This guide documents lessons learned from optimizing Deriva's LLM-based pipeline for consistency and quality. It's intended for developers working on prompt engineering and configuration tuning.
For running benchmarks, see BENCHMARKS.md for the user guide and CLI reference.
- Optimization Methodology
- Prompt Engineering Principles
- ArchiMate Knowledge
- Case Study: Initial Optimization
- Graph-Based Optimization
- Optimization Log
- Phase 4: Advanced Optimizations
- Token Efficiency Optimizations (v0.6.9)
- References
Instead of testing all configs together (expensive, noisy), use targeted optimization:
- Identify worst-performing element type by analyzing
summary.json - Run 10+ iterations with only that config uncached using
--nocache-configs - Iterate on prompt until 100% consistency for that element type
- Move to next worst element type
This approach is:
- Cost-efficient: Only LLM calls for the config being tested
- Fast iteration: 10 runs in ~4 minutes vs 3 full runs in ~6 minutes
- Clear signal: Isolates the effect of prompt changes
# Step 1: Run baseline to identify worst element type
uv run python -m deriva.cli.cli benchmark run \
--repos flask_invoice_generator \
--models mistral-devstral2 \
--runs 3 \
--no-cache
# Step 2: Analyze by element type prefix
uv run python -c "
import json
with open('workspace/benchmarks/<session>/analysis/summary.json') as f:
data = json.load(f)
intra = data['intra_model'][0]
from collections import defaultdict
element_types = defaultdict(lambda: {'stable': 0, 'unstable': 0})
for e in intra['stable_elements']:
prefix = e.split('_')[0]
element_types[prefix]['stable'] += 1
for e in intra['unstable_elements'].items():
prefix = e[0].split('_')[0]
element_types[prefix]['unstable'] += 1
for prefix, counts in sorted(element_types.items()):
total = counts['stable'] + counts['unstable']
pct = (counts['stable'] / total * 100) if total > 0 else 0
print(f'{prefix}: {counts[\"stable\"]}/{total} ({pct:.0f}%)')
"
# Step 3: Run targeted test for worst element type
uv run python -m deriva.cli.cli benchmark run \
--repos flask_invoice_generator \
--models mistral-devstral2 \
--runs 10 \
--nocache-configs TechnologyService
# Step 4: Update config and repeat until 100%This is the most important rule for config optimization.
When writing prompts, NEVER include:
- Specific entity names from test repositories (invoice, customer, position)
- Specific file names (app.py, models.py)
- Specific technology stacks (Flask, SQLAlchemy)
- Specific project structures
Examples: Bad vs Good
BAD - Overfitting:
# DON'T DO THIS
Create services for: invoice management, customer handling, position tracking
Exclude files like: app.py, __init__.py
Do not create BusinessActor for "flask_invoice_generator" concepts
GOOD - Generalizable:
# DO THIS INSTEAD
Create services for: entity management, data validation, document generation
Exclude: framework initialization methods, internal utilities
Filter source nodes where out_degree = 0 AND pagerank < 0.01
Test for overfitting: Ask yourself: "Would this prompt work identically on a completely different repository (e.g., an e-commerce app, a healthcare system, a gaming backend)?"
If the answer is "no" or "it depends on the domain", the prompt is overfitting.
| Approach | Example | Consistency |
|---|---|---|
| Too specific | as_validate_invoice_input |
Low (varies by domain) |
| Correct level | as_validate_data |
High (generalizes) |
Guide the LLM to use GENERIC category names (data, entity, document) rather than domain-specific names.
Empirical support: Liang 2025 achieved 100% accuracy on domain-specific tasks by providing carefully engineered in-context learning prompts with explicit domain constraints. Their finding that domain-specific instructions improved performance by 30% on complex cases validates the importance of abstraction-level guidance in prompts.
1. Explicit Naming Rules
"Use snake_case" is not enough. Provide exact format examples:
NAMING RULES (CRITICAL FOR CONSISTENCY):
1. Use SINGULAR form always (Invoice not Invoices)
2. Use lowercase snake_case for identifier (bus_obj_invoice)
3. Use Title Case for display name (Invoice)
2. Ban Synonyms Explicitly
MANDATORY SYNONYM RULES - ALWAYS use these canonical names:
- Customer (NEVER: Client, User, Buyer, Account)
- Order (NEVER: Purchase, Transaction, Sale)
- Position (NEVER: Line Item, Order Line, Item)
3. Canonical Identifier Tables
For DataObject and similar types, provide lookup tables:
| File Pattern | Identifier |
|--------------|------------|
| .env, .flaskenv | do_environment_configuration |
| requirements.txt | do_dependency_manifest |
| *.db | do_application_database |
| .gitignore | do_version_control_configuration |
4. Graph-Based Filtering
Filter by structural properties rather than naming patterns:
DO NOT create TechnologyService for:
- Nodes with low structural importance (pagerank < threshold)
- Transitive dependencies (out_degree = 0)
- Nodes not in k-core >= 2
5. Examples Drive Consistency
Claude follows example patterns closely. A well-structured example JSON is more effective than verbose rules:
{
"elements": [
{
"identifier": "as_manage_entities",
"name": "Entity Management",
"description": "CRUD operations for domain entities"
}
]
}6. Use XML Tags for Structure
Aligns with Claude's prompt engineering best practices:
<definition>
ApplicationService represents a behavior element...
</definition>
<naming>
Use verb phrases: "Invoice Processing", "Payment Service"
</naming>
<constraints>
Maximum 5 services per repository
</constraints>
| Element Type | Naming Pattern | Examples |
|---|---|---|
| ApplicationService | Verb phrases | "Invoice Processing", "Payment Service" |
| DataObject | Singular noun phrases | "Environment Configuration" |
| BusinessObject | Singular nouns | "Customer", "Invoice" |
| ApplicationComponent | Directory-based | "templates", "static" |
For comprehensive ArchiMate reference including element definitions, relationship rules, and metamodel constraints, see ARCHIMATE.md.
Key sections for prompt engineering:
- The Three Aspects - Active/Behavior/Passive classification
- Relationships - Valid relationship types and constraints
- Common Pitfalls - Modeling mistakes to avoid
Key findings from academic research on LLM-based ArchiMate derivation:
| Finding | Source | Implication for Deriva |
|---|---|---|
| Few-shot prompting works without fine-tuning | Chaaben 2022 | Use in-context examples, not trained models |
| Domain-specific ICL prompts can achieve 100% accuracy | Liang 2025 | Invest in tailored prompt engineering per element type |
| Guidance texts significantly improve output | Coutinho 2025 | Include domain-specific instruction documents |
| Chain-of-thought may decrease performance | Chen 2023 | Prefer direct instructions over reasoning chains |
| High precision, low recall is the norm | Chen 2023 | Expect correct but incomplete outputs |
| Code-to-ArchiMate: 68% precision, 80% recall | Castillo 2019 | Industrial benchmark baseline for extraction |
| NLP model extraction: 83-96% correctness | Arora 2016 | Achievable with explicit naming rules |
| LLMs show higher consistency than humans | Reitemeyer 2025 | Multiple runs can improve reliability |
| Consistency ≠ accuracy (independent properties) | Raj 2025 | Validate correctness separately from consistency |
| Human-in-the-loop is essential | All sources | Design for validation, not full automation |
| Temperature | Use Case | Trade-off |
|---|---|---|
| 0.0-0.2 | Element derivation | Maximum consistency, less creativity |
| 0.3-0.5 | Relationship discovery | Balanced |
| 0.6-0.8 | Name generation | More variety, less consistency |
Recommendation: Use low temperature (0.2-0.3) for element derivation to maximize consistency across runs.
Critical caveat: Consistency and accuracy are independent properties (Raj 2025). High consistency does NOT guarantee correctness. A process could consistently produce incorrect results. Always validate accuracy separately through manual review or ground truth comparison.
Multi-Run Aggregation
Run derivation 3-5 times and aggregate results (Wang 2025):
def aggregate_elements(runs: list[list[dict]]) -> list[dict]:
"""Keep elements appearing in majority of runs."""
element_counts = {}
for run in runs:
for element in run:
key = element["identifier"]
if key not in element_counts:
element_counts[key] = {"element": element, "count": 0}
element_counts[key]["count"] += 1
threshold = len(runs) // 2 + 1
return [
data["element"]
for data in element_counts.values()
if data["count"] >= threshold
]Confidence Thresholds
| Confidence | Interpretation | Action |
|---|---|---|
| 0.9-1.0 | High confidence | Include automatically |
| 0.7-0.9 | Moderate confidence | Include with review flag |
| 0.5-0.7 | Low confidence | Manual review required |
| < 0.5 | Very low | Exclude or investigate |
Identifier Hallucination
Problem: LLM invents identifiers not in the provided list.
Solution: Explicitly constrain in the prompt:
CRITICAL: You MUST use identifiers EXACTLY as shown in this list:
["ac_auth", "bo_customer", "do_user_data"]
Do NOT:
- Invent new identifiers
- Modify existing identifiers
- Use partial matches
Over-Generation
Problem: LLM creates too many elements/relationships.
Solution: Add explicit constraints:
## Constraints
- Maximum 3 relationships per source element
- Only create elements where confidence > 0.5
- If no candidates are suitable, return {"elements": []}
Generic Names
Problem: LLM uses code names instead of business names.
Solution: Specify naming requirements:
Naming rules:
- Use business-meaningful names, not code identifiers
- "User Authentication Service" not "auth_service"
- "Customer Order" not "customer_order_model"
- Names should be understandable to business stakeholders
Chain-of-Thought Degradation
Problem: Asking LLM to explain reasoning decreases quality (Chen 2023).
Solution:
- Use direct instructions, not reasoning chains
- Don't ask "think step by step" for ArchiMate derivation
- Focus prompts on what to output, not how to think
Initial benchmark with 5 runs showed 28% consistency with 18 unstable elements:
unstable_elements:
bus_obj_positions: 3/5 runs # plural vs singular
bus_obj_invoicedetails: 4/5 # camelCase vs snake_case
bus_obj_customer: 4/5 # vs "client" synonym
app_comp_static: 2/5 # inconsistent naming
app_comp_flask_invoice_generator_static: 3/5 # repo prefix included
The original prompts were too vague:
- BusinessObject: "Derive BusinessObject elements from business concepts"
- ApplicationComponent: "Use directory name as component name, include repo for context"
- TechnologyService: "Group related dependencies into logical services"
BusinessObject Prompt (Improved)
NAMING RULES (CRITICAL FOR CONSISTENCY):
1. Use SINGULAR form always (Invoice not Invoices)
2. Use lowercase snake_case for identifier (bus_obj_invoice)
3. Use Title Case for display name (Invoice)
MANDATORY SYNONYM RULES - ALWAYS use these canonical names:
- Customer (NEVER: Client, User, Buyer, Account)
- Order (NEVER: Purchase, Transaction, Sale)
- Position (NEVER: Line Item, Order Line, Item)
Output stable, deterministic results.
ApplicationComponent Prompt (Improved)
NAMING RULES (CRITICAL FOR CONSISTENCY):
1. Use ONLY the directory name, NEVER include repository name prefix
- Correct: app_comp_static
- Wrong: app_comp_flask_invoice_generator_static
2. Use lowercase snake_case for identifier
Output stable, deterministic results.
| Metric | Before | After | Improvement |
|---|---|---|---|
| Consistency | 28% | 78.6% | +50.6% |
| Unstable elements | 18 | 3 | -83% |
| Count variance | 1.84 | 0.24 | -87% |
Unstable elements may correlate with graph properties of their source nodes. By analyzing stable vs unstable elements' sources, we can identify patterns and apply graph-based filters.
- Run enrichment algorithms on the graph (PageRank, Louvain, k-core, etc.)
- Correlate stability with graph properties
- Apply filters in derivation queries
Step 1: Run Enrichment
from deriva.modules.derivation import enrich
enrichments = enrich.enrich_graph(
edges=edges,
algorithms=['pagerank', 'louvain', 'kcore', 'articulation_points', 'degree']
)
# Write to graph: graph_manager.batch_update_properties(enrichments)Step 2: Correlate Stability
Query source nodes for stable vs unstable elements:
// Get graph properties for element sources
MATCH (e) WHERE e.identifier IN $element_ids
WITH e.properties_json as props
MATCH (n {id: source_id})
RETURN n.pagerank, n.kcore_level, n.out_degree, n.in_degreeAnalysis result:
STABLE vs UNSTABLE Source Nodes:
+-----------+---------+----------+------------+
| Metric | Stable | Unstable | Difference |
+-----------+---------+----------+------------+
| PageRank | 0.0188 | 0.0071 | +164% |
| K-core | 1.15 | 1.00 | +15% |
| Out-degree| 2.31 | 0.00 | +inf |
+-----------+---------+----------+------------+
Step 3: Apply Graph-Based Filters
Update input_graph_query to filter on graph properties:
MATCH (n)
WHERE (n:`Graph:TypeDefinition` OR n:`Graph:BusinessConcept`)
AND n.active = true
AND (n.out_degree > 0 OR n.pagerank > 0.01) -- Filter floating nodes
RETURN n.id, n.name, n.pagerank, n.kcore_level| Source Type | Graph Properties | Stability |
|---|---|---|
| Structural (TypeDefinition, Method, File) | High out-degree, connected | More stable |
| Semantic (BusinessConcept) | Zero out-degree, floating | Less stable |
Semantic nodes extracted by LLM have no structural relationships in the code graph. When derivation uses these as sources, the LLM has less context, leading to inconsistent outputs.
This observation aligns with broader challenges in neural-symbolic integration: Cai 2025 identifies "representation gaps between neural network outputs and structured symbolic representations" as a fundamental challenge, particularly for complex relational reasoning. The graph-based filtering approach helps bridge this gap by grounding LLM interpretation in structural context.
Recommendation: For element types that can use either structural or semantic sources, prefer structural sources or require minimum graph connectivity.
Detailed chronological record of optimization sessions and findings.
2026-01-03: Initial Config Optimization
Repository: flask_invoice_generator (small) Model: openai-gptx Runs: 5
| Session | Consistency | Element Counts | Issues |
|---|---|---|---|
| bench_20260103_094609 | 28% | 13-17 | 18 unstable elements |
Main Problems:
- BusinessObject: naming variants (positions/position, invoicedetails/invoice_details)
- ApplicationComponent: repo prefix inconsistency
- TechnologyService: detection variance
- BusinessObject (v1-v3): Added explicit naming rules, mandatory synonym rules, singular form requirement
- ApplicationComponent (v1-v2): Never include repo name prefix, use only directory name
- TechnologyService (v1-v2): Standard service categories list, grouping rules
- DataObject (v1-v2): Generic names only
| Session | Consistency | Element Counts | Issues |
|---|---|---|---|
| bench_20260103_095630 | 78.6% | 12-13 | 3 unstable elements |
| bench_20260103_101845 | 100% | 12 | 0 (DataObject test) |
Improvement: +50.6% consistency, 83% fewer unstable elements
- Explicit naming rules are critical
- Ban synonyms explicitly
- Standard category lists reduce variance
- Add determinism instruction to every LLM prompt
- Test one config at a time with
--nocache-configs
2026-01-03: Medium Repository Test
Repository: full-stack-fastapi-template (medium) Model: openai-gptx
- Extraction failures - "Response missing 'dependencies' array" in ExternalDependency extraction
- Edge creation failures - Node ID mismatches for TypeDefinition and Test extractions
- These are infrastructure/schema issues, not derivation LLM issues
| Session | Consistency | Notes |
|---|---|---|
| bench_20260103_100150 | 61.1% | DataObject naming variants |
Observation: Medium repo has underlying extraction issues to resolve before clean benchmarking.
2026-01-03: Relationship Derivation Fix
Issue: Run failures with "Invalid relationship type: Association"
The LLM was outputting "Association" which is not a valid ArchiMate relationship type.
Updated build_relationship_prompt() in deriva/modules/derivation/base.py:
VALID RELATIONSHIP TYPES (use ONLY these exact names):
- Composition, Aggregation, Serving, Realization, Access, Flow, Assignment
INVALID TYPES (NEVER use these):
- Association (use Serving or Flow instead)
- Dependency (use Serving instead)
- Uses (use Serving instead)
| Session | Consistency | Runs | Failures |
|---|---|---|---|
| bench_20260103_103526 | 85.7% | 3 | 0 |
| Metric | Baseline | Final | Improvement |
|---|---|---|---|
| Consistency | 28% | 85.7% | +57.7% |
| Stable elements | 7 | 12 | +71% |
| Unstable elements | 18 | 2 | -89% |
| Run failures | ~33% | 0% | -100% |
2026-01-03: Cross-Repository Generalization Test
Objective: Verify configs are generic and don't overfit to test repositories
Repositories tested:
- flask_invoice_generator (small)
- full-stack-fastapi-template (medium)
| Repository | Runs | Consistency | Status |
|---|---|---|---|
| flask_invoice_generator | 3/3 | 85.7% | Configs work well |
| full-stack-fastapi-template | 3/3 | 57.1% | Extraction infra issues |
Evidence of generalization:
- Consistent naming patterns across repos (same prefixes)
- Small repo high consistency (85.7%) proves prompts work generically
- Medium repo failures are infrastructure bugs, NOT config issues
Conclusion: No config adjustments needed for generalization. Derivation configs are generic.
2026-01-08: Efficient Targeted Optimization Workflow
Model: mistral-devstral2 Repository: flask_invoice_generator
See Optimization Methodology for the full workflow.
| Version | Stable | Unstable | Consistency | Key Change |
|---|---|---|---|---|
| v1 | 0/5 | 5 | 0% | Original vague prompt |
| v2 | 1/3 | 2 | 33% | Added canonical names |
| v3 | 3/8 | 5 | 38% | Added determinism instruction |
| v4 | 3/4 | 1 | 75% | Excluded transitive deps |
| Element Type | Baseline | After Optimization |
|---|---|---|
| ApplicationService | 0% | 100% |
| BusinessActor | 0% | 100% |
| DataObject | 50% | 100% |
| BusinessProcess | 0% | 50% |
| BusinessObject | 25% | 50% |
| TechnologyService | 0% | 75% |
Overall: 28% - 75%
| Config | Method | Variance | Status |
|---|---|---|---|
| Repository | Deterministic | 0 | 100% |
| Directory | Deterministic | 0 | 100% |
| File | Deterministic | 0 | 100% |
| Method | AST (Python) | 0 | 100% |
| TypeDefinition | AST (Python) | 0 | 100% |
| ExternalDependency | Parser | 0 | 100% |
| Test | Cache | 0 | 100% |
| BusinessConcept | LLM | 84-86 | 1-2 unstable |
| Technology | LLM | 86-87 | 1 unstable |
Key finding: Only LLM-based extraction configs have variance. AST-based and deterministic parsers are 100% consistent.
2026-01-08: Graph Property-Based Optimization
See Graph-Based Optimization for the full methodology.
| Metric | Baseline | After Filter | Change |
|---|---|---|---|
| Element Consistency | 81.25% | 80.0% | -1.25% |
| Unstable Elements | 3 | 3 | Same count |
Key outcome: The specific unstable elements changed completely. Graph filter eliminated instability from "floating" semantic nodes.
2026-01-09: File Classification and CLI Improvements
Objective: Improve file extraction quality and add CLI support for file type management.
- Files with unknown extensions had
fileType=Noneinstead of a meaningful default - No CLI for file types - registry could only be managed via UI
- Improved File Classification Logic - Files with unknown extensions now get
file_type="unknown"with extension assubtype - CLI File Type Management - Added
deriva config filetypecommands
| Metric | Before | After |
|---|---|---|
| Files with null fileType | ~15% | 0% |
| CLI file type commands | 0 | 4 |
2026-01-10: A/B Testing Framework and Derivation Optimization
Objective: Create fast A/B testing workflow and improve derivation consistency to >=80%
Problem: 41.7% consistency with naming variants
Solution: Canonical identifier table in prompt
Result: 41.7% - 100% consistency
Problem: 28.6% consistency with variants like as_validate_data vs as_validate_input
Solution: Abstraction principle + example-driven prompt with XML tags
Result: 28.6% - 100% consistency
| Element Type | Before | After | Change |
|---|---|---|---|
| DataObject | 41.7% | 100% | +58.3% |
| ApplicationService | 28.6% | 100% | +71.4% |
- Explicit naming rules are critical - "Use snake_case" is not enough; provide exact examples
- Ban synonyms explicitly - "Customer (NEVER: Client, User, Buyer)" works better than "use consistent names"
- Standard category lists reduce variance - Enumerate allowed values
- Add determinism instruction - "Output stable, deterministic results" in every LLM prompt
- Test one config at a time - Use
--nocache-configsfor targeted testing - Examples drive consistency - A good example JSON is more effective than verbose rules
- Abstraction level is key - Use generic category names, not domain-specific names (Liang 2025: +30% improvement)
- Graph-based selection over name-based - Filter by structural properties (in_degree, pagerank)
- Never use repository-specific rules - All optimizations must be generic
- Prefer structural sources over semantic - TypeDefinition/Method sources are more stable than BusinessConcept
- Consistency ≠ accuracy - High consistency doesn't guarantee correctness; validate both independently (Raj 2025)
The following advanced optimizations were implemented to further improve token efficiency and consistency.
Functions in deriva/modules/derivation/base.py:
| Function | Purpose |
|---|---|
estimate_tokens(text) |
Estimates token count (~4 chars/token) |
get_model_context_limit(model) |
Returns context limit for model |
check_prompt_size(prompt, model) |
Warns if prompt exceeds 80% of limit |
limit_existing_elements(elements, max=50) |
Keeps top-N elements by confidence |
stratified_sample_elements(elements, max_per_type=10) |
Samples across element types |
Only include existing elements with graph proximity to new elements:
from deriva.modules.derivation.base import (
get_connected_source_ids,
filter_by_graph_proximity,
)
# Get nodes connected within 2 hops
connected_ids = get_connected_source_ids(graph_manager, new_source_ids, max_hops=2)
# Filter to only graph neighbors
filtered = filter_by_graph_proximity(existing_elements, connected_ids)Benefits:
- 60-90% reduction in context size
- Better relationship quality (only related elements in context)
- Reduced hallucination of spurious relationships
The defer_relationships parameter enables a two-phase architecture.
Default Mode (defer_relationships=True): (Recommended)
Phase 1: Create ALL elements (skip relationships)
Phase 2: Single consolidated relationship pass
Legacy Mode (defer_relationships=False):
For each element type:
1. Create elements → 2. Derive relationships → Repeat
Usage:
from deriva.services.derivation import generate_element
from deriva.modules.derivation.base import derive_consolidated_relationships
# Phase 1: Generate all elements without relationships
all_elements = []
for element_type in element_types:
result = generate_element(
element_type=element_type,
defer_relationships=True, # Skip per-batch relationships
# ... other params
)
all_elements.extend(result["elements"])
# Phase 2: Derive all relationships in one pass
relationships = derive_consolidated_relationships(
all_elements=all_elements,
relationship_rules=rules_by_type,
llm_query_fn=llm.query,
graph_manager=graph_manager,
)Benefits:
- Better context - ALL elements available during relationship derivation
- Fewer LLM calls - One pass per element type instead of per batch
- More consistent - Reduces ordering effects
- Graph-aware filtering works better with complete element set
Batch size adapts to candidate count and token limits:
from deriva.modules.derivation.base import (
calculate_dynamic_batch_size,
adjust_batch_for_tokens,
)
# Auto-size based on candidate count
batch_size = calculate_dynamic_batch_size(len(candidates)) # 10-25 range
# Reduce if tokens exceed model limit
batch_size = adjust_batch_for_tokens(batch_size, estimated_tokens, model_name)After Phase 4 optimizations (5 runs, mistral-devstral2, flask_invoice_generator):
| Metric | Result |
|---|---|
| Structural edge consistency | 100% |
| Duration | ~411s |
| Node variance | 87-89 (stable) |
| Elements per run | 22-24 |
| Relationships per run | 20-30 |
Version 0.6.9 introduced several token efficiency improvements that reduce extraction costs by an estimated 40-60%.
Problem: Default JSON formatting with indent=2 adds significant whitespace overhead.
Solution: Use compact serialization with no whitespace:
# Before (wasteful)
json.dumps(data, indent=2)
# {"elements": [
# {
# "id": "bus_concept_1",
# "name": "Customer"
# }
# ]}
# After (efficient)
json.dumps(data, separators=(",", ":"))
# {"elements":[{"id":"bus_concept_1","name":"Customer"}]}Savings: ~15% token reduction for JSON payloads.
Where to apply:
- Existing concepts/elements passed to LLM prompts
- Any structured data in prompt context
- NOT for human-readable output or logs
Problem: Static instructions repeated in every LLM call waste tokens.
Solution: Separate prompts into system (static) and user (dynamic) portions:
| Prompt Type | Content | Sent When |
|---|---|---|
| System prompt | Role definition, naming rules, output format, constraints | Once per session (cached by provider) |
| User prompt | File content, existing concepts, specific context | Every call |
Implementation pattern:
# System prompt - static instructions (sent once per session)
system_prompt = """
You are an expert at extracting business concepts from source code.
NAMING RULES:
1. Use singular form (Invoice not Invoices)
2. Use Title Case for names
3. Use lowercase snake_case for identifiers
OUTPUT FORMAT:
Return valid JSON with "concepts" array.
"""
# User prompt - dynamic content (per file/batch)
user_prompt = f"""
<existing_concepts>
{json.dumps(existing, separators=(",", ":"))}
</existing_concepts>
<file path="{file_path}">
{file_content}
</file>
Extract business concepts from this file.
"""Benefits:
- Many providers cache system prompts across calls
- Reduces redundant instruction tokens
- Cleaner separation of concerns
- Easier to maintain and update instructions
Problem: Each small file requires a separate LLM call with full prompt overhead.
Solution: Batch multiple small files into single LLM calls using the batch_size configuration.
Configuration:
# CLI usage
uv run deriva-cli run extraction --repo myrepo --batch-size 5
# Or set in extraction config
uv run deriva-cli config update extraction BusinessConcept \
-p '{"batch_size": 5}'How batching works:
- Files are sorted by size (smallest first)
- Files are grouped until batch token limit is reached
- Each batch is sent as a single LLM call
- Results are disaggregated back to individual files
Batching parameters:
| Parameter | Default | Description |
|---|---|---|
batch_size |
1 | Maximum files per batch |
max_batch_tokens |
4000 | Token limit per batch |
batch_by_directory |
false | Group files from same directory |
Example batch prompt:
<files>
<file path="models/customer.py" index="0">
class Customer:
name: str
email: str
</file>
<file path="models/order.py" index="1">
class Order:
customer_id: int
total: float
</file>
<file path="models/item.py" index="2">
class Item:
name: str
price: float
</file>
</files>
Extract business concepts from each file. Return results indexed by file.
Savings: 30-50% token reduction depending on file sizes and batch efficiency.
Best practices for batching:
- Start with batch_size=3-5 for initial testing
- Increase for small files - config files, models, schemas batch well
- Keep batch_size=1 for large files - complex modules need individual attention
- Monitor quality - very high batch sizes may reduce extraction quality
- Use batch_by_directory when files in same directory share context
When all three optimizations are applied together:
| Optimization | Individual Savings | Cumulative |
|---|---|---|
| Compact JSON | ~15% | 15% |
| System/User separation | ~10-15% | 25-30% |
| Multi-file batching | ~30-50% | 40-60% |
Measuring token usage:
# Run extraction with verbose logging to see token counts
uv run deriva-cli run extraction --repo myrepo -v
# Check logs for token usage per step
grep "tokens" workspace/logs/extraction_*.jsonlBefore optimizing prompts for consistency, ensure token efficiency:
- JSON payloads use compact serialization (
separators=(",", ":")) - Static instructions are in system prompt, dynamic content in user prompt
- Small files are batched appropriately (
batch_size > 1) - Large context is filtered (use
limit_existing_elements()orstratified_sample_elements()) - Graph proximity filtering is enabled for relationship derivation
| Citation | Reference | Key Contribution |
|---|---|---|
| Arora 2016 | Arora et al., "Extracting domain models from natural-language requirements" | Industrial NLP extraction: 83-96% correctness, explicit naming rules |
| Cai 2025 | Cai et al., "Practices, opportunities and challenges in the fusion of knowledge graphs and large language models" | KG-LLM integration taxonomy (KEL/LEK/LKC), neural-symbolic representation gaps |
| Castillo 2019 | Castillo et al., "ArchiRev - Reverse engineering toward ArchiMate models" | Code-to-ArchiMate benchmark: 68% precision, 80% recall |
| Chaaben 2022 | Chaaben et al., "Towards using Few-Shot Prompt Learning for Automating Model Completion" | Few-shot prompting without fine-tuning, frequency-based ranking |
| Chaaben 2024 | Chaaben et al., "On the Utility of Domain Modeling Assistance with LLMs" | 20% time reduction, 33-56% suggestion contribution rates |
| Chen 2023 | Chen et al., "Automated Domain Modeling with LLMs: A Comparative Study" | F1 scores (0.76 classes, 0.34 relationships), chain-of-thought caution |
| Coutinho 2025 | Coutinho et al., "LLM-Based Modeling Assistance for Textual Ontology-Driven Conceptual Modeling" | Guidance texts significantly improve output quality |
| Liang 2025 | Liang et al., "Integrating Large Language Models for Automated Structural Analysis" | Domain-specific ICL achieves 100% accuracy; benchmarking methodology |
| Raj 2025 | Raj et al., "Semantic Consistency for Assuring Reliability of Large Language Models" | Critical: Consistency and accuracy are independent properties |
| Reitemeyer 2025 | Reitemeyer & Fill, "Applying LLMs in Knowledge Graph-based Enterprise Modeling" | LLMs show higher consistency than humans, human-in-the-loop essential |
| Wang 2025 | Wang & Wang, "Assessing Consistency and Reproducibility in LLM Outputs" | 3-5 runs optimal for consistency |
| Resource | Description |
|---|---|
| ArchiMate 3.2 Specification | Official ArchiMate standard from The Open Group |
| Mastering ArchiMate | Gerben Wierda's comprehensive guide to ArchiMate modeling |
| ArchiMate Best Practices | Community-curated best practices for Archi tool usage |
| ArchiMate Cookbook | Eero Hosiaisluoma's practical ArchiMate patterns |
- ANSI/NISO Z39.19-2005 (R2010) - Guidelines for Controlled Vocabularies
- ISO 704:2022 - Terminology work: Principles and methods
- OMG SBVR - Semantics of Business Vocabulary and Business Rules
- ARCHIMATE.md - ArchiMate element definitions, relationships, and metamodel reference
- BENCHMARKS.md - User guide for running benchmarks
- CONTRIBUTING.md - Architecture and development patterns
- ArchiMate Best Practices & Resource Guide - Detailed prompt templates for ArchiMate derivation