Deriva Optimization Guide

This guide documents lessons learned from optimizing Deriva's LLM-based pipeline for consistency and quality. It's intended for developers working on prompt engineering and configuration tuning.

For running benchmarks, see BENCHMARKS.md for the user guide and CLI reference.

Optimization Methodology
Prompt Engineering Principles
ArchiMate Knowledge
Case Study: Initial Optimization
Graph-Based Optimization
Optimization Log
Phase 4: Advanced Optimizations
Token Efficiency Optimizations (v0.6.9)
References

Optimization Methodology

Targeted Single-Config Testing

Instead of testing all configs together (expensive, noisy), use targeted optimization:

Identify worst-performing element type by analyzing summary.json
Run 10+ iterations with only that config uncached using --nocache-configs
Iterate on prompt until 100% consistency for that element type
Move to next worst element type

This approach is:

Cost-efficient: Only LLM calls for the config being tested
Fast iteration: 10 runs in ~4 minutes vs 3 full runs in ~6 minutes
Clear signal: Isolates the effect of prompt changes

Command Pattern

# Step 1: Run baseline to identify worst element type
uv run python -m deriva.cli.cli benchmark run \
  --repos flask_invoice_generator \
  --models mistral-devstral2 \
  --runs 3 \
  --no-cache

# Step 2: Analyze by element type prefix
uv run python -c "
import json
with open('workspace/benchmarks/<session>/analysis/summary.json') as f:
    data = json.load(f)
intra = data['intra_model'][0]
from collections import defaultdict
element_types = defaultdict(lambda: {'stable': 0, 'unstable': 0})
for e in intra['stable_elements']:
    prefix = e.split('_')[0]
    element_types[prefix]['stable'] += 1
for e in intra['unstable_elements'].items():
    prefix = e[0].split('_')[0]
    element_types[prefix]['unstable'] += 1
for prefix, counts in sorted(element_types.items()):
    total = counts['stable'] + counts['unstable']
    pct = (counts['stable'] / total * 100) if total > 0 else 0
    print(f'{prefix}: {counts[\"stable\"]}/{total} ({pct:.0f}%)')
"

# Step 3: Run targeted test for worst element type
uv run python -m deriva.cli.cli benchmark run \
  --repos flask_invoice_generator \
  --models mistral-devstral2 \
  --runs 10 \
  --nocache-configs TechnologyService

# Step 4: Update config and repeat until 100%

Prompt Engineering Principles

The Golden Rule: No Repository-Specific Overfitting

This is the most important rule for config optimization.

When writing prompts, NEVER include:

Specific entity names from test repositories (invoice, customer, position)
Specific file names (app.py, models.py)
Specific technology stacks (Flask, SQLAlchemy)
Specific project structures

Examples: Bad vs Good

BAD - Overfitting:

# DON'T DO THIS
Create services for: invoice management, customer handling, position tracking
Exclude files like: app.py, __init__.py
Do not create BusinessActor for "flask_invoice_generator" concepts

GOOD - Generalizable:

# DO THIS INSTEAD
Create services for: entity management, data validation, document generation
Exclude: framework initialization methods, internal utilities
Filter source nodes where out_degree = 0 AND pagerank < 0.01

Test for overfitting: Ask yourself: "Would this prompt work identically on a completely different repository (e.g., an e-commerce app, a healthcare system, a gaming backend)?"

If the answer is "no" or "it depends on the domain", the prompt is overfitting.

Abstraction Level Determines Consistency

Approach	Example	Consistency
Too specific	`as_validate_invoice_input`	Low (varies by domain)
Correct level	`as_validate_data`	High (generalizes)

Guide the LLM to use GENERIC category names (data, entity, document) rather than domain-specific names.

Empirical support: Liang 2025 achieved 100% accuracy on domain-specific tasks by providing carefully engineered in-context learning prompts with explicit domain constraints. Their finding that domain-specific instructions improved performance by 30% on complex cases validates the importance of abstraction-level guidance in prompts.

Key Techniques

1. Explicit Naming Rules

"Use snake_case" is not enough. Provide exact format examples:

NAMING RULES (CRITICAL FOR CONSISTENCY):
1. Use SINGULAR form always (Invoice not Invoices)
2. Use lowercase snake_case for identifier (bus_obj_invoice)
3. Use Title Case for display name (Invoice)

2. Ban Synonyms Explicitly

MANDATORY SYNONYM RULES - ALWAYS use these canonical names:
- Customer (NEVER: Client, User, Buyer, Account)
- Order (NEVER: Purchase, Transaction, Sale)
- Position (NEVER: Line Item, Order Line, Item)

3. Canonical Identifier Tables

For DataObject and similar types, provide lookup tables:

| File Pattern | Identifier |
|--------------|------------|
| .env, .flaskenv | do_environment_configuration |
| requirements.txt | do_dependency_manifest |
| *.db | do_application_database |
| .gitignore | do_version_control_configuration |

4. Graph-Based Filtering

Filter by structural properties rather than naming patterns:

DO NOT create TechnologyService for:
- Nodes with low structural importance (pagerank < threshold)
- Transitive dependencies (out_degree = 0)
- Nodes not in k-core >= 2

5. Examples Drive Consistency

Claude follows example patterns closely. A well-structured example JSON is more effective than verbose rules:

{
  "elements": [
    {
      "identifier": "as_manage_entities",
      "name": "Entity Management",
      "description": "CRUD operations for domain entities"
    }
  ]
}

6. Use XML Tags for Structure

Aligns with Claude's prompt engineering best practices:

<definition>
ApplicationService represents a behavior element...
</definition>

<naming>
Use verb phrases: "Invoice Processing", "Payment Service"
</naming>

<constraints>
Maximum 5 services per repository
</constraints>

ArchiMate Naming Conventions

Element Type	Naming Pattern	Examples
ApplicationService	Verb phrases	"Invoice Processing", "Payment Service"
DataObject	Singular noun phrases	"Environment Configuration"
BusinessObject	Singular nouns	"Customer", "Invoice"
ApplicationComponent	Directory-based	"templates", "static"

ArchiMate Knowledge

For comprehensive ArchiMate reference including element definitions, relationship rules, and metamodel constraints, see ARCHIMATE.md.

Key sections for prompt engineering:

The Three Aspects - Active/Behavior/Passive classification
Relationships - Valid relationship types and constraints
Common Pitfalls - Modeling mistakes to avoid

Research Findings

Key findings from academic research on LLM-based ArchiMate derivation:

Finding	Source	Implication for Deriva
Few-shot prompting works without fine-tuning	Chaaben 2022	Use in-context examples, not trained models
Domain-specific ICL prompts can achieve 100% accuracy	Liang 2025	Invest in tailored prompt engineering per element type
Guidance texts significantly improve output	Coutinho 2025	Include domain-specific instruction documents
Chain-of-thought may decrease performance	Chen 2023	Prefer direct instructions over reasoning chains
High precision, low recall is the norm	Chen 2023	Expect correct but incomplete outputs
Code-to-ArchiMate: 68% precision, 80% recall	Castillo 2019	Industrial benchmark baseline for extraction
NLP model extraction: 83-96% correctness	Arora 2016	Achievable with explicit naming rules
LLMs show higher consistency than humans	Reitemeyer 2025	Multiple runs can improve reliability
Consistency ≠ accuracy (independent properties)	Raj 2025	Validate correctness separately from consistency
Human-in-the-loop is essential	All sources	Design for validation, not full automation

Temperature and Consistency

Temperature	Use Case	Trade-off
0.0-0.2	Element derivation	Maximum consistency, less creativity
0.3-0.5	Relationship discovery	Balanced
0.6-0.8	Name generation	More variety, less consistency

Recommendation: Use low temperature (0.2-0.3) for element derivation to maximize consistency across runs.

Validation Strategies

Critical caveat: Consistency and accuracy are independent properties (Raj 2025). High consistency does NOT guarantee correctness. A process could consistently produce incorrect results. Always validate accuracy separately through manual review or ground truth comparison.

Multi-Run Aggregation

Run derivation 3-5 times and aggregate results (Wang 2025):

def aggregate_elements(runs: list[list[dict]]) -> list[dict]:
    """Keep elements appearing in majority of runs."""
    element_counts = {}
    for run in runs:
        for element in run:
            key = element["identifier"]
            if key not in element_counts:
                element_counts[key] = {"element": element, "count": 0}
            element_counts[key]["count"] += 1

    threshold = len(runs) // 2 + 1
    return [
        data["element"]
        for data in element_counts.values()
        if data["count"] >= threshold
    ]

Confidence Thresholds

Confidence	Interpretation	Action
0.9-1.0	High confidence	Include automatically
0.7-0.9	Moderate confidence	Include with review flag
0.5-0.7	Low confidence	Manual review required
< 0.5	Very low	Exclude or investigate

LLM-Specific Pitfalls

Identifier Hallucination

Problem: LLM invents identifiers not in the provided list.

Solution: Explicitly constrain in the prompt:

CRITICAL: You MUST use identifiers EXACTLY as shown in this list:
["ac_auth", "bo_customer", "do_user_data"]

Do NOT:
- Invent new identifiers
- Modify existing identifiers
- Use partial matches

Over-Generation

Problem: LLM creates too many elements/relationships.

Solution: Add explicit constraints:

## Constraints
- Maximum 3 relationships per source element
- Only create elements where confidence > 0.5
- If no candidates are suitable, return {"elements": []}

Generic Names

Problem: LLM uses code names instead of business names.

Solution: Specify naming requirements:

Naming rules:
- Use business-meaningful names, not code identifiers
- "User Authentication Service" not "auth_service"
- "Customer Order" not "customer_order_model"
- Names should be understandable to business stakeholders

Chain-of-Thought Degradation

Problem: Asking LLM to explain reasoning decreases quality (Chen 2023).

Solution:

Use direct instructions, not reasoning chains
Don't ask "think step by step" for ArchiMate derivation
Focus prompts on what to output, not how to think

Case Study: Initial Optimization

Problem

Initial benchmark with 5 runs showed 28% consistency with 18 unstable elements:

unstable_elements:
  bus_obj_positions: 3/5 runs      # plural vs singular
  bus_obj_invoicedetails: 4/5      # camelCase vs snake_case
  bus_obj_customer: 4/5            # vs "client" synonym
  app_comp_static: 2/5             # inconsistent naming
  app_comp_flask_invoice_generator_static: 3/5  # repo prefix included

Root Cause Analysis

The original prompts were too vague:

BusinessObject: "Derive BusinessObject elements from business concepts"
ApplicationComponent: "Use directory name as component name, include repo for context"
TechnologyService: "Group related dependencies into logical services"

Solution: Explicit Naming Rules

BusinessObject Prompt (Improved)

NAMING RULES (CRITICAL FOR CONSISTENCY):
1. Use SINGULAR form always (Invoice not Invoices)
2. Use lowercase snake_case for identifier (bus_obj_invoice)
3. Use Title Case for display name (Invoice)

MANDATORY SYNONYM RULES - ALWAYS use these canonical names:
- Customer (NEVER: Client, User, Buyer, Account)
- Order (NEVER: Purchase, Transaction, Sale)
- Position (NEVER: Line Item, Order Line, Item)

Output stable, deterministic results.

ApplicationComponent Prompt (Improved)

NAMING RULES (CRITICAL FOR CONSISTENCY):
1. Use ONLY the directory name, NEVER include repository name prefix
   - Correct: app_comp_static
   - Wrong: app_comp_flask_invoice_generator_static
2. Use lowercase snake_case for identifier

Output stable, deterministic results.

Results

Metric	Before	After	Improvement
Consistency	28%	78.6%	+50.6%
Unstable elements	18	3	-83%
Count variance	1.84	0.24	-87%

Graph-Based Optimization

The Hypothesis

Unstable elements may correlate with graph properties of their source nodes. By analyzing stable vs unstable elements' sources, we can identify patterns and apply graph-based filters.

Methodology

Run enrichment algorithms on the graph (PageRank, Louvain, k-core, etc.)
Correlate stability with graph properties
Apply filters in derivation queries

Step 1: Run Enrichment

from deriva.modules.derivation import enrich

enrichments = enrich.enrich_graph(
    edges=edges,
    algorithms=['pagerank', 'louvain', 'kcore', 'articulation_points', 'degree']
)
# Write to graph: graph_manager.batch_update_properties(enrichments)

Step 2: Correlate Stability

Query source nodes for stable vs unstable elements:

// Get graph properties for element sources
MATCH (e) WHERE e.identifier IN $element_ids
WITH e.properties_json as props
MATCH (n {id: source_id})
RETURN n.pagerank, n.kcore_level, n.out_degree, n.in_degree

Analysis result:

STABLE vs UNSTABLE Source Nodes:
+-----------+---------+----------+------------+
| Metric    | Stable  | Unstable | Difference |
+-----------+---------+----------+------------+
| PageRank  | 0.0188  | 0.0071   | +164%      |
| K-core    | 1.15    | 1.00     | +15%       |
| Out-degree| 2.31    | 0.00     | +inf       |
+-----------+---------+----------+------------+

Step 3: Apply Graph-Based Filters

Update input_graph_query to filter on graph properties:

MATCH (n)
WHERE (n:`Graph:TypeDefinition` OR n:`Graph:BusinessConcept`)
  AND n.active = true
  AND (n.out_degree > 0 OR n.pagerank > 0.01)  -- Filter floating nodes
RETURN n.id, n.name, n.pagerank, n.kcore_level

Key Insight: Structural vs Semantic Sources

Source Type	Graph Properties	Stability
Structural (TypeDefinition, Method, File)	High out-degree, connected	More stable
Semantic (BusinessConcept)	Zero out-degree, floating	Less stable

Semantic nodes extracted by LLM have no structural relationships in the code graph. When derivation uses these as sources, the LLM has less context, leading to inconsistent outputs.

This observation aligns with broader challenges in neural-symbolic integration: Cai 2025 identifies "representation gaps between neural network outputs and structured symbolic representations" as a fundamental challenge, particularly for complex relational reasoning. The graph-based filtering approach helps bridge this gap by grounding LLM interpretation in structural context.

Recommendation: For element types that can use either structural or semantic sources, prefer structural sources or require minimum graph connectivity.

Optimization Log

Detailed chronological record of optimization sessions and findings.

2026-01-03: Initial Config Optimization

Repository: flask_invoice_generator (small) Model: openai-gptx Runs: 5

Baseline Results

Session	Consistency	Element Counts	Issues
bench_20260103_094609	28%	13-17	18 unstable elements

Main Problems:

BusinessObject: naming variants (positions/position, invoicedetails/invoice_details)
ApplicationComponent: repo prefix inconsistency
TechnologyService: detection variance

Optimizations Applied

BusinessObject (v1-v3): Added explicit naming rules, mandatory synonym rules, singular form requirement
ApplicationComponent (v1-v2): Never include repo name prefix, use only directory name
TechnologyService (v1-v2): Standard service categories list, grouping rules
DataObject (v1-v2): Generic names only

Final Results

Session	Consistency	Element Counts	Issues
bench_20260103_095630	78.6%	12-13	3 unstable elements
bench_20260103_101845	100%	12	0 (DataObject test)

Improvement: +50.6% consistency, 83% fewer unstable elements

Key Learnings

Explicit naming rules are critical
Ban synonyms explicitly
Standard category lists reduce variance
Add determinism instruction to every LLM prompt
Test one config at a time with --nocache-configs

2026-01-03: Medium Repository Test

Repository: full-stack-fastapi-template (medium) Model: openai-gptx

Issues Encountered

Extraction failures - "Response missing 'dependencies' array" in ExternalDependency extraction
Edge creation failures - Node ID mismatches for TypeDefinition and Test extractions
These are infrastructure/schema issues, not derivation LLM issues

Partial Results

Session	Consistency	Notes
bench_20260103_100150	61.1%	DataObject naming variants

Observation: Medium repo has underlying extraction issues to resolve before clean benchmarking.

2026-01-03: Relationship Derivation Fix

Issue: Run failures with "Invalid relationship type: Association"

The LLM was outputting "Association" which is not a valid ArchiMate relationship type.

Fix Applied

Updated build_relationship_prompt() in deriva/modules/derivation/base.py:

VALID RELATIONSHIP TYPES (use ONLY these exact names):
- Composition, Aggregation, Serving, Realization, Access, Flow, Assignment

INVALID TYPES (NEVER use these):
- Association (use Serving or Flow instead)
- Dependency (use Serving instead)
- Uses (use Serving instead)

Results After Fix

Session	Consistency	Runs	Failures
bench_20260103_103526	85.7%	3	0

Cumulative Improvement

Metric	Baseline	Final	Improvement
Consistency	28%	85.7%	+57.7%
Stable elements	7	12	+71%
Unstable elements	18	2	-89%
Run failures	~33%	0%	-100%

2026-01-03: Cross-Repository Generalization Test

Objective: Verify configs are generic and don't overfit to test repositories

Repositories tested:

flask_invoice_generator (small)
full-stack-fastapi-template (medium)

Results

Repository	Runs	Consistency	Status
flask_invoice_generator	3/3	85.7%	Configs work well
full-stack-fastapi-template	3/3	57.1%	Extraction infra issues

Analysis: Configs Are Generic

Evidence of generalization:

Consistent naming patterns across repos (same prefixes)
Small repo high consistency (85.7%) proves prompts work generically
Medium repo failures are infrastructure bugs, NOT config issues

Conclusion: No config adjustments needed for generalization. Derivation configs are generic.

2026-01-08: Efficient Targeted Optimization Workflow

Model: mistral-devstral2 Repository: flask_invoice_generator

New Methodology: One Config at a Time

See Optimization Methodology for the full workflow.

TechnologyService Optimization Results

Version	Stable	Unstable	Consistency	Key Change
v1	0/5	5	0%	Original vague prompt
v2	1/3	2	33%	Added canonical names
v3	3/8	5	38%	Added determinism instruction
v4	3/4	1	75%	Excluded transitive deps

Overall Progress

Element Type	Baseline	After Optimization
ApplicationService	0%	100%
BusinessActor	0%	100%
DataObject	50%	100%
BusinessProcess	0%	50%
BusinessObject	25%	50%
TechnologyService	0%	75%

Overall: 28% - 75%

Extraction Consistency Analysis

Config	Method	Variance	Status
Repository	Deterministic	0	100%
Directory	Deterministic	0	100%
File	Deterministic	0	100%
Method	AST (Python)	0	100%
TypeDefinition	AST (Python)	0	100%
ExternalDependency	Parser	0	100%
Test	Cache	0	100%
BusinessConcept	LLM	84-86	1-2 unstable
Technology	LLM	86-87	1 unstable

Key finding: Only LLM-based extraction configs have variance. AST-based and deterministic parsers are 100% consistent.

2026-01-08: Graph Property-Based Optimization

See Graph-Based Optimization for the full methodology.

Results

Metric	Baseline	After Filter	Change
Element Consistency	81.25%	80.0%	-1.25%
Unstable Elements	3	3	Same count

Key outcome: The specific unstable elements changed completely. Graph filter eliminated instability from "floating" semantic nodes.

2026-01-09: File Classification and CLI Improvements

Objective: Improve file extraction quality and add CLI support for file type management.

Issues Identified

Files with unknown extensions had fileType=None instead of a meaningful default
No CLI for file types - registry could only be managed via UI

Fixes Applied

Improved File Classification Logic - Files with unknown extensions now get file_type="unknown" with extension as subtype
CLI File Type Management - Added deriva config filetype commands

Improvement Summary

Metric	Before	After
Files with null fileType	~15%	0%
CLI file type commands	0	4

2026-01-10: A/B Testing Framework and Derivation Optimization

Objective: Create fast A/B testing workflow and improve derivation consistency to >=80%

DataObject Optimization

Problem: 41.7% consistency with naming variants

Solution: Canonical identifier table in prompt

Result: 41.7% - 100% consistency

ApplicationService Optimization

Problem: 28.6% consistency with variants like as_validate_data vs as_validate_input

Solution: Abstraction principle + example-driven prompt with XML tags

Result: 28.6% - 100% consistency

Final Results

Element Type	Before	After	Change
DataObject	41.7%	100%	+58.3%
ApplicationService	28.6%	100%	+71.4%

Summary of Key Learnings

Explicit naming rules are critical - "Use snake_case" is not enough; provide exact examples
Ban synonyms explicitly - "Customer (NEVER: Client, User, Buyer)" works better than "use consistent names"
Standard category lists reduce variance - Enumerate allowed values
Add determinism instruction - "Output stable, deterministic results" in every LLM prompt
Test one config at a time - Use --nocache-configs for targeted testing
Examples drive consistency - A good example JSON is more effective than verbose rules
Abstraction level is key - Use generic category names, not domain-specific names (Liang 2025: +30% improvement)
Graph-based selection over name-based - Filter by structural properties (in_degree, pagerank)
Never use repository-specific rules - All optimizations must be generic
Prefer structural sources over semantic - TypeDefinition/Method sources are more stable than BusinessConcept
Consistency ≠ accuracy - High consistency doesn't guarantee correctness; validate both independently (Raj 2025)

Phase 4: Advanced Optimizations

The following advanced optimizations were implemented to further improve token efficiency and consistency.

Token Estimation & Context Limiting

Functions in deriva/modules/derivation/base.py:

Function	Purpose
`estimate_tokens(text)`	Estimates token count (~4 chars/token)
`get_model_context_limit(model)`	Returns context limit for model
`check_prompt_size(prompt, model)`	Warns if prompt exceeds 80% of limit
`limit_existing_elements(elements, max=50)`	Keeps top-N elements by confidence
`stratified_sample_elements(elements, max_per_type=10)`	Samples across element types

Graph-Aware Pre-filtering (Phase 4.3)

Only include existing elements with graph proximity to new elements:

from deriva.modules.derivation.base import (
    get_connected_source_ids,
    filter_by_graph_proximity,
)

# Get nodes connected within 2 hops
connected_ids = get_connected_source_ids(graph_manager, new_source_ids, max_hops=2)

# Filter to only graph neighbors
filtered = filter_by_graph_proximity(existing_elements, connected_ids)

Benefits:

60-90% reduction in context size
Better relationship quality (only related elements in context)
Reduced hallucination of spurious relationships

Separated Derivation Phases (Phase 4.6)

The defer_relationships parameter enables a two-phase architecture.

Default Mode (defer_relationships=True): (Recommended)

Phase 1: Create ALL elements (skip relationships)
Phase 2: Single consolidated relationship pass

Legacy Mode (defer_relationships=False):

For each element type:
  1. Create elements → 2. Derive relationships → Repeat

Usage:

from deriva.services.derivation import generate_element
from deriva.modules.derivation.base import derive_consolidated_relationships

# Phase 1: Generate all elements without relationships
all_elements = []
for element_type in element_types:
    result = generate_element(
        element_type=element_type,
        defer_relationships=True,  # Skip per-batch relationships
        # ... other params
    )
    all_elements.extend(result["elements"])

# Phase 2: Derive all relationships in one pass
relationships = derive_consolidated_relationships(
    all_elements=all_elements,
    relationship_rules=rules_by_type,
    llm_query_fn=llm.query,
    graph_manager=graph_manager,
)

Benefits:

Better context - ALL elements available during relationship derivation
Fewer LLM calls - One pass per element type instead of per batch
More consistent - Reduces ordering effects
Graph-aware filtering works better with complete element set

Dynamic Batch Sizing

Batch size adapts to candidate count and token limits:

from deriva.modules.derivation.base import (
    calculate_dynamic_batch_size,
    adjust_batch_for_tokens,
)

# Auto-size based on candidate count
batch_size = calculate_dynamic_batch_size(len(candidates))  # 10-25 range

# Reduce if tokens exceed model limit
batch_size = adjust_batch_for_tokens(batch_size, estimated_tokens, model_name)

Benchmark Results

After Phase 4 optimizations (5 runs, mistral-devstral2, flask_invoice_generator):

Metric	Result
Structural edge consistency	100%
Duration	~411s
Node variance	87-89 (stable)
Elements per run	22-24
Relationships per run	20-30

Token Efficiency Optimizations (v0.6.9)

Version 0.6.9 introduced several token efficiency improvements that reduce extraction costs by an estimated 40-60%.

Compact JSON Serialization

Problem: Default JSON formatting with indent=2 adds significant whitespace overhead.

Solution: Use compact serialization with no whitespace:

# Before (wasteful)
json.dumps(data, indent=2)
# {"elements": [
#     {
#         "id": "bus_concept_1",
#         "name": "Customer"
#     }
# ]}

# After (efficient)
json.dumps(data, separators=(",", ":"))
# {"elements":[{"id":"bus_concept_1","name":"Customer"}]}

Savings: ~15% token reduction for JSON payloads.

Where to apply:

Existing concepts/elements passed to LLM prompts
Any structured data in prompt context
NOT for human-readable output or logs

System/User Prompt Separation

Problem: Static instructions repeated in every LLM call waste tokens.

Solution: Separate prompts into system (static) and user (dynamic) portions:

Prompt Type	Content	Sent When
System prompt	Role definition, naming rules, output format, constraints	Once per session (cached by provider)
User prompt	File content, existing concepts, specific context	Every call

Implementation pattern:

# System prompt - static instructions (sent once per session)
system_prompt = """
You are an expert at extracting business concepts from source code.

NAMING RULES:
1. Use singular form (Invoice not Invoices)
2. Use Title Case for names
3. Use lowercase snake_case for identifiers

OUTPUT FORMAT:
Return valid JSON with "concepts" array.
"""

# User prompt - dynamic content (per file/batch)
user_prompt = f"""
<existing_concepts>
{json.dumps(existing, separators=(",", ":"))}
</existing_concepts>

<file path="{file_path}">
{file_content}
</file>

Extract business concepts from this file.
"""

Benefits:

Many providers cache system prompts across calls
Reduces redundant instruction tokens
Cleaner separation of concerns
Easier to maintain and update instructions

Multi-File Batching

Problem: Each small file requires a separate LLM call with full prompt overhead.

Solution: Batch multiple small files into single LLM calls using the batch_size configuration.

Configuration:

# CLI usage
uv run deriva-cli run extraction --repo myrepo --batch-size 5

# Or set in extraction config
uv run deriva-cli config update extraction BusinessConcept \
  -p '{"batch_size": 5}'

How batching works:

Files are sorted by size (smallest first)
Files are grouped until batch token limit is reached
Each batch is sent as a single LLM call
Results are disaggregated back to individual files

Batching parameters:

Parameter	Default	Description
`batch_size`	1	Maximum files per batch
`max_batch_tokens`	4000	Token limit per batch
`batch_by_directory`	false	Group files from same directory

Example batch prompt:

<files>
<file path="models/customer.py" index="0">
class Customer:
    name: str
    email: str
</file>
<file path="models/order.py" index="1">
class Order:
    customer_id: int
    total: float
</file>
<file path="models/item.py" index="2">
class Item:
    name: str
    price: float
</file>
</files>

Extract business concepts from each file. Return results indexed by file.

Savings: 30-50% token reduction depending on file sizes and batch efficiency.

Best practices for batching:

Start with batch_size=3-5 for initial testing
Increase for small files - config files, models, schemas batch well
Keep batch_size=1 for large files - complex modules need individual attention
Monitor quality - very high batch sizes may reduce extraction quality
Use batch_by_directory when files in same directory share context

Combined Token Savings

When all three optimizations are applied together:

Optimization	Individual Savings	Cumulative
Compact JSON	~15%	15%
System/User separation	~10-15%	25-30%
Multi-file batching	~30-50%	40-60%

Measuring token usage:

# Run extraction with verbose logging to see token counts
uv run deriva-cli run extraction --repo myrepo -v

# Check logs for token usage per step
grep "tokens" workspace/logs/extraction_*.jsonl

Token Efficiency Checklist

Before optimizing prompts for consistency, ensure token efficiency:

JSON payloads use compact serialization (separators=(",", ":"))
Static instructions are in system prompt, dynamic content in user prompt
Small files are batched appropriately (batch_size > 1)
Large context is filtered (use limit_existing_elements() or stratified_sample_elements())
Graph proximity filtering is enabled for relationship derivation

References

Academic Sources

Citation	Reference	Key Contribution
Arora 2016	Arora et al., "Extracting domain models from natural-language requirements"	Industrial NLP extraction: 83-96% correctness, explicit naming rules
Cai 2025	Cai et al., "Practices, opportunities and challenges in the fusion of knowledge graphs and large language models"	KG-LLM integration taxonomy (KEL/LEK/LKC), neural-symbolic representation gaps
Castillo 2019	Castillo et al., "ArchiRev - Reverse engineering toward ArchiMate models"	Code-to-ArchiMate benchmark: 68% precision, 80% recall
Chaaben 2022	Chaaben et al., "Towards using Few-Shot Prompt Learning for Automating Model Completion"	Few-shot prompting without fine-tuning, frequency-based ranking
Chaaben 2024	Chaaben et al., "On the Utility of Domain Modeling Assistance with LLMs"	20% time reduction, 33-56% suggestion contribution rates
Chen 2023	Chen et al., "Automated Domain Modeling with LLMs: A Comparative Study"	F1 scores (0.76 classes, 0.34 relationships), chain-of-thought caution
Coutinho 2025	Coutinho et al., "LLM-Based Modeling Assistance for Textual Ontology-Driven Conceptual Modeling"	Guidance texts significantly improve output quality
Liang 2025	Liang et al., "Integrating Large Language Models for Automated Structural Analysis"	Domain-specific ICL achieves 100% accuracy; benchmarking methodology
Raj 2025	Raj et al., "Semantic Consistency for Assuring Reliability of Large Language Models"	Critical: Consistency and accuracy are independent properties
Reitemeyer 2025	Reitemeyer & Fill, "Applying LLMs in Knowledge Graph-based Enterprise Modeling"	LLMs show higher consistency than humans, human-in-the-loop essential
Wang 2025	Wang & Wang, "Assessing Consistency and Reproducibility in LLM Outputs"	3-5 runs optimal for consistency

Industry Resources

Resource	Description
ArchiMate 3.2 Specification	Official ArchiMate standard from The Open Group
Mastering ArchiMate	Gerben Wierda's comprehensive guide to ArchiMate modeling
ArchiMate Best Practices	Community-curated best practices for Archi tool usage
ArchiMate Cookbook	Eero Hosiaisluoma's practical ArchiMate patterns

Standards

ANSI/NISO Z39.19-2005 (R2010) - Guidelines for Controlled Vocabularies
ISO 704:2022 - Terminology work: Principles and methods
OMG SBVR - Semantics of Business Vocabulary and Business Rules

Uh oh!

FilesExpand file tree

OPTIMIZATION.md

Latest commit

History

OPTIMIZATION.md

File metadata and controls

Deriva Optimization Guide

Table of Contents

Optimization Methodology

Targeted Single-Config Testing

Command Pattern

Prompt Engineering Principles

The Golden Rule: No Repository-Specific Overfitting

Abstraction Level Determines Consistency

Key Techniques

ArchiMate Naming Conventions

ArchiMate Knowledge

Research Findings

Temperature and Consistency

Validation Strategies

LLM-Specific Pitfalls

Case Study: Initial Optimization

Problem

Root Cause Analysis

Solution: Explicit Naming Rules

Results

Graph-Based Optimization

The Hypothesis

Methodology

Key Insight: Structural vs Semantic Sources

Optimization Log

Baseline Results

Optimizations Applied

Final Results

Key Learnings

Issues Encountered

Partial Results

Fix Applied

Results After Fix

Cumulative Improvement

Results

Analysis: Configs Are Generic

New Methodology: One Config at a Time

TechnologyService Optimization Results

Overall Progress

Extraction Consistency Analysis

Results

Issues Identified

Fixes Applied

Improvement Summary

DataObject Optimization

ApplicationService Optimization

Final Results

Summary of Key Learnings

Phase 4: Advanced Optimizations

Token Estimation & Context Limiting

Graph-Aware Pre-filtering (Phase 4.3)

Separated Derivation Phases (Phase 4.6)

Dynamic Batch Sizing

Benchmark Results

Token Efficiency Optimizations (v0.6.9)

Compact JSON Serialization

System/User Prompt Separation

Multi-File Batching

Combined Token Savings

Token Efficiency Checklist

References

Academic Sources

Industry Resources

Standards

Further Reading