Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
143 commits
Select commit Hold shift + click to select a range
541bf37
Add concat op to webgpu. (#20068)
yomaytk Mar 4, 2026
24d2ee0
[WebGPU] Fix wait logic for inflight jobs (#20096)
nikhilJain17 Mar 4, 2026
1a29907
hexagon: add llama-completion runner script (#20095)
tboinovski1 Mar 4, 2026
69fd345
opencl: add `SET`, support i32 for `CPY`, minor refactor for cpy (#20…
lhez Mar 5, 2026
7a99dc8
hexagon: Flash Attention optimizations (dma, mpyacc, multi-row) and M…
max-krasnyansky Mar 5, 2026
92f7da0
chore : correct typos [no ci] (#20041)
marcelpetrick Mar 5, 2026
5e335ba
webui: Improvements for Models Selector UI (#20066)
allozaur Mar 5, 2026
cf23251
convert : register Qwen 3.5 ForCausalLM for text only (#20119)
CISC Mar 5, 2026
b5ed0e0
cli : add command and file auto-completion (#19985)
CISC Mar 5, 2026
872646b
model : update Qwen3.5 model type detection (#20126)
EZForever Mar 5, 2026
2cd20b7
CUDA: Improve performance via less synchronizations between token (#…
aendk Mar 5, 2026
a0ed91a
models : kda chunk size = 16 (#19827)
ymcki Mar 5, 2026
2b10b62
hexagon: add fp16 support for binary ops: add,sub,mul,div (#20139)
YardenTal44 Mar 6, 2026
6c97bff
opencl: add neg, exp and diag (#20127)
lhez Mar 6, 2026
f7db3f3
cli : Don't clear system prompt when using '/clear' (#20067)
roj234 Mar 6, 2026
17a4258
kv-cache : fix M-RoPE checkpoints (#20132)
ggerganov Mar 6, 2026
2850bc6
ggml-cpu: fix data race for debug asserts (#20148)
JohannesGaessler Mar 6, 2026
f6235a4
webui: Agentic Loop + MCP Client with support for Tools, Resources an…
allozaur Mar 6, 2026
f5ddcd1
Checkpoint every n tokens: squash (#20087)
pwilkin Mar 6, 2026
388baab
context: ignore zero scale LoRAs when checking sameness (#20166)
TimNN Mar 6, 2026
1e38a7a
CUDA: use shared mem for ssm_conv (#20128)
am17an Mar 6, 2026
c6980ff
ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083) (#20130)
shalinib-ibm Mar 6, 2026
ba2ff79
ggml: update comments for backends which have no memory to report (#2…
taronaeo Mar 6, 2026
d48e876
ggml-cuda: add mem check for fusion (#19916)
am17an Mar 6, 2026
ba2fd11
cpu: skip redudant ROPE cache updates (#20149)
max-krasnyansky Mar 6, 2026
e68f2fb
server : preserve anthropic thinking blocks in conversion (#20120)
T0mSIlver Mar 6, 2026
34df42f
hexagon: add f32 ssm_conv op (#20122)
tboinovski1 Mar 6, 2026
566059a
Autoparser - complete refactoring of parser architecture (#18675)
pwilkin Mar 6, 2026
7463687
Add @pwilkin to CODEOWNERS for autoparser code (#20174)
pwilkin Mar 6, 2026
649f064
quants : Add memsets and other fixes for IQ quants (#19861)
bartowski1182 Mar 6, 2026
2f2923f
Autoparser: add optional argument reshuffle capability (#20171)
pwilkin Mar 6, 2026
c024d85
Autoparser: True streaming (#20177)
pwilkin Mar 7, 2026
6fce5c6
opencl: add l2_norm (#20160)
lhez Mar 7, 2026
c5a7788
ggml: add GATED_DELTA_NET op (#19504)
am17an Mar 7, 2026
213c4a0
[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190)
arthw Mar 8, 2026
ff52ee9
server : correct index on finish in OAI completion streams (#20226)
decahedron1 Mar 8, 2026
b283f6d
Revert to OAI-compatible args (#20213)
pwilkin Mar 8, 2026
a950479
readme : update infra list (#20212)
Defilan Mar 8, 2026
a976ff0
llama: end-to-end tests (#19802)
JohannesGaessler Mar 8, 2026
cd18a50
vulkan: Fix data races in coopmat1 mul_mat(_id) (#20084)
jeffbolznv Mar 8, 2026
d088d5b
ggml-vulkan: Add ELU op support (#20183)
GiantPrince Mar 8, 2026
62b8143
Fix structured outputs (#20223)
pwilkin Mar 8, 2026
9b24886
Fix compile bug (#20203)
pwilkin Mar 8, 2026
451ef08
common : gracefully handle incomplete output (#20191)
aldehir Mar 8, 2026
35bee03
graph : remove redundant scale_w parameter (#20235)
CISC Mar 8, 2026
d417bc4
server : do not create checkpoints right after mtmd chunks (#20232)
ggerganov Mar 8, 2026
92bde36
docs: add detailed breakdown of KV cache compaction via Attention Mat…
claude Mar 8, 2026
48d5dc0
feat: add KV cache compaction via Attention Matching POC tool
claude Mar 8, 2026
97c64fb
PEG parser for LFM2 (#20251)
pwilkin Mar 9, 2026
ae87863
llama-bench: introduce `-hf` and `-hff` flags & use `--mmap 1` by def…
taronaeo Mar 9, 2026
5f4cdac
cuda : display total and free VRAM capacity during device initializat…
tehsiuhuang Mar 9, 2026
b2f460b
vulkan: skip zero size tensors in backend copies (#20233)
0cc4m Mar 9, 2026
0beb8db
ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (#…
bertaye Mar 9, 2026
55faeab
test: add unit tests for KV cache compaction math utilities
claude Mar 9, 2026
e2763a6
contributing: limit open PRs for new contributors to 1 (#20036)
am17an Mar 9, 2026
b518195
llama-quant : left-align tensor names in output (#20117)
ddh0 Mar 9, 2026
18c4c25
docs: add user stories for KV cache compaction feature
claude Mar 9, 2026
509907d
docs: add algorithms and techniques reference for KV cache compaction
claude Mar 9, 2026
8569f90
docs: add researcher rationale and improvement opportunities for KV c…
claude Mar 9, 2026
e8bbc73
ggml-cuda: disable gdn for musa (#20278)
am17an Mar 9, 2026
107d599
server : add kill switch when server is stuck (#20277)
ggerganov Mar 9, 2026
7681ba5
docs: add Cartridges (Eyuboglu 2025) as prior art context in rational…
claude Mar 9, 2026
d4947b8
docs: add adjacent concepts cross-pollination map for KV compaction
claude Mar 9, 2026
7f86341
docs: enrich adjacent concepts with deep research findings
claude Mar 9, 2026
7a006a6
docs: add WildCat/RPCholesky as unified framework, enrich Nystrom sec…
claude Mar 9, 2026
5d98a7c
docs: add Frank-Wolfe/attention equivalence, CKA, CS-VLM, grand unifi…
claude Mar 9, 2026
80339f7
docs: add 6 more adjacent concepts — sparse GP, token merging, submod…
claude Mar 9, 2026
bf28d86
docs: enrich coresets, WildCat, and Caratheodory with deep research f…
claude Mar 9, 2026
c152c85
docs: enrich FW, sketching, CS sections + add open theoretical questions
claude Mar 9, 2026
43e1cbd
models : fix assert in mamba2 graph (#20270)
ggerganov Mar 9, 2026
f76565d
common: map developer role to system (#20215)
pwilkin Mar 9, 2026
d6e1556
server : fix off-by-1 in server_tokens::size_up_to_pos() (#20279)
ggerganov Mar 9, 2026
344ee2a
server : warn swa-full is not supported for non-SWA models (#20291)
ggerganov Mar 9, 2026
ed0007a
metal : add upscale (#20284)
ggerganov Mar 9, 2026
96cfc49
server : fix checkpoints n_tokens calculation (#20287)
ggerganov Mar 9, 2026
e22cd0a
metal : extend mul_mv_ext to BF16, Q2_K, Q3_K (#20250)
arkavo-com Mar 9, 2026
60a4387
Add strix-halo-optimizer skill for AMD Ryzen AI Max+ 395
fabiantax Mar 4, 2026
30dd1bc
Add eval workspace and metadata for strix-halo-optimizer iteration 1
fabiantax Mar 4, 2026
8ad6b83
Add eval run outputs and timing for iteration 1
fabiantax Mar 4, 2026
959bc8f
Add grading results and benchmark summary for iteration 1
fabiantax Mar 4, 2026
384266b
Rewrite AGENTS.md for AI-optimized fork development
fabiantax Mar 5, 2026
28020a0
Move strix-halo-optimizer skill to .claude/skills/ and expand content
fabiantax Mar 5, 2026
938e033
Add iteration 2 eval results for strix-halo-optimizer skill
fabiantax Mar 5, 2026
ea0497a
Update skill with research findings: gfx1151, Vulkan parity, MoE models
fabiantax Mar 5, 2026
73ff9e9
Add Vulkan cooperative matrix and FA tuning details
fabiantax Mar 5, 2026
96ebc81
Add iteration-3 eval metadata for strix-halo-optimizer skill
fabiantax Mar 5, 2026
80f1655
Add timing data for eval-3 with_skill run
fabiantax Mar 5, 2026
10e3f20
Add timing data for eval-1 with_skill run
fabiantax Mar 5, 2026
ca89fec
Add timing data for eval-2 with_skill run
fabiantax Mar 5, 2026
9b32d31
Add eval-3 without_skill outputs and timing
fabiantax Mar 5, 2026
ef72276
Add eval-2 without_skill outputs and timing
fabiantax Mar 5, 2026
8f85db6
Add eval-1 without_skill outputs and timing
fabiantax Mar 5, 2026
c25396f
Add eval-4 with_skill outputs and timing
fabiantax Mar 5, 2026
adc2b8c
Add eval-5 with_skill outputs and timing
fabiantax Mar 5, 2026
71db90a
Add eval-5 without_skill outputs and timing
fabiantax Mar 5, 2026
6ef221d
Add eval-4 without_skill outputs and timing
fabiantax Mar 5, 2026
c31cfa6
Add eval-6 with_skill outputs and timing
fabiantax Mar 5, 2026
4c9b6dd
Add iteration-3 eval results, grading, and benchmark viewer
fabiantax Mar 5, 2026
4e5622c
ggml-cuda: auto-detect AMD APU unified memory and improve iGPU support
fabiantax Mar 5, 2026
9230d66
ggml-cuda: tune flash attention kernels for RDNA 3.5
fabiantax Mar 5, 2026
2034f53
ggml-cuda: add UMA weight prefetching for integrated GPUs
fabiantax Mar 5, 2026
f229ccc
ggml-backend: selective backend sync and eliminate redundant sync on …
fabiantax Mar 5, 2026
4bdaa8f
ggml-cuda: add auto-tuning for flash attention parallel_blocks and GE…
fabiantax Mar 5, 2026
66c2d8d
docs: add implementation plan for auto-tuning flash attention
fabiantax Mar 5, 2026
e7fa9d0
docs: add flash attention auto-tuning documentation
fabiantax Mar 5, 2026
305fec9
common: add UMA auto-configuration for iGPU unified memory systems
fabiantax Mar 5, 2026
c9c1948
common: add APEX-inspired bandwidth-aware layer splitting for UMA
fabiantax Mar 5, 2026
51cbe35
docs: add user stories for APEX runtime scheduling
fabiantax Mar 5, 2026
428d70e
docs: add APEX runtime scheduling implementation plan
fabiantax Mar 5, 2026
e0bb11e
implement APEX runtime scheduling for hybrid CPU-GPU inference
fabiantax Mar 5, 2026
7d21e43
vulkan: add RDNA3 pipeline config, iGPU flash attention tuning, and U…
fabiantax Mar 5, 2026
7435012
fix: AVX512 build errors with GGML_ZEN5 non-native builds
fabiantax Mar 5, 2026
298be3e
Add Strix Halo benchmark script for validating build performance
fabiantax Mar 5, 2026
70bed2d
docs: add MoE token generation optimization stories, benchmark script…
fabiantax Mar 5, 2026
9ce49f3
Add MoE expert selection analyzer tool for prefetch optimization
fabiantax Mar 5, 2026
a2f7157
Add software prefetch for MoE expert weights in CPU MUL_MAT_ID
fabiantax Mar 6, 2026
3f39d21
Make UMA prefetch expert-aware for MoE on Strix Halo
fabiantax Mar 6, 2026
5677806
fix: moe-analyzer arg parsing, expert auto-detection, and token gener…
fabiantax Mar 6, 2026
e69b782
vulkan: fused SSM recurrence + batched elementwise + shared memory ti…
fabiantax Mar 8, 2026
4c5671f
chore: update .gitignore for graphrag pipeline artifacts
fabiantax Mar 9, 2026
c79e782
docs: add Strix Halo optimization report and user stories
fabiantax Mar 9, 2026
44a2af4
feat: add ModernBERT NER+RE fine-tuning pipeline for GraphRAG
fabiantax Mar 9, 2026
0163079
feat: add GraphRAG pipeline — Rust NER+RE with FalkorDB integration
fabiantax Mar 9, 2026
3cf69a2
chore: add Claude Code commands, settings, and unfinished prototypes
fabiantax Mar 9, 2026
8bea599
docs: add developer machine specs (Ryzen AI MAX+ 395) to CLAUDE.md
claude Mar 10, 2026
b20cc4b
docs: add 500 t/s throughput goal and Attention Matching strategy to …
claude Mar 10, 2026
9952316
docs: add implementation plan for KV cache compaction integration
claude Mar 10, 2026
4ad6e55
docs: add comprehensive gap analysis for KV cache compaction
claude Mar 10, 2026
2271991
feat: implement KV cache compaction via Attention Matching (Phase 1+2)
claude Mar 10, 2026
41d78b4
refactor: extract hparams from mctx to reduce upstream merge fragility
claude Mar 10, 2026
5a4c985
feat: add Q capture for repeat-prefill reference queries (Phase 4)
claude Mar 10, 2026
ae3c1fd
feat: add non-uniform per-head budgets for KV cache compaction (Phase 5)
claude Mar 10, 2026
0fa9866
feat: add online auto-compaction and bias serialization (Phase 7)
claude Mar 10, 2026
9491826
fix: correct AM compaction index mismatch and C_v solver instability
claude Mar 10, 2026
5f15148
feat: add per-layer beta injection and KV cache defragmentation
claude Mar 10, 2026
1dbbc87
chore: add wikitext-2-raw/ to .gitignore
claude Mar 11, 2026
10eda02
Add KV cache compaction development roadmap timeline to README
claude Mar 11, 2026
cfbf02b
fix: disable UMA profiler auto-enable — fixes 5x server regression
fabiantax Mar 14, 2026
3b472fb
feat: cache-aware expert routing API + test harness (WIP)
fabiantax Mar 15, 2026
000a010
feat: working cache-aware expert routing via llm_graph_input
fabiantax Mar 15, 2026
0a3eda7
feat: --expert-cache-bonus flag for cache-aware MoE routing in server
fabiantax Mar 15, 2026
15c5198
Merge remote-tracking branch 'origin/main' into claude/kv-cache-compa…
fabiantax Mar 27, 2026
f548100
wip: uncommitted work before Linux migration
fabiantax Apr 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
41 changes: 41 additions & 0 deletions .claude/commands/anno-ner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Run the Rust NER extraction pipeline using the anno crate (GLiNER zero-shot). Pass a source file path and options.

## Instructions

1. Build and run the pipeline in `graphrag-pipeline/`:
```
cd C:/Users/fabia/Projects/llama.cpp/llama.cpp/graphrag-pipeline
cargo run -- $ARGUMENTS
```

2. If no arguments given, show usage help:
```
cargo run -- --help
```

## CLI Options

- `--source <file>` — Input text file (paper, profiling log, code comments)
- `--dry-run` — Print extracted entities/relations without writing to FalkorDB
- `--ner-only` — Skip LLM relation extraction, NER pass only (fast, free)
- `--labels <csv>` — Custom entity types (default: hardware,gpu_feature,optimization_technique,algorithm,software_framework,performance_metric,memory_pattern,kernel_operation,model_architecture,constraint,data_structure,research_paper)

## How It Works

- **anno crate**: Uses `GLiNEROnnx` with model `onnx-community/gliner_small-v2.1`
- **ZeroShotNER trait**: `extract_with_types(text, labels, threshold=0.5)`
- Chunks text at ~600 tokens (2400 chars) with 400-char overlap (GraphRAG optimal)
- Outputs `<source>_extracted.json` with entities and relations

## Typical Workflows

- **Quick NER scan**: `--source paper.txt --ner-only --dry-run`
- **Full pipeline**: `--source paper.txt` (needs ANTHROPIC_API_KEY + FalkorDB running)
- **Custom domain**: `--source log.txt --labels "kernel,bandwidth,latency,occupancy" --ner-only`

## Build

```
cd C:/Users/fabia/Projects/llama.cpp/llama.cpp/graphrag-pipeline
cargo build
```
29 changes: 29 additions & 0 deletions .claude/commands/arxiv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Search arXiv for papers and read their contents. Pass a search query or paper ID.

## Instructions

1. Load arXiv MCP tools:
- Use ToolSearch to load: `select:mcp__arxiv-server__search_papers,mcp__arxiv-server__read_paper,mcp__arxiv-server__list_papers,mcp__arxiv-server__download_paper`

2. Execute the user's request: $ARGUMENTS

3. Search phase:
- Use `mcp__arxiv-server__search_papers` with the query
- Search tips: prefix `ti:` for title, `au:` for author
- Category filters: `cs.LG` (ML), `cs.AR` (architecture), `cs.DC` (distributed), `cs.PF` (performance)
- Example: `ti:flash attention au:dao cat:cs.LG`

4. Read phase:
- For each relevant paper, use `mcp__arxiv-server__read_paper` with the arXiv ID
- Summarize: problem, method, key results, relevance to GPU optimization

5. Save findings:
- If FalkorDB is running (use /falkordb), create research_paper nodes and relations
- Save paper text to `graphrag-pipeline/sources/` for later NER extraction
- Report paper IDs, titles, and key takeaways

## Quick Examples

- Search: `mcp__arxiv-server__search_papers` with query `"flash attention v3 hopper"`
- Read: `mcp__arxiv-server__read_paper` with id `"2307.08691"`
- List recent: `mcp__arxiv-server__list_papers` with category `"cs.LG"` and max_results `5`
32 changes: 32 additions & 0 deletions .claude/commands/falkordb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Query or modify the FalkorDB gpu_optimization knowledge graph. Pass a Cypher query or describe what you want.

## Instructions

1. Load FalkorDB MCP tools:
- Use ToolSearch to load: `+falkordb` (finds graph query/create tools)

2. Connection details:
- Host: localhost:6379 (Redis protocol)
- Graph name: `gpu_optimization`
- Browser UI: http://localhost:3000

3. Execute the user's request: $ARGUMENTS

## Common Cypher Patterns

- **List all node labels**: `CALL db.labels()`
- **List all relation types**: `CALL db.relationshipTypes()`
- **Find a node**: `MATCH (n {name: 'SharedMemoryTiling'}) RETURN n`
- **All neighbors**: `MATCH (n {name: 'DeltaNet'})-[r]-(m) RETURN n, type(r), m`
- **Shortest path**: `MATCH p=shortestPath((a {name: 'X'})-[*]-(b {name: 'Y'})) RETURN p`
- **Create node**: `CREATE (:optimization_technique {name: 'MyTech', description: 'desc', type: 'optimization_technique'})`
- **Create relation**: `MATCH (a {name: 'X'}), (b {name: 'Y'}) CREATE (a)-[:IMPROVES]->(b)`
- **Fuzzy search**: `MATCH (n) WHERE n.name CONTAINS 'SSM' RETURN n`

## Relation Types

IMPLEMENTS, USES, OPTIMIZES, TARGETS, IMPROVES, REDUCES, ELIMINATES, MEASURES, LIMITS, ENABLES, EXTENDS, BUILDS_ON, VALIDATES, COMPETES_WITH, IS_PART_OF, IS_FEATURE_OF, REQUIRES, COULD_IMPROVE, INTRODUCES, PORTS_TO

## Entity Types

hardware, gpu_feature, optimization_technique, algorithm, software_framework, performance_metric, memory_pattern, kernel_operation, model_architecture, constraint, data_structure, research_paper
149 changes: 149 additions & 0 deletions .claude/commands/gliner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
GLiNER zero-shot NER and relation extraction reference. Use when working with GLiNER models, gline-rs, or the graphrag-pipeline.

## GLiNER Model Modes

GLiNER supports multiple extraction modes via different model architectures:

### 1. Span Mode (NER only)
- Models: `gliner_small-v2.1`, `gliner_large-v2.1`, `gliner_multi-v2.1`
- API: `GLiNER::<SpanMode>::new(params, runtime_params, tokenizer, model)`
- Input: `TextInput::from_str(&texts, &labels)`
- Output: `SpanOutput` — list of entity spans with class, text, probability
- Fast, small models (175MB int8). Good for high-throughput NER.

### 2. Token Mode (NER, multitask models)
- Models: `gliner-multitask-large-v0.5`, `gliner-relex-large-v0.5`
- API: `TokenPipeline::new(tokenizer)?.to_composable(&model, &params)`
- Same input/output as span mode but uses token-level classification
- Required for multitask models that also support relation extraction

### 3. Relation Extraction (via composed pipeline)
- Model: `gliner-multitask-large-v0.5` (same model does both NER + RE)
- Requires `TokenPipeline` (NER) chained with `RelationPipeline` (RE)
- Relations are schema-driven: define allowed subject/object entity types per relation

## gline-rs API (Rust crate v1.0.1)

### NER (Span Mode)
```rust
use gliner::model::GLiNER;
use gliner::model::pipeline::span::SpanMode;
use gliner::model::params::Parameters;
use gliner::model::input::text::TextInput;
use orp::params::RuntimeParameters;

let model = GLiNER::<SpanMode>::new(
Parameters::default(),
RuntimeParameters::default().with_threads(2),
"models/gliner_small-v2.1/tokenizer.json",
"models/gliner_small-v2.1/onnx/model_int8.onnx",
)?;
let input = TextInput::from_str(&["some text"], &["person", "company"])?;
let output = model.inference(input)?;
for spans in &output.spans {
for span in spans {
println!("{} [{}] {:.0}%", span.text(), span.class(), span.probability() * 100.0);
}
}
```

### NER + Relation Extraction (Composed Pipeline)
```rust
use composable::*;
use orp::model::Model;
use orp::params::RuntimeParameters;
use gliner::model::params::Parameters;
use gliner::model::pipeline::{token::TokenPipeline, relation::RelationPipeline};
use gliner::model::input::{text::TextInput, relation::schema::RelationSchema};

let params = Parameters::default();
let model = Model::new("models/gliner-multitask-large-v0.5/onnx/model_q4f16.onnx",
RuntimeParameters::default())?;

let mut schema = RelationSchema::new();
schema.push_with_allowed_labels("USES", &["software_framework"], &["algorithm"]);
schema.push_with_allowed_labels("TARGETS", &["optimization_technique"], &["hardware"]);
// Or unconstrained:
schema.push("IMPROVES");

let pipeline = composed![
TokenPipeline::new("models/gliner-multitask-large-v0.5/tokenizer.json")?
.to_composable(&model, &params),
RelationPipeline::default("models/gliner-multitask-large-v0.5/tokenizer.json", &schema)?
.to_composable(&model, &params),
];

let input = TextInput::from_str(&["text"], &["person", "company"])?;
pipeline.apply(input)?;
```

### Output Structures
```rust
// Entity (from SpanOutput or TokenPipeline)
span.text() -> &str // "Bill Gates"
span.class() -> &str // "person"
span.probability() -> f32 // 0.999
span.offsets() -> (usize, usize)

// Relation (from RelationOutput)
relation.subject() -> &str // "Bill Gates"
relation.object() -> &str // "Microsoft"
relation.class() -> &str // "founded"
relation.probability() -> f32 // 0.997
```

### Parameters
```rust
Parameters::default()
.with_threshold(0.5) // confidence threshold
.with_flat_ner(true) // no overlapping entities
.with_multi_label(false) // no overlapping different-class spans
.with_max_length(Some(512)) // max sequence length
```

## Available Models (local)

| Model | Path | Size | Mode | Capabilities |
|-------|------|------|------|-------------|
| gliner_small-v2.1 | `models/gliner_small-v2.1/` | 175MB (int8) | Span | NER only |
| gliner-multitask-large-v0.5 | `models/gliner-multitask-large-v0.5/` | 519MB (q4f16) | Token | NER + Relations |

## ONNX Models on HuggingFace

| Repo | Tasks | License |
|------|-------|---------|
| `onnx-community/gliner_small-v2.1` | NER | Apache 2.0 |
| `onnx-community/gliner_large-v2.1` | NER | Apache 2.0 |
| `onnx-community/gliner-multitask-large-v0.5` | NER + RE | Apache 2.0 |
| `knowledgator/gliner-relex-large-v0.5` | NER + RE (needs ONNX conversion) | Apache 2.0 |

## Domain Entity Types (GPU optimization)

```
hardware, gpu_feature, optimization_technique, algorithm,
software_framework, performance_metric, memory_pattern,
kernel_operation, model_architecture, constraint,
data_structure, research_paper
```

## Domain Relation Types

```
IMPLEMENTS, USES, OPTIMIZES, TARGETS, IMPROVES, REDUCES,
ELIMINATES, MEASURES, LIMITS, ENABLES, EXTENDS, BUILDS_ON,
VALIDATES, COMPETES_WITH, IS_PART_OF, IS_FEATURE_OF,
REQUIRES, COULD_IMPROVE, INTRODUCES, PORTS_TO
```

## Docker

```bash
# NER only (fast, no API key needed)
docker compose run --rm graphrag --source sources/paper.txt --ner-only --dry-run

# Full pipeline (NER + LLM relations + FalkorDB)
ANTHROPIC_API_KEY=sk-... docker compose run --rm graphrag --source sources/paper.txt

# Skip local NER, LLM-only
docker compose run --rm graphrag --source sources/paper.txt --skip-ner
```
36 changes: 36 additions & 0 deletions .claude/commands/graph-enrichment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
Run the full knowledge graph enrichment pipeline: arXiv paper -> chunk -> NER -> LLM relations -> dedup -> FalkorDB. Pass a paper ID, URL, or topic.

## Instructions

1. Load tools:
- Use ToolSearch to load: `select:mcp__arxiv-server__search_papers,mcp__arxiv-server__read_paper,mcp__arxiv-server__download_paper`
- Use ToolSearch to load: `+falkordb` (for graph merge verification)

2. **Acquire source** from $ARGUMENTS:
- If arXiv ID (e.g. `2307.08691`): use `mcp__arxiv-server__read_paper` to get full text
- If search topic: use `mcp__arxiv-server__search_papers`, pick best result, then read it
- Save text to `C:/Users/fabia/Projects/llama.cpp/llama.cpp/graphrag-pipeline/sources/<id>.txt`

3. **Run extraction pipeline**:
```bash
cd C:/Users/fabia/Projects/llama.cpp/llama.cpp/graphrag-pipeline
cargo run -- --source sources/<id>.txt
```
- Without ANTHROPIC_API_KEY: add `--ner-only` (NER pass only, no LLM gleaning)
- For preview: add `--dry-run` (print results, skip FalkorDB merge)

4. **Verify in FalkorDB**:
- Query new nodes: `MATCH (n) WHERE n.name CONTAINS '<keyword>' RETURN n`
- Check relations: `MATCH (n)-[r]->(m) RETURN n.name, type(r), m.name ORDER BY n.name LIMIT 20`

5. **Report**: Summarize entities created, relations found, and any dedup merges.

## Pipeline Stages (GraphRAG + LightRAG hybrid)

| Stage | Technique | Detail |
|-------|-----------|--------|
| Chunk | GraphRAG | 600-token chunks, 100-token overlap |
| NER | anno/GLiNER | Zero-shot with 12 GPU-domain entity types |
| Relations | GraphRAG gleaning | Claude Haiku, multi-round extraction per chunk |
| Dedup | LightRAG | Normalize names, merge properties, deduplicate rels |
| Merge | Incremental | MATCH-or-CREATE into FalkorDB `gpu_optimization` graph |
29 changes: 29 additions & 0 deletions .claude/commands/research.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Research a topic using web search, arXiv papers, and Hugging Face. Synthesize findings into actionable results.

## Instructions

1. Load research tools first:
- Use ToolSearch to load: WebSearch, WebFetch
- Use ToolSearch to load arxiv MCP tools: mcp__arxiv-server__search_papers, mcp__arxiv-server__read_paper, mcp__arxiv-server__list_papers
- Use ToolSearch to load HuggingFace MCP tools: mcp__claude_ai_Hugging_Face__paper_search, mcp__claude_ai_Hugging_Face__hub_repo_search

2. Search phase — run these in parallel:
- WebSearch for: $ARGUMENTS
- mcp__arxiv-server__search_papers for relevant papers
- mcp__claude_ai_Hugging_Face__paper_search if ML models are relevant

3. Deep-dive phase:
- For each promising result, use WebFetch to get details (GitHub READMEs, docs)
- For key papers, use mcp__arxiv-server__read_paper to read full content
- Use mcp__claude_ai_Hugging_Face__hub_repo_search for relevant models/datasets

4. Compile findings into a structured comparison table with:
- Project name + URL
- Language/SDK (TypeScript, Rust, Python, etc.)
- Key features relevant to the query
- Maturity (stars, last update, version)
- How it could apply to our GPU optimization work

5. If FalkorDB is running, suggest new entities/relations to add from findings

6. Eval: verify at least 3 sources were consulted and findings are cross-referenced
5 changes: 5 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"env": {
"CLAUDE_CODE_TASK_LIST_ID": "llama-cpp-graphrag"
}
}
Loading
Loading