LLM-driven evolutionary optimizer for JAX Pallas TPU kernels. Automatically discovers high-performance kernel variants through iterative mutation, evaluation on real TPU hardware (GKE TPU v7x), multi-signal profiling, and structured reflection.
Glaucis operates as an interactive Claude Code skill loop:
pallas-evolve:start
|
+-------------+-------------+
| |
THINK (LLM) Read config &
Generate N kernel create GitHub
variant mutations tracking Issue
| |
+-------------+-------------+
|
pallas-evolve:submit
Build payloads, create K8s
Job on GKE, collect results
|
pallas-evolve:analyze
Multi-signal bottleneck
classification, top-K
lineage selection
|
pallas-evolve:reflect
Extract failure patterns &
successful optimizations
into AGENT.md
|
compact
(loop back)
Each iteration:
- Think -- Generate N kernel variants by mutating code between
EVOLVE-BLOCK-START/EVOLVE-BLOCK-ENDmarkers, guided by learnings inAGENT.md - Submit -- Package variants into a K8s Job, deploy to GKE TPU v7x, run a 5-stage evaluation pipeline (compile, correctness, performance, XPlane trace, deep IR analysis)
- Analyze -- Classify bottlenecks using compute ratio, VLIW bundles, MXU utilization, arithmetic intensity, HBM bandwidth, register spills, and more. Select top-K lineages for the next round.
- Reflect -- Record failure patterns (
[Fxxx]) and successful optimizations ([Sxxx]) intoAGENT.md. Post round summary to the GitHub Issue.
Glaucis/
├── kernel-evolve/ # Core Python package
│ ├── src/kernel_evolve/
│ │ ├── config.py # Pydantic YAML config validation
│ │ ├── mutation.py # EVOLVE-BLOCK extraction/injection
│ │ ├── evaluator.py # EvalResult/EvalRequest data classes
│ │ ├── kube_evaluator.py # K8s Job submission via kubectl
│ │ ├── profiler.py # XPlane trace & IR analysis (624 lines)
│ │ └── docker_evaluate_helpers.py # Batch subprocess dispatch
│ ├── docker/
│ │ ├── Dockerfile # TPU evaluation container
│ │ └── evaluate.py # 5-stage evaluator (runs inside K8s Pod)
│ ├── examples/
│ │ ├── matmul.yaml # Simple tiled matmul config
│ │ ├── chunk_gla.yaml # Chunked Gated Linear Attention config
│ │ ├── gmm_fp8_blockwise.yaml # Grouped MatMul FP8 config
│ │ └── kernels/ # Template & reference implementations
│ ├── plugins/pallas-evolve/ # Claude Code skill plugin
│ │ └── skills/{start,submit,analyze,reflect}/
│ ├── scripts/ # Utility scripts
│ └── tests/ # 13 test files
├── .github/
│ ├── workflows/
│ │ ├── kernel-eval.yaml # TPU evaluation workflow
│ │ └── build-image.yaml # Docker image build & push
│ └── ci/
│ ├── kernel-eval-job.yaml # K8s Job template
│ ├── kernel-eval-gmm-job.yaml # GMM FP8 variant Job template
│ └── xplane-explore-job.yaml # XPlane exploration Job
├── AGENT.md # Accumulated optimization learnings
├── docs/plans/ # Design & implementation documents
└── LICENSE # Apache 2.0
- Python >= 3.10
kubectlconfigured for a GKE cluster with TPU v7x nodes- GCS bucket for profile artifact storage (default:
glaucis-profiles) - Claude Code CLI installed
cd kernel-evolve
pip install -e ".[dev]"Optional extras:
pip install -e ".[charts]" # matplotlib for visualization
pip install -e ".[profile]" # xprof for trace analysisThe pallas-evolve plugin ships with 6 skills that drive the optimization loop inside Claude Code.
Option A: Local plugin (for development)
Add to your project's .claude/settings.local.json:
{
"enabledPlugins": {
"pallas-evolve@local": true
}
}Claude Code will discover the plugin from kernel-evolve/plugins/pallas-evolve/.claude-plugin/plugin.json.
Option B: From repository
{
"enabledPlugins": {
"pallas-evolve@https://github.com/sii-xinglong/Glaucis": true
}
}Launch Claude Code in the project directory. You should see these skills available:
| Skill | Purpose |
|---|---|
pallas-evolve:init-kernel |
Initialize a kernel project from upstream |
pallas-evolve:profile-brief |
Generate a profile brief from TPU eval artifacts |
pallas-evolve:start |
Start an optimization session |
pallas-evolve:submit |
Submit a batch of variants for TPU eval |
pallas-evolve:analyze |
Analyze batch evaluation results |
pallas-evolve:reflect |
Record learnings to AGENT.md |
In Claude Code, invoke the start skill with a config:
/start examples/matmul.yaml
This will:
- Parse the config and validate shapes/correctness settings
- Create a GitHub Issue for tracking
- Enter the think-submit-analyze-reflect loop
- Step-by-step (default) -- Pauses after each phase for human review
- Autonomous -- Runs the full loop unattended until max iterations or target speedup is reached
Create a YAML config pointing to your kernel template and reference:
kernel:
name: "my_kernel"
template: "kernels/my_kernel.py"
reference: "kernels/my_kernel_ref.py"
markers:
start: "# EVOLVE-BLOCK-START"
end: "# EVOLVE-BLOCK-END"
shapes:
- { M: 1024, N: 1024, K: 1024 }
correctness:
atol: 1e-2
rtol: 1e-2
evaluator:
namespace: "default"
job_template: ".github/ci/kernel-eval-job.yaml"
repo: "sii-xinglong/Glaucis"
branch: "main"
tpu:
cluster: "tpu7x-cluster"
zone: "us-central1"
session:
max_iterations: 20
output_dir: "runs/my_kernel"
batch:
variants_per_round: 4
top_k: 2
max_active_lineages: 6Mark the mutable region in your kernel template:
# EVOLVE-BLOCK-START
def my_kernel(x_ref, y_ref, o_ref):
# ... kernel code to be optimized ...
# EVOLVE-BLOCK-END| Kernel | Description | Config |
|---|---|---|
matmul |
Tiled matrix multiplication | examples/matmul.yaml |
chunk_gla |
Chunked Gated Linear Attention (fwd + bwd) | examples/chunk_gla.yaml |
gmm_fp8_blockwise |
Grouped MatMul with FP8 block-wise quantization | examples/gmm_fp8_blockwise.yaml |
The evaluator runs inside a K8s Pod on GKE TPU v7x and executes 5 stages:
- Compile --
exec()the kernel,jax.jit+ lower to HLO - Correctness -- Compare against reference implementation with
np.testing.assert_allclose - Performance -- 10 warmup + 50 timed iterations, report median latency and speedup
- XPlane Trace -- JAX profiler capture, compute ratio, memory transfer ratio, per-unit utilization (MXU, Scalar ALU, Vector ALU, Vector Load/Store)
- Deep IR -- HLO/LLO/Mosaic dump parsing: VLIW bundles, MXU distribution, VMEM allocations, bundle density, DMA analysis, HBM bandwidth, FLOP counts, arithmetic intensity
Profile artifacts (HLO, LLO, trace events) are uploaded to GCS for post-hoc analysis.
| Component | Detail |
|---|---|
| Cloud | Google Cloud Platform |
| Cluster | tpu7x-cluster (GKE, us-central1) |
| TPU | v7x (Ironwood), 2x2x1 topology, 4 chips |
| Peak Compute | 2307 TFLOPS (BF16) |
| HBM Bandwidth | 3690 GB/s |
| Container Registry | us-central1-docker.pkg.dev/tpu-service-473302/glaucis/kernel-eval |
| Artifact Storage | GCS bucket glaucis-profiles |
| CI Auth | Workload Identity Federation for GitHub Actions |
cd kernel-evolve
# Run tests
pytest
# Lint
ruff check src/ tests/
ruff format --check src/ tests/Apache 2.0 -- see LICENSE.