Second-year AI student at Inland Norway University of Applied Sciences, currently on exchange at UC Berkeley. I build ML systems that work under real-world constraints — crash-prone hardware, tight budgets, noisy quantum backends — and I publish the results honestly, including when things don't work.
Papers on arXiv:
- Training-Free Lexical–Dense Fusion for Conversational-Memory Retrieval — Training-free, CPU-only score-level fusion of BM25 with turn-level late-interaction dense retrieval on the LoCoMo conversational-memory benchmark. 0.752 Hit@1 vs. 0.640 for BM25 (+8.8–17.2 pp over late interaction alone across six encoders). Includes the negative results: a cross-encoder reranker hurts, and the gain fades on LongMemEval-S. (arXiv:2606.04194, June 2026)
- Feasible-First Exploration for Constrained ML Deployment Optimization — Crash-aware TBA→TPE hybrid optimizer. 80% discovery rate of the globally optimal model vs. 30% for standalone TPE; reduced wasted trials from 74% to 42%. Benchmarked on DeployBench: 5 architectures × 3 backends × 3 quantizations × 6 batch sizes across 5 NVIDIA GPUs (H100, A100, RTX 5080, L4, T4), 10 seeds each. 46/46 tests passing. (arXiv:2604.25073, April 2026)
- SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving — Two-phase TBA→TPE optimizer for vLLM tuning. 150-trial A100 study: 75/75 feasibility, zero crashes; statistically tied with random search on best latency (p=0.84) but 4.4× tighter cross-seed variance under concurrent load. (arXiv:2604.17627, April 2026)
- Hidden Device Heterogeneity in Constrained ML Deployment — PyTorch's INT8 quantization silently switches from GPU to CPU, creating 39% feasibility flip rates. (submitted April 2026)
- Expanding multi-GPU benchmark results for the TBA deployment optimizer (H100, A100, RTX 5080, L4, T4)
- Waiting on D-Wave LaunchPad QPU access to complete the quantum annealing benchmark
- Cross-hardware validation of SLO-Guard on non-A100 GPUs
- Iterating on the personal-assistant retrieval stack — speaker-aware ranking and operational event-log feedback as signals for retrieval quality
- Extending the lexical–dense fusion retrieval study (opsem, arXiv:2606.04194) — encoder scaling and additional conversational-memory benchmarks beyond LoCoMo and LongMemEval-S
- Building Solon, an autonomous ML-research agent — verification-first architecture (skeleton-constrained authoring, receipt-traced claims, a 2σ + out-of-sample-persistence credibility gate) with a FunSearch-style evolutionary search engine, MAP-Elites quality-diversity, and a holdout gate against selection-bias overfit
- Coursework at UC Berkeley (CS 61C, concurrent with research)
Enrolled:
- CS 61C — Great Ideas in Computer Architecture
- EECS 127 — Optimization Models in Engineering
- ENGIN 183 — Technology Innovation and Entrepreneurship
- ASTRON C12 — The Planets (astrophysics breadth)
Auditing:
- CS 152/252A — Computer Architecture and Engineering
- CS 170 — Efficient Algorithms and Intractable Problems
- CS 185/285 — Deep Reinforcement Learning, Decision Making, and Control
- CS 61B — Data Structures
Focus areas: systems architecture, optimization theory, and deep RL — chosen to complement my ML deployment research with hardware-level understanding and formal optimization foundations.
ML systems and deployment optimization — How do you find the best inference configuration (backend, quantization, batch size) when most of the search space crashes or violates constraints? I built a two-phase optimizer (Thermal Budget Annealing → constrained TPE) that treats crashes as data and maps feasible regions before exploiting them. I also packaged one finding from that work into deploy-doctor — a small PyTorch CLI that catches the silent failure where an int8 "GPU" model actually runs on the CPU.
Quantum-classical benchmarking — Fair comparisons between classical simulated annealing, D-Wave quantum annealing, and QAOA on IBM hardware. I design standardized solver interfaces and report negative results when the hypothesis doesn't hold.
Agentic AI and applied CV — LLM-powered tool-calling agents for domain-specific automation, and competition-grade object detection pipelines with ONNX inference and ensemble methods. The same applied-CV thread runs through an honest exploration framework for underwater aquaculture net-damage detection, where I worked the full pipeline end to end: foundation-model and self-supervised anomaly detection (PatchCore, DINOv2, from-scratch SimCLR), synthetic-to-real domain-gap handling, adversarial and out-of-distribution evaluation, temporal video reasoning, and ONNX/FastAPI deployment — with, deliberately, no validated real-world claims, since all damage is synthetic (net-inspection-cv, private repo).
Personal AI and grounded retrieval — How does a local-first assistant remember what matters across years of notes and conversations, and ground its answers without hallucinating? I'm building one (private repo): JARVIS, a voice-driven personal "cortex" running entirely on a local Windows machine — an always-listening wake-word HUD, a dual-brain runtime that hot-swaps between the Claude API and a local Ollama model for a zero-network privacy mode, 60+ tools for actually controlling the computer (apps, shell, media, mail/calendar), and a markdown memory vault rendered as a navigable 3D memory atlas. Under the hood: a hybrid BM25 + dense-embedding + reciprocal-rank-fusion retrieval stack, a classifier-driven subsystem distilling raw conversation logs into a queryable knowledge vault, and a fixed-query eval harness so retrieval changes are measured, not hand-waved. The retrieval recipe is published separately as opsem (arXiv:2606.04194).
Autonomous research agents and verifiable AI discovery — Can an AI agent run the scientific loop end to end — propose, implement, evaluate, and report a result you can actually trust? I'm building one (private repo): Solon, a verification-first autonomous ML-research agent. Since most LLM-agent ML results are fabricated or invalidated, the writer can't invent numbers: every metric is parsed from real stdout, every claim traces to a reproducibility receipt, and a credibility gate certifies an effect only if it clears 2σ and survives fresh seeds. On that spine sits a FunSearch/AlphaEvolve-style evolutionary engine — a MAP-Elites archive of diverse "stepping stones" plus verified-fragment memory, so discoveries compound across runs. Pointed at the real LoCoMo benchmark (same as opsem), it produced an honest null: a holdout gate caught a seed-lucky +14 pp Hit@1 that reversed to −2.3 pp on unseen seeds — exactly the selection-bias overfit it exists to stop. The lesson: the bottleneck isn't the model, it's the objective and the rigor.
Constrained-ML-Deployment Two research papers sharing the DeployBench infrastructure. (1) TBA: crash-aware two-phase optimizer for constrained ML deployment. (2) Hidden Device Heterogeneity: empirical study showing INT8 dynamic quantization silently moves inference to CPU, creating stochastic feasibility boundaries. 2,150 measurement trials, 5 GPU types, full reproducibility.
SLO-Guard Crash-aware autotuner for vLLM serving. Optimizes vLLM configs (batching, memory, execution mode) under hard latency/memory SLOs. Crashes are encoded as constraint violations and replayed into a warm-started TPE phase, so failed trials inform subsequent search. 150-trial A100 study on Qwen2-1.5B: 75/75 feasibility, zero crashes; statistically tied with random search on peak latency (Mann-Whitney p=0.84) but 4.4× tighter cross-seed variance on best latency under concurrent load. Paper at arXiv:2604.17627. Both sequential and concurrent harness datasets published for replication.
opsem Reproduction code and paper for Training-Free Lexical–Dense Fusion for Conversational-Memory Retrieval. Fuses BM25 with turn-level late-interaction (max-sim over per-turn vectors) dense retrieval at the score level — no training, runs on CPU, one leave-one-conversation-out weight. On LoCoMo: 0.752 Hit@1 vs. 0.640 BM25, +8.8–17.2 pp over late interaction alone across six encoders. Every number in the paper has a JSON + Markdown receipt; honest leave-one-conversation-out cross-validation throughout. Paper at arXiv:2606.04194.
deploy-agent Productized version of TBA. CLI + FastAPI dashboard + MCP server for automated ML deployment optimization. Give it a model and hardware constraints, it searches backends/quantization/batch sizes and returns the best feasible config with full evidence. Crash handling, structured JSON logs, live WebSocket charts.
deploy-doctor A small PyTorch CLI that flags silent device-placement footguns — e.g. an int8 model that quietly runs on the CPU instead of the GPU you asked for. GPU-free static diagnosis, CI-friendly. MIT.
dwave-benchmark Classical SA vs D-Wave quantum annealing on Max-Cut and spin glass problems. Phase 1 complete: all classical solvers converge to identical solutions up to n=500 with zero quality gap. Phase 2 (QPU) pending D-Wave access. Common solver interface, reproducible seeds, timing analysis showing Neal SA ~400x faster than pure Python SA.
qaoa-benchmark Negative-result study: budget-aware classical optimizers vs COBYLA/SPSA for QAOA parameter tuning under noisy simulation. Finding: shallow QAOA landscapes are too smooth — COBYLA with fixed defaults matched or beat the learning optimizer. 375 total runs, 3 graphs, 5 budget levels, 5 seeds.
net-inspection-cv (private repo) Honest, research-grade framework for flagging damage (holes/tears) in aquaculture net footage. Benchmarks five detectors (classical, anomaly, label-free PatchCore, supervised YOLOv8 detect/seg, and a det∧seg ensemble), closing the synthetic-to-real gap by compositing labelled damage onto real SINTEF SOLAQUA ROV frames — localisation F1 0.12 → 0.50 → 0.78 → 0.97. Adversarial "is it cheating?" eval, OOD review gate, temporal confirmation, SSL backbone ablation (DINOv2 / from-scratch SimCLR), ROS-bag ingestion, ONNX export, and a FastAPI/Streamlit service. Reports its failures too, and claims no validated real-world numbers, since all damage is synthetic.
Norwegian-AI-Championship NM i AI 2026 competition entry (Team INNBerkeley). Three tasks: YOLOv8x object detection with ONNX inference, multi-scale TTA, and WBF ensembling; a FastAPI accounting agent using Gemini 2.5 Flash + Tripletex API; and an A* pathfinding agent for Norse world prediction.
ML2-Exam Machine Learning 2 exam work.
Python, PyTorch, vLLM, ONNX Runtime, Optuna, Qiskit, D-Wave Ocean SDK, FastAPI, Docker, LaTeX, SciPy, NetworkX, Matplotlib, NumPy, scikit-learn, Qiskit Aer, Google Colab, sentence-transformers, Hugging Face Transformers, Hugging Face Datasets, BM25 / rank fusion, Ollama, SQLite, OpenCV, Ultralytics YOLOv8, torchvision, scikit-image, Streamlit, rosbags, Pandas, Pillow, Modal, MAP-Elites / quality-diversity search, PyTorch quantization, pytest, GitHub Actions, Ruff
UC Berkeley (exchange 2025–2026) · INN Norway (home institution)