Describe the model you want. Get a trained, deployed model on HuggingFace.
No YAML configs. No training scripts. No juggling API keys manually. You type what you need in plain English, and a pipeline of specialized AI agents handles everything — from parsing your intent, to fetching your dataset, to writing training code, to pushing the final model to HuggingFace Hub.
This is not a wrapper around a training library. It is a fully orchestrated agentic system, built from scratch, that converts natural language into end-to-end ML pipelines across 13 task types and 5 LLM providers.
Training and deploying a machine learning model involves an exhausting sequence of decisions: pick a backbone, set hyperparameters, preprocess the data correctly, write a training loop that doesn't break, monitor metrics, then navigate the HuggingFace API to push everything. Each step depends on the last. One wrong choice early cascades into hours of debugging later.
Most people either invest weeks learning every component, or rely on AutoML tools that give you no control. There is no middle ground where you describe what you want and get working code and a deployed model back.
This project is that middle ground.
You run mltrainer in your terminal. An agent greets you, asks what you want to build, collects your credentials, and validates them in real time. Once it has everything it needs, the pipeline fires — automatically, sequentially, without you touching anything else.
$ mltrainer
┌──────────────────────────────────────────────────────┐
│ Fully Autonomous ML Trainer Agentic System │
└──────────────────────────────────────────────────────┘
Agent: Hi! Tell me what you'd like to build.
You: I want to classify spam SMS messages. Dataset is on
HuggingFace at sms_spam. I'll train on Kaggle.
Agent: Got it. I'll need your HuggingFace token and your
Kaggle credentials to proceed...
>> intent_parser ✓ 2.1s
>> dataset ✓ 4.8s
>> preprocessing ✓ 3.2s
>> config ✓ 1.9s
>> architecture ✓ 2.4s
>> codegen ✓ 6.1s
>> monitor ✓ 142.3s
>> deploy ✓ 8.7s
Model live at: https://huggingface.co/you/spam-classifier
Everything between your description and that final URL is handled by the system.
The system has four major layers that work together:
- CLI Terminal App — the interface you interact with. It displays real-time events from the pipeline (LLM calls, tool calls, retries, errors) and routes your input to the orchestrator.
- Orchestrator — the central coordinator. It initializes jobs, runs the pipeline, manages sub-agents, handles failures, and owns the retry logic.
- 9 Complex Sub-Agents — each responsible for one stage of the ML pipeline. They all read from and write to a single shared Job Context object, which is the system's central memory.
- Tool Calling Mechanism — a structured tool registry that agents use to call external services (validate credentials, fetch dataset metadata, etc.) through a uniform interface.
All state is persisted to SQLite after every stage. If your pipeline crashes at step 6, you resume from step 6 — not from step 1.
Before any ML work begins, the Intake Manager Agent has a conversation with you. This is not a form — it is a multi-turn dialogue managed by an LLM.
The agent collects your dataset URL, runtime preference (Kaggle or Modal), HuggingFace credentials, Kaggle API keys, and optionally your own LLM provider key. Every credential gets validated in real time via tool calls before the agent marks itself ready. It calls the HuggingFace /api/whoami endpoint, the Kaggle competitions API, and the LLM provider's authentication endpoint — and if any of them fail, it tells you exactly what's wrong and asks again.
The backend, not the agent, decides when intake is complete. The agent cannot declare itself ready if a required field is missing or a credential failed validation. This prevents any downstream stage from ever receiving broken inputs.
The api keys the user enters in conversation is not exposed to LLMs at all , that would be a great security risk, instead I developed a system where the intake manager agent will first sanitize for api keys and store them again reference ids , these reference ids are passed to the llm, when llm sends these reference ids in tool calls, we replace it with actual api key and excecute the function behind the tool call. This way the api keys remain safe from uploading to LLM provider servers and olso in conversation history that is stored.
Once intake is complete, the orchestrator takes over. It calls the Intent Parser Agent first, which translates your natural language description into a universal structured JSON — the parsed_intent. This object captures your task type, architecture preferences, hyperparameters, dataset configuration, and deployment settings.
The task type determines the execution plan. A tabular_classification job runs: dataset → preprocessing → config → architecture → codegen → monitor → deploy. An llm_finetuning job skips preprocessing. A clustering job skips deploy. The ExecutionPlan module handles this routing.
From there, each stage runs in sequence, with the result of every stage written into the shared Job Context for the next agent to read.
Each agent is a clean unit of responsibility. They extend BaseAgent, implement a single _execute(context) method, and return a dict that gets merged into the shared context.
| Agent | What It Does |
|---|---|
| Intake Manager | Collects and validates all user inputs through multi-turn conversation |
| Intent Parser | Converts natural language description into a structured JSON specification |
| Dataset Fetch | Downloads dataset from HuggingFace, Kaggle, or URL; computes data report |
| Dataset Preprocess | Cleans, encodes, and splits the data for training |
| Config | Selects or infers all hyperparameters based on data characteristics |
| Architecture | Picks a backbone model and designs the task-specific head |
| Code Gen | Writes the full training script and requirements.txt |
| Model Training | Executes training on Kaggle or Modal, captures logs and metrics |
| Deployment | Pushes the trained model and creates a HuggingFace Space |
The Intake Manager and Intent Parser are fully implemented. The remaining seven are next in the build sequence.
The orchestrator is the brain of the system. It does not just call agents in a loop — it manages the entire lifecycle of a job.
Job initialization — creates a job_id, allocates the intake agent, saves the initial state to SQLite.
Pipeline execution — loads the stage list for the detected task type, iterates through each stage, routes to the correct agent, merges results back into context, and saves a checkpoint after every completed stage.
Failure handling — when an agent fails, the orchestrator does not silently continue or crash. It captures the failure reason, updates the job's failure state, emits a STAGE_FAILED event to the CLI, and stops the pipeline. The partial state is preserved in SQLite.
Resumption — when you restart the CLI and a previous session exists, you can resume. The orchestrator loads the context from SQLite, identifies which stages are already in stage_results, skips them, and continues from where it stopped.
This is one of the more interesting architectural choices in the system. Every stage has an independent retry budget and backoff schedule. But the retry logic does not just retry blindly.
When a stage fails and gets retried, the previous error message is injected into the agent's LLM conversation as a new user message before the next attempt. The agent literally sees what went wrong in its context window and can reason about how to fix it.
Attempt 1: Code gen produces invalid Python syntax
Error captured: "SyntaxError: unexpected indent at line 47"
Attempt 2: Agent sees previous error in prompt
→ Agent self-corrects, produces valid training script
→ Stage succeeds
Different stages have different retry configurations. The monitoring stage gets 10 attempts with a 30-second base delay (polling a training run). The intent parser gets 3 attempts with a 2-second delay. Each stage's config is tuned to its expected failure modes.
Every agent reads from and writes to a single JobContext object. This is not a message bus or a shared database — it is a Python dataclass that flows through the pipeline as the single source of truth.
The context holds everything: the raw user prompt, the full conversation history from intake, the parsed intent, the dataset report, the preprocessed data path, the final config, the architecture spec, the training script, the training logs, the best metric, and the deployed model URL.
When any agent writes to the context, every downstream agent sees it. When the context is serialized to SQLite, the entire job state is captured in a single JSON blob. This makes resumption exact — there is no reconstruction or inference about what happened before. The context tells you everything.
No agent imports an LLM SDK directly. Every agent calls self.llm_router.complete(system_prompt, user_message, ...) and gets back a response. The router handles everything else.
Supported providers:
- Anthropic (Claude Sonnet, Opus, Haiku)
- OpenAI (GPT-4o, GPT-4o-mini)
- Google (Gemini 2.0 Flash, Gemini 1.5 Pro)
- Groq (Qwen 32B, LLaMA 3.3 70B)
- Ollama (local models — Qwen 2.5 Coder, LLaMA 3, Mistral, Phi-3)
The router is built on LiteLLM, which normalizes the API surface across all providers. If you provide an API key during intake, the router uses your chosen provider. If you provide no key, it falls back automatically — Groq's free tier first, then local Ollama models.
You can run this entire system without spending a dollar on LLM API calls.
Agents that need to interact with external services use a structured tool registry. The ToolExecuter maps each agent type to a BaseTool subclass, which defines its tools in OpenAI format and implements the execution logic.
The Intake Manager uses CredentialValidatorTools to call three external APIs live during the intake conversation. The Dataset Agent uses DatasetAgentTools to fetch HuggingFace dataset metadata — splits, features, row counts, column types.
Every tool call is sanitized before logging — API keys and tokens are masked automatically. Tool calls and their results are also emitted as events to the CLI, so you see exactly what the agent is doing in real time.
The orchestrator and every agent emit typed events to a thread-safe queue. A background thread in the CLI consumes this queue and renders each event to the terminal as it arrives.
You see LLM calls happening. You see tool calls with their arguments. You see stage completions with timing. You see retries with the error that caused them. You see everything.
This is not just good UX — it is how you debug a multi-agent pipeline that runs for several minutes. When something fails at the codegen stage, you can see the exact LLM response that caused it and the exact error injected before the retry.
| Domain | Task Types |
|---|---|
| Tabular | Classification, Regression |
| Image | Classification, Regression, Object Detection |
| Text | Classification, Generation, Token Classification, Summarization, Translation |
| LLM | Fine-tuning |
| Time Series | Forecasting |
| Unsupervised | Clustering |
The entire orchestration layer is custom Python. No LangGraph, no CrewAI, no AutoGen.
The reason is control. Each stage in this pipeline has meaningfully different retry behavior, different context requirements, and different failure modes. Expressing that in a general-purpose graph framework means fighting the framework every time you need something specific.
A custom orchestrator is ~500 lines of explicit Python. You can read it top to bottom and understand exactly what happens when a stage fails, exactly when a checkpoint is saved, and exactly how error feedback gets injected into a retry. There are no black boxes.
The same reasoning applies to the context system. A JobContext dataclass is simpler than a graph state, easier to serialize, easier to inspect in SQLite, and easier to reason about when something goes wrong at 2am.
Install dependencies (requires Python 3.11+):
uv syncSet up environment (optional — or provide a key during intake):
# Free tier available on Groq — no cost to run the system
GROQ_API_KEY=your_keyRun:
mltrainerThe CLI will guide you from there. You will be asked for your dataset URL, HuggingFace token, runtime preference, and any credentials your setup requires. Everything is validated before the pipeline starts.
To resume a previous session after a crash or interruption:
mltrainer
# → "Previous session found. [R] Resume [N] New session"The intake and intent parsing stages are complete and production-quality. The orchestration infrastructure — job context, retry handler, event system, SQLite persistence, resumption — is fully built and ready for the remaining agents.
The seven pipeline agents (dataset through deployment) are being implemented one by one. Each will be documented here as it ships.
Built to make the entire ML training pipeline disappear behind a conversation.