Fully Autonomous ML Trainer Agentic System (In Development)

Describe the model you want. Get a trained, deployed model on HuggingFace.

No YAML configs. No training scripts. No juggling API keys manually. You type what you need in plain English, and a pipeline of specialized AI agents handles everything — from parsing your intent, to fetching your dataset, to writing training code, to pushing the final model to HuggingFace Hub.

This is not a wrapper around a training library. It is a fully orchestrated agentic system, built from scratch, that converts natural language into end-to-end ML pipelines across 13 task types and 5 LLM providers.

The Problem This Solves

Training and deploying a machine learning model involves an exhausting sequence of decisions: pick a backbone, set hyperparameters, preprocess the data correctly, write a training loop that doesn't break, monitor metrics, then navigate the HuggingFace API to push everything. Each step depends on the last. One wrong choice early cascades into hours of debugging later.

Most people either invest weeks learning every component, or rely on AutoML tools that give you no control. There is no middle ground where you describe what you want and get working code and a deployed model back.

This project is that middle ground.

What It Does

You run mltrainer in your terminal. An agent greets you, asks what you want to build, collects your credentials, and validates them in real time. Once it has everything it needs, the pipeline fires — automatically, sequentially, without you touching anything else.

$ mltrainer

  ┌──────────────────────────────────────────────────────┐
  │         Fully Autonomous ML Trainer Agentic System   │
  └──────────────────────────────────────────────────────┘

Agent: Hi! Tell me what you'd like to build.

You: I want to classify spam SMS messages. Dataset is on
     HuggingFace at sms_spam. I'll train on Kaggle.

Agent: Got it. I'll need your HuggingFace token and your
       Kaggle credentials to proceed...

>> intent_parser        ✓  2.1s
>> dataset              ✓  4.8s
>> preprocessing        ✓  3.2s
>> config               ✓  1.9s
>> architecture         ✓  2.4s
>> codegen              ✓  6.1s
>> monitor              ✓  142.3s
>> deploy               ✓  8.7s

Model live at: https://huggingface.co/you/spam-classifier

Everything between your description and that final URL is handled by the system.

Architecture Overview

The system has four major layers that work together:

CLI Terminal App — the interface you interact with. It displays real-time events from the pipeline (LLM calls, tool calls, retries, errors) and routes your input to the orchestrator.
Orchestrator — the central coordinator. It initializes jobs, runs the pipeline, manages sub-agents, handles failures, and owns the retry logic.
9 Complex Sub-Agents — each responsible for one stage of the ML pipeline. They all read from and write to a single shared Job Context object, which is the system's central memory.
Tool Calling Mechanism — a structured tool registry that agents use to call external services (validate credentials, fetch dataset metadata, etc.) through a uniform interface.

All state is persisted to SQLite after every stage. If your pipeline crashes at step 6, you resume from step 6 — not from step 1.

The Two Phases

Phase 1: Interactive Intake

Before any ML work begins, the Intake Manager Agent has a conversation with you. This is not a form — it is a multi-turn dialogue managed by an LLM.

The agent collects your dataset URL, runtime preference (Kaggle or Modal), HuggingFace credentials, Kaggle API keys, and optionally your own LLM provider key. Every credential gets validated in real time via tool calls before the agent marks itself ready. It calls the HuggingFace /api/whoami endpoint, the Kaggle competitions API, and the LLM provider's authentication endpoint — and if any of them fail, it tells you exactly what's wrong and asks again.

The backend, not the agent, decides when intake is complete. The agent cannot declare itself ready if a required field is missing or a credential failed validation. This prevents any downstream stage from ever receiving broken inputs.

Note

The api keys the user enters in conversation is not exposed to LLMs at all , that would be a great security risk, instead I developed a system where the intake manager agent will first sanitize for api keys and store them again reference ids , these reference ids are passed to the llm, when llm sends these reference ids in tool calls, we replace it with actual api key and excecute the function behind the tool call. This way the api keys remain safe from uploading to LLM provider servers and olso in conversation history that is stored.

Phase 2: Automated Pipeline

Once intake is complete, the orchestrator takes over. It calls the Intent Parser Agent first, which translates your natural language description into a universal structured JSON — the parsed_intent. This object captures your task type, architecture preferences, hyperparameters, dataset configuration, and deployment settings.

The task type determines the execution plan. A tabular_classification job runs: dataset → preprocessing → config → architecture → codegen → monitor → deploy. An llm_finetuning job skips preprocessing. A clustering job skips deploy. The ExecutionPlan module handles this routing.

From there, each stage runs in sequence, with the result of every stage written into the shared Job Context for the next agent to read.

The Agents

Each agent is a clean unit of responsibility. They extend BaseAgent, implement a single _execute(context) method, and return a dict that gets merged into the shared context.

Agent	What It Does
Intake Manager	Collects and validates all user inputs through multi-turn conversation
Intent Parser	Converts natural language description into a structured JSON specification
Dataset Fetch	Downloads dataset from HuggingFace, Kaggle, or URL; computes data report
Dataset Preprocess	Cleans, encodes, and splits the data for training
Config	Selects or infers all hyperparameters based on data characteristics
Architecture	Picks a backbone model and designs the task-specific head
Code Gen	Writes the full training script and requirements.txt
Model Training	Executes training on Kaggle or Modal, captures logs and metrics
Deployment	Pushes the trained model and creates a HuggingFace Space

The Intake Manager and Intent Parser are fully implemented. The remaining seven are next in the build sequence.

The Orchestrator

The orchestrator is the brain of the system. It does not just call agents in a loop — it manages the entire lifecycle of a job.

Job initialization — creates a job_id, allocates the intake agent, saves the initial state to SQLite.

Pipeline execution — loads the stage list for the detected task type, iterates through each stage, routes to the correct agent, merges results back into context, and saves a checkpoint after every completed stage.

Failure handling — when an agent fails, the orchestrator does not silently continue or crash. It captures the failure reason, updates the job's failure state, emits a STAGE_FAILED event to the CLI, and stops the pipeline. The partial state is preserved in SQLite.

Resumption — when you restart the CLI and a previous session exists, you can resume. The orchestrator loads the context from SQLite, identifies which stages are already in stage_results, skips them, and continues from where it stopped.

The Retry Mechanism

This is one of the more interesting architectural choices in the system. Every stage has an independent retry budget and backoff schedule. But the retry logic does not just retry blindly.

When a stage fails and gets retried, the previous error message is injected into the agent's LLM conversation as a new user message before the next attempt. The agent literally sees what went wrong in its context window and can reason about how to fix it.

Attempt 1: Code gen produces invalid Python syntax
Error captured: "SyntaxError: unexpected indent at line 47"

Attempt 2: Agent sees previous error in prompt
→ Agent self-corrects, produces valid training script
→ Stage succeeds

Different stages have different retry configurations. The monitoring stage gets 10 attempts with a 30-second base delay (polling a training run). The intent parser gets 3 attempts with a 2-second delay. Each stage's config is tuned to its expected failure modes.

The Job Context — Central Memory

Every agent reads from and writes to a single JobContext object. This is not a message bus or a shared database — it is a Python dataclass that flows through the pipeline as the single source of truth.

The context holds everything: the raw user prompt, the full conversation history from intake, the parsed intent, the dataset report, the preprocessed data path, the final config, the architecture spec, the training script, the training logs, the best metric, and the deployed model URL.

When any agent writes to the context, every downstream agent sees it. When the context is serialized to SQLite, the entire job state is captured in a single JSON blob. This makes resumption exact — there is no reconstruction or inference about what happened before. The context tells you everything.

The LLM Router — Vendor Independence

No agent imports an LLM SDK directly. Every agent calls self.llm_router.complete(system_prompt, user_message, ...) and gets back a response. The router handles everything else.

Supported providers:

Anthropic (Claude Sonnet, Opus, Haiku)
OpenAI (GPT-4o, GPT-4o-mini)
Google (Gemini 2.0 Flash, Gemini 1.5 Pro)
Groq (Qwen 32B, LLaMA 3.3 70B)
Ollama (local models — Qwen 2.5 Coder, LLaMA 3, Mistral, Phi-3)

The router is built on LiteLLM, which normalizes the API surface across all providers. If you provide an API key during intake, the router uses your chosen provider. If you provide no key, it falls back automatically — Groq's free tier first, then local Ollama models.

You can run this entire system without spending a dollar on LLM API calls.

The Tool Calling Mechanism

Agents that need to interact with external services use a structured tool registry. The ToolExecuter maps each agent type to a BaseTool subclass, which defines its tools in OpenAI format and implements the execution logic.

The Intake Manager uses CredentialValidatorTools to call three external APIs live during the intake conversation. The Dataset Agent uses DatasetAgentTools to fetch HuggingFace dataset metadata — splits, features, row counts, column types.

Every tool call is sanitized before logging — API keys and tokens are masked automatically. Tool calls and their results are also emitted as events to the CLI, so you see exactly what the agent is doing in real time.

The Event System

The orchestrator and every agent emit typed events to a thread-safe queue. A background thread in the CLI consumes this queue and renders each event to the terminal as it arrives.

You see LLM calls happening. You see tool calls with their arguments. You see stage completions with timing. You see retries with the error that caused them. You see everything.

This is not just good UX — it is how you debug a multi-agent pipeline that runs for several minutes. When something fails at the codegen stage, you can see the exact LLM response that caused it and the exact error injected before the retry.

Supported Task Types

Domain	Task Types
Tabular	Classification, Regression
Image	Classification, Regression, Object Detection
Text	Classification, Generation, Token Classification, Summarization, Translation
LLM	Fine-tuning
Time Series	Forecasting
Unsupervised	Clustering

Why No LangGraph or Agent Framework

The entire orchestration layer is custom Python. No LangGraph, no CrewAI, no AutoGen.

The reason is control. Each stage in this pipeline has meaningfully different retry behavior, different context requirements, and different failure modes. Expressing that in a general-purpose graph framework means fighting the framework every time you need something specific.

A custom orchestrator is ~500 lines of explicit Python. You can read it top to bottom and understand exactly what happens when a stage fails, exactly when a checkpoint is saved, and exactly how error feedback gets injected into a retry. There are no black boxes.

The same reasoning applies to the context system. A JobContext dataclass is simpler than a graph state, easier to serialize, easier to inspect in SQLite, and easier to reason about when something goes wrong at 2am.

Getting Started

Install dependencies (requires Python 3.11+):

uv sync

Set up environment (optional — or provide a key during intake):

# Free tier available on Groq — no cost to run the system
GROQ_API_KEY=your_key

Run:

mltrainer

The CLI will guide you from there. You will be asked for your dataset URL, HuggingFace token, runtime preference, and any credentials your setup requires. Everything is validated before the pipeline starts.

To resume a previous session after a crash or interruption:

mltrainer
# → "Previous session found. [R] Resume  [N] New session"

Project Status

The intake and intent parsing stages are complete and production-quality. The orchestration infrastructure — job context, retry handler, event system, SQLite persistence, resumption — is fully built and ready for the remaining agents.

The seven pipeline agents (dataset through deployment) are being implemented one by one. Each will be documented here as it ships.

Built to make the entire ML training pipeline disappear behind a conversation.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
agent_schemas		agent_schemas
agents		agents
api		api
cli		cli
config		config
data_sources		data_sources
llm		llm
orchestrator		orchestrator
storage		storage
tests		tests
tool		tool
tool_methods		tool_methods
.gitignore		.gitignore
.python-version		.python-version
LLM_ENV_HANDLING_PLAN.md		LLM_ENV_HANDLING_PLAN.md
README.md		README.md
logging_config.py		logging_config.py
main.py		main.py
pyproject.toml		pyproject.toml
to_do.txt		to_do.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fully Autonomous ML Trainer Agentic System (In Development)

The Problem This Solves

What It Does

Architecture Overview

The Two Phases

Phase 1: Interactive Intake

Note

Phase 2: Automated Pipeline

The Agents

The Orchestrator

The Retry Mechanism

The Job Context — Central Memory

The LLM Router — Vendor Independence

The Tool Calling Mechanism

The Event System

Supported Task Types

Why No LangGraph or Agent Framework

Getting Started

Project Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fully Autonomous ML Trainer Agentic System (In Development)

The Problem This Solves

What It Does

Architecture Overview

The Two Phases

Phase 1: Interactive Intake

Note

Phase 2: Automated Pipeline

The Agents

The Orchestrator

The Retry Mechanism

The Job Context — Central Memory

The LLM Router — Vendor Independence

The Tool Calling Mechanism

The Event System

Supported Task Types

Why No LangGraph or Agent Framework

Getting Started

Project Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages