autofit2

Few-shot text classification. Massively multilingual (50+ languages), fully automated pipeline built on setfit and SBERT embeddings.

Key Features

Few-Shot Learning: High precision (95–99%) with a few dozen labeled examples.
Multilingual Support: Pretrained models for 20 languages; evaluation corpora for 50+. Scalable to 100+ via Common Crawl.
Automated Pipeline: End-to-end preprocessing, fine-tuning, evaluation, and deployment from a single JSON config.
Reproducibility & Transparency: JSON-based configuration, model card generation, and CO₂ emission tracking.

Usage

1. Prepare Data Use dataload or implement a custom loader providing labeled examples.

2. Configure Create myproject.json specifying dataset paths, model settings, and output directories. Supports multi-language/task blocks.

3. Run

The pipeline supports resumable execution.

python train.py myproject.json

4. Output

Deployable model archive.
Generated model card (training details, intended use, performance metrics, bias evaluation).

Configuration

myproject.json defines the training parameters. Its structure depends on the target type: Base Models (all) or Custom Models (custom).

General Structure

{
  "<task-key>": {
    "<language-key>": {
      "base": {
        "model file": "<path>",          // Relative path, no trailing slash (e.g. "models-in/all-MiniLM-L6-v2")
        "model type": "<string>",          // e.g., "bert"
        "pretraining task": "<string>",  // e.g., "sentence similarity"
        "downstream task": "<string>"    // e.g., "binary text classification"
      },
      "targets": {
        "<id-key>": { ... }             // See Target Options below
      }
    }
  }
}

Target Types

The "targets" dictionary supports three specific key types:

all (Base Model)
- Generates a full set of artifacts: model folder, archive, and card.
- Model ID: Derived from the config filename ({config_name}-{task}-{lang}). The config filename must be stable.
custom (Custom Model)
- Generates a full set of artifacts: model folder, archive, and card.
- Model ID: can be auto-generated as a 14–16 character lowercase alphanumeric string.
benchmark 1..N (Benchmarking Only)
- Does not generate model artifacts.
- Outputs only score logs.
- Must be used in conjunction with an all target to produce output.

Target Options

Each entry in the "targets" dictionary supports the following keys:

Key	Type	Description
`description`	`string`	Free-form description of the target.
`link`	`string`	URL to source data or documentation.
`train embedding`	`bool`	Set to `true` to fine-tune embeddings during training.
`base clf`	`string`	ID string pointing to a `.joblib` file located in `BASE_PATH`. Must match exactly.
`sample ratio`	`float`	Random sample of total data for full training (e.g., `0.5` = 50%).
`embedding sample ratio`	`float`	Random sample of data used only for embedding fine-tuning (e.g., `0.1` = 10%).

Loaders

The "loader" field defines how data is ingested and transformed. It expects a list of commands (functions or transformations):

"loader": ["command_1", "command_2"]

Command Definition: Each command must return a list of dictionaries with keys text and label. Commands can be raw loader functions or wrapped transformations (e.g., list comprehensions, lambdas).
Data Splitting Logic:
- If 2 commands AND target != all:
  - Command 1 → Training Data
  - Command 2 → Evaluation Data
- Else (Target = all):
  - All commands are concatenated into a single dataset.
  - Split: 100/100 (No split; entire set used for training).
- Else (Other Targets, e.g., custom or benchmarks with 1 command):
  - All commands are concatenated into a single dataset.
  - Split: 70/30 (Train/Test).

Configuration Example

{
  "mod": {
    "el": {
      "base": {
        "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
        "model type": "bert",
        "pretraining task": "sentence similarity",
        "downstream task": "binary text classification"
      },
      "targets": {
        "benchmark 1": {
          "description": "Pitenis et al. - Offensive Language Identification in Greek",
          "link": "https://arxiv.org/abs/2003.07459",
          "loader": [
            "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
            "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
          ]
        },
        "all": {
          "loader": [
            "el_offense20()"
          ]
        }
      }
    }
  }
}

Breakdown: Finetuning a Sentence Transformer

To fine-tune a base model for a specific task and language, define a config block like the one below. This example sets up a text moderation (mod) pipeline for Greek (el) using a multilingual sentence transformer.

Base Model Setup

"base": {
  "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
  "model type": "bert",
  "pretraining task": "sentence similarity",
  "downstream task": "binary text classification"
}

Model file: Path to the pretrained transformer.
Model type: Architecture type (e.g., BERT).
Pretraining task: Original task the model was trained on.
Downstream task: Task you're adapting it to (e.g., moderation, sentiment analysis).

Targets

You can specify multiple finetuning targets. Each target defines a dataset and training strategy.

benchmark 1

"benchmark 1": {
  "description": "Pitenis et al. - Offensive Language Identification in Greek",
  "link": "https://arxiv.org/abs/2003.07459",
  "loader": [
    "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
    "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
  ]
}

Uses a train/test split for evaluation.
Based on a published benchmark dataset.

all

"all": {
  "loader": ["el_offense20()"]
}

Uses the full dataset for training.
No explicit evaluation—this is for production-grade finetuning.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eval-data		eval-data
examples		examples
packaging		packaging
.gitattributes		.gitattributes
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autofit2

Key Features

Usage

Configuration

General Structure

Target Types

Target Options

Loaders

Configuration Example

Breakdown: Finetuning a Sentence Transformer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autofit2

Key Features

Usage

Configuration

General Structure

Target Types

Target Options

Loaders

Configuration Example

Breakdown: Finetuning a Sentence Transformer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages