Skip to content

neospe/autofit2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

autofit2

Few-shot text classification. Massively multilingual (50+ languages), fully automated pipeline built on setfit and SBERT embeddings.

Key Features

  • Few-Shot Learning: High precision (95–99%) with a few dozen labeled examples.
  • Multilingual Support: Pretrained models for 20 languages; evaluation corpora for 50+. Scalable to 100+ via Common Crawl.
  • Automated Pipeline: End-to-end preprocessing, fine-tuning, evaluation, and deployment from a single JSON config.
  • Reproducibility & Transparency: JSON-based configuration, model card generation, and CO₂ emission tracking.

Usage

1. Prepare Data Use dataload or implement a custom loader providing labeled examples.

2. Configure Create myproject.json specifying dataset paths, model settings, and output directories. Supports multi-language/task blocks.

3. Run

The pipeline supports resumable execution.

python train.py myproject.json

4. Output

  • Deployable model archive.
  • Generated model card (training details, intended use, performance metrics, bias evaluation).

Configuration

myproject.json defines the training parameters. Its structure depends on the target type: Base Models (all) or Custom Models (custom).

General Structure

{
  "<task-key>": {
    "<language-key>": {
      "base": {
        "model file": "<path>",          // Relative path, no trailing slash (e.g. "models-in/all-MiniLM-L6-v2")
        "model type": "<string>",          // e.g., "bert"
        "pretraining task": "<string>",  // e.g., "sentence similarity"
        "downstream task": "<string>"    // e.g., "binary text classification"
      },
      "targets": {
        "<id-key>": { ... }             // See Target Options below
      }
    }
  }
}

Target Types

The "targets" dictionary supports three specific key types:

  1. all (Base Model)
    • Generates a full set of artifacts: model folder, archive, and card.
    • Model ID: Derived from the config filename ({config_name}-{task}-{lang}). The config filename must be stable.
  2. custom (Custom Model)
    • Generates a full set of artifacts: model folder, archive, and card.
    • Model ID: can be auto-generated as a 14–16 character lowercase alphanumeric string.
  3. benchmark 1..N (Benchmarking Only)
    • Does not generate model artifacts.
    • Outputs only score logs.
    • Must be used in conjunction with an all target to produce output.

Target Options

Each entry in the "targets" dictionary supports the following keys:

Key Type Description
description string Free-form description of the target.
link string URL to source data or documentation.
train embedding bool Set to true to fine-tune embeddings during training.
base clf string ID string pointing to a .joblib file located in BASE_PATH. Must match exactly.
sample ratio float Random sample of total data for full training (e.g., 0.5 = 50%).
embedding sample ratio float Random sample of data used only for embedding fine-tuning (e.g., 0.1 = 10%).

Loaders

The "loader" field defines how data is ingested and transformed. It expects a list of commands (functions or transformations):

"loader": ["command_1", "command_2"]
  • Command Definition: Each command must return a list of dictionaries with keys text and label. Commands can be raw loader functions or wrapped transformations (e.g., list comprehensions, lambdas).
  • Data Splitting Logic:
    • If 2 commands AND target != all:
      • Command 1 → Training Data
      • Command 2 → Evaluation Data
    • Else (Target = all):
      • All commands are concatenated into a single dataset.
      • Split: 100/100 (No split; entire set used for training).
    • Else (Other Targets, e.g., custom or benchmarks with 1 command):
      • All commands are concatenated into a single dataset.
      • Split: 70/30 (Train/Test).

Configuration Example

{
  "mod": {
    "el": {
      "base": {
        "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
        "model type": "bert",
        "pretraining task": "sentence similarity",
        "downstream task": "binary text classification"
      },
      "targets": {
        "benchmark 1": {
          "description": "Pitenis et al. - Offensive Language Identification in Greek",
          "link": "https://arxiv.org/abs/2003.07459",
          "loader": [
            "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
            "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
          ]
        },
        "all": {
          "loader": [
            "el_offense20()"
          ]
        }
      }
    }
  }
}

Breakdown: Finetuning a Sentence Transformer

To fine-tune a base model for a specific task and language, define a config block like the one below. This example sets up a text moderation (mod) pipeline for Greek (el) using a multilingual sentence transformer.

Base Model Setup

"base": {
  "model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
  "model type": "bert",
  "pretraining task": "sentence similarity",
  "downstream task": "binary text classification"
}
  • Model file: Path to the pretrained transformer.
  • Model type: Architecture type (e.g., BERT).
  • Pretraining task: Original task the model was trained on.
  • Downstream task: Task you're adapting it to (e.g., moderation, sentiment analysis).

Targets

You can specify multiple finetuning targets. Each target defines a dataset and training strategy.

  1. benchmark 1
"benchmark 1": {
  "description": "Pitenis et al. - Offensive Language Identification in Greek",
  "link": "https://arxiv.org/abs/2003.07459",
  "loader": [
    "el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
    "el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
  ]
}
  • Uses a train/test split for evaluation.
  • Based on a published benchmark dataset.
  1. all
"all": {
  "loader": ["el_offense20()"]
}
  • Uses the full dataset for training.
  • No explicit evaluation—this is for production-grade finetuning.

About

Automated end-to-end data preprocessing, model training, and evaluation pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors