Few-shot text classification. Massively multilingual (50+ languages), fully automated pipeline built on setfit and SBERT embeddings.
- Few-Shot Learning: High precision (95–99%) with a few dozen labeled examples.
- Multilingual Support: Pretrained models for 20 languages; evaluation corpora for 50+. Scalable to 100+ via Common Crawl.
- Automated Pipeline: End-to-end preprocessing, fine-tuning, evaluation, and deployment from a single JSON config.
- Reproducibility & Transparency: JSON-based configuration, model card generation, and CO₂ emission tracking.
1. Prepare Data
Use dataload or implement a custom loader providing labeled examples.
2. Configure
Create myproject.json specifying dataset paths, model settings, and output directories. Supports multi-language/task blocks.
3. Run
The pipeline supports resumable execution.
python train.py myproject.json4. Output
- Deployable model archive.
- Generated model card (training details, intended use, performance metrics, bias evaluation).
myproject.json defines the training parameters. Its structure depends on the target type: Base Models (all) or Custom Models (custom).
{
"<task-key>": {
"<language-key>": {
"base": {
"model file": "<path>", // Relative path, no trailing slash (e.g. "models-in/all-MiniLM-L6-v2")
"model type": "<string>", // e.g., "bert"
"pretraining task": "<string>", // e.g., "sentence similarity"
"downstream task": "<string>" // e.g., "binary text classification"
},
"targets": {
"<id-key>": { ... } // See Target Options below
}
}
}
}The "targets" dictionary supports three specific key types:
all(Base Model)- Generates a full set of artifacts: model folder, archive, and card.
- Model ID: Derived from the config filename (
{config_name}-{task}-{lang}). The config filename must be stable.
custom(Custom Model)- Generates a full set of artifacts: model folder, archive, and card.
- Model ID: can be auto-generated as a 14–16 character lowercase alphanumeric string.
benchmark 1..N(Benchmarking Only)- Does not generate model artifacts.
- Outputs only score logs.
- Must be used in conjunction with an
alltarget to produce output.
Each entry in the "targets" dictionary supports the following keys:
| Key | Type | Description |
|---|---|---|
description |
string |
Free-form description of the target. |
link |
string |
URL to source data or documentation. |
train embedding |
bool |
Set to true to fine-tune embeddings during training. |
base clf |
string |
ID string pointing to a .joblib file located in BASE_PATH. Must match exactly. |
sample ratio |
float |
Random sample of total data for full training (e.g., 0.5 = 50%). |
embedding sample ratio |
float |
Random sample of data used only for embedding fine-tuning (e.g., 0.1 = 10%). |
The "loader" field defines how data is ingested and transformed. It expects a list of commands (functions or transformations):
"loader": ["command_1", "command_2"]- Command Definition: Each command must return a list of dictionaries with keys
textandlabel. Commands can be raw loader functions or wrapped transformations (e.g., list comprehensions, lambdas). - Data Splitting Logic:
- If 2 commands AND target !=
all:- Command 1 → Training Data
- Command 2 → Evaluation Data
- Else (Target =
all):- All commands are concatenated into a single dataset.
- Split: 100/100 (No split; entire set used for training).
- Else (Other Targets, e.g.,
customor benchmarks with 1 command):- All commands are concatenated into a single dataset.
- Split: 70/30 (Train/Test).
- If 2 commands AND target !=
{
"mod": {
"el": {
"base": {
"model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
"model type": "bert",
"pretraining task": "sentence similarity",
"downstream task": "binary text classification"
},
"targets": {
"benchmark 1": {
"description": "Pitenis et al. - Offensive Language Identification in Greek",
"link": "https://arxiv.org/abs/2003.07459",
"loader": [
"el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
"el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
]
},
"all": {
"loader": [
"el_offense20()"
]
}
}
}
}
}To fine-tune a base model for a specific task and language, define a config block like the one below. This example sets up a text moderation (mod) pipeline for Greek (el) using a multilingual sentence transformer.
Base Model Setup
"base": {
"model file": "models-in/paraphrase-multilingual-MiniLM-L12-v2",
"model type": "bert",
"pretraining task": "sentence similarity",
"downstream task": "binary text classification"
}- Model file: Path to the pretrained transformer.
- Model type: Architecture type (e.g., BERT).
- Pretraining task: Original task the model was trained on.
- Downstream task: Task you're adapting it to (e.g., moderation, sentiment analysis).
Targets
You can specify multiple finetuning targets. Each target defines a dataset and training strategy.
benchmark 1
"benchmark 1": {
"description": "Pitenis et al. - Offensive Language Identification in Greek",
"link": "https://arxiv.org/abs/2003.07459",
"loader": [
"el_offense20(files=['offenseval2020-greek/offenseval-gr-training-v1/offenseval-gr-training-v1.tsv'])",
"el_offense20(files=['offenseval2020-greek/offenseval-gr-testsetv1/offenseval-gr-test-v1-combined.tsv'])"
]
}- Uses a train/test split for evaluation.
- Based on a published benchmark dataset.
all
"all": {
"loader": ["el_offense20()"]
}- Uses the full dataset for training.
- No explicit evaluation—this is for production-grade finetuning.