Danish Foundation Models is a collaborative project for training foundational Danish language model. Which seeks to:
- Develop and maintain state-of-the-art models for Danish,
- which are well-validated across a wide range of tasks.
- Furthermore, we wish to ensure good documentation, which allows users to assess the model for their use-case critically
- Open-source, both model and source code
Note: This repository is intended for the data processing of DFM.
└── dfm-processing/
├── .github
│ └── workflows
├── LICENSE
├── README.md
├── config
│ └── example.yaml
├── pyproject.toml
├── src
│ └── dfm_processing
├── tests
│ ├── cli
│ ├── data_pipeline
│ └── document_processing
└── uv.lockThis project requires the following dependencies:
- Programming Language: Python
- Package Manager: Uv
Build dfm-processing from the source and intsall dependencies:
-
Clone the repository:
❯ git clone https://github.com/danish-foundation-models/dfm-processing
-
Navigate to the project directory:
❯ cd dfm-processing -
Install the dependencies:
Using uv:
❯ uv sync --all-extras
The CLI is divided into two sections, "document" and "pipeline". Each section contains specific commands for different tasks.
-
Process Directory:
- Purpose: Extract text data from various file types in a directory.
- Usage:
uv run dfm-processing document process-directory path_to_dir output_dir dataset_name
- Example:
uv run dfm-processing document process-directory ./data ./output my_dataset
-
Process Web Crawl:
- Purpose: Extract text data from a web crawl.
- Usage:
uv run dfm-processing document process-web-crawl crawl_log output_dir crawled_data dataset_name
- Example:
uv run dfm-processing document process-web-crawl example.com.log ./output ./crawled_data/ example.com
-
Filter:
- Purpose: Run a filtering pipeline on a dataset to filter out "poor" quality data.
- Usage:
uv run dfm-processing pipeline filter yaml_config
- Example:
uv run dfm-processing pipeline filter ./config/example.yaml
-
Sentence Deduplication (
sent_dedup):- Purpose: Perform sentence deduplication on a given dataset.
- Usage:
uv run dfm-processing pipeline sent_dedup yaml_config
- Example:
uv run dfm-processing pipeline sent_dedup ./config/example.yaml
-
MinHash Deduplication (
minhash-dedup):- Purpose: Perform MinHash Deduplication on a given dataset.
- Usage:
uv run dfm-processing pipeline minhash-dedup yaml_config
- Example:
uv run dfm-processing pipeline minhash-dedup ./config/example.yaml
For more information please check out the following links:
| 📑 About | A overview of the DFM project |
| Research Paper | An paper introducing DFM and its rationale |
| 🚀 Models | A overview of current models available through the DFM project |
| 💽 Datasets | Includes datasheets about the datasets which includes preprocessing, reason for constructions and more. |
DFM is considered a collaborative project for training and maintaining Danish Language models. If you wish to contribute don't hesitate to reach out using one of the following channels:
| 🗣 DDSC Slack | Join the discussion in the "danish-foundation-models"-channel |
| 💬 GitHub Discussion | Ask questions or start a discussion |
| 🚨 GitHub Issues | Notices a bug in the code? Please create an issue |
You can contribute both:
- Developer time, the lifeblood of any open-source project
- Pre-training datasets you wish to include in the model training
- Validation tasks can even be private benchmarks where you only wish to share the performance metrics.
- And probably in many other ways
