- Python >= 3.10 & <= 3.12
- Git (for source installation)
- uv (recommended package installer)
- uv can be installed by:
# Using curl
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or using pip
pip install uvData-Juicer is now available on PyPI. The minimal installation includes core data processing capabilities:
uv pip install py-data-juicerThis provides:
- Data loading and manipulation
- File system operations
- Parallel processing
- Basic I/O and utilities
For component details, plz refer to pyproject.toml.
Core ML & DL
# Generic ML/DL capabilities
uv pip install "py-data-juicer[generic]"Includes: PyTorch, Transformers, VLLM, etc.
Domain-Specific Features
# Computer Vision
uv pip install "py-data-juicer[vision]"
# Natural Language Processing
uv pip install "py-data-juicer[nlp]"
# Audio Processing
uv pip install "py-data-juicer[audio]"
**Additional Components**
```bash
# Distributed Computing
uv pip install "py-data-juicer[distributed]"
# AI Services & APIs
uv pip install "py-data-juicer[ai_services]"
**Development Tools**
```bash
# Development & Testing
uv pip install "py-data-juicer[dev]"1. Text Processing Setup
uv pip install "py-data-juicer[generic,nlp]"2. Vision Processing Setup
uv pip install "py-data-juicer[generic,vision]"3. Full Processing Pipeline
uv pip install "py-data-juicer[generic,nlp,vision,distributed]"4. Complete Installation
# Install all features (except sandbox)
uv pip install "py-data-juicer[all]"5. For Development Mode
For contributors and developers:
# Clone repository
git clone https://github.com/modelscope/data-juicer.git
cd data-juicer
# Install dev dependencies
uv pip install -e ".[dev]"Besides the scenarios-based installation, we also provide OP-based and recipe-based manners.
- Install dependencies for specific OPs
With the growth of the number of OPs, the dependencies of all OPs become very heavy. Instead of using the command pip install -v -e .[all] to install all dependencies,
we provide two alternative, lighter options:
-
Automatic Minimal Dependency Installation: During the execution of Data-Juicer, minimal dependencies will be automatically installed. This allows for immediate execution, but may potentially lead to dependency conflicts.
-
Manual Minimal Dependency Installation: To manually install minimal dependencies tailored to a specific execution configuration, run the following command:
# only for installation from source python tools/dj_install.py --config path_to_your_data-juicer_config_file # use command line tool dj-install --config path_to_your_data-juicer_config_file
- You can
-
either pull our pre-built image from DockerHub:
docker pull datajuicer/data-juicer:<version_tag>
- if you can not connect ot DockerHub, please use other registry mirrors (you can find some from the Internet):
docker pull <other_registry_mirror>/datajuicer/data-juicer:<version_tag>
-
or run the following command to build the docker image including the latest
data-juicerwith provided Dockerfile:docker build -t datajuicer/data-juicer:<version_tag> .
-
The format of
<version_tag>is likev0.2.0, which is the same as the release version tag.
-
- installation check
import data_juicer as dj
print(dj.__version__)-
Modular Installation
- Install only what you need
- Combine components as required
- Use
allfor complete installation
-
Sandbox Environment
- Separate installation for experimental features
- Will be provided as micro-services in future
-
For Video-related Operators
- Before using video-related operators, FFmpeg should be installed and accessible via the $PATH environment variable.
- You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on Debian/Ubuntu, brew install ffmpeg on OS X) or visit the official ffmpeg link.
- Check if your environment path is set correctly by running the ffmpeg command from the terminal.
-
Getting Help
- Plz check documentation/issues first
- Create GitHub issues when necessary
- Join community channels for discussions