Skip to content

When using model with Pyspark on worker machine #103

@LovAsawa-Draup

Description

@LovAsawa-Draup

The current implementation of the OpusMT model loading within the EasyNMT library uses the following approach:
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
However, this approach does not account for specifying a custom cache directory for model storage. The issue arises when deploying the model across a distributed environment, such as worker nodes in a Spark cluster. By default, the model is downloaded to the default Hugging Face cache directory (/home/.cache). While the master node typically has the necessary permissions for this directory, worker nodes often lack write access to /home/.

As a result, when the model is initialized on worker nodes, they attempt to download the model to the same default location, leading to permission errors.

Proposed Solution:
To avoid permission issues and ensure proper model distribution across worker nodes, the cache directory should be explicitly set during model initialization. The cache_dir parameter can be passed directly to the from_pretrained() method, ensuring models are downloaded and cached in a specified directory accessible by all nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions