Enhance encode_targets.py with dataset and model support#11
Enhance encode_targets.py with dataset and model support#11smokyngt wants to merge 1 commit intoqdrant:masterfrom
Conversation
Refactor encoding script to support new dataset loading and model options.
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the encode_targets.py script to support multiple data sources and embedding models. The refactor transforms a simple file-based encoding script into a more flexible tool that can handle both HuggingFace datasets and local files with different embedding frameworks.
Key changes:
- Added support for HuggingFace datasets (MS MARCO) alongside existing file-based input
- Implemented dual model support (FastEmbed and SentenceTransformer) with automatic fallback
- Enhanced argument parsing with vocabulary filtering and improved batch processing
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| if max_count and i >= max_count: | ||
| break | ||
| else: | ||
| raise ValueError(f"Dataset non supporté: {name}") |
There was a problem hiding this comment.
Error message should be in English to maintain consistency with the rest of the codebase.
| raise ValueError(f"Dataset non supporté: {name}") | |
| raise ValueError(f"Unsupported dataset: {name}") |
| ) | ||
| yield embeddings | ||
| else: | ||
| raise ValueError(f"Model type non supporté: {model_type}") |
There was a problem hiding this comment.
Error message should be in English to maintain consistency with the rest of the codebase.
| raise ValueError(f"Model type non supporté: {model_type}") | |
| raise ValueError(f"Unsupported model type: {model_type}") |
| texts_list = list(texts_generator) | ||
| for embeddings_batch in encode_texts(texts_list, model, model_type, args.batch_size, args.use_cuda): |
There was a problem hiding this comment.
Converting the entire generator to a list loads all texts into memory at once, which could cause memory issues with large datasets. Consider processing texts in chunks or streaming them directly to the encoding function.
| embeddings = model.encode( | ||
| list(texts), | ||
| batch_size=batch_size, | ||
| convert_to_numpy=True, | ||
| show_progress_bar=True | ||
| ) | ||
| yield embeddings |
There was a problem hiding this comment.
The SentenceTransformer branch converts texts to a list again and yields all embeddings at once, negating the benefit of the generator pattern. Consider yielding embeddings in batches to maintain memory efficiency.
| raise ValueError("Must specify either --input-file or --dataset") | ||
|
|
||
| model, model_type = get_model(args.model_name) | ||
| os.makedirs(os.path.dirname(args.output_file), exist_ok=True) |
There was a problem hiding this comment.
This will fail if args.output_file is just a filename without a directory path, as os.path.dirname() would return an empty string. Consider checking if dirname is non-empty before calling makedirs.
| os.makedirs(os.path.dirname(args.output_file), exist_ok=True) | |
| output_dir = os.path.dirname(args.output_file) | |
| if output_dir: | |
| os.makedirs(output_dir, exist_ok=True) |
Refactor encoding script to support new dataset loading and model options.