Skip to content

lucas-ebi/sloth

Repository files navigation

SLOTH – Structural Loader with On-demand Traversal Handling

Lazy by design. Fast by default.

logo

Version

Python License Docs


SLOTH is a fast, flexible mmCIF parser for structural biology workflows. Built on the C++ gemmi backend, it performs eager parsing and lazy object construction — efficient for both large-scale pipelines and interactive exploration.

  • High-speed parsing via gemmi
  • Lazy construction of row and item objects for memory efficiency
  • Pythonic dot-notation access to mmCIF data
  • Multi-level validationMMCIFValidator().validate() runs the full mmCIF dictionary + wwPDB rule suite and returns a ValidationReport
  • Schema-aware warnings — unknown categories/items trigger SchemaWarning with "Did you mean …?" suggestions
  • Tab completion & fuzzy matching__dir__() exposes item/category/block names; typos produce helpful AttributeError messages
  • Pluggable validation with cross-category support and model-level plugin registration
  • JSON export/import with automatic relationship resolution

Installation

pip install -i https://test.pypi.org/simple/ sloth-mmcif

Or from source:

git clone https://github.com/lucas-ebi/sloth.git
cd sloth
pip install -e ".[dev]"

Quick Start

from sloth import MMCIFHandler

handler = MMCIFHandler()
mmcif = handler.read("1abc.cif")

# Dot notation
print(mmcif.data_1ABC._struct.title[0])
print(mmcif.data_1ABC._atom_site.Cartn_x[0])

# Dictionary notation
x = mmcif.data[0]["_atom_site"]["Cartn_x"]

# Export to nested JSON
handler.export(mmcif, file_path="output.json", indent=2)

Validation

from sloth import MMCIFValidator

# Full validation (dictionary schema + wwPDB rules)
vp = MMCIFValidator()
report = vp.validate(mmcif)
print(report.is_valid)      # True / False
print(report.errors)        # ERROR-level issues
print(report.warnings)      # WARNING-level issues

Performance

Benchmarks on synthetic mmCIF files (macOS, Python 3.10):

File Size Full Parse Selective Access Speed Memory (Parse) Memory (Access)
1KB 12ms 13ms 40μs 198KB 4KB
10KB 12ms 13ms 97μs 222KB 13KB
100KB 13ms 14ms 594μs 1.0MB 104KB
1.0MB 19ms 25ms 6ms 7.7MB 954KB
50.7MB 394ms 693ms 298ms 205.4MB 46.1MB
102.0MB 817ms 1.4s 607ms 386.8MB 75.5MB

Note: Access memory can appear smaller than the file on disk because Python's string interning deduplicates repeated values in mmCIF columns (e.g., atom type symbols, residue names, chain IDs). When many rows share the same string, Python stores it only once — so memory usage after access reflects unique string content rather than total row count.

Documentation

Full documentation, API reference, and interactive cookbook:

Contributing

  1. Fork the repo
  2. Create a feature branch
  3. Add tests
  4. Submit a PR

License

MIT License — use freely, modify responsibly.

About

A Python library for parsing and writing mmCIF (macromolecular Crystallographic Information Framework) files with an ultra-simple API that's automatically optimized for performance.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages