HDDFlyzer is a CLI-first cheminformatics toolkit with a Python API for reproducible, traceable molecular descriptor-space workflows. It generates descriptor, similarity, projection, and visualization outputs from compound collections, stores them with metadata, and supports later reconstruction, inspection, and analysis of completed runs.
HDDFlyzer can be used in two complementary ways:
- as a command-line tool, through the
hddflyzercommand; - as a Python library, for workflow execution, run reconstruction, artifact inspection, and scientific analysis over existing outputs.
The CLI and Python API are two entry points into the same traceable workflow engine.
HDDFlyzer organizes a cheminformatics analysis around a local molecule registry and a dataset-first results folder. A typical workflow starts from a CSV of compounds, creates a canonical registry, annotates molecules, computes Tanimoto similarity and molecular descriptor tables, performs feature curation and correlation-based pruning, projects molecular spaces with PCA, t-SNE, or UMAP, and writes figures plus metadata-rich outputs.
The design goal is practical reproducibility: every major operation writes
stable files under results/<tag>/, and each step keeps enough metadata to
understand how the output was generated.
Cheminformatics workflows often produce many disconnected files: descriptor tables, similarity matrices, dimensionality-reduction coordinates, figures, and metadata. HDDFlyzer keeps these products organized under a dataset-specific results folder and records how they were generated.
This makes the package useful for exploratory molecular-space analysis, teaching, workflow prototyping, and reproducible local studies where the relationship between input molecules, computational steps, and generated artifacts must remain visible.
HDDFlyzer currently focuses on traceable molecular descriptor-space workflows for local datasets. It supports registry preparation, natural-product class annotation, similarity calculation, descriptor engineering, feature curation, dimensionality reduction, visualization, run reconstruction, and scientific views over existing outputs.
HDDFlyzer is not a docking tool, a complete drug-discovery platform, an automatic enrichment workflow, or an automatic chemical interpretation system. Its scientific helpers operate on already generated artifacts; they do not rerun the pipeline or infer chemical conclusions automatically.
molecule CSV
-> canonical registry
-> NP-class annotation
-> Tanimoto similarity
-> descriptor tables
-> feature curation and pruning
-> PCA/t-SNE/UMAP projections
-> figures, metadata, manifest, and workflow summary
-> reconstructed result views from Python
All outputs for a collection are stored under one dataset-specific directory, for example:
results/aocd/
Python 3.11 or newer is recommended. RDKit is best installed from conda-forge.
After v0.1.5 is merged, tagged, and published through the manual PyPI
workflow, HDDFlyzer can be installed from PyPI:
pip install hddflyzerFor local development from a source checkout:
conda create -n hddflyzer_env python=3.11
conda activate hddflyzer_env
conda install -c conda-forge rdkit
pip install -e ".[dev]"UMAP support is included through umap-learn in the package dependencies.
Place local input collections in:
examples/
For example:
examples/valid_metadata_aocd.csv
Then prepare the registry with:
hddflyzer data prepare aocdThe default input directory is separate from hddflyzer/data/, which contains
package code. To use another input directory, set HDDFLYZER_DATA_DIR.
PowerShell:
$env:HDDFLYZER_DATA_DIR = "C:\path\to\csvs"
hddflyzer data prepare aocdUnix-like shells:
HDDFLYZER_DATA_DIR=/path/to/csvs hddflyzer data prepare aocdPrepare a molecule registry:
hddflyzer data prepare aocdRun the standard workflow:
hddflyzer pipeline run aocdRun a lighter workflow that stops before t-SNE and UMAP:
hddflyzer pipeline run aocd --skip-dimredExecute only selected stages:
hddflyzer pipeline run aocd --stages chem.features,chem.pruningLoad the completed run from Python:
from hddflyzer.results import load_workflow_run
run = load_workflow_run("aocd")
run.workflow_contract
run.outputs(category="dimred")Use the unified CLI:
hddflyzer <module> <subcommand> [args]Recommended real-data sequence:
hddflyzer data prepare aocd
hddflyzer annotate npc aocd
hddflyzer chem tanimoto aocd
hddflyzer chem sample aocd
hddflyzer chem features aocd
hddflyzer chem curate-features aocd
hddflyzer chem reference-features aocd --reference
hddflyzer chem stats base aocd
hddflyzer chem stats hddf aocd
hddflyzer chem pruning aocdDimensionality-reduction commands include:
hddflyzer dimred pca aocd
hddflyzer dimred pca-joint aocd dianatdb
hddflyzer dimred tsne features aocd
hddflyzer dimred tsne tanimoto aocd
hddflyzer dimred tsne pruning aocd
hddflyzer dimred umap features aocd
hddflyzer dimred umap tanimoto aocd
hddflyzer dimred umap pruning aocdProjection modes mean:
features = descriptor space from full feature table
tanimoto = structural similarity space from Morgan fingerprints
pruning = descriptor space after correlation-based feature pruning
Visualization commands include:
hddflyzer viz npc aocd
hddflyzer viz similarity tanimoto aocd
hddflyzer viz similarity fingerprints aocd
hddflyzer viz correlations hddf aocd
hddflyzer viz pca analysis aocd
hddflyzer viz tsne features aocd QED 30
hddflyzer viz umap features aocd QED
hddflyzer viz umap tanimoto aocd PathwayReference-dependent similarity descriptors are intentionally separate from the intrinsic feature table. They describe molecule-reference pairs, not molecules alone:
hddflyzer chem reference-features aocd --referenceOptional sampled feature calculation is available when Tanimoto sampling keeps fewer compounds than the full registry:
hddflyzer chem features aocd --sampledOptional pickle output is explicit and not the default:
hddflyzer chem features aocd --save-pklThe same workflow engine can be called from Python:
from hddflyzer.pipeline import execute_workflow
execution = execute_workflow("aocd")
execution.ok
execution.stage_results
run = execution.runFor lower-level control, run_pipeline() returns list[StageResult], while
run_workflow() returns a reconstructed WorkflowRun or propagates an error if
the run cannot be reconstructed.
The project uses a dataset-first results layout:
results/
<tag>/
manifest.json
workflow_summary.md
registry/
annotations/
chemistry/
features/
dimred/
figures/
For aocd, representative outputs include:
results/aocd/registry/molecules.csv
results/aocd/annotations/npclassifier/npclassifier.csv
results/aocd/chemistry/tanimoto/tanimoto_matrix.npz
results/aocd/chemistry/tanimoto/tanimoto_ids.csv
results/aocd/features/full/features.csv
results/aocd/features/curated/features_ml.csv
results/aocd/features/pruning/selected_features.txt
results/aocd/dimred/pca/pca_coordinates.csv
results/aocd/dimred/tsne/*/
results/aocd/dimred/umap/*/
results/aocd/manifest.json
results/aocd/workflow_summary.md
Each major step writes one interpretable data file plus one metadata file where
possible. Large matrices are stored in compressed .npz. Pickle output is
optional, not default.
manifest.json is the dataset-level index for a run. It records operation
history, generated files, stale file references, per-operation metadata,
current_outputs, output_categories, and a workflow_contract block that
makes the canonical registry -> chem -> dimred -> viz -> metadata/results
workflow explicit. workflow_summary.md presents the same contract as a concise
human-readable index.
The result API reconstructs completed runs from results/<tag>/manifest.json.
It does not create new state or recalculate scientific outputs.
from hddflyzer.results import load_workflow_run, load_artifact
run = load_workflow_run("aocd")
artifact = run.artifact(kind="tanimoto_matrix", category="chem")
loaded = load_artifact(artifact)
loaded.artifact
loaded.data
loaded.metadataLoaded result artifacts can also be selected by semantic kind:
artifact = run.artifact(
kind="descriptor_table",
category="chem",
operation="chem.features",
)
loaded = run.load_artifact(
kind="descriptor_table",
category="chem",
operation="chem.features",
)
loaded.data
loaded.metadataSupported artifact kinds are:
molecule_registry
descriptor_table
tanimoto_matrix
projection_coordinates
figure
metadata
workflow_summary
unknown
run.artifacts(...) returns all matching artifacts. run.artifact(...) returns
exactly one artifact or raises a clear error when none or multiple candidates
match. run.load_artifact(...) selects one artifact and loads it using the same
contracts as load_artifact(...). The optional required filter can be used to
disambiguate artifacts by a path fragment, for example
required="features/full/features.csv".
Loaded artifacts can be wrapped as lightweight scientific views:
LoadedArtifact -> DescriptorSpace
LoadedArtifact -> SimilaritySpace
LoadedArtifact -> ProjectionSpace
These views do not execute the pipeline and do not recalculate descriptors, similarity, or projections. They wrap artifacts that already exist in a reconstructed run, so Python code can work with results in domain language.
from hddflyzer.results import load_workflow_run
run = load_workflow_run("aocd")
descriptors = run.descriptor_space(
category="chem",
operation="chem.features",
)
similarity = run.similarity_space(category="chem")
projection = run.projection_space(
category="dimred",
operation="dimred.pca",
)Each scientific space exposes molecule_ids when identifiers are available.
Spaces can be compared or aligned by molecular identity:
from hddflyzer.science import align_spaces, has_aligned_molecule_ids, shared_molecule_ids
shared = shared_molecule_ids(descriptors, projection)
aligned = has_aligned_molecule_ids(descriptors, projection)
descriptors_aligned, projection_aligned = align_spaces(descriptors, projection)hddflyzer.science also provides small, pure analysis helpers over existing
spaces:
from hddflyzer.science import (
compare_descriptor_groups,
descriptor_projection_correlations,
projection_neighborhood_preservation,
similarity_projection_correlation,
similarity_projection_neighbor_overlap,
)
global_corr = similarity_projection_correlation(similarity, projection)
neighbor_overlap = similarity_projection_neighbor_overlap(similarity, projection, k=10)
descriptor_ranking = descriptor_projection_correlations(descriptors, projection)
local_preservation = projection_neighborhood_preservation(similarity, projection, k=10)
group_differences = compare_descriptor_groups(descriptors, labels="class_label")These helpers use only already reconstructed artifacts. They do not generate plots, do not perform automatic clustering, do not run enrichment, and do not make automatic chemical interpretations.
Visualization inputs can be resolved from reconstructed results:
from hddflyzer.results import load_workflow_run
from hddflyzer.viz import resolve_viz_inputs, plot_hddf_scatters
run = load_workflow_run("aocd")
inputs = resolve_viz_inputs(
run,
kind="descriptor_table",
category="chem",
required="features/full/features.csv",
)
plot_hddf_scatters(inputs)plot_hddf_scatters() also accepts a loaded descriptor-table artifact:
loaded = run.load_artifact(
kind="descriptor_table",
category="chem",
operation="chem.features",
)
plot_hddf_scatters(loaded)HDDFlyzer is designed as a local Python library and CLI application with defensive handling around reconstructed results:
- Pickle (
.pkl) loading is blocked by default in public loaders. - Pickle can be loaded only with
allow_pickle=True, and should be used only for trusted local files produced in a controlled environment. - Run tags are validated against empty values, path traversal, absolute paths, and path separators.
- Artifacts reconstructed from
manifest.jsonmust resolve inside the correspondingresults/<tag>/run directory. update_manifest()rejects files outsideresults/<tag>/.
These checks reduce accidental unsafe file access and unsafe deserialization. They are not a substitute for treating unknown files and manifests as untrusted input.
HDDFlyzer separates command-line execution, workflow orchestration, result reconstruction, scientific views, visualization, and low-level I/O.
hddflyzer/
cli.py # command-line entry point
config/ # global settings and canonical output paths
pipeline/ # workflow execution contracts and orchestration
results/ # WorkflowRun, ResultArtifact, LoadedArtifact
science/ # scientific views, alignment, metrics, interpretation
io/ # readers, writers, manifests, path safety
data/ # canonical molecule registry logic
chem/ # descriptors, similarity, sampling, pruning, references
dimred/ # PCA, t-SNE, UMAP
viz/ # plotting layer and visualization input resolution
utils/ # shared helpers
The domain modules keep the scientific logic. The io/ layer owns canonical
read/write helpers. The pipeline/ layer exposes run_pipeline(),
run_workflow(), and execute_workflow() for programmatic execution. The
results/ layer reconstructs completed runs from manifest.json through
WorkflowRun, ResultArtifact, and LoadedArtifact.
The central CLI lives in hddflyzer/cli.py. Subpackages may still expose
module-level main() functions for direct execution, but hddflyzer is the
user-facing entry point.
HDDFlyzer follows a simple separation of responsibilities:
pure/helper functions reusable logic for tests, notebooks, and scripts
run() functions file-based workflow execution and manifest updates
main()/hddflyzer command-line parsing and process exit behavior
| Family | Description |
|---|---|
| Constitutional | Molecular weight, LogP, H-bonding, rings, rotatable bonds |
| Topological | Chi, Kappa, Balaban, Bertz, graph indices |
| Electronic | Partial charges and VSA descriptors |
| Geometrical | Currently limited; conformer-based 3D descriptors are planned |
| Hybrid | Derived ratios from base descriptors |
| HDDF | QED, lead-likeness, pharmacophore complexity, synthetic accessibility, desirability |
| Reference Similarity | Morgan, FeatMorgan, AtomPair, RDKFingerprint, Torsion, Layered, MACCS; stored under features/reference/ |
| Reference MCS | Maximum common substructure size, molecule atom counts, Tanimoto, and overlap; stored under features/reference/ |
HDDFlyzer follows these output conventions:
CSV = canonical tabular results
JSON = metadata, parameters, provenance, compact summaries
NPZ = compressed numeric matrices
PKL = optional cache only
PNG = visualization output
TXT = only for small human-facing operational files such as selected_features.txt
The goal is low redundancy, reproducible outputs, stable paths, and readable data products suitable for real dataset analysis.
HDDFlyzer currently focuses on 1D/2D descriptor tables, similarity matrices, dimensionality reduction, visualization, and reconstructed result analysis. Future versions may add explicit conformer-generation and 3D descriptor workflows, including reproducible conformer parameters, energy-minimization metadata, and clear separation between 2D descriptor tables and conformer-based 3D descriptor tables.
Run tests:
pytest -qThe test suite currently covers core utilities, chemistry calculations, dimensionality-reduction helpers, configuration, persistence, orchestration, CLI routing, reconstructed results, scientific views, and visualization APIs.
HDDFlyzer is pre-stable research software. The command structure is stabilizing, but output contracts may still evolve as the real-data pipeline matures.
Contributions are welcome. Please open an issue before submitting a pull request. Follow the existing code style: NumPy-style docstrings, type hints, and SPDX license headers in all source files.
See CONTRIBUTING.md for full guidelines. Please also read our Code of Conduct.
If you use HDDFlyzer in your research, please cite it using the metadata in CITATION.cff or the format below:
Contreras-Torres, F. F. and Saldivar-González, F. I. (2026). HDDFlyzer: High-Dimensional Descriptor-based Feature Space Analyzer. https://github.com/NanoBiostructuresRG/hddflyzer
Developed by Flavio F. Contreras-Torres (Tecnologico de Monterrey) and Fernanda I. Saldivar-González (Tecnologico de Monterrey) Monterrey, Mexico - June 2026
LGPL-3.0-or-later. See LICENSE, COPYING, and COPYING.LESSER.