This is a curated CATH 4.3 dataset for PiFold (an updated version of CATH 4.2 by Ingraham, et al, NeurIPS 2019). This new version included better structures (PDB-REDO), more chains, the last CATH release, included gaps (noted by "-"), removed Tags and missing regions (noted as "X" with NaN coordinates), removed tags, and cases with large missing regions.
Preprocessed data and splits can be found here: cathPi.tgz:
- chain_set.jsonl Max sequence length 500 aa
- chain_set_splits.json Test: 1422 Train: 18960 Validation: 1436