Skip to content

[I/O] KeyErrors are thrown for info columns not present in a region #150

@edawson

Description

@edawson

When passing a list of INFO fields and reading a VCF region where no variants have an entry for the field, an error is thrown. This happens with the 1000 Genomes VCF and the population-specific allele frequencies such as AFR. I assume it's because the header may be poorly formatted and the expected number does not match the real number.

More graceful behavior would be helpful, and we should probably have a test or two to catch things like this (and to declare whether or not we consider this a bug).

Example read:

onekg = utils.read_vcf_to_pandas("/home/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz", regions=["1:1-1000000"], info_keys=["AA", "AC", "AF", "AMR", "AN"], num_threads=1)

Error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-9addf9417f5b> in <module>
      1 ## Read in a VCF using the VariantWorks VCFReader
----> 2 onekg = utils.read_vcf_to_pandas("/home/edawson/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz", regions=["1:1-1000000"], info_keys=["AFR"], num_threads=1)

~/sandbox/somatic-vc/somaticdf/utils.py in read_vcf_to_pandas(vcf_file, regions, num_threads, tags, info_keys, filter_keys, format_keys, chunk_size)
    343     Read a VCF file into a PANDAS Variant DataFrame.
    344     """
--> 345     v = VCFReader(vcf_file, regions=regions, num_threads=num_threads, tags=tags, require_genotype=False, info_keys=info_keys, format_keys=format_keys, filter_keys=filter_keys, chunksize=chunk_size)
    346     df = v.dataframe
    347     df = set_default_types(df)

~/sandbox/VariantWorks/variantworks/io/vcfio.py in __init__(self, vcf, bams, is_fp, require_genotype, tags, info_keys, filter_keys, format_keys, regions, num_threads, chunksize, sort, unbounded_val_max_cols)
    172 
    173         # Parse the VCF
--> 174         self._parallel_parse_vcf()
    175 
    176     @property

~/sandbox/VariantWorks/variantworks/io/vcfio.py in _parallel_parse_vcf(self)
    599                     self._info_vcf_keys.append(h['ID'])
    600         for k in self._info_vcf_keys:
--> 601             header_number = vcf.get_header_type(k)['Number']
    602             self._info_vcf_key_counts[k] = self._get_normalized_count(header_number, 1, len(vcf.samples))
    603             self._header_number[k] = header_number

~/anaconda3/envs/rapids16/lib/python3.8/site-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.VCF.get_header_type()

~/anaconda3/envs/rapids16/lib/python3.8/site-packages/cyvcf2/cyvcf2.pyx in cyvcf2.cyvcf2.VCF.get_header_type()

KeyError: b'AFR'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions