Skip to content

How to import pseudogenes from GFF3 file as GRanges object? #68

@zhewa

Description

@zhewa

Thank you for developing such a powerful package for working with genomes in R. My questions is how do I import pseudogenes, e.g. MTCO2P2, from a NCBI RefSeq human assembly GFF3 file TxDb as GRanges object using GenomicFeatures functions? This entry was not found in the output GRanges objects of functions transcripts, exons, cds, or genes.

Pseudogene MTCO2P2 is on line 3862455 of GFF3 file GCF_000001405.40_GRCh38.p14_genomic.gff:
# NC_000018.10 Curated Genomic pseudogene 47853230 47853472 . - . ID=gene-MTCO2P2;Dbxref=GeneID:100873202,HGNC:HGNC:25354;Name=MTCO2P2;description=MT-CO2 pseudogene 2;gbkey=Gene;gene=MTCO2P2;gene_biotype=pseudogene;gene_synonym=HsT4010;pseudo=true

The GFF3 file (Last modified: 2023-10-11 12:39) was downloaded from here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz

library(GenomicFeatures)
db <- makeTxDbFromGFF("GCF_000001405.40_GRCh38.p14_genomic.gff")

# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning messages:
# 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#     some transcripts have no "transcript_id" attribute ==> their
#     name ("tx_name" column in the TxDb object) was set to NA
# 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
#     the transcript names ("tx_name" column in the TxDb object)
#     imported from the "transcript_id" attribute are not unique
# 3: In .find_exon_cds(exons, cds) :
#     The following transcripts have exons that contain more than
#     one CDS (only the first CDS was kept for each exon):
#     rna-NM_001134939.1, rna-NM_001172437.2, rna-NM_001184961.1,
#     rna-NM_001301020.1, rna-NM_001301302.1, rna-NM_001301371.1,
#     rna-NM_002537.3, rna-NM_004152.3, rna-NM_015068.3,
#     rna-NM_016178.2

tx <- transcripts(db, columns = c("tx_id", "tx_name", "gene_id"))
ex <- exons(db, columns = c("exon_id", "gene_id"))
cds <- cds(db, columns = c("cds_id", "gene_id"))
ge <- genes(db)

txdf <- as.data.frame(tx)
exdf <- as.data.frame(ex)
cdsdf <- as.data.frame(cds)
gedf <- as.data.frame(ge)

"MTCO2P2" %in% txdf$gene_id
# [1] FALSE
47853230 %in% txdf$start
# [1] FALSE
47853472 %in% txdf$end
# [1] FALSE

"MTCO2P2" %in% exdf$gene_id
# [1] FALSE
47853230 %in% exdf$start
# [1] FALSE
47853472 %in% exdf$end
# [1] FALSE

"MTCO2P2" %in% cdsdf$gene_id
# [1] FALSE
47853230 %in% cdsdf$start
# [1] FALSE
47853472 %in% cdsdf$end
# [1] FALSE

"MTCO2P2" %in% gedf$gene_id
# [1] FALSE
47853230 %in% gedf$start
# [1] FALSE
47853472 %in% gedf$end
# [1] FALSE
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
[1] GenomicFeatures_1.54.1 AnnotationDbi_1.64.1  
[3] Biobase_2.62.0         GenomicRanges_1.54.1  
[5] GenomeInfoDb_1.38.2    IRanges_2.36.0        
[7] S4Vectors_0.40.2       BiocGenerics_0.48.1   

loaded via a namespace (and not attached):
 [1] KEGGREST_1.42.0             SummarizedExperiment_1.32.0
 [3] xfun_0.41                   rjson_0.2.21               
 [5] lattice_0.22-5              vctrs_0.6.5                
 [7] tools_4.3.2                 bitops_1.0-7               
 [9] generics_0.1.3              curl_5.2.0                 
[11] parallel_4.3.2              tibble_3.2.1               
[13] fansi_1.0.6                 RSQLite_2.3.4              
[15] blob_1.2.4                  pkgconfig_2.0.3            
[17] Matrix_1.6-4                dbplyr_2.4.0               
[19] lifecycle_1.0.4             GenomeInfoDbData_1.2.11    
[21] compiler_4.3.2              stringr_1.5.1              
[23] Rsamtools_2.18.0            Biostrings_2.70.1          
[25] progress_1.2.3              codetools_0.2-19           
[27] htmltools_0.5.7             RCurl_1.98-1.13            
[29] yaml_2.3.8                  pillar_1.9.0               
[31] crayon_1.5.2                BiocParallel_1.36.0        
[33] DelayedArray_0.28.0         cachem_1.0.8               
[35] abind_1.4-5                 tidyselect_1.2.0           
[37] digest_0.6.33               stringi_1.8.3              
[39] dplyr_1.1.4                 restfulr_0.0.15            
[41] grid_4.3.2                  biomaRt_2.58.0             
[43] fastmap_1.1.1               SparseArray_1.2.2          
[45] cli_3.6.2                   magrittr_2.0.3             
[47] S4Arrays_1.2.0              XML_3.99-0.16              
[49] utf8_1.2.4                  prettyunits_1.2.0          
[51] filelock_1.0.3              rappdirs_0.3.3             
[53] bit64_4.0.5                 rmarkdown_2.25             
[55] XVector_0.42.0              httr_1.4.7                 
[57] matrixStats_1.2.0           bit_4.0.5                  
[59] png_0.1-8                   hms_1.1.3                  
[61] evaluate_0.23               memoise_2.0.1              
[63] knitr_1.45                  BiocIO_1.12.0              
[65] BiocFileCache_2.10.1        rtracklayer_1.62.0         
[67] rlang_1.1.2                 glue_1.6.2                 
[69] DBI_1.1.3                   xml2_1.3.6                 
[71] rstudioapi_0.15.0           R6_2.5.1                   
[73] MatrixGenerics_1.14.0       GenomicAlignments_1.38.0   
[75] zlibbioc_1.48.0    

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions