Transcript discovery

While evaluating RNA alignment on real datasets, RNAseqEval.py script can try and discover new transcripts. Thanscript discovery is enabled by setting the option --calc_new_annotations. New transcripts are discovered in two ways, by conbining existing annotations and by removing or skipping small introns. New transcripts are discovered only for those alignments that do not perfectly match any available annotations. Finally, new transcripts are reported to the user only if they are supported by a minimum number of reads (currently set at three).

During the regular evaluation process, a set of candidate annotations is constructed for each alignment, consisting of all annotations that overlap the alignment. From the set of candidate annotations, a best_match_annotation is chosen based on the number of nucleotides from the alignment that fall inside and outside of each annotation. New transcripts are calculated only for those alignments that do not perfectly match the best_match_annotation.

After all new transcripts are determined, they are collected and compared, and only those transcripts that are supported by a minimum number of reads (currently set to 3) are reported to the user.

Output

General evaluation report only contains information on how many new transcripts were discovered and how many reads were used to construct them:

Found 213 potential new annotations with 4860 alignments
                
Detailed report on annotations can be found in an '_annotations.report' file.

Detailed information on discovered transcripts is given in a separate file. If output file is specified (options -o and --output), text '_annotations.report' is appened to the output filename and if output filename is not specified, detailed annotation report is writen to '_annotations.report' file.

An exampe of a detailed annotation report is given below.

Name: New annotation 18 
Based on:NM_166774
Strand: +
Number of reads: 7
Type:FUSED ANNOTATION
Reads:
m160615_181138_42182_c101000182550000001823232709161603_s1_p0/135263/ccs
m160615_181138_42182_c101000182550000001823232709161603_s1_p0/138904/ccs
m160713_175918_42182_c101000162550000001823232709161621_s1_p0/136269/ccs
m160615_181138_42182_c101000182550000001823232709161603_s1_p0/39346/ccs
m160615_181138_42182_c101000182550000001823232709161603_s1_p0/137514/ccs
m160713_175918_42182_c101000162550000001823232709161621_s1_p0/14795/ccs
m160615_181138_42182_c101000182550000001823232709161603_s1_p0/43113/ccs
Items: [610684, 610897] [611727, 611807] [613615, 613796] [619848, 620178]

New annotation name is generated sequentially. Field Based on contains the transcript new annotation was based on (initial best_match_annotation for the alignment). Field Type determines if the annotation was constructed by combining existing annotations or by intron skipping. The report also lists all reads used to construct the annotation and finaly exons themselves.

Combining existing annotations

The first method for new transcript discovery tries to combine existing annotations to achieve the correct alignment. The process is visible on the figure below.

Blue annotation is chosen as the best_match_annotation, because, compared to the purple annotation, more nucleotide bases from the alignment fall within it and less nucleotide bases fall outside it. However, the first part of the alignment does not perfectly match the first exon in the annotation. Therefore, a set of candidate annotations is searched for an exon that does perfectly match the first part of the alignment. This is the first exon of the purple candidate alignment. To construct the new transcript/annotation the first exon from the blue best_match_annotation is replaced by the fist exon from the purple candidate annotation.

Skipping small introns

The second method for new transcript discovery creates new annotations by starting with an existing annotation and removing introns smaller than a preset value (currently set to 10). The new annotation simply combines exons seperated by small introns. The process is visible in the figure below.

Results

The algorithm was tested on several datasets from our RNA benchmark paper Evaluation of tools for long read RNAseq slice-aware alignment published by Oxford Journals Bioinformatics. We tested it on alignments obtained by two best tools from the paper (Minimap2 and GMAP). We also tested the algorithm on the new version of our Graphmap tool (https://github.com/isovic/graphmap - in the last stage of development), taliored for mapping 3rd generation RNA reads.

All test datasets were obtained by RNA sequencing Drosophila Melanogaster.

Dataset 1 contains PacBio ROI (Reads of Insert) - 192,000 reads
Dataset 2 contains PacBio subreads - 243,000 reads
Dataset 3 contains Oxford Nanopre MinION reads, using R9 flowcell - 40,000 reads

The results are given in the table below. While a relatively small number of generated new transcripts is the same accross all three RNA mapping tools, it can be concluded that those transcripts are the most likely to be accurate.

Dataset	Graphmap	Minimap2	GMap	Common
dataset 1	247	213	174	33
dataset 2	217	192	178	18
dataset 3	13	13	10	2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript discovery

Output

Combining existing annotations

Skipping small introns

Results

FilesExpand file tree

Transcript_discovery.md

Latest commit

History

Transcript_discovery.md

File metadata and controls

Transcript discovery

Output

Combining existing annotations

Skipping small introns

Results