The proposed tool enables the analysis of DNA sequences using three alignment-free techniques, combining text-based and Graph Learning approaches in one tool. We conducted an in-depth exploration, from an ML perspective, of the problem of recognizing hidden patterns that allow us to identify chimeric RNAs deriving from oncogenic gene fusions. We approach gene fusion as the chromosomal rearrangement that joins two genes into a single fusion gene, resulting in a chimeric transcript composed of two parts, each originating from one of the fused genes. We propose three distinct ML-based methods, each one based on a specific representation technique for the RNA-Seq reads: sentence-based, graph-based, and hypergraph-based. The sentence-based method leverages techniques from NLP, treating the nucleotide sequences as textual data, to extract semantic patterns from the reads. The graph-based approach advances this by employing De Bruijn graphs and Graph Neural Networks (GNNs) to capture complex topological relationships. Finally, the hypergraph-based approach introduces the use of Hypergraph Neural Networks (HGNNs), allowing us to model higher-order interactions by constructing hyperedges from maximal cliques in the De Bruijn graph. Through these progressively more sophisticated representations, we show that deeper models are better equipped to uncover hidden patterns critical for detecting chimeric reads. As the representational depth increases, so does the capacity to capture the underlying structure of the genomic data. However, this improvement comes with the need for more advanced and computationally demanding ML models, to handle the complexity of graph-based data.
In order to use the proposed tools, the requiremenets are needed, for techinical reasons Ubuntu environment is advised.
pip install -r requirements.txtIt's necessary the gt-shredder from genome tool
apt install genometoolsThe project is tested on a Conda Enviroment with the needed requirements.
conda create --name inside-gene-fusion --file requirements.txtIs preferible to run the scripts from Inside-Gene-Fusion folder. Take care to change the paths and the name of the dataset inside the scripts to match the personalized experiments.
Into the data folder, there is the file download_transcript.py needed to dowload the gene transcript inside the file genes_panel.txt
After the processing of the genes, into the folder transcripts is possible to find the fastq files of the trancripts for each gene.
cd source/gene_fusion_kmer-main/data
python3 download_transcripts.pyIs possible to create synthetic dataset using the fusim from the dowloaded gene transcripts using download_transcript.py.
To install fusim:
wget https://github.com/aebruno/fusim/raw/master/releases/fusim-0.2.2-bin.zip
unzip fusim-0.2.2-bin.zip
rm fusim-0.2.2-bin.zip
wget -O refFlat.txt.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt
gunzip refFlat.txt.gz
mv refFlat.txt fusim-0.2.2/refFlat.txt
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar -xzf chromFa.tar.gz
cat chr*.fa > fusim-0.2.2/hg19.fa
rm chromFa.tar.gz
rm chr*.fa
apt install samtools
samtools faidx fusim-0.2.2/hg19.faTo run fusim on two genes gene1 and gene2, use the following command specifying dir_path:
java -jar ./fusim-0.2.2/fusim.jar \
--gene-model=./fusim-0.2.2/refFlat.txt \
--fusions=10 \
--reference=./fusim-0.2.2/hg19.fa \
--fasta-output=dir_path/fusion_gene1_gene2.fasta \
--text-output=dir_path/fusion_gene1_gene2.txt \
-1 gene1 \
-2 gene2 \
--cds-only \
--auto-correct-orientationRun the following python script source/gene-fusion-kmer-main/data/fusim_dataset.py to form all possible merges between genes contained in genes_panel.txt.
ART ILLUMINA is the tool that allows you to obtain read synthetics from fasta files.
To install ART ILLUMINA install the following package:
apt install art-nextgen-simulation-toolsART ILLUMINA takes a folder of fasta files and returns the synthetic reads.
Run the following python script source/gene-fusion-kmer-main/data/art_dataset.py to get the reads of all fasta files obtained from fusim.
In order to create a no_chimeric datasets coming from the dowloaded transcripts to train the future models, run the source/gene-fusion-kmer-main/datasets/create_chimeric_no_chimeric.py.
The non chimeric dataset will saved in source/gene-fusion-kmer-main/datasets/ with the name dataset_non_chimeric.fastq.
The chimeric dataset is composed from sequences generated by fusim + art pipeline, the dataset is saved in source/gene-fusion-kmer-main/datasets/ with the name dataset_chimeric.fastq.
This approach is based on the definition of a model capable of analyzing and classifying lists of k-mers. The reads are represented as sets of sentences composed of k- mers (sentence-based representation), to leverage BERT to uncover the hidden semantic structures within genomic data (see the following Figure).
Such a sentence-based representation is in turn exploited by a DL-based model for the detection of chimeric reads, built as an ensemble of two sub-models: Gene classifier and Fusion classifier. The goal of Gene classifier is to classify a sentence into the gene from which it is generated. It is trained using all the sentences derived from non-chimeric reads extracted from the transcripts of a reference set of genes (see the following Figure). To train Fusion classifier, a set of chimeric and non- chimeric reads is generated from the same reference set of genes used for training Gene classifier.
The k-mers, i.e., all the substrings of length k which can be extracted from a DNA or RNA sequence, allow the local characteristics of the sequences to be considered while lessening the impact of sequencing errors. In this work we represent a read using the list of its k-mers. This representation allows the model to learn the local characteristics of reads and perform accurate classification.
Run the following command to prepare your data for the DNABERT Gene Classifier model:
python3 source/gene_classifier_pre_process_data_filter.pyAt the end of the script is possible to find inside the gene_fusion_kmer_main/data/kmers_6 kemerized subsequences generated from the gt_shredder and inside the folder gene_fusion_kmer_main/dataset/ the sentences.csv with the associated label.
To fine-tune the pre-trained model DNABERT for Gene Classification run the following command:
python3 source/dnabert_geneclassifier_fine_tune.pyEnsure that you modify the n_labels variable to match the number of labels in your customized dataset.
To identify a DNA sequence as chimeric or not, a Fusion Classifier model is trained on the embedding representation of the sequences given from the fine-tuned gene classifier DNABERT.
To train Fusion classifier, a set of chimeric and non-chimeric reads is generated from the same reference set of genes used for training Gene classifier. Then, for
each read all the sentences of length n_words are generated and then provided as input to Gene classifier, previously trained. Gene classifier includes an embedding
layer, as well as several classification layers. The outputs of the embedding layer for all the generated sentences are grouped into a single embedding matrix, which
constitutes the input for Fusion classifier. Then, Fusion classifier uses such embedding matrices to distinguish between reads that arise from the fusion of
two genes and reads that originate from a single gene.
Run the followin script to create an embedding dataset for your sequences:
python3 source/create_embedding_dataset_with_dnabert_geneclassifier.pyNow the Fusion Classifier can be trained running the following script:
python3 source/gene_fusion_dnn.pyAll the trained model are saved into the source folder
To overcome the limitations of the sentence-based approach, we employ a more advanced graph-based approach, utilizing De Bruijn graphs.In a De Bruijn graph, nodes represent k-mers, and edges indicate overlaps between consecutive k-mers. By applying GNNs, we are able to capture the complex topological dependencies between nodes through message-passing mechanisms. GNNs allow nodes to aggregate information from neighboring nodes, effectively learning intricate patterns that are essential for accurately identifying fusion events.
To efficiently create De-Bruijn graphs to run the experiments using GNNs the following script has to be run:
python3 source/pre_process_data_graph_and_hyper.pyWe proposed a novel approach to efficiently train a GNNs on DNA sequneces. For each kmerized sequence a De Bruijn graph is created, the informations inside the nodes of the graphs are crucial in order to train a pre-trained DNABERT model on the assumptions that the De Bruijn graphs represents chimeric or not chimeric sequences. As the previously approach, a pre-trained DNABERT model is fine-tuned to exctract the embeddings representation of the kmers inside the De Bruijn graph's nodes. This process is crucial to generate informative data for the GNNs model in order to have a more and precisious classifier model.
To train a pre-trained DNABERT model on chimeric or not chimeric sequences run:
python3 source/dnabert_fusion_fine_tune.pyCreated the De Bruijn graphs and Fine-Tuned the DNABERT model on chimeric and not chimeric sequences, is possible to train the GNNs model runnign the script:
python3 source/gene_fusion_graph.pyTo further deepen the representational power, we introduce a hypergraph-based approach. Unlike traditional graphs, where edges connect only two nodes, hypergraphs allow for hyperedges that connect multiple nodes simultaneously, thus capturing multi-way interactions that are commonly observed in biological systems. This higher-order representation is particularly well-suited for modeling the complex interactions in gene fusion events. By using Hypergraph Neural Networks (HGNNs), which extend the capabilities of GNNs to hypergraph-structured data, we can extract deeper patterns and relationships, offering a most sophisticated level of analysis for detecting chimeric reads. The methodology followed for this approach is similar to that used in the graph-based approach, with the difference that in this case, each read is represented with a special hypergraph, which we call De Bruijn H-graph A crucial aspect of defining a hypergraph is establishing the rule for constructing hyperedges. In our approach, we used the maximal cliques within De Bruijn graphs to generate hyperedges, capturing the structural complexity of reads. In a De Bruijn graph, nodes represent k-mers, and edges indicate overlaps between consecutive k-mers. A clique is a subset of nodes where every pair is connected, representing complete connectivity. In this context, cliques highlight regions where k-mers overlap across multiple positions, forming continuous subsequences.
To recognize the cliques into De Bruijn graphs and the cliques, run the script:
python3 source/gene_fusion_hypergraph.pyThe DNABERT fine-tuned models can be retrived to the following link: Fine Tuned Models
The DNABERTs models are fine-tuned on the genes inside the gene panel list: RUNX1,ETV6,RIPOR1,CTCF,KMT2A,EZR,PAX5,PTEN,PMEL,TAL1,DUX4,CRLF2,MEF2D,BCL9,TCF3,ZNF384,PBX1.
It's suggested to move the models inside the models folder to match the paths.
All the dataset and the model are saved at the end of each script, to give the possibity to work with them and develop new scripts, into the test folder is possible to find the script to run the saved models on new ad hoc data

