-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path09-IsolateAssembly.Rmd
More file actions
215 lines (148 loc) · 6.67 KB
/
09-IsolateAssembly.Rmd
File metadata and controls
215 lines (148 loc) · 6.67 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# Isolate Assembly
We have been focusing on metagenomes—communities of microbial organisms. Now we are going to shift our focus to isolates. We will assemble and annotate single microbial organisms that have been isolated and cultured.
## Get example data
We will use some pathogenic E. coli sequence data isolated from birds in South Africa, sequenced with Oxford Nanopore.
First, make a directory to work in and go into that directory.
```{bash, eval=FALSE}
mkdir ~/isolate_assembly
cd ~/isolate_assembly
```
Now, get the example data from the sequence read archive, which houses publicly available sequencing reads. We will prefetch the data before we extract the sequencing reads, which makes it faster.
```{bash, eval=FALSE}
prefetch SRR32629793 --output-directory .
fasterq-dump --outdir . --outfile ecoli.fastq --threads 4 --progress ./SRR32629793/SRR32629793.sra
```
Now, let's clean up the reads and get some information about them. We'll use fastplong, which is for long reads. By default, it will remove reads with more than 40% of nts having a quality score <15. Since we don't have access to the nanopore output files, we can't use pycoQC.
```{bash, eval=FALSE}
/opt/fastplong/fastplong -i ecoli.fastq -o ecoli.trim.fq
```
We now have a file called ecoli.trim.fq. Download the .html file onto your computer and take a look at the read data before and after trimming. If you are using scp instead of gui, an example command is below.
```{bash, eval=FALSE}
scp -P 2508 <username>@inbre.ncgr.org:/home/jm/isolate_assembly/fastplong.html .
```
Find the total number of bases after filtering in the .html file. Calculate the approximate coverage by dividing this number by the E. coli genome size (~5 Mb). We want to make sure the coverage is at least 30X.
<details>
<summary>Click for Answer</summary>
```{bash, eval=FALSE}
After filtering, we have 238.209373 Mb of sequence data.
238.209373 Mb / 5 Mb = 47.642X coverage
```
</details>
## Assemble the genome
We will use flye to assemble the genome. It has some preset parameters specifically for nanopore data. We'll use the --nano-raw parameter. Newer versions of flye have a --nano-hq version for high quality reads called with the Guppy5+ SUP base callers.
More information is available at https://github.com/mikolmogorov/Flye/blob/flye/docs/USAGE.md
Activate the environment.
```{bash, eval=FALSE}
conda activate flye
```
Now run the assembly. This will take about 10 minutes.
```{bash, eval=FALSE}
flye --nano-raw ecoli.trim.fq --out-dir ecoli_flye --threads 4 --no-alt-contigs
```
As it is finishing up, it will print out some assembly statistics and the path to the final assembly. Discuss with your neighbor what you think each one means then discuss as a group.
<details>
<summary>Click for Answer</summary>
```{bash, eval=FALSE}
[2025-03-11 04:19:23] INFO: Assembly statistics:
Total length: 5369202
Fragments: 3
Fragments N50: 5120307
Largest frg: 5120307
Scaffolds: 0
Mean coverage: 43
[2025-03-11 16:05:17] INFO: Final assembly: /home/jm/isolate_assembly/ecoli_flye/assembly.fasta
```
</details>
We can get a little more information in one of the output files. Take a look and see what you understand and what questions you have.
Columns
* Contig/scaffold id
* Length
* Coverage
* Is circular, (Y)es or (N)o
* Is repetitive, (Y)es or (N)o
* Multiplicity (based on coverage)
* Alternative group (alternative haplotypes)
* Graph path (graph path corresponding to this contig/scaffold).
```{bash, eval=FALSE}
cat ecoli_flye/assembly_info.txt
```
<details>
<summary>Click for Answer</summary>
```{bash, eval=FALSE}
#seq_name length cov. circ. repeat mult. alt_group graph_path
contig_1 5120307 44 Y N 1 * 1
contig_3 161968 38 Y N 1 * 3
contig_2 86927 41 Y N 1 * 2
contig_1 is the main genome and is circular.
Contigs 2, 3 are plasmids (much smaller and also circular).
```
</details>
## Assembly Assessment
We will use checkM2, a successor to checkM, to assess the quality of our assembly. It uses machine learning to figure out what lineage each genome has (this works on metagenomic assemblies as well) and whether the genome has the complete complement of genes expected for that lineage.
More information is available here: https://github.com/chklovski/checkm2
First, activate the environment.
```{bash, eval=FALSE}
conda activate checkm2
```
Then run checkM2.
```{bash, eval=FALSE}
checkm2 predict --threads 20 --input ecoli_flye/assembly.fasta --output-directory ecoli_checkm2 --database_path /opt/checkm2/CheckM2_database/uniref100.KO.1.dmnd
```
Take a look at the report.
```{bash, eval=FALSE}
cat ecoli_checkm2/quality_report.tsv
```
<details>
<summary>Click for Answer</summary>
```{bash, eval=FALSE}
Name Completeness Contamination Completeness_Model_Used Translation_Table_Used Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences Total_Contigs Max_Contig_Length Additional_Notes
assembly 100.0 0.47 Neural Network (Specific Model) 11 0.872 5120307 308.58413888340897 5369202 0.51 5069 35120307 None
```
</details>
## Annotation
We will use the National Center for Biotechnology Information's (NCBI's) Prokaryotic Genome Annotation Pipeline (PGAP).
More information is available here: https://www.ncbi.nlm.nih.gov/refseq/annotation_prok/process/ and here:
https://github.com/ncbi/pgap/wiki/Quick-Start
The parameters:
-r report anonymized usage data
-o output directory (can include full path)
-g genome assembly (fasta)
-s 'organism_name' (genus or genus and species)
Note: To save time and because we are having some permissions issues, we have run this for you.
**This has already been run for you so don't run it.**
```{bash, eval=FALSE}
/opt/pgap/pgap.py -r -o ecoli_annotation -g /home/jm/isolate_assembly/ecoli_flye/assembly.fasta -s 'Escherichia coli'
```
The annotation is put into a file in GFF format. More information on the GFF annotation format is here: https://useast.ensembl.org/info/website/upload/gff.html
Link to the GFF file that we ran previously and take a look at the file.
```{bash, eval=FALSE}
ln -s /opt/pgap/ecoli_annotation/ .
```
Let's count how many of each type of annotation there is in the gff file.
```{bash, eval=FALSE}
grep -v "^#" ecoli_annotation/annot.gff |awk '{print $3}' |sort|uniq -c
```
<details>
<summary>Click for Answer</summary>
```{bash, eval=FALSE}
1 antisense_RNA
24 CDS
1 direct_repeat
121 exon
4852 gene
5088 Homology
7 ncRNA
355 pseudogene
3 region
7 riboswitch
1 RNase_P_RNA
22 rRNA
7 sequence_feature
1 SRP_RNA
1 tmRNA
88 tRNA
```
</details>
<!--The annotation took about 45 minutes to run-->
<!--Need get pgap working for everyone-->
<!--Fungi? Annotation-->