Metagenomics-Bookdown/09-IsolateAssembly.Rmd at main · ncgr/Metagenomics-Bookdown · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# Isolate Assembly

We have been focusing  on metagenomes—communities of microbial organisms. Now we are going to shift our focus to isolates. We will assemble and annotate single microbial organisms that have been isolated and cultured.

## Get example data

We will use some pathogenic E. coli sequence data isolated from birds in South Africa, sequenced with Oxford Nanopore.

First, make a directory to work in and go into that directory.

```{bash, eval=FALSE}
mkdir ~/isolate_assembly

cd ~/isolate_assembly

```

Now, get the example data from the sequence read archive, which houses publicly available sequencing reads. We will prefetch the data before we extract the sequencing reads, which makes it faster.

```{bash, eval=FALSE}
prefetch SRR32629793 --output-directory .

fasterq-dump --outdir . --outfile ecoli.fastq --threads 4 --progress ./SRR32629793/SRR32629793.sra
```

Now, let's clean up the reads and get some information about them. We'll use fastplong, which is for long reads. By default, it will remove reads with more than 40% of nts having a quality score <15. Since we don't have access to the nanopore output files, we can't use pycoQC.

```{bash, eval=FALSE}
/opt/fastplong/fastplong -i ecoli.fastq -o ecoli.trim.fq
```

We now have a file called ecoli.trim.fq. Download the .html file onto your computer and take a look at the read data before and after trimming. If you are using scp instead of gui, an example command is below.

```{bash, eval=FALSE}
scp -P 2508 <username>@inbre.ncgr.org:/home/jm/isolate_assembly/fastplong.html .

```

Find the total number of bases after filtering in the .html file. Calculate the approximate coverage by dividing this number by the E. coli genome size (~5 Mb). We want to make sure the coverage is at least 30X.

<details>
  <summary>Click for Answer</summary>
```{bash, eval=FALSE}
After filtering, we have 238.209373 Mb of sequence data.

238.209373 Mb / 5 Mb = 47.642X coverage
```
</details>

## Assemble the genome

We will use flye to assemble the genome. It has some preset parameters specifically for nanopore data. We'll use the --nano-raw parameter. Newer versions of flye have a --nano-hq version for high quality reads called with the Guppy5+ SUP base callers.

More information is available at https://github.com/mikolmogorov/Flye/blob/flye/docs/USAGE.md

Activate the environment.

```{bash, eval=FALSE}
conda activate flye
```

Now run the assembly. This will take about 10 minutes.

```{bash, eval=FALSE}
flye --nano-raw ecoli.trim.fq --out-dir ecoli_flye --threads 4 --no-alt-contigs
```

As it is finishing up, it will print out some assembly statistics and the path to the final assembly. Discuss with your neighbor what you think each one means then discuss as a group.

<details>
  <summary>Click for Answer</summary>
```{bash, eval=FALSE}
[2025-03-11 04:19:23] INFO: Assembly statistics:

	Total length:	5369202
	Fragments:	3
	Fragments N50:	5120307
	Largest frg:	5120307
	Scaffolds:	0
	Mean coverage:	43

[2025-03-11 16:05:17] INFO: Final assembly:   /home/jm/isolate_assembly/ecoli_flye/assembly.fasta
```
</details>

We can get a little more information in one of the output files. Take a look and see what you understand and what questions you have.

Columns
* Contig/scaffold id
* Length
* Coverage
* Is circular, (Y)es or (N)o
* Is repetitive, (Y)es or (N)o
* Multiplicity (based on coverage)
* Alternative group (alternative haplotypes)
* Graph path (graph path corresponding to this contig/scaffold).


```{bash, eval=FALSE}
cat ecoli_flye/assembly_info.txt

```

<details>
  <summary>Click for Answer</summary>
```{bash, eval=FALSE}
#seq_name	length	cov.	circ.	repeat	mult.	alt_group	graph_path
contig_1	5120307	44	Y	N	1	*	1
contig_3	161968	38	Y	N	1	*	3
contig_2	86927	41	Y	N	1	*	2

contig_1 is the main genome and is circular.

Contigs 2, 3 are plasmids (much smaller and also circular).
```
</details>


## Assembly Assessment

We will use checkM2, a successor to checkM, to assess the quality of our assembly. It uses machine learning to figure out what lineage each genome has (this works on metagenomic assemblies as well) and whether the genome has the complete complement of genes expected for that lineage.

More information is available here: https://github.com/chklovski/checkm2

First, activate the environment.

```{bash, eval=FALSE}
conda activate checkm2
```

Then run checkM2.

```{bash, eval=FALSE}
checkm2 predict --threads 20 --input ecoli_flye/assembly.fasta --output-directory ecoli_checkm2 --database_path /opt/checkm2/CheckM2_database/uniref100.KO.1.dmnd

```


Take a look at the report.

```{bash, eval=FALSE}
cat ecoli_checkm2/quality_report.tsv

```

<details>
  <summary>Click for Answer</summary>
```{bash, eval=FALSE}
Name	Completeness	Contamination	Completeness_Model_Used	Translation_Table_Used	Coding_Density	Contig_N50	Average_Gene_Length	Genome_Size	 GC_Content	Total_Coding_Sequences	Total_Contigs	Max_Contig_Length	Additional_Notes
assembly	100.0	0.47	Neural Network (Specific Model)	11	0.872	5120307	308.58413888340897	  5369202	0.51	5069	35120307	None
```
</details>


## Annotation

We will use the National Center for Biotechnology Information's (NCBI's) Prokaryotic Genome Annotation Pipeline (PGAP).

More information is available here: https://www.ncbi.nlm.nih.gov/refseq/annotation_prok/process/ and here:
https://github.com/ncbi/pgap/wiki/Quick-Start

The parameters:
-r report anonymized usage data
-o output directory (can include full path)
-g genome assembly (fasta)
-s 'organism_name' (genus or genus and species)

Note: To save time and because we are having some permissions issues, we have run this for you.

**This has already been run for you so don't run it.**

```{bash, eval=FALSE}
/opt/pgap/pgap.py -r -o ecoli_annotation -g /home/jm/isolate_assembly/ecoli_flye/assembly.fasta -s 'Escherichia coli'
```

The annotation is put into a file in GFF format. More information on the GFF annotation format is here: https://useast.ensembl.org/info/website/upload/gff.html

Link to the GFF file that we ran previously and take a look at the file.

```{bash, eval=FALSE}
ln -s /opt/pgap/ecoli_annotation/ .

```

Let's count how many of each type of annotation there is in the gff file.

```{bash, eval=FALSE}
grep -v "^#" ecoli_annotation/annot.gff |awk '{print $3}' |sort|uniq -c
```

<details>
  <summary>Click for Answer</summary>
```{bash, eval=FALSE}
      1 antisense_RNA
     24 CDS
      1 direct_repeat
    121 exon
   4852 gene
   5088 Homology
      7 ncRNA
    355 pseudogene
      3 region
      7 riboswitch
      1 RNase_P_RNA
     22 rRNA
      7 sequence_feature
      1 SRP_RNA
      1 tmRNA
     88 tRNA
```
</details>

<!--The annotation took about 45 minutes to run-->
<!--Need get pgap working for everyone-->
<!--Fungi? Annotation-->