Skip to content

Conversation

@anna-parker
Copy link
Collaborator

@anna-parker anna-parker commented Nov 4, 2024

This is a copy of a dataset I created here: GenSpectrum/nextclade-datasets#4

You can view the dataset by running

https://clades.nextstrain.org/?dataset-server=https://raw.githubusercontent.com/genspectrum/nextclade-datasets/add_marburg/data

The example sequences: data/marburg/unreleased/sequences.fasta are a subsample of 20 marburg sequences with over 10% coverage.

I also added clades and lineages:
image

Example sequences fall with their corresponding sequence in the tree:
image

Tree built here:
https://github.com/anna-parker/marburg-virus-tree/tree/main

@rneher
Copy link
Member

rneher commented Nov 4, 2024

Thanks, Anna. This looks good! The example sequences are probably too many though.

It might also be interesting to make the default ref_node the clade founder. This should work by setting
default = '__clade_founder__' in the dataset config in the tree.json.

Instead of __root__ here

image

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Nov 4, 2024

Again, not a huge expert on this particular virus, but my usual concern is that the path is too generic:

data/community/genspectrum/marburg

It's like having a flu named just flu. Not very future-proof or clear.

Paths are immutable, global, unique identifiers of datasets and they cannot be nested in the existing dataset paths. So once /marburg is taken, the /marburg/A and /marburg/hello/world are no longer available.

If any immediate identification or sub-classification groups come to mind, it's better to include them into the path. Or, as a lazy solution, a path segment with ref sequence accession.

The usual candidate keywords could be in the readme:

Orthomarburgvirus marburgense species taxon (taxonId: 3052505), members of this species are called marburgviruses. However, the species has two distinct lineages: ravn virus (RAVV) and marburg virus (MARV). Alignments use the official INSDC marburg virus reference sequence NC_001608.3.

as well as and in the genbank readme:

https://www.ncbi.nlm.nih.gov/nuccore/NC_001608.3

LOCUS       NC_001608              19111 bp    cRNA    linear   VRL 13-AUG-2018
DEFINITION  Marburg marburgvirus isolate Marburg
            virus/H.sapiens-tc/KEN/1980/Mt. Elgon-Musoke, complete genome.
ACCESSION   NC_001608
VERSION     NC_001608.3
DBLINK      BioProject: [PRJNA485481](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA485481)
KEYWORDS    RefSeq.
SOURCE      Orthomarburgvirus marburgense
  ORGANISM  [Orthomarburgvirus marburgense](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3052505)
            Viruses; Riboviria; Orthornavirae; Negarnaviricota;
            Haploviricotina; Monjiviricetes; Mononegavirales; Filoviridae;
            Orthomarburgvirus.
REFERENCE   1  (bases 1 to 19111)
  AUTHORS   Enterlein,S., Volchkov,V., Weik,M., Kolesnikova,L., Volchkova,V.,
            Klenk,H.D. and Muhlberger,E.
  TITLE     Rescue of recombinant Marburg virus from cDNA is dependent on
            nucleocapsid protein VP30
  JOURNAL   J. Virol. 80 (2), 1038-1043 (2006)
   PUBMED   [16379005](https://www.ncbi.nlm.nih.gov/pubmed/16379005)
REFERENCE   2  (bases 11291 to 19109)
  AUTHORS   Muhlberger,E., Sanchez,A., Randolf,A., Will,C., Kiley,M.P.,
            Klenk,H.D. and Feldmann,H.
  TITLE     The nucleotide sequence of the L gene of Marburg virus, a
            filovirus: homologies with paramyxoviruses and rhabdoviruses
  JOURNAL   Virology 187 (2), 534-547 (1992)
   PUBMED   [1546452](https://www.ncbi.nlm.nih.gov/pubmed/1546452)
REFERENCE   3  (bases 1 to 19111)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (22-OCT-2007) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   4  (bases 1 to 19111)
  AUTHORS   Muhlberger,E.
  TITLE     Direct Submission
  JOURNAL   Submitted (21-SEP-2005) Department of Virology, Philipps University
            Marburg, Robert-Koch-Str. 17, Marburg 35037, Germany
COMMENT     REVIEWED [REFSEQ](https://www.ncbi.nlm.nih.gov/RefSeq/): This record has been curated by NCBI staff. The
            reference sequence was derived from [DQ217792](https://www.ncbi.nlm.nih.gov/nuccore/DQ217792).

If there's only one marburg and will ever be, then ignore this :)

@anna-parker
Copy link
Collaborator Author

@ivan-aksamentov thanks! I renamed to community/genspectrum/marburg/HK1980/all-lineages (H = human, K= Kenya, 1980 = collection date)

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Nov 4, 2024

I allowed myself to push the rebuild (instead of the bot :)

This way we can preview how indexing and autosuggestions work. Also, it is better to test things in this PR, not some other things at a remote location (genspectrum repo), to make sure we test what we have here, not there:

https://master.clades.nextstrain.org/?dataset-server=gh:anna-parker/nextclade_data@add-marburg@/data_output

At this point LGTM, but I'll let science team to do the final decision and merge if all good.

@rneher
Copy link
Member

rneher commented Nov 4, 2024

LGTM. happy to see that the clade founder thing works!

@ivan-aksamentov ivan-aksamentov merged commit a140238 into nextstrain:master Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants