Skip to content

Add Sort Priority System for Gene Quality #91

@tmushayahama

Description

@tmushayahama

Related to #89

Add a sort_priority field to rank genes by annotation quality, prioritizing well-annotated genes with known GO terms over those with unknown terms or unresolved symbols.

Problem

Currently, genes are not sorted by annotation quality, making it difficult to:

  • Quickly access well-annotated genes
  • So many unknowns on the first pages

Solution

Implement a priority system based on:

  1. Number of unknown GO terms (UNKNOWN:CC, UNKNOWN:BP, UNKNOWN:MF)
  2. Whether the gene has a resolved symbol

Priority Levels

Priority Condition Description
1 Default Genes with known GO terms
10 1 unknown term Contains 1 unknown GO term
20 2 unknown terms Contains 2 unknown GO terms
30 3 unknown terms Contains 3 unknown GO terms
50 Unnamed gene named_gene: false

Sorting Order

Genes sorted by (in order):

  1. sort_priority (ascending) - Best quality first
  2. coordinates_chr_num (ascending) - Chromosome number
  3. gene_symbol (ascending) - Alphabetical for deterministic ordering

Elasticsearch Query:

sort=[
    {"sort_priority": {"order": "asc"}},
    {"coordinates_chr_num.keyword": {"order": "asc"}},
    {"gene_symbol.keyword": {"order": "asc"}}
]

Benefits

  • Highlights high-quality, well-annotated genes
  • Consistent quality ranking across entire dataset
  • Deterministic ordering via chromosome and gene symbol

Discussion

Should genes with lower priority (10, 20, 30, 50) be sorted:

  1. At the bottom of the entire list (current implementation) - All priority 1 genes first (sorted by chromosome then gene_symbol), then all priority 10+ genes (sorted by chromosome then gene_symbol)
  2. At the bottom of each chromosome - Within each chromosome, priority 1 genes first (sorted by gene_symbol), then priority 10+ genes for that chromosome (sorted by gene_symbol)
  3. At the bottom of each gene_symbol letter - Within each letter group (A*, B*, C*, etc.), priority 1 genes first, then priority 10+ genes for that letter

Testing

  • Verify priority calculation for each unknown term count
  • Confirm unnamed genes get priority 50
  • Check sorting order: priority → chromosome → gene_symbol
  • Validate output JSON includes sort_priority field
  • Test Elasticsearch query returns genes in correct order

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions