Skip to content

Latest commit

 

History

History
249 lines (234 loc) · 46.6 KB

File metadata and controls

249 lines (234 loc) · 46.6 KB

Findings

Skill Writing Conventions

  • Standardized operator-manual structure is now the active house style for manual refinements.
  • description should describe when to use the tool, not summarize the workflow.
  • Real CLI behavior often differs from autogenerated assumptions; wrapper quirks belong in Guardrails.

Tool Behavior Findings

Bedtools

  • Many bedtools wrappers reject GNU-style --help / --version and behave better with -h.
  • bedToIgv writes the IGV batch script to stdout; -path controls snapshot directory inside IGV, not the script file destination.
  • tagBam writes tagged BAM to stdout and requires a payload mode such as -labels, -names, or -scores.
  • subtractBed -wo/-wb changes output semantics into diagnostic layouts; those outputs are not plain trimmed interval files.

STAR / STARlong

  • CPU-specific wrappers mostly share the same operational semantics; differences are primarily binary build/ISA related.
  • Compressed read inputs still require --readFilesCommand zcat or equivalent.
  • Genome FASTA files for index generation must remain uncompressed.

Legacy Helper Scripts

  • wgsim_eval.pl is a multi-command evaluator, not a single-purpose command.
  • vcfutils.pl and samtools.pl are command multiplexers with legacy workflows centered on older SAMtools/BCFtools conventions.
  • split-at-intron is an EDirect shell filter that consumes a tag/value stream from stdin, not a generic genomic interval file.
  • hmmsim studies HMM score distributions on random sequences; it is not a general biological sequence generator.

Structural Validation History

  • Bedtools family has been fully standardized and structurally validated.
  • STAR / STARlong plain and CPU-specific wrappers standardized and validated.
  • Recent helper-script batch standardized and validated.

2026-03-27

  • ViennaRNA tools were missing only the standardized ## When To Use This Tool and ## Common Patterns sections; the rest of their operator-manual structure was already stable.
  • RNAsnoop uses --help for usage output. The short option -h is a real algorithm parameter (--minimal-stem-length), so RNAsnoop -h fails with an argument error.
  • The skill folder rnapkplex maps to the executable RNAPKplex; the case mismatch is real and worth documenting to avoid false "binary not found" assumptions.
  • RNAplex is most reusable as a query-vs-target scanner with optional RNAplfold accessibility directories rather than as a generic cofolding tool.
  • RNAdos is landscape-summary oriented: it counts structures per energy band rather than enumerating individual suboptimal folds.
  • Bowtie2 direct executables consistently warn that the wrapper scripts (bowtie2, bowtie2-build, bowtie2-inspect) are preferred. Those warnings should be documented rather than treated as failures.
  • bowtie2-inspect can force large-index inspection with --large-index, while the -s and -l executables pin the small/large format explicitly.
  • HISAT2 core wrapper binaries follow the same pattern: direct execution works, but hisat2, hisat2-build, and hisat2-inspect wrapper scripts are the recommended public entry points.
  • In HISAT2, spliced alignment is on by default for aligners; fragment-length controls -I/-X become relevant only with --no-spliced-alignment.
  • hisat2-inspect can export embedded graph annotations (--snp, --ss, --ss-all, --exon), making it more than a simple FASTA/name inspector.
  • HISAT2 helper scripts mostly reject --version; hisat2_read_statistics.py defaults to sampling 10000 reads, and -n 0 means scan the whole file rather than zero reads.
  • hisat2_extract_snps_haplotypes_UCSC.py writes <base>.snp and <base>.haplotype, plus .ref.testset.fa and .alt.testset.fa when --testset is used.
  • hisat2_extract_snps_haplotypes_VCF.py writes <base>.snp and <base>.haplotype, and --extra-files additionally emits .ref plus _backbone.fa.
  • hisat2_simulate_reads.py locally emits a non-fatal Python SyntaxWarning before -h usage text, and paired-end simulation writes <base>.sam, <base>_1.fa, and <base>_2.fa.
  • Easel tools generally prefer -h; most reject --help and --version, while esl-mixdchlet is a notable exception because top-level --version works and the command is really a subcommand dispatcher.
  • esl-construct rebuild modes (-x, -r, -c, --indi, --ffreq, --fmin) all require -o, so they are edit-and-write operations rather than pure inspection.
  • esl-histplot shows a real local docs mismatch: the installed man page says the default output is a survival plot, but the live binary emits histogram-style XY output unless --surv is set.
  • esl-mask consumes a three-column seqname start end mask file; order must match the sequence file unless -R is used with an SSI index.
  • esl-selectn is line-level reservoir sampling, not sequence-record sampling.
  • esl-seqrange requires an SSI-indexed input file, and runtime testing confirms procidx is 1-based.
  • esl-shuffle has a smaller docs mismatch: live -h advertises RNA as the default alphabet in -G mode, while the man page advises choosing --rna, --dna, or --amino explicitly.
  • esl-ssdraw only draws the first alignment in a Stockholm file and cannot reuse its generated PostScript output as a fresh template.
  • esl-weight cannot currently start in this environment because libopenblas.so.0 is missing; local man-page and binary-string evidence confirm -g, -p, and -b as the documented weighting modes.
  • hmmbuild is also blocked locally by the same missing libopenblas.so.0, but the man page confirms important behavior that autogenerated docs missed: msa_file may be -, hmm_out may not, and -n only works for single-alignment input.
  • hmmconvert is a stdout-emitting format converter; -2 is a legacy HMMER2 compatibility path, while --outfmt 3/a through 3/f selects named HMMER3 ASCII revisions.
  • hmmemit has a real short-help mismatch: usage text implies a single-HMM input, but runtime testing with a multi-model library emitted one sample per model successfully.
  • hmmlogo does not render a final image. The real default output is plain text tables beginning with values such as max expected height = ... and Residue heights.
  • hmmpgmd is a master-worker daemon layer in front of phmmer, hmmsearch, and hmmscan; its sequence database input must already be in hmmpgmd format, which the man page ties back to esl-reformat.
  • The hmmpgmd-shard skill maps to the real executable hmmpgmd_shard. In shard mode, only sequence databases are sharded, and --num_shards on the master must equal the worker count.
  • Subread-family help handling is inconsistent: --help and --version are not true switches for most commands, and some tools print usage only after complaining about the invalid option.
  • subread-fullscan is a single-read diagnostic scanner, not a FASTQ aligner. Its final argument is a literal read sequence string.
  • sublong requires a full one-block Subread index and uses -v as its real version flag; --help and --version print usage text but still end as unrecognized-option paths.

2026-03-28

  • blast2sam.pl is not a general BLAST formatter. The bundled POD and source show it is specifically a parser for legacy default-format blastn text output.
  • blast2sam.pl --help works only because Perl Getopt::Std injects generic help/version handling; -help is parsed as -h -e -l -p and fails.
  • blast2sam.pl -s prints the aligned query sequence, not necessarily the original raw read sequence, and -d emits dummy I quality characters so downstream SAM consumers can tolerate the record.
  • blast2sam.pl emits headerless SAM and silently drops unaligned queries instead of outputting unmapped SAM rows.
  • tblastn_vdb follows normal BLAST+ single-dash controls: -help and -version work, while prior autogenerated --help / --version captures were wrong for the live binary.
  • Bare tblastn_vdb invocation fails with Must specify at least one SRA/WGS database, making a missing -db the first diagnostic check.
  • tblastn_vdb -db expects an SRA or WGS source name rather than a standard local BLAST database path.
  • tblastn_vdb -sra_mode has materially different search universes: 0 unaligned reads, 1 aligned reference sequences, 2 both.
  • ace2sam writes the SAM body to stdout but emits header text on stderr, and its ACE parser expects strict block ordering plus matching AF/RD read order.
  • bowtie2sam.pl has no true CLI help: --help and --version are treated as filenames. It also chooses a single best Bowtie hit per read-name block rather than emitting all alignments.
  • export2sam.pl is one of the few legacy helper scripts in this cluster with real long-option help. --read1 is mandatory, --nofilter keeps failed-purity reads, and --qlogodds is only for pre-1.3 Solexa-style quality encoding.
  • maq2sam-long and maq2sam-short expose almost no self-description beyond Usage: maq2sam <in.map> [<readGroup>]; --help and --version are just treated as filenames.
  • novo2sam.pl skips comment/QC/NM-style lines and silently drops alignments whose status is not U, so it is not a lossless Novoalign-to-SAM converter.
  • psl2sam.pl uses -a/-b/-q/-r only for computing AS:i and explicitly does not emit N CIGAR operators for intron-like gaps.
  • soap2sam.pl and zoom2sam.pl use -p only to interpret mate relationships; they assume mate adjacency in the input rather than repairing arbitrary reordering.
  • zoom2sam.pl requires a manual read-length argument and emits * for sequence and quality fields because that information is not recovered from the handled Zoom format.
  • interpolate_sam.pl is not a generic SAM interpolator; it builds an interpolated per-base count track from a sorted SAM file and assumes simple M/I/D CIGARs plus an older MAQ-style reference naming convention.
  • sam2vcf.pl expects legacy samtools pileup -c input on stdin and emits VCFv3.3; --help works but --version does not.
  • disambiguate-nucleotides is a stdin/stdout shell filter that expands IUPAC ambiguity codes, uppercases output, and silently waits for input instead of providing built-in help text.
  • Most remaining ViennaRNA executables in this batch use the modern --help / --version interface cleanly, unlike many older samtools helper scripts.
  • RNAcofold and RNAmultifold both treat & as the strand separator and continue reading batch input until a single @ line or EOF.
  • RNAconsensus is a shell wrapper around a Python implementation. --help works, but --version is not implemented and errors out.
  • RNAconsensus hardcons can consume either RNAalifold stdout or an RNAalifold dot plot, then emit per-sequence constraints suitable for piping into RNAfold -C.
  • RNAdistance reads structures from stdin, and -B[=file] writes an aligned backtrack of matching substructures.
  • RNApvmin cannot start in this environment because libopenblas.so.0 is missing. Binary strings still reveal the real usage: it reads the sequence from stdin and a SHAPE file as the positional argument.
  • RNApvmin expects SHAPE input lines in the form [position] [nucleotide] [absolute shape reactivity] and writes the resulting perturbation vector to stdout while optimization progress goes to stderr.
  • kinwalker --help works, but --version is unrecognized and falls back to usage text.
  • EDirect archive-* wrappers are not safe to probe with --help or --version; those switches usually still trigger the real setup / refresh path and then fail on missing helpers or EDIRECT_LOCAL_ARCHIVE.
  • archive-nihocc downloads the NIH Open Citation Collection zip and warns the transfer can take hours; daily rebuilds index/invert layers, while -index additionally merges and posts.
  • archive-nlmnlp builds a local PubMed concept archive from PubTator Central plus GeneRIF / gene metadata files, and it also needs a local go toolchain.
  • archive-nmcds is a full RefSeq NM CDS archiver that creates a master accession list and CDS offset table; its cleanup flags are hierarchical, with -zap removing the deepest layers.
  • archive-pids builds local PMCID postings from PubMed metadata; it is not a simple one-shot identifier extractor.
  • archive-pmc and archive-pubmed have real operational submodes like -missing, -verify, and -index; they should be documented as distinct maintenance paths rather than as generic “download archive” commands.
  • archive-pubmed changes index content when -stem or -strict is used, and mixed-year local state can force a cleanup path before reindexing.
  • archive-taxonomy also requires a local go compiler and uses the same destructive cleanup ladder (-clean through -zap) seen in other archive wrappers.
  • combine-uid-lists is just sort -nu "$@": it always returns a numeric-sorted union with duplicates removed.
  • difference-uid-lists computes the symmetric difference (FILE1 △ FILE2), not a directional subtraction and not an overlap.
  • exclude-uid-lists computes FILE1 - FILE2, while intersect-uid-lists computes the shared set only.
  • The UID-list helpers sort their inputs internally, so pre-sorting is optional, but original record order is lost.
  • difference-uid-lists, exclude-uid-lists, and intersect-uid-lists have no clean help/version interface; passing --help or --version leaks through to sort / comm and can still emit missing-file noise.
  • Many EDirect converters are only thin shell wrappers around transmute -?2x; using the wrapper by absolute path is not sufficient unless the rest of the EDirect bin directory is also on PATH.
  • Those thin converter wrappers usually have no meaningful --help / --version output. In the current environment, most emit nothing; json2xml is a special case that converts --help / --version into literal XML tags.
  • fsa2xml emits one <FASTA> block per FASTA record, not a single enclosing document for the whole file.
  • ini2xml, toml2xml, and yaml2xml currently emit a <ConfigFile> root for simple mappings.
  • csv2xml and tbl2xml are also transmute wrappers, but in trivial smoke tests they produced no output instead of a friendly schema error, so representative-input validation matters.
  • jsonl2xml is a line loop: every JSONL line is independently converted to its own <root>...</root> fragment, so the combined stream is not a single well-formed XML document unless the caller wraps it.
  • gff2xml is not a simple format toggle; it pipelines tbl2xml, xtract, and transmute, splitting the semicolon-delimited GFF Attributes field into nested XML tags.
  • xml2fsa and xml2tbl are fixed xtract recipes over INSDSeq XML and do not pass through positional filenames, so stdin piping is the reliable invocation pattern.
  • xml2fsa builds FASTA headers from the first available accession / id / locus field plus the definition text.
  • xml2tbl produces a feature-centric table beginning with >Feature <accession> lines followed by interval and qualifier rows.
  • xml2json is a Perl stdin-only converter based on XML::Simple and JSON::PP; in the current environment it cannot start because XML::Simple.pm is missing.
  • fill-aa supports --help but not --version; it can annotate only selected event classes via -t, expects sorted VCF input, and falls back from the raw -a path to <prefix><chrom>.fa.gz.
  • fill-an-ac recomputes both AC and AN from genotype columns and hard-codes diploid counting with recalc_ac_an(2).
  • fill-fs annotates only against the first ALT allele at multiallelic sites, and the -m mask-character switch only affects the next -b, -v, or -c target that follows it on the command line.
  • fill-ref-md5 depends on tabix, samtools faidx, and md5sum; -d alone only works if the dictionary already covers all chromosomes needed by the indexed VCF.
  • filter-columns is just a tab-delimited awk predicate wrapper; the expression must be passed as one quoted shell argument, and it injects YR / DT variables automatically.
  • filter-genbank and filter-record are thin transmute wrappers whose actual filtering rules live inside transmute; both reject --help / --version as unrecognized arguments.
  • filter-stop-words is line-oriented and drops stop words entirely by default; -plus emits + placeholders instead. The first command-line token is always consumed, so arbitrary replacement arguments are awkward.
  • extract_exons.py outputs unique zero-based exon intervals and merges neighboring exons separated by 5 bp or less before printing.
  • extract_splice_sites.py uses the same exon-merging rule, then emits unique zero-based splice junction boundaries; with -v it prints transcript/exon/intron summary stats to stderr.
  • gbf2facds has only two real modes: nucleotide CDS (-na, default) and translated protein (-aa). Both reject --help / --version, and the FASTA headers are metadata-rich rather than bare accessions.
  • gbf2fsa is just gbf2xml | xml2fsa, so its behavior and dependencies inherit both wrappers.
  • gbf2info emits structured GenBankInfo XML with <info>, <feature>, and <sequence> sections, and internally remaps problematic feature names (for example 3'UTR -> 3_UTR) before wrapping them as XML tags.
  • gbf2ref is a thin transmute -g2r wrapper; in the current build --help / --version fall through to the generic “Unable to create GenBank reference indexer” error.
  • gbf2tbl is gbf2xml | xml2tbl, so it emits the same >Feature <accession> table structure as xml2tbl, just starting from GenBank flatfile input.
  • gff-sort is a real EDirect pipeline, not a self-contained sorter. It strips all comment/directive lines, depends on tbl2xml, xtract, transmute, and sort-table being on PATH, and hard-codes a feature priority of gene/pseudogene -> RNA-like -> CDS -> exon / intron -> everything else.
  • gff2gff is a stdin/stdout bcftools helper that appends missing ID, biotype, and Name fields when it can infer them from gene_id, gene_type, gene_name, transcript_id, and transcript_type. Even successful runs print a "Fixed N records" summary to stderr.
  • gff2gff.py is a much more brittle legacy converter than autogenerated docs implied: it requires gffutils just to start, takes a scratch database path as its second positional argument, writes converted GFF3 to stdout, skips ncRNA groups, and assumes attributes like Name and locus_tag.
  • flattenGTF is a Subread binary that falls through to usage text when probed with --help or --version. It writes SAF to disk, defaults to -t exon -g gene_id, and -C keeps exon edges while still producing non-overlapping output.
  • fuse-ranges expects four tab-delimited columns where column 3 is strand and column 4 is a comma-separated start..end list. It merges adjacent intervals too, silently drops rows whose first field does not begin with [1-9], and can emit a bogus 0 0 1 sentinel on empty/non-matching input.
  • fuse-segments behaves similarly for simple start/end tables: it normalizes reversed coordinates, merges adjacent segments, ignores everything after column 2, still needs sort-table on PATH, and shares the same bogus 0 0 1 empty-input behavior.
  • find-in-gene is really find-in-gene <strand> <min> <max> over stdin GENE XML, despite the shell wrapper only checking for two arguments and printing a misleading "must have start and stop position" error. A fourth argument is accepted but unused.
  • accn-at-a-time is just a lowercase text tokenizer: it splits on anything outside [A-Za-z0-9_.] and does not validate that the resulting tokens are real accessions.
  • align-columns is tab-delimited pretty-printing around transmute -align; -help works, but -version is noisy because it shells out to einfo -version, which does not return a clean version string here.
  • between-two-genes is a local awk block slicer, not a gene-lookup tool. Its arguments are regex patterns, output is inclusive of the boundary rows, and if the second boundary is missing it prints from the first match to EOF.
  • expand-current is an operational local-archive rebuild script. It deletes previous derived index files, requires EDIRECT_LOCAL_ARCHIVE plus helpers like pm-collect, and can still stumble through a broken environment while exiting 0.
  • gene2range takes a chromosome name and turns DocumentSummary XML into sorted GENE XML. It emits XML, not TSV, and still depends on xtract, sort-table, and tbl2xml being on PATH.
  • join-into-groups-of defaults to batches of 10000 and emits comma-joined lines. Because it is xargs-based, embedded whitespace inside identifiers is destructive.
  • just-top-hits does not rank by score. It simply keeps the first N first-column groups from an already grouped table, preserving every row inside those retained groups.
  • amino-acid-composition is line-based and not FASTA-aware. It will happily count letters from FASTA headers unless those lines are stripped first.
  • annot-tsv uses --help for usage. The short flag -h is a real option for header-row specification and therefore errors without an argument.
  • asn2ref is a compact xtract recipe over Seq-entry citation content. It emits citation XML blocks and normalizes page ranges down to the first page value.
  • cit2pmid defaults to remote matching, supports explicit modes (-eutils, -local, -exact, -verify), treats repeated -author fields as first and last author, truncates page ranges to the first page, and does not implement -help / -version as metadata flags.
  • download-ncbi-software only special-cases magic-blast, datasets / dataformat, and sra-toolkit; it also depends on nquire, and in the current Linux x86_64 branch the sra-toolkit case reaches an empty suffix path that effectively no-ops while still exiting successfully.
  • download-pmc is a bulk operational downloader, not a metadata probe. It defaults to FTP, can switch to HTTPS, iterates both baseline and incr trees across oa_comm, oa_noncomm, and oa_other, and deletes tarballs that fail XML verification after retries.
  • exact-snp wraps the real executable exactSNP; no-argument invocation prints usage, -v is the real version flag, and output is VCF even though some legacy examples still suggest .txt.
  • fasta-sanitize.pl handles both FASTA and FASTQ, sanitizes only the first whitespace-delimited token in each header, and prints rename messages only when a record name actually changes.
  • get_species_taxids.sh uses underscores in the real executable name, checks for esearch, efetch, and esummary before normal usage handling, and treats -t and -n as mutually exclusive output modes.
  • gm2ranges expects at least five whitespace-delimited columns and preserves minus-strand segments as descending stop..start strings, making it a normalization precursor rather than a final interval file.
  • gm2segs depends on xtract, print-columns, sort-table, and fuse-segments being on PATH, filters the label BLASTN - mrna exactly, emits RAW / PLS / MNS / CMB report blocks, and can inherit bogus 0 0 1 sentinels from fuse-segments on empty strand partitions.
  • pair-at-a-time is a text-stream bigram helper, not a sequencing read-pair tool. It lowercases text, strips punctuation, collapses consecutive duplicates via uniq, and can fail by absolute path if word-at-a-time is not also on PATH.
  • color-chrs.pl writes a single <prefix>.svg, requires -p, and hard-codes human chromosomes 1..22 plus X; it also treats any existing path on the command line as an input file before option parsing.
  • pma2apa converts PubmedArticle XML to either a tab-prefixed APA citation line or a structured <APASet> XML record. --help is not implemented, and -ascii can be combined with either text or XML output.
  • pma2pme converts PubmedArticle XML to Pubmed-entry ASN.1 text by default, with -xml exposing the intermediate XML and -std switching away from the default compact Medline-style author representation.
  • nhance.sh only special-cases four shortcut flags (-pathway, -gene-to-pathway, -litvar, -citmatch). In the current environment, active shortcut calls fail immediately with Escape: command not found, while --help or plain invocation can exit silently with no output.
  • print-columns is just an awk "{print ...}" wrapper over tab-delimited stdin. The expression must be shell-quoted, YR and DT are injected automatically, and --help with no stdin produces no useful output.
  • print-missing-subranges assumes ascending one-column integers, reports gaps as inclusive start-end spans, anchors the sequence at 1, and does not infer any terminal upper bound after the last observed value.
  • quote-grouped-elements drops blank lines and turns every literal space into a quoted comma separator; it does not escape embedded quotes and is only CSV-like, not a full CSV serializer.
  • qualfa2fq.pl requires exactly two positional arguments, auto-decompresses only by .gz suffix, does not verify matching FASTA / QUAL record IDs, converts qualities with score + 33, and wraps only the quality string at 60 characters.
  • qualityScores is the real executable behind the quality-scores skill. -h / --help are treated as invalid or unrecognized options before usage text is shown, -i and -o are mandatory, the default sample size is 10000, and output is one comma-separated quality vector per sampled read.
  • guess-ploidy.py is a pure plotting wrapper around bcftools +guess-ploidy -v. It requires exactly two positional arguments, skips comment lines, uses only SEX rows, and writes a static PNG through Matplotlib's Agg backend.
  • hgvs2spdi does not read raw HGVS strings directly; it expects HGVS XML on stdin and optionally a positional transform table file with accession-to-offset mappings. Without that file, it performs live efetch / gbf2xml lookups to derive CDS-start offsets.
  • ds2pme is the docsum analogue of pma2pme: it converts PubMed DocumentSummary XML to Pubmed-entry ASN.1 by default and exposes the intermediate XML with -xml.
  • bsmp2info emits compact XML, not TSV. It lowercases harmonized_name attributes into element names, collapses multiple links into one pipe-delimited Link field, and live BioSample fetches can hit NCBI 429 rate limits.
  • genRandomReads is the real executable behind the gen-random-reads skill. --help is reported as an unrecognized option before usage text, omitting --totalReads defaults to one million reads with a warning, and the tool has a distinct --summarizeFasta mode for transcript-length inventory.
  • ct2db has a clean modern help interface (-h, --help, -V) and writes extended FASTA with dot-bracket strings to stdout; --no-pk removes pseudoknots and --no-modified replaces modified bases with N.
  • datatool is a large NCBI CLI with required -m moduleFile for real work, detailed -help output, and single-dash long options such as -version; the local binary reports datatool: 2.24.0 from the BLAST 2.17.0 package build.
  • popt is a ViennaRNA post-filter for RNAsubopt -s output. The local binary does not expose a normal help flag, but its embedded usage string explicitly states RNAsubopt -s < seq | popt.
  • clustalw2 uses legacy uppercase option conventions (-HELP, -INFILE=..., -ALIGN, -TREE) and a bare invocation drops into an interactive menu rather than exiting with usage. -h and --help are both wrong.
  • blst2gm is a stdin xtract wrapper, not a standalone BLAST parser. It errors cleanly on empty stdin, filters only annotations labeled BLASTN - mrna, and emits compact tab-delimited rows with pipe-joined multi-value fields.
  • blst2tkns is another tiny xtract recipe, this time over Seq-align-set_E XML. It is not a generic BLAST text/tabular converter, does not expose real help, requires xtract on PATH, and with dependencies available but no input it fails with No data supplied to xtract from stdin or file.
  • ecommon.sh is a shared EDirect shell library, not a meaningful standalone CLI. Running it directly or with -version is silent, while sourcing it exposes version 24.0 and functions such as ParseCommonArgs, RunWithLogging, and WriteEDirect.
  • ecollect is a UID-source normalizer and PubMed SOLR workaround wrapper, not a record fetcher. -help is unrecognized, -db is mandatory, -count / -subset are PubMed-specific query modes, and final UID output is deduplicated with sort -n | uniq so original order is lost.
  • pmc2info converts PMC <article> XML to normalized PMCInfo XML, including an internal mapping of common section titles such as introduction, results, discussion, and methods. It has no real help mode and depends on both xtract and transmute being on PATH.
  • pmc2bioc converts PMC <article> XML to BioC collection XML, and the source explicitly labels it WORK IN PROGRESS. Like pmc2info, it lacks real help behavior and immediately fails if xtract / transmute are unavailable.
  • nquire is the low-level EDirect transport layer, not just an EUtils helper. -h / --help print real usage text, -version returns 24.0, direct GET requests to EUtils work, but an FTP listing smoke test against ftp.ncbi.nlm.nih.gov failed here with curl: (56) response reading failed.
  • AnalyseDists is the real executable behind the analyse-dists skill, and its embedded usage string even misspells itself as AnalyseDist. -h, --help, and --version all route to the same usage error, and a tiny 2 x 2 stdin matrix smoke test simply echoed the matrix back, even with -Xn.
  • alimask cannot currently start because libopenblas.so.0 is missing, so its skill has to rely on binary-string evidence. Those strings show it requires one of --modelrange, --alirange, --model2ali, or --ali2model, that the mapping modes are no postmsa, that stdin requires --informat, and that --hand needs an RF line.
  • AnalyseSeqs is the real executable behind analyse-seqs. -h, --help, and --version all print the same usage banner, the installed man page confirms stdin-driven equal-length sequence blocks terminated by @ or %, and live four-sequence runs showed -Xb, -Xn, and -Xw creating demo_box.ps, demo_nj.ps, and demo_wards.ps respectively. A tiny two-sequence smoke test still exited 0 with no stdout or sidecar output, and the man page explicitly warns that only Hamming distance is well tested.
  • b2ct has no usable help path in this environment: bare b2ct and b2ct -h were silent. The confirmed happy path is stdin, for example printf '>test\nAAAA\n.... (0.00)\n' | b2ct, which emitted CT rows to stdout beginning with 4 ENERGY = 0.0 test. In contrast, positional file invocation such as b2ct fold.out exited 0 but produced no stdout and no sidecar file in local smoke tests. An invalid sample with mismatched sequence and structure lengths emitted sequence and structure have unequal length, and binary strings reveal a separate unbalanced brackets error path.
  • md5fa is not a whole-file checksum tool. It emits one digest per FASTA record plus aggregate >ordered and >unordered lines. Live reordered-FASTA tests showed that swapping record order changes >ordered but leaves >unordered unchanged. md5fa -h is treated as a filename, and an empty FASTA emitted the normal empty MD5 for >ordered but an all-zero digest for >unordered.
  • md5sum-lite is a minimal HTSlib-backed checksum helper that hashes files or stdin and prints digest target, using - for stdin. md5sum-lite -h is treated as a filename, error messages are prefixed with md5sum:, and no real help, version, or GNU-style verification mode was evidenced from live tests or binary strings.
  • plot-ampliconstats is a real Perl script with built-in usage and a hard dependency on gnuplot 5.0+. It reads stdin if FILE is omitted, uses the first positional argument as a filename prefix, and source inspection shows it generates many prefix-*.gp, prefix-*.png, and index.html outputs. In this environment, a minimal smoke test failed immediately because gnuplot was not installed.
  • plot-bamstats parses samtools stats output (not raw BAM and not flagstat) and has real options such as -p, -m, -s, and -r in the script source, but local execution is currently blocked before help or plotting because Perl cannot load URI::Escape.pm.
  • plot-vcfstats accepts true bcftools stats .vchk files, not ad hoc approximations. A real one-record smoke test generated plot.py, plot-vcfstats.log, PNG panels, and .dat files, while a fake hand-written input failed the script's sanity check. Adding -P skipped the PDF stage cleanly; without it, local runs failed because pdflatex / tectonic were missing.
  • propmapped does not appear to emit useful stdout by default. In local tests, output was captured only with -o and took the form path,total,mapped,fraction. -f switches to fragment counting, -p restricts to properly paired fragments, and --help / --version are both unrecognized even though the binary still prints propMapped v2.1.1.
  • rchive is a shell dispatcher, not the real binary. It locates a platform-specific executable such as rchive.Linux. --help printed the built-in archive/index help in this environment, -version printed 24.0, and --version instead fell through to a no-input error path.
  • ref-cache is an HTSlib CRAM reference-caching proxy with a clean -h path and a useful installed man page. It requires -d, uses short options, defaults to the EBI upstream MD5 service unless -U is set, and the man page states that it exits silently if another instance is already running on the chosen port.
  • ref2pmid is just a one-line wrapper around transmute -r2p "$@". It has no standalone help/version behavior; calling it without stdin simply fails inside transmute.
  • refseq-nm-cds is a heavyweight operational script, not a query helper. Source inspection shows it defaults to human if no species is given, supports aliases such as man, mice, and fish, downloads many *.rna.gbff.gz files through nquire, and writes species-specific outputs like human_cds.txt. Unsupported arguments such as --help and --version are treated as species names and trigger a noisy shell error from a stray break.
  • reorder-columns is a tiny tab-only awk wrapper. It has no help path, uses 1-based positional column numbers, and can duplicate columns because it simply expands $N expressions into an awk print list.
  • repair always writes BAM, can accept SAM with -S, and adds dummy mate records for singleton/unpaired reads unless -d is supplied. Local tests showed dummy mates with sequence N and quality A, while -d suppressed them completely. -h / --help are invalid-option paths that still print the usage banner repair Version 2.1.1.
  • easel is a dispatcher-style HMMER/Easel front end, but in this environment the binary cannot start because libopenblas.so.0 is missing. readelf confirms dependencies on libgsl.so.25, libopenblas.so.0, and libmpi.so.40, while binary strings still expose the intended top-level interface: easel -h, easel --version, easel <cmd> -h, and easel <cmd> [<args>...].
  • plot-roh.py does not accept raw bcftools roh output by itself. The source requires gzipped *.txt.gz files containing both GT rows and eight-column RG rows, and explicitly says extra bcftools query genotype lines may be needed. A minimal mixed GT/RG sample produced a 3000 x 150 PNG locally, while an RG-only sample crashed with IndexError at row[7].
  • removeDup is a thresholded location-purge tool, not a duplicate marker. With three reads at one locus and -r 2, local testing removed all three reads. -h and --version are invalid-option paths that still dump the usage banner removeDup Version 2.1.1, and BAM remains the default output unless -S is set.
  • run-ncbi-converter is a Perl bootstrap wrapper for downloadable NCBI converter binaries. It hardcodes ftp.ncbi.nlm.nih.gov, caches into ~/.cache/ncbi-converters unless NCBI_CONVERTER_DIR is set, and treats the first positional argument as the converter basename used to construct <name>.<platform>.gz. There is no safe local help/version path because even -h immediately attempts FTP access; in this workspace that failed with Unable to connect to FTP server: Bad file descriptor.
  • run-roh.pl is more than a simple loop over bcftools roh. It normalizes chromosome names, optionally annotates AF1KG frequencies, writes per-input .bcf intermediates plus .txt.gz and .log outputs, appends genotype rows via bcftools query, and then merges ROH presence into outdir/merged.txt. --version is not implemented.
  • skip-if-file-exists is a newline-delimited path filter, not a conditional command launcher. It simply echoes paths whose -f test is false. Existing regular files are silently suppressed, while directories still pass through because the script checks only -f.
  • snp2hgvs is a tiny xtract | transmute wrapper over dbSNP docsum XML, not a free-form SNP string converter. With real docsum input for rs104894914 it emitted structured <HGVS> XML containing multiple variant representations, including genomic and coding forms.
  • snp2tbl is only three lines of shell, but the composition matters: snp2hgvs | hgvs2spdi "$@" | spdi2tbl. Its -h behavior is unsafe because argv is forwarded downstream, which locally triggered cat: invalid option -- 'h' before the no-input xtract failure.
  • sort-by-length is just a Perl line sorter: print sort { length($a) <=> length($b) } <>. It does not understand FASTA records or other multi-line biological structures.
  • sort-table is a very thin wrapper around GNU sort with a forced tab delimiter and an unconditional grep '.' prefilter, so blank lines are always dropped. -h is passed through as a human-numeric sort flag, not as help.
  • sort-uniq-count does not require pre-sorted input because it sorts internally. The wrapper always performs case-insensitive grouping via uniq -i -c, defaults to sort -f, and rewrites the result as count<TAB>value.
  • sort-uniq-count-rank adds a final sort -k 1,1nr -k "2$flags" stage on top of sort-uniq-count, so counts are always ranked descending first. Its apparent help/version flags are unsafe because argv is repurposed into compact sort-flag letters.
  • spdi2tbl is a wrapper over xtract on <SPDI> XML, followed by sort-table, cut, and uniq. It emits 8-column rows shaped like rsid accession position deleted inserted class type gene, and its class ordering is explicitly normalized as Genomic, Coding, Protein.
  • tbl2prod is not an NCBI feature-table converter despite its name. It expects spdi2tbl-style 8-column rows, skips genomic variants, fetches the relevant nucleotide or protein record from NCBI, and emits reference (:+) plus altered product rows as a 3-column table after sorting/cutting away the protein ID.
  • test-edirect is a long-running smoke/demo harness for the full EDirect stack. With no arguments it prints EDirect 24.0, platform info, and many titled example sections such as INFO HELP, FIELD EXAMPLE, and LINK EXAMPLE. -test is a special traced pipeline mode, while -h / --version are unrecognized.
  • test-eutils is a smaller endpoint-health checker for E-utilities. Help text is real, --version is not, and -alive emits a mode header followed by progress dots or failure markers. In a bounded local run, test-eutils -alive produced .... before the external timeout fired.
  • test-pcre maps to the real executable test_pcre, which is the standard pcre2test CLI. -help and -version both work, and a minimal stdin script /abc/ plus subject abc produced a successful 0: abc match.
  • test-pmc-index has no argument parser at all. It always runs a random-ID PMC title roundtrip using xfetch -db pmc and xsearch -db pmc -title. Without EDIRECT_LOCAL_ARCHIVE, it still stumbles into rchive-level errors after printing the missing-environment warning.
  • test-pubmed-index is even more environment-sensitive: it sources xcommon.sh, looks for archive/postings/data folders, exercises xfetch, xsearch, xinfo, cit2pmid -local, and a local meshconv.xml-based MeSH climb. With EDIRECT_LOCAL_ARCHIVE unset, it emits a noisy mix of repeated path errors rather than stopping cleanly.
  • word-at-a-time is just sed 's/[^a-zA-Z0-9]/ /g; s/^ *//' | tr 'A-Z' 'a-z' | fmt -w 1. It strips punctuation/underscores, lowercases everything, and emits one token per line.
  • xcommon.sh is a shared implementation library for the local-archive x* tools, not a meaningful standalone command. Key functions include FindArchiveFolder, FindPostingsFolder, FindDataFolder, ParseStdin, and GetUIDs, all of which shape the behavior of sibling wrappers.
  • xfetch is a local archive retrieval wrapper over rchive -fetch / rchive -stream, not a remote efetch clone. Without EDIRECT_LOCAL_ARCHIVE, it can still print outer XML wrappers before failing inside rchive.
  • xfilter is a local postings query helper. It tokenizes incoming UIDs with word-at-a-time and then calls rchive -query; it is not a remote efilter replacement.
  • xinfo is a local postings inspector. In particular, -fields literally lists directories inside the postings folder. With EDIRECT_LOCAL_ARCHIVE unset, the failed cd "$postingsBase" can accidentally fall through and list the current working directory.
  • xsearch is the local-search counterpart in the same stack. -query wraps hits in ENTREZ_DIRECT XML unless -raw is set, -match / -exact / -title delegate directly to rchive, -words / -pairs tokenize via word-at-a-time plus filter-stop-words, and omitting -db defaults to pubmed.
  • xlink is a local link resolver over rchive -link, not a remote Entrez linker. It accepts UIDs from stdin, -id, -input, or an upstream ENTREZ_DIRECT message; -target is mandatory; and the current xlink.ini only defines [pubmed] CITED=pubmed, CITES=pubmed, and PMCID=pmc.
  • xa2multi.pl is a minimal Perl SAM filter for BWA XA:Z tags. It prints the original line unchanged, emits one secondary SAM record per alternate hit, copies the mismatch count into NM:i, reverse-complements sequence/qualities if orientation flips, and explicitly leaves TLEN/ISIZE uncomputed.
  • uniq-table is a column-pruning AWK script, not a row deduplicator. It uses row 2 as the baseline, marks a column as interesting only when some later row differs, and therefore removes columns that are invariant from row 2 onward. Running uniq-table -help just exposes generic gawk help.
  • run_with_lock is a compiled NCBI locking helper, but the local installation is incomplete: bare invocation fails with Unable to exec get_lock. Binary strings and official NCBI source still expose the real option surface: -base, -getter, -log, -map, -reviewer, and a standalone ! marker that suppresses exit-status propagation. --help and --version are merely reported as unsupported options.
  • seq_cache_populate.pl builds htslib/CRAM REF_CACHE trees by uppercasing and whitespace-stripping FASTA sequences, hashing them by MD5, and writing them under hex-split subdirectories. It supports direct FASTA arguments, -find <dir>, or stdin; reruns print Already exists; and -subdirs 16 is rejected even though the error message misleadingly says “less than 15”.
  • subindel exposes only a usage-banner interface: bare invocation prints usage, -h is an invalid option, and --version is unrecognized. The live banner documents -i, -g, -o, -d, -I, and --paired-end, while binary strings suggest the -o value is treated more like an output prefix (%s.indel.vcf) than a literal VCF filename.
  • STARlong is a shell dispatcher over STARlong-avx2, -avx, -sse4.1, -ssse3, -sse3, and -plain. bash -x STARlong --version showed this host selecting STARlong-avx2. Help output is the generic STAR manual, not wrapper-specific text, and reports version 2.7.11b with earliest compatible genome index 2.7.4a.
  • project_tree_builder is a compiled NCBI Unix C++ tree generator with full live help and version output. The help text explicitly says the root should end with c++, the subtree can be either a path or a file list, and the tool still requires a solution argument. Local -dryrun testing with placeholder arguments exited silently with status 0.
  • roh-viz is a Perl HTML generator over bcftools roh plus bcftools query. Source inspection shows both -i (ROH file) and -v (VCF/BCF) are mandatory, only RG rows from the ROH file are plotted, and the built-in example/error text is wrong because -r is actually the regions filter, not the ROH input flag.
  • systematic-mutations is a stdin-only bash wrapper that uppercases the first whitespace-delimited field, substitutes every position with A/C/G/T via transmute -replace, optionally appends the second field as :<pattern>, then case-insensitively deduplicates the emitted variants. Command-line flags are ignored.
  • vrfs-variances is a stdin parser for SITE rows from bcftools/vrfs. In default mode it prints MEAN and VAR2 to stderr but also emits the terminal selected SITE row to stdout. In -s mode the current code can duplicate that last site. In -v mode it emits only the variance vector, one value per line.
  • Biomni is present as a local repo and import biomni works if the repo is added to sys.path, but deeper tool imports currently fail because langchain_core is missing. The repo expects a dedicated biomni_e1 environment and API-key configuration for agent workflows.
  • evo2 is present as a local repo with examples and a phage_gen subproject, but direct import fails because vortex is missing and the Docker image is not built yet.
  • RFdiffusion is present as a local repo with test_rfdiffusion.sh, design_ebola_binder.sh, and models/, but the Docker image is not built yet.
  • protein-structure in this workspace is best treated as a gateway/planning skill, not a ready predictor stack: colabfold_batch, pymol, chimera, chimerax, foldseek, mmseqs, and fpocket are all absent from PATH.
  • sequence-analysis can safely route to real installed tools: blastn, blastp, makeblastdb, bowtie2, bwa, samtools, hmmscan, hmmbuild, mafft, muscle, hisat2, featureCounts, prodigal, RNAfold, seqkit, and Biopython.
  • bioinformatics-toolkit is best documented as an umbrella router over the verified local CLI families plus repo-backed projects, not as a monolithic executable.
  • phage-design is a local subproject under evo2/phage_gen, with a real Python pipeline script and config template. Its bundled Slurm launcher contains placeholder /path/to/... values and is not runnable as-is.
  • yeast_database is a local learning project under projects/yeast_genome_learning; its primary entrypoints are the staged Bash scripts, while scripts/pipeline.py --steps is a compatibility helper that works locally.