@link-assistant/web-search

A web search microservice and library that aggregates results from 20+ search engines and knowledge/paper/code APIs, with intelligent result merging and reranking. Ships as two first-class implementations — JavaScript (@link-assistant/web-search) and Rust (the web-search crate) — that stay in lock-step: the same provider catalog, categories, merge strategies, and CLI/HTTP surface in both languages.

Features

Many providers, four categories: 40 providers grouped into search, knowledge, papers, and code — a superset of FormalAI's web_search_core registry (see Search Providers and the issue #5 compatibility map).
Descriptor-driven catalog: Engines are declared as data (URL, request kind, parser) and run through one shared GenericProvider, so adding an engine in one place adds it everywhere.
web-capture component: JavaScript can lazily load @link-assistant/web-capture, and Rust delegates wc:* providers to the published web-capture crate.
Result merging: Combine results using RRF, weighted scoring, or interleaving.
Configurable weights: Adjust provider weights for custom reranking.
URL deduplication: Automatic normalization and deduplication across providers.
Typed provider registry: A single source of truth powering provider discovery (CLI --list-providers, HTTP /providers, /categories) and provider instantiation.
Dual language parity: Identical behavior and an extensive shared test suite across JavaScript and Rust.
Multi-runtime support: The JavaScript build works with Bun, Node.js, and Deno.

Entry points

web-search ships the same three entry points in both languages. Every entry point is available from the published packages — no Git checkout or path dependency required.

Entry point	JavaScript (`@link-assistant/web-search`)	Rust (`web-search` crate)
Library	`import { createSearchEngine } from '@link-assistant/web-search'`	`use web_search::WebSearchEngine;`
CLI	`npx web-search "query"` (bin `web-search`)	`web-search "query"` (binary `web-search`)
HTTP server	`npx web-search serve --port 3000`	`web-search serve --port 3000`

npm package: @link-assistant/web-search
crates.io crate: web-search
GitHub releases: js-v* (JavaScript) and rust-v* (Rust)

Installation

JavaScript (npm)

# Install the latest published version
npm install @link-assistant/web-search        # npm
bun add @link-assistant/web-search            # bun
yarn add @link-assistant/web-search           # yarn

# Pin a specific published version (recommended for CI / reproducible builds)
npm install @link-assistant/web-search@0.9.0

Rust (cargo)

# Add the latest published crate
cargo add web-search

# Pin a specific published version (recommended for CI / reproducible builds)
cargo add web-search@0.2.0

# Or install the CLI/server binary directly from crates.io
cargo install web-search            # latest
cargo install web-search@0.2.0      # pinned

The pinned versions above match the current published baseline (npm @link-assistant/web-search@0.9.0, crates.io web-search@0.2.0). Replace them with the latest tags shown on the badges at the top of this README.

Quick Start

As a Library

import {
  WebSearchEngine,
  createSearchEngine,
} from '@link-assistant/web-search';

// Create a search engine
const engine = createSearchEngine();

// Search across all providers
const results = await engine.search('artificial intelligence');

// Search with options
const results = await engine.search('machine learning', {
  limit: 20,
  providers: ['google', 'duckduckgo'],
  strategy: 'rrf',
  weights: { google: 1.5, duckduckgo: 1.0 },
});

// Search single provider
const googleResults = await engine.searchSingle('deep learning', 'google');

As a REST API Server

# Start the server
npx web-search serve --port 3000

# Or with bun
bunx web-search serve --port 3000

API Endpoints:

GET /search?q=<query> - Search all providers
POST /search - Search with options in body
GET /search/:provider?q=<query> - Search single provider
GET /providers - List available providers and the typed registry (filter with ?category=<search|knowledge|papers|code>)
GET /categories - List provider ids grouped by category
GET /health - Health check

Example:

curl "http://localhost:3000/search?q=rust+programming&limit=10&strategy=rrf"

# Only the scholarly-paper providers
curl "http://localhost:3000/providers?category=papers"

# Provider ids per category
curl "http://localhost:3000/categories"

As a CLI Tool

# Search from command line
npx web-search "artificial intelligence"

# With options
npx web-search "machine learning" --limit 20 --providers google,bing --format json

# Search category-specific providers
npx web-search "transformer architecture" --providers arxiv,crossref,openalex

# Output just URLs
npx web-search "deep learning" --format urls

# Discover every available provider, grouped by category
npx web-search --list-providers

Merge Strategies

Reciprocal Rank Fusion (RRF)

Default strategy. Combines results by their rank positions across providers.

const results = await engine.search(query, { strategy: 'rrf' });

Weighted Scoring

Score results based on provider weights and rank positions.

const results = await engine.search(query, {
  strategy: 'weighted',
  weights: { google: 2.0, duckduckgo: 1.0, bing: 0.5 },
});

Interleaving

Round-robin style interleaving of results from each provider.

const results = await engine.search(query, { strategy: 'interleave' });

Search Providers

Providers are organized into the four categories formal-ai consumes, and the catalog is a superset of FormalAI's web_search_core registry (issue #5 — see the compatibility map). Run npx web-search --list-providers (or cargo run -- --list-providers from rust/) to print the live catalog; both languages report the same 40 providers.

Category	Providers	Access
`search`	google, bing, duckduckgo, searx, brave, mojeek, ecosia, startpage, yahoo, yandex, lite (DuckDuckGo Lite), `wc:*`	API / hybrid / HTML / component
`knowledge`	wikipedia, wikidata, wiktionary, wikinews, internet-archive, dbpedia, openlibrary, semantic-scholar, openalex, crossref, cambridge-dictionary, merriam-webster, dictionary-com, collins-dictionary	API / HTML
`papers`	arxiv, europepmc, doaj	API (CORS-readable)
`code`	github, hackernews, gitlab, codeberg, gitee, bitbucket, gitflic	API (CORS-readable)

Native search providers are listed above. The optional wc:* providers are the same engines delegated through the web-capture component.

api providers call a JSON/Atom endpoint directly.
html providers scrape a search-results page with a per-engine regex through the shared anchor-list parser (the search engines, plus the dictionary knowledge providers, which resolve a headword to its canonical entry page).
hybrid providers (google, bing) use an official API when credentials are configured and fall back to scraping otherwise.
component providers (wc:*) are backed by the optional @link-assistant/web-capture library — see web-capture component.

The category defaults follow FormalAI's DuckDuckGo-first plan: duckduckgo (search), wikipedia (knowledge), arxiv (papers), and github (code). When no providers are requested, the live default plan is duckduckgo, internet-archive, wikipedia, wikidata, wiktionary, wikinews.

GITHUB_TOKEN is optional but raises the GitHub search rate limit when set.

Class-based providers (google, bing, duckduckgo)

import {
  GoogleProvider,
  BingProvider,
  DuckDuckGoProvider,
} from '@link-assistant/web-search';

// Google: Custom Search API when configured, scraping fallback otherwise
const google = new GoogleProvider({
  apiKey: 'your-api-key',
  searchEngineId: 'your-cx-id',
});

// Bing: Web Search API when configured, scraping fallback otherwise
const bing = new BingProvider({ apiKey: 'your-bing-api-key' });

// DuckDuckGo: HTML scraping, no API key required
const duckduckgo = new DuckDuckGoProvider();

Descriptor-driven providers

Every other engine in the table is declared as a descriptor (id, request kind, parser) and instantiated through a single GenericProvider. The registry can build the whole catalog so you can pick any provider by id:

import { buildProviders, API_ENGINES } from '@link-assistant/web-search';

// Instantiate the full catalog (Map<id, provider>) and select one
const arxiv = buildProviders().get('arxiv');
const results = await arxiv.search('graph neural networks', { limit: 5 });

// Or build directly from a descriptor
import { createGenericProvider } from '@link-assistant/web-search';
const crossref = createGenericProvider(
  API_ENGINES.find((d) => d.id === 'crossref')
);

web-capture component

Any provider can be backed by the optional @link-assistant/web-capture component library, exposed through the wc:* provider ids (wc:wikipedia, wc:duckduckgo, wc:google, wc:bing, wc:brave). The dependency is loaded lazily; when it is not installed the provider warns once and returns an empty result set so the rest of the aggregation keeps working. You can also inject a custom implementation for testing:

import { createWebCaptureProvider } from '@link-assistant/web-search';

const provider = createWebCaptureProvider({
  engine: 'wikipedia',
  // Optional: inject a fetch/search implementation (defaults to @link-assistant/web-capture)
  searchImpl: async (query, options) => [
    /* { title, url, snippet } */
  ],
});

Provider registry

A typed registry is the single source of truth for discovery and instantiation:

import {
  CATEGORIES, // ['search', 'knowledge', 'papers', 'code']
  getRegistry, // full provider metadata
  getProviderIds, // ids, optionally filtered by category
  getDefaultProviderIds, // ids used when none are specified
  buildProviders, // instantiate the whole catalog
} from '@link-assistant/web-search';

getProviderIds('papers'); // ['crossref', 'openalex', 'arxiv']

API Reference

WebSearchEngine

const engine = new WebSearchEngine(config);

// Search methods
await engine.search(query, options);
await engine.searchSingle(query, providerName, options);

// Provider management
engine.getAvailableProviders();
engine.getProviderStatus();
engine.setProviderWeight(name, weight);
engine.setProviderEnabled(name, enabled);
engine.getProvider(name);

Merge Functions

import {
  mergeResults,
  mergeWithRRF,
  mergeWithWeights,
  mergeWithInterleave,
} from '@link-assistant/web-search';

// Merge results from multiple providers
const merged = mergeResults(resultsByProvider, {
  strategy: 'rrf',
  weights: { google: 1.5 },
  rrfK: 60,
  removeDuplicates: true,
});

Rust Library

A first-class Rust implementation lives in the rust/ directory (crate web-search). It mirrors the JavaScript library: the same descriptor-driven catalog, the same typed registry, the same four categories, and the same 22 providers — verified by a shared test suite (cargo test).

cd rust
cargo build --release

Rust CLI

# Search
./target/release/web-search "artificial intelligence" --limit 10

# Category-specific providers
./target/release/web-search "graph neural networks" --providers arxiv,crossref

# List every available provider, grouped by category (matches the JS CLI)
./target/release/web-search --list-providers

# Start server (GET /search, /providers, /categories, /health)
./target/release/web-search serve --port 3000

Rust Library Usage

use web_search::{WebSearchEngine, SearchOptions, MergeStrategy};

let engine = WebSearchEngine::new();

let results = engine.search_with_options(
    "machine learning",
    SearchOptions { limit: Some(10), ..Default::default() },
    None,
    Some(MergeOptions { strategy: MergeStrategy::Rrf, ..Default::default() })
).await?;

Development

Language-specific project files live under js/ and rust/; repository-level documentation and workflow metadata stay at the root. CI/CD helper scripts live with their language: js/scripts/ and rust/scripts/.

cd js

# Install dependencies
bun install

# Run tests
bun test

# Run with other runtimes
npm test
deno test --allow-read --allow-env --allow-net

# Lint code
bun run lint

# Format code
bun run format

# Verify JavaScript/Rust layout and provider parity
cd ..
node js/scripts/check-js-rust-parity.mjs

Rust Development

cd rust

# Run tests
cargo test

# Run clippy
cargo clippy

# Format code
cargo fmt

# Run Rust CI/CD guard scripts from the repository root
cd ..
rust-script rust/scripts/check-file-size.rs --rust-root rust
rust-script rust/scripts/check-crate-size.rs --rust-root rust

Environment Variables

GOOGLE_API_KEY - Google Custom Search API key
GOOGLE_SEARCH_ENGINE_ID - Google Custom Search Engine ID
BING_API_KEY - Bing Web Search API key

License

Unlicense - Public Domain

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
docs/case-studies		docs/case-studies
experiments		experiments
js		js
rust		rust
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

@link-assistant/web-search

Features

Entry points

Installation

JavaScript (npm)

Rust (cargo)

Quick Start

As a Library

As a REST API Server

As a CLI Tool

Merge Strategies

Reciprocal Rank Fusion (RRF)

Weighted Scoring

Interleaving

Search Providers

Class-based providers (google, bing, duckduckgo)

Descriptor-driven providers

web-capture component

Provider registry

API Reference

WebSearchEngine

Merge Functions

Rust Library

Rust CLI

Rust Library Usage

Development

Rust Development

Environment Variables

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages