Skip to content

d-wwei/omni-search-skill

Repository files navigation

Omni-Search-Skill

English | 简体中文

Omni-Search-Skill is a full-stack search and retrieval skill for agentic workflows.

Its vision is simple:

no-blind-spot, high-speed web search and fetching across the public web

It combines search, fetch, search-then-fetch, and crawl into one skill with a unified output shape and provider routing layer.

What it does

  • Searches the live web through multiple providers
  • Fetches a specific page as clean Markdown
  • Resolves a query into top search hits and fetches the best page(s)
  • Crawls a site for relevant pages when a docs map or content graph is needed
  • Routes automatically between local, free, and paid providers
  • Detects junk content (captcha, JS-required pages) and falls back automatically
  • Skips known-blocked domains to avoid wasted attempts

Built-in providers

Search (12 providers)

Provider Type Free Tier
Jina Search API (key optional) Generous free tier
DuckDuckGo (ddgs) Library Unlimited
CN free (DDG + Bing CN) HTML scraping Unlimited
Brave Search API 2,000/month
Serper.dev API 2,500/month
Google CSE API 100/day
Bing Web Search API (Azure) 1,000/month
Tavily Search API 1,000/month
Baidu AI Search API With key
Exa via mcporter With key

Fetch (4 providers)

Provider Type Notes
Local Scrapling Local browser Fast + stealth (Camoufox) auto-fallback
Jina Reader API Good for JS-heavy sites
Tavily Extract API Paid fallback
Firecrawl Scrape API Paid fallback

Crawl

  • Tavily Crawl

Routing model

Smart provider selection

The router uses a tiered, cost-optimized strategy:

Search routing:

  1. Tier 1 — Free: Jina (with key) → DuckDuckGo (ddgs library) → CN free HTML
  2. Tier 2 — Freemium: Tavily → Brave → Serper → Google CSE → Bing
  3. Tier 3 — Specialized: Baidu (Chinese), Exa

Fetch routing:

  1. Local Scrapling (fast mode, then stealth auto-fallback)
  2. Jina Reader
  3. Tavily Extract / Firecrawl (when paid allowed)

Domain-aware optimization: Sites known to block local fetching (x.com, zhihu, weibo, bloomberg, wsj, etc.) skip straight to API providers — saves time and avoids flaky failures.

Resilience features

  • Request-level retry with backoff for transient HTTP errors (429, 5xx)
  • Stealth fetch retry for Camoufox browser crashes
  • Junk content detection: captcha pages, JS-required shells → auto fallback
  • Graceful degradation: returns best available result instead of failing
  • Content quality threshold: minimum 500 chars of usable content before accepting

Repository layout

omni-search-skill/
  SKILL.md
  README.md
  README.zh-CN.md
  requirements.txt
  .env.example
  scripts/
    omni_search.py
    eval_benchmark.py
  omni_search_skill/
    cli.py
    models.py
    providers.py
    router.py
    utils.py

Installation

git clone https://github.com/d-wwei/omni-search-skill.git
cd omni-search-skill
python3 -m pip install -r requirements.txt

For stealth fetching (JS-heavy sites), also install the Camoufox browser:

python3 -m camoufox fetch

API keys (all optional)

The system works with zero API keys (using ddgs + local fetch), but adding keys unlocks more providers and better coverage:

Key Provider How to get
JINA_API_KEY Jina Search + Reader jina.ai
BRAVE_API_KEY Brave Search brave.com/search/api
SERPER_API_KEY Serper.dev (Google SERP) serper.dev
TAVILY_API_KEY Tavily Search/Extract/Crawl tavily.com
GOOGLE_CSE_API_KEY + GOOGLE_CSE_CX Google Custom Search developers.google.com
BING_API_KEY Bing Web Search (Azure) azure.microsoft.com
BAIDU_API_KEY Baidu AI Search cloud.baidu.com
FIRECRAWL_API_KEY Firecrawl Scrape firecrawl.dev

Place them in .env based on .env.example.

Quick start

# Check what is available in the current environment
python3 scripts/omni_search.py providers

# Search the web
python3 scripts/omni_search.py search "latest AI news"

# Fetch a page
python3 scripts/omni_search.py fetch "https://openai.com/news"

# Search first, then fetch top result(s)
python3 scripts/omni_search.py resolve "Tavily extract docs" --fetch-top 2

# Crawl a docs site
python3 scripts/omni_search.py crawl "https://docs.tavily.com"

Benchmark

The project includes a comprehensive benchmark (scripts/eval_benchmark.py) that tests against 35 fetch targets and 22 search queries across:

  • Social media (X, Reddit, Instagram, TikTok, Xiaohongshu)
  • Finance (Seeking Alpha, Yahoo Finance, Bloomberg, WSJ, FT)
  • Chinese web (Douban, Zhihu, 36kr, Weibo, Bilibili)
  • Tech (HN, GitHub, arXiv, StackOverflow, OpenAI docs)
  • News (Wikipedia, BBC, NYT, whitehouse.gov, WHO)
  • Hard targets (LinkedIn, Medium, Pinterest, Amazon, Google Scholar)
  • Multilingual search (English, Chinese, Japanese, French, Korean)
python3 scripts/eval_benchmark.py

Safety model

  • Blocks localhost and private-network fetch targets by default
  • Prefers local and lower-cost routes first
  • Uses paid providers only when they unlock better quality or coverage
  • Falls through to the next provider on failure instead of retrying the same route
  • Retries only on transient HTTP errors (429, 5xx) with backoff

License

MIT

About

Omni-Search-Skill: a full-stack web search and retrieval skill for fast, no-blind-spot search, fetch, and crawl workflows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages