Wikidata CLI

A command-line tool that converts natural-language prompts into a SPARQL script, executes against Wikidata Query Service, and prints the results.

Command line interface
     ↓
Natural language
     ↓
LLM generator
     ↓
SPARQL script
     ↓
Wikidata execution
     ↓
Formatted answer

Quickstart

Initialze the Project

uv sync

Set an API Key

Create a .env file and save your OpenAI API key:

OPENAI_API_KEY=your_key_here

To use LLMs on Nvidia's NIM, specify your NIM API key as follows.

NIM_API_KEY=your_key_here

Run the CLI:

uv run python -m src.cli

After initializing the CLI, the user has three options:

Ask a query
Change model config
Exit the CLI

When a query is given the system, the user will see a SPARQL script if it is successfully generated. The user can choose whether or not the script should be executed on Wikidata. If the script is executed, the user will see the returned results on the CLI.

Example

Query: List 5 cities in Taiwan

Genreated SPARQL:

SELECT ?city ?cityLabel WHERE {
  ?city wdt:P31 wd:Q515;
        wdt:P17 wd:Q865.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 5

Retrieved Results:

Results (5 shown):

1. Taipei
   QID: Q1867
   URI: http://www.wikidata.org/entity/Q1867
2. New Taipei
   QID: Q244898
   URI: http://www.wikidata.org/entity/Q244898
3. Taichung
   QID: Q245023
   URI: http://www.wikidata.org/entity/Q245023
4. Hsinchu City
   QID: Q249994
   URI: http://www.wikidata.org/entity/Q249994
5. Keelung
   QID: Q249996
   URI: http://www.wikidata.org/entity/Q249996

Case Studies

data/case_studies.jsonl contains queries with ambiguous inputs, conflicting constraints, typos, negation, and non-English contents.

In general, LLMs are capable of handling typos, negation, French, and Chinese better than expected. The main failures were not in intent extraction, but in downstream validation and entity resolution.

Category	Query	Result	Root Cause	Planned Fix
Baseline	Find scientists from France born after 1900	Success	—	—
Baseline	Child has father Johann Sebastian Bach and mother Maria Barbara Bach	Success	—	—
Ambiguous entity	Find people from Georgia	Returned Georgia the U.S. state; missed Georgia the country	Resolver selected top-1 entity without ambiguity detection	Expose top candidates or require clarification
Ambiguous entity	Find works by Bach	Returned no results	Resolver treated “Bach” as a last-name entity rather than resolving intended person/composer	Add ambiguity handling and candidate display
Conflicting constraints	Find scientists born after 1900 and before 1800	Generated valid SPARQL but impossible filters	No semantic validation for date ranges	Add constraint conflict validator
Conflicting constraints	Find French scientists born after 1900 and born before 1800	Generated valid SPARQL but impossible filters	No semantic validation for date ranges	Add constraint conflict validator
Typo	Find scinetists from Frnace born after 1900	Success	Model/resolver recovered likely intent	Keep as robustness win
Typo	Find child whose fther is Johann Sebastian Bach	Success	Model/resolver recovered father=P22	Keep as robustness win
Non-English	Trouve les scientifiques français nés après 1900	Success	Model translated query into English semantic intent	Keep multilingual prompt examples
Non-English	找出1900年後出生的法國科學家	Success	Model translated query into English semantic intent	Keep multilingual prompt examples
Negation	Find French scientists not born after 1900	Success	Model handled negation correctly	Add regression test

Key Findings

The strongest parts of the system were intent extraction and typo recovery. GPT-5.5 successfully converted misspelled and non-English queries into usable semantic intents.

The weakest parts were ambiguity handling and constraint validation. For example, “Georgia” can refer to either the country or the U.S. state, but the LLM selected one candidate automatically. Similarly, impossible date constraints such as “born after 1900 and before 1800” compiled into SPARQL instead of being rejected earlier.

This suggests the next improvements should focus less on prompt engineering and more on deterministic safeguards:

ambiguity detection in entity/property resolution
conflict detection in date and numeric constraints
clearer user-facing error messages

Hardening and Fixes

The strongest models tested, GPT-5.5 and GPT-5.4, produced similar failures. This showed that the core issues were not only LLM parsing errors, but deterministic pipeline issues in entity resolution and validation.

Fix 1: Ambiguity Detection

The original resolver selected the first Wikidata search result automatically. This failed for ambiguous labels such as "Georgia" and "Bach". guardrails.py is introduced to solve this issue. It compares top candidates and raises an ambiguity error when multiple plausible candidates exist. This prevents the system from returning confidently wrong results.

Example:

Input: Find people from Georgia

Before: The system silently resolved Georgia to the U.S. state.

After: The system reports that "Georgia" is ambiguous entity label and asks for a more specific query or provide a QID. It also list top candidates for users's reference.

Fix 2: Constraint Conflict Validation

The original system allowed contradictory date filters, such as:

born after 1900 and before 1800

This generated valid SPARQL but always returned no results. I added validation before compilation to detect impossible date ranges.

Before: The compiler generated SPARQL with both filters.

After: The validator rejects the query with a clear error explaining that the date constraints conflict.

Remaining Hard Cases

Some failures are fundamentally difficult because natural language queries often omit information that is required for deterministic Wikidata resolution.

Ambiguous Entity Names: Names such as "Georgia", "Bach", "Washington", or "Apple" can refer to many different Wikidata entities. Without additional context, there may be no single correct QID. A search API can return candidates, but it cannot always know which one the user intended. The safest behavior is to expose ambiguity instead of guessing.
Wikidata Modeling Complexity: Wikidata is not always modeled uniformly. A concept that sounds simple in natural language may require different properties for different domains. For example, "French scientist" might use country of citizenship, country of origin, or nationality-like descriptions to matach "France". The system currently handles direct property-value constraints and may select the incorrect property labels that are semantically similar to the correct label.

Cross-model Evaluation

To run the evaluation on the evaluation dataset, use the following command.

uv run python -m src.run_eval

The default models are GPT-5.5, GPT-5.4, and Devstral 2 123B Instruct 2512. These models were selected based on two main criteria:

Strong structured reasoning + coding ability: The LLMs are asked to generate SPARQL scripts, which is closer to code generation than free-form text generation.
Large context windows All three LLMs have large context window (1.05M, 1.05M, and 256K, respectively). Larger context windows reduce truncation risk in prompts and improve consistency in outputs.

All three models were capable of hitting the accuracy threshold because they reliably produced valid SPARQL scripts that match ground truth. However, they also exhibited consistent failure patterns.

Wrong entity/property selections: All models occasionally failed at entity/property selection. For example, when asked about Films directed by Steven Spielberg, Devstral 2 123B Instruct 2512 chose Q534 (Ordinance on Industrial Safety and Health) instead of Q8877 (Steven Spielberg) as the target entity. Similar issue also showed up when smaller models (e.g., NVIDIA Nemotron 3 Nano Omni 30B) were used as the SPARQL generator. The strongest model, GPT-5.5 in this case, also failed to select the correct property label when there are other sematically similar candidates. This failure suggests that LLMs sometimes map to semantically unrelated but token-similar.
Complex query: Models struggled when queries required aggregation or calculations. Additionally, variations in SPARQL syntax formatting increased the probability of failures.

In conclusion, larger, stronger models improve SPARQL generation reliability. However, most real failures stem from ambiguity and validation gaps, rather than raw model capability. Building a robust system therefore requires careful evaluation design, explicit handling of ambiguity, and deterministic validation layers. The current evaluation focuses on relatively simple queries. As tasks become more complex, a more carefully designed dataset and evaluation pipeline will be necessary.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikidata CLI

Quickstart

Initialze the Project

Set an API Key

Run the CLI:

Example

Case Studies

Key Findings

Hardening and Fixes

Fix 1: Ambiguity Detection

Example:

Fix 2: Constraint Conflict Validation

Remaining Hard Cases

Cross-model Evaluation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wikidata CLI

Quickstart

Initialze the Project

Set an API Key

Run the CLI:

Example

Case Studies

Key Findings

Hardening and Fixes

Fix 1: Ambiguity Detection

Example:

Fix 2: Constraint Conflict Validation

Remaining Hard Cases

Cross-model Evaluation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages