A command-line tool that converts natural-language prompts into a SPARQL script, executes against Wikidata Query Service, and prints the results.
Command line interface
↓
Natural language
↓
LLM generator
↓
SPARQL script
↓
Wikidata execution
↓
Formatted answer
uv syncCreate a .env file and save your OpenAI API key:
OPENAI_API_KEY=your_key_here
To use LLMs on Nvidia's NIM, specify your NIM API key as follows.
NIM_API_KEY=your_key_here
uv run python -m src.cliAfter initializing the CLI, the user has three options:
- Ask a query
- Change model config
- Exit the CLI
When a query is given the system, the user will see a SPARQL script if it is successfully generated. The user can choose whether or not the script should be executed on Wikidata. If the script is executed, the user will see the returned results on the CLI.
Query: List 5 cities in Taiwan
Genreated SPARQL:
SELECT ?city ?cityLabel WHERE {
?city wdt:P31 wd:Q515;
wdt:P17 wd:Q865.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 5Retrieved Results:
Results (5 shown):
1. Taipei
QID: Q1867
URI: http://www.wikidata.org/entity/Q1867
2. New Taipei
QID: Q244898
URI: http://www.wikidata.org/entity/Q244898
3. Taichung
QID: Q245023
URI: http://www.wikidata.org/entity/Q245023
4. Hsinchu City
QID: Q249994
URI: http://www.wikidata.org/entity/Q249994
5. Keelung
QID: Q249996
URI: http://www.wikidata.org/entity/Q249996
data/case_studies.jsonl contains queries with ambiguous inputs, conflicting constraints, typos, negation, and non-English contents.
In general, LLMs are capable of handling typos, negation, French, and Chinese better than expected. The main failures were not in intent extraction, but in downstream validation and entity resolution.
| Category | Query | Result | Root Cause | Planned Fix |
|---|---|---|---|---|
| Baseline | Find scientists from France born after 1900 | Success | — | — |
| Baseline | Child has father Johann Sebastian Bach and mother Maria Barbara Bach | Success | — | — |
| Ambiguous entity | Find people from Georgia | Returned Georgia the U.S. state; missed Georgia the country | Resolver selected top-1 entity without ambiguity detection | Expose top candidates or require clarification |
| Ambiguous entity | Find works by Bach | Returned no results | Resolver treated “Bach” as a last-name entity rather than resolving intended person/composer | Add ambiguity handling and candidate display |
| Conflicting constraints | Find scientists born after 1900 and before 1800 | Generated valid SPARQL but impossible filters | No semantic validation for date ranges | Add constraint conflict validator |
| Conflicting constraints | Find French scientists born after 1900 and born before 1800 | Generated valid SPARQL but impossible filters | No semantic validation for date ranges | Add constraint conflict validator |
| Typo | Find scinetists from Frnace born after 1900 | Success | Model/resolver recovered likely intent | Keep as robustness win |
| Typo | Find child whose fther is Johann Sebastian Bach | Success | Model/resolver recovered father=P22 | Keep as robustness win |
| Non-English | Trouve les scientifiques français nés après 1900 | Success | Model translated query into English semantic intent | Keep multilingual prompt examples |
| Non-English | 找出1900年後出生的法國科學家 | Success | Model translated query into English semantic intent | Keep multilingual prompt examples |
| Negation | Find French scientists not born after 1900 | Success | Model handled negation correctly | Add regression test |
The strongest parts of the system were intent extraction and typo recovery. GPT-5.5 successfully converted misspelled and non-English queries into usable semantic intents.
The weakest parts were ambiguity handling and constraint validation. For example, “Georgia” can refer to either the country or the U.S. state, but the LLM selected one candidate automatically. Similarly, impossible date constraints such as “born after 1900 and before 1800” compiled into SPARQL instead of being rejected earlier.
This suggests the next improvements should focus less on prompt engineering and more on deterministic safeguards:
- ambiguity detection in entity/property resolution
- conflict detection in date and numeric constraints
- clearer user-facing error messages
The strongest models tested, GPT-5.5 and GPT-5.4, produced similar failures. This showed that the core issues were not only LLM parsing errors, but deterministic pipeline issues in entity resolution and validation.
The original resolver selected the first Wikidata search result automatically. This failed for ambiguous labels such as "Georgia" and "Bach".
guardrails.py is introduced to solve this issue.
It compares top candidates and raises an ambiguity error when multiple plausible candidates exist.
This prevents the system from returning confidently wrong results.
Input:
Find people from Georgia
Before: The system silently resolved Georgia to the U.S. state.
After: The system reports that "Georgia" is ambiguous entity label and asks for a more specific query or provide a QID. It also list top candidates for users's reference.
The original system allowed contradictory date filters, such as:
born after 1900 and before 1800
This generated valid SPARQL but always returned no results. I added validation before compilation to detect impossible date ranges.
Before: The compiler generated SPARQL with both filters.
After: The validator rejects the query with a clear error explaining that the date constraints conflict.
Some failures are fundamentally difficult because natural language queries often omit information that is required for deterministic Wikidata resolution.
-
Ambiguous Entity Names: Names such as "Georgia", "Bach", "Washington", or "Apple" can refer to many different Wikidata entities. Without additional context, there may be no single correct QID. A search API can return candidates, but it cannot always know which one the user intended. The safest behavior is to expose ambiguity instead of guessing.
-
Wikidata Modeling Complexity: Wikidata is not always modeled uniformly. A concept that sounds simple in natural language may require different properties for different domains. For example, "French scientist" might use country of citizenship, country of origin, or nationality-like descriptions to matach "France". The system currently handles direct property-value constraints and may select the incorrect property labels that are semantically similar to the correct label.
To run the evaluation on the evaluation dataset, use the following command.
uv run python -m src.run_eval
The default models are GPT-5.5, GPT-5.4, and Devstral 2 123B Instruct 2512.
These models were selected based on two main criteria:
-
Strong structured reasoning + coding ability: The LLMs are asked to generate SPARQL scripts, which is closer to code generation than free-form text generation.
-
Large context windows All three LLMs have large context window (1.05M, 1.05M, and 256K, respectively). Larger context windows reduce truncation risk in prompts and improve consistency in outputs.
All three models were capable of hitting the accuracy threshold because they reliably produced valid SPARQL scripts that match ground truth. However, they also exhibited consistent failure patterns.
-
Wrong entity/property selections: All models occasionally failed at entity/property selection. For example, when asked about
Films directed by Steven Spielberg,Devstral 2 123B Instruct 2512choseQ534(Ordinance on Industrial Safety and Health) instead ofQ8877(Steven Spielberg) as the target entity. Similar issue also showed up when smaller models (e.g.,NVIDIA Nemotron 3 Nano Omni 30B) were used as the SPARQL generator. The strongest model, GPT-5.5 in this case, also failed to select the correct property label when there are other sematically similar candidates. This failure suggests that LLMs sometimes map to semantically unrelated but token-similar. -
Complex query: Models struggled when queries required aggregation or calculations. Additionally, variations in SPARQL syntax formatting increased the probability of failures.
In conclusion, larger, stronger models improve SPARQL generation reliability. However, most real failures stem from ambiguity and validation gaps, rather than raw model capability. Building a robust system therefore requires careful evaluation design, explicit handling of ambiguity, and deterministic validation layers. The current evaluation focuses on relatively simple queries. As tasks become more complex, a more carefully designed dataset and evaluation pipeline will be necessary.