SciTrace-RL

SciTrace-RL is a submission-ready demo for AI4S Infra + scientific agents. It turns a scientific-agent run into a reproducible execution trace, validates the trace, and converts the run into a reward-labeled training sample.

The project is designed for DP Technology's "追光计划" direction:

Primary direction: AI4S Infra
Secondary fit: 智能体赋能科学发现
Core idea: scientific agents should not only answer; they should leave auditable, replayable, and trainable execution evidence.

What This Demo Shows

The demo runs a lightweight dry-lab workflow for lithium-ion battery electrolyte additive screening:

Retrieve evidence from a local scientific corpus.
Screen candidate additives with deterministic chemistry proxy features.
Generate a citation-grounded report.
Validate citations, replayability, constraints, and claim-evidence alignment.
Optionally run an AI judge over report claims and retrieved evidence.
Export a trace-to-reward sample for future planner/tool-router training.
Export a post-training bundle with SFT, DPO, process-reward, credit-assignment, and tool-router records.
Export an escalation packet for high-fidelity computation, wet-lab validation, expert review, and feedback ingestion.
Export an interoperability bundle aligned with W3C PROV, Workflow Run RO-Crate, and OpenTelemetry-style agent spans.
Run a 15-case eval suite with deterministic gates, optional DeepSeek/OpenAI-compatible judging, and explicit expert-review boundaries.
Run deep evaluation for trajectory quality, citation support precision, multi-run stability, cost/latency, and external-result ingestion.

This is intentionally not presented as a high-fidelity chemistry model. The value is the infrastructure pattern: trace schema, tool adapter boundaries, validation gates, artifact replay, and reward generation.

Why It Fits DP Technology

DP Technology's Bohrium + SciMaster stack frames the bottleneck of agentic science as an infrastructure problem: workflows must become executable, observable, reproducible, governed, and continuously improvable. SciTrace-RL directly targets that bottleneck.

The demo mirrors the same platform logic:

Reading: evidence retrieval from scientific sources.
Computing: candidate scoring through a callable tool.
Validation: trace-backed checks before promoting a result.
Feedback: execution trace converted into reward data.

Architecture

flowchart LR
    A["Scientific Goal"] --> B["Agent Runtime"]
    B --> C["Literature Search Adapter"]
    B --> D["Molecule Screening Adapter"]
    B --> E["Report Writer"]
    C --> F["Trace Store"]
    D --> F
    E --> F
    F --> G["Validation Gates"]
    G --> H["Reward Label"]
    H --> I["Planner / Tool-Router Training Data"]

Repository Structure

.
├── data/
│   ├── corpus/scientific_sources.json
│   └── tasks/electrolyte_additive_screen.json
├── docs/
│   ├── ai_api_validation.md
│   ├── architecture.md
│   ├── demo_guide.md
│   ├── project_proposal_scitrace_rl.pdf
│   ├── project_proposal_scitrace_rl.tex
│   └── research_basis.md
├── outputs/
│   ├── demo_dashboard.html
│   ├── demo_report.md
│   ├── demo_trace.json
│   ├── escalation_packet.json
│   ├── post_training_bundle.json
│   ├── provenance_bundle.json
│   ├── eval/eval_report.md
│   ├── deep_eval/deep_eval_report.md
│   ├── eval_deepseek_v7/eval_report.md
│   ├── ranked_candidates.json
│   ├── retrieved_sources.json
│   ├── trace_to_reward_sample.json
│   └── validation_scorecard.json
├── src/scitrace_rl/
│   ├── ai_judge.py
│   ├── chemistry.py
│   ├── cli.py
│   ├── dashboard.py
│   ├── deep_eval.py
│   ├── escalation.py
│   ├── eval_suite.py
│   ├── external_feedback.py
│   ├── learning_signal.py
│   ├── runner.py
│   ├── schema.py
│   ├── tools.py
│   ├── utils.py
│   └── validators.py
└── tests/test_runner.py

Run

No external Python dependency is required.

PYTHONPATH=src python3 -m scitrace_rl.cli --out outputs

Expected output:

trace_id=trace_...
reward=0.97
dashboard=outputs/demo_dashboard.html

Open the dashboard:

open outputs/demo_dashboard.html

Run tests:

PYTHONPATH=src python3 -m unittest discover -s tests

Run the eval suite:

PYTHONPATH=src python3 -m scitrace_rl.eval_suite --out outputs/eval

Run the deep eval suite:

PYTHONPATH=src python3 -m scitrace_rl.deep_eval --out outputs/deep_eval --stability-runs 5

Optional DeepSeek AI judge:

export SCITRACE_AI_JUDGE=1
export DEEPSEEK_API_KEY="your_api_key"
export DEEPSEEK_MODEL="deepseek-v4-flash"
export DEEPSEEK_BASE_URL="https://api.deepseek.com"
PYTHONPATH=src python3 -m scitrace_rl.cli --out outputs

Run DeepSeek-backed evaluation:

PYTHONPATH=src python3 -m scitrace_rl.eval_suite --out outputs/eval_deepseek

Benchmark-Aligned Evaluation

Recent science-agent and research-agent benchmarks suggest that a credible demo should test more than final-answer quality. SciTrace-RL therefore maps the public benchmark landscape into local, reproducible checks:

Benchmark	What it stresses	SciTrace-RL coverage
CORE-Bench	Computational reproducibility across paper-based tasks	`artifact_replay`, deterministic screening hashes, `deep_eval` stability runs
PaperBench	Paper-to-code replication, experiment execution, rubric grading	trace artifacts, replay gates, report artifacts, structured validation scorecard
MLR-Bench	Open-ended ML research agents and fabricated experiment risk	adversarial cases for invented computation, unsupported quantitative claims, and premature deployment
DeepResearch Bench	Research-report quality, effective citations, citation accuracy	`citation_integrity`, `citation_support_precision`, claim metadata checks
TRAJECT-Bench	Tool selection, argument correctness, dependency/order satisfaction	`trajectory_quality` validates tool order, output contracts, artifact links, and duration sanity
AIRS-Bench	Full research lifecycle: ideas, experiments, analysis, iteration	`post_training_bundle`, `escalation_packet`, external feedback ingestion
SPOT	Verification of scientific errors in manuscripts	negative cases for wrong mechanism, unsupported transfer, overconfident safety, and citation mismatch
FIRE-Bench	Full-cycle scientific insight rediscovery	current coverage is partial: trace/reward/escalation infrastructure; real rediscovery tasks are future work
MMDeepResearch-Bench	Multimodal evidence grounding and citation integrity	current text-only trace design can ingest Uni-Parser/OmniScience artifacts, but multimodal support is future work

Current comprehensive local run:

unit tests: 4/4 passed
demo reward: 0.97
validation gates: 8
trajectory_quality: pass
citation_support_precision: pass
15-case adversarial eval: pass
deterministic_detection_rate: 1.0
citation_support_detection_rate: 1.0
semantic_or_support_detection_rate: 1.0
supported_case_pass_rate: 1.0
auto_resolvable_coverage: 0.8
expert_required_case_share: 0.2
deep_eval overall_status: pass
multi-run stability: 5 runs, 0 drift
external-result ingestion: pass

Key Artifacts

docs/project_proposal_scitrace_rl.pdf: concise proposal PDF.
docs/project_proposal_scitrace_rl.tex: LaTeX source for the proposal.
docs/ai_api_validation.md: optional AI API judge setup and rationale.
docs/demo_guide.md: what the reviewer should inspect in the demo.
docs/research_basis.md: source-backed rationale for the direction.
outputs/demo_trace.json: full trace with tool calls, artifacts, validation results, and reward.
outputs/provenance_bundle.json: W3C PROV, Workflow Run RO-Crate, and OpenTelemetry span views of the same run.
outputs/demo_report.md: generated scientific-agent report.
outputs/validation_scorecard.json: machine-readable validation gates.
outputs/trace_to_reward_sample.json: one reward-labeled sample for future agent training.
outputs/post_training_bundle.json: concrete SFT, DPO, process-reward, credit-assignment, and tool-router examples.
outputs/escalation_packet.json: structured next-step handoff to computation, lab, expert review, and feedback ingestion.
outputs/eval/eval_report.md: offline 15-case validation report.
outputs/deep_eval/deep_eval_report.md: trajectory, citation-support, multi-run stability, and external-ingestion evaluation.
outputs/eval_deepseek_v7/eval_report.md: DeepSeek-backed 15-case validation report from the latest real API run.

By default the ai_claim_review gate is marked skip, so the demo remains reproducible without external API access. When enabled, the AI judge reviews whether generated claims are supported by retrieved evidence.

Extension Plan

The current adapters are intentionally local and deterministic. In a production Bohrium/SciMaster setting, the same interfaces can be replaced by:

OpenAlex / PubMed / Uni-Parser / OmniScience evidence ingestion.
Bohrium / Lebesgue compute jobs.
Uni-Mol / DPA / Uni-Fold model calls.
Uni-Lab-OS wet-lab execution hooks.
Human/expert review queues for claim promotion and unresolved scientific boundary conditions.
W3C PROV / Workflow Run RO-Crate export for FAIR scientific workflow records.
OpenTelemetry GenAI spans for production observability.
Offline RL, process-reward modeling, preference data, and tool-router training over validated traces.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs		docs
outputs		outputs
src/scitrace_rl		src/scitrace_rl
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciTrace-RL

What This Demo Shows

Why It Fits DP Technology

Architecture

Repository Structure

Run

Benchmark-Aligned Evaluation

Key Artifacts

Extension Plan

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SciTrace-RL

What This Demo Shows

Why It Fits DP Technology

Architecture

Repository Structure

Run

Benchmark-Aligned Evaluation

Key Artifacts

Extension Plan

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages