SciTrace-RL is a submission-ready demo for AI4S Infra + scientific agents. It turns a scientific-agent run into a reproducible execution trace, validates the trace, and converts the run into a reward-labeled training sample.
The project is designed for DP Technology's "追光计划" direction:
- Primary direction: AI4S Infra
- Secondary fit: 智能体赋能科学发现
- Core idea: scientific agents should not only answer; they should leave auditable, replayable, and trainable execution evidence.
The demo runs a lightweight dry-lab workflow for lithium-ion battery electrolyte additive screening:
- Retrieve evidence from a local scientific corpus.
- Screen candidate additives with deterministic chemistry proxy features.
- Generate a citation-grounded report.
- Validate citations, replayability, constraints, and claim-evidence alignment.
- Optionally run an AI judge over report claims and retrieved evidence.
- Export a trace-to-reward sample for future planner/tool-router training.
- Export a post-training bundle with SFT, DPO, process-reward, credit-assignment, and tool-router records.
- Export an escalation packet for high-fidelity computation, wet-lab validation, expert review, and feedback ingestion.
- Export an interoperability bundle aligned with W3C PROV, Workflow Run RO-Crate, and OpenTelemetry-style agent spans.
- Run a 15-case eval suite with deterministic gates, optional DeepSeek/OpenAI-compatible judging, and explicit expert-review boundaries.
- Run deep evaluation for trajectory quality, citation support precision, multi-run stability, cost/latency, and external-result ingestion.
This is intentionally not presented as a high-fidelity chemistry model. The value is the infrastructure pattern: trace schema, tool adapter boundaries, validation gates, artifact replay, and reward generation.
DP Technology's Bohrium + SciMaster stack frames the bottleneck of agentic science as an infrastructure problem: workflows must become executable, observable, reproducible, governed, and continuously improvable. SciTrace-RL directly targets that bottleneck.
The demo mirrors the same platform logic:
- Reading: evidence retrieval from scientific sources.
- Computing: candidate scoring through a callable tool.
- Validation: trace-backed checks before promoting a result.
- Feedback: execution trace converted into reward data.
flowchart LR
A["Scientific Goal"] --> B["Agent Runtime"]
B --> C["Literature Search Adapter"]
B --> D["Molecule Screening Adapter"]
B --> E["Report Writer"]
C --> F["Trace Store"]
D --> F
E --> F
F --> G["Validation Gates"]
G --> H["Reward Label"]
H --> I["Planner / Tool-Router Training Data"]
.
├── data/
│ ├── corpus/scientific_sources.json
│ └── tasks/electrolyte_additive_screen.json
├── docs/
│ ├── ai_api_validation.md
│ ├── architecture.md
│ ├── demo_guide.md
│ ├── project_proposal_scitrace_rl.pdf
│ ├── project_proposal_scitrace_rl.tex
│ └── research_basis.md
├── outputs/
│ ├── demo_dashboard.html
│ ├── demo_report.md
│ ├── demo_trace.json
│ ├── escalation_packet.json
│ ├── post_training_bundle.json
│ ├── provenance_bundle.json
│ ├── eval/eval_report.md
│ ├── deep_eval/deep_eval_report.md
│ ├── eval_deepseek_v7/eval_report.md
│ ├── ranked_candidates.json
│ ├── retrieved_sources.json
│ ├── trace_to_reward_sample.json
│ └── validation_scorecard.json
├── src/scitrace_rl/
│ ├── ai_judge.py
│ ├── chemistry.py
│ ├── cli.py
│ ├── dashboard.py
│ ├── deep_eval.py
│ ├── escalation.py
│ ├── eval_suite.py
│ ├── external_feedback.py
│ ├── learning_signal.py
│ ├── runner.py
│ ├── schema.py
│ ├── tools.py
│ ├── utils.py
│ └── validators.py
└── tests/test_runner.py
No external Python dependency is required.
PYTHONPATH=src python3 -m scitrace_rl.cli --out outputsExpected output:
trace_id=trace_...
reward=0.97
dashboard=outputs/demo_dashboard.html
Open the dashboard:
open outputs/demo_dashboard.htmlRun tests:
PYTHONPATH=src python3 -m unittest discover -s testsRun the eval suite:
PYTHONPATH=src python3 -m scitrace_rl.eval_suite --out outputs/evalRun the deep eval suite:
PYTHONPATH=src python3 -m scitrace_rl.deep_eval --out outputs/deep_eval --stability-runs 5Optional DeepSeek AI judge:
export SCITRACE_AI_JUDGE=1
export DEEPSEEK_API_KEY="your_api_key"
export DEEPSEEK_MODEL="deepseek-v4-flash"
export DEEPSEEK_BASE_URL="https://api.deepseek.com"
PYTHONPATH=src python3 -m scitrace_rl.cli --out outputsRun DeepSeek-backed evaluation:
PYTHONPATH=src python3 -m scitrace_rl.eval_suite --out outputs/eval_deepseekRecent science-agent and research-agent benchmarks suggest that a credible demo should test more than final-answer quality. SciTrace-RL therefore maps the public benchmark landscape into local, reproducible checks:
| Benchmark | What it stresses | SciTrace-RL coverage |
|---|---|---|
| CORE-Bench | Computational reproducibility across paper-based tasks | artifact_replay, deterministic screening hashes, deep_eval stability runs |
| PaperBench | Paper-to-code replication, experiment execution, rubric grading | trace artifacts, replay gates, report artifacts, structured validation scorecard |
| MLR-Bench | Open-ended ML research agents and fabricated experiment risk | adversarial cases for invented computation, unsupported quantitative claims, and premature deployment |
| DeepResearch Bench | Research-report quality, effective citations, citation accuracy | citation_integrity, citation_support_precision, claim metadata checks |
| TRAJECT-Bench | Tool selection, argument correctness, dependency/order satisfaction | trajectory_quality validates tool order, output contracts, artifact links, and duration sanity |
| AIRS-Bench | Full research lifecycle: ideas, experiments, analysis, iteration | post_training_bundle, escalation_packet, external feedback ingestion |
| SPOT | Verification of scientific errors in manuscripts | negative cases for wrong mechanism, unsupported transfer, overconfident safety, and citation mismatch |
| FIRE-Bench | Full-cycle scientific insight rediscovery | current coverage is partial: trace/reward/escalation infrastructure; real rediscovery tasks are future work |
| MMDeepResearch-Bench | Multimodal evidence grounding and citation integrity | current text-only trace design can ingest Uni-Parser/OmniScience artifacts, but multimodal support is future work |
Current comprehensive local run:
unit tests: 4/4 passed
demo reward: 0.97
validation gates: 8
trajectory_quality: pass
citation_support_precision: pass
15-case adversarial eval: pass
deterministic_detection_rate: 1.0
citation_support_detection_rate: 1.0
semantic_or_support_detection_rate: 1.0
supported_case_pass_rate: 1.0
auto_resolvable_coverage: 0.8
expert_required_case_share: 0.2
deep_eval overall_status: pass
multi-run stability: 5 runs, 0 drift
external-result ingestion: pass
docs/project_proposal_scitrace_rl.pdf: concise proposal PDF.docs/project_proposal_scitrace_rl.tex: LaTeX source for the proposal.docs/ai_api_validation.md: optional AI API judge setup and rationale.docs/demo_guide.md: what the reviewer should inspect in the demo.docs/research_basis.md: source-backed rationale for the direction.outputs/demo_trace.json: full trace with tool calls, artifacts, validation results, and reward.outputs/provenance_bundle.json: W3C PROV, Workflow Run RO-Crate, and OpenTelemetry span views of the same run.outputs/demo_report.md: generated scientific-agent report.outputs/validation_scorecard.json: machine-readable validation gates.outputs/trace_to_reward_sample.json: one reward-labeled sample for future agent training.outputs/post_training_bundle.json: concrete SFT, DPO, process-reward, credit-assignment, and tool-router examples.outputs/escalation_packet.json: structured next-step handoff to computation, lab, expert review, and feedback ingestion.outputs/eval/eval_report.md: offline 15-case validation report.outputs/deep_eval/deep_eval_report.md: trajectory, citation-support, multi-run stability, and external-ingestion evaluation.outputs/eval_deepseek_v7/eval_report.md: DeepSeek-backed 15-case validation report from the latest real API run.
By default the ai_claim_review gate is marked skip, so the demo remains reproducible without external API access. When enabled, the AI judge reviews whether generated claims are supported by retrieved evidence.
The current adapters are intentionally local and deterministic. In a production Bohrium/SciMaster setting, the same interfaces can be replaced by:
- OpenAlex / PubMed / Uni-Parser / OmniScience evidence ingestion.
- Bohrium / Lebesgue compute jobs.
- Uni-Mol / DPA / Uni-Fold model calls.
- Uni-Lab-OS wet-lab execution hooks.
- Human/expert review queues for claim promotion and unresolved scientific boundary conditions.
- W3C PROV / Workflow Run RO-Crate export for FAIR scientific workflow records.
- OpenTelemetry GenAI spans for production observability.
- Offline RL, process-reward modeling, preference data, and tool-router training over validated traces.