This release introduces the NewResearcher components and a completely redesigned 3-phase workflow in the Streamlit UI.
- File:
src/chunking/token_chunker.py - Features:
- Sentence boundary preservation using NLTK
- Token counting with tiktoken (OpenAI's tokenizer)
- Configurable chunk size (default: 800 tokens) and overlap (default: 100 tokens)
- Metadata for each chunk (token count, sentence count, character positions)
- Usage:
from chunking.token_chunker import TokenChunker chunker = TokenChunker(chunk_size=800, chunk_overlap=100) chunks = chunker.chunk_text(long_text)
- File:
src/validation/source_validator.py - Features:
- Multi-dimensional scoring (credibility, recency, technical depth)
- Venue reputation scoring (Tier 1: NeurIPS, ICML, CVPR, etc.)
- Peer review detection
- Top-N filtering
- Scoring:
- Credibility (0-10): Based on venue reputation and indicators
- Recency (0-10): Based on publication year with decay
- Technical Depth (0-10): Based on equation density and technical terms
- Usage:
from validation.source_validator import validate_sources result = validate_sources(sources, top_n=5)
- File:
src/search/academic_search.py - Features:
- Multi-source search (arXiv, Semantic Scholar, Exa, Tavily)
- Year filtering
- Relevance scoring
- PDF link extraction
- APIs Supported:
- arXiv (no API key required)
- Semantic Scholar (no API key required)
- Exa (requires EXA_API_KEY)
- Tavily (requires TAVILY_API_KEY)
- Usage:
from search.academic_search import search_papers result = search_papers("transformer architecture", max_results=10)
-
process_paper_to_knowledge_graph()- Phase 1- Runs ingestion through knowledge graph construction
- Returns method_struct, equations, datasets, knowledge_graph
- Sets
ready_for_code_genflag
-
generate_code()- Phase 2- Generates PyTorch code from extracted method information
- Returns generated files
-
run_sandbox_validation()- Phase 3- Creates Windows Sandbox and executes code
- Returns validation results with execution output
- RAG Search - Original RAG functionality
- Pipeline (v2) - New 3-phase workflow
- NewResearcher - New tools (chunking, validation, search)
- Testing - Testing and validation suite
- Dashboard - Project overview
Stages:
- Document Ingestion
- Vision Processing
- Chunking
- Method Extraction
- Equation Processing
- Dataset Processing
- Knowledge Graph Construction
UI: Shows progress for each stage, extracted method summary, confidence scores
- Trigger: Manual button press
- Action: Generate PyTorch code from extracted method
- Output: View generated code files before proceeding
- Trigger: Manual button press
- Action: Create Windows Sandbox, execute code
- Output: Live execution results, stdout/stderr, logs
- Select paper
- Configure chunk size and overlap
- View chunk metrics (total tokens, avg tokens/chunk)
- Preview individual chunks
- Manual entry or load from paper
- Multi-dimensional scoring
- Top sources display with detailed metrics
- Multi-source search
- Year filtering
- Results with abstracts, citations, PDF links
- Run coverage evaluation
- Generate calibration data
- Test with sample sources
- View scoring breakdown
- Test sandbox creation
- Execute sample code
- View results
- Validate pipeline results for selected paper
- Check for required output files
src/chunking/__init__.pysrc/chunking/token_chunker.pysrc/validation/__init__.pysrc/validation/source_validator.pysrc/search/__init__.pysrc/search/academic_search.pysrc/app_streamlit_v2.py(new UI)PHASE6_SUMMARY.md(this file)
src/agents/orchestrator.py- Added 3-phase workflow methodssrc/app_streamlit.py- Replaced with v2
New optional dependencies:
tiktoken>=0.5.0 # For token counting
nltk>=3.8.0 # For sentence tokenization
requests>=2.31.0 # For API calls (already required)
Install with:
pip install tiktoken nltkFor academic search APIs:
EXA_API_KEY=your_exa_key
TAVILY_API_KEY=your_tavily_keystreamlit run src/app_streamlit.py- Upload a PDF in the "Pipeline (v2)" tab
- Click "Start Research Phase" (Phase 1)
- Review extracted method and knowledge graph
- Click "Generate Code" (Phase 2)
- Review generated code
- Click "Create Sandbox & Run" (Phase 3)
- View execution results
- Go to "NewResearcher" tab
- Select tool: Token-Aware Chunking, Source Validation, or Academic Search
- Configure parameters
- Run and view results
- Token-aware chunking with sentence preservation
- Source validation with multi-dimensional scoring
- Academic search across multiple sources
- 3-phase workflow with manual triggers
- Code preview before sandbox execution
- Sandbox execution with live output
- Testing & Validation tab
- Modern UI with phase cards
- Test the new workflow end-to-end
- Add more academic search sources (Google Scholar, PubMed)
- Implement caching for search results
- Add export functionality for generated code
- Integrate conformal prediction confidence scores into UI
- The old
process_paper()method is still available for backward compatibility - NewResearcher components are optional - the system works without API keys
- Sandbox execution requires Windows Sandbox to be enabled on Windows
- Token chunking falls back to approximate counting if tiktoken is not installed