Research2Text: Development Analysis & Recommendations

Executive Summary

This document analyzes the current state of Research2Text, identifies critical issues, and provides actionable recommendations for taking the project to the next level. It covers the NewResearcher integration, code execution isolation alternatives, knowledge graph quality issues, embedding strategies, and architectural improvements.

1. NewResearcher Integration Analysis

What NewResearcher Brings (CrewAI-based)

Architecture:

Uses CrewAI framework (higher-level abstraction than custom orchestrator)
5 specialized agents: Planner → Search → Validator → Extractor → Synthesizer
Token-aware pipeline with strict limits (800 tokens/chunk, 3000 chars/source)
External search integration (Exa, Tavily) for live research

Key Strengths:

Better chunking strategy: Sentence-aware with paragraph boundary preservation
Source validation: Explicit credibility scoring (1-10) before extraction
Token safety: Hard limits at every layer prevent LLM overflow
Citation-focused: Built for generating reports with verified references
Production patterns: Proper logging, env-based config, retry logic

Integration Opportunities:

Component	Current Research2Text	NewResearcher	Integration Strategy
Orchestration	Custom Orchestrator	CrewAI	Keep custom - more control over paper-to-code flow
Chunking	Word-based (700 words)	Token-based (800 tokens)	Adopt NewResearcher - sentence boundaries are better
Validation	Post-hoc AST check	Pre-extraction source scoring	Add source validation before method extraction
Search	Local ChromaDB only	Exa + Tavily APIs	Add as optional for live paper discovery
Token Management	None	Strict limits everywhere	Implement throughout

Recommendation: Integrate NewResearcher's chunking, token management, and source validation patterns into Research2Text. Keep the custom orchestrator for the paper-to-code pipeline (more control than CrewAI's black box).

2. Code Execution Isolation: Docker Alternatives

The Problem

Current validator.py:78 runs generated code directly:

res = run_and_capture(["python", "train.py"], cwd=sandbox, timeout=90)

Risks:

No isolation from host system
Generated code can access filesystem, network
Malicious paper → malicious code execution
Dependency conflicts with host Python

Why Docker is Heavyweight

Concern	Reality
Startup time	5-30 seconds for container creation
Resource overhead	~100MB+ per container
Complexity	Requires Docker daemon, image management
Cross-platform	Docker Desktop issues on Windows/Mac
Cleanup	Orphaned containers/volumes

Better Alternatives (In Order of Preference)

Option 1: Firejail (Recommended for Linux)

# Install: sudo apt install firejail
firejail --noprofile --private=/tmp/sandbox python train.py

Pros:

Native Linux namespaces (no VM overhead)
Filesystem sandboxing (--private)
Network isolation (--net=none)
Resource limits (--rlimit)
Sub-second startup

Cons:

Linux only
Requires root for installation

Option 2: subprocess with RestrictedPython

from restrictedpython import compile_restricted
from restrictedpython.Guards import safe_builtins

# Compile code with restricted builtins
gl = {'__builtins__': safe_builtins}
exec(compile_restricted(user_code), gl)

Pros:

Pure Python (no external deps)
Prevents dangerous builtins (eval, exec, open)
Fast execution

Cons:

Can be bypassed by determined attackers
Limited protection against infinite loops

Option 3: PyPy Sandbox (Experimental)

True sandboxing at interpreter level
Very secure but limited library support

Option 4: Systemd-nspawn (Linux)

systemd-nspawn --directory=/var/lib/sandbox --private-network python train.py

Lightweight container (shares kernel)
Better than Docker for single-shot execution

Option 5: Cloud Functions (AWS Lambda, etc.)

True isolation
Pay-per-execution
Requires internet + cloud account

Recommended Implementation

Hybrid Approach:

# validator.py - improved isolation

def run_isolated(code_dir: Path, timeout: int = 90) -> RunResult:
    """Run code with best available isolation."""

    # Try Firejail first (Linux production)
    if shutil.which("firejail"):
        return _run_firejail(code_dir, timeout)

    # Fallback to RestrictedPython (cross-platform dev)
    return _run_restricted(code_dir, timeout)

def _run_firejail(code_dir: Path, timeout: int) -> RunResult:
    cmd = [
        "firejail",
        "--noprofile",
        "--private=/tmp/sandbox",
        "--net=none",  # No network
        "--rlimit-cpu=90",  # CPU limit
        "--rlimit-as=1g",  # Memory limit
        "python", "train.py"
    ]
    return run_and_capture(cmd, cwd=code_dir, timeout=timeout)

def _run_restricted(code_dir: Path, timeout: int) -> RunResult:
    # Use RestrictedPython for the entry point
    # Fall back to subprocess with timeout for imports
    pass

Verdict: Use Firejail for production Linux + RestrictedPython fallback for cross-platform development. Avoid Docker for this use case.

3. Knowledge Graph Quality Issues

Current Implementation Analysis

Looking at graph_builder_agent.py:20-121:

# Current issues:
1. Nodes created only from method_struct (limited fields)
2. No section/chunk-level node extraction
3. No relationship inference (only hardcoded "contains", "uses")
4. No entity linking (datasets not canonicalized)
5. Properties are empty dictionaries

Why Quality is Low

Issue	Current Behavior	Expected Behavior
Limited node types	Only Paper, Algorithm, Dataset, Equation, Citation	Should include: Method, Architecture, Hyperparameter, Metric, Baseline, Ablation
No section context	Single paper node	Should have: Abstract, Introduction, Method, Experiments, Results nodes
Flat structure	Paper → Algorithm (1 hop)	Should have: Paper → Section → Subsection → Method → Component
No semantic edges	Only "contains", "uses"	Should have: "implements", "extends", "compares_to", "achieves", "outperforms"
Empty properties	`{}`	Should have: confidence scores, source text, page numbers

Root Causes

Input data is too coarse: Only uses method_struct (high-level) not chunks (detailed)
No LLM involvement: Graph is built with simple dictionary lookups
No entity resolution: "CIFAR-10" and "CIFAR10" become separate nodes
Static schema: Node types hardcoded, not inferred from content

Recommendations for High-Quality Knowledge Graphs

Approach 1: LLM-Assisted Graph Extraction (Recommended)

Add a new extraction step using the LLM:

# New: graph_extraction_agent.py
GRAPH_EXTRACTION_PROMPT = """
Analyze the following research paper text and extract a detailed knowledge graph.

TEXT:
{text}

Extract:
1. **Entities** (with type, label, and properties)
   - Paper sections (Abstract, Introduction, Method, etc.)
   - Algorithms and models
   - Datasets and benchmarks
   - Hyperparameters (with values)
   - Metrics and results
   - Claims and contributions

2. **Relationships** (with type and evidence)
   - "achieves" (model → metric)
   - "uses" (model → dataset)
   - "outperforms" (model → baseline)
   - "implements" (code → algorithm)
   - "depends_on" (component → component)

Return JSON:
{
  "entities": [
    {"id": "...", "type": "...", "label": "...", "properties": {...}}
  ],
  "relationships": [
    {"source": "...", "target": "...", "type": "...", "evidence": "..."}
  ]
}
"""

Approach 2: Chunk-Level Graph Building

Process each chunk individually, then merge:

# Modified graph_builder_agent.py
def process(self, message: AgentMessage) -> AgentResponse:
    chunks = payload.get("chunks", [])

    # Build subgraph for each chunk
    subgraphs = []
    for chunk in chunks:
        subgraph = self._extract_subgraph(chunk)  # LLM call per chunk
        subgraphs.append(subgraph)

    # Merge subgraphs with entity resolution
    merged = self._merge_subgraphs(subgraphs)

    return AgentResponse(data=merged)

Approach 3: Hybrid (Current + LLM)

Keep current structure for speed, add LLM enrichment for detail:

# Two-phase graph building
phase1 = self._build_base_graph(method_struct)  # Fast, rule-based
phase2 = self._enrich_with_llm(chunks)  # Detailed, LLM-based
return merge(phase1, phase2)

Quick Wins

Add section nodes: Parse paper structure ("Abstract", "1. Introduction", etc.)
Canonicalize entities: Use dataset_loader_agent's fuzzy matching for datasets
Add confidence scores: Copy from method_struct.confidence
Add source evidence: Store which chunk each node came from

4. Content Chunking Strategy Analysis

Current Implementation

Research2Text (utils.py:18-73):

def chunk_text_by_words(text, chunk_size_words=700, overlap_words=100):
    words = text.split()
    # Simple word splitting - no sentence awareness

NewResearcher (text_chunker.py:35-116):

def chunk_text(text, max_tokens=800, overlap_tokens=100):
    sentences = _split_into_sentences(text)
    # Sentence-aware with token counting

Comparison

Aspect	Research2Text	NewResearcher	Winner
Unit	Words (700)	Tokens (800)	Tokens (LLM-native)
Boundary	Arbitrary word count	Sentence boundaries	Sentences
Overlap	Words	Tokens	Tokens
Hard limit	None	3000 chars	NewResearcher
Token counting	None	tiktoken	NewResearcher

Problems with Current Chunking

Sentence breaking: "The attention mechanism (see Section 3) works by..." might split mid-sentence
Context loss: Mathematical expressions spanning chunks get broken
Inefficient LLM usage: 700 words ≈ 1000+ tokens (exceeds optimal)
No semantic chunking: Related paragraphs may be separated

Recommended Chunking Strategy

# Enhanced chunker combining both approaches

def chunk_text_enhanced(
    text: str,
    max_tokens: int = 800,
    overlap_tokens: int = 100,
    respect_sections: bool = True
) -> List[Chunk]:
    """
    Advanced chunking with:
    - Sentence boundaries
    - Section awareness
    - Semantic coherence
    """

    # Step 1: Identify section boundaries
    sections = parse_sections(text) if respect_sections else []

    # Step 2: Split into sentences
    sentences = split_sentences(text)

    # Step 3: Group sentences respecting:
    #   - Token limit
    #   - Section boundaries
    #   - Semantic similarity (optional)

    chunks = []
    current_chunk = []
    current_tokens = 0

    for sentence in sentences:
        sent_tokens = count_tokens(sentence)

        # Check if adding this sentence exceeds limit
        if current_tokens + sent_tokens > max_tokens:
            # Save current chunk
            chunks.append(create_chunk(current_chunk, overlap_tokens))

            # Start new chunk with overlap
            current_chunk = get_overlap_sentences(current_chunk, overlap_tokens)
            current_tokens = count_tokens(current_chunk)

        current_chunk.append(sentence)
        current_tokens += sent_tokens

    return chunks

Advanced: Semantic Chunking

# Use embeddings to keep semantically similar content together
from sklearn.cluster import KMeans

def semantic_chunking(sentences: List[str], n_chunks: int) -> List[List[str]]:
    embeddings = model.encode(sentences)
    clusters = KMeans(n_clusters=n_chunks).fit_predict(embeddings)

    # Group sentences by cluster
    chunks = defaultdict(list)
    for sent, cluster in zip(sentences, clusters):
        chunks[cluster].append(sent)

    return list(chunks.values())

Recommendation: Adopt NewResearcher's token-based, sentence-aware chunking. Add section boundary detection for academic papers.

5. Embedding Strategy: Gemini 2.0 vs Current

Current Setup

# index_documents.py:61
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
# 384 dimensions, 22M parameters

Gemini 2.0 Text Embedding

Specs (from Google's announcement):

Model: models/text-embedding-004 (or similar)
Dimensions: 768 (configurable)
Context window: 8,192 tokens (vs MiniLM's 512)
Multilingual: Yes
MTEB Score: ~68+ (vs MiniLM's ~62)

Comparison

Feature	MiniLM-L6-v2	Gemini 2.0	Advantage
Size	22M params	Unknown (likely larger)	MiniLM (local)
Speed	Fast (CPU)	API latency	MiniLM
Quality	MTEB ~62	MTEB ~68+	Gemini
Context	512 tokens	8,192 tokens	Gemini (16x!)
Cost	Free (local)	API pricing	MiniLM
Privacy	Local	Cloud	MiniLM
Multilingual	Limited	Strong	Gemini
Offline	Yes	No	MiniLM

Critical Consideration: Context Window

The 8K context window is GAME-CHANGING for research papers:

# Current: 512 tokens ≈ 350 words
# Can fit: 1-2 paragraphs

# Gemini: 8,192 tokens ≈ 6,000 words
# Can fit: Entire paper section or short paper

This means:

No more chunking needed for short papers
Section-level embeddings instead of paragraph-level
Better semantic coherence in retrieval

Hybrid Recommendation

# config.py - embedding configuration

class EmbeddingConfig:
    # Primary: Gemini for quality (if API available)
    GEMINI_MODEL = "models/text-embedding-004"
    GEMINI_DIMENSIONS = 768
    GEMINI_CONTEXT = 8192

    # Fallback: Local MiniLM (offline/privacy)
    LOCAL_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
    LOCAL_DIMENSIONS = 384
    LOCAL_CONTEXT = 512

    # Strategy
    @classmethod
    def get_embedder(cls):
        if os.getenv("GEMINI_API_KEY"):
            return GeminiEmbedder(cls.GEMINI_MODEL)
        return LocalEmbedder(cls.LOCAL_MODEL)

Implementation Plan

Phase 1: Add Gemini as Optional

# new file: src/embedders.py

class GeminiEmbedder:
    def __init__(self, model_name="models/text-embedding-004"):
        import google.generativeai as genai
        genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
        self.model = model_name

    def embed(self, texts: List[str]) -> List[List[float]]:
        # Batch embedding with 8k context
        result = genai.embed_content(
            model=self.model,
            content=texts,
            task_type="retrieval_document"
        )
        return result['embedding']

Phase 2: Adaptive Chunking

If using Gemini: Larger chunks (2000-4000 tokens)
If using MiniLM: Keep current 700 words

Phase 3: Unified Interface

# Both embedders implement same interface
embedder = EmbeddingConfig.get_embedder()
embeddings = embedder.embed(chunks)  # Works with either

Verdict on Gemini 2.0

YES, integrate it, but as an optional enhancement, not replacement:

Keep MiniLM as default (local, free, offline)
Add Gemini as premium option (better quality, larger context)
Let users choose based on their needs

The 8K context window alone justifies the integration for research papers.

6. Comprehensive Improvement Roadmap

Phase 1: Foundation (Weeks 1-2)

Code Execution Safety

Implement Firejail sandbox (Linux)
Add RestrictedPython fallback (cross-platform)
Add resource limits (CPU, memory, time)
Security audit of generated code patterns

Chunking Improvements

Adopt token-based chunking from NewResearcher
Add sentence boundary detection
Add section-aware splitting
Benchmark: chunk quality vs retrieval accuracy

Phase 2: Quality (Weeks 3-4)

Knowledge Graph Enhancement

Add LLM-assisted graph extraction
Implement entity canonicalization
Add relationship inference ("outperforms", "implements")
Add confidence scores to nodes/edges
Visual graph export (NetworkX + Matplotlib)

Embedding Upgrade

Add Gemini 2.0 embedder (optional)
Implement adaptive chunking based on embedder
Add embedding quality evaluation
Support for multimodal embeddings (future)

Phase 3: Integration (Weeks 5-6)

NewResearcher Integration

Port token management system
Add source validation agent
Integrate search tools (Exa/Tavily) as optional
Unified logging format

Performance & Scale

Async agent execution (where possible)
Caching layer for LLM responses
Parallel chunk processing
Database migration (ChromaDB → PostgreSQL + pgvector for scale)

Phase 4: Polish (Weeks 7-8)

Testing & Reliability

Unit tests for each agent
Integration tests for full pipeline
Benchmark suite (extraction accuracy, code correctness)
CI/CD pipeline

Documentation

API documentation
Architecture decision records (ADRs)
Deployment guide
Contributing guidelines

7. Quick Wins (Do These First)

Fix chunking (1 day)
- Copy NewResearcher's text_chunker.py
- Replace chunk_text_by_words with token-based version
Add Firejail (1 day)
- Install on Linux dev machine
- Add _run_firejail() to validator.py
- Test with generated PyTorch code
Improve knowledge graph (2 days)
- Add section parsing (regex for "1. Introduction")
- Add properties (confidence, source chunk)
- Canonicalize dataset names
Add Gemini embeddings (1 day)
- Get API key
- Implement GeminiEmbedder class
- Add toggle in config

8. Architecture Decision Summary

Decision	Current	Recommended	Rationale
Sandbox	None	Firejail + RestrictedPython	Security without Docker overhead
Chunking	Word-based	Token-based, sentence-aware	Better LLM compatibility
Embeddings	MiniLM only	MiniLM + Gemini (optional)	Quality vs cost flexibility
Knowledge Graph	Rule-based	LLM-assisted + rule-based	Richer, more accurate graphs
Orchestration	Custom	Keep custom	More control than CrewAI
NewResearcher	Separate	Integrate components	Best of both worlds

9. Risk Assessment

Risk	Likelihood	Impact	Mitigation
Firejail Linux-only	High	Medium	Provide RestrictedPython fallback
Gemini API costs	Medium	Low	Make optional, keep MiniLM default
LLM graph extraction slow	High	Low	Cache results, make async
Token-based chunking breaks existing	Low	High	Test with existing papers
Integration complexity	Medium	Medium	Incremental integration, not rewrite

10. Success Metrics

Define measurable targets:

Metric	Current	Target	Measurement
Knowledge graph nodes (avg)	10-15	40-60	Nodes per paper
Code validation success	~60%	>85%	% of papers with runnable code
RAG retrieval accuracy	Unknown	>80%	Human evaluation on 50 queries
Pipeline execution time	~120s	<60s	End-to-end paper processing
Token efficiency	Unknown	<4000 tokens/paper	Total LLM tokens used

Conclusion

Research2Text has a solid foundation but needs focused improvements in:

Security (sandboxing)
Quality (knowledge graphs, chunking)
Flexibility (embedding options)
Integration (NewResearcher components)

The recommended path is incremental enhancement, not a rewrite. Keep the custom orchestrator (it's working well), adopt NewResearcher's best practices (chunking, token management), and add Gemini as an optional upgrade.

Next immediate actions:

Implement Firejail sandbox
Port NewResearcher's chunking
Add Gemini embedder
Enhance knowledge graph with LLM assistance

Document Version: 1.0 Date: 2026-03-17 Author: Claude Code Analysis

FilesExpand file tree

what development.md

Latest commit

History

what development.md

File metadata and controls

Research2Text: Development Analysis & Recommendations

Executive Summary

1. NewResearcher Integration Analysis

What NewResearcher Brings (CrewAI-based)

2. Code Execution Isolation: Docker Alternatives

The Problem

Why Docker is Heavyweight

Better Alternatives (In Order of Preference)

Option 1: Firejail (Recommended for Linux)

Option 2: subprocess with RestrictedPython

Option 3: PyPy Sandbox (Experimental)

Option 4: Systemd-nspawn (Linux)

Option 5: Cloud Functions (AWS Lambda, etc.)

Recommended Implementation

3. Knowledge Graph Quality Issues

Current Implementation Analysis

Why Quality is Low

Root Causes

Recommendations for High-Quality Knowledge Graphs

Approach 1: LLM-Assisted Graph Extraction (Recommended)

Approach 2: Chunk-Level Graph Building

Approach 3: Hybrid (Current + LLM)

Quick Wins

4. Content Chunking Strategy Analysis

Current Implementation

Comparison

Problems with Current Chunking

Recommended Chunking Strategy

Advanced: Semantic Chunking

5. Embedding Strategy: Gemini 2.0 vs Current

Current Setup

Gemini 2.0 Text Embedding

Comparison

Critical Consideration: Context Window

Hybrid Recommendation

Implementation Plan

Verdict on Gemini 2.0

6. Comprehensive Improvement Roadmap

Phase 1: Foundation (Weeks 1-2)

Phase 2: Quality (Weeks 3-4)

Phase 3: Integration (Weeks 5-6)

Phase 4: Polish (Weeks 7-8)

7. Quick Wins (Do These First)

8. Architecture Decision Summary

9. Risk Assessment

10. Success Metrics

Conclusion