A Six-Dimensional Evaluation Framework for Drama Script Continuation
π Website β’ β¨ Interactive Demo β’ π Leaderboard β’ π€ Dataset
- Overview
- Quick Start
- Project Components
- Web Demo
- Dataset
- Evaluation Framework
- Leaderboard
- Documentation
- Contributing
- Citation
- License
DramaBench is a comprehensive benchmark for evaluating drama script continuation capabilities of large language models. It provides:
- π Project Website - Interactive showcase with evaluation results and case studies
- β¨ Interactive Demo - Try script continuation with multiple LLM models (user-provided API key)
- πΎ Large-Scale Dataset - 1,103 drama scripts with human annotations
- π Evaluation Framework - 6 independent dimensions with rigorous metrics
- π Model Leaderboard - Compare 8 SOTA language models
- π Case Studies - 24 curated examples with detailed analysis
- π§ Evaluation Prompts - LLM-based labeling templates for all 6 dimensions
- Format Standards (Rule-based) - Screenplay format compliance
- Narrative Efficiency (LLM-labeled) - Story progression effectiveness
- Character Consistency (LLM-labeled) - Character voice and behavior
- Emotional Depth (LLM-labeled) - Emotional arc development
- Logic Consistency (LLM-labeled) - Factual coherence and continuity
- Conflict Handling (LLM-labeled) - Conflict development quality
- 1,103 unique drama scripts
- 8,824 total evaluations (1,103 scripts Γ 8 models)
- 8 state-of-the-art language models
- 6 independent evaluation dimensions
- 252 statistical significance tests (65.9% significant)
- 24 curated case studies
- Python 3.10+
- Web browser (Chrome, Safari, Firefox, or Edge)
Method 1: One-Click Start (Easiest)
cd DramaBench
./start_demo.shThis will automatically:
- β Start a local HTTP server on port 8000
- β Open the demo in your default browser
- β Navigate to http://localhost:8000
Method 2: Manual Server Start
cd DramaBench
# Using uv (if available)
uv run python -m http.server 8000
# Or using Python 3 directly
python3 -m http.server 8000
# Then open http://localhost:8000 in your browserDue to browser CORS restrictions, you must use a local HTTP server to view the demo. Opening HTML files directly (file:// protocol) will cause data loading errors.
An interactive, Apple-inspired web interface for exploring evaluation results and trying script continuation.
Website Features:
- π Interactive leaderboard with dimension filters
- π Case studies explorer with 24 examples
- π¨ Premium dark gradient design
- π± Fully responsive (mobile/tablet/desktop)
- β‘ Pure HTML/CSS/JavaScript (no frameworks)
Interactive Demo Features:
- β¨ Try script continuation with 4 SOTA models (GPT-5.2, Gemini 3, GLM-4.7, MiniMax M2.1)
- π User-provided OpenRouter API key (stored locally)
- π 500 drama scripts from DramaBench dataset
- π Official prompt template for generation
- π Compare AI-generated vs ground truth continuations
- π¨ Matching Apple-style design
Pages:
index.html- Main landing pageweb/leaderboard.html- Model rankingsweb/cases.html- Case studies browserweb/demo.html- Interactive script continuation demo
β View Live Website | β Try Interactive Demo
π Now Available on Hugging Face!
The DramaBench dataset is being released progressively to ensure quality and gather community feedback.
Current Release (v2.0):
- β 500 Drama Scripts - Available now on Hugging Face
- π₯ Download: FutureMa/DramaBench
- π Format: JSONL with structured metadata
- π License: MIT License
- π Usage: Load with
datasetslibrary
Quick Start:
from datasets import load_dataset
# Load dataset
dataset = load_dataset("FutureMa/DramaBench", split="train")
# Access samples
sample = dataset[0]
print(sample['title'])
print(sample['context'])
print(sample['continuation'])Release Roadmap:
| Version | Samples | Status | Expected Release |
|---|---|---|---|
| v1.0 | 100 | β Released | 2025-12-23 |
| v2.0 | 500 | β Available | 2026-01-01 |
| v3.0 (Full) | 1,103 | π Planned | Q2 2026 |
Full Dataset Contents (v3.0):
- 1,103 drama script contexts and continuations
- Model-generated continuations (8 SOTA models)
- Human annotations and quality assessments
- Multi-dimensional evaluation metrics
- Error taxonomy and classification
β Now Available: LLM-based evaluation prompt templates for all 6 dimensions.
Location: prompts/ directory
Contents:
narrative_efficiency_prompt.txt- Story progression effectivenesscharacter_consistency_prompt.txt- Character voice and behavior consistencyemotional_depth_prompt.txt- Emotional arc developmentlogic_consistency_prompt.txt- Factual coherence and continuityconflict_handling_prompt.txt- Conflict development and resolutiondialogue_quality_prompt.txt- Dialogue naturalness and purpose
Quick Start:
# Load a prompt template
with open('prompts/narrative_efficiency_prompt.txt', 'r') as f:
prompt = f.read()
# Fill placeholders
prompt = prompt.replace('{CONTEXT}', script_context)
prompt = prompt.replace('{CONTINUATION}', generated_continuation)
prompt = prompt.replace('{MODEL}', 'GPT-4')
prompt = prompt.replace('{SCRIPT_ID}', 'script_001')
# Send to LLM and get structured JSON output
response = llm_api_call(prompt)
evaluation = json.loads(response)See prompts/README.md for detailed usage instructions.
Coming Soon: Full evaluation pipeline including:
- Statistical analysis scripts
- Visualization generation tools
- Reproducibility automation scripts
Visit dramabench.pages.dev to explore:
- Homepage - Project overview and statistics
- Leaderboard - Compare 8 SOTA models across 6 dimensions
- Case Studies - Browse 24 curated examples with detailed analysis
- Interactive Demo - Try script continuation yourself
Try it now: dramabench.pages.dev/web/demo.html
Experience drama script continuation with state-of-the-art language models:
Features:
- π 500 Drama Scripts - Select from DramaBench v2.0 dataset
- π€ 4 SOTA Models - GPT-5.2, Gemini 3 Flash, GLM-4.7, MiniMax M2.1
- π Your API Key - Uses OpenRouter API (bring your own key)
- π Compare Results - View AI-generated vs ground truth side-by-side
- π¨ Apple Design - Beautiful, responsive interface
How to Use:
- Get your free API key from OpenRouter
- Visit the demo page
- Enter your API key (stored locally in your browser)
- Select a script from 500 options
- Choose your preferred model
- Generate and compare continuations
Cost: Pay-as-you-go through OpenRouter (typically $0.01-0.10 per generation)
Interactive Leaderboard
- Filter by dimension (overall + 6 dimensions)
- Expandable model details with per-dimension scores
- Rank badges (gold/silver/bronze)
- Real-time filtering and sorting
Case Studies Explorer
- 24 curated success/failure examples
- Filter by dimension and type
- Script excerpts with metrics
- Analysis insights and takeaways
Design
- Apple-inspired UI with premium dark gradients
- SF Pro font family (system fonts)
- Glassmorphism effects
- Smooth animations and transitions
- Fully responsive layout
- Pure HTML/CSS/JavaScript (no frameworks)
- Apple Design Language principles
- CSS Grid & Flexbox layouts
- Backdrop filters for glassmorphism
- CSS animations for smooth transitions
Regenerate web demo data from source:
cd DramaBench
uv run python web/scripts/process_data.pyThis processes:
- 6 dimension metrics CSV files (8,824 evaluations)
- 24 case studies with detailed analysis
- Generates web-friendly JSON in
web/data/
π€ Hugging Face Dataset: FutureMa/DramaBench
Current Release: v2.0 (500 samples) - Available Now!
Load with Datasets Library:
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("FutureMa/DramaBench", split="train")
# Access a sample
sample = dataset[0]
print(f"Title: {sample['title']}")
print(f"Context: {sample['context'][:200]}...")
print(f"Continuation: {sample['continuation'][:200]}...")
print(f"Stats: {sample['stats']}")Analyze Dataset:
cd DramaBench
python scripts/load_dataset.pyCurrent Release (v2.0 - 500 samples):
- 500 high-quality drama scripts with context-continuation pairs
- Average context length: ~1,601 characters (~400 tokens)
- Average continuation length: ~1,600 characters (~400 tokens)
- Split types: 73% scene boundary, 27% middle
- Format: Fountain screenplay format (industry standard)
- Fields:
id,title,description,context,continuation,stats
Release Roadmap:
| Version | Samples | Status | Release Date |
|---|---|---|---|
| v1.0 | 100 | β Released | 2025-12-23 |
| v2.0 | 500 | β Available | 2026-01-01 |
| v3.0 (Full) | 1,103 | π Planned | Q2 2026 |
Full Benchmark (v3.0 - Planned):
- 1,103 complete drama scripts
- Model-generated continuations from 8 SOTA models
- Human annotations and quality assessments
- Multi-dimensional evaluation metrics (8,824 evaluations)
- Error taxonomy and classification
- Statistical significance test results
Format: JSONL with structured metadata
License: MIT License
DramaBench uses a hybrid evaluation system:
-
Rule-Based Analysis (Format Standards)
- 100% reproducible
- Zero cost
- Fountain syntax validation
-
LLM-Based Labeling (5 content dimensions)
- Structured feature extraction
- Statistical metric calculation
- Not direct scoring
| Dimension | Type | Key Metrics | Description |
|---|---|---|---|
| Format Standards | Rule-based | Format Error Rate, Novelization Index, Dialogue-Action Ratio | Screenplay format compliance |
| Narrative Efficiency | LLM-labeled | Effective Narrative Rate (ENR), Beats Per Page | Story progression effectiveness |
| Character Consistency | LLM-labeled | Out-of-Character Rate, Voice Distinctiveness | Character voice and behavior consistency |
| Emotional Depth | LLM-labeled | Arc Score, Complexity Ratio | Emotional arc development |
| Logic Consistency | LLM-labeled | Logic Break Rate, Context Coherence | Factual coherence and logical continuity |
| Conflict Handling | LLM-labeled | Conflict Score, Drop Rate | Conflict development and resolution |
Statistical Significance:
- 252 Mann-Whitney U tests performed
- 166/252 comparisons significant (65.9% with FDR correction)
- Beats Per Page: Most differentiating (26/28 significant)
Dimension Independence:
- Mean absolute correlation: |r| = 0.020 (extremely low)
- Max correlation: |r| = 0.068 (Format β Narrative)
- All dimensions capture distinct quality aspects
Human-LLM Agreement:
- Strong agreement on 3/5 dimensions
- Logic: r=0.48*** (Pearson correlation)
- Emotional Depth: ΞΊ=0.53 (Cohen's Kappa)
- Conflict: ΞΊ=0.42 (Cohen's Kappa)
Available Now: All LLM-based evaluation prompts are available in the prompts/ directory.
Quick Start:
- Navigate to
prompts/folder - Select a dimension template (e.g.,
narrative_efficiency_prompt.txt) - Replace placeholders:
{CONTEXT},{CONTINUATION},{MODEL},{SCRIPT_ID} - Send to your preferred LLM (Claude Sonnet 4.5, GPT-4, etc.)
- Parse the structured JSON response
Example:
import json
# Load template
with open('prompts/character_consistency_prompt.txt', 'r') as f:
template = f.read()
# Fill with your data
prompt = template.replace('{CONTEXT}', context_text)
prompt = prompt.replace('{CONTINUATION}', continuation_text)
prompt = prompt.replace('{MODEL}', 'GPT-4')
prompt = prompt.replace('{SCRIPT_ID}', 'script_042')
# Call LLM (example with OpenAI)
response = openai.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
# Parse results
results = json.loads(response.choices[0].message.content)
print(f"OOC Rate: {results['statistics']['ooc_rate']}")Detailed Documentation: See prompts/README.md for:
- Detailed usage instructions
- Batch evaluation examples
- Output format specifications
- Quality guidelines
- Common issues and solutions
| Rank | Model | Provider | Overall Score |
|---|---|---|---|
| π₯ 1 | GPT-5.2 | OpenAI | 0.960 |
| π₯ 2 | GLM-4.6 | Zhipu AI | 0.930 |
| π₯ 3 | Qwen3-Max | Alibaba Cloud | 0.917 |
| 4 | Claude Opus 4.5 | Anthropic | 0.888 |
| 5 | MiniMax M2 | MiniMax | 0.869 |
| 6 | DeepSeek V3.2 | DeepSeek | 0.856 |
| 7 | Gemini 3 Pro | Google DeepMind | 0.843 |
| 8 | Kimi K2 Thinking | Moonshot AI | 0.815 |
Note: Rankings may vary by dimension. See web demo for detailed per-dimension scores.
DramaBench/
βββ index.html # Main landing page
βββ README.md # This file
βββ start_demo.sh # One-click demo launcher
βββ assets/
β βββ DramaBench_cover.png # Project cover image
βββ prompts/ # Evaluation prompt templates
β βββ README.md # Prompts usage guide
β βββ narrative_efficiency_prompt.txt # Narrative efficiency evaluation
β βββ character_consistency_prompt.txt # Character consistency evaluation
β βββ emotional_depth_prompt.txt # Emotional depth evaluation
β βββ logic_consistency_prompt.txt # Logic consistency evaluation
β βββ conflict_handling_prompt.txt # Conflict handling evaluation
β βββ dialogue_quality_prompt.txt # Dialogue quality evaluation
βββ web/ # Web application
β βββ leaderboard.html # Model rankings page
β βββ cases.html # Case studies page
β βββ demo.html # Interactive script continuation demo
β βββ css/
β β βββ apple-style.css # Apple-inspired CSS framework
β βββ data/ # Data files
β β βββ leaderboard.json # Model rankings (14KB)
β β βββ case_studies.json # 24 case studies (262KB)
β β βββ statistics.json # Overall statistics (3KB)
β β βββ demo/ # Demo-specific data
β β βββ dramabench_continuation_500.jsonl # 500 scripts dataset (v2.0)
β β βββ dramabench_continuation_100.jsonl # 100 scripts dataset (v1.0)
β β βββ drama_continuation_prompt_template.txt # Official prompt
β βββ scripts/
β βββ process_data.py # Data processing script
β βββ demo.js # Interactive demo logic
βββ dataset/ # [Coming Soon] Dataset files
βββ evaluation/ # [Coming Soon] Evaluation code
βββ docs/ # [Coming Soon] Additional documentation
Tested and optimized for:
- β Chrome 90+
- β Safari 14+
- β Firefox 88+
- β Edge 90+
Issue: "Error loading data"
- Cause: Opening HTML files directly without HTTP server
- Solution: Use
./start_demo.shorpython3 -m http.server 8000
Issue: "Port 8000 already in use"
- Cause: Another process is using port 8000
- Solution: Use a different port:
python3 -m http.server 8001
We welcome contributions to DramaBench! Areas for contribution:
- π Bug reports and fixes
- π Documentation improvements
- π¨ UI/UX enhancements
- π New visualizations
- π§ Evaluation tools
- πΎ Dataset improvements
How to Contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
If you use DramaBench in your research, please cite our paper:
@misc{ma2025dramabenchsixdimensionalevaluationframework,
title={DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation},
author={Shijian Ma and Yunqi Huang and Yan Lin},
year={2025},
eprint={2512.19012},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.19012}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Apple Design Team - Design inspiration
- ACL Community - Research support
- Model Providers - OpenAI, Anthropic, Google DeepMind, Alibaba Cloud, DeepSeek, MiniMax, Moonshot AI, Zhipu AI
For questions, feedback, or collaboration opportunities:
- Issues: GitHub Issues
- Email: mas8069@foxmail.com
- Twitter: @mashijiann
Last Updated: 2025-12-30 β’ Version: 1.0.0 β’ Status: β Active
Made with β€οΈ by the DramaBench Team
