Skip to content

A six-dimensional evaluation framework for drama script continuation with interactive leaderboard and case studies

License

Notifications You must be signed in to change notification settings

IIIIQIIII/DramaBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DramaBench

DramaBench Cover

A Six-Dimensional Evaluation Framework for Drama Script Continuation

Status License Paper

🌐 Website β€’ ✨ Interactive Demo β€’ πŸ“Š Leaderboard β€’ πŸ€— Dataset


πŸ“‹ Table of Contents


🎯 Overview

DramaBench is a comprehensive benchmark for evaluating drama script continuation capabilities of large language models. It provides:

Core Components

  • 🌐 Project Website - Interactive showcase with evaluation results and case studies
  • ✨ Interactive Demo - Try script continuation with multiple LLM models (user-provided API key)
  • πŸ’Ύ Large-Scale Dataset - 1,103 drama scripts with human annotations
  • πŸ“Š Evaluation Framework - 6 independent dimensions with rigorous metrics
  • πŸ† Model Leaderboard - Compare 8 SOTA language models
  • πŸ“ Case Studies - 24 curated examples with detailed analysis
  • πŸ”§ Evaluation Prompts - LLM-based labeling templates for all 6 dimensions

Six Evaluation Dimensions

  1. Format Standards (Rule-based) - Screenplay format compliance
  2. Narrative Efficiency (LLM-labeled) - Story progression effectiveness
  3. Character Consistency (LLM-labeled) - Character voice and behavior
  4. Emotional Depth (LLM-labeled) - Emotional arc development
  5. Logic Consistency (LLM-labeled) - Factual coherence and continuity
  6. Conflict Handling (LLM-labeled) - Conflict development quality

Key Statistics

  • 1,103 unique drama scripts
  • 8,824 total evaluations (1,103 scripts Γ— 8 models)
  • 8 state-of-the-art language models
  • 6 independent evaluation dimensions
  • 252 statistical significance tests (65.9% significant)
  • 24 curated case studies

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Web browser (Chrome, Safari, Firefox, or Edge)

Launch Web Demo

Method 1: One-Click Start (Easiest)

cd DramaBench
./start_demo.sh

This will automatically:

  • βœ… Start a local HTTP server on port 8000
  • βœ… Open the demo in your default browser
  • βœ… Navigate to http://localhost:8000

Method 2: Manual Server Start

cd DramaBench

# Using uv (if available)
uv run python -m http.server 8000

# Or using Python 3 directly
python3 -m http.server 8000

# Then open http://localhost:8000 in your browser

⚠️ Important Note

Due to browser CORS restrictions, you must use a local HTTP server to view the demo. Opening HTML files directly (file:// protocol) will cause data loading errors.


🧩 Project Components

1. Project Website & Interactive Demo

An interactive, Apple-inspired web interface for exploring evaluation results and trying script continuation.

Website Features:

  • πŸ“Š Interactive leaderboard with dimension filters
  • πŸ“ Case studies explorer with 24 examples
  • 🎨 Premium dark gradient design
  • πŸ“± Fully responsive (mobile/tablet/desktop)
  • ⚑ Pure HTML/CSS/JavaScript (no frameworks)

Interactive Demo Features:

  • ✨ Try script continuation with 4 SOTA models (GPT-5.2, Gemini 3, GLM-4.7, MiniMax M2.1)
  • πŸ”‘ User-provided OpenRouter API key (stored locally)
  • πŸ“œ 500 drama scripts from DramaBench dataset
  • 🎭 Official prompt template for generation
  • πŸ“Š Compare AI-generated vs ground truth continuations
  • 🎨 Matching Apple-style design

Pages:

  • index.html - Main landing page
  • web/leaderboard.html - Model rankings
  • web/cases.html - Case studies browser
  • web/demo.html - Interactive script continuation demo

β†’ View Live Website | β†’ Try Interactive Demo

2. Dataset

πŸŽ‰ Now Available on Hugging Face!

The DramaBench dataset is being released progressively to ensure quality and gather community feedback.

Current Release (v2.0):

  • βœ… 500 Drama Scripts - Available now on Hugging Face
  • πŸ“₯ Download: FutureMa/DramaBench
  • πŸ“„ Format: JSONL with structured metadata
  • πŸ”“ License: MIT License
  • πŸ“Š Usage: Load with datasets library

Quick Start:

from datasets import load_dataset

# Load dataset
dataset = load_dataset("FutureMa/DramaBench", split="train")

# Access samples
sample = dataset[0]
print(sample['title'])
print(sample['context'])
print(sample['continuation'])

Release Roadmap:

Version Samples Status Expected Release
v1.0 100 βœ… Released 2025-12-23
v2.0 500 βœ… Available 2026-01-01
v3.0 (Full) 1,103 πŸ“‹ Planned Q2 2026

Full Dataset Contents (v3.0):

  • 1,103 drama script contexts and continuations
  • Model-generated continuations (8 SOTA models)
  • Human annotations and quality assessments
  • Multi-dimensional evaluation metrics
  • Error taxonomy and classification

3. Evaluation Prompts

βœ… Now Available: LLM-based evaluation prompt templates for all 6 dimensions.

Location: prompts/ directory

Contents:

  • narrative_efficiency_prompt.txt - Story progression effectiveness
  • character_consistency_prompt.txt - Character voice and behavior consistency
  • emotional_depth_prompt.txt - Emotional arc development
  • logic_consistency_prompt.txt - Factual coherence and continuity
  • conflict_handling_prompt.txt - Conflict development and resolution
  • dialogue_quality_prompt.txt - Dialogue naturalness and purpose

Quick Start:

# Load a prompt template
with open('prompts/narrative_efficiency_prompt.txt', 'r') as f:
    prompt = f.read()

# Fill placeholders
prompt = prompt.replace('{CONTEXT}', script_context)
prompt = prompt.replace('{CONTINUATION}', generated_continuation)
prompt = prompt.replace('{MODEL}', 'GPT-4')
prompt = prompt.replace('{SCRIPT_ID}', 'script_001')

# Send to LLM and get structured JSON output
response = llm_api_call(prompt)
evaluation = json.loads(response)

See prompts/README.md for detailed usage instructions.

Coming Soon: Full evaluation pipeline including:

  • Statistical analysis scripts
  • Visualization generation tools
  • Reproducibility automation scripts

🌐 Website & Interactive Demo

Live Website

Visit dramabench.pages.dev to explore:

  • Homepage - Project overview and statistics
  • Leaderboard - Compare 8 SOTA models across 6 dimensions
  • Case Studies - Browse 24 curated examples with detailed analysis
  • Interactive Demo - Try script continuation yourself

Interactive Demo

Try it now: dramabench.pages.dev/web/demo.html

Experience drama script continuation with state-of-the-art language models:

Features:

  • 🎭 500 Drama Scripts - Select from DramaBench v2.0 dataset
  • πŸ€– 4 SOTA Models - GPT-5.2, Gemini 3 Flash, GLM-4.7, MiniMax M2.1
  • πŸ”‘ Your API Key - Uses OpenRouter API (bring your own key)
  • πŸ“Š Compare Results - View AI-generated vs ground truth side-by-side
  • 🎨 Apple Design - Beautiful, responsive interface

How to Use:

  1. Get your free API key from OpenRouter
  2. Visit the demo page
  3. Enter your API key (stored locally in your browser)
  4. Select a script from 500 options
  5. Choose your preferred model
  6. Generate and compare continuations

Cost: Pay-as-you-go through OpenRouter (typically $0.01-0.10 per generation)

Website Features

Interactive Leaderboard

  • Filter by dimension (overall + 6 dimensions)
  • Expandable model details with per-dimension scores
  • Rank badges (gold/silver/bronze)
  • Real-time filtering and sorting

Case Studies Explorer

  • 24 curated success/failure examples
  • Filter by dimension and type
  • Script excerpts with metrics
  • Analysis insights and takeaways

Design

  • Apple-inspired UI with premium dark gradients
  • SF Pro font family (system fonts)
  • Glassmorphism effects
  • Smooth animations and transitions
  • Fully responsive layout

Technologies

  • Pure HTML/CSS/JavaScript (no frameworks)
  • Apple Design Language principles
  • CSS Grid & Flexbox layouts
  • Backdrop filters for glassmorphism
  • CSS animations for smooth transitions

Local Development

Regenerate web demo data from source:

cd DramaBench
uv run python web/scripts/process_data.py

This processes:

  • 6 dimension metrics CSV files (8,824 evaluations)
  • 24 case studies with detailed analysis
  • Generates web-friendly JSON in web/data/

πŸ’Ύ Dataset

Dataset Access

πŸ€— Hugging Face Dataset: FutureMa/DramaBench

Current Release: v2.0 (500 samples) - Available Now!

Quick Start

Load with Datasets Library:

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("FutureMa/DramaBench", split="train")

# Access a sample
sample = dataset[0]
print(f"Title: {sample['title']}")
print(f"Context: {sample['context'][:200]}...")
print(f"Continuation: {sample['continuation'][:200]}...")
print(f"Stats: {sample['stats']}")

Analyze Dataset:

cd DramaBench
python scripts/load_dataset.py

Dataset Overview

Current Release (v2.0 - 500 samples):

  • 500 high-quality drama scripts with context-continuation pairs
  • Average context length: ~1,601 characters (~400 tokens)
  • Average continuation length: ~1,600 characters (~400 tokens)
  • Split types: 73% scene boundary, 27% middle
  • Format: Fountain screenplay format (industry standard)
  • Fields: id, title, description, context, continuation, stats

Release Roadmap:

Version Samples Status Release Date
v1.0 100 βœ… Released 2025-12-23
v2.0 500 βœ… Available 2026-01-01
v3.0 (Full) 1,103 πŸ“‹ Planned Q2 2026

Full Benchmark (v3.0 - Planned):

  • 1,103 complete drama scripts
  • Model-generated continuations from 8 SOTA models
  • Human annotations and quality assessments
  • Multi-dimensional evaluation metrics (8,824 evaluations)
  • Error taxonomy and classification
  • Statistical significance test results

Format: JSONL with structured metadata

License: MIT License


πŸ“Š Evaluation Framework

Methodology

DramaBench uses a hybrid evaluation system:

  1. Rule-Based Analysis (Format Standards)

    • 100% reproducible
    • Zero cost
    • Fountain syntax validation
  2. LLM-Based Labeling (5 content dimensions)

    • Structured feature extraction
    • Statistical metric calculation
    • Not direct scoring

Six Dimensions

Dimension Type Key Metrics Description
Format Standards Rule-based Format Error Rate, Novelization Index, Dialogue-Action Ratio Screenplay format compliance
Narrative Efficiency LLM-labeled Effective Narrative Rate (ENR), Beats Per Page Story progression effectiveness
Character Consistency LLM-labeled Out-of-Character Rate, Voice Distinctiveness Character voice and behavior consistency
Emotional Depth LLM-labeled Arc Score, Complexity Ratio Emotional arc development
Logic Consistency LLM-labeled Logic Break Rate, Context Coherence Factual coherence and logical continuity
Conflict Handling LLM-labeled Conflict Score, Drop Rate Conflict development and resolution

Validation

Statistical Significance:

  • 252 Mann-Whitney U tests performed
  • 166/252 comparisons significant (65.9% with FDR correction)
  • Beats Per Page: Most differentiating (26/28 significant)

Dimension Independence:

  • Mean absolute correlation: |r| = 0.020 (extremely low)
  • Max correlation: |r| = 0.068 (Format ↔ Narrative)
  • All dimensions capture distinct quality aspects

Human-LLM Agreement:

  • Strong agreement on 3/5 dimensions
  • Logic: r=0.48*** (Pearson correlation)
  • Emotional Depth: ΞΊ=0.53 (Cohen's Kappa)
  • Conflict: ΞΊ=0.42 (Cohen's Kappa)

Using Evaluation Prompts

Available Now: All LLM-based evaluation prompts are available in the prompts/ directory.

Quick Start:

  1. Navigate to prompts/ folder
  2. Select a dimension template (e.g., narrative_efficiency_prompt.txt)
  3. Replace placeholders: {CONTEXT}, {CONTINUATION}, {MODEL}, {SCRIPT_ID}
  4. Send to your preferred LLM (Claude Sonnet 4.5, GPT-4, etc.)
  5. Parse the structured JSON response

Example:

import json

# Load template
with open('prompts/character_consistency_prompt.txt', 'r') as f:
    template = f.read()

# Fill with your data
prompt = template.replace('{CONTEXT}', context_text)
prompt = prompt.replace('{CONTINUATION}', continuation_text)
prompt = prompt.replace('{MODEL}', 'GPT-4')
prompt = prompt.replace('{SCRIPT_ID}', 'script_042')

# Call LLM (example with OpenAI)
response = openai.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.3
)

# Parse results
results = json.loads(response.choices[0].message.content)
print(f"OOC Rate: {results['statistics']['ooc_rate']}")

Detailed Documentation: See prompts/README.md for:

  • Detailed usage instructions
  • Batch evaluation examples
  • Output format specifications
  • Quality guidelines
  • Common issues and solutions

πŸ† Leaderboard

Top 8 Models Evaluated

Rank Model Provider Overall Score
πŸ₯‡ 1 GPT-5.2 OpenAI 0.960
πŸ₯ˆ 2 GLM-4.6 Zhipu AI 0.930
πŸ₯‰ 3 Qwen3-Max Alibaba Cloud 0.917
4 Claude Opus 4.5 Anthropic 0.888
5 MiniMax M2 MiniMax 0.869
6 DeepSeek V3.2 DeepSeek 0.856
7 Gemini 3 Pro Google DeepMind 0.843
8 Kimi K2 Thinking Moonshot AI 0.815

Note: Rankings may vary by dimension. See web demo for detailed per-dimension scores.


πŸ“š Documentation

Project Structure

DramaBench/
β”œβ”€β”€ index.html                    # Main landing page
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ start_demo.sh                 # One-click demo launcher
β”œβ”€β”€ assets/
β”‚   └── DramaBench_cover.png     # Project cover image
β”œβ”€β”€ prompts/                      # Evaluation prompt templates
β”‚   β”œβ”€β”€ README.md                # Prompts usage guide
β”‚   β”œβ”€β”€ narrative_efficiency_prompt.txt      # Narrative efficiency evaluation
β”‚   β”œβ”€β”€ character_consistency_prompt.txt     # Character consistency evaluation
β”‚   β”œβ”€β”€ emotional_depth_prompt.txt           # Emotional depth evaluation
β”‚   β”œβ”€β”€ logic_consistency_prompt.txt         # Logic consistency evaluation
β”‚   β”œβ”€β”€ conflict_handling_prompt.txt         # Conflict handling evaluation
β”‚   └── dialogue_quality_prompt.txt          # Dialogue quality evaluation
β”œβ”€β”€ web/                          # Web application
β”‚   β”œβ”€β”€ leaderboard.html         # Model rankings page
β”‚   β”œβ”€β”€ cases.html               # Case studies page
β”‚   β”œβ”€β”€ demo.html                # Interactive script continuation demo
β”‚   β”œβ”€β”€ css/
β”‚   β”‚   └── apple-style.css      # Apple-inspired CSS framework
β”‚   β”œβ”€β”€ data/                    # Data files
β”‚   β”‚   β”œβ”€β”€ leaderboard.json     # Model rankings (14KB)
β”‚   β”‚   β”œβ”€β”€ case_studies.json    # 24 case studies (262KB)
β”‚   β”‚   β”œβ”€β”€ statistics.json      # Overall statistics (3KB)
β”‚   β”‚   └── demo/                # Demo-specific data
β”‚   β”‚       β”œβ”€β”€ dramabench_continuation_500.jsonl  # 500 scripts dataset (v2.0)
β”‚   β”‚       β”œβ”€β”€ dramabench_continuation_100.jsonl  # 100 scripts dataset (v1.0)
β”‚   β”‚       └── drama_continuation_prompt_template.txt  # Official prompt
β”‚   └── scripts/
β”‚       β”œβ”€β”€ process_data.py      # Data processing script
β”‚       └── demo.js              # Interactive demo logic
β”œβ”€β”€ dataset/                      # [Coming Soon] Dataset files
β”œβ”€β”€ evaluation/                   # [Coming Soon] Evaluation code
└── docs/                         # [Coming Soon] Additional documentation

Browser Compatibility

Tested and optimized for:

  • βœ… Chrome 90+
  • βœ… Safari 14+
  • βœ… Firefox 88+
  • βœ… Edge 90+

Common Issues

Issue: "Error loading data"

  • Cause: Opening HTML files directly without HTTP server
  • Solution: Use ./start_demo.sh or python3 -m http.server 8000

Issue: "Port 8000 already in use"

  • Cause: Another process is using port 8000
  • Solution: Use a different port: python3 -m http.server 8001

🀝 Contributing

We welcome contributions to DramaBench! Areas for contribution:

  • πŸ› Bug reports and fixes
  • πŸ“ Documentation improvements
  • 🎨 UI/UX enhancements
  • πŸ“Š New visualizations
  • πŸ”§ Evaluation tools
  • πŸ’Ύ Dataset improvements

How to Contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“– Citation

If you use DramaBench in your research, please cite our paper:

@misc{ma2025dramabenchsixdimensionalevaluationframework,
  title={DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation},
  author={Shijian Ma and Yunqi Huang and Yan Lin},
  year={2025},
  eprint={2512.19012},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2512.19012}
}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • Apple Design Team - Design inspiration
  • ACL Community - Research support
  • Model Providers - OpenAI, Anthropic, Google DeepMind, Alibaba Cloud, DeepSeek, MiniMax, Moonshot AI, Zhipu AI

πŸ“§ Contact

For questions, feedback, or collaboration opportunities:


Last Updated: 2025-12-30 β€’ Version: 1.0.0 β€’ Status: βœ… Active

Made with ❀️ by the DramaBench Team

About

A six-dimensional evaluation framework for drama script continuation with interactive leaderboard and case studies

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •