A Python tool for automatically extracting and analyzing review process timelines from academic journal articles (PDF format). This tool helps researchers understand the typical review duration of different journals by extracting submission, revision, and acceptance dates from PDF files.
- 📄 Automatic PDF Text Extraction: Uses PyMuPDF (fitz) to extract text from PDF files
- 🔍 Multi-Format Support: Compatible with multiple journal formats including:
- Elsevier journals (e.g., "Received in revised form" format)
- IEEE journals (e.g., "revised" format)
- 📊 Comprehensive Statistics: Calculates mean, median, minimum, and maximum review times
- ⏱️ Dual Time Units: Displays results in both days and months (30 days/month)
- 🎯 Smart Date Recognition: Handles various date formats and cross-line date information
- 📁 Batch Processing: Processes all PDF files in a specified directory
- Python 3.6+
- PyMuPDF (fitz)
- Clone this repository:
git clone https://github.com/yourusername/journal-review-time-statistics.git
cd journal-review-time-statistics- Install required dependencies:
pip install PyMuPDFOr if using conda:
conda install -c conda-forge pymupdf- Organize your PDF files in the following structure:
journal article archive/
├── IEEE Sens J/
│ ├── article1.pdf
│ ├── article2.pdf
│ └── ...
├── Another Journal/
│ └── ...
└── ...
- Modify the
journal_nameinmain.pyto specify which journal to analyze:
journal_name = "IEEE Sens J" # Change this to your target journal
pdf_folder = rf"journal article archive/{journal_name}"- Run the script:
python main.pyThe tool recognizes the following date patterns commonly used in academic journals:
Received 8 April 2024; Received in revised form 23 August 2024; Accepted 15 September 2024
Received 17 January 2025; revised 31 March 2025; accepted 24 April 2025.
Date of publication 15 May 2025; date of current version 30 May 2025.
================================================================================
找到 10 个PDF文件
================================================================================
处理文件: article1.pdf
Received: 2024-04-08
Revised: 2024-08-23
Accepted: 2024-09-15
>> Received -> Revised: 137 天 (4.6 个月)
>> Received -> Accepted: 160 天 (5.3 个月)
--------------------------------------------------------------------------------
...
================================================================================
【统计结果】
================================================================================
处理IEEE Sens J的PDF文件总数: 10
成功提取Received→Revised时间的文件数: 10
成功提取Received→Accepted时间的文件数: 10
【Received -> Revised 平均时间】: 220.2 天 (7.3 个月)
中位数: 183.5 天 (6.1 个月)
最短: 54 天 (1.8 个月)
最长: 440 天 (14.7 个月)
【Received -> Accepted 平均时间】: 242.6 天 (8.1 个月)
中位数: 201.0 天 (6.7 个月)
最短: 80 天 (2.7 个月)
最长: 449 天 (15.0 个月)
================================================================================
- PDF Text Extraction: The script reads the first 3 pages of each PDF file to locate date information
- Pattern Matching: Uses regular expressions to identify and extract dates with keywords:
- "Received" - Initial submission date
- "Revised" or "Received in revised form" - Revision submission date
- "Accepted" - Final acceptance date
- Date Parsing: Converts various date formats into standardized datetime objects
- Time Calculation: Computes the number of days between key milestones
- Statistical Analysis: Calculates mean, median, min, and max values across all processed papers
.
├── main.py # Main script
├── README.md # This file
└── journal article archive/ # Directory containing PDF files
├── IEEE Sens J/
├── Elsevier Journal/
└── ...
extract_text_from_pdf(pdf_path): Extracts text from PDF filesparse_date(date_string): Parses various date formats into datetime objectsextract_dates_from_text(text): Identifies and extracts received/revised/accepted datescalculate_days_difference(date1, date2): Calculates the difference in daysprocess_pdf_folder(folder_path): Processes all PDFs in a directorycalculate_statistics(results, journal_name): Computes and displays statistics
- 📖 Journal Selection: Help researchers choose journals with faster review times
- 📈 Trend Analysis: Analyze how review times change over different periods
- 🔬 Research Planning: Better estimate publication timelines for grant applications
- 📊 Comparative Studies: Compare review efficiency across different journals
- The script assumes date information appears in the first 3 pages of the PDF
- Dates are expected to follow common academic journal formats
- For best results, ensure PDFs are text-based (not scanned images)
- Review times are calculated from the received date to revision/acceptance dates
Issue: Chinese characters display as garbled text in Windows PowerShell
Solution:
- Run the script in a Python IDE (PyCharm, VS Code, etc.) for proper UTF-8 display
- Or execute
chcp 65001in PowerShell before running the script
Issue: No dates extracted from PDFs
Solution:
- Verify that the PDFs contain text (not scanned images)
- Check if the date format matches supported patterns
- The date information should be within the first 3 pages
Contributions are welcome! Feel free to:
- Report bugs
- Suggest new features
- Submit pull requests to support additional journal formats
- Improve date pattern recognition
This project is licensed under the MIT License - see the LICENSE file for details.
Created for academic research purposes to help researchers make informed decisions about journal submissions.
- PyMuPDF team for the excellent PDF processing library
- The academic community for inspiring this tool
Note: This tool is designed for personal research and analysis purposes. Please respect copyright laws when processing PDF files.