📊 PDF Report Generation — Automated Data Extraction & Excel Reporting

📌 Overview

A Python automation tool built to extract structured data from operational PDF reports and export it to formatted Excel files — eliminating the need for manual copy-paste work across large, multi-page server performance reports.

This project was developed in October 2020 at Capgemini Technology Services as part of an internal automation initiative within the Major Incident Management and Reporting team. It was recognised with a Spot Award for reducing manual reporting effort and improving data accuracy for global stakeholders.

The tool supports two report types via a simple menu-driven interface:

Server Report — Extracts server names and performance metrics (CPU/memory interval data)
Disk/Drive Report — Extracts server names, logical disk/drive identifiers, and associated performance metrics

🚀 Key Features

Automated PDF parsing — Extracts server names and logical drive identifiers using regex pattern matching across multi-page PDFs
Table extraction — Converts embedded PDF tables to structured CSV using tabula-py, then processes with pandas
Intelligent row mapping — Matches extracted server/drive names to their corresponding data rows using index-based alignment
Excel export — Outputs a clean, structured .xlsx file using openpyxl via pandas ExcelWriter
Automatic cleanup — Removes all temporary buffer files (Buffer.csv, Server names.txt, Drive names.txt) after execution
Performance timing — Reports total execution time in minutes
Menu-driven interface — Single entry point (Main.py) with user-selectable report type

🗂️ Project Structure

PDF_ReportGeneration/
│
├── Main.py                  # Entry point — menu to select report type
├── ServerReportGenerator.py # Generates server-level performance reports
├── ReportGenerator.py       # Generates disk/drive-level performance reports
└── README.md                # Project documentation

⚙️ How It Works

ServerReportGenerator.py

Accepts a PDF file path as input
Reads all pages and extracts text using PyPDF2
Applies regex pattern (SCOM) to identify and extract server names
Writes server names to a temporary Server names.txt file
Converts all PDF tables to a CSV buffer using tabula-py
Loads the CSV into a pandas DataFrame and retains only relevant columns (Interval, Min Value, Max Value, Average Value)
Identifies section boundaries by locating rows where Interval == "Interval" (header repeat rows)
Maps each server name to its corresponding data rows using sorted index alignment
Exports the final labelled DataFrame to a user-named .xlsx file
Cleans up all temporary files

ReportGenerator.py

Follows the same pipeline as above, with the addition of:

Logical disk/drive name extraction using the (Logical Disk:) regex pattern
A second column DriveName inserted into the output DataFrame
Parallel mapping of both server names and drive names to their respective data rows

Main.py

*****Menu*****
1. Server Details
2. Disk/Drive Details
Select 1 for generating Server details file or 2 for generating a Disk/drive file
-->

Calls the appropriate script via os.system() based on user selection.

🛠️ Tech Stack

Library	Purpose
`PyPDF2`	PDF file reading and text extraction
`tabula-py`	PDF table detection and CSV conversion
`pandas`	DataFrame manipulation, filtering, and column management
`openpyxl`	Excel file creation and writing via pandas ExcelWriter
`xlsxwriter`	Excel formatting support
`re`	Regex-based pattern matching for server/drive name extraction
`io`	In-memory string stream processing for line-by-line text parsing
`os`	File system operations and subprocess execution
`tabulate`	Tabular data display in terminal (utility)
`time`	Execution time measurement

📋 Requirements

PyPDF2
tabula-py
tabulate
xlsxwriter
openpyxl
pandas

Note: tabula-py requires Java (JRE 8+) to be installed on the machine. Ensure Java is available in your system PATH before running.

Install dependencies:

pip install PyPDF2 tabula-py tabulate xlsxwriter openpyxl pandas

▶️ Usage

Clone or download the repository
Install the required dependencies (see above)
Run the entry point:

python Main.py

Select your report type from the menu:
- Enter 1 for Server performance report
- Enter 2 for Disk/Drive performance report
When prompted, enter the PDF file name (including .pdf extension)
When prompted, enter the output Excel file name (including .xlsx extension)
The tool will process the PDF, display progress in the terminal, and save the final .xlsx report in the working directory.

Example terminal session:

*****Menu*****
1.Server Details
2.Disk/Drive Details
Select 1 for generating Server details file or 2 for generating a Disk/drive file
--> 1
Enter file name
--> SCOM_Report_Oct2020.pdf
SERVER01
SERVER02
SERVER03
...
Enter final report name with .xlsx extension
--> Final_Server_Report_Oct2020.xlsx
Total time taken: 2.34 minutes

💡 Background & Impact

This tool was built in response to a recurring manual reporting task within the Major Incident Management team at Capgemini. Prior to automation, team members spent approximately 60 minutes per week manually extracting server and disk performance data from multi-page SCOM PDF reports and reformatting it into Excel for stakeholder distribution.

After deployment:

⏱️ Report generation time reduced by ~75% (from ~60 mins to 15–20 mins)
👥 Manual effort eliminated for 14–16 team members
✅ Reporting accuracy and consistency significantly improved
🏆 Recognised with a Spot Award from the AXA Service Control Lead, Capgemini (May 2020 & February 2021)

👤 Author

Vishal Petkar

GitHub: @VishPetkar13
LinkedIn: vishalpetkar
Created: October 2020

📜 License

This project was developed for internal operational use at Capgemini Technology Services. Code is shared publicly for portfolio and demonstration purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📊 PDF Report Generation — Automated Data Extraction & Excel Reporting

📌 Overview

🚀 Key Features

🗂️ Project Structure

⚙️ How It Works

ServerReportGenerator.py

ReportGenerator.py

Main.py

🛠️ Tech Stack

📋 Requirements

▶️ Usage

💡 Background & Impact

👤 Author

📜 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

📊 PDF Report Generation — Automated Data Extraction & Excel Reporting

📌 Overview

🚀 Key Features

🗂️ Project Structure

⚙️ How It Works

ServerReportGenerator.py

ReportGenerator.py

Main.py

🛠️ Tech Stack

📋 Requirements

▶️ Usage

💡 Background & Impact

👤 Author

📜 License