A Python automation tool built to extract structured data from operational PDF reports and export it to formatted Excel files — eliminating the need for manual copy-paste work across large, multi-page server performance reports.
This project was developed in October 2020 at Capgemini Technology Services as part of an internal automation initiative within the Major Incident Management and Reporting team. It was recognised with a Spot Award for reducing manual reporting effort and improving data accuracy for global stakeholders.
The tool supports two report types via a simple menu-driven interface:
- Server Report — Extracts server names and performance metrics (CPU/memory interval data)
- Disk/Drive Report — Extracts server names, logical disk/drive identifiers, and associated performance metrics
- Automated PDF parsing — Extracts server names and logical drive identifiers using regex pattern matching across multi-page PDFs
- Table extraction — Converts embedded PDF tables to structured CSV using
tabula-py, then processes withpandas - Intelligent row mapping — Matches extracted server/drive names to their corresponding data rows using index-based alignment
- Excel export — Outputs a clean, structured
.xlsxfile usingopenpyxlviapandas ExcelWriter - Automatic cleanup — Removes all temporary buffer files (
Buffer.csv,Server names.txt,Drive names.txt) after execution - Performance timing — Reports total execution time in minutes
- Menu-driven interface — Single entry point (
Main.py) with user-selectable report type
PDF_ReportGeneration/
│
├── Main.py # Entry point — menu to select report type
├── ServerReportGenerator.py # Generates server-level performance reports
├── ReportGenerator.py # Generates disk/drive-level performance reports
└── README.md # Project documentation
- Accepts a PDF file path as input
- Reads all pages and extracts text using
PyPDF2 - Applies regex pattern
(SCOM)to identify and extract server names - Writes server names to a temporary
Server names.txtfile - Converts all PDF tables to a CSV buffer using
tabula-py - Loads the CSV into a
pandasDataFrame and retains only relevant columns (Interval,Min Value,Max Value,Average Value) - Identifies section boundaries by locating rows where
Interval == "Interval"(header repeat rows) - Maps each server name to its corresponding data rows using sorted index alignment
- Exports the final labelled DataFrame to a user-named
.xlsxfile - Cleans up all temporary files
Follows the same pipeline as above, with the addition of:
- Logical disk/drive name extraction using the
(Logical Disk:)regex pattern - A second column
DriveNameinserted into the output DataFrame - Parallel mapping of both server names and drive names to their respective data rows
*****Menu*****
1. Server Details
2. Disk/Drive Details
Select 1 for generating Server details file or 2 for generating a Disk/drive file
-->
Calls the appropriate script via os.system() based on user selection.
| Library | Purpose |
|---|---|
PyPDF2 |
PDF file reading and text extraction |
tabula-py |
PDF table detection and CSV conversion |
pandas |
DataFrame manipulation, filtering, and column management |
openpyxl |
Excel file creation and writing via pandas ExcelWriter |
xlsxwriter |
Excel formatting support |
re |
Regex-based pattern matching for server/drive name extraction |
io |
In-memory string stream processing for line-by-line text parsing |
os |
File system operations and subprocess execution |
tabulate |
Tabular data display in terminal (utility) |
time |
Execution time measurement |
PyPDF2
tabula-py
tabulate
xlsxwriter
openpyxl
pandas
Note:
tabula-pyrequires Java (JRE 8+) to be installed on the machine. Ensure Java is available in your system PATH before running.
Install dependencies:
pip install PyPDF2 tabula-py tabulate xlsxwriter openpyxl pandas- Clone or download the repository
- Install the required dependencies (see above)
- Run the entry point:
python Main.py-
Select your report type from the menu:
- Enter
1for Server performance report - Enter
2for Disk/Drive performance report
- Enter
-
When prompted, enter the PDF file name (including
.pdfextension) -
When prompted, enter the output Excel file name (including
.xlsxextension) -
The tool will process the PDF, display progress in the terminal, and save the final
.xlsxreport in the working directory.
Example terminal session:
*****Menu*****
1.Server Details
2.Disk/Drive Details
Select 1 for generating Server details file or 2 for generating a Disk/drive file
--> 1
Enter file name
--> SCOM_Report_Oct2020.pdf
SERVER01
SERVER02
SERVER03
...
Enter final report name with .xlsx extension
--> Final_Server_Report_Oct2020.xlsx
Total time taken: 2.34 minutes
This tool was built in response to a recurring manual reporting task within the Major Incident Management team at Capgemini. Prior to automation, team members spent approximately 60 minutes per week manually extracting server and disk performance data from multi-page SCOM PDF reports and reformatting it into Excel for stakeholder distribution.
After deployment:
- ⏱️ Report generation time reduced by ~75% (from ~60 mins to 15–20 mins)
- 👥 Manual effort eliminated for 14–16 team members
- ✅ Reporting accuracy and consistency significantly improved
- 🏆 Recognised with a Spot Award from the AXA Service Control Lead, Capgemini (May 2020 & February 2021)
Vishal Petkar
- GitHub: @VishPetkar13
- LinkedIn: vishalpetkar
- Created: October 2020
This project was developed for internal operational use at Capgemini Technology Services. Code is shared publicly for portfolio and demonstration purposes.