Skip to content

Latest commit

 

History

History
183 lines (138 loc) · 6.92 KB

File metadata and controls

183 lines (138 loc) · 6.92 KB

📊 PDF Report Generation — Automated Data Extraction & Excel Reporting

Python Pandas OpenPyXL PyPDF2 Tabula Status Year


📌 Overview

A Python automation tool built to extract structured data from operational PDF reports and export it to formatted Excel files — eliminating the need for manual copy-paste work across large, multi-page server performance reports.

This project was developed in October 2020 at Capgemini Technology Services as part of an internal automation initiative within the Major Incident Management and Reporting team. It was recognised with a Spot Award for reducing manual reporting effort and improving data accuracy for global stakeholders.

The tool supports two report types via a simple menu-driven interface:

  • Server Report — Extracts server names and performance metrics (CPU/memory interval data)
  • Disk/Drive Report — Extracts server names, logical disk/drive identifiers, and associated performance metrics

🚀 Key Features

  • Automated PDF parsing — Extracts server names and logical drive identifiers using regex pattern matching across multi-page PDFs
  • Table extraction — Converts embedded PDF tables to structured CSV using tabula-py, then processes with pandas
  • Intelligent row mapping — Matches extracted server/drive names to their corresponding data rows using index-based alignment
  • Excel export — Outputs a clean, structured .xlsx file using openpyxl via pandas ExcelWriter
  • Automatic cleanup — Removes all temporary buffer files (Buffer.csv, Server names.txt, Drive names.txt) after execution
  • Performance timing — Reports total execution time in minutes
  • Menu-driven interface — Single entry point (Main.py) with user-selectable report type

🗂️ Project Structure

PDF_ReportGeneration/
│
├── Main.py                  # Entry point — menu to select report type
├── ServerReportGenerator.py # Generates server-level performance reports
├── ReportGenerator.py       # Generates disk/drive-level performance reports
└── README.md                # Project documentation

⚙️ How It Works

ServerReportGenerator.py

  1. Accepts a PDF file path as input
  2. Reads all pages and extracts text using PyPDF2
  3. Applies regex pattern (SCOM) to identify and extract server names
  4. Writes server names to a temporary Server names.txt file
  5. Converts all PDF tables to a CSV buffer using tabula-py
  6. Loads the CSV into a pandas DataFrame and retains only relevant columns (Interval, Min Value, Max Value, Average Value)
  7. Identifies section boundaries by locating rows where Interval == "Interval" (header repeat rows)
  8. Maps each server name to its corresponding data rows using sorted index alignment
  9. Exports the final labelled DataFrame to a user-named .xlsx file
  10. Cleans up all temporary files

ReportGenerator.py

Follows the same pipeline as above, with the addition of:

  • Logical disk/drive name extraction using the (Logical Disk:) regex pattern
  • A second column DriveName inserted into the output DataFrame
  • Parallel mapping of both server names and drive names to their respective data rows

Main.py

*****Menu*****
1. Server Details
2. Disk/Drive Details
Select 1 for generating Server details file or 2 for generating a Disk/drive file
-->

Calls the appropriate script via os.system() based on user selection.


🛠️ Tech Stack

Library Purpose
PyPDF2 PDF file reading and text extraction
tabula-py PDF table detection and CSV conversion
pandas DataFrame manipulation, filtering, and column management
openpyxl Excel file creation and writing via pandas ExcelWriter
xlsxwriter Excel formatting support
re Regex-based pattern matching for server/drive name extraction
io In-memory string stream processing for line-by-line text parsing
os File system operations and subprocess execution
tabulate Tabular data display in terminal (utility)
time Execution time measurement

📋 Requirements

PyPDF2
tabula-py
tabulate
xlsxwriter
openpyxl
pandas

Note: tabula-py requires Java (JRE 8+) to be installed on the machine. Ensure Java is available in your system PATH before running.

Install dependencies:

pip install PyPDF2 tabula-py tabulate xlsxwriter openpyxl pandas

▶️ Usage

  1. Clone or download the repository
  2. Install the required dependencies (see above)
  3. Run the entry point:
python Main.py
  1. Select your report type from the menu:

    • Enter 1 for Server performance report
    • Enter 2 for Disk/Drive performance report
  2. When prompted, enter the PDF file name (including .pdf extension)

  3. When prompted, enter the output Excel file name (including .xlsx extension)

  4. The tool will process the PDF, display progress in the terminal, and save the final .xlsx report in the working directory.

Example terminal session:

*****Menu*****
1.Server Details
2.Disk/Drive Details
Select 1 for generating Server details file or 2 for generating a Disk/drive file
--> 1
Enter file name
--> SCOM_Report_Oct2020.pdf
SERVER01
SERVER02
SERVER03
...
Enter final report name with .xlsx extension
--> Final_Server_Report_Oct2020.xlsx
Total time taken: 2.34 minutes

💡 Background & Impact

This tool was built in response to a recurring manual reporting task within the Major Incident Management team at Capgemini. Prior to automation, team members spent approximately 60 minutes per week manually extracting server and disk performance data from multi-page SCOM PDF reports and reformatting it into Excel for stakeholder distribution.

After deployment:

  • ⏱️ Report generation time reduced by ~75% (from ~60 mins to 15–20 mins)
  • 👥 Manual effort eliminated for 14–16 team members
  • ✅ Reporting accuracy and consistency significantly improved
  • 🏆 Recognised with a Spot Award from the AXA Service Control Lead, Capgemini (May 2020 & February 2021)

👤 Author

Vishal Petkar


📜 License

This project was developed for internal operational use at Capgemini Technology Services. Code is shared publicly for portfolio and demonstration purposes.