🇬🇧 English | 🇫🇷 Français
⚠️ This repository presents the project and its technical documentation.
The production version is not publicly distributed.
The data shown in the video demonstration is fictitious. For confidentiality reasons, it was not possible to display the actual data collected by the system.
A modular data collection, processing, and export platform.
It allows you to collect information from one or multiple sources, clean and validate the data, and export it in usable formats.
The system is built around a simple principle: each component has a single responsibility.
Collectors gather data, processors clean and prepare it, exporters generate the final files, and the main engine orchestrates the workflow.
This approach keeps the system readable, maintainable, and extensible, whether it's a small collection project or a full prospecting/data gathering platform.
Key features:
- Search management
- Multi-collector architecture
- Data cleaning and normalization
- Data validation
- Duplicate removal
- CSV and JSON export
- Logging and error handling
- Automatic retries
- Tracking contacted prospects
- Web interface for managing searches and results
The system workflow is built on a sequential processing chain to ensure data quality at every stage.
Load configuration
↓
Initialize environment
↓
Load queries
↓
Execute collectors
↓
Clean data
↓
Normalize data
↓
Validate data
↓
Remove duplicates
↓
Generate exports
↓
Log execution
Each component remains independent to facilitate maintenance, updates, and the addition of new features.
New collectors, processors, exporters, or interfaces can be integrated without impacting the existing architecture.
The current version serves as the technical foundation of the system and will support future functionalities for data collection, enrichment, and exploitation.
data-collection-system/
│
├── main.py → System entry point, orchestrates the entire workflow
├── requirements.txt → List of external Python dependencies
├── run.bat → Starts the local application server
│
├── config/
│ ├── __init__.py → Declares the directory as a Python package
│ └── settings.py → Centralizes the system's technical configuration
│
├── collectors/
│ ├── __init__.py → Declares the directory as a Python package
│ ├── duckduckgo_collector.py → Collects data from a business directory source
│ ├── bing_collector.py → Collects data from Bing
│ ├── qwant_collector.py → Collects data from Qwant
│ ├── source_registry.py → Registers and returns the system's active collectors
│ └── base_collector.py → Defines the common contract all collectors must follow
│
├── processors/
│ ├── __init__.py → Declares the directory as a Python package
│ ├── cleaner.py → Cleans collected data before processing
│ ├── validator.py → Validates data integrity and required fields
│ ├── deduplicator.py → Removes duplicate records from the dataset
│ ├── email_extractor.py → Contains functions for extracting email addresses from visited websites
│ ├── phone_extractor.py → Contains functions for extracting phone numbers from visited websites
│ └── normalizer.py → Standardizes formats and normalizes values
│
├── exports/
│ ├── __init__.py → Declares the directory as a Python package
│ ├── csv_exporter.py → Exports processed data to CSV format
│ └── json_exporter.py → Exports processed data to JSON format
│
├── core/
│ ├── __init__.py → Declares the directory as a Python package
│ ├── logger.py → Records system events, errors, and execution logs
│ ├── retry_handler.py → Automatically retries failed operations
│ ├── error_handler.py → Centralizes error handling and processing
│ └── folder_initializer.py → Automatically creates required directories at startup
│
├── models/
│ ├── __init__.py → Declares the directory as a Python package
│ └── company.py → Defines the company data structure used throughout the system
│
├── data/
│ ├── prospected.json → History of previously contacted prospects and associated prospecting dates
│ └── processed/ → Stores cleaned, validated, and export-ready data
│
├── logs/
│ └── app.log → System log containing events, errors, and execution records
│
├── searches/
│ ├── __init__.py → Declares the directory as a Python package
│ ├── search_manager.py → Manages the loading, storage, and execution of searches
│ └── searches.json → Stores user-defined search configurations
│
├── web/
│ ├── __init__.py → Declares the directory as a Python package
│ ├── app.py → Main Flask application, provides routes for searches, execution, results, and the user interface
│ │
│ └── templates/
│ ├── edit_searches.html → Edit the settings of an existing search
│ ├── results.html → Displays collected data, filtering options, and prospect tracking
│ ├── running.html → Progress screen displayed while a search is running
│ ├── searches.html → List of saved searches
│ └── new_search.html → Create a new search
│
├── LICENSE.md → Terms of use and legal framework
└── docs/
├── INSTALL.md → Provides step-by-step instructions for setting up and running the system
├── GUIDE.md → User guide
└── README.md → General system documentation
The system allows you to create, save, and execute custom searches.
Each search can include different criteria such as keywords, geographic areas, or parameters specific to a data source.
Searches are stored centrally and can be reused in future runs.
The collection engine retrieves information from one or more sources.
The architecture relies on independent collectors, making it easy to add new sources without modifying the rest of the application.
Collected data is automatically cleaned before processing.
This step removes unnecessary spaces, corrects certain formats, and prepares the data for subsequent stages.
The system standardizes values to ensure overall data consistency.
This facilitates comparisons, searches, and deduplication operations.
Each record is checked to verify the presence and consistency of required information.
Invalid data can be rejected before export.
The engine automatically detects and removes duplicates from collected results.
This step improves export quality and reduces noise in datasets.
Results can be exported in various formats for use with other tools or systems.
Currently supported formats:
- CSV
- JSON
All important system events are recorded in log files.
This feature simplifies error diagnosis and execution monitoring.
The system centralizes error processing to ensure consistent behavior when incidents occur.
Certain operations can be automatically retried when a temporary failure is detected.
This feature improves the overall reliability of the system.
The system maintains a list of previously contacted prospects to prevent duplicate outreach and facilitate activity tracking.
The architecture already includes the components required for a web interface that allows users to manage searches, data collection, and exports through a graphical interface.
The system loads all technical and functional parameters required for execution.
This includes:
- Storage paths
- Export settings
- Timeout settings
- Execution limits
- Logging options
Required directories are automatically created if needed.
This step ensures the environment is ready before any collection operations.
Saved searches are loaded from the search manager.
Each search contains the criteria that will be used by the collectors.
The collector registry provides the list of active sources to use during execution.
This architecture allows adding new sources without modifying the main engine.
Collectors retrieve data from the configured sources.
Results are converted into a structured format compatible with the rest of the system.
Collected data is prepared for subsequent steps.
This phase removes unnecessary spaces and corrects simple inconsistencies.
Values are harmonized to ensure a consistent format across the entire dataset.
Records are checked to ensure required information is present and usable.
Invalid data can be rejected.
The system detects and removes identical or redundant records to improve final data quality.
Validated data is exported using the configured formats.
Currently supported formats:
- CSV
- JSON
Important events, errors, and execution details are recorded in the system logs.
The workflow ends after file generation and the recording of execution information.
Results are then available in the project's export directories.
The system includes a web interface to manage searches, run collections, and view results in a browser.
The interface allows:
- Creating new searches
- Editing existing searches
- Deleting searches
- Manually running collections
- Viewing results
- Tracking previously contacted prospects
The data flow describes the complete path of information from collection to final export.
Each step occurs at a precise point to ensure quality, consistency, and usability of the results.
User Search
↓
Collector
↓
Collected Data
↓
Cleaning
↓
Normalization
↓
Validation
↓
Deduplication
↓
Export
The process begins with a search saved in the system.
Example:
{
"id": 1,
"name": "Paris Plumbers",
"keyword": "plumber",
"city": "Paris",
"enabled": true
}This search defines the criteria that will be used by the collectors.
The collector retrieves information from one or more sources.
The data is converted into a structured format shared across the entire system.
Example:
{
"name": "Example Company",
"website": "https://example.com",
"email": "contact@example.com",
"phone": "+33123456789",
"city": "Paris",
"country": "France",
"sector": "keyword",
"source": "Directory",
"processed": false
}Data is cleaned to remove unnecessary or inconsistent elements.
Examples:
- Remove unnecessary spaces
- Convert empty values
- Prepare string fields
Values are standardized to ensure a consistent format.
Examples:
PARIS
Paris
paris
become:
Paris
This step improves the overall consistency of the dataset.
Each record is checked to ensure that required information is present.
Example:
Name
Website
Incomplete or invalid records can be rejected.
The system searches for potential duplicates and keeps only unique records.
This step improves export quality and reduces redundant data.
Validated data is exported in the configured formats.
Currently supported formats:
- CSV
- JSON
Exports are generated in:
data/processed/
Important events are recorded in:
logs/app.log
Examples:
Application started
Collection started
250 records collected
Export completed
This step allows tracking system behavior and facilitates diagnostics in case of incidents.
The web interface architecture is already integrated into the project.
The goal is to enable creation, modification, and execution of searches from a dedicated web interface.
Available features:
- Create a search
- Edit an existing search
- Delete a search
- Select collectors
- Manually run collections
- View results
- Track previously contacted prospects
Create a search
↓
Save
↓
Start collection
↓
Process data
↓
Generate exports
↓
View results
The system was built around several core principles to ensure robustness, maintainability, and scalability.
Each component has a clearly defined responsibility.
This approach simplifies development, testing, and future enhancements.
Different modules can evolve independently.
Adding a new collector or exporter does not require changes to the system core.
The architecture allows for easy addition of new data sources, processing steps, or export formats.
The project structure is designed to facilitate code understanding and long-term maintenance.
The system includes validation, logging, error handling, and automatic retry mechanisms to improve reliability.
Data security and integrity are key aspects of the architecture.
Currently implemented features:
- Validation of collected data
- Centralized error handling
- Event logging
- Automatic retries
- Component isolation
- Controlled processing flows
The architecture also allows for the future addition of supplementary security mechanisms as needed.
© Palks Studio — see LICENSE.md
