Skip to content

amglng/API_Sales_Data_Pipeline

Repository files navigation

API Sales Data Pipeline

Project Overview

This project implements a complete data pipeline for fetching, processing, and visualizing stock market data. It demonstrates modern data engineering practices using containerization and workflow orchestration.

The pipeline automates the ETL (Extract, Transform, Load) process for multiple stock tickers, storing data in CSV files and providing an interactive dashboard for analysis.

Key features:

  • Automated data fetching from a public API
  • Data transformation and validation
  • Scheduled pipeline execution with Apache Airflow
  • Interactive dashboard with Streamlit and Plotly
  • Containerized deployment with Docker

Tech Stack

  • Python 3.9+
  • Apache Airflow - Workflow orchestration
  • Docker & Docker Compose - Containerization
  • PostgreSQL - Database for Airflow metadata
  • Streamlit - Web dashboard framework
  • Plotly - Interactive charts
  • Pandas - Data manipulation
  • Requests - HTTP API calls

Dependencies

The following Python packages are required:

  • certifi==2026.1.4
  • charset-normalizer==3.4.4
  • idna==3.11
  • python-dotenv==1.2.1
  • requests==2.32.5
  • urllib3==2.6.3
  • streamlit
  • plotly
  • pandas

Additionally, the project uses:

  • Apache Airflow (containerized)
  • PostgreSQL (containerized)

Project Structure

api_sales_data_pipeline/
│
├── README.md
├── requirements.txt
├── docker-compose.yml
├── Dockerfile
├── Dockerfile.airflow
│
├── app.py                    # Main ETL pipeline script
├── fetch_api_data.py         # Data fetching utility
├── dashboard.py              # Streamlit dashboard
├── test.py                   # Unit tests
├── api_data.csv              # Sample output CSV
│
├── airflow/
│   ├── dags/
│   │   └── api_pipeline_dag.py  # Airflow DAG definition
│   ├── config/
│   ├── logs/
│   └── plugins/
│
├── data/                     # Processed stock data CSVs
│   ├── AAPL_data.csv
│   ├── AMZN_data.csv
│   ├── FUN_data.csv
│   ├── GOOGL_data.csv
│   ├── MSFT_data.csv
│   ├── NVDA_data.csv
│   └── TSLA_data.csv
│
└── logs/                     # Application logs

Quick Start

Prerequisites

  • Docker and Docker Compose installed
  • Git

Installation & Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/api-sales-data-pipeline.git
    cd api-sales-data-pipeline
  2. Start the services:

    docker-compose up --build

    This will:

    • Build the Airflow containers
    • Start PostgreSQL database
    • Initialize Airflow database
    • Start Airflow scheduler and webserver
    • Run the data pipeline automatically
  3. Access the services:

    • Airflow Web UI: http://localhost:8080 (username: airflow, password: airflow)
    • Streamlit Dashboard: Run locally after data is processed (see below)

Running the Dashboard Locally

After the pipeline has run and generated data:

  1. Install Python dependencies:

    pip install -r requirements.txt
  2. Run the dashboard:

    streamlit run dashboard.py
  3. Access the dashboard: http://localhost:8501


Pipeline Workflow

The pipeline is orchestrated by Apache Airflow and runs the following steps:

  1. Extract: Fetch stock data from the PocketPortfolio API for configured tickers
  2. Transform: Clean and validate the JSON data, convert to CSV format
  3. Load: Save processed data to individual CSV files in the data/ directory

Airflow DAG

The DAG (api_pipeline_dag.py) is scheduled to run daily but can also be triggered manually through the Airflow UI.

Data Sources


Configuration

Environment variables can be modified in docker-compose.yml:

  • BASE_URL: API endpoint URL
  • TICKERS: Comma-separated list of stock tickers

Development

Running Tests

python test.py

Manual Pipeline Execution

python app.py

Building Docker Images

docker-compose build

Example Output

The pipeline generates CSV files with the following structure:

symbol date open high low close volume
AAPL 2026-02-14 150.25 152.10 149.80 151.75 52847392
AAPL 2026-02-13 148.90 150.50 148.50 150.20 48273928

Troubleshooting

Common Issues

  1. Port conflicts: Ensure ports 8080 (Airflow) and 5432 (PostgreSQL) are available
  2. Docker build failures: Check Docker Desktop is running
  3. Data not loading: Verify API connectivity and ticker symbols

Logs

  • Application logs: logs/pipeline.log
  • Airflow logs: airflow/logs/
  • Docker logs: docker-compose logs

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is for educational purposes. Please check the API terms of service for commercial use.


Usage

Run the pipeline:

python fetch_api_data.py

The script will:

  1. Fetch data from the API
  2. Transform and clean the data
  3. Output the results to api_data.csv

Requirements

  • Python 3.6+
  • requests library

Future Enhancements

  • Add error handling for API failures
  • Implement retry logic for failed requests
  • Add data validation tests
  • Schedule the pipeline to run automatically
  • Store data in a SQL database instead of CSV

License

This project is for educational purposes.


Author

Amogelang Ngene

About

A Python mini-project that fetches data from a free API, transforms it into a clean tabular format, and outputs it to a CSV file for analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages