This project implements a complete data pipeline for fetching, processing, and visualizing stock market data. It demonstrates modern data engineering practices using containerization and workflow orchestration.
The pipeline automates the ETL (Extract, Transform, Load) process for multiple stock tickers, storing data in CSV files and providing an interactive dashboard for analysis.
Key features:
- Automated data fetching from a public API
- Data transformation and validation
- Scheduled pipeline execution with Apache Airflow
- Interactive dashboard with Streamlit and Plotly
- Containerized deployment with Docker
- Python 3.9+
- Apache Airflow - Workflow orchestration
- Docker & Docker Compose - Containerization
- PostgreSQL - Database for Airflow metadata
- Streamlit - Web dashboard framework
- Plotly - Interactive charts
- Pandas - Data manipulation
- Requests - HTTP API calls
The following Python packages are required:
certifi==2026.1.4charset-normalizer==3.4.4idna==3.11python-dotenv==1.2.1requests==2.32.5urllib3==2.6.3streamlitplotlypandas
Additionally, the project uses:
- Apache Airflow (containerized)
- PostgreSQL (containerized)
api_sales_data_pipeline/
│
├── README.md
├── requirements.txt
├── docker-compose.yml
├── Dockerfile
├── Dockerfile.airflow
│
├── app.py # Main ETL pipeline script
├── fetch_api_data.py # Data fetching utility
├── dashboard.py # Streamlit dashboard
├── test.py # Unit tests
├── api_data.csv # Sample output CSV
│
├── airflow/
│ ├── dags/
│ │ └── api_pipeline_dag.py # Airflow DAG definition
│ ├── config/
│ ├── logs/
│ └── plugins/
│
├── data/ # Processed stock data CSVs
│ ├── AAPL_data.csv
│ ├── AMZN_data.csv
│ ├── FUN_data.csv
│ ├── GOOGL_data.csv
│ ├── MSFT_data.csv
│ ├── NVDA_data.csv
│ └── TSLA_data.csv
│
└── logs/ # Application logs
- Docker and Docker Compose installed
- Git
-
Clone the repository:
git clone https://github.com/yourusername/api-sales-data-pipeline.git cd api-sales-data-pipeline -
Start the services:
docker-compose up --build
This will:
- Build the Airflow containers
- Start PostgreSQL database
- Initialize Airflow database
- Start Airflow scheduler and webserver
- Run the data pipeline automatically
-
Access the services:
- Airflow Web UI: http://localhost:8080 (username:
airflow, password:airflow) - Streamlit Dashboard: Run locally after data is processed (see below)
- Airflow Web UI: http://localhost:8080 (username:
After the pipeline has run and generated data:
-
Install Python dependencies:
pip install -r requirements.txt
-
Run the dashboard:
streamlit run dashboard.py
-
Access the dashboard: http://localhost:8501
The pipeline is orchestrated by Apache Airflow and runs the following steps:
- Extract: Fetch stock data from the PocketPortfolio API for configured tickers
- Transform: Clean and validate the JSON data, convert to CSV format
- Load: Save processed data to individual CSV files in the
data/directory
The DAG (api_pipeline_dag.py) is scheduled to run daily but can also be triggered manually through the Airflow UI.
- API: https://pocketportfolio.app/api/tickers
- Tickers: FUN, AAPL, GOOGL, AMZN, MSFT, TSLA, NVDA
Environment variables can be modified in docker-compose.yml:
BASE_URL: API endpoint URLTICKERS: Comma-separated list of stock tickers
python test.pypython app.pydocker-compose buildThe pipeline generates CSV files with the following structure:
| symbol | date | open | high | low | close | volume |
|---|---|---|---|---|---|---|
| AAPL | 2026-02-14 | 150.25 | 152.10 | 149.80 | 151.75 | 52847392 |
| AAPL | 2026-02-13 | 148.90 | 150.50 | 148.50 | 150.20 | 48273928 |
- Port conflicts: Ensure ports 8080 (Airflow) and 5432 (PostgreSQL) are available
- Docker build failures: Check Docker Desktop is running
- Data not loading: Verify API connectivity and ticker symbols
- Application logs:
logs/pipeline.log - Airflow logs:
airflow/logs/ - Docker logs:
docker-compose logs
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is for educational purposes. Please check the API terms of service for commercial use.
Run the pipeline:
python fetch_api_data.pyThe script will:
- Fetch data from the API
- Transform and clean the data
- Output the results to
api_data.csv
- Python 3.6+
requestslibrary
- Add error handling for API failures
- Implement retry logic for failed requests
- Add data validation tests
- Schedule the pipeline to run automatically
- Store data in a SQL database instead of CSV
This project is for educational purposes.
Amogelang Ngene