In this project, I explored a real-world supply chain dataset to uncover what drives market performance, how delivery efficiency interacts with revenue, and why some shipping modes struggle despite strong sales.
The company operates globally in the e-commerce industry, shipping products across markets like Europe, LATAM, Africa, Pacific Asia, and USCA. As a data analyst on the operations team, my goal was to investigate delivery delays, identify underperforming shipping modes, and understand the relationship between on-time delivery (OTD) and business revenue.
The analysis began with a key hypothesis:
“If we improve on-time delivery (OTD), total revenue and order volume will also increase.”
That assumption didn’t hold but what I found instead revealed how operational design and delivery trade-offs shape performance more than timing ever could.
The dataset includes both structured and unstructured data sources:
- DataCoSupplyChainDataset.csv → contains all sales, delivery, and order-level data across multiple markets.
- tokenized_access_logs.csv (Clickstream) → captures user activity and browsing behavior.
- DescriptionDataCoSupplyChain.csv → provides definitions for all columns in the structured dataset.
Before diving into analysis, I ensured the data was clean and consistent:
- Standardized date formats for order and delivery timestamps.
- Verified that product categories, markets, and regional hierarchies were aligned.
- Checked for missing delivery data and duplicates in key transaction records.
- Created derived fields for OTD, delay days, and YoY performance.
From 2015 to 2017, the company generated roughly $20.2M in revenue across five major markets. Revenue peaked in 2015 (around $7M), fluctuated through 2016–2017, and ended slightly lower but more stable.
Europe emerged as the top-performing market, while Africa despite having 48 active countries contributed the least to overall revenue. Interestingly, the US (within the USCA market) recorded exceptionally high sales in short bursts, reaching about $250K in a few months between April and August.
The most significant dip occurred in April 2016, primarily driven by a slowdown in Europe’s order volume. Since Europe represents the largest market, any fluctuation there had an outsized effect on overall performance.
Through deeper analysis, I found that while order volume fluctuations drove revenue peaks and dips, delivery efficiency (OTD) wasn’t strongly correlated with revenue. Shipping faster didn’t necessarily mean selling moremeaning something deeper was shaping the company’s growth.
At this stage, I shifted focus from markets to shipping modes, aiming to understand whether the delays were linked to transportation type rather than geography.
The scatter plot comparing revenue vs. average delay revealed a clear trade-off: higher-revenue shipping modes often showed longer delays, especially Second Class shipments. This shifted the narrative the challenge wasn’t customer demand, it was operational execution.
- Second Class shipping generated one of the highest average delays (~2.5 days) yet ranked second in total revenue. Customers kept choosing it, suggesting price or availability outweighed punctuality.
- Standard Class emerged as the most reliable performer. It had balanced revenue, strong OTD, and interestingly it was the only mode that delivered orders in advance.
- First Class, despite being the premium option, consistently showed a high rate of late or cancelled deliveries, making it the weakest in delivery reliability.
- Across all modes, Late Delivery dominated the delivery status counts. However, when orders did arrive early (“Advanced Delivery”), they tended to align with higher revenue and smoother operations again, tied exclusively to Standard Class.
When testing the hypothesis “Does improving OTD increase total revenue?” the data said no.
The Pearson correlation between OTD and revenue was weak, meaning that improving on-time delivery alone wouldn’t directly boost sales. Profitability leaned more on order volume and shipping mode efficiency than delivery speed.
Key takeaway:
Customers tolerate delay when the trade-off is convenience or lower cost. Standard Class quietly proved operational excellence that should be scaled across other modes.
After identifying the underperforming modes, I focused on understanding why the delays occurred.
The problem didn’t lie in the shipping mode itself but in where the shipments originated. Two states repeatedly surfaced as pain points: Puerto Rico (PR) and California (CA).
- Both states handled the largest order volumes and contributed the highest revenue totals, especially among delayed shipments.
- When plotting average delay days vs. total sales, PR and CA consistently appeared as high-revenue, high-delay clusters.
- The departments most affected were Footwear, Apparel, Tech, Outdoors, and Golf categories that also had the highest cross-market demand.
- These findings suggest that warehouse congestion and extended transport distances from these states were the main drivers of delay.
In plain terms:
The company’s most profitable markets rely too heavily on two overburdened states for fulfillment. As a result, even a strong sales performance is undermined by recurring delivery inefficiencies.
Recommendations:
- Redistribute order processing and fulfillment load from PR and CA to less saturated warehouses.
- Streamline Second Class routes or merge overlapping paths with Standard Class to reduce redundancy.
- Enhance the existing delay-risk flag by incorporating dynamic congestion indicators and operationalize it earlier in the order allocation process.
Short technical tour for data analysts and hiring managers.
- Place raw data files in
datasets/raw/. - Load the data into MySQL schema
supply_chain. - Run scripts in
01_data_preparation/to clean and prepare the dataset. - Execute scripts in numbered order (
02_exploratory_analysis/→04_root_cause_analysis/). - Requires MySQL with support for CTEs, window functions, and date calculations.
| Folder | Purpose |
|---|---|
01_data_preparation/ |
Data cleaning, standardization, and preparation of delay & order data |
02_exploratory_analysis/ |
Initial KPI calculations and delay distribution |
03_delay_analysis/ |
Hypothesis testing (OTD rate vs Revenue correlation) |
04_root_cause_analysis/ |
Warehouse bottleneck and shipping mode root cause investigation |
05_shipping_mode_analysis/ |
Performance comparison across shipping classes |
06_reporting/ |
Summary reports and performance tables |
1. Delay Rate Calculation
Core metric used for hypothesis testing (On-Time Delivery rate).
Full script: scripts/03_delay_analysis/otd_rate_calculation.sql
SELECT
shipping_mode,
COUNT(*) AS total_orders,
SUM(CASE WHEN on_time = 1 THEN 1 ELSE 0 END) AS on_time_orders,
ROUND(SUM(CASE WHEN on_time = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS otd_rate
FROM supply_chain.orders
GROUP BY shipping_mode
ORDER BY otd_rate DESC;
2. Revenue vs Delay Correlation
Tests the main hypothesis (does better delivery performance drive higher revenue?).
Full script: scripts/03_delay_analysis/revenue_vs_delay_correlation.sql
SELECT
state,
ROUND(AVG(delay_days), 2) AS avg_delay_days,
ROUND(SUM(revenue), 2) AS total_revenue,
ROUND(CORR(delay_days, revenue), 4) AS correlation_coefficient
FROM supply_chain.orders
GROUP BY state
ORDER BY avg_delay_days ASC; 3. Root Cause-Warehouse Bottleneck Analysis
Identifies PR and CA as the main sources of severe delays.
Full script: scripts/04_root_cause_analysis/warehouse_bottleneck.sql
SELECT
warehouse_state,
COUNT(*) AS total_orders,
SUM(CASE WHEN delay_days > 5 THEN 1 ELSE 0 END) AS severe_delays,
ROUND(SUM(CASE WHEN delay_days > 5 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS severe_delay_rate
FROM supply_chain.orders
GROUP BY warehouse_state
ORDER BY severe_delay_rate DESC;4. Shipping Mood Performace
Compares revenue and delay trade-offs across shipping classes.
Full script: scripts/05_shipping_mode_analysis/shipping_mode_performance.sql
SELECT
shipping_mode,
ROUND(AVG(delay_days), 2) AS avg_delay,
ROUND(SUM(revenue), 2) AS total_revenue,
ROUND(SUM(revenue) / COUNT(*), 2) AS revenue_per_order
FROM supply_chain.orders
GROUP BY shipping_mode
ORDER BY avg_delay ASC;