Cleaning, reformatting, and splitting bank marketing campaign data for PostgreSQL ingestion.
Personal loans are a lucrative revenue stream for banks. The typical interest rate of a two-year loan in the United Kingdom is around 10%. In September 2022 alone, UK consumers borrowed around £1.5 billion — approximately £300 million in interest generated by banks over two years.
This project cleans data collected from a bank's recent marketing campaign aimed at getting customers to take out a personal loan. The cleaned data is structured to be imported into a PostgreSQL database, enabling future campaigns to reuse the same schema seamlessly.
├── bank_marketing.csv # Raw input data
├── notebook.ipynb # Main Jupyter Notebook
├── client.csv # Output: client demographic data
├── campaign.csv # Output: campaign interaction data
└── economics.csv # Output: economic indicator data
Clean, reformat, and split bank_marketing.csv into three separate CSV files:
| File | Description |
|---|---|
client.csv |
Client demographic and financial profile |
campaign.csv |
Campaign contact history and outcomes |
economics.csv |
Macroeconomic indicators at time of contact |
| Column | Type | Description | Cleaning |
|---|---|---|---|
client_id |
integer |
Client ID | — |
age |
integer |
Age in years | — |
job |
object |
Type of job | Replace "." → "_" |
marital |
object |
Marital status | — |
education |
object |
Education level | Replace "." → "_", "unknown" → NaN |
credit_default |
bool |
Credit in default? | 1 if "yes", else 0 |
mortgage |
bool |
Has housing loan? | 1 if "yes", else 0 |
| Column | Type | Description | Cleaning |
|---|---|---|---|
client_id |
integer |
Client ID | — |
number_contacts |
integer |
Contacts in current campaign | — |
contact_duration |
integer |
Last contact duration (seconds) | — |
previous_campaign_contacts |
integer |
Contacts in previous campaign | — |
previous_outcome |
bool |
Previous campaign outcome | 1 if "success", else 0 |
campaign_outcome |
bool |
Current campaign outcome | 1 if "yes", else 0 |
last_contact_date |
datetime |
Last contact date | Built from day + month + 2022, format: YYYY-MM-DD |
| Column | Type | Description | Cleaning |
|---|---|---|---|
client_id |
integer |
Client ID | — |
cons_price_idx |
float |
Consumer price index | — |
euribor_three_months |
float |
Euribor 3-month rate | — |
import pandas as pd
import numpy as np
df = pd.read_csv("bank_marketing.csv")
# ── client.csv ────────────────────────────────────────────────────────────────
client = df[["client_id", "age", "job", "marital", "education",
"credit_default", "mortgage"]].copy()
client["job"] = client["job"].str.replace(".", "_", regex=False)
client["education"] = (client["education"]
.str.replace(".", "_", regex=False)
.replace("unknown", np.NaN))
client["credit_default"] = client["credit_default"].apply(lambda x: 1 if x == "yes" else 0).astype(bool)
client["mortgage"] = client["mortgage"].apply(lambda x: 1 if x == "yes" else 0).astype(bool)
client.to_csv("client.csv", index=False)
# ── campaign.csv ──────────────────────────────────────────────────────────────
campaign = df[["client_id", "number_contacts", "contact_duration",
"previous_campaign_contacts", "previous_outcome",
"campaign_outcome", "day", "month"]].copy()
campaign["previous_outcome"] = campaign["previous_outcome"].apply(lambda x: 1 if x == "success" else 0).astype(bool)
campaign["campaign_outcome"] = campaign["campaign_outcome"].apply(lambda x: 1 if x == "yes" else 0).astype(bool)
campaign["last_contact_date"] = pd.to_datetime(
campaign["day"].astype(str) + "-" + campaign["month"] + "-2022",
format="%d-%b-%Y"
).dt.strftime("%Y-%m-%d")
campaign = campaign.drop(columns=["day", "month"])
campaign.to_csv("campaign.csv", index=False)
# ── economics.csv ─────────────────────────────────────────────────────────────
economics = df[["client_id", "cons_price_idx", "euribor_three_months"]].copy()
economics.to_csv("economics.csv", index=False)-
Clone the repository
git clone https://github.com/your-username/bank-marketing-cleaning.git cd bank-marketing-cleaning -
Install dependencies
pip install pandas numpy
-
Run the notebook or script
jupyter notebook notebook.ipynb # or python solution.py -
Check outputs —
client.csv,campaign.csv, andeconomics.csvwill be generated in the project root.
- Boolean encoding —
"yes"/"success"→True (1), all others →False (0) - String cleaning — dots replaced with underscores in
jobandeducationfields - Null handling —
"unknown"education values replaced withNaN - Date construction —
last_contact_dateassembled from separateday,monthcolumns with year fixed at2022
- Python 3.8+
- pandas — data manipulation and CSV I/O
- NumPy — NaN handling
This project is licensed under the MIT License.