Task 1: Data Cleaning and Preprocessing – Data Analyst Internship
Objective Clean and preprocess a raw marketing dataset by:
- Handling missing values
- Removing duplicate rows
- Standardizing text and date formats
- Renaming columns
- Correcting data types
This process prepares the dataset for analysis and ensures data quality.
Dataset Used
Name:Customer Personality Analysis
Filename: marketing_campaign.csv
Source:Kaggle Dataset
Description:The dataset contains customer demographics, spending habits, and campaign response data. Useful for customer segmentation and marketing analysis.
Data Cleaning Steps Performed
-
Removed Duplicates
- Used
drop_duplicates()to eliminate any duplicate entries.
- Used
-
Handled Missing Values
- Filled missing values in the
Incomecolumn with the median. - Dropped remaining rows with missing values using
dropna().
- Filled missing values in the
-
Standardized Text Fields
- Converted
EducationandMarital_Statusto lowercase and removed extra spaces using.str.lower().str.strip().
- Converted
-
Converted Date Formats
- Converted
Dt_Customercolumn to consistentdatetimeformat (DD-MM-YYYY).
- Converted
-
Renamed Columns
- Renamed all columns to
snake_caseusing string methods to ensure consistency and readability.
- Renamed all columns to
-
Corrected Data Types
- Created new
agecolumn fromYear_Birth(2025 - Year_Birth). - Ensured
ageis an integer andDt_Customeris in datetime format.
- Created new
Tools Used
- Python 3.x
- Pandas
- VS code
Output Files
cleaned_marketing_campaign.csv– Final cleaned dataset- 'marketing_campaign' - actual dataset
data_cleaning.py– Python script used for cleaningREADME.md– Documentation file (this file)
Kaggle Datasets Suitable for Task 1
- ✅ Customer Personality Analysis (used)