Beauty Wizard

Cosmetic Ingredient Transparency and Regulatory Indicators

A data analysis capstone project exploring cosmetic product formulations, ingredient usage patterns, and regulatory reporting signals through a relational database and reproducible analytics workflow.

Project Overview

Beauty Wizard integrates retail cosmetic product data with government chemical reporting to examine how ingredients are used across products, brands, and categories—and how those patterns intersect with regulatory signals.

Rather than labeling products or brands as “safe” or “unsafe,” this project focuses on transparency, formulation complexity, and reporting prevalence, giving analytical context to how ingredients appear in the marketplace and in public regulatory datasets.

All analyses are fully reproducible via a single Jupyter Notebook and a SQLite database.

Core Questions

Which ingredients are most prevalent across cosmetic products?
How complex are typical cosmetic formulations?
Do higher-priced or higher-ranked products differ in ingredient diversity?
Which ingredients and brands appear most frequently in regulatory reporting datasets?
How does regulatory exposure differ when measured at the ingredient, product, or brand level?

Setup and Environment

This project uses a Python virtual environment to ensure reproducible execution across systems. All required dependencies are listed in requirements.txt. Users can recreate the environment locally by creating and activating a virtual environment and installing dependencies before running the notebook. A dedicated Jupyter kernel is registered to ensure consistency between the execution environment and installed packages.

git clone https://github.com/angelakberry/beauty_wizard.git
cd beauty_wizard

python3 -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows (Git Bash)
source venv/Scripts/activate

pip install -r requirements.txt

python -m ipykernel install --user \  --name beauty_wizard \  --display-name "Python (beauty_wizard)"

jupyter notebook

Note: If jupyter notebook is not found, use python -m jupyter notebook to launch Jupyter from the active virtual environment.

Select the kernel Python (beauty_wizard) and run all cells.

Note: The venv/ directory is intentionally excluded from version control. All paths are relative to the project root for portability. No external configuration or credentials are required.

Running the Analysis

The primary analysis is contained in BeautyWizard.ipynb at the repository root.

Data Sources

This project combines three independent datasets:

Dataset	Description
Sephora Skincare Product Ingredients (Kaggle)	Retail product listings, prices, rankings, and ingredient text
BeautyFeeds Skincare & Haircare Dataset	Supplemental ingredient and product metadata
California Chemicals in Cosmetics	Government chemical reporting data, including reporting timelines and counts

All datasets were cleaned, standardized, and integrated into a unified schema for analysis.

Methodology

Data Cleaning and Standardization

Normalized column names and text fields (case, whitespace, characters)
Standardized ingredient strings and tokenized ingredient lists
Applied dataset-specific missing data strategies
Preserved real-world variability by flagging, not removing, outliers

Ingredient Normalization

Ingredients were cleaned and lowercased for consistent matching
Products were expanded to ingredient-level granularity for frequency analysis
Ingredient names were mapped across datasets prior to database insertion

Outlier Handling

Extreme price values identified using the IQR method
Luxury-priced products retained and transparently flagged
Outliers included in EDA to reflect real market conditions

Database Design

The project uses SQLite to enforce relational integrity and support SQL-driven analysis.

Core Tables

Products — brand, product name, price, rank, and product type
Ingredients — normalized ingredient master list
ProductIngredients — many-to-many junction table
ChemicalReports — regulatory reporting records linked at the ingredient level

erDiagram
    PRODUCTS {
        INTEGER product_id PK
        TEXT brand
        TEXT product_name
        REAL price
        REAL rank
        TEXT product_type
    }

    INGREDIENTS {
        INTEGER ingredient_id PK
        TEXT ingredient_name
    }

    PRODUCT_INGREDIENTS {
        INTEGER product_id FK
        INTEGER ingredient_id FK
    }

    CHEMICAL_REPORTS {
        INTEGER report_id PK
        TEXT brand
        TEXT product_name
        INTEGER ingredient_id FK
        INTEGER chemicalid
        TEXT chemicalname
        TEXT initialdatereported
        TEXT mostrecentdatereported
        TEXT discontinueddate
        INTEGER chemicalcount
    }

    PRODUCTS ||--o{ PRODUCT_INGREDIENTS : contains
    INGREDIENTS ||--o{ PRODUCT_INGREDIENTS : included_in
    INGREDIENTS ||--o{ CHEMICAL_REPORTS : reported_in

Foreign key constraints are enforced (PRAGMA foreign_keys = ON) to ensure data consistency.

A static ER diagram .png is included in the repository under /schema.

Exploratory Data Analysis (EDA)

EDA focuses on understanding the shape of the data before applying relational queries:

Ingredient frequency distributions
Ingredient count distributions by product type
Product price distribution and outliers
Price vs. ranking relationships

These views provide context for interpreting later SQL-based analyses.

Advanced SQL Analyses

Three primary SQL-driven analyses anchor the project:

1. Formulation Complexity by Brand

Measures average number of ingredients per product by brand
Highlights differences in formulation strategies
Clarifies that this metric is not ingredient prevalence

2. Ingredient Prevalence Across Products

Identifies ingredients appearing most frequently across products
Demonstrates that a small subset of ingredients dominates formulations
Distinct from brand-level complexity analysis

3. Regulatory Reporting Exposure

Evaluated at both product and brand levels
Counts unique ingredients appearing in CSCP reports
Emphasizes that reporting frequency ≠ product safety risk

Key Findings

Cosmetic formulations rely heavily on a small set of common ingredients, followed by a long tail of less frequent components.
Most products contain 20 to 40 ingredients, indicating moderate formulation complexity.
Ingredient diversity shows no strong correlation with price or product rank.
Regulatory reporting is concentrated among a relatively small subset of ingredients and brands.
High reporting counts typically reflect widely used ingredients, not necessarily elevated safety concerns.

Limitations

Ingredient presence does not account for concentration or exposure level.
Regulatory datasets reflect reporting activity, not enforcement actions or health outcomes.
Dataset coverage varies by brand and product category.
Results should be interpreted as analytical signals, not consumer safety claims.

Future Extensions

Potential next steps include:

Automated data refresh pipelines
API-driven product lookups
Integration of consumer review sentiment
Expanded regulatory datasets and longitudinal analysis

Reproducibility

All analysis is contained in a single Jupyter Notebook
SQLite database generated programmatically
Command-line Git used throughout development
Notebook structured for portfolio review and PDF export, if needed

Note: Run all notebook cells from top to bottom to fully reproduce results.

Repository Structure

beauty_wizard /data .csv dataset files /schema ER diagram image and Mermaid Markdown script beauty_wizard.ipynb Jupyter notebook BeautyWiz.db (generated) README.md requirements.txt

Project Vision

Beauty Wizard advances ingredient transparency and data-driven beauty research. Empowering consumers with smarter, safer, more sustainable choices.

Contributing

Contributions, dataset suggestions, and methodology feedback are welcome. Please open an issue or submit a pull request.

AI Usage Disclosure

AI-assisted tools were used as a support aid for project organization, additional dataset sourcing, requirements checklisting, repository structure review, Git troubleshooting, and documentation formatting (e.g., Markdown and Mermaid diagrams). All data cleaning, analysis, database design, SQL queries, visualizations, and conclusions were independently developed, reviewed, and validated by the author in accordance with program AI usage guidelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beauty Wizard

Project Overview

Core Questions

Setup and Environment

Running the Analysis

Data Sources

Methodology

Data Cleaning and Standardization

Ingredient Normalization

Outlier Handling

Database Design

Core Tables

Exploratory Data Analysis (EDA)

Advanced SQL Analyses

1. Formulation Complexity by Brand

2. Ingredient Prevalence Across Products

3. Regulatory Reporting Exposure

Key Findings

Limitations

Future Extensions

Reproducibility

Repository Structure

Project Vision

Contributing

AI Usage Disclosure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
data		data
schema		schema
.gitignore		.gitignore
README.md		README.md
beauty_wizard.ipynb		beauty_wizard.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Beauty Wizard

Project Overview

Core Questions

Setup and Environment

Running the Analysis

Data Sources

Methodology

Data Cleaning and Standardization

Ingredient Normalization

Outlier Handling

Database Design

Core Tables

Exploratory Data Analysis (EDA)

Advanced SQL Analyses

1. Formulation Complexity by Brand

2. Ingredient Prevalence Across Products

3. Regulatory Reporting Exposure

Key Findings

Limitations

Future Extensions

Reproducibility

Repository Structure

Project Vision

Contributing

AI Usage Disclosure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages