Skip to content

angelakberry/beauty_wizard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beauty Wizard

Cosmetic Ingredient Transparency and Regulatory Indicators

A data analysis capstone project exploring cosmetic product formulations, ingredient usage patterns, and regulatory reporting signals through a relational database and reproducible analytics workflow.


Project Overview

Beauty Wizard integrates retail cosmetic product data with government chemical reporting to examine how ingredients are used across products, brands, and categories—and how those patterns intersect with regulatory signals.

Rather than labeling products or brands as “safe” or “unsafe,” this project focuses on transparency, formulation complexity, and reporting prevalence, giving analytical context to how ingredients appear in the marketplace and in public regulatory datasets.

All analyses are fully reproducible via a single Jupyter Notebook and a SQLite database.


Core Questions

  • Which ingredients are most prevalent across cosmetic products?
  • How complex are typical cosmetic formulations?
  • Do higher-priced or higher-ranked products differ in ingredient diversity?
  • Which ingredients and brands appear most frequently in regulatory reporting datasets?
  • How does regulatory exposure differ when measured at the ingredient, product, or brand level?

Setup and Environment

This project uses a Python virtual environment to ensure reproducible execution across systems. All required dependencies are listed in requirements.txt. Users can recreate the environment locally by creating and activating a virtual environment and installing dependencies before running the notebook. A dedicated Jupyter kernel is registered to ensure consistency between the execution environment and installed packages.

git clone https://github.com/angelakberry/beauty_wizard.git
cd beauty_wizard

python3 -m venv venv

# macOS / Linux
source venv/bin/activate

# Windows (Git Bash)
source venv/Scripts/activate

pip install -r requirements.txt

python -m ipykernel install --user \  --name beauty_wizard \  --display-name "Python (beauty_wizard)"

jupyter notebook

Note: If jupyter notebook is not found, use python -m jupyter notebook to launch Jupyter from the active virtual environment.

Select the kernel Python (beauty_wizard) and run all cells.

Note: The venv/ directory is intentionally excluded from version control. All paths are relative to the project root for portability. No external configuration or credentials are required.

Running the Analysis

The primary analysis is contained in BeautyWizard.ipynb at the repository root.


Data Sources

This project combines three independent datasets:

Dataset Description
Sephora Skincare Product Ingredients (Kaggle) Retail product listings, prices, rankings, and ingredient text
BeautyFeeds Skincare & Haircare Dataset Supplemental ingredient and product metadata
California Chemicals in Cosmetics Government chemical reporting data, including reporting timelines and counts

All datasets were cleaned, standardized, and integrated into a unified schema for analysis.


Methodology

Data Cleaning and Standardization

  • Normalized column names and text fields (case, whitespace, characters)
  • Standardized ingredient strings and tokenized ingredient lists
  • Applied dataset-specific missing data strategies
  • Preserved real-world variability by flagging, not removing, outliers

Ingredient Normalization

  • Ingredients were cleaned and lowercased for consistent matching
  • Products were expanded to ingredient-level granularity for frequency analysis
  • Ingredient names were mapped across datasets prior to database insertion

Outlier Handling

  • Extreme price values identified using the IQR method
  • Luxury-priced products retained and transparently flagged
  • Outliers included in EDA to reflect real market conditions

Database Design

The project uses SQLite to enforce relational integrity and support SQL-driven analysis.

Core Tables

  • Products — brand, product name, price, rank, and product type
  • Ingredients — normalized ingredient master list
  • ProductIngredients — many-to-many junction table
  • ChemicalReports — regulatory reporting records linked at the ingredient level
erDiagram
    PRODUCTS {
        INTEGER product_id PK
        TEXT brand
        TEXT product_name
        REAL price
        REAL rank
        TEXT product_type
    }

    INGREDIENTS {
        INTEGER ingredient_id PK
        TEXT ingredient_name
    }

    PRODUCT_INGREDIENTS {
        INTEGER product_id FK
        INTEGER ingredient_id FK
    }

    CHEMICAL_REPORTS {
        INTEGER report_id PK
        TEXT brand
        TEXT product_name
        INTEGER ingredient_id FK
        INTEGER chemicalid
        TEXT chemicalname
        TEXT initialdatereported
        TEXT mostrecentdatereported
        TEXT discontinueddate
        INTEGER chemicalcount
    }

    PRODUCTS ||--o{ PRODUCT_INGREDIENTS : contains
    INGREDIENTS ||--o{ PRODUCT_INGREDIENTS : included_in
    INGREDIENTS ||--o{ CHEMICAL_REPORTS : reported_in

Loading

Foreign key constraints are enforced (PRAGMA foreign_keys = ON) to ensure data consistency.

A static ER diagram .png is included in the repository under /schema.


Exploratory Data Analysis (EDA)

EDA focuses on understanding the shape of the data before applying relational queries:

  • Ingredient frequency distributions
  • Ingredient count distributions by product type
  • Product price distribution and outliers
  • Price vs. ranking relationships

These views provide context for interpreting later SQL-based analyses.


Advanced SQL Analyses

Three primary SQL-driven analyses anchor the project:

1. Formulation Complexity by Brand

  • Measures average number of ingredients per product by brand
  • Highlights differences in formulation strategies
  • Clarifies that this metric is not ingredient prevalence

2. Ingredient Prevalence Across Products

  • Identifies ingredients appearing most frequently across products
  • Demonstrates that a small subset of ingredients dominates formulations
  • Distinct from brand-level complexity analysis

3. Regulatory Reporting Exposure

  • Evaluated at both product and brand levels
  • Counts unique ingredients appearing in CSCP reports
  • Emphasizes that reporting frequency ≠ product safety risk

Key Findings

  • Cosmetic formulations rely heavily on a small set of common ingredients, followed by a long tail of less frequent components.
  • Most products contain 20 to 40 ingredients, indicating moderate formulation complexity.
  • Ingredient diversity shows no strong correlation with price or product rank.
  • Regulatory reporting is concentrated among a relatively small subset of ingredients and brands.
  • High reporting counts typically reflect widely used ingredients, not necessarily elevated safety concerns.

Limitations

  • Ingredient presence does not account for concentration or exposure level.
  • Regulatory datasets reflect reporting activity, not enforcement actions or health outcomes.
  • Dataset coverage varies by brand and product category.
  • Results should be interpreted as analytical signals, not consumer safety claims.

Future Extensions

Potential next steps include:

  • Automated data refresh pipelines
  • API-driven product lookups
  • Integration of consumer review sentiment
  • Expanded regulatory datasets and longitudinal analysis

Reproducibility

  • All analysis is contained in a single Jupyter Notebook
  • SQLite database generated programmatically
  • Command-line Git used throughout development
  • Notebook structured for portfolio review and PDF export, if needed

Note: Run all notebook cells from top to bottom to fully reproduce results.

Repository Structure

beauty_wizard /data .csv dataset files /schema ER diagram image and Mermaid Markdown script beauty_wizard.ipynb Jupyter notebook BeautyWiz.db (generated) README.md requirements.txt


Project Vision

Beauty Wizard advances ingredient transparency and data-driven beauty research. Empowering consumers with smarter, safer, more sustainable choices.


Contributing

Contributions, dataset suggestions, and methodology feedback are welcome. Please open an issue or submit a pull request.


AI Usage Disclosure

AI-assisted tools were used as a support aid for project organization, additional dataset sourcing, requirements checklisting, repository structure review, Git troubleshooting, and documentation formatting (e.g., Markdown and Mermaid diagrams). All data cleaning, analysis, database design, SQL queries, visualizations, and conclusions were independently developed, reviewed, and validated by the author in accordance with program AI usage guidelines.


About

Python and SQL data analysis of cosmetic products, ingredient patterns, and chemical risk indicators built on a relational SQLite database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors