Cosmetic Ingredient Transparency and Regulatory Indicators
A data analysis capstone project exploring cosmetic product formulations, ingredient usage patterns, and regulatory reporting signals through a relational database and reproducible analytics workflow.
Beauty Wizard integrates retail cosmetic product data with government chemical reporting to examine how ingredients are used across products, brands, and categories—and how those patterns intersect with regulatory signals.
Rather than labeling products or brands as “safe” or “unsafe,” this project focuses on transparency, formulation complexity, and reporting prevalence, giving analytical context to how ingredients appear in the marketplace and in public regulatory datasets.
All analyses are fully reproducible via a single Jupyter Notebook and a SQLite database.
- Which ingredients are most prevalent across cosmetic products?
- How complex are typical cosmetic formulations?
- Do higher-priced or higher-ranked products differ in ingredient diversity?
- Which ingredients and brands appear most frequently in regulatory reporting datasets?
- How does regulatory exposure differ when measured at the ingredient, product, or brand level?
This project uses a Python virtual environment to ensure reproducible execution across systems. All required dependencies are listed in requirements.txt. Users can recreate the environment locally by creating and activating a virtual environment and installing dependencies before running the notebook. A dedicated Jupyter kernel is registered to ensure consistency between the execution environment and installed packages.
git clone https://github.com/angelakberry/beauty_wizard.git
cd beauty_wizard
python3 -m venv venv
# macOS / Linux
source venv/bin/activate
# Windows (Git Bash)
source venv/Scripts/activate
pip install -r requirements.txt
python -m ipykernel install --user \ --name beauty_wizard \ --display-name "Python (beauty_wizard)"
jupyter notebookNote: If jupyter notebook is not found, use
python -m jupyter notebookto launch Jupyter from the active virtual environment.
Select the kernel Python (beauty_wizard) and run all cells.
Note: The venv/ directory is intentionally excluded from version control. All paths are relative to the project root for portability. No external configuration or credentials are required.
The primary analysis is contained in BeautyWizard.ipynb at the repository root.
This project combines three independent datasets:
| Dataset | Description |
|---|---|
| Sephora Skincare Product Ingredients (Kaggle) | Retail product listings, prices, rankings, and ingredient text |
| BeautyFeeds Skincare & Haircare Dataset | Supplemental ingredient and product metadata |
| California Chemicals in Cosmetics | Government chemical reporting data, including reporting timelines and counts |
All datasets were cleaned, standardized, and integrated into a unified schema for analysis.
- Normalized column names and text fields (case, whitespace, characters)
- Standardized ingredient strings and tokenized ingredient lists
- Applied dataset-specific missing data strategies
- Preserved real-world variability by flagging, not removing, outliers
- Ingredients were cleaned and lowercased for consistent matching
- Products were expanded to ingredient-level granularity for frequency analysis
- Ingredient names were mapped across datasets prior to database insertion
- Extreme price values identified using the IQR method
- Luxury-priced products retained and transparently flagged
- Outliers included in EDA to reflect real market conditions
The project uses SQLite to enforce relational integrity and support SQL-driven analysis.
- Products — brand, product name, price, rank, and product type
- Ingredients — normalized ingredient master list
- ProductIngredients — many-to-many junction table
- ChemicalReports — regulatory reporting records linked at the ingredient level
erDiagram
PRODUCTS {
INTEGER product_id PK
TEXT brand
TEXT product_name
REAL price
REAL rank
TEXT product_type
}
INGREDIENTS {
INTEGER ingredient_id PK
TEXT ingredient_name
}
PRODUCT_INGREDIENTS {
INTEGER product_id FK
INTEGER ingredient_id FK
}
CHEMICAL_REPORTS {
INTEGER report_id PK
TEXT brand
TEXT product_name
INTEGER ingredient_id FK
INTEGER chemicalid
TEXT chemicalname
TEXT initialdatereported
TEXT mostrecentdatereported
TEXT discontinueddate
INTEGER chemicalcount
}
PRODUCTS ||--o{ PRODUCT_INGREDIENTS : contains
INGREDIENTS ||--o{ PRODUCT_INGREDIENTS : included_in
INGREDIENTS ||--o{ CHEMICAL_REPORTS : reported_in
Foreign key constraints are enforced (PRAGMA foreign_keys = ON) to ensure data consistency.
A static ER diagram .png is included in the repository under /schema.
EDA focuses on understanding the shape of the data before applying relational queries:
- Ingredient frequency distributions
- Ingredient count distributions by product type
- Product price distribution and outliers
- Price vs. ranking relationships
These views provide context for interpreting later SQL-based analyses.
Three primary SQL-driven analyses anchor the project:
- Measures average number of ingredients per product by brand
- Highlights differences in formulation strategies
- Clarifies that this metric is not ingredient prevalence
- Identifies ingredients appearing most frequently across products
- Demonstrates that a small subset of ingredients dominates formulations
- Distinct from brand-level complexity analysis
- Evaluated at both product and brand levels
- Counts unique ingredients appearing in CSCP reports
- Emphasizes that reporting frequency ≠ product safety risk
- Cosmetic formulations rely heavily on a small set of common ingredients, followed by a long tail of less frequent components.
- Most products contain 20 to 40 ingredients, indicating moderate formulation complexity.
- Ingredient diversity shows no strong correlation with price or product rank.
- Regulatory reporting is concentrated among a relatively small subset of ingredients and brands.
- High reporting counts typically reflect widely used ingredients, not necessarily elevated safety concerns.
- Ingredient presence does not account for concentration or exposure level.
- Regulatory datasets reflect reporting activity, not enforcement actions or health outcomes.
- Dataset coverage varies by brand and product category.
- Results should be interpreted as analytical signals, not consumer safety claims.
Potential next steps include:
- Automated data refresh pipelines
- API-driven product lookups
- Integration of consumer review sentiment
- Expanded regulatory datasets and longitudinal analysis
- All analysis is contained in a single Jupyter Notebook
- SQLite database generated programmatically
- Command-line Git used throughout development
- Notebook structured for portfolio review and PDF export, if needed
Note: Run all notebook cells from top to bottom to fully reproduce results.
beauty_wizard /data .csv dataset files /schema ER diagram image and Mermaid Markdown script beauty_wizard.ipynb Jupyter notebook BeautyWiz.db (generated) README.md requirements.txt
Beauty Wizard advances ingredient transparency and data-driven beauty research. Empowering consumers with smarter, safer, more sustainable choices.
Contributions, dataset suggestions, and methodology feedback are welcome. Please open an issue or submit a pull request.
AI-assisted tools were used as a support aid for project organization, additional dataset sourcing, requirements checklisting, repository structure review, Git troubleshooting, and documentation formatting (e.g., Markdown and Mermaid diagrams). All data cleaning, analysis, database design, SQL queries, visualizations, and conclusions were independently developed, reviewed, and validated by the author in accordance with program AI usage guidelines.