This project generates synthetic data from scratch based on constraints and simple rules without requiring an original dataset.
I built this because in analytics, market research, and capstone projects, you often need realistic-looking data to prototype analysis, dashboards, or workflows, but can’t use real or proprietary data.
- Generates CSV datasets from a YAML specification
- Supports numeric ranges, categories, dates, and IDs
- Allows rule-based dependencies between fields
- Works for any topic (not tied to a specific domain)
The generator doesn’t assume anything about the data’s meaning it just follows the structure you define.
- You describe the dataset shape in a YAML file (columns, bounds, rules)
- The generator creates base values within those bounds
- Rules override values where conditions are met
- The result is written to a CSV file
YAML keeps the data definition readable and easy to change without editing Python code. Most changes happen in the spec, not the generator.
- Rule-based dependencies (not statistical modeling)
- No guarantee of real-world distributions
- Optimized for clarity and flexibility, not scale
- Capstone projects
- Analytics prototyping
- Market research simulations
- Synthetic datasets for dashboards or demos
python synthgen.py