Skip to content

๐Ÿง  Model-driven synthetic test data for CI/CD and analytics - deterministic, privacy-preserving, and domain-aware. Includes Python APIs, XML pipelines, and MCP/IDE integration to orchestrate realistic datasets for finance, healthcare, and other regulated environments.

License

Notifications You must be signed in to change notification settings

rapiddweller/datamimic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

97 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DATAMIMIC โ€” Deterministic Synthetic Test Data That Makes Sense

Generate realistic, interconnected, and reproducible test data for finance, healthcare, and beyond.

Faker gives you random data. DATAMIMIC gives you consistent, explainable datasets that respect business logic and domain constraints.

  • ๐Ÿงฌ Patient medical histories that match age and demographics
  • ๐Ÿ’ณ Bank transactions that obey balance constraints
  • ๐Ÿ›ก Insurance policies aligned with real risk profiles

CI Coverage Maintainability Python License: MIT MCP Ready


โœจ Why DATAMIMIC?

Typical data generators produce isolated random values. Thatโ€™s fine for unit tests โ€” but meaningless for system, analytics, or compliance testing.

# Faker โ€” broken relationships
patient_name = fake.name()
patient_age = fake.random_int(1, 99)
conditions   = [fake.word()]
# "25-year-old with Alzheimer's" โ€” nonsense data
# DATAMIMIC โ€” contextual realism
from datamimic_ce.domains.healthcare.services import PatientService
patient = PatientService().generate()
print(f"{patient.full_name}, {patient.age}, {patient.conditions}")
# "Shirley Thompson, 72, ['Diabetes', 'Hypertension']"

โš™๏ธ Quickstart (Community Edition)

Install and run:

pip install datamimic-ce

Deterministic Generation

DATAMIMIC produces the same data for the same request, across machines and CI runs. Seeds, clocks, and UUIDv5 namespaces enforce reproducibility.

from datamimic_ce.domains.facade import generate_domain

request = {
    "domain": "person",
    "version": "v1",
    "count": 1,
    "seed": "docs-demo",                # identical seed โ†’ identical output
    "locale": "en_US",
    "clock": "2025-01-01T00:00:00Z"     # fixed clock = stable time context
}

response = generate_domain(request)
print(response["items"][0]["id"])
# Same input โ†’ same output

Determinism Contract

  • Inputs: {seed, clock, uuidv5-namespace, request body}
  • Guarantees: byte-identical payloads + stable determinism_proof.content_hash
  • Scope: all CE domains (see docs for domain-specific caveats)

โšก MCP (Model Context Protocol)

Run DATAMIMIC as an MCP server so Claude / Cursor (and agents) can call deterministic data tools.

Install

pip install datamimic-ce[mcp]
# Development
pip install -e .[mcp]

Run (SSE transport)

export DATAMIMIC_MCP_HOST=127.0.0.1
export DATAMIMIC_MCP_PORT=8765
# Optional auth; clients must send the same token via Authorization: Bearer or X-API-Key
export DATAMIMIC_MCP_API_KEY=changeme
datamimic-mcp

In-proc example (determinism proof)

import anyio, json
from fastmcp.client import Client
from datamimic_ce.mcp.models import GenerateArgs
from datamimic_ce.mcp.server import create_server

async def main():
    args = GenerateArgs(domain="person", locale="en_US", seed=42, count=2)
    payload = args.model_dump(mode="python")
    async with Client(create_server()) as c:
        a = await c.call_tool("generate", {"args": payload})
        b = await c.call_tool("generate", {"args": payload})
        print(json.loads(a[0].text)["determinism_proof"]["content_hash"]
              == json.loads(b[0].text)["determinism_proof"]["content_hash"])  # True
anyio.run(main)

Config keys

  • DATAMIMIC_MCP_HOST (default 127.0.0.1)
  • DATAMIMIC_MCP_PORT (default 8765)
  • DATAMIMIC_MCP_API_KEY (unset = no auth)
  • Requests over cap (count > 10_000) are rejected with 422.

โžก๏ธ Full guide, IDE configs (Claude/Cursor), transports, errors: docs/mcp_quickstart.md


๐Ÿงฉ Domains & Examples

๐Ÿฅ Healthcare

from datamimic_ce.domains.healthcare.services import PatientService
patient = PatientService().generate()
print(patient.full_name, patient.conditions)
  • Demographically realistic patients
  • Doctor specialties match conditions
  • Hospital capacities and types
  • Longitudinal medical records

๐Ÿ’ฐ Finance

from datamimic_ce.domains.finance.services import BankAccountService
account = BankAccountService().generate()
print(account.account_number, account.balance)
  • Balances respect transaction histories
  • Card/IBAN formats per locale
  • Distributions tuned for fraud/reconciliation tests

๐ŸŒ Demographics

  • PersonService with locale packs (DE / US / VN), versioned and auditable

๐Ÿ”’ Deterministic by Design

  • Frozen clocks + canonical hashing โ†’ reproducible IDs
  • Seeded RNG โ†’ identical outputs across runs
  • Schema validation (XSD/JSONSchema) โ†’ structural integrity
  • Provenance hashing โ†’ audit-ready lineage

๐Ÿ“˜ See Developer Guide


๐Ÿงฎ XML / Python Parity

Python:

from random import Random
from datamimic_ce.domains.common.models.demographic_config import DemographicConfig
from datamimic_ce.domains.healthcare.services import PatientService

cfg = DemographicConfig(age_min=70, age_max=75)
svc = PatientService(dataset="US", demographic_config=cfg, rng=Random(1337))
print(svc.generate().to_dict())

Equivalent XML:

<setup>
  <generate name="seeded_seniors" count="3" target="CSV">
    <variable name="patient" entity="Patient" dataset="US" ageMin="70" ageMax="75" rngSeed="1337" />
    <key name="full_name" script="patient.full_name" />
    <key name="age" script="patient.age" />
    <array name="conditions" script="patient.conditions" />
  </generate>
</setup>

๐Ÿงฐ CLI

# Run instant healthcare demo
datamimic demo create healthcare-example
datamimic run ./healthcare-example/datamimic.xml

# Verify version
datamimic version

Quality gates (repo):

make typecheck   # mypy --strict
make lint        # pylint (โ‰ฅ9.0 score target)
make coverage    # target โ‰ฅ 90%

๐Ÿงญ Architecture Snapshot

  • Core pipeline: Determinism kit โ€ข Domain services โ€ข Schema validators
  • Governance layer: Group tables โ€ข Linkage audits โ€ข Provenance hashing
  • Execution layer: CLI โ€ข API โ€ข XML runners โ€ข MCP server

โš–๏ธ CE vs EE

Feature Community (CE) Enterprise (EE)
Deterministic domain generation โœ… โœ…
XML + Python pipelines โœ… โœ…
Healthcare & Finance domains โœ… โœ…
Multi-user collaboration โŒ โœ…
Governance & lineage dashboards โŒ โœ…
ML engines (Mostly AI, Synthcity, โ€ฆ) โŒ โœ…
RBAC & audit logging (HIPAA/GDPR/PCI) โŒ โœ…
EDIFACT / SWIFT adapters โŒ โœ…

๐Ÿ‘‰ Compare editions โ€ข Book a strategy call


๐Ÿ“š Documentation & Community


๐Ÿš€ Get Started

pip install datamimic-ce

Generate data that makes sense โ€” deterministically. โญ Star us on GitHub if DATAMIMIC improves your testing workflow.

About

๐Ÿง  Model-driven synthetic test data for CI/CD and analytics - deterministic, privacy-preserving, and domain-aware. Includes Python APIs, XML pipelines, and MCP/IDE integration to orchestrate realistic datasets for finance, healthcare, and other regulated environments.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 6

Languages