Skip to content

Bottleneck in MarkovRandomField.synthetic_data in large-scale regimes #98

@ryan112358

Description

@ryan112358

Calling synthetic_data fails in the following setting:

Reproducing:

potentials = mbi.CliqueVector.zeros(domain, cliques)
marginals = mbi.marginal_oracles.message_passing_stable(potentials)
model = mbi.MarkovRandomField(potentials=potentials, marginals=marginals)
data = model.synthetic_data(rows=10_000_000)

Analysis

I believe the issue might stem from the need to materialize an array of size (row, n_i) where n_i is the domain size for column i. With some columns have domains as large as 1000 in this setting, I believe the current implementation requires 80 GB if the array is represented in float64 format (specifically, row_cdfs on line 119). Technically, we should only need to compute row_cdfs, for the unique values in current_proj_data, rather than all values, which would save significant memory.

Minimal Example

The domain and cliques provided below are by no means minimal. It may be fix and test the issue with respect to a smaller example first (with a handful of attributes, at least one being large-cardinality).

Domain

Domain(attributes=('metro', 'metarea', 'metaread', 'city', 'sizepl', 'urban', 'sea', 'gq', 'gqtype', 'gqtyped', 'gqfunds', 'farm', 'ownershp', 'ownershpd', 'rent', 'valueh', 'split', 'slrec', 'respondt', 'famsize', 'nchlt5', 'sex', 'age', 'agemonth', 'marst', 'marrno', 'agemarr', 'chborn', 'race', 'hispan', 'hispand', 'bpl', 'bpld', 'mbpl', 'mbpld', 'fbpl', 'fbpld', 'nativity', 'citizen', 'mtongue', 'mtongued', 'spanname', 'hisprule', 'school', 'higrade', 'higraded', 'educ', 'educd', 'empstat', 'empstatd', 'labforce', 'classwkr', 'classwkrd', 'occ', 'occ1950', 'ind', 'ind1950', 'wkswork2', 'hrswork2', 'uocc95', 'uclasswk', 'incwage', 'incnonwg', 'occscore', 'sei', 'presgl', 'erscor50', 'edscor50', 'npboss50', 'migrate5', 'migrate5d', 'migplac5', 'migcity5', 'samesea5', 'vetstat', 'vetstatd', 'vet1940', 'vetwwi', 'vetper', 'vetchild', 'ssenroll'), shape=(5, 334, 378, 1164, 19, 3, 472, 7, 10, 93, 13, 4, 3, 8, 157, 67, 2, 3, 3, 59, 10, 4, 138, 15, 8, 10, 89, 64, 10, 6, 55, 165, 545, 164, 543, 165, 545, 6, 8, 92, 489, 4, 9, 5, 25, 69, 13, 44, 5, 16, 4, 4, 18, 1000, 283, 365, 162, 7, 9, 279, 8, 103, 4, 58, 87, 197, 1002, 1002, 1002, 7, 15, 199, 197, 6, 4, 10, 4, 3, 8, 5, 3), labels=None)

Cliques

[('ownershp', 'split'), ('split', 'classwkr'), ('split', 'vet1940'), ('respondt', 'vetwwi'), ('slrec', 'vetwwi'), ('ownershp', 'ssenroll'), ('slrec', 'ssenroll'), ('slrec', 'respondt'), ('urban', 'ssenroll'), ('urban', 'slrec'), ('slrec', 'incnonwg'), ('sex', 'vetwwi'), ('slrec', 'spanname'), ('respondt', 'sex'), ('labforce', 'ssenroll'), ('ownershp', 'vetstat'), ('respondt', 'incnonwg'), ('slrec', 'labforce'), ('farm', 'slrec'), ('vetstat', 'ssenroll'), ('farm', 'respondt'), ('urban', 'labforce'), ('farm', 'vetwwi'), ('slrec', 'sex'), ('incnonwg', 'vetwwi'), ('spanname', 'vetwwi'), ('split', 'samesea5'), ('ownershp', 'vet1940'), ('respondt', 'spanname'), ('split', 'wkswork2'), ('ownershp', 'school'), ('empstat', 'ssenroll'), ('metro', 'ownershp'), ('ownershp', 'empstat'), ('sex', 'spanname'), ('vetstat', 'vet1940'), ('sex', 'incnonwg'), ('spanname', 'incnonwg'), ('farm', 'spanname'), ('farm', 'sex'), ('farm', 'incnonwg'), ('slrec', 'hispan'), ('samesea5', 'ssenroll'), ('ownershp', 'samesea5'), ('ownershp', 'nativity'), ('empstat', 'vet1940'), ('split', 'race'), ('split', 'nchlt5'), ('empstat', 'vetstat'), ('metro', 'vetstat'), ('split', 'vetstatd'), ('school', 'vetstat'), ('split', 'marrno'), ('metro', 'vet1940'), ('ownershp', 'wkswork2'), ('gq', 'ssenroll'), ('ownershp', 'migrate5'), ('urban', 'vetper'), ('ownershpd', 'respondt'), ('nativity', 'vetstat'), ('samesea5', 'vetstat'), ('nativity', 'vet1940'), ('hispan', 'spanname'), ('slrec', 'vetper'), ('sex', 'hispan'), ('ownershpd', 'slrec'), ('marst', 'ssenroll'), ('hispan', 'incnonwg'), ('vetper', 'ssenroll'), ('classwkr', 'samesea5'), ('uclasswk', 'ssenroll'), ('ownershpd', 'vetwwi'), ('samesea5', 'vet1940'), ('farm', 'hispan'), ('school', 'empstat'), ('metro', 'empstat'), ('split', 'educ'), ('hrswork2', 'vetwwi'), ('slrec', 'hrswork2'), ('respondt', 'hrswork2'), ('slrec', 'hisprule'), ('migrate5', 'vetstat'), ('classwkr', 'wkswork2'), ('ownershp', 'vetstatd'), ('metro', 'nativity'), ('urban', 'gqtype'), ('nativity', 'empstat'), ('gqtype', 'slrec'), ('empstat', 'samesea5'), ('school', 'samesea5'), ('ownershp', 'race'), ('ownershp', 'marrno'), ('gqtype', 'ssenroll'), ('vetstatd', 'ssenroll'), ('ownershp', 'nchlt5'), ('farm', 'ownershpd'), ('labforce', 'vetper'), ('ownershpd', 'spanname'), ('ownershpd', 'sex'), ('ownershpd', 'incnonwg'), ('school', 'migrate5'), ('empstat', 'migrate5'), ('hisprule', 'incnonwg'), ('hrswork2', 'incnonwg'), ('spanname', 'hrswork2'), ('farm', 'hisprule'), ('sex', 'hrswork2'), ('spanname', 'hisprule'), ('farm', 'hrswork2'), ('sex', 'hisprule'), ('gqfunds', 'ssenroll'), ('ownershp', 'educ'), ('gqtype', 'labforce'), ('marrno', 'vet1940'), ('vetstat', 'vetstatd'), ('race', 'vetstat'), ('race', 'vet1940'), ('classwkr', 'vetstatd'), ('vetstatd', 'vet1940'), ('nchlt5', 'vet1940'), ('migrate5', 'samesea5'), ('wkswork2', 'samesea5'), ('migrate5d', 'ssenroll'), ('slrec', 'agemonth'), ('ownershp', 'migrate5d'), ('agemonth', 'ssenroll'), ('slrec', 'empstatd'), ('split', 'higrade'), ('school', 'vetstatd'), ('metro', 'vetstatd'), ('race', 'empstat'), ('metro', 'race'), ('empstat', 'vetstatd'), ('hispan', 'hrswork2'), ('classwkrd', 'ssenroll'), ('slrec', 'classwkrd'), ('hispan', 'hisprule'), ('gq', 'marst'), ('nchlt5', 'samesea5'), ('race', 'nativity'), ('migrate5d', 'vetstat'), ('race', 'samesea5'), ('nativity', 'vetstatd'), ('marrno', 'samesea5'), ('samesea5', 'vetstatd'), ('marst', 'uclasswk'), ('spanname', 'empstatd'), ('farm', 'empstatd'), ('sex', 'empstatd'), ('empstatd', 'incnonwg'), ('wkswork2', 'vetstatd'), ('migrate5', 'vetstatd'), ('ownershpd', 'hrswork2'), ('ownershp', 'higrade'), ('empstat', 'migrate5d'), ('educ', 'samesea5'), ('gqtype', 'vetper'), ('hisprule', 'hrswork2'), ('migrate5d', 'samesea5'), ('educ', 'wkswork2'), ('gq', 'gqfunds'), ('sizepl', 'vetchild'), ('hispan', 'empstatd'), ('marrno', 'race'), ('higrade', 'classwkr'), ('marrno', 'vetstatd'), ('race', 'vetstatd'), ('nchlt5', 'vetstatd'), ('nchlt5', 'race'), ('gqfunds', 'marst'), ('educ', 'vetstatd'), ('educd', 'ssenroll'), ('ownershp', 'educd'), ('valueh', 'split'), ('split', 'higraded'), ('empstatd', 'hrswork2'), ('migrate5d', 'vetstatd'), ('higrade', 'samesea5'), ('sizepl', 'citizen'), ('hispand', 'ssenroll'), ('slrec', 'hispand'), ('ownershp', 'occscore'), ('split', 'sei'), ('occscore', 'ssenroll'), ('higrade', 'wkswork2'), ('educd', 'vetstat'), ('slrec', 'famsize'), ('famsize', 'ssenroll'), ('respondt', 'chborn'), ('slrec', 'chborn'), ('chborn', 'vetwwi'), ('ownershp', 'valueh'), ('split', 'incwage'), ('ownershp', 'higraded'), ('educd', 'empstat'), ('occscore', 'vetstat'), ('marrno', 'higrade'), ('higrade', 'vetstatd'), ('sex', 'chborn'), ('farm', 'chborn'), ('chborn', 'incnonwg'), ('chborn', 'spanname'), ('ownershp', 'sei'), ('educd', 'samesea5'), ('slrec', 'agemarr'), ('agemarr', 'ssenroll'), ('valueh', 'vet1940'), ('valueh', 'vetstat'), ('mtongue', 'ssenroll'), ('split', 'ind1950'), ('higrade', 'educ'), ('valueh', 'school'), ('valueh', 'empstat'), ('occscore', 'samesea5'), ('valueh', 'samesea5'), ('valueh', 'nativity'), ('classwkr', 'incwage'), ('higraded', 'samesea5'), ('age', 'ssenroll'), ('slrec', 'age'), ('educd', 'vetstatd'), ('valueh', 'migrate5'), ('higraded', 'wkswork2'), ('ownershp', 'ind1950'), ('slrec', 'mbpl'), ('mbpl', 'ssenroll'), ('bpl', 'ssenroll'), ('slrec', 'fbpl'), ('fbpl', 'ssenroll'), ('sei', 'samesea5'), ('chborn', 'hrswork2'), ('slrec', 'migcity5'), ('slrec', 'presgl'), ('presgl', 'ssenroll'), ('migplac5', 'ssenroll'), ('wkswork2', 'sei'), ('incwage', 'samesea5'), ('educd', 'migrate5d'), ('valueh', 'marrno'), ('valueh', 'race'), ('valueh', 'vetstatd'), ('valueh', 'nchlt5'), ('higraded', 'vetstatd'), ('wkswork2', 'incwage'), ('marst', 'mtongue'), ('farm', 'migcity5'), ('spanname', 'migcity5'), ('incnonwg', 'migcity5'), ('sex', 'migcity5'), ('slrec', 'uocc95'), ('uocc95', 'ssenroll'), ('occ1950', 'ssenroll'), ('slrec', 'occ1950'), ('sei', 'vetstatd'), ('ind1950', 'samesea5'), ('metarea', 'slrec'), ('metarea', 'ssenroll'), ('valueh', 'migrate5d'), ('incwage', 'vetstatd'), ('ind', 'ssenroll'), ('slrec', 'ind'), ('metaread', 'ssenroll'), ('hispan', 'migcity5'), ('marst', 'bpl'), ('agemonth', 'agemarr'), ('sea', 'slrec'), ('sea', 'ssenroll'), ('mtongued', 'ssenroll'), ('slrec', 'mtongued'), ('marst', 'migplac5'), ('ind1950', 'vetstatd'), ('mbpld', 'ssenroll'), ('slrec', 'mbpld'), ('slrec', 'fbpld'), ('bpld', 'ssenroll'), ('fbpld', 'ssenroll'), ('valueh', 'higrade'), ('higrade', 'higraded'), ('sizepl', 'gqtyped'), ('hrswork2', 'migcity5'), ('mtongued', 'spanname'), ('sex', 'mtongued'), ('higrade', 'sei'), ('mtongued', 'vetchild'), ('educd', 'occscore'), ('higrade', 'incwage'), ('mtongued', 'samesea5'), ('sizepl', 'rent'), ('occ', 'ssenroll'), ('slrec', 'occ'), ('urban', 'occ'), ('edscor50', 'ssenroll'), ('npboss50', 'ssenroll'), ('erscor50', 'ssenroll'), ('slrec', 'erscor50'), ('ownershp', 'npboss50'), ('slrec', 'edscor50'), ('metaread', 'marst'), ('famsize', 'hispand'), ('gq', 'mtongued'), ('city', 'slrec'), ('city', 'ssenroll'), ('marst', 'mtongued'), ('mtongued', 'uclasswk'), ('labforce', 'occ'), ('npboss50', 'vetstat'), ('higrade', 'ind1950'), ('marst', 'bpld'), ('bpld', 'citizen'), ('classwkrd', 'uocc95'), ('npboss50', 'samesea5'), ('gqfunds', 'mtongued'), ('agemonth', 'mtongued'), ('occ', 'vetper'), ('famsize', 'age'), ('sizepl', 'mtongued'), ('famsize', 'mbpl'), ('famsize', 'fbpl'), ('gqtype', 'occ'), ('sizepl', 'bpld'), ('famsize', 'mtongued'), ('agemarr', 'mtongued'), ('educd', 'npboss50'), ('mtongue', 'mtongued'), ('gqtyped', 'mtongued'), ('rent', 'mtongued'), ('bpl', 'mtongued'), ('mtongued', 'migcity5'), ('mtongued', 'migplac5'), ('sea', 'uocc95'), ('mtongued', 'uocc95'), ('metarea', 'mtongued'), ('mtongued', 'ind'), ('metaread', 'mtongued'), ('occ', 'presgl'), ('mbpld', 'mtongued'), ('fbpld', 'mtongued'), ('bpld', 'mtongued'), ('occ1950', 'erscor50'), ('mtongued', 'occ'), ('mtongued', 'edscor50'), ('mtongued', 'erscor50'), ('mtongued', 'npboss50'), ('city', 'mtongued')]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions