diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..f4a5beaf --- /dev/null +++ b/.gitignore @@ -0,0 +1,15 @@ +#intellij +.idea/ +*.iml +#local spark context data from unit tests +spark-warehouse/ +#Build dirctory for maven/sbt +target/ +project/project/ +project/target/ +./testData +*.DS_Store +/dependency-reduced-pom.xml +/bin/ +/python/dist/ +/python/pyAutoML.egg-info/ diff --git a/APIDOCS.md b/APIDOCS.md new file mode 100644 index 00000000..5411f595 --- /dev/null +++ b/APIDOCS.md @@ -0,0 +1,2672 @@ +# AutoML-Toolkit + +The AutoML-Toolkit is an automated ML solution for Apache Spark. It provides common data cleansing and feature engineering support, automated hyper-parameter tuning through distributed genetic algorithms, and model tracking integration with MLFlow. +It currently supports Supervised Learning algorithms that are provided as part of Spark Mllib, as well as a few Spark-supported open source distributed +ML packages. +> NOTE: There are a number of features and modules within this code base that do not exist in core Spark and these APIs ARE +exposed for use (although not documented in this APIDOC). For further information on core functionality of these modules, and to explore +the scaladocs for these modules, feel free to clone the repo, load into an IDE, and auto-generate the scaladoc for +viewing on your browser through the HTML generator for scaladocs feature. + +## General Overview + +The AutoML-Toolkit is a multi-layer API that can be used in several different ways: +1. Full Automation (high-level API) by using the FamilyRunner object, Configuration Generator and utilzing the .executeWithPipeline() main public method. +2. Mid-level Automation through use of individual component API's (DataPrep / AutomationRunner / FeatureImportances objects and classes) +3. Low-level API's for Hyper-parameter tuning and other independent functionality + +## Full Automation + +At the highest level of the API, the FamilyRunner, using defaults, requires only a Spark Dataframe and an Array of InstanceConfigurations to be supplied to the object instantiation. + +With the recommended PipelineAPI method .executeWithPipeline(), a SparkML Pipeline will be created, wrapping all transformations used in building the feature vector, +hyperparameter tuning, and best selected model will be returned. + +### Family Runner API +This example shows configuring 3 seperate tuning runs (RandomForest Classifier, Logistic Regression, and XGBoost Classifier): +```scala +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.executor.FamilyRunner + +val runName = "Automated-Model-Run-1" + +val configurationOverrides = Map( + "labelCol" -> "my_label", + "tunerParallelism" -> 6, + "tunerKFold" -> 3, + "tunerTrainSplitMethod" -> "stratified", + "scoringMetric" -> "areaUnderROC", + "tunerNumberOfGenerations" -> 6, + "tunerNumberOfMutationsPerGeneration" -> 8, + "tunerInitialGenerationMode" -> "permutations", + "tunerInitialGenerationPermutationCount" -> 18, + "tunerFirstGenerationGenePool" -> 18, + "mlFlowModelSaveDirectory" -> s"dbfs:/automl/$runName", + "inferenceConfigSaveLocation" -> s"dbfs:/automl/inference/$runName", + "mlFlowExperimentName" -> s"/Users/benjamin.wilson@databricks.com/AutoMLDemo/MLFlow/$runName" +) + +val runConfiguration = Array("RandomForest", "LogisticRegression", "XGBoost") + .map(x => ConfigurationGenerator.generateConfigFromMap(x, "classifier", configurationOverrides)) + +val pipelineRunner = FamilyRunner(spark.table("mlDatabase.featureData"), runConfiguration).executeWithPipeline() +``` + +Shown above are **typical override values** for the configuration overrides. Each value does have a default assigned to it, +but the logging paths for default values may not be where you would like the artifacts to be stored. +The `mlFlowExperimentName` in particular will just log the mlflow data to the parent directory of the notebook being used +to execute this code from. This may not be desired, so it is recommended to override this value to a central location. + +#### Automation Return Values + +##### The Run Results, consisting of type: `FamilyFinalOutputWithPipeline` + +Which are of type: +```scala +case class FamilyFinalOutputWithPipeline( + familyFinalOutput: FamilyFinalOutput, + bestPipelineModel: Map[String, PipelineModel], + bestMlFlowRunId: Map[String, String] = Map.empty +) +``` +These return types are: +* familyFinalOutput: `FamilyFinalOutput` +```scala +case class FamilyFinalOutput(modelReport: Array[GroupedModelReturn], + generationReport: Array[GenerationalReport], + modelReportDataFrame: DataFrame, + generationReportDataFrame: DataFrame, + mlFlowReport: Array[MLFlowReportStructure]) +``` +* bestPipelineModel -> The SparkML Pipeline (custom flavor, but still serializable) for each of the model families that have been tested. + This artifact pipeline can be directly used for inference in a chained tune-per-run approach, or retrieved via the mlflow accessor API + and used to perform inference on a data set with at least the same columns as the original data set (additional columns CAN be present for inference, but missing columns are not permitted) +* bestMlFlowRunId -> Map for the best result for each of the model families. This ID can be externally stored, if desired, so that a simple log-store interface can be used +to retrieve the appropriate pipeline from the tuning run for inference purposes. + + + +The data within the familyFinalOutput consists of: +* modelReport -> Array[GroupedModelReturn] +```scala +case class GroupedModelReturn(modelFamily: String, + hyperParams: Map[String, Any], + model: Any, + score: Double, + metrics: Map[String, Double], + generation: Int) +``` +* generationReport -> A Spark Dataframe representation of the Generational Average Scores, consisting of 2 columns: + * `Generation[Int]` + * `Score[Double]` + +* modelReportDataFrame -> A Spark Dataframe representation of the Generational Run Results, consisting of 5 columns: + + * model_family[String] + * model_type[String] + * generation[Int] + * generation_mean_score[Double] + * generation_std_dev_score[Double] + +* mlFlowReport -> Array[MLFlowReportStructure] that contains the following structures: +```scala +case class MLFlowReturn(client: MlflowClient, + experimentId: String, + runIdPayload: Array[(String, Double)]) + +case class MLFlowReportStructure(fullLog: MLFlowReturn, bestLog: MLFlowReturn +``` + +```text +NOTE: If using MLFlow integration, all of this data, in raw format, will be recorded and stored automatically. +``` + +## Configuration Generator API + +The purpose of the configuration generator is to provide a means of overriding the default values of the automl toolkit. +The full configuration for the application utilizes a grouped configuration approach, isolating separate similar stage +configs in groups of similar relevance. Due to the sheer number of tasks that are being performed, the complexity of +the internal processes, and the demoralizing idea of setting nested case class configurations, this interface allows for +a Map to be configured with individual overrides to change this structure internally. +In the following section, the configuration parameters will be shown and explained. + + +### Model Family + +Setter: `.setModelingFamily()` + +```text +Default: "RandomForest" + +Sets the modeling family (Spark Mllib) to be used to train / validate. +``` + +> For model families that support both Regression and Classification, the parameter value +`.setModelDistinctThreshold()` is used to determine which to use. Distinct values in the label column, +if below the `modelDistinctThreshold`, will use a Classifier flavor of the Model Family. Otherwise, it will use the Regression Type. + +Currently supported models: +* "XGBoost" - [XGBoost Classifier](https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#) or [XGBoost Regressor](https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#) +* "RandomForest" - [Random Forest Classifier](http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier) or [Random Forest Regressor](http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression) +* "GBT" - [Gradient Boosted Trees Classifier](http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier) or [Gradient Boosted Trees Regressor](http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-regression) +* "Trees" - [Decision Tree Classifier](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier) or [Decision Tree Regressor](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-regression) +* "LinearRegression" - [Linear Regressor](http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression) +* "LogisticRegression" - [Logistic Regressor](http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression) (supports both Binomial and Multinomial) +* "MLPC" - [Multi-Layer Perceptron Classifier](http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier) +* "SVM" - [Linear Support Vector Machines](http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-support-vector-machine) + +* "LightGBM" (currently suspended, pending library improvements to LightGBM) [LightGBM](https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md) +> NOTE: The automl-toolkit has the interfaces available for tuning LightGBM, but due to limitations in the underlying +>thread management in LightGBM, asynchronous instantiations of models causes extreme instability to Spark. If fixed in the future, +>this model will be moved to an [Accessible-Experimental] state. + +### Generic Config + +```scala +case class GenericConfig(var labelCol: String, + var featuresCol: String, + var dateTimeConversionType: String, + var fieldsToIgnoreInVector: Array[String], + var scoringMetric: String, + var scoringOptimizationStrategy: String) +``` + +#### Label Column Name + +Setter: `.setLabelCol()` Map Name: `'labelCol'` + +```text +Default: "label" + +This is the 'predicted value' for use in supervised learning. + If this field does not exist within the dataframe supplied, an assertion exception will be thrown + once a method (other than setters/getters) is called on the AutomationRunner() object. +``` +> NOTE : this value should always be defined by the end user. +#### Feature Column Name + +Setter: `.setFeaturesCol()` Map Name: `'featuresCol'` + +```text +Default: "features" + +Purely cosmetic setting that ensures consistency throughout all of the modules within AutoML-Toolkit. + +[Future Feature] In a future planned release, new accessor methods will make this setting more relevant, + as a validated prediction data set will be returned along with run statistics. +``` + +#### Date Time Conversion Type + +Setter: `.setDateTimeConversionType()` Map Name: `'dateTimeConversionType'` + +```text +Default: "split" + +Available options: "split", "unix" +``` +This setting determines how to handle DateTime type fields. +* In the "unix" setting mode, the datetime type is converted to `[Double]` type of the `[Long]` Unix timestamp in seconds. + +* In the "split" setting mode, the date is transformed into its constituent parts and adding each part to seperate fields. +> By default, extraction is at maximum precision -> (year, month, day, hour, minute, second) + +#### Fields to ignore in vector + +Setter: `.setFieldsToIgnoreInVector()` +Map Name `'fieldsToIgnoreInVector'` + +```text +Default: Array.empty[String] + +Provides a means for ignoring specific fields in the Dataframe from being included in any DataPrep feature +engineering tasks, filtering, and exlusion from the feature vector for model tuning. + +This is particularly useful if there is a need to perform follow-on joins to other data sets after model +training and prediction is complete. +``` + +#### Scoring Metric + +Setter: `.setScoringMetric()` +Map Name `'scoringMetric'` + +```text +Default is set dynamically by the prediction type (either regressor or classifier). + +Available values: + +Regressor -> rmse, mse, r2, mae + +MultiClass Classifier -> f1, accuracy, weightedPrecision, weightedRecall +Binary Classifier -> f1, accuracy, weightedPrecision, weightedRecall, areaUnderROC, areaUnderPR +``` + +For Regressor modeling, the default is [rmse](https://en.wikipedia.org/wiki/Root-mean-square_deviation) + +For Classifer modeling, the default is [f1](https://en.wikipedia.org/wiki/F1_score) + +> NOTE: It is **highly advised** to override this value for the type of problem that is being solved. No one scoring methodology works best for every situation. + +> NOTE: although f1 score is a valid metric for binary classification problems, it is highly advised to use a Binary Classification scoring method that is more accurate for these problems (areaUnderROC, areaUnderPR) + + +#### Scoring Optimization Strategy + +Setter: `.setScoringOptimizationStrategy()` +Map Name `'scoringOptimizationStrategy'` + +```text +The optimization strategy is a measure of the direction for which future candidates within the algorithm will be considered 'good'. +At the conclusion of each tuning round, the evaluation of 'best as of time 'x'' is determined by sorting the results in either a maximal or minimal way, +and as such, this setting is critical to getting a good result from the tuning portion of the automl toolkit + +Default: Determined by prediction type (Classifier is 'maximize', Regressor is 'minimize') +Available values: 'maximize' or 'minimize' +``` +> NOTE take care when selecting which direction to optimize, particularly with r2 optimization. The intent is typically to ***maximize*** this value, but the default +>configuration will set this to minimize. Override this value if attempting to tune with r2 optimization. + + +### Switch Config +```scala +case class SwitchConfig(var naFillFlag: Boolean, + var varianceFilterFlag: Boolean, + var outlierFilterFlag: Boolean, + var pearsonFilterFlag: Boolean, + var covarianceFilterFlag: Boolean, + var oneHotEncodeFlag: Boolean, + var scalingFlag: Boolean, + var featureInteractionFlag: Boolean, + var dataPrepCachingFlag: Boolean, + var autoStoppingFlag: Boolean, + var pipelineDebugFlag: Boolean) +``` + + +### Fill Null Values (naFillFlag) +```text +Default: ON +Turned off via setter .naFillOff() +``` +> NOTE: It is **HIGHLY recommended** to leave this turned on. If there are Null values in the feature vector, exceptions may be thrown. + +This module allows for filling both numeric values and categorical & string values. + +#### Available Overrides +* [Numeric Fill Stat](#numeric-fill-stat) +* [Character Fill Stat](#character-fill-stat) +* [Cardinality Check Mode](#fill-config-cardinality-check-mode) +* [Cardinality Switch](#fill-config-cardinality-switch) +* [Cardinality Type](#fill-config-cardinality-type) +* [Cardinlaity Limit](#fill-config-cardinality-limit) +* [Cardinality Precision](#fill-config-cardinality-precision) +* [Character NA Blanket Fill Value](#fill-config-character-na-blanket-fill-value) +* [Filter Precision](#fill-config-filter-precision) +* [NA Fill Mode](#fill-config-na-fill-mode) +* [Numeric Na Blanket Fill Value](#fill-config-numeric-na-blanket-fill-value) +* [Numeric NA Fill Map](#fill-config-numeric-na-fill-map) +* [Categorical NA Fill Map](#fill-config-categorical-na-fill-map) + + +### Filter Zero Variance Features (varianceFilterFlag) +```text +Default: ON +Turned off via setter .varianceFilterOff() + +NOTE: It is HIGHLY recommended to leave this turned on. +Feature fields with zero information gain increase overall processing time and provide no real value to the model. +``` +There are no options associated with module. + +### Filter Outliers (outlierFilterFlag) +```text +Default: OFF +Turned on via setter .outlierFitlerOn() +``` + +This module allows for detecting outliers from within each field, setting filter thresholds +(either automatically or manually), and allowing for either tail reduction or two-sided reduction. +Including outliers in some families of machine learning models will result in dramatic overfitting to those values. + +> NOTE: It is recommended to only use this feature when doing basic exploratory analysis. Filtering outliers should be +> conducted externally on a feature data set once the problem statement and analysis of data is completed. + +#### Available Overrides +* [Continuous Data Threshold](#outlier-continuous-data-threshold) +* [Fields To Ignore](#outlier-fields-to-ignore) +* [Filter Bounds](#outlier-filter-bounds) +* [Filter Precision](#outlier-filter-precision) + +### Pearson Filtering (pearsonFilterFlag) +```text +Default: OFF +Turned on via setter .pearsonFilterOn() +``` +This module will perform validation of each field of the data set (excluding fields that have been added to +`.setFieldsToIgnoreInVector(])` and any fields that have been culled by any previous optional DataPrep +feature engineering module) to the label column that has been set (`.setLabelCol()`). + +The mechanism for comparison is a ChiSquareTest that utilizes one of three currently supported modes : +- pValue +- pearsonStat +- degreesFreedom + +For further reading on Pearson's chi-squared test -> +[Spark Doc - ChiSquaredTest](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.stat.ChiSquareTest$) +[Pearson's Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test) + +#### Available Overrides +* [Pearson Filter Statistic](#pearson-filter-statistic) +* [Pearson Filter Direction](#pearson-filter-direction) +* [Pearson Filter Manual Value](#pearson-filter-manual-value) +* [Pearson Filter Mode](#pearson-filter-mode) +* [Pearson Auto Filter N Tile](#pearson-auto-filter-n-tile) + +### Covariance Filtering (covarianceFilterFlag) +```text +Default: OFF +Turned on via setter .covarianceFilterOn() +``` + +Covariance Filtering is a Data Prep module that iterates through each element of the feature space +(fields that are intended to be part of the feature vector), calculates the pearson correlation coefficient between each +feature to every other feature, and provides for the ability to filter out highly positive or negatively correlated +features to prevent fitting errors. + +Further Reading: [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) + +> NOTE: This algorithm, although operating within a concurrent thread pool, can be costly to execute. +In sequential mode (parallelism = 1), it is O(n * log(n)) and should be turned on only for an initial exploratory phase of determining predictive power. + +> There are no settings for determining left / right / both sided filtering. Instead, the cutoff values can be set to achieve this. +> > i.e. to only filter positively correlated values, apply the setting: `.setCovarianceCutoffLow(-1.0)` which would only filter +> > fields that are **exactly** negatively correlated (linear negative correlation) + +#### Available Overrides +* [Cutoff-Low](#correlation-covariance-cutoff-low) +* [Cutoff-High](#correlation-covariance-cutoff-high) + +##### General Algorithm example + +Given a data set: + +| A | B | C | D | label | +|:----: |:----: |:----: |:----: |:----: | +| 1 | 94 | 5 | 10 | 1 | +| 2 | 1 | 4 | 20 | 0 | +| 3 | 22 | 3 | 30 | 1 | +| 4 | 5 | 2 | 40 | 0 | +| 5 | 5 | 1 | 50 | 0 | + +Each of the fields A:B:C:D would be compared to one another, the pearson value would be calculated, and a filter will occur. + +* A->B +* A->C +* A->D +* B->C +* B->D +* C->D + +There is a perfect linear negative correlation present between A->C, a perfect postitive linear correlation between +A->D, and a perfect linear negative correlation between C->D. +However, evaluation of C->D will not occur, as both C and D will be filtered out due to the correlation coefficient +threshold. +The resultant data set from this module would be, after filtering: + +| A | B | label | +|:----: |:----: |:----: | +| 1 | 94 | 1 | +| 2 | 1 | 0 | +| 3 | 22 | 1 | +| 4 | 5 | 0 | +| 5 | 5 | 0 | + +### OneHotEncoding (oneHotEncodingFlag) +```text +Default: OFF +Turned on via setter .oneHotEncodingOn() +``` +Useful for all non-tree-based models (e.g. LinearRegression) for converting categorical features into a vector boolean +space. + +Details: [OneHotEncoderEstimator](http://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator) + +Implementation here follows the general recommendation: Categorical (text) fields will be StringIndexed first, then +OneHotEncoded. + +> There are no options for this setting. It either encodes the categorical features, or leaves them StringIndexed. +> NOTE: NOT RECOMMENDED for tree based models. If a tree-based model is selected, a warning will appear. + +### Scaling (scalingFlag) + +The Scaling Module provides for an automated way to set scaling on the feature vector prior to modeling. +There are a number of ML algorithms that dramatically benefit from scaling of the features to prevent overfitting. +Although tree-based algorithms are generally resilient to this issue, if using any other family of model, it is +**highly recommended** to use some form of scaling. + +The available scaling modes are: +* [minMax](http://spark.apache.org/docs/latest/ml-features.html#minmaxscaler) + * Scales the feature vector to the specified min and max. + * Creates a Dense Vector. +* [standard](http://spark.apache.org/docs/latest/ml-features.html#standardscaler) + * Scales the feature vector to the unit standard deviation of the feature (has options for centering around mean) + * If centering around mean, creates a Dense Vector. Otherwise, it can maintain sparsity. +* [normalize](http://spark.apache.org/docs/latest/ml-features.html#normalizer) + * Scales the feature vector to a p-norm normalization value. +* [maxAbs](http://spark.apache.org/docs/latest/ml-features.html#maxabsscaler) + * Scales the feature vector to a range of {-1, 1} by dividing each value by the Max Absolute Value of the feature. + * Retains Vector type. + +#### Available Overrides +* [Scaling Type](#scaling-type) +* [Scaling P-Norm](#scaling-p-norm) + +### Feature Interaction (featureInteractionFlag) + +The Feature Interaction module allows for creation of pair-wise products (Interactions) between feature fields. +The default mode (optimistic) will calculate [Information Gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees) +from Entropy and Differential Entropy calculations done on the parents of each interaction candidate, the offspring candidate, +and make a decision to keep or discard based on the relative ratio of the interacted child to its parents. +There are 4 main modes here: +flag off - no interactions (default) +flag on - 'all' - all potential children candidates will be created with no Information Gain calculations being done +(fastest, but potentially risky as some interactions may create a poorly fit model) +flag on - 'optimistic' - potential children are interacted, but only retained in the final feature vector if the resulting +child's information gain metric is at least n% of at least one parent. See below sections for the setters / map value to get +full explanation on what the configuration attributes are and what they do. +flag on - 'strict' - potential children are interacted, but are only retained if they are n% of BOTH PARENTS. + +DEFAULT: OFF + +> NOTE strict is 'safest', but in general does not create additional fields. It is recommended to use either 'optimisitic' for most use cases +>and 'all' for testing purposes or in aiding feature engineering iterative tasks to inform where to create more information for future experiments. + +#### Available Overrides +* [Continuous Discretizer Bucket Count](#feature-interaction-continuous-discretizer-bucket-count) +* [Parallelism](#feature-interaction-parallelism) +* [Retention Mode](#feature-interaction-retention-mode) +* [Target Interaction Percentage](#feature-interaction-target-intercation-percentage) + +### DataPrepCachingFlag + +This setting will determine whether caching of the feature engineering stages happen prior to splitting. It can be +useful for performance aspects of extremely large data sets or moderate size data sets with high numbers of columns in +the feature set. + +Default: OFF + +### Auto Stopping Flag + +If set to on, when coupled with a stopping criteria score, will stop initializing new Futures and once all async +currently running models are complete, will stop the tuning phase. + +> NOTE: this feature WILL NOT TERMINATE CURRENTLY RUNNING TASKS. Each model is run asynchronously, and once committed +>to execution, will run to completion. This feature only prevents ADDITIONAL Futures from being executed. + +#### Available Overrides +* [Auto-Stopping Score](#tuner-auto-stopping-score) + +### Pipeline Debug Flag + +When turned on, will report debug statements for each pipeline stage into stdout, giving insight into what settings +are being used, what the status is, and information about what is dynamically being done at runtime. + +DEFAULT: OFF + +> NOTE: debug reporting is VERBOSE. Recommend to only run when needed to validate stage execution. + +### Feature Engineering Config +```scala +case class FeatureEngineeringConfig( + var dataPrepParallelism: Int, + var numericFillStat: String, + var characterFillStat: String, + var modelSelectionDistinctThreshold: Int, + var outlierFilterBounds: String, + var outlierLowerFilterNTile: Double, + var outlierUpperFilterNTile: Double, + var outlierFilterPrecision: Double, + var outlierContinuousDataThreshold: Int, + var outlierFieldsToIgnore: Array[String], + var pearsonFilterStatistic: String, + var pearsonFilterDirection: String, + var pearsonFilterManualValue: Double, + var pearsonFilterMode: String, + var pearsonAutoFilterNTile: Double, + var covarianceCorrelationCutoffLow: Double, + var covarianceCorrelationCutoffHigh: Double, + var scalingType: String, + var scalingMin: Double, + var scalingMax: Double, + var scalingStandardMeanFlag: Boolean, + var scalingStdDevFlag: Boolean, + var scalingPNorm: Double, + var featureImportanceCutoffType: String, + var featureImportanceCutoffValue: Double, + var dataReductionFactor: Double, + var cardinalitySwitch: Boolean, + var cardinalityType: String, + var cardinalityLimit: Int, + var cardinalityPrecision: Double, + var cardinalityCheckMode: String, + var filterPrecision: Double, + var categoricalNAFillMap: Map[String, String], + var numericNAFillMap: Map[String, AnyVal], + var characterNABlanketFillValue: String, + var numericNABlanketFillValue: Double, + var naFillMode: String, + var featureInteractionRetentionMode: String, + var featureInteractionContinuousDiscretizerBucketCount: Int, + var featureInteractionParallelism: Int, + var featureInteractionTargetInteractionPercentage: Double +) +``` + +#### Data Prep Parallelism + +Setter: `.setDataPrepParallelism()` +Map Name: `'dataPrepParallelism'` + +```text +Default: 10 +``` + +This setting is used to set the maximum number of asynchronous Futures that are utilized for several stages within the +Data Preparation modules for feature vector creation. Some of the checks and validations that are performed are +inherently parallelizable, and as such, the cluster can be utilized to asynchronously process these calculations. + +> NOTE - WARNING - setting this value to too high of a value based on a cluster not large enough to handle the GC +> involved in multiple copies of individual field data series may introduce instability or long pauses while the Heap +> is cleared. Perform testing on your data set to determine if the cluster being used +> for the tuning run is sufficiently large to support overriding this value in a higher direction. + + +#### Numeric Fill Stat + +Setter: `.setFillConfigNumericFillStat()` +Map Name: `'fillConfigNumericFillStat'` + +```text +Specifies the behavior of the naFill algorithm for numeric (continuous) fields. +Values that are generated as potential fill candidates are set according to the available statistics that are +calculated from a df.summary() method. + +Default: "mean" +``` +* For all numeric types (or date/time types that have been cast to numeric types) +* Allowable fill statistics: +1. "min" - minimum sorted value from all distinct values of the field +2. "25p" - 25th percentile (Q1 / Lower IQR value) of the ascending sorted data field +3. "mean" - the mean (average) value of the data field +4. "median" - median (50th percentile / Q2) value of the ascending sorted data field +5. "75p" - 75th percentile (Q3 / Upper IQR value) of the ascending sorted data field +6. "max" - maximum sorted value from all distinct values of the field + +#### Character Fill Stat + +Setter: `.setFillConfigCharacterFillStat()` +Map Name: `'fillConfigCharacterFillStat'` + +```text +Specifies the behavior of the naFill algorithm for character (String, Char, Boolean, Byte, etc.) fields. +Generated through a df.summary() method +Available options are: +"min" (least frequently occurring value) +or +"max" (most frequently occurring value) + +Default: "max" +``` + +#### Model Selection Distinct Threshold + +Setter: `.setFillConfigModelSelectionDistinctThreshold` +Map Name: `'fillConfigModelSelectionDistinctThreshold'` + +```text +Default: 50 +``` +The threshold value that is used to detect, based on the supplied labelCol, the cardinality of the label through +a .distinct().count() being issued to the label column. Values from this cardinality determination that are +above this setter's value will be considered to be a Regression Task, those below will be considered a +Classification Task. + +> NOTE: In the case of exceptions being thrown for incorrect type (detected a classifier, but intended usage is for +> a regression, lower this value. Conversely, if a classification problem has a significant number of +> classes, above the default threshold of this setting (50), increase this value.) + +#### Outlier Filter Bounds + +Setter: `.setOutlierFilterBounds()` +Map Name: `'outlierFilterBounds'` + +```text +Default: "both" +``` +Filtering both 'tails' is only recommended if the nature of the data set's input features are of a normal distribution. +If there is skew in the distribution of the data, a left or right-tailed filter should be employed. +The allowable modes are: +1. "lower" - useful for left-tailed distributions (rare) +2. "both" - useful for normally distributed data (common) +3. "upper" - useful for right-tailed distributions (common) + +For Further reading: [Distributions](https://en.wikipedia.org/wiki/List_of_probability_distributions#With_infinite_support) + +> NOTE: If you don't know what the distribution of your fields are, it is not advised to use outlier filtering. +> This feature is also not recommended for production runs using the toolkit. +> This feature is designed for a first-pass evaluation of completely uncleaned data for experimental purposes. + +#### Lower Filter NTile + +Setter: `.setOutlierLowerFilterNTile()` +Map Name: `'outlierLowerFilterNTile'` + +```text +Default: 0.02 + +Restrictions are set on this configuration - the value must be between 0 and 1. +``` +Filters out values (rows) that are below the specified quantile level based on a sort of the field's data in ascending order. +> Only applies to modes "both" and "lower" + +#### Upper Filter NTile + +Setter: `.setOutlierUpperFilterNTile()` +Map Name: `'outlierUpperFilterNTile'` + +```text +Default: 0.98 +``` +Filters out values (rows) that are above the specified quantile threshold based on the ascending sort of the field's data. +> Only applies to modes "both" and "upper" + +#### Outlier Filter Precision + +Setter: `.setOutlierFilterPrecision()` +Map Name: `'outlierFilterPrecision'` + +```text +Default: 0.01 +``` +Determines the level of precision in the calculation of the N-tile values of each field. +Setting this number to a lower value will result in additional shuffling and computation. +The algorithm that uses the filter precision is `approx_count_distinct(columnName: String, rsd: Double)`. +The lower that this value is set, the more accurate it is (setting it to 0.0 will be an exact count), but the more +shuffling (computationally expensive) will be required to calculate the value. +> NOTE: **Restricted Value** range: 0.0 -> 1.0 + +#### Outlier Continuous Data Threshold + +Setter: `.setOutlierContinuousDataThreshold()` +Map Name: `'outlierContinuousDataThreshold'` + +```text +Default: 50 +``` +Determines an exclusion filter of unique values that will be ignored if the unique count of the field's values is below the specified threshold. +> Example: + +| Col1 | Col2 | Col3 | Col4 | Col5 | +|:----: |:----: |:----: |:----: |:----: | +| 1 | 47 | 3 | 4 | 1 | +| 1 | 54 | 1 | 0 | 11 | +| 1 | 9999 | 0 | 0 | 0 | +| 0 | 7 | 3 | 0 | 11 | +| 1 | 1 | 0 | 0 | 1 | + +> In this example data set, if the continuousDataThreshold value were to be set at 4, the ignored fields would be: Col1, Col3, Col4, and Col5. +> > Col2, having 5 unique entries, would be evaluated by the outlier filtering methodology and, provided that upper range filtering is being done, Row #3 (value entry 9999) would be filtered out with an UpperFilterNTile setting of 0.8 or lower. + +#### Outlier Fields To Ignore + +Setter: `.setOutlierFieldsToIgnore()` +Map Name: `'outlierFieldsToIgnore'` + +```text +Default: Array("") +``` +Optional configuration that allows certain fields to be exempt (ignored) by the outlier filtering processing. +Any column names that are supplied to this setter will not be used for row filtering. + +> NOTE: it is **highly advised** to populate this to control for 'blind filtering' of too much data. +>> Typical use cases for this + + +#### Pearson Filter Statistic + +Setter: `.pearsonFilterStatistic()` +Map Name: `'pearsonFilterStatistic'` + +```text +Default: "pearsonStat" + +allowable values: "pvalue", "pearsonStat", or "degreesFreedom" +``` + +Correlation Detection between a feature value and the label value is capable of 3 supported modes: +* [Pearson Correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) ("pearsonStat") +> > Calculates the Pearson Correlation Coefficient in range of {-1.0, 1.0} +* [Degrees Freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)#In_analysis_of_variance_(ANOVA)) ("degreesFreedom") + + Additional Reading: + + [Reduced Chi-squared statistic](https://en.wikipedia.org/wiki/Reduced_chi-squared_statistic) + + [Generalized Chi-squared distribution](https://en.wikipedia.org/wiki/Generalized_chi-squared_distribution) + + [Degrees Of Freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)#Of_random_vectors) + +> > Calculates the Degrees of Freedom of the underlying linear subspace, in unbounded range {0, n} +> > where n is the feature vector size. + +*Before Overriding this value, ensure that a thorough understanding of this statistic is achieved.* + +* p-value ("pvalue") + +> > Calculates the p-value of independence from the Pearson chi-squared test in range of {0.0, 1.0} + +#### Pearson Filter Direction + +Setter: `.setPearsonFilterDirection()` +Map Name: `'pearsonFilterDirection'` + +```text +Default: "greater" + +allowable values: "greater" or "lesser" +``` + +Specifies whether to filter values out that are higher or lower than the target cutoff value. + +#### Pearson Filter Manual Value + +Setter: `.setPearsonFilterManualValue()` +Map Name: `'pearsonFilterManualValue'` + +```text +Default: 0.0 + +(placeholder value) +``` + +Allows for manually filtering based on a hard-defined limit. + +Example: with .setPearsonFilterMode("manual") and .setPearsonFilterDirection("greater") +the removal of fields (columns) that have a pearson correlation coefficient result above this value will be dropped from the feature vector for modeling runs. + +> Note: if using "manual" mode on this module, it is *imperative* to provide a valid value through this setter, as the Default placeholder value will +> provide poor results and it is intended to, as such, flag the user that a valid value be provided. + +#### Pearson Filter Mode + +Setter: `.setPearsonFilterMode()` +Map Name: `'pearsonFilterMode'` + +```text +Default: "auto" + +allowable values: "auto" or "manual" +``` +Determines whether a manual filter value is used as a threshold, or whether a quantile-based approach (automated) +based on the distribution of Chi-squared test results will be used to determine the threshold. + +> The automated approach (using a specified NTile) will adapt to more general problems and is recommended for getting +measures of modeling feasibility (exploratory phase of modeling). However, if utilizing the high-level API with a +well-understood data set, it is recommended to override the mode, setting it to manual, and utilizing a known +acceptable threshold value for the test that is deemed acceptable based on analysis of the feature fields and the predictor (label) column. + +#### Pearson Auto Filter N Tile + +Setter: `.setPearsonAutoFilterNTile()` +Map Name: `'pearsonAutoFilterNTile'` + +```text +Default: 0.75 + +allowable range: 0.0 > x > 1.0 + +Provides the ntile threshold above or below which (depending on PearsonFilterDirection setting) fields will +be removed, depending on the distribution of pearson statistics from all feature columns. +``` +([Q3 / Upper IQR value](https://en.wikipedia.org/wiki/Interquartile_range)) + + +When in "auto" mode, this will reduce the feature vector by 75% of its total size, retaining only the 25% most +important predictive-power features of the vector. + +#### Correlation (Covariance) Cutoff Low + +Setter: `.setCovarianceCutoffLow()` +Map Name: `'covarianceCutoffLow'` + +```text +Default: -0.8 + +Value must be set > -1.0 +``` +The setting at below which the right-hand comparison field will be filtered out of the data set, provided that the +pearson correlation coefficient between left->right fields is below this threshold. + +> NOTE: Max supported value for `.setCovarianceCutoffLow` is -1.0 + +#### Correlation (Covariance) Cutoff High + +Setter: `.setCovarianceCutoffHigh()` +Map Name: `'covarianceCutoffHigh'` + +```text +Default: 0.8 + +Value must be set < 1.0 +``` +The upper positive correlation filter level. Correlation Coefficients above this level setting will be removed from the data set. + +> NOTE: Max supported value for `.setCovarianceCutoffHigh` is 1.0 + +#### Scaling Type + +Setter: `.setScalingType()` +Map Name: `'scalingType'` + +```text +Default: "minMax" + +allowable values: "minMax", "standard", "normalize", or "maxAbs" +``` + +Sets the scaling library to be employed in scaling the feature vector. + +#### Scaler Min + +Setter: `.setScalingMin()` +Map Name: `'scalingMin'` + +```text +Default: 0.0 + +Only used in "minMax" mode +``` + +Used to set the scaling lower threshold for MinMax Scaler +(normalizes all features in the vector to set the minimum post-processed value specified in this setter) + +#### Scaler Max + +Setter: `.setScalingMax()` +Map Name: `'scalingMax'` + +```text +Default: 1.0 + +Only used in "minMax" mode +``` + +Used to set the scaling upper threshold for MinMax Scaler +(normalizes all features in the vector to set the maximum post-processed value specified in this setter) + +#### Scaling p-norm + +Setter: `.setScalingPNorm()` +Map Name: `'scalingPNorm'` + +```text +Default: 2.0 +``` +> NOTE: Only used in "normalize" mode. + +> NOTE: value must be >=1.0 for proper functionality in a finite vector space. + +Sets the level of "smoothing" for scaling the noise out of the vector. + +Further Reading: [P-Norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm), [Lp Space](https://en.wikipedia.org/wiki/Lp_space) + +#### Standard Scaler Mean Flag + +Setter: `.setScalingStandardMeanFlag()` +Map Name: `'scalingStandardMeanFlag'` + +```text +Default: false + +Only used in "standard" mode +``` + +With this flag set to `true`, The features within the vector are centered around mean (0 adjusted) before scaling. +> Read the [docs](http://spark.apache.org/docs/latest/ml-features.html#standardscaler) before switching this on. +> > Setting to 'on' will create a dense vector, which will increase memory footprint of the data set. +> +#### Standard Scaler StdDev Flag + +Setter: `.setScalingStdDevFlag()` +Map Name: `'scalingStdDevFlag'` + +```text +Default: true for Linear Models, false for tree-based models (gets set AT RUNTIME unless explicitly turned off for linear models) +``` +> NOTE: Only used in "standard" mode + +Scales the data to the unit standard deviation. [Explanation](https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation) + + +#### Feature Interaction Retention Mode + +Setter: `.setFeatureInteractionRetentionMode()` +Map Name: `'featureInteractionRetentionMode'` + +```text + Setter for determining the mode of operation for inclusion of interacted features. + Modes are: + - all -> Includes all interactions between all features (after string indexing of categorical values) + - optimistic -> If the Information Gain / Variance, as compared to at least ONE of the parents of the interaction + is above the threshold set by featureInteractionTargetInteractionPercentage + (e.g. if IG of left parent is 0.5 and right parent is 0.9, with threshold set at 10, if the interaction + between these two parents has an IG of 0.42, it would be rejected, but if it was 0.46, it would be kept) + - strict -> the threshold percentage must be met for BOTH parents. + (in the above example, the IG for the interaction would have to be > 0.81 in order to be included in + the feature vector). + +Default: 'optimistic' +``` + +#### Feature Interaction Continuous Discretizer Bucket Count + +Setter: `.setFeatureInteractionContinuousDiscretizerBucketCount()` +Map Name: `'featureInteractionContinuousDiscretizerBucketCount'` + +```text +Setter for determining the behavior of continuous feature columns. In order to calculate Entropy for a continuous +variable, the distribution must be converted to nominal values for estimation of per-split information gain. +This setting defines how many nominal categorical values to create out of a continuously distributed feature +in order to calculate Entropy. + +Default: 10 +``` + +> Note: must be greater than 1 + +#### Feature Interaction Parallelism + +Setter: `.setFeatureInteractionParallelism()` +Map Name: `'featureInteractionParallelism'` + +```text +Setter for configuring the concurrent count for scoring of feature interaction candidates. +Due to the nature of these operations, the configuration here may need to be set differently to that of +the modeling and general feature engineering phases of the toolkit. This is highly dependent on the row +count of the data set being submitted. + +Default: 12 +``` +> NOTE: must be greater than 0 +>> It is recommended to decrease this value for larger data sets to avoid overwhelming the executor thread pools or filling the Heap too quickly. + +#### Feature Interaction Target Interaction Percentage + +Setter: `.setFeatureInteractionTargetInteractionPercentage()` +Map Name: `'featureInteractionTargetInteractionPercentage'` + +```text +Establishes the minimum acceptable InformationGain or Variance allowed for an interaction +candidate based on comparison to the scores of its parents. This value is a 'reduction from matched' in that +a value of 0.1 here would mean that children candidates that are at least 90% of the IG value of parents would be +included in the final feature vector. + +Default: 10.0 +``` + +#### Feature Importance Cutoff Type + +Setter: `.setFeatureImportanceCutoffType()` +Map Name: `'featureImportanceCutoffType'` + +```text +Setting for determining where to limit the feature vector after completing a feature importances run in order to return +either the top n most important features, or the top features above a specific relevance score cutoff. + +modes: 'none', 'value', or 'count' + +Default: 'count' +``` + +#### Feature Importance Cutoff Value + +Setter: `.setFeatureImportanceCutoffValue()` +Map Name: `'featureImportanceCutoffValue'` + +```text +Restrictive filtering limit on either counts of fields (if feature importance cutoff type is in 'count' mode) ranked, +or direct value of feature importance. + +WARNING: depending on the algorithm used to calculate feature importances, operating in 'value' mode is different for +XGBoost vs. RandomForest since their scoring methodologies are different. Please see respective API docs for +XGBoost and Spark RandomForest to get an understanding of how these are calculated before attempting 'value' mode. + +Default: 15.0 +``` + +#### Data Reduction Factor + +Setter: `.setDataReductionFactor()` +Map Name: `'dataReductionFactor'` + +```text +Testing feature for validating large runs on a smaller subset of data (DEV API ONLY) +Will reduce the size of the data set by the value provided, if set. + +Default: 0.5 (will drop half of the rows) +``` + +> NOTE: must in range 0 to 1. +> WARNING: not recommended for use in actual training runs or in production. This removes data from the training data +>set indiscriminantly, and as such, cannot guarantee effective preservation of underlying distributions or class balance. + + +#### Fill Config Cardinality Switch + +Setter: `.setFillConfigCardinalitySwitch()` +Map Name: `'fillConfigCardinalitySwitch'` + +```text +Toggles the checking for whether to treat a field as nominal or continuous based on the distinct counts. +This is important for nominal data that, even though numeric, should be handles as a categorical-like value. +Fields that fall below the cardinality limit will be handled in the same way as StringType fields +(utilizing max or min fill rather than mean or quantile fill) + +Default: true (on) +``` +> Note: it is recommended to leave this feature ON unless it is absolutely known that all numeric values in the data set +>should be handled as though they were continuous values. + +#### Fill Config Cardinality Type + +Setter: `.setfFillConfigCardinalityType()` +Map Name: `'fillConfigCardinalityType'` + +```text +Configuration for how cardinality is calculated, either 'approx' or 'exact' + +Default: 'exact' +``` +> NOTE: setting 'exact' on extremely large data sets will incur large waits as data is serialized to get counts. + +#### Fill Config Cardinality Limit + +Setter: `.setFillConfigCardinalityLimit()` +Map Name: `'fillConfigCardinalityLimit'` + +```text +The cardinality threshold for use if the fill config cardinality switch is turned on - this is the value that distinct +counts below which will be considered to be 'nominal' and handled as a categorical fill value. + +Default: 200 +``` + +#### Fill Config Cardinality Precision + +Setter: `.setFillConfigCardinalityPrecision()` +Map Name: `'fillConfigCardinalityPrecision'` + +```text +Precision value for 'approx' mode on fill config cardinality type + +Must be in range >0 to 1 + +Default: 0.05 +``` + +#### Fill Config Cardinality Check Mode + +Setter: `.setFillConfigCardinalityCheckMode()` +Map Name: `'fillConfigCardinalityCheckMode'` + +```text +Setter for the cardinality check mode to be used. Available modes are "warn" and "silent". +- In "warn" mode, an exception will be thrown if the cardinality for a categorical column is above the threshold. +- In "silent" mode, the field will be ignored from processing and will not be included in the feature vector. + +Default: 'silent" +``` + +#### Fill Config Filter Precision + +Setter: `.setFillConfigFilterPrecision()` +Map Name: `'fillConfigFilterPrecision'` + +```text +Setter for defining the precision for calculating the model type as per the label column + +Must be in range 0 to 1 +``` +> NOTE: setting this value to zero (0) for a large regression problem will incur a long processing time and an expensive shuffle. + +#### Fill Config Categorical NA Fill Map + +Setter: `.setFillConfigCategoricalNAFillMap()` +Map Name: `'fillConfigCategoricalNAFillMap'` + +```text +A means of directly controlling at a column-level distinct overrides to columns for na fill of categorical data for StringType columns. The structure is of +Column Name -> fill value +This will function only on non-numeric value type columns and the data will be cast as a String, regardless of the input +data type that is applied in the Map. + +Default: Map.empty[String, String] (empty Map) +``` + +#### Fill Config Numeric NA Fill Map + +Setter: `.setFillConfigNumericNAFillMap()` +Map Name: `'fillConfigNumericNAFillMap'` + +```text +A means of directly controlling at a column-level distinct overrides to columns for na fill for numeric Type columns +(all columns get cast to DoubleType throughout modeling anyway). The structure is of +Column Name -> fill value +This will function only numeric value type columns and the data will be cast as a Double, regardless of the input +data type that is applied in the Map. i.e. to fill with Int 1, simply write as a Double 1.0 + +Default: Map.empty[String, Double] (empty Map) +``` + +#### Fill Config Character NA Blanket Fill Value + +Setter: `.setFillConfigCharacterNABlanketFillValue()` +Map Name: `'fillConfigCharacterNABlanketFillValue'` + +```text +Sets the ability to fill all categorical (StringType) columns in the data set to the same na fill replacement value. + +Default: "" +``` +> Note - only recommended for certain ML applications. Not advised for most. + +#### Fill Config Numeric NA Blanket Fill Value + +Setter: `.setFillConfigNumericNABlanketFillValue()` + Map Name: `'fillConfigNumericNABlanketFillValue'` + +```text +Sets the ability to fill all numeric columns in the data set to the same na fill value. + +Default: 0.0 +``` +> Note - not recommended, but included as a feature for certain older application needs from legacy migrations. + +#### Fill Config NA Fill Mode + +Setter: `.setFillConfigNAFillMode()` +Map Name: `'fillConfigNAFillMode'` + +```text + Mode for na fill + Available modes: + - auto: Stats-based na fill for fields. Usage of .setNumericFillStat and + .setCharacterFillStat will inform the type of statistics that will be used to fill. + - mapFill: Custom by-column overrides to 'blanket fill' na values on a per-column + basis. The categorical (string) fields are set via .setCategoricalNAFillMap while the + numeric fields are set via .setNumericNAFillMap.
+ - blanketFillAll: Fills all fields based on the values specified by + .setCharacterNABlanketFillValue and .setNumericNABlanketFillValue. All NA's for the + appropriate types will be filled in accordingly throughout all columns. + - blanketFillCharOnly: Will use statistics to fill in numeric fields, but will replace + all categorical character fields na values with a blanket fill value. + - blanketFillNumOnly: Will use statistics to fill in character fields, but will replace + all numeric fields na values with a blanket value. + +Default: 'auto' +``` + + + + +### Tuner Config +```scala +case class TunerConfig(var tunerAutoStoppingScore: Double, + var tunerParallelism: Int, + var tunerKFold: Int, + var tunerTrainPortion: Double, + var tunerTrainSplitMethod: String, + var tunerKSampleSyntheticCol: String, + var tunerKSampleKGroups: Int, + var tunerKSampleKMeansMaxIter: Int, + var tunerKSampleKMeansTolerance: Double, + var tunerKSampleKMeansDistanceMeasurement: String, + var tunerKSampleKMeansSeed: Long, + var tunerKSampleKMeansPredictionCol: String, + var tunerKSampleLSHHashTables: Int, + var tunerKSampleLSHSeed: Long, + var tunerKSampleLSHOutputCol: String, + var tunerKSampleQuorumCount: Int, + var tunerKSampleMinimumVectorCountToMutate: Int, + var tunerKSampleVectorMutationMethod: String, + var tunerKSampleMutationMode: String, + var tunerKSampleMutationValue: Double, + var tunerKSampleLabelBalanceMode: String, + var tunerKSampleCardinalityThreshold: Int, + var tunerKSampleNumericRatio: Double, + var tunerKSampleNumericTarget: Int, + var tunerTrainSplitChronologicalColumn: String, + var tunerTrainSplitChronologicalRandomPercentage: Double, + var tunerSeed: Long, + var tunerFirstGenerationGenePool: Int, + var tunerNumberOfGenerations: Int, + var tunerNumberOfParentsToRetain: Int, + var tunerNumberOfMutationsPerGeneration: Int, + var tunerGeneticMixing: Double, + var tunerGenerationalMutationStrategy: String, + var tunerFixedMutationValue: Int, + var tunerMutationMagnitudeMode: String, + var tunerEvolutionStrategy: String, + var tunerGeneticMBORegressorType: String, + var tunerGeneticMBOCandidateFactor: Int, + var tunerContinuousEvolutionImprovementThreshold: Int, + var tunerContinuousEvolutionMaxIterations: Int, + var tunerContinuousEvolutionStoppingScore: Double, + var tunerContinuousEvolutionParallelism: Int, + var tunerContinuousEvolutionMutationAggressiveness: Int, + var tunerContinuousEvolutionGeneticMixing: Double, + var tunerContinuousEvolutionRollingImprovingCount: Int, + var tunerModelSeed: Map[String, Any], + var tunerHyperSpaceInference: Boolean, + var tunerHyperSpaceInferenceCount: Int, + var tunerHyperSpaceModelCount: Int, + var tunerHyperSpaceModelType: String, + var tunerInitialGenerationMode: String, + var tunerInitialGenerationPermutationCount: Int, + var tunerInitialGenerationIndexMixingMode: String, + var tunerInitialGenerationArraySeed: Long, + var tunerOutputDfRepartitionScaleFactor: Int, + var tunerDeltaCacheBackingDirectory: String, + var tunerDeltaCacheBackingDirectoryRemovalFlag: Boolean, + var splitCachingStrategy: String) +``` + + +#### Tuner Auto Stopping Score + +Setter: `.setTunerAutoStoppingScore()` +Map Name: `'tunerAutoStoppingScore'` + +```text +Setting for specifying the early stopping value. + +Default: 0.95 +``` +> NOTE: Ensure that the value specified matches the optimization score set in `.setScoringMetric()` +>> i.e. if using f1 score for a classification problem, an appropriate early stopping score might be in the range of {0.92, 0.98} + +[WARNING] This value is set as a placeholder for a default classification problem. If using regression, this will ***need to be changed*** + +#### Tuner Parallelism + +Setter: `.setTunerParallelism()` +Map Name: `'tunerParallelism'` + +```text +Means for setting the number of asynchronous models that are executed concurrently within the generational genetic algorithm. +Feeds into the equations for determining appropriate repartitions based on cluster size and available executor CPU's to tune the run appropriately. + +Default: 20 +``` + +Sets the number of concurrent models that will be evaluated in parallel through Futures. +This creates a new [ForkJoinPool](https://java-8-tips.readthedocs.io/en/stable/forkjoin.html), and as such, it is important to not set it too high, in order to prevent overloading +the driver JVM with too many elements in the `Array[DEqueue[task]]` ForkJoinPool. +NOTE: There is a global limit of 32767 threads on the JVM, as well. +However, in practice, running too many models in parallel will likely OOM the workers anyway. + +> NOTE: highly recommended to override this value as the default value is simply a placeholder. +> NOTE values set above 30 will receive a WARNING stating that the number of concurrent tasks is likely beyond an efficient ratio of +>avalable cluster resources and the benefits of asynchronous tuning. +> NOTE: different models have different characteristics and behavior. It is recommended that tree-based (non XGBoost) use a relatively low value here +> (~4-6), while linear models can run at much higher levels and benefit from the higher concurrency. + +#### Tuner Kfold + +Setter: `.setTunerKFold()` +Map Name: `'tunerKFold'` + +```text +Sets the number of different splits that are happening on the pre-modeled data set for train and test, allowing for +testing of different splits of data to ensure that the hyper parameters under test are being evaluated for different mixes of the data. +This value dictates the number of copies of the data that will exist either cached, persisted, or written to temporary delta tables during the +modeling phase. + +Default: 5 +``` + +> Note a warning will appear if the kFold count is lower than recommended to prevent selecting hyper parameters that may overfit to one particular split of +>the data set into train / test. + +#### Tuner Train Portion + +Setter: `.setTunerTrainPortion()` +Map Name: `'tunerTrainPortion'` + +```text +Sets the proportion of the input DataFrame to be used for Train (the value of this variable) and Test +(1 - the value of this variable) + +Default: 0.8 +``` +> NOTE: restricted to between 0 and 1 (highly recommended to stay in the range of 0.5 to 0.9) + + +#### Train Split Method + +Setter: `.setTunerTrainSplitMethod()` +Map Name: `'tunerTrainSplitMethod'` + +This setting allows for specifying how to split the provided data set into test/training sets for scoring validation. +Some ML use cases are highly time-dependent, even in traditional ML algorithms (i.e. predicting customer churn). +As such, it is important to be able to predict on apriori data and synthetically 'predict the future' by doing validation +testing on a holdout data set that is more recent than the training data used to build the model. + +Additional reading: [Sampling in Machine Learning](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) + +Setter: `.setTrainSplitMethod()` +```text +Available options: "random" or "chronological" or "stratified" or "underSample" or "overSample" or "kSample" + +Default: "random" + +``` +Chronological Split Mode +- Splits train / test between a sortable field (date, datetime, unixtime, etc.) +> Chronological split method **does not require** a date type or datetime type field. Any sort-able / continuous distributed field will work. + +Random Split Mode +- Leaving the default value of "random" will randomly shuffle the train and test data sets each k-fold iteration. + +***This is only recommended for classification data sets where there is relatively balanced counts of unique classes in the label column*** +> [NOTE]: attempting to run any mode other than "random" or "chronological" on a regression problem will not work. Default behavior +will reset the trainSplitMethod to "random" if they are selected on a regression problem. + +Stratified Mode + +- Stratified mode will balance all of the values present in the label column of a classification algorithm so that there +is adequate coverage of all available labels in both train and test for each kfold split step. + +***It is HIGHLY RECOMMENDED to use this mode if there is a large skew in your label column (class imbalance) and there is a need for +training on the full, unmanipulated data set.*** + +UnderSampling Mode + +- Under sampling will evaluate the classes within the label column and sample all classes within each kfold / model run +to target the row count of the smallest class. (Only recommended for moderate skew cases) + +***If using this mode, ensure that the smallest class has sufficient row counts to make training effective!*** + +OverSampling Mode [NOT RECOMMENDED FOR GENERAL USE] + +- Over sampling will evaluate the class with the highest count of entries and during splitting, will replicate all +other classes' data to *generally match* the counts of the most frequent class. (**this is not exact and can vary from +run to run**) + +***WARNING*** - using this mode will dramatically increase the training and test data set sizes. +***Small count classes' data will be copied multiple times to eliminate skew*** + +KSampling Mode +- Uses a distributed implementation of [SMOTE](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis#SMOTE) applied to +the minority class(es) for the training split only (test is not augmented). +At it's core, the algorithm is building k clusters based on the feature vector. From these cluster centroids, +a MinHashLSH model is built to perform distance calculations from the centroids to each of the members of the cluster +centroid. The collection of candidate points are then sorted based on the distance metric, at which point random +numbers of vector index values are mutated between the adjacent points near the centroid and the centroid value +itself. These synthetic feature elements are then vectorized, labeled as 'synthetic' and used to augment the +minority classes present in the unbalanced classification training set. +> NOTE: The synthetic data IS NOT INCLUDED IN THE TEST SETS for scoring of models. + +> NOTE: as this is an ML-based augmentation system, there is considerable time and resources involved in creating 'intelligent' +>over-sampling of the minority class(es), which will occur prior to the modeling phase. + +#### Tuner KSample Synthetic Col + +Setter: `.setTunerKSampleSyntheticCol()` +Map Name: `'tunerKSampleSyntheticCol'` + +```text +Internal temporary column name denoting normal vs synthetic rows of data to ensure that a train/test split will not +include synthetic data in the test data set (which would invalidate the model's scoring) + +Default: 'synthetic_ksample' +``` +> NOTE: the name can be anything, provided it isn't the same as a name in the data set already, or a reserved field name (i.e. 'features' or 'label') + +#### Tuner KSample K Groups + +Setter: `.setTunerKSampleKGroups()` +Map Name: `'tunerKSampleKGroups'` + +```text +Specifies the number of K clusters to generate for the synthetic data generation for minority classes + +Default: 25 +``` +> NOTE: placeholder value derived at based on general testing. Feel free to override as needed. + +#### Tuner KSample KMeans MaxIter + +Setter: `.setTunerKSampleKMeansMaxIter()` +Map Name: `'tunerKSampleKMeansMaxIter'` + +```text +Specifies the maximum number of iterations for the KMeans model to attempt to converge + +Default: 100 +``` +> NOTE: not recommended to override unless needed based on cardinality complexity of the features involved. (DEV API) + +#### Tuner KSample KMeans Tolerance + +Setter: `.setTunerKSampleKMeansTolerance()` +Map Name: `'tunerKSampleKMeansTolerance'` + +```text +KMeans setting for determining the tolerance value for convergence. + +Default: 1e-6 +``` + +> NOTE Must be greater than 0. + +See [DOC](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for further details. + +#### Tuner KSample KMeans Distribution Measurement + +Setter: `.setTunerKSampleKMeansDistanceMeasurement()` +Map Name: `'tunerKSampleKMeansDistanceMeasurement'` + +```text +Which distance measurement to use for generation of synthetic data. + +Either 'euclidean' or 'cosine' + +Default: 'euclidean' +``` + +#### Tuner KSample KMeans Seed + +Setter: `.setTunerKSampleKMeansSeed()` +Map Name: `'tunerKSampleKMeansSeed'` + +```text +The seed for KMeans to attempt to create a somewhat repeatable convergence for a particular data set. + +Default: 42L +``` + +#### Tuner KSample KMeans Prediction Col + +Setter: `.setTunerKSampleKMeansPredictionCol()` +Map Name: `'tunerKSampleKMeansPredictionCol'` + +```text +Internal use only column name for KSampling internal processes. + +Default: 'kGroups_ksample' +``` +> NOTE: ensure that reserved field names are not used, nor a field name that is present in the raw data set. + +#### Tuner KSample LSH Hash Tables + +Setter: `.setTunerKSampleLSHHashTables()` +Map Name: `'tunerKSampleLSHHashTables'` + +```text +Sets the number of hash tables involved in the jaccard distance calculations in MinHashLSH + +Default: 10 +``` + +For further reading, review the [docs and links](http://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance) + +#### Tuner KSample LSH Seed + +Setter: `.setTunerKSampleLSHSeed()` +Map Name: `'tunerKSampleLSHSeed'` + +```text +Seed for the MinHashLSH algorithm for repeatability. + +Default: 42L +``` + +#### Tuner KSample LSH Output Col + +Setter: `.setTunerKSampleLSHOutputCol()` +Map Name: `'tunerKSampleLSHOutputCol'` + +```text +Internal use only column name for MinHashLSH processes. + +Default: 'hashes_ksample' +``` +> NOTE: ensure that reserved field names are not used or any name of a field in the raw data set. + +#### Tuner KSample Quorum Count + +Setter: `.setTunerKSampleQuorumCount()` +Map Name: `'tunerKSampleQuorumCount'` + +```text +Setting for how many vectors to include in adjacency calculations to the centroid position of the K-cluster +for the generation of synthetic data. + +Larger values will get more dynamic synthetic data, however, will incur additional runtime processing. + +Default: 7 +``` + +#### Tuner KSample Minimum Vector Count to Mutate + +Setter: `.setTunerKSampleMinimumVectorCountToMutate()` +Map Name: `'tunerKSampleMinimumVectorCountToMutate'` + +```text +Minimum threshold value for vector indeces to mutate within the feature vector during synthetic data generation. + +Higher values will result in more data variation, but potentially will create more of a challenge to converge during training. + +Default: 1 +``` + +> NOTE: random selection between this number and the total number of features within the vector will be used, depending on the mutation mode. +> highly recommended to override if using 'fixed' mode on vector mutation method. + +#### Tuner KSample Vector Mutation Method + +Setter: `.setTunerKSampleVectorMutationMethod()` +Map Name: `'tunerKSampleVectorMutationMethod'` + +```text +One of three modes: +"fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. +"random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. +"all" - will mutate all of the vectors. + +Default: 'random' +``` + +#### Tuner KSample Mutation Mode + +Setter: `.setTunerKSampleMutationMode()` +Map Name: `'tunerKSampleMutationMode'` + +```text +Defines the method of mixing of the Vector positions selected to the centroid position. +Available modes: +"weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors +"random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors +"ratio" - uses a ratio between the values of the centroid vector and the mutation vector + +Default: 'weighted' +``` + +#### Tuner KSample Mutation Value + +Setter: `.setTunerKSampleMutationValue()` +Map Name: `'tunerKSampleMutationValue'` + +```text +Specifies the magnitude of mixing in 'weighted' or 'ratio' modes of mutation mode. + +Default: 0.5 +``` +> NOTE: must be set within a range of 0 and 1 + +#### Tuner KSample Label Balance Mode + +Setter: `.setTunerKSampleLabelBalanceMode()` +Map Name: `'tunerKSampleLabelBalanceMode'` + +```text +Split methodology for the KSample methodology. +Available Modes: +'match': Will match all smaller class counts to largest class count. [WARNING] - May significantly increase memory pressure! +'percentage': Will adjust smaller classes to a percentage value of the largest class count. +'target': Will increase smaller class counts to a fixed numeric target of rows. + +Default: 'match' +``` + +#### Tuner KSample Cardinality Threshold + +Setter: `.seTunerKSampleCardinalityThreshold()` +Map Name: `'tunerKSampleCardinalityThreshold'` + +```text +Cardinality check threshold for determining if synthetic data should be resolved on an ordinal scale (rounded / floored) +or left as a continuous data point + +Default: 20 +``` + +#### Tuner KSample + +Setter: `.setTunerKSampleNumericRatio()` +Map Name: `'tunerKSampleNumericRatio'` + +```text +For Percentage mode on setTunerKSampleLabelBalanceMode() +Defines the percentage target to match to based on the majority class. + +Default: 0.2 +``` +> NOTE: must be between 0 and 1 (recommend setting this based on analysis of the class imbalance and the total data size) + +#### Tuner KSample + +Setter: `.setTunerKSampleNumericTarget()` +Map Name: `'tunerKSampleNumericTarget'` + +```text +For Target mode on setTunerKSampleLabelBalanceMode() +Defines a target number of rows for the minority class(es) to reach (real and synthetic data together) + +Default: 500 +``` +> NOTE: highly recommended to use if the data set is small. If large, use ratio or match settings. + + +#### Train Split Chronological Column + +Specify the field to be used in restricting the train / test split based on sort order and percentage of data set to conduct the split. +> As specified above, there is no requirement that this field be a date or datetime type. However, it *is recommended*. + +Setter: `.setTunerTrainSplitChronologicalColumn()` +Map Name: `'tunerTrainSplitChronologicalColumn'` + +```text +Default: "datetime" + +This is a placeholder value. +Validation will occur when modeling begins (post data-prep) to ensure that this field exists in the data set. +``` +> It is ***imperative*** that this field exists in the raw DataFrame being supplied to the main class. ***CASE SENSITIVE MATCH*** +> > Failing to ensure this setting is correctly applied could result in an exception being thrown mid-run, wasting time and resources. + +#### Train Split Chronological Random Percentage + +Due to the fact that a Chronological split, when done by a sort and percentage 'take' of the DataFrame, each k-fold +generation would extract an identical train and test data set each iteration if the split were left static. This +setting allows for a 'jitter' to the train / test boundary to ensure that k-fold validation provides more useful results. + +Setter: `.setTunerTrainSplitChronologicalRandomPercentage()` +Map Name: `'tunerTrainSplitChronologicalRandomPercentage'` + +```text + +representing the percentage value (fractional * 100) +Default: 0.0 + +This is a placeholder value. +``` +> [WARNING] Failing to override this value if using "chronological" mode on `.setTunerTrainSplitMethod()` is equivalent to setting +`.setTunerKFold(1)` for efficacy purposes, and will simply waste resources by fitting multiple copies of the same hyper +parameters on the exact same data set. + +#### Tuner Seed + +Setter: `.setTunerSeed()` +Map Name: `'tunerSeed'` + +```text +Sets the seed for both random selection generation for the initial random pool of values, as well as initializing +another randomizer for random train/test splits. + +Default: 42L +``` + +#### Tuner First Generation Gene Pool + +Setter: `.setTunerFirstGenerationGenePool()` +Map Name: `'tunerFirstGenerationGenePool'` + +```text +Determines the random search seed pool for the genetic algorithm to operate from. +There are space constraints on numeric hyper parameters, character, and boolean that are distinct for each modeling +family and model type. +Setting this value higher increases the chances of minimizing convergence, at the expense of a longer run time. + +Default: 20 +``` +> NOTE: Setting this value below 10 is ***not recommended***. Values less than 6 are not permitted and will throw an assertion exception. + +#### Tuner Number of Generations + +Setter: `.setTunerNumberOfGenerations()` +Map Name: `'tunerNumberOfGenerations'` + +```text +This setting, applied only to batch processing mode, sets the number of mutation generations that will occur. + +The higher this number, the better the exploration of the hyper parameter space will occur, although it comes at the +expense of longer run-time. This is a *sequential blocking* setting. Parallelism does not effect this. + +Default: 10 +``` +> NOTE: batch mode only! + +#### Tuner Number of Parents To Retain + +Setter: `.setTunerNumberOfParentsToRetain()` +Map Name: `tunerNumberOfParentsToRetain` + +```text +This setting will restrict the number of candidate 'best' results of the previous generation of hyper parameter tuning, +using these result's configuration to mutate the next generation of attempts. + +Default: 3 +``` + +> NOTE: The higher this setting, the more 'space exploration' that will occur. However, it may slow the possibility of + converging to an optimal condition. + + +#### Tuner Number of Mutations Per Generation + +Setter: `.setTunerNumberOfMutationsPerGeneration()` +Map Name: `tunerNumberOfMutationsPerGeneration` + +```text +This setting specifies the size of each evolution batch pool per generation (other than the first seed generation). + +Default: 10 +``` +> The higher this setting is set, the more alternative spaces are checked, however, if this value is higher than +what is set by `.setTunerParallelism()`, it will add to the run-time. + +#### Tuner Genetic Mixing + +Setter: `.setTunerGeneticMixing()` +Map Name: `'tunerGeneticMixing'` + +```text +This setting defines the ratio of impact that the 'best parent' that is used to mutate with a new randomly generated +child will have upon the mixed-inheritance hyper parameter. The higher this number, the more effect from the parent the parameter will have. + +Default: 0.7 + +Recommended range: {0.3, 0.8} +``` + +> NOTE: Setting this value < 0.1 is effectively using random parameter replacement. +Conversely, setting the value > 0.9 will not mutate the next generation strongly enough to effectively search the parameter space. + + +#### Tuner Generational Mutation Strategy + +Setter: `.setTunerGenerationalMutationStrategy()` +Map Name: `'tunerGenerationalMutationStrategy'` + +```text +Provides for one of two modes: +* Linear +> This mode will decrease the number of selected hyper parameters that will be mutated each generation. +It is set to utilize the fixed mutation value as a decrement reducer. + + Example: + + A model family is selected that has 10 total hyper parameters. + + In "linear" mode for the generational mutation strategy, with a fixed mutation value of 1, + the number of available mutation parameters for the first mutation generation would be + set to a maximum value of 9 (randomly selected in range of {1, 9}). + At generation #2, the maximum mutation count for hyper parameters in the vector space + will decrememt, leaving a range of randomly selected random hyper parameters of {1, 8}. + This behavior continues until either the decrement value is 1 or the generations are exhausted. + +* Fixed +> This mode sets a static mutation count for each generation. The setting of Fixed Mutation Value +determines how many of the hyper parameters will be mutated each generation. There is no decrementing. + +Default: "linear" + +Available options: "linear" or "fixed" +``` +> NOTE: fixed mode may introduce very long training times in order to explore the hyper parameter space effectively. +>Only use 'fixed' mode when fine-tuning an already existing model with small amounts of new training data. + +#### Tuner Fixed Mutation Value + +Setter: `.setTunerFixedMutationValue()` +Map Name: `'tunerFixedMutationValue'` + +```text +Allows for restricting the number of hyper parameters to mutate per generational epoch. + +Default: 1 +``` +> NOTE: using this setting and keeping the value low will DRAMATICALLY increase the time to convergence. +> only recommended for fine-tuning of an already existing use-case in which a small exploration of fine tuning +> is desired. + +#### Tuner Mutation Magnitude Mode + +Setter: `.setTunerMutationMagnitudeMode()` +Map Name: `'tunerMutationMagnitudeMode'` + +```text + +This setting determines the number of hyper parameter values that will be mutated during each mutation iteration. + +There are two modes: +* "random" + +> In random mode, the setting of `.setGenerationalMutationStrategy()` is used, in conjunction with +the current generation count, to provide a bounded restriction on the number of hyper parameters +per model configuration that will be mutated. A Random number of indeces will be selected for +mutation in this range. + +* "fixed" + +> In fixed mode, a constant count of hyper parameters will be mutated, used in conjunction with +the setting of .`setGenerationalMutationStrategy()`. +>> i.e. With fixed mode, and a generational mutation strategy of "fixed", each mutation generation +would be static (e.g. fixedMutationValue of 3 would mean that each model of each generation would +always mutate 3 hyper parameters). Variations of the mixing of these configurations will result in +varying degrees of mutation aggressiveness. + +Default: "fixed" + +Available options: "fixed" or "random" +``` + +#### Tuner Evolution Strategy + +Setter: `.setTunerEvolutionStrategy()` +Map Name: `'tunerEvolutionStrategy'` + +```text +Determining the mode (batch vs. continuous) + +In batch mode, the hyper parameter space is explored with an initial seed pool (based on Random Search with constraints). + +After this initial pool is evaluated (in parallel), the best n parents from this seed generation are used to 'sire' a new generation. + +This continues for as many generations are specified through the config `.setTunerNumberOfGenerations()`, +*or until a stopping threshold is reached at the conclusion of a concurrent generation batch run.* + + +Continuous mode uses the concept of micro-batching of hyper parameter tuning, running *n* models +in parallel. When each Future is returned, the evaluation of its performance is made, compared to +the current best results, and a new hyper parameter run is constructed. Since this is effectively +a queue/dequeue process utilizing concurrency, there is a certain degree of out-of-order process +and return. Even if an early stopping criteria is met, the thread pool will await for all committed +Futures to return their values before exiting the parallel concurrent execution context. + + +Default: "batch" (HIGHLY recommended) + +Available options: "batch" or "continuous" +``` +> Sets the mutation methodology used for optimizing the hyper parameters for the model. +>> NOTE! Continuous mode is 'experimental' and may be slightly unstable depending on the data set used in training. + +#### Tuner Genetic MBO Regressor Type + +Setter: `.setTunerGeneticMBORegressorType()` +Map Name: `'tunerGeneticMBORegressorType'` + +```text +The post-genetic algorithm stage consists of running MBO (Model-based optimization) on apriori hyper parameters +to score associations. + +This setting allows for a choice between XGBoost or RandomForest or LinearRegression + +Default 'XGBoost' +``` + +#### Tuner Genetic MBO Candidate Factor + +Setter: `.setTunerGeneticMBOCandidateFactor()` +Map Name: `'tunerGeneticMBOCandidateFactor'` + +```text +Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through +mutation for each generation other than the initial and post-modeling optimization phases. The larger this +value (default: 10), the more potential space can be searched. There is not a large performance hit to this, +and as such, values in excess of 100 are viable. + +Default: 10 +``` + +#### Tuner Continuous Evolution Improvement Threshold [EXPERIMENTAL] + +Setter: `.setTunerContinuousEvolutionImprovementThreshold()` +Map Name: `'tunerContinuousEvolutionImprovementThreshold'` + +```text +Setter for defining the secondary stopping criteria for continuous training mode ( number of consistently +not-improving runs to terminate the learning algorithm due to diminishing returns. +(an improvement to a priori will reset the counter and subsequent non-improvements +will decrement a mutable counter. If the counter hits this limit specified in value, the continuous +mode algorithm will stop). + +Default -10 +``` +> NOTE: must be a negative Integer! + +#### Tuner Continuous Evolution Max Iterations [EXPERIMENTAL] + +Setter: `.setTunerContinuousEvolutionMaxIterations()` +Map Name: `'tunerContinuousEvolutionMaxIterations'` + +```text +This parameter sets the total maximum cap on the number of hyper parameter tuning models that are +created and set to run. The higher the value, the better the chances for convergence to optimum +tuning criteria, at the expense of runtime. + + +Default: 200 +``` + +#### Tuner Continuous Evolution Stopping Score [EXPERIMENTAL] + +[OVERRIDE WARNING] + +Setter: `.setTunerContinuousEvolutionStoppingScore()` +Map Name: `'tunerContinuousEvolutionStoppingScore'` + +**NOTE**: ***This value MUST be overridden in regression problems*** + +```text +Setting for early stopping. When matched with the score type, this target is used to terminate the hyper parameter +tuning run so that when the threshold has been passed, no additional Futures runs will be submitted to the concurrent +queue for parallel processing. + +Default: 1.0 + +This is a placeholder value. Ensure it is overriden for early stopping to function in classification problems. +``` +> NOTE: The asynchronous nature of this algorithm will have additional results potentially return after a stopping +criteria is met, since Futures may have been submitted before the result of a 'winning' run has returned. +> > This is intentional by design and does not constitute a bug. + +#### Tuner Continuous Evolution Parallelism [EXPERIMENTAL] + +Setter: `.setTunerContinuousEvolutionParallelism()` +Map Name: `'tunerContinuousEvolutionParallelism'` + +```text +This setting defines the number of concurrent Futures that are submitted in continuous mode. Setting this number too +high (i.e. > 5) will minimize / remove the functionality of continuous processing mode, as it will begin to behave +more like a batch mode operation. + +Default: 4 +``` +> TIP: **Recommended value range is {2, 5}** to see the greatest exploration benefit of the n-dimensional hyper +parameter space, with the benefit of run-time optimization by parallel async execution. + + +#### Tuner Continuous Evolution Mutation Aggressiveness [EXPERIMENTAL] + +Setter: `.setTunerContinousEvolutionMutationAggressiveness()` +Map Name: `'tunerContinuousEvolutionMutationAggressiveness'` + +```text +Similar to the batch mode setting `.setFixedMutationValue()`; however, there is no concept of a 'linear' vs 'fixed' +setting. There is only a fixed mode for continuous processing. This sets the number of hyper parameters that will +be mutated during each async model execution. + +Default: 3 +``` +> The higher the setting of this value, the more the feature space will be explored; however, the longer it may take to +converge to a 'best' tuned parameter set. + +> The recommendation is, for **exploration of a modeling task**, to set this value ***higher***. If trying to fine-tune a model, +or to automate the **re-tuning of a production model** on a scheduled basis, setting this value ***lower*** is preferred. + +#### Tuner Continuous Evolution Genetic Mixing [EXPERIMENTAL] + +Setter: `.setTunerContinuousEvolutionGeneticMixing()` +Map Name: `'tunerContinuousEvolutionGeneticMixing'` + +```text +This mirrors the batch mode genetic mixing parameter. Refer to description above. + +Default: 0.7 + +Restricted to range {0, 1} +``` + +#### Tuner Continuous Evolution Rolling Improvement Count [EXPERIMENTAL] + +Setter: `.setTunerContinuousEvolutionRollingImprovementCount()` +Map Name: `'tunerContinuousEvolutionRollingImprovementCount'` + +```text +[EXPERIMENTAL] +This is an early stopping criteria that measures the cumulative gain of the score as the job is running. +If improvements ***stop happening***, then the continuous iteration of tuning will stop to prevent useless continuation. + +Default: 20 +``` + +#### Tuner Model Seed + +Setters: `.setTunerModelSeed()` +Map Name: `'tunerModelSeed'` + +```text +Allows for 'jump-starting' a model tuning run, primarily for the purpose of +fine-tuning after a previous run, or for retraining a previously trained model. + +Provides a forced hyper parameter configuration for resumption of tuning, or to re-evaluate a model that has already +been tuned prior. Useful in production use cases where a solid baseline exists, but updated training data may +make a better model. +``` +> ***CAUTION*** : using a model Seed from an initial starting point on a data set (during exploratory phase) **may** +result in poor hyper parameter tuning performance. *It is always best to not seed a new model when exploring new use cases*. + +#### Tuner Hyper Space Inference Flag + +Setters: `.setTunerHyperSpaceInferenceFlag()` +Map Name: `'tunerHyperSpaceInferenceFlag'` + +```text +Whether or not to run a hyper space inference run at the conclusion of the genetic algorithm. + +Default: ON +``` +> It is HIGHLY ADVISED to leave this setting on. + +#### Tuner Hyper Space Inference Count + +Setters: `.setTunerHyperSpaceInferenceCount()` +Map Name: `'tunerHyperSpaceInferenceCount'` + +```text +Count of synthetic rows of permutations to generate for the MBO-based post genetic tuning runs. +Default: 200000 +``` +> NOTE: Maximum limit is 1,000,000 (1 Million). Values above that do not provide noticeable results more than 500k +>and simply put more stress on the driver. + +#### Tuner Hyper Space Model Count +Setters: `.setTunerHyperSpaceModelCount()` +Map Name: `'tunerHyperSpaceModelCount'` + +```text +Number of models to generate from the predicted 'best hyper parameters' from the MBO stage. +Default: 10 +``` +> NOTE: a Warning will be issued if this setting is applied >50. The likelihood that a better setting is arrived at +>over this threshold is slim to none. Most users set this in the range of 5 - 20 + +#### Tuner Hyper Space Model Type + +Setters: `setTunerHyperSpaceModelType()` +Map Name: `'tunerHyperSpaceModelType'` + +```text +The type of Regressor to use in the MBO phase (RandomForest or XGBoost or LinearRegression) +Default: RandomForest +``` + +#### Tuner Initial Generation Mode + +Setters: `.setTunerInitialGenerationMode()` +Map Name: `'tunerInitialGenerationMode'` + +```text +Whether to use a random search (default) or a permutations-based search space. +Options: 'random' or 'permutations' + +Default: 'random' +``` +> the 'permutations' mode is recommended to ensure that full search is conducted throughout the algorithm. However, +>it generates guaranteed bad results, and as such, is not turned on by default until users become familiar with +>the performance of the toolkit. + +#### Tuner Initial Generation Permutation Count + +Setters: `.setTunerInitialGenerationPermutationCount()` +Map Name: `'tunerInitialGenerationPermutationCount'` + +```text +Sets the number of hyper parameter combinations to generate and utilize when in 'permutation' mode for the initial +search space prior to the genetic algorithm being activated on the 2nd generation and onwards. +Default: 10 +``` + +#### Tuner Initial Generation Index Mixing Mode + +Setters: `.setTunerInitialGenerationIndexMixingMode()` +Map Name: `'tunerInitialGenerationIndexMixingMode'` + +```text +Sets the method in which the hyper parameter permutations are configured. +Random will shuffle the generated min to max series of individual hyperparameters, exploring the n-dimensional space +in a linear manner. +Random will mix them in random combinations to achieve a more 'random search' across the n-dimensional hyper plane. +Options: 'random' or 'linear' +Default: 'linear' +``` + +#### Tuner Initial Generation Array Seed + +Setters: `.setTunerInitialGenerationArraySeed()` +Map Name: `'tunerInitialGenerationArraySeed'` + +```text +Sets the seed for the generation of permutations to make the samplers for the random mode first-selection repeatable. +Default: 42L +``` + +#### Tuner Output Df Repartition Scale Factor + +Setters: `.setTunerOutputDfRepartitionScaleFactor()` +Map Name: `'tunerOutputDfRepartitionScaleFactor'` + +```text +Sets the degree of repartitioning factor that is done on the output Dataframes coming from the toolkit. +Default: 3 +``` + +### Algorithm Config +```scala +case class AlgorithmConfig(var stringBoundaries: Map[String, List[String]], + var numericBoundaries: Map[String, (Double, Double)]) +``` + + +Setters (for all numeric boundaries): `.setNumericBoundaries()` + +Setters (for all string boundaries): `.setStringBoundaries()` + +> To override any of the features space exploration constraints, pick the correct Map configuration for the family +that is being used, define the Map values, and override with the common setters. + +#### XGBoost + +###### Default Numeric Boundaries +```scala + def _xgboostDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "alpha" -> Tuple2(0.0, 1.0), + "eta" -> Tuple2(0.1, 0.5), + "gamma" -> Tuple2(0.0, 10.0), + "lambda" -> Tuple2(0.1, 10.0), + "maxDepth" -> Tuple2(3.0, 10.0), + "subSample" -> Tuple2(0.4, 0.6), + "minChildWeight" -> Tuple2(0.1, 10.0), + "numRound" -> Tuple2(25.0, 250.0), + "maxBins" -> Tuple2(25.0, 512.0), + "trainTestRatio" -> Tuple2(0.2, 0.8) + ) +``` + +#### Random Forest + +###### Default Numeric Boundaries +```scala +def _rfDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "numTrees" -> Tuple2(50.0, 1000.0), + "maxBins" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ) +``` +###### Default String Boundaries +```scala + def _rfDefaultStringBoundaries = Map( + "impurity" -> List("gini", "entropy"), + "featureSubsetStrategy" -> List("auto") + ) +``` +#### Gradient Boosted Trees + +###### Default Numeric Boundaries +```scala + def _gbtDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "maxBins" -> Tuple2(10.0, 100.0), + "maxIter" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "minInstancesPerNode" -> Tuple2(1.0, 50.0), + "stepSize" -> Tuple2(1E-4, 1.0) + ) +``` +###### Default String Boundaries +```scala + def _gbtDefaultStringBoundaries: Map[String, List[String]] = Map( + "impurity" -> List("gini", "entropy"), + "lossType" -> List("logistic") + ) +``` +#### Decision Trees + +###### Default Numeric Boundaries +```scala + def _treesDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "maxBins" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "minInstancesPerNode" -> Tuple2(1.0, 50.0) + ) +``` +###### Default String Boundaries +```scala + def _treesDefaultStringBoundaries: Map[String, List[String]] = Map( + "impurity" -> List("gini", "entropy") + ) +``` +#### Linear Regression + +###### Default Numeric Boundaries +```scala + def _linearRegressionDefaultNumBoundaries: Map[String, (Double, Double)] = Map ( + "elasticNetParams" -> Tuple2(0.0, 1.0), + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) +``` +###### Default String Boundaries +```scala + def _linearRegressionDefaultStringBoundaries: Map[String, List[String]] = Map ( + "loss" -> List("squaredError", "huber") + ) +``` +#### Logistic Regression + +###### Default Numeric Boundaries +```scala + def _logisticRegressionDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "elasticNetParams" -> Tuple2(0.0, 1.0), + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) +``` +###### Default String Boundaries +```scala + def _logisticRegressionDefaultStringBoundaries: Map[String, List[String]] = Map( + "" -> List("") + ) +``` + +> NOTE: ***DO NOT OVERRIDE THIS*** + +#### Multilayer Perceptron Classifier + +###### Default Numeric Boundaries +```scala + def _mlpcDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "layers" -> Tuple2(1.0, 10.0), + "maxIter" -> Tuple2(10.0, 100.0), + "stepSize" -> Tuple2(0.01, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5), + "hiddenLayerSizeAdjust" -> Tuple2(0.0, 50.0) + ) +``` +###### Default String Boundaries +```scala + def _mlpcDefaultStringBoundaries: Map[String, List[String]] = Map( + "solver" -> List("gd", "l-bfgs") + ) +``` +#### Linear Support Vector Machines + +###### Default Numeric Boundaries +```scala + def _svmDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) +``` +###### Default String Boundaries +```scala + def _svmDefaultStringBoundaries: Map[String, List[String]] = Map( + "" -> List("") + ) +``` +> NOTE: ***DO NOT OVERRIDE THIS*** + +#### LightGBM Families [Coming Soon when Concurrency works with LightGBM] + +###### Default Numeric Boundaries +```scala + def _lightGBMDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "baggingFraction" -> Tuple2(0.5, 1.0), + "baggingFreq" -> Tuple2(0.0, 1.0), + "featureFraction" -> Tuple2(0.6, 1.0), + "learningRate" -> Tuple2(1E-8, 1.0), + "maxBin" -> Tuple2(50, 1000), + "maxDepth" -> Tuple2(3.0, 20.0), + "minSumHessianInLeaf" -> Tuple2(1e-5, 50.0), + "numIterations" -> Tuple2(25.0, 250.0), + "numLeaves" -> Tuple2(10.0, 50.0), + "lambdaL1" -> Tuple2(0.0, 1.0), + "lambdaL2" -> Tuple2(0.0, 1.0), + "alpha" -> Tuple2(0.0, 1.0) + ) +``` + +###### Default String Boundaries +```scala + def _lightGBMDefaultStringBoundaries: Map[String, List[String]] = Map( + "boostingType" -> List("gbdt", "rf", "dart", "goss") + ) +``` + + +### Logging Config +```scala +case class LoggingConfig(var mlFlowLoggingFlag: Boolean, + var mlFlowLogArtifactsFlag: Boolean, + var mlFlowTrackingURI: String, + var mlFlowExperimentName: String, + var mlFlowAPIToken: String, + var mlFlowModelSaveDirectory: String, + var mlFlowLoggingMode: String, + var mlFlowBestSuffix: String, + var inferenceConfigSaveLocation: String, + var mlFlowCustomRunTags: Map[String, String] +) +``` + +### MLFlow Settings + +MLFlow integration in this toolkit allows for logging and tracking of not only the best model returned by a particular run, +but also a tracked history of all hyper parameters, scoring results for validation, and a location path to the actual +model artifacts that are generated for each iteration. + +More information: [MLFlow](https://mlflow.org/docs/latest/index.html), [API Docs](https://mlflow.org/docs/latest/java_api/index.html) + +The implementation here leverages the JavaAPI and can support both remote and Databricks-hosted MLFlow deployments. + + + +#### MLFlow Logging Flag + +Setters: `.mlFlowLoggingOn()` and `.mlFlowLoggingOff()` +Map Name: `'mlFlowLoggingFlag'` + +```text +Provides for either logging the results of the hyper parameter tuning run to MLFlow or not. + +Default: on +``` +> NOTE: it is HIGHLY ADVISED to leave this on. Most algorithms produce far too much text in stdout to follow the results. + +#### MLFlow Log Artifacts Flag [DEPRECATED API] + +```text +Deprecated for non-pipeline runs. +Not recommended to use the older API's. + +This will be removed in future releasees. +``` + +#### MLFlow Tracking URI [FOR EXTERNAL MLFLOW TRACKING SERVERS ONLY] + +Setter: `.setMlFlowTrackingURI()` +Map Name: `'mlFlowTrackingURI'` + +```text +If using a non-Databricks hosted MLFlow instance, this is the address to your tracking server. + +If running on Databricks, this information is automatically pulled for you and used to do user-based authentication. +``` + +#### MLFlow Experiment Name [CRITICAL TO OVERRIDE FOR MOST USE CASES] + +Setter `.setMlFlowExperimentName()` +Map Name: `'mlFlowExperimentName'` + +```text +The Workspace-resolved path within your shard that you would like the model's results to be logged to. + +If this value is not specified, it will log to the same Workspace directory path that the current Notebook is running within. +``` +> Highly recommended to override this value!! + +#### MLFlow API Token [FOR EXTENERAL MLFLOW TRACKING SERVERS ONLY] + +Setter: `.setMlFlowAPIToken()` +Map Name: `'mlFlowAPIToken'` + +```text +If using a non-Databricks hosted MLFlow instance, this is the API Token to your tracking server. + +If running on Databricks, this information is automatically pulled for you and used to do user-based authentication. +``` + + +#### MLFlow Model Save Directory [CRITICAL TO OVERRIDE FOR MOST USE CASES] + +Setter: `.setMlFlowModelSaveDirectory()` +Map Name: `'mlFlowModelSaveDirectory'` + +```text +The path root to store all of the models that are generated with each hyper parameter optimization iteration. +These will be preceded and labeled with the UUID that is generated for the run (also logged in mlflow model location parameter) + +this will infer a dbfs location for writing the model artifacts to. +``` +> NOTE: it is HIGHLY ADVISED to override the default!! + +#### MLFlow Logging Mode + +Setter: `.setMlFlowLoggingMode()` +Map Name: `'mlFlowLoggingMode'` + +```text +Sets whether to log all results (default), just the best run's results, or just the tuning results. + +Options: 'full', 'bestOnly', 'tuningOnly' + +Default: 'full' +``` + + +#### MLFlow Best Suffix + +Setter: `.setMlFlowBestSuffix()` +Map Name: `'mlFlowBestSuffix'` + +```text +A seperate MLFlow log entry for the best results for each of the Families tested. + +Default: '_best' +``` + +#### Inference Config Save Location [DEPRECATED API] + +Setter: `.setInferenceConfigSaveLocation()` +Map Name: `'inferenceConfigSaveLocation'` + +```text +Can be set to a location on dbfs for now, but this API is deprecated in favor of the PipelineAPI and will be +removed in a future version. +``` + +#### MLFlow Custom Run Tags + +Setter: `.setMlFlowCustomRunTags(` +Map Name: `'mlFlowCustomRunTags'` + +```text +Additional data that can be logged about the run in mlflow. +``` + +#### Tuner Delta Cache Backing Directory + +Setter: `.setTunerDeltaCacheBackingDirectory()` +Map Name: `'tunerDeltaCacheBackingDirectory'` + +```text +If using 'delta' mode on the split caching strategy, this is the dbfs root path to use to temporarily (or persistently) +write the delta train / test split tables to. +``` +> NOTE: ensure you have permissions to write to this dbfs location prior to attempting to conduct an automl toolkit run. + +#### Tuner Delta Cache Backing Directory Removal Flag + +Setter: `.setTunerDeltaCacheBackingDirectoryRemovalFlag()` +Map Name: `'tunerDeltaCacheBackingDirectoryRemovalFlag'` + +```text +Specifies whether to 'clean up' the delta dbfs location for 'delta mode' split caching strategy + +Some users may want to have a persistent copy of their train/test splits for particular runs to do evaluation later on +or to keep for auditing / replayability reasons. If this is the case, set this flag to FALSE. + +Default: True, will remove files +``` + +#### Split Caching Strategy + +Setter: `.setSplitCachingStrategy()` +Map Name: `'splitCachingStrategy'` + +```text +Mode to use to store the train test splits before model tuning begins. + +Options: 'cache' or 'persist' or 'delta' + +cache - memory caches the splits +persist - local fs disk persists the splits +delta - writes the splits to delta, then returns a reference reader from that location + +Default: cache +``` +> NOTE: take note of the size of the modeling data set and set the appropriate mode based on the memory pressure of the cluster +>and the algorithms chosen to run. + +### Instance Config +This is the main wrapper for all of the above grouped configuration modules. +```scala +case class InstanceConfig( + var modelFamily: String, + var predictionType: String, + var genericConfig: GenericConfig, + var switchConfig: SwitchConfig, + var featureEngineeringConfig: FeatureEngineeringConfig, + var algorithmConfig: AlgorithmConfig, + var tunerConfig: TunerConfig, + var loggingConfig: LoggingConfig +) +``` + +# Inference + +### Pipeline API +To use pipeline API, check [here](PIPELINE_API_DOCS.md) for an example usage. + +## Batch Inference Mode [DEPRECATED API, Pipeline API is now the official main access paradigm] + +Batch Inference Mode is a feature that allows for a particular run's settings to be preserved, recalled, and used to +apply the precise feature engineering actions that occured during a model training run. This is useful in order to +utilize the built model for batch inference, or to understand how to write real-time transformation logic in order to +create a feature vector that can be sent through a model in a model-as-a-service mode. + +***This mode REQUIRES MLFlow in order to function*** + +Each individual model will get two forms of the configuration: +1. A compact JSON data structure that is logged as a tagged element to the mlflow run that can be copied and +pasted (or retrieved through the mlflow API) for any particular run that would be used to do batch inference +2. A specified location that a Dataframe has been saved that contains the same location. + +There are two primary entry points for an inference run. One simply requires the path of the DataFrame (preferred), +while the other requires the json string itself. Everything needed to execute the inference prediction is +contained within this data structure (either the json or the Dataframe) + + +## Feature Importance + +To utilize the Feature Importance functionality and the associated API, the settings are similar to what is listed +above in the main APIs. + +The total config is: + +```scala +case class FeatureImportanceConfig( + labelCol: String, + featuresCol: String, + dataPrepParallelism: Int, + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]], + scoringMetric: String, + trainPortion: Double, + trainSplitMethod: String, + trainSplitChronologicalColumn: String, + trainSplitChronlogicalRandomPercentage: Double, + parallelism: Int, + kFold: Int, + seed: Long, + scoringOptimizationStrategy: String, + firstGenerationGenePool: Int, + numberOfGenerations: Int, + numberOfMutationsPerGeneration: Int, + numberOfParentsToRetain: Int, + geneticMixing: Double, + generationalMutationStrategy: String, + mutationMagnitudeMode: String, + fixedMutationValue: Int, + autoStoppingScore: Double, + autoStoppingFlag: Boolean, + evolutionStrategy: String, + continuousEvolutionMaxIterations: Int, + continuousEvolutionStoppingScore: Double, + continuousEvolutionParallelism: Int, + continuousEvolutionMutationAggressiveness: Int, + continuousEvolutionGeneticMixing: Double, + continuousEvolutionRollingImprovementCount: Int, + dataReductionFactor: Double, + firstGenMode: String, + firstGenPermutations: Int, + firstGenIndexMixingMode: String, + firstGenArraySeed: Long, + fieldsToIgnore: Array[String], + numericFillStat: String, + characterFillStat: String, + modelSelectionDistinctThreshold: Int, + dateTimeConversionType: String, + modelType: String, + featureImportanceModelFamily: String, + featureInteractionFlag: Boolean, + featureInteractionRetentionMode: String, + featureInteractionContinuousDiscretizerBucketCount: Int, + featureInteractionParallelism: Int, + featureInteractionTargetInteractionPercentage: Double, + deltaCacheBackingDirectory: String, + deltaCacheBackingDirectoryRemovalFlag: Boolean, + splitCachingStrategy: String +) +``` + +These settings are applied through using the Configuration Generator, with identical map overrides as specified in +the previous sections. + +The main API for feature importances is used as follows: + +```scala +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.exploration.FeatureImportances + +val configurationOverrides = Map( + "labelCol" -> "my_label", + "tunerParallelism" -> 6, + "tunerKFold" -> 3, + "tunerTrainSplitMethod" -> "stratified", + "scoringMetric" -> "areaUnderROC", + "tunerNumberOfGenerations" -> 3, + "tunerNumberOfMutationsPerGeneration" -> 8, + "tunerInitialGenerationMode" -> "permutations", + "tunerInitialGenerationPermutationCount" -> 18, + "tunerFirstGenerationGenePool" -> 18 +) + +val config = ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", configurationOverrides) +val featConfig = ConfigurationGenerator.generateFeatureImportanceConfig(config) + +val featureImportances = FeatureImportances(df, featConfig, "count", 20).generateFeatureImportances() +``` + + diff --git a/LICENSE.txt b/LICENSE.txt new file mode 100644 index 00000000..756a852c --- /dev/null +++ b/LICENSE.txt @@ -0,0 +1,20 @@ +Databricks Labs - AutoML Toolkit (the "Software") + +This library (the "Software") may not be used except in connection with the Licensee's use of the Databricks Platform Services +pursuant to an Agreement (defined below) between Licensee (defined below) and Databricks, Inc. ("Databricks"). This Software +shall be deemed part of the “Downloadable Services” under the Agreement if such term is defined therein. If the Agreement does +not define Downloadable Services but does defined Subscription Services, then this Softare shall be deemed "Subscription Services". +If neither term is defined in such Agreement, than the term that refers to the applicable Databricks Platform Services (as defined below) +shall be substituted herein for “Downloadable Services.” Licensee's use of the Software must comply at all times with any restrictions +applicable to the Downloadable (or if not defined, Subscription) Services, generally, and must be used in accordance with any applicable +documentation. If you have not agreed to an Agreement or otherwise do not agree to these terms, you may not use the Software. This +license terminates automatically upon the termination of the Agreement or Licensee's breach of these terms. + +Agreement: the agreement between Databricks and Licensee governing the use of the Databricks Platform Services, which shall be, with +respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with respect to Databricks +Community Edition, the Community Edition Terms of Service located at www.databricks.com/ce-termsofuse, in each case unless Licensee +has entered into a separate written agreement with Databricks governing the use of the applicable Databricks Platform Services. + +Databricks Platform Services: the Databricks services or the Databricks Community Edition services, according to where the Software is used. + +Licensee: the user of the Software, or, if the Software is being used on behalf of a company, the company. \ No newline at end of file diff --git a/PIPELINE_API_DOCS.md b/PIPELINE_API_DOCS.md new file mode 100644 index 00000000..257720d1 --- /dev/null +++ b/PIPELINE_API_DOCS.md @@ -0,0 +1,185 @@ +# Pipeline API for the AutoML-Toolkit + +The AutoML-Toolkit is an automated ML solution for Apache Spark. It provides common data cleansing and feature +engineering support, automated hyper-parameter tuning through distributed genetic algorithms, and model tracking +integration with MLFlow. It currently supports Supervised Learning algorithms that are provided as part of Spark Mllib. + +## General Overview + +The AutoML toolkit exposes the following pipeline-related APIs via [FamilyRunner](src/main/scala/com/databricks/labs/automl/executor/FamilyRunner.scala) + +#### [Inference using PipelineModel](#full-predict-pipeline-api) | [Inference using MLflow Run ID](#running-inference-pipeline-directly-against-an-mlflow-run-id-since-v061) + +### Full Predict pipeline API: +```text +executeWithPipeline() +``` +This pipeline API works with the existing configuration object (and overrides) as listed [here](APIDOCS.md), +but it returns the following output +```text +FamilyFinalOutputWithPipeline( + familyFinalOutput: FamilyFinalOutput, + bestPipelineModel: Map[String, PipelineModel] +) +``` +As noted, ```bestPipelineModel``` contains a key, value pair of a model family +and the best pipeline model (based on the selected ```scoringOptimizationStrategy```) + +Example: +```scala +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.executor.FamilyRunner +import org.apache.spark.ml.PipelineModel + +val data = spark.table("ben_demo.adult_data") +val overrides = Map( + "labelCol" -> "label", "mlFlowLoggingFlag" -> false, + "scalingFlag" -> true, "oneHotEncodeFlag" -> true, + "pipelineDebugFlag" -> true +) +val randomForestConfig = ConfigurationGenerator + .generateConfigFromMap("RandomForest", "classifier", overrides) +val runner = FamilyRunner(data, Array(randomForestConfig)) + .executeWithPipeline() + +runner.bestPipelineModel("RandomForest").transform(data) + +//Serialize it +runner.bestPipelineModel("RandomForest").write.overwrite().save("tmp/predict-pipeline-1") + +// Load it for running inference +val pipelineModel = PipelineModel.load("tmp/predict-pipeline-1") +val predictDf = pipelineModel.transform(data) +``` + + +### Feature engineering pipeline API: +```text +generateFeatureEngineeredPipeline(verbose: Boolean = false) +``` +@param ```verbose```: If set to true, any dataset transformed with this feature engineered pipeline will include all + input columns for the vector assembler stage + +This API builds a feature engineering pipeline based on the existing configuration object (and overrides) +as listed [here](APIDOCS.md). It returns back the output of type ```Map[String, PipelineModel]``` where ```(key -> value)``` are +```(modelFamilyName -> featureEngPipelineModel)``` + +Example: +```scala +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.executor.FamilyRunner +import org.apache.spark.ml.PipelineModel + +val data = spark.table("ben_demo.adult_data") +val overrides = Map( + "labelCol" -> "label", "mlFlowLoggingFlag" -> false, + "scalingFlag" -> true, "oneHotEncodeFlag" -> true, + "pipelineDebugFlag" -> true +) +val randomForestConfig = ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", overrides) +val runner = FamilyRunner(data, Array(randomForestConfig)) + .generateFeatureEngineeredPipeline(verbose = true) + +runner("RandomForest") +.write +.overwrite() +.save("tmp/feat-eng-pipeline-1") + +val featEngDf = PipelineModel +.load("tmp/feat-eng-pipeline-1") +.transform(data) +``` + +### Running Inference Pipeline directly against an MLflow RUN ID since v0.6.1: +With this release, it is now possible to run inference given a Mlflow RUN ID, +since pipeline API now automatically registers inference pipeline model with Mlflow along with +a bunch of other useful information, such as pipeline execution progress and each Pipeline +stage transformation. This can come very handy to view the train pipeline's progress +as well as troubleshooting. + +
+ Example of Pipeline Tags registered with Mlflow + + ## Heading + An example of pipeline tags in Mlflow + ![Alt text](images/mlflow-1.png) + + And one of the transformations in a pipeline + ![Alt text](images/mlflow-2.png) +
+ +#### Example (As of 0.7.1) +##### Model Pipeline MUST be trained/tuned using 0.7.1+ +As of 0.7.1, the API ensures that data scientists can very easily send model to data engineering for production with +only an MLFlow Run ID. This is made possible by the addition of the full main config tracked in MLFlow. +This greatly simplifies the Inference Pipeline but it also enables config tracking and verification much easier. +When `mlFlowLoggingFlag` is `true` the config is tracked on every model tracked as per +[mlFlowLoggingMode](APIDOCS.md#mlflow-logging-mode). +![Alt text](images/MLFLow_Config_Tracking.png) + +Most teams follow the process: +* Data Science + * Iterative training, testing, validation, review, tracking + * Identification of model to move to production + * Submit ticket to Data Engineering with MLFlow RunID to productionize model + +* Data Engineering + * Productionize Model + +Below is a full pipeline example +```scala +// Data Science +val data = spark.table("my_database.myTrain_Data") +val overrides = Map( + "labelCol" -> "label", "mlFlowLoggingFlag" -> true, + "scalingFlag" -> true, "oneHotEncodeFlag" -> true +) +val randomForestConfig = ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", overrides) +val runner = FamilyRunner(data, Array(randomForestConfig)).executeWithPipeline() + +// Data Engineering +val pipelineBest = PipelineModelInference.getPipelineModelByMlFlowRunId("111825a540544443b9db14e5b9a6006b") +val prediction = pipelineBest.transform(spark.read.format("delta").load("dbfs:/.../myDataForInference")) + .withColumn("priceReal", exp(col("price"))).withColumn("prediction", exp(col("prediction"))) +prediction.write.format("delta").saveAsTable("my_database.newestPredictions") +``` + +MLFLow_Config_Tracking.png + +#### Example (Deprecated as of 0.7.1): +```scala +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.executor.FamilyRunner +import org.apache.spark.ml.PipelineModel +import com.databricks.labs.automl.pipeline.inference.PipelineModelInference + +val data = spark.table("ben_demo.adult_data") +val overrides = Map( + "labelCol" -> "label", "mlFlowLoggingFlag" -> true, + "scalingFlag" -> true, "oneHotEncodeFlag" -> true, + "pipelineDebugFlag" -> true +) +val randomForestConfig = ConfigurationGenerator + .generateConfigFromMap("RandomForest", "classifier", overrides) +val runner = FamilyRunner(data, Array(randomForestConfig)) + .executeWithPipeline() + +val mlFlowRunId = runner.bestMlFlowRunId("RandomForest") + +val loggingConfig = randomForestConfig.loggingConfig +val pipelineModel = PipelineModelInference.getPipelineModelByMlFlowRunId(mlFlowRunId, loggingConfig) +pipelineModel.transform(data.drop("label")).drop("features").show(10) +``` + +### Pipeline Configurations +As noted above, all the pipeline APIs will work with the existing configuration objects. In addition to those, pipeline API +exposes the following configurations: + +```@text +default: false +pipelineDebugFlag: A Boolean flag for the pipeline logging purposes. When turned on, each stage in a pipeline execution +will print and log out a lot of useful information that can be used to track transformations for debugging/troubleshooting +puproses. Since v0.6.1, when this flag is turned on, pipeline reports all of these transformations to Mlflow as Run tags. +``` + + diff --git a/README.md b/README.md index fb3a228d..258b76f0 100644 --- a/README.md +++ b/README.md @@ -1 +1,181 @@ -# providentia +# Databricks Labs AutoML +[Release Notes](RELEASE_NOTES.md) | +[Python API Docs](python/docs/APIDOCs.md) | +[Python Artifact](python/dist/pyAutoML-0.1.0-py3-none-any.whl) | +[Developer Docs](APIDOCS.md) | +[Python Docs](python/docs/APIDOCs.md) | +[Demo](demos) | +[Release Artifacts](bin) | +[Contributors](#core-contribution-team) + + +This Databricks Labs project is a non-supported end-to-end supervised learning solution for automating: +* Feature clean-up + * Advanced NA fill, covariance calculations, collinearity determination, outlier filtering, and data casting +* Feature Importance calculation suite + * RandomForest or XGBoost determinations +* Feature Interaction with [Information Gain selection](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) +* Feature vectorization +* Advanced train/test split techniques (including Distributed [SMOTE](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis#SMOTE) (KSample)) +* Model selection and training +* Hyper parameter optimization and selection + * Hyperspace, Genetic, and MBO-based selection +* Batch Prediction through serialized [SparkML Pipelines](https://spark.apache.org/docs/latest/ml-pipeline.html) +* Logging of model results and training runs (using [MLFlow](https://mlflow.org)) + +This package utilizes Apache Spark ML and currently supports the following model family types: + +* Decision Trees (Regressor and Classifier) +* Gradient Boosted Trees (Regressor and Classifier) +* Random Forest (Regressor and Classifier) +* Linear Regression +* Logistic Regression +* Multi-Layer Perceptron Classifier +* Support Vector Machines +* XGBoost (Regressor and Classifier) + +> NOTE: LightGBM support is built-in, but is in Experimental mode and canot be accessed from the FamilyRunner API +> while we are undergoing testing and evaluation of thread concurrency issues with the LightGBM code base. + +## Documentation + +Scala API documentation can be found [here](APIDOCS.md) + +Python API documentation can be found [here](python/docs/APIDOCs.md) + + +## Building + +Databricks Labs AutoML can be built with either [SBT](https://www.scala-sbt.org/) or [Maven](https://maven.apache.org/). + +```text +This package requires Java 1.8.x and scala 2.11.x to be installed on your system prior to building. +``` + +After cloning this repo onto your local system, navigate to the root directory and execute either: + +##### Maven Build +```sbtshell +mvn clean install -DskipTests +``` + +##### SBT Build +```sbtshell +sbt package +``` +If there is any StackOverflowError during the build, adjust the stack size on your computer's JVM. Example: +```sbtshell +#For Maven +export MAVEN_OPTS=-Xss2m +#For SBT +export SBT_OPTS="-Xss2M" +``` + + +This will skip unit test execution (it is not recommended to run unit tests in local mode against this package as unit testing is asynchronous and incredibly CPU intensive for this code base.) + + +## Setup + +Once the artifact has been built, attach to the Databricks Shard through either the [DBFS API](https://docs.databricks.com/api/latest/dbfs.html) or the GUI. Once loaded into the account, utilize either the [Libraries API](https://docs.databricks.com/api/latest/libraries.html#install) to attach to a cluster, or utilize the GUI to attach the .jar to the cluster. + +```text +NOTE: It is not recommended to attach this libarary to all clusters on the account. + +Use of an ML Runtime cluster configuration is highly advised to ensure that custom management of dependent +libraries and configurations are provided 'out of the box' + +``` + +Attach the following libraries to the cluster: +* The automl toolkit jar created above. (automatedml_2.11-((version)).jar) +* If using the PySpark API for the toolkit, the [.whl file](python/docs/APIDOCs.md#Setup) for the PySpark API. + +> IMPORTANT NOTE: as of release 0.7.1, the mlflow libraries in pypi and Maven are NO LONGER NEEDED. Attaching them +> to your cluster WILL prevent the run from logging and will throw an exception. DO NOT ATTACH EITHER OF THEM. + +## Getting Started + +This package provides a number of different levels of API interaction, from the highest-level "default only" FamilyRunner to low-level APIs that allow for highly customizable workflows to be created for automated ML tuning and Inference. + +For the purposes of a quick-start intro, the below example is of the highest-level API access point. + +```scala + +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.executor.FamilyRunner + +val data = spark.table("ben_demo.adult_data") + +val overrides = Map("labelCol" -> "income", +"mlFlowExperimentName" -> (defaults to current notebook directory), +"mlFlowModelSaveDirectory" -> "dbfs:/ml/FirstAutoMLRun/", +"inferenceConfigSaveLocation" -> "dbfs:/ml/FirstAutoMLRun/inference" +) + +val randomForestConfig = ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", overrides) +val gbtConfig = ConfigurationGenerator.generateConfigFromMap("GBT", "classifier", overrides) +val logConfig = ConfigurationGenerator.generateConfigFromMap("LogisticRegression", "classifier", overrides) + +val runner = FamilyRunner(data, Array(randomForestConfig, gbtConfig, logConfig)).execute() +``` +This example will take the default configuration for all of the application parameters (excepting the overridden parameters in overrides Map) and execute Data Preparation tasks, Feature Vectorization, and automatic tuning of all 3 specified model types. At the conclusion of each run, the results and model artifacts will be logged to the mlflow location that was specified in the configuration. + +For a listing of all available parameter overrides and their functionality, see the [Developer Docs](APIDOCS.md) + +## Pipeline API +### v0.6.0 +Starting with this release, AutoML now exposes an API to work with the pipeline semantics around +feature engineering steps and full predict pipelines. Example: + +```scala +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.executor.FamilyRunner +import org.apache.spark.ml.PipelineModel + +val data = spark.table("ben_demo.adult_data") +val overrides = Map( + "labelCol" -> "label", + "mlFlowLoggingFlag" -> false, + "scalingFlag" -> true, + "oneHotEncodeFlag" -> true, + "pipelineDebugFlag" -> true +) +val randomForestConfig = ConfigurationGenerator + .generateConfigFromMap("RandomForest", "classifier", overrides) + +val runner = FamilyRunner(data, Array(randomForestConfig)).executeWithPipeline() + +runner.bestPipelineModel("RandomForest").transform(data) + +//Serialize it +runner.bestPipelineModel("RandomForest").write.overwrite().save("tmp/predict-pipeline-1") + +// Load it for running inference +val pipelineModel = PipelineModel.load("tmp/predict-pipeline-1") +val predictDf = pipelineModel.transform(data) +``` +### Inference via Mlflow Run ID +It is also possible to use MlFlow Run ID for inference, if Mlflow logging is turned on during training. +For usage, see [this](PIPELINE_API_DOCS.md#running-inference-pipeline-directly-against-an-mlflow-run-id-since-v061) + +For all available pipeline APIs. please see [Developer Docs](PIPELINE_API_DOCS.md) + +## Feedback + +Issues with the application? Found a bug? Have a great idea for an addition? +Feel free to file an issue. + +## Contributing +Have a great idea that you want to add? Fork the repo and submit a PR! + +## Legal Information +This software is provided as-is and is not officially supported by Databricks through customer technical support channels. +Support, questions, and feature requests can be communicated via email -> benjamin.wilson@databricks.com or through the Issues page of this repo. +Please see the [legal agreement](LICENSE.txt) and understand that issues with the use of this code will not be answered or investigated by Databricks Support. + +## Core Contribution team +* Lead Developer: Ben Wilson, Practice Leader, Databricks +* Developer: Daniel Tomes, RSA Practice Leader, Databricks +* Developer: Jas Bali, Sr. Solutions Consultant, Databricks +* Developer: Mary Grace Moesta, Customer Success Engineer, Databricks diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md new file mode 100644 index 00000000..0cbd208d --- /dev/null +++ b/RELEASE_NOTES.md @@ -0,0 +1,245 @@ +## Auto ML Toolkit Release Notes + +### Version 0.7.1 +#### Features +* Complete overhaul of train/test splitting and kFolding. Prior to this performance scaling improvement, +the train and test data sets would be calculated during each model's kfold stage, resulting in non-homogenous +comparisons between hyper parameters, as well as performance degradation from consantly having to perform +splitting of the data. There are three new parameters to control the behavior of splitting: +```text +Configuration parameters: + +- "tunerDeltaCacheBackingDirectory" + This new setting will define a location on dbfs in order to write extremely large training and test split data sets + to. This is particularly recommended for use cases in which the volume of the data is so large that making even a + few copies of the raw data would exceed budgeted cluster size allowances (recommended for data sets that are in the + 100's of GB range - TB range) +- "tunerDeltaCacheBackingDirectoryRemovalFlag" + This new setting (Boolean flag) will determine, if using 'delta' mode on splitCachingStrategy, whether or not + to delete the delta data sets on dbfs after training is completed. By default, this is set to true (will delete and + clean up the directories). If further evaluation or testing of the train test splits is needed after the run is + completed, or if investigation into the composition of the splits is desired, or if auditing of the training data + is required due to business rules, set this flag to false. + NOTE: directory pathing prevention of collisions is done through the generation of a UUID. The path on dbfs + will have this as part of the root bucket for the run. +- "splitCachingStrategy" DEFAULT: 'persist' + Options: 'cache', 'persist', or 'delta' + - delta mode: this will perform a train/test split for each kFold specified in the job definition, write the train + and test datasets to dbfs in delta format, and provide a reference to the delta source for the training run. + NOTE: this will incur overhead and is NOT recommended for data sets that can easily fit multiple copies into + memory on the cluster. + - persist mode: this will cache and persist the train and test kFold data sets onto local disk. This is recommended + for larger data sets that in order to fit n copies of the data in memory would require an extremely large or + expensive cluster. This is the default mode. + - cache mode: this will use standard caching (memory and disk) for the kFold sets of train and test. This mode + is only recommended if the data set is relatively small and k copies of the data set can comfortably reside in memory + on the cluster. +``` +* Main config is now written and tracked via MLFlow. Any pipeline trained as of 0.7.1 will provide a full config +in json format in MLFlow Artifacts and next to your saved models path. + +* Run Inference Pipelines with only a RunID. You no longer have to track and manage a LoggingConfig to pass into +the inference pipeline. That constructor has been deprecated, only use it for legacy pipelines. Old training pipelines +will not be able to run this way but all future pipelines created as of 0.7.1 will be able to run with only the +MLFlow runId. + +#### Bug Fixes / Improvements +* scoring metric can now support resolution of differently spelled metrics (upper case, camel case, etc.) + and will resolve to the standard naming conventions within SparkML for Binary, Multiclass, and Regression + evaluators. +* Model training was getting one additional fold than applied at configuration, this has been resolved. +* Type casting enabled from python API for complex nested types to config +* Minor changes to assertions to provide a better experience +* Minor internal function changes + +### Version 0.6.2 +#### Features +* Added support for PySpark! There is now a Python API for Databricks labs automl! +* Added FeatureInteraction module (configurable to interact either all features or only those that pass checks for +perceived gain of adding the interacted feature based on its parents) +```text +Configuration of feature interaction modes is through setting the configurations: + +- featureInteractionFlag -> true (turns the feature on. Default: false) +- featureInteractionRetentionMode -> one of: 'all', 'optimistic', or 'strict' + (all -> interacts all features and will include them in the feature vector + optimistic -> compares each interacted column to it's parents and if it is at least + it is at least 1 - featureInteractionTargetInteractionPercentage as good as EITHER parent + it will be retained as a feature. + strict -> it must be 1 - featureInteractionTargetInteractionPercentage as good as + BOTH parents to be included. + ) +- featureInteractionContinuousDiscretizerBucketCount -> Default 10 (sets the number of quantization buckets to use when + handling continuous features in order to properly calculate InformationGain for Classification Models. Increasing + this value may provide greater accuracy at the cost of runtime performance) +- featureInteractionParallelism -> Default 12 (the interacted features are created and scored + asynchronously. This parallelism setting is separated from the other two parallelism + values due to the distinctly different level of CPU consumption that is required to perform this + stage. Overriding the default is recommended and is intended to be set in accordance with + the size of the cluster executing the run.) +- featureInteractionTargetInteractionPercentage -> Default 10.0 (provides the threshold for the + retention modes 'strict' and 'optimistic' for determining whether to keep an interacted column based on the + relative percentage difference between the parents of the interacted column and the interacted column. + It is measured as a "must be at least 1 - x% as good at y" wherein x is the percentage to be included and y is + either Variance or Information Gain. i.e. : with this value set to 10 for a classification problem, + an interacted column would be included in the feature vector if it's Information Gain was greater than + or equal to 80% of the Information Gain of its parents) +``` +* Added the ability to calculate Differential Entropy for Regression Tasks (supported in both FeatureInteraction and +in the new Pearson Filtering algorithm) +* Full refactor of Pearson Filtering and Feature Correlation to utilize DataFrames for column-wise comparisons + Adjusted core looping algorithm to support exactly 1:1 checking of n x m column validation. Speed improvements are + dramatic for data sets utilizing higher numbers of features. + * Improved cardinality detection for data types and inspection of correlation detection based on approrpriate + methodology (Information Gain / Entropy for Classifiers, Differential Entropy for Regressors) and handling of + nominal vs continuous numeric types correctly for such validators. + +#### Bug Fixes +* Adjusted Pipeline OneHotEncoder to ensure prevention of metadata loss from StringIndexers for inference .transforms() + through the use of additional StringIndexer stages immediately preceding the OneHotEncoder stages. +* Stability improvements - creation of ~63 new unit tests to validate core functionality +* Over 200 issues solved due to new unit testing test suite (not going to mention them all here) + +### Version 0.6.1 +* Upgraded MlFLow to 1.3.0 +* Pipeline now registers with Mlflow (including Inference Pipeline Model and feature engineered original df) +* Added new Pipeline APIs to Run inference directly against MLFlow Run Ids +* Training Pipeline now automatically registers pipeline progress and each stages transformations with MLFlow + +### Version 0.6.0 + +#### Features +* New APIs around Spark ML pipeline semantics for fetching full inference as well as feature engineering pipelines. See [this](PIPELINE_API_DOCS.md) for the usage +* MainConfig settings are now pretty printed to stdout and logged as json strings to aid in readability. +* PostModelingOptimization will now search through a logspace based on euclidean distance of vector similarity to +minimize (not remove) the probability of too-similar hyper parameters from being tested in final phase. +[NOTE] - this feature is not supported for MLPC due to the complexity involved in layer estimation +for distance calculations. +* MLflow settings are now defaulted: api Key, uri are default configured to work with the current notebook context +that is calling the class. These can still be overridden. +* MLflow logging is now defaulted to the same parent directory of the notebook executing it through reflection +during runtime. This is to maintain parity with how hosted MLFlow works. This can be overridden if an alternate +Workspace path is desired for logging to. +* Binary Encoder stand-alone package (transformer) has been added and is compatible with the SparkML Pipeline API +This is intended to be used as an alternative to OneHotEncoding for high cardinality +nominal fields. It is an information loss algorithm, though. Integration options with the automl +toolkit will be coming in a future release. +* Added a new MBO algorithm on top of the Genetic Algorithm. In each generation, a larger count of potential +candidates are now generated, which are then fed, along with apriori hyperparameter + score information to a +new package (GenerationOptimizer.scala) which will train a Regressor (selectable) and return the best predicted +hyperparameter combinations. +* Added the following additional configuration options: +```text +dataPrepParallelism -> allows for setting a separate parallelism factor for the feature engineering phase +of the application (can be useful for extremely large data sets to have a lower parallelism value than the +tuning parallelism setting) + +tunerGeneticMBOCandidateFactor -> Integer that serves as a multiplicative adjustment to the number of candidates +that are mutated and generated from each genetic mutation epoch (only applies to stages other than first and last) + +tunerGeneticMBORegressorType -> One of "XGBoost", "RandomForest", or "LinearRegression" + +tunerContinuousEvolutionImprovementThreshold -> allows for an additional stopping criteria based on cumulative +gains of improvement. NOTE: must be negative and values less negative than -5 will likely cause early stopping +in continuous mode if parallelism is set too high. Adjust to values closer to 0 than -5 with caution! + +``` + +#### Bug Fixes +* If a setter(s) were used after the mainConfig was set on AutomationRunner, the default values would be applied +to the instance of the Automation or FamilyRunner. This behavior has been fixed and chained setters can be used +even after the mapped configuration has been applied. +* KSample (distributed SMOTE) bug fixes for scalability and reliability. +* Eliminated the scaling bug when using a model that doesn't have ksample as its trainSplitMethodology set has a +scaling task set. +* enabled asynchronous support for variance filtering to reflect the dataPrepParallelism setting (was hard-coded before to 10) +* changed default logging location for mlflow to support azure shards + + +### Version 0.5.2 + +#### Fix XGBoost 0.9.0 issue with classifiers +XGBoost implementation for Spark has a default override for missing value imputation that is not compatible with SparkML. +Modifying the default behavior of XGBoost allows the new version to work correctly. +#### Adding support for naFill manual override: +##### New Modes: +* "auto" - previously the only mode (uses statistical options to infer missing data) Usage of .setNumericFillStat and .setCharacterFillStat will inform the type of statistics that will be used to fill. +* "mapFill" - Custom by-column overrides. Column names specified in either of the two maps (numericNAFillMap and/or categoricalNAFillMap) MUST be present in the DataFrame schema. + All fields not specified in these maps will use the stastics-based approach to fill na's. +* "blanketFillAll" - Fills na's throughout the DataFrame with values specified in characterNABlanketFillValue and numericNABlanketFillValue. +* "blanketFillCharOnly" - will use the characterNABlanketFillValue for only categorical columns, while the stats method will be used for numerics. +* "blanketFillNumOnly" - will use the numericNABlanketFillValue for only categorical columns, while the stats method will be used for character columns. +######Example: +```scala +val overrides = Map( + "labelCol" -> "myLabel", + "fillConfigCategoricalNAFillMap" -> Map("native_country" -> "us"), + "fillConfigNumericNAFillMap" -> Map("education_years" -> 12.0, "capital_loss" -> 0.0), + "fillConfigCharacterNABlanketFillValue" -> "missing", + "fillConfigNumericNABlanketFillValue" -> 0.0, + "fillConfigNAFillMode" -> "mapFill" +) +``` +#### Safety Checks for High Cardinality Non-numeric columns +To prevent against extreme feature vector size 'explosion' with improperly defined high cardinality feature fields (i.e. a userID or telephone number is in the data set), cardinality checks are now enabled during feature vector creation. The behavior of these is controlled with the following setters: +- setCardinalityCheck (default true) - disables or enables this feature +- setCardinalityCheckMode - either "silent" or "warn" + * Silent mode: Removes high cardinality non-numeric fields from the feature vector. + * Warn mode: If a non-numeric field is found that is above the specified threshold, an exception is thrown. +- setCardinalityLimit Integer limit above which the field will be removed (silent mode) or an exception will be thrown (warn mode) +- setCardinalityPrecision Optional override to the precision for checking the approx distinct cardinality nature of a non-numeric field. +- setCardinalityType Whether to use distinct or approx_distinct (approx_distinct is highly recommended for large data sets) +######Example: +```scala +val overrides = Map( + "labelCol" -> "myLabel", + "fillConfigCardinalitySwitch" -> true, + "fillConfigCardinalityType" -> "exact", + "fillConfigCardinalityPrecision" -> 0.9, + "fillConfigCardinalityCheckMode" -> "warn", + "fillConfigCardinalityLimit" -> 100 +) +``` +#### KSampling (Distributed SMOTE) +Now supported in the autoML Toolkit is 'intelligent minority class oversampling'. +- This is a new train / test split method for classification problems with heavy class imbalance. +- To use, simply specify in the configuration map: +######Example: + +```scala +val overrides = Map( + "labelCol" -> "myLabel", + "tunerTrainSplitMethod" -> "kSample", + "tunerKSampleSyntheticCol" -> "synth_KSample", + "tunerKSampleKGroups" -> 25, + "tunerKSampleKMeansMaxIter" -> 200, + "tunerKSampleKMeansTolerance" -> 1E-6, + "tunerKSampleKMeansDistanceMeasurement" -> "euclidean", + "tunerKSampleKMeansSeed" -> 42L, + "tunerKSampleKMeansPredictionCol" -> "kGroup_sample", + "tunerKSampleLSHHashTables" -> 10, + "tunerKSampleLSHSeed" -> 42L, + "tunerKSampleLSHOutputCol" -> "hashes_ksample", + "tunerKSampleQuorumCount" -> 7, + "tunerKSampleMinimumVectorCountToMutate" -> 1, + "tunerKSampleVectorMutationMethod" -> "random", + "tunerKSampleMutationMode" -> "weighted", + "tunerKSampleMutationValue" -> 0.5, + "tunerKSampleLabelBalanceMode" -> "target", + "tunerKSampleCardinalityThreshold" -> 20, + "tunerKSampleNumericRatio" -> 0.2, + "tunerKSampleNumericTarget" -> 10000 +) + +val myConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", overrides) + +val runner = FamilyRunner(data, Array(myConfig)).execute + +``` +- Before turning on this mode, ensure that: + * Modeling type is of 'classfication' + * The modes "target" and "percentage" are subject to RunTime checks. If the numeric ratio or target (for the + respective mode selected) are, for the smallest minority class count, bigger than these target values, a RunTimeException + will be thrown. + +Further details of the implementation, performance, and usage of this new feature will be extensively documented in an upcoming blog post by Databricks. \ No newline at end of file diff --git a/build.sbt b/build.sbt new file mode 100644 index 00000000..6b1ff040 --- /dev/null +++ b/build.sbt @@ -0,0 +1,50 @@ +name := "AutomatedML" + +organization := "com.databricks" + +version := "0.7.1" + +scalaVersion := "2.11.12" +scalacOptions ++= Seq("-Xmax-classfile-name", "78") + +libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0" +libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.4.0" +libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0" +libraryDependencies += "org.mlflow" % "mlflow-client" % "1.3.0" +libraryDependencies += "org.json4s" %% "json4s-jackson" % "3.5.3" +libraryDependencies += "ml.dmlc" % "xgboost4j" % "0.90" +libraryDependencies += "ml.dmlc" % "xgboost4j-spark" % "0.90" +libraryDependencies += "junit" % "junit" % "4.8.1" % "test" +libraryDependencies += "org.scalatest" % "scalatest_2.11" % "3.0.6" +libraryDependencies += "com.databricks" % "dbutils-api_2.11" % "0.0.4" % Provided +libraryDependencies += "ml.combust.mleap" %% "mleap-runtime" % "0.14.0" +libraryDependencies += "ml.combust.mleap" %% "mleap-spark" % "0.14.0" +libraryDependencies += "com.microsoft.ml.spark" %% "mmlspark" % "0.18.1" +libraryDependencies += "org.vegas-viz" %% "vegas" % "0.3.11" + +lazy val commonSettings = Seq( + version := "0.7.1", + organization := "com.databricks", + scalaVersion := "2.11.12" +) + +assemblyShadeRules in assembly := Seq( + ShadeRule.rename("org.json4s.**" -> "shadeio.@1").inAll +) + +assemblyMergeStrategy in assembly := { + case PathList("META-INF", xs @ _*) => MergeStrategy.discard + case x => MergeStrategy.first +} + +assemblyExcludedJars in assembly := { + val cp = (fullClasspath in assembly).value + cp filter { f => + f.data.getName.contains("spark-core") || + f.data.getName.contains("spark-mllib") || + f.data.getName.contains("spark-sql") || + f.data.getName.contains("com.databricks.backend") || + f.data.getName.contains("com.microsoft.ml.spark") || + f.data.getName.contains("com.databricks.dbutils-api_2.11") + } +} diff --git a/demos/AutoMLPresentationDemo.dbc b/demos/AutoMLPresentationDemo.dbc new file mode 100644 index 00000000..ad89f0a5 Binary files /dev/null and b/demos/AutoMLPresentationDemo.dbc differ diff --git a/demos/AutoMLPresentationDemo.html b/demos/AutoMLPresentationDemo.html new file mode 100644 index 00000000..c670ae16 --- /dev/null +++ b/demos/AutoMLPresentationDemo.html @@ -0,0 +1,42 @@ + + + + +AutoMLPresentationDemo - Databricks + + + + + + + + + + + + + + + + + + + diff --git a/demos/LoanRiskWithPipelineAPI.dbc b/demos/LoanRiskWithPipelineAPI.dbc new file mode 100644 index 00000000..6079e470 Binary files /dev/null and b/demos/LoanRiskWithPipelineAPI.dbc differ diff --git a/demos/LoanRiskWithPipelineAPI.html b/demos/LoanRiskWithPipelineAPI.html new file mode 100644 index 00000000..b90c89ff --- /dev/null +++ b/demos/LoanRiskWithPipelineAPI.html @@ -0,0 +1,42 @@ + + + + +Export Loan Risk With Pipeline API - Databricks + + + + + + + + + + + + + + + + + + + diff --git a/images/MLFLow_Config_Tracking.png b/images/MLFLow_Config_Tracking.png new file mode 100644 index 00000000..1b8e8a2e Binary files /dev/null and b/images/MLFLow_Config_Tracking.png differ diff --git a/images/mlflow-1.png b/images/mlflow-1.png new file mode 100644 index 00000000..69ac732d Binary files /dev/null and b/images/mlflow-1.png differ diff --git a/images/mlflow-2.png b/images/mlflow-2.png new file mode 100644 index 00000000..040a1bc9 Binary files /dev/null and b/images/mlflow-2.png differ diff --git a/pom.xml b/pom.xml new file mode 100644 index 00000000..7e946ff8 --- /dev/null +++ b/pom.xml @@ -0,0 +1,246 @@ + + 4.0.0 + com.databricks + automatedml + jar + 2_11-0.7.1 + + 2.11.12 + 2.4.0 + 1.3.0 + 3.5.3 + provided + 0.90 + 0.18.1 + + + + org.scala-lang + scala-library + ${scala.version} + ${dependency.scope} + + + org.apache.spark + spark-core_2.11 + ${spark.version} + ${dependency.scope} + + + org.apache.spark + spark-sql_2.11 + ${spark.version} + ${dependency.scope} + + + org.apache.spark + spark-mllib_2.11 + ${spark.version} + ${dependency.scope} + + + org.mlflow + mlflow-client + ${mlflow.version} + ${dependency.scope} + + + org.json4s + json4s-jackson_2.11 + ${json4s.version} + ${dependency.scope} + + + ml.dmlc + xgboost4j + ${dmlc.version} + ${dependency.scope} + + + ml.dmlc + xgboost4j-spark + ${dmlc.version} + ${dependency.scope} + + + junit + junit + 4.8.1 + test + + + com.microsoft.ml.spark + mmlspark_2.11 + ${mmlspark.version} + ${dependency.scope} + + + org.apache.spark + * + + + + + org.vegas-viz + vegas_2.11 + 0.3.11 + ${dependency.scope} + + + org.scalatest + scalatest_2.11 + 3.0.6 + test + + + com.databricks + dbutils-api_2.11 + 0.0.4 + ${dependency.scope} + + + ml.combust.mleap + mleap-runtime_2.11 + 0.14.0 + + + ml.combust.mleap + mleap-spark_2.11 + 0.14.0 + + + org.vegas-viz + vegas_2.11 + 0.3.11 + + + + + + org.apache.maven.plugins + maven-compiler-plugin + + 1.8 + 1.8 + + + + net.alchim31.maven + scala-maven-plugin + 3.2.0 + + + scala-compile-first + process-resources + + compile + + + + scala-test-compile-first + process-test-resources + + testCompile + + + + attach-scaladocs + verify + + doc-jar + + + + + + org.apache.maven.plugins + maven-shade-plugin + 1.7.1 + + + + *:* + + META-INF/*.SF + META-INF/*.DSA + META-INF/*.RSA + + + + + + + package + + shade + + + + + ${groupId}:${artifactId} + + + com/databricks/backend/common/rpc/** + + + + + + + + + reference.conf + + + + + + + + org.apache.maven.plugins + maven-surefire-plugin + 3.0.0-M3 + + 8 + true + -Xmx1024m -XX:MaxPermSize=256m + + + + org.apache.maven.plugins + maven-dependency-plugin + 3.1.1 + + + copy + package + + copy + + + + + + + com.databricks + automatedml + ${version} + jar + true + ${project.basedir}/bin + + + + **\/CommandContext.class + + ${project.basedir}/bin + true + true + + + + + diff --git a/project/assembly.sbt b/project/assembly.sbt new file mode 100644 index 00000000..813ce170 --- /dev/null +++ b/project/assembly.sbt @@ -0,0 +1 @@ +addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9") \ No newline at end of file diff --git a/project/build.properties b/project/build.properties new file mode 100644 index 00000000..c0bab049 --- /dev/null +++ b/project/build.properties @@ -0,0 +1 @@ +sbt.version=1.2.8 diff --git a/python/.DS_Store b/python/.DS_Store new file mode 100644 index 00000000..a43d87eb Binary files /dev/null and b/python/.DS_Store differ diff --git a/python/__init__.py b/python/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/build/lib/py_auto_ml/__init__.py b/python/build/lib/py_auto_ml/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/build/lib/py_auto_ml/automation_runner.py b/python/build/lib/py_auto_ml/automation_runner.py new file mode 100644 index 00000000..a6cef528 --- /dev/null +++ b/python/build/lib/py_auto_ml/automation_runner.py @@ -0,0 +1,122 @@ +import json +from pyspark.sql.functions import DataFrame +from py_auto_ml.local_spark_singleton import SparkSingleton +from py_auto_ml.utilities.helpers import Helpers + + +class AutomationRunner: + + def __init__(self): + # Setup Spark singleton Instance + self.spark = SparkSingleton.get_instance() + + def run_automation_runner(self, + model_family: str, + prediction_type: str, + dataframe: DataFrame, + runner_type: str, + overrides=None): + """ + + :param model_family: str + One of the supported model types + + :param prediction_type: str + Either "classifier" or "regressor" + + :param dataframe: DataFrame + + :param runner_type: str + One of the following calls to the automation runner: "run", "confusion", "prediction" + + :param overrides: dict + Dictionary of configuration overrides + + :return: + """ + # Checking for supported model families and types + Helpers.check_model_family(model_family) + Helpers.check_prediction_type(prediction_type) + + runner_type_lower = runner_type.lower() + Helpers.check_runner_types(runner_type_lower) + + + # Check if you need default instance config or generating from map of overrides + if overrides is not None: + default_flag = "false" + # Stringify overrides to JSON + stringified_overrides = json.dumps(overrides) + else: + default_flag = "true" + stringified_overrides = "" + + self.spark._jvm.com.databricks.labs.automl.pyspark.AutomationRunnerUtil.runAutomationRunner(model_family, + prediction_type, + stringified_overrides, + dataframe._jdf, + runner_type_lower, + default_flag) + self._automation_runner = True + + return self._get_returns(runner_type_lower) + + def _get_returns(self, + runner_type: str): + """ + + :param runner_type: + One of the following calls to the automation runner: "run", "confusion", "prediction" + :return: Dataframe depending on `runner_type` + `run` + generation_report dataframe + model_report dataframe + `confusion` + confusion_data: dataframe + prediction_data: dataframe + generation_report: dataframe + model_report: dataframe + `prediction` + data_with_predictions: dataframe + generation_report: dataframe + model_report: dataframe + """ + # Cache the returns + if self._automation_runner == True: + if runner_type == "run": + generation_report = self.spark.sql("select * from generationReport") + model_report = self.spark.sql("select * from modelReport") + return_dict = { + 'generation_report': generation_report, + "model_report": model_report + } + return return_dict + elif runner_type == "confusion": + confusion_data = self.spark.sql("select * from confusionData") + prediction_data = self.spark.sql("select * from predictionData") + generation_report = self.spark.sql("select * from generationReport") + model_report = self.spark.sql("select * from modelReport") + return_dict = { + 'confusion_data': confusion_data, + 'prediction_data': prediction_data, + 'generation_report': generation_report, + 'model_report': model_report + } + return return_dict + elif runner_type == "prediction": + data_with_predictions = self.spark.sql("select * from dataWithPredictions") + generation_report = self.spark.sql("select * from generationReport") + model_report = self.spark.sql("select * from modelReportData") + return_dict = { + 'data_with_predictions': data_with_predictions, + 'generation_report': generation_report, + 'model_report': model_report + } + return return_dict + else: + print("No returns were added - check your runner_type") + + else: + raise Exception ("In order to generate the proper returns for the automation runner, please first run the " + "automation runner with the `run_automation_runner`") + diff --git a/python/build/lib/py_auto_ml/executor/__init__.py b/python/build/lib/py_auto_ml/executor/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/build/lib/py_auto_ml/executor/family_runner.py b/python/build/lib/py_auto_ml/executor/family_runner.py new file mode 100644 index 00000000..f76528f1 --- /dev/null +++ b/python/build/lib/py_auto_ml/executor/family_runner.py @@ -0,0 +1,166 @@ +import json +from pyspark.sql.functions import DataFrame +from py_auto_ml.local_spark_singleton import SparkSingleton +from py_auto_ml.utilities.helpers import Helpers + + +class FamilyRunner: + def __init__(self): + self.spark = SparkSingleton.get_instance() + + def run_family_runner(self, + df: DataFrame, + prediction_type: str, + family_configs: dict): + """ + + :param family_configs: dict + Supported model_family as a key, vaue is a dictionary of configuration overrides + :param prediction_type: str + "regressor" or "classifier" + :param df: dataframe + :param path: string + Path to writing to writing out the pipeline models + :return: + """ + # Checking for supported model families and types + Helpers.check_prediction_type(prediction_type) + + stringified_family_configs = json.dumps(family_configs) + self.spark._jvm.com.databricks.labs.automl.pyspark.FamilyRunnerUtil.runFamilyRunner(df._jdf, + stringified_family_configs, + prediction_type) + + self._family_runner = True + + return self._get_returns() + + # Fetch the temp tables and bring them into python + + def _get_returns(self): + """ + + :return: dict of dataframes + 'model_report':model_report: dataframe + 'generation_report':generation_report: dataframe + 'best_mlflow_run_id':best_mlflow_run_id: dataframe + """ + if self._family_runner != True: + raise Exception("You must first run the family runner to generate the proper return dataframes") + else: + model_report = self.spark.sql("SELECT * FROM modelReportDataFrame") + generation_report = self.spark.sql("SELECT * FROM generationReportDataFrame") + best_mlflow_run_id = self.spark.sql("SELECT * FROM bestMlFlowRunId") + return_dict = { + 'model_report': model_report, + 'generation_report': generation_report, + 'best_mlflow_run_id': best_mlflow_run_id + } + return return_dict + + # Get the best mlflow run Id from DF + # Get pipeline path from tag in MLflow using tracking client + # Returns pipeline model + + # TO DO full inference on pipeline model + def mlflow_pipeline_inference(self, + mlflow_run_id: str, + model_family: str, + prediction_type: str, + dataframe: DataFrame, + configs = [], + label_col='label'): + """ + + :param mlflow_run_id: string + The Mflow Run Id (as generated by AutoML) which has the pipeline model of interest + :param model_family: string + Support model family + :parameter prediction_type: string + Supported prediction type + :param dataframe + The dataframe being used for inference + :param configs + Dictionary of configuration overrides, default is empty dictionary + :param label_col: string + Label column of dataset that will be used for inference. Default is "label" + :return inferred_df: dataframe + The dataframe with prediction + """ + # Checking for supported model families and types + Helpers.check_model_family(model_family) + Helpers.check_prediction_type(prediction_type) + + stringified_configs = json.dumps(configs) + + self.spark._jvm.com.databricks.labs.automl.pyspark.FamilyRunnerUtil.runMlFlowInference(mlflow_run_id, + model_family, + prediction_type, + label_col, + stringified_configs, + dataframe._jdf) + # Pull out the inference + inference_df = self.spark.sql("SELECT * FROM inferenceDF") + + return inference_df + + def path_pipeline_inference(self, + path: str, + dataframe: DataFrame): + """ + + :param path: + Path to the pipelined model created by AutoML + :param dataframe: + Spark dataframe that will be used for inference + :return: inference_df: Dataframe + Dataframe with predictions + """ + + self.spark._jvm.com.databricks.labs.automl.pyspark.FamilyRunnerUtil.runPathInference(path, + dataframe._jdf) + inferred_df = self.spark.sql('SELECT * FROM pathInferenceDF') + + return inferred_df + + def feature_eng_pipeline(self, + df: DataFrame, + model_family: str, + prediction_type: str, + configs= {} + ): + """ + + :param df: Dataframe + Dataframe feature engineering pipeline will be applied to + :param model_family: string + Supported model family + :param prediction_type: string + Supported prediction type + :param configs: dict + Dictionary of overrides + :return: Dataframe + Feature engineered dataframe + """ + + # Checking for supported model families and types + Helpers.check_model_family(model_family) + Helpers.check_prediction_type(prediction_type) + + + stringified_family_configs = json.dumps(configs) + self.spark._jvm.com.databricks.labs.automl.pyspark.FamilyRunnerUtil.runFeatureEngPipeline(df._jdf, + model_family, + prediction_type, + stringified_family_configs) + feature_eng_df = self.spark.sql("SELECT * FROM featEngDf") + + return feature_eng_df + + + + + + + + diff --git a/python/build/lib/py_auto_ml/exploration/__init__.py b/python/build/lib/py_auto_ml/exploration/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/build/lib/py_auto_ml/exploration/feature_importance.py b/python/build/lib/py_auto_ml/exploration/feature_importance.py new file mode 100644 index 00000000..bf1e2ed1 --- /dev/null +++ b/python/build/lib/py_auto_ml/exploration/feature_importance.py @@ -0,0 +1,84 @@ +import json +from pyspark.sql.functions import DataFrame +from py_auto_ml.local_spark_singleton import SparkSingleton +from py_auto_ml.utilities.helpers import Helpers + + +class FeatureImportance: + def __init__(self): + self.spark = SparkSingleton.get_instance() + + + def run_feature_importance(self, + model_family: str, + prediction_type: str, + dataframe: DataFrame, + cutoff_value: float, + cutoff_type: str, + overrides=None): + """ + + :param model_family: str + One of the supported model types + + :param prediction_type: str + Either "classifier" or "regressor" + + :param df: DataFrame + + :param cutoff_value: float + Threshold value for feature importance algorithm + + :param cutoff_type: str + Cutoff for the number features + + :param overrides: dict + Dictionary of configuration overrides + + :return: + """ + + # Checking for supported model families and types + Helpers.check_model_family(model_family) + Helpers.check_prediction_type(prediction_type) + + + ## Set flag for default configs + if overrides is not None: + default_flag = "false" + # Convert the configs to JSON + stringified_overrides = json.dumps(overrides) + else: + stringified_overrides = "" + default_flag = "true" + + # Pass to JVM to run FI + self.spark._jvm.com.databricks.labs.automl.pyspark.FeatureImportanceUtil.runFeatureImportance(model_family, + prediction_type, + stringified_overrides, + dataframe._jdf, + cutoff_type, + cutoff_value, + default_flag) + self.feature_importance = True + return self._get_returns() + + def _get_returns(self): + """ + + :return: dict of dataframes: + 'importances': importances df + 'top_fields': top fields df + """ + if self.feature_importance != True: + raise Exception ("Please first generate feature importances by running `run_feature_importance`") + else: + importances = self.spark.sql("select * from importances") + top_fields = self.spark.sql("select feature from importances") + return_dict = { + 'importances': importances, + 'top_fields': top_fields + } + return return_dict + + diff --git a/python/build/lib/py_auto_ml/local_spark_singleton.py b/python/build/lib/py_auto_ml/local_spark_singleton.py new file mode 100644 index 00000000..1cb38c91 --- /dev/null +++ b/python/build/lib/py_auto_ml/local_spark_singleton.py @@ -0,0 +1,8 @@ +from pyspark.sql import SparkSession + + +class SparkSingleton: + + @classmethod + def get_instance(cls): + return SparkSession.builder.getOrCreate() diff --git a/python/build/lib/py_auto_ml/spark_singleton.py b/python/build/lib/py_auto_ml/spark_singleton.py new file mode 100644 index 00000000..3b2a9664 --- /dev/null +++ b/python/build/lib/py_auto_ml/spark_singleton.py @@ -0,0 +1,8 @@ +from pyspark.sql import SparkSession + + +class SparkSingleton: + + @classmethod + def get_instance(cls): + return SparkSession.builder.getOrCreate() \ No newline at end of file diff --git a/python/build/lib/py_auto_ml/test/__init__.py b/python/build/lib/py_auto_ml/test/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/build/lib/py_auto_ml/test/local_spark_singleton.py b/python/build/lib/py_auto_ml/test/local_spark_singleton.py new file mode 100644 index 00000000..f4272588 --- /dev/null +++ b/python/build/lib/py_auto_ml/test/local_spark_singleton.py @@ -0,0 +1,23 @@ +from pyspark.sql import SparkSession +import os + + +class SparkSingleton: + """A singleton class on Datalib which returns one Spark instance""" + __instance = None + + @classmethod + def get_instance(cls): + """Create a Spark instance for Datalib. + :return: A Spark instance + """ + return (SparkSession.builder + .getOrCreate()) + + @classmethod + def get_local_instance(cls): + return (SparkSession.builder + .master("local[*]") + .appName("automl") + .getOrCreate()) + diff --git a/python/build/lib/py_auto_ml/test/test_automation_runner.py b/python/build/lib/py_auto_ml/test/test_automation_runner.py new file mode 100644 index 00000000..edd7868d --- /dev/null +++ b/python/build/lib/py_auto_ml/test/test_automation_runner.py @@ -0,0 +1,53 @@ +import unittest +from py_auto_ml.test.local_spark_singleton import SparkSingleton +from py_auto_ml.automation_runner import AutomationRunner + + +class TestFamilyRunner(unittest.TestCase): + + def setup(self): + self.spark = SparkSingleton.get_instance() + + def test_get_returns(self): + self.setup() + automation_runner = AutomationRunner() + + model_report_data_frame = self.spark.createDataFrame([(1,2,3)],["col1", "col2", "col3"]) + model_report_data_frame.createOrReplaceTempView("modelReport") + model_report_data_frame.createOrReplaceTempView("modelReportData") + + generation_report_data_frame = self.spark.createDataFrame([(4, 5, 6, 7)], ["col1", "col2", "col3", "col4"]) + generation_report_data_frame.createOrReplaceTempView("generationReport") + + confusion_data = self.spark.createDataFrame([(7, 8)], ["col1", "col2"]) + confusion_data.createOrReplaceTempView("confusionData") + + prediction_data = self.spark.createDataFrame([(9,10,11,12,13)], ["col1", "col2", "col3", "col4", "col5"]) + prediction_data.createOrReplaceTempView("predictionData") + + data_with_preds = self.spark.createDataFrame([(14, 15)], ["col1", "col2"]) + data_with_preds.createOrReplaceTempView("dataWithPredictions") + + # Test with RUN + automation_runner._automation_runner = True + automation_runner.get_returns("run") + assert len(automation_runner.generation_report.columns) == 4 + assert len(automation_runner.model_report.columns) == 3 + + # Test with CONFUSION + automation_runner.get_returns("confusion") + assert len(automation_runner.confusion_data.columns) == 2 + assert len(automation_runner.prediction_data.columns) == 5 + assert len(automation_runner.generation_report.columns) == 4 + assert len(automation_runner.model_report.columns) == 3 + + # Test with PREDICTION + automation_runner.get_returns("prediction") + assert len(automation_runner.generation_report.columns) == 4 + assert len(automation_runner.model_report.columns) == 3 + assert len(automation_runner.data_with_predictions.columns) == 2 + + self.tear_down() + + def tear_down(self): + self.spark.stop() diff --git a/python/build/lib/py_auto_ml/test/test_class.py b/python/build/lib/py_auto_ml/test/test_class.py new file mode 100644 index 00000000..cfc73e83 --- /dev/null +++ b/python/build/lib/py_auto_ml/test/test_class.py @@ -0,0 +1,3 @@ +class TestClass: + def __init__(self): + print("this test is successfull :) ") \ No newline at end of file diff --git a/python/build/lib/py_auto_ml/test/test_family_runner.py b/python/build/lib/py_auto_ml/test/test_family_runner.py new file mode 100644 index 00000000..b3079af2 --- /dev/null +++ b/python/build/lib/py_auto_ml/test/test_family_runner.py @@ -0,0 +1,34 @@ +import unittest +from py_auto_ml.test.local_spark_singleton import SparkSingleton +from py_auto_ml.executor.family_runner import FamilyRunner + + +class TestFamilyRunner(unittest.TestCase): + + def setup(self): + self.spark = SparkSingleton.get_instance() + + def test_get_returns(self): + self.setup() + family_runner = FamilyRunner() + + model_report_data_frame = self.spark.createDataFrame([(1,2,3)],["col1", "col2", "col3"]) + model_report_data_frame.createOrReplaceTempView("modelReportDataFrame") + + generation_report_data_frame = self.spark.createDataFrame([(4, 5, 6, 7)], ["col1", "col2", "col3", "col4"]) + generation_report_data_frame.createOrReplaceTempView("generationReportDataFrame") + + best_mlflow_run_id = self.spark.createDataFrame([(7, 8)], ["col1", "col2"]) + best_mlflow_run_id.createOrReplaceTempView("bestMlFlowRunId") + + family_runner._family_runner = True + family_runner.get_returns() + + assert len(family_runner.model_report.columns) == 3 + assert len(family_runner.best_mlflow_run_id.columns) == 2 + assert len(family_runner.generation_report.columns) == 4 + + self.tear_down() + + def tear_down(self): + self.spark.stop() diff --git a/python/build/lib/py_auto_ml/test/test_feature_importance.py b/python/build/lib/py_auto_ml/test/test_feature_importance.py new file mode 100644 index 00000000..0d08bfe0 --- /dev/null +++ b/python/build/lib/py_auto_ml/test/test_feature_importance.py @@ -0,0 +1,75 @@ +import unittest +from python.py_auto_ml.test.local_spark_singleton import SparkSingleton +from python.py_auto_ml.exploration.feature_importance import FeatureImportance +from pyspark.sql import SparkSession + + +class TestFeatureImportance(unittest.TestCase): + + def setup(self): + self.spark = SparkSingleton.get_instance() + + def test_bring_in_returns(self): + self.setup() + + importances_data_frame = self.spark.createDataFrame([(1, 2, 3)], ["feature", "col2", "col3"]) + importances_data_frame.createOrReplaceTempView("importances") + + feat_imp = FeatureImportance() + feat_imp.feature_importance = True + feat_imp.bring_in_returns() + + assert len(feat_imp.importances.columns) == 3 + assert len(feat_imp.top_fields.columns) == 1 + + self.tear_down() + + def tear_down(self): + self.spark.stop() + + @staticmethod + def convert_csv_to_df(csv_path: str): + spark_session = SparkSession.builder.master('local[*]').appName("providentiaml-unit-tests").getOrCreate() + spark_session.sparkContext.setLogLevel("ERROR") + return spark_session.read.format('csv').option("header", "true").option("inferSchema", "true").load(csv_path) + + def test_loan_risk_xgboost(self): + self.setup() + loan_risk_df = self.convert_csv_to_df("Desktop/providenc/load_risk.csv") + generic_overrides = { + "labelCol": "label", + "scoringMetric": "areaUnderROC", + "dataPrepCachingFlag": False, + "autoStoppingFlag": True, + "tunerAutoStoppingScore": 0.91, + "tunerParallelism": 1*2, + "tunerKFold": 2, + "tunerSeed": 42, + "tunerInitialGenerationArraySeed": 42, + "tunerTrainPortion": 0.7, + "tunerTrainSplitMethod": "stratified", + "tunerInitialGenerationMode": "permutations", + "tunerInitialGenerationPermutationCount": 8, + "tunerInitialGenerationIndexMixingMode": "linear", + "tunerFirstGenerationGenePool": 16, + "tunerNumberOfGenerations": 3, + "tunerNumberOfParentsToRetain": 2, + "tunerNumberOfMutationsPerGeneration": 4, + "tunerGeneticMixing": 0.8, + "tunerGenerationalMutationStrategy": "fixed", + "tunerEvolutionStrategy": "batch", + "tunerHyperSpaceInferenceFlag": True, + "tunerHyperSpaceInferenceCount": 400000, + "tunerHyperSpaceModelType": "XGBoost", + "tunerHyperSpaceModelCount": 8, + "mlFlowLoggingFlag": False, + "mlFlowLogArtifactsFlag": False + } + feat_imp = FeatureImportance.run_feature_importance("XGBoost", "classifier", loan_risk_df, 20.0, "count", generic_overrides) + + assert len(feat_imp.top_fields.columns) != 0 + assert len(feat_imp.importances.columns) !=0 + + + + self.tear_down() \ No newline at end of file diff --git a/python/build/lib/py_auto_ml/utilities/__init__.py b/python/build/lib/py_auto_ml/utilities/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/build/lib/py_auto_ml/utilities/helpers.py b/python/build/lib/py_auto_ml/utilities/helpers.py new file mode 100644 index 00000000..0232fc09 --- /dev/null +++ b/python/build/lib/py_auto_ml/utilities/helpers.py @@ -0,0 +1,28 @@ +class Helpers: + + @staticmethod + def check_model_family(model_family: str): + supported_models = ["RandomForest","XGBoost", "LogisticRegresesion","Trees","GBT","LinearRegression", + "MLPC", "SVM"] + if model_family not in supported_models: + raise Exception("Your model family but be within any of the following supported model types:", + supported_models) + + @staticmethod + def check_prediction_type(prediction_type:str): + supported_prediction_types = ['regressor', 'classifier'] + if prediction_type not in supported_prediction_types: + raise Exception("Prediction type is not supported - it must be one of the following", + supported_prediction_types) + + @staticmethod + def check_runner_types(runner_type: str): + """ + + :param runner_type: str + Checking that the runner_type is a supported runner_type + :return: + """ + acceptable_strings = ["run", "confusion", "prediction"] + if runner_type not in acceptable_strings: + raise Exception("runner_type must be one of the following run, confusion, or prediction") \ No newline at end of file diff --git a/python/build/lib/utilities/__init__.py b/python/build/lib/utilities/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/build/lib/utilities/helpers.py b/python/build/lib/utilities/helpers.py new file mode 100644 index 00000000..0232fc09 --- /dev/null +++ b/python/build/lib/utilities/helpers.py @@ -0,0 +1,28 @@ +class Helpers: + + @staticmethod + def check_model_family(model_family: str): + supported_models = ["RandomForest","XGBoost", "LogisticRegresesion","Trees","GBT","LinearRegression", + "MLPC", "SVM"] + if model_family not in supported_models: + raise Exception("Your model family but be within any of the following supported model types:", + supported_models) + + @staticmethod + def check_prediction_type(prediction_type:str): + supported_prediction_types = ['regressor', 'classifier'] + if prediction_type not in supported_prediction_types: + raise Exception("Prediction type is not supported - it must be one of the following", + supported_prediction_types) + + @staticmethod + def check_runner_types(runner_type: str): + """ + + :param runner_type: str + Checking that the runner_type is a supported runner_type + :return: + """ + acceptable_strings = ["run", "confusion", "prediction"] + if runner_type not in acceptable_strings: + raise Exception("runner_type must be one of the following run, confusion, or prediction") \ No newline at end of file diff --git a/python/demo/0.2.0 - Simplify Loan Risk.dbc b/python/demo/0.2.0 - Simplify Loan Risk.dbc new file mode 100644 index 00000000..cf859cd8 Binary files /dev/null and b/python/demo/0.2.0 - Simplify Loan Risk.dbc differ diff --git a/python/demo/0.2.0 - Simplify Loan Risk.html b/python/demo/0.2.0 - Simplify Loan Risk.html new file mode 100644 index 00000000..429d7d9d --- /dev/null +++ b/python/demo/0.2.0 - Simplify Loan Risk.html @@ -0,0 +1,42 @@ + + + + +0.2.0 - Simplify Loan Risk - Databricks + + + + + + + + + + + + + + + + + + + diff --git a/python/demo/0.2.0 Simplify Loan Risk - Pipeline APIs.dbc b/python/demo/0.2.0 Simplify Loan Risk - Pipeline APIs.dbc new file mode 100644 index 00000000..7e67c595 Binary files /dev/null and b/python/demo/0.2.0 Simplify Loan Risk - Pipeline APIs.dbc differ diff --git a/python/demo/0.2.0 Simplify Loan Risk - Pipeline APIs.html b/python/demo/0.2.0 Simplify Loan Risk - Pipeline APIs.html new file mode 100644 index 00000000..a3abbd07 --- /dev/null +++ b/python/demo/0.2.0 Simplify Loan Risk - Pipeline APIs.html @@ -0,0 +1,42 @@ + + + + +0.2.0 Simplify Loan Risk - Pipeline APIs - Databricks + + + + + + + + + + + + + + + + + + + diff --git a/python/demo/Simplify Loan Risk Python NB.dbc b/python/demo/Simplify Loan Risk Python NB.dbc new file mode 100644 index 00000000..945e67ec Binary files /dev/null and b/python/demo/Simplify Loan Risk Python NB.dbc differ diff --git a/python/demo/Simplify Loan Risk Python NB.html b/python/demo/Simplify Loan Risk Python NB.html new file mode 100644 index 00000000..92bf121f --- /dev/null +++ b/python/demo/Simplify Loan Risk Python NB.html @@ -0,0 +1,42 @@ + + + + +Simplify Loan Risk Python NB - Databricks + + + + + + + + + + + + + + + + + + + diff --git a/python/dist/pyAutoML-0.1.0-py3-none-any.whl b/python/dist/pyAutoML-0.1.0-py3-none-any.whl new file mode 100644 index 00000000..afddc5da Binary files /dev/null and b/python/dist/pyAutoML-0.1.0-py3-none-any.whl differ diff --git a/python/dist/pyAutoML-0.2.0-py3-none-any.whl b/python/dist/pyAutoML-0.2.0-py3-none-any.whl new file mode 100644 index 00000000..815cb5ca Binary files /dev/null and b/python/dist/pyAutoML-0.2.0-py3-none-any.whl differ diff --git a/python/docs/APIDOCs.md b/python/docs/APIDOCs.md new file mode 100644 index 00000000..dcacc9ee --- /dev/null +++ b/python/docs/APIDOCs.md @@ -0,0 +1,372 @@ +# AutoML-Toolkit + +The AutoML-Toolkit is an automated ML solution for Apache Spark. It provides common data cleansing and feature +engineering support, automated hyper-parameter tuning through distributed genetic algorithms, and model tracking +integration with MLFlow. It currently supports Supervised Learning algorithms that are provided as part of Spark Mllib. + +The python APIs are a means towards interfacing with the Scala library via pyspark. + +## Setup +Currently, this library exists as a `.whl` file in the `/dist` directory. You can also run the following in a terminal +to build the wheel locally: +``` +python setup.py bdist_wheel +``` + +## General Overview +The python APIs currently support the three following classes: +1. `FeatureImportance` +2. `AutomationRunner` +3. `FamilyRunner` + + +### Feature Importance Class +For more information about the underlying algorithms please see [APIDOCS](https://github.com/databricks/providentia/blob/master/APIDOCS.md#automl-toolkit). +Feature importances are run via the `run_feature_importace` function within an instance of the `FeatureImportance` +class. + +`model_family` - one of the supported model families listed [Here](https://github.com/databricks/providentia/blob/master/APIDOCS.md#automl-toolkit) + + +`prediction_type` - either "regressor" or "classifier" + + +`dataframe` - Dataframe that will be used for feature importance algorithm + +`cutoff_value` - threshold value for feature importance algorithm + +`cutoff_type` - cutoff for the feature algorithm + +`overrides` - dictionary of overrides for feature importance configuration + +Below is an example of using the `FeatureImportance` class on Databricks: +```python + +source_data = spark.read.parquet("/tmp/loan-risk-analysis/loan-risk-analysis-full-cleansed.parquet").withColumnRenamed("bad_loan", "label") + +## Generic configuration +experimentNamePrefix = "/Users/marygrace.moesta@databricks.com/AutoML" +RUNVERSION = "5" +labelColumn = "label" +runExperiment = "runRF_" + RUNVERSION +projectName = "mg_AutoML_Demo" +modelSaveFolder = "/tmp/mgm/ml/automl/" + +## This is the configuration of the hardware available (default of 4, 4, and 4) +nodeCount = 8 +coresPerNode = 16 +totalCores = nodeCount * coresPerNode +driverCores = 30 + +## Save locations +mlFlowModelSaveDirectory = "dbfs:" + modelSaveFolder + "models/" + projectName + "/" +inferenceConfigSaveLocation = "dbfs:" + modelSaveFolder + "inference/" + projectName + "/" +cntx = dbutils.entry_point.getDbutils().notebook().getContext() +api_token = cntx.apiToken().get() +api_url = cntx.apiUrl().get() +notebook_path = cntx.notebookPath().get() +generic_overrides = { + "labelCol": labelColumn, + "scoringMetric": "areaUnderROC", + "dataPrepCachingFlag": False, + "autoStoppingFlag": True, + "tunerAutoStoppingScore": 0.91, + "tunerParallelism": driverCores, + "tunerKFold": 1, ## normally should be >=5 + "tunerSeed": 42, ## for reproducibility + "tunerInitialGenerationArraySeed": 42, + "tunerTrainPortion": 0.7, + "tunerTrainSplitMethod": "stratified", + "tunerInitialGenerationMode": "permutations", + "tunerInitialGenerationPermutationCount": 8, + "tunerInitialGenerationIndexMixingMode": "linear", + "tunerFirstGenerationGenePool": 16, + "tunerNumberOfGenerations": 3, + "tunerNumberOfParentsToRetain": 2, + "tunerNumberOfMutationsPerGeneration": 4, + "tunerGeneticMixing": 0.8, + "tunerGenerationalMutationStrategy": "fixed", + "tunerEvolutionStrategy": "batch", + "tunerHyperSpaceInferenceFlag": True, + "tunerHyperSpaceInferenceCount": 400000, + "tunerHyperSpaceModelType": "XGBoost", + "tunerHyperSpaceModelCount": 8, + "mlFlowLoggingFlag": True, + "mlFlowLogArtifactsFlag": False, + "mlFlowTrackingURI": api_url, + "mlFlowExperimentName": experimentNamePrefix +"/" + projectName+ "/" + runExperiment, + "mlFlowAPIToken": api_token, + "mlFlowModelSaveDirectory": mlFlowModelSaveDirectory, + "mlFlowLoggingMode": "bestOnly", + "mlFlowBestSuffix": "_best", + "inferenceConfigSaveLocation": inferenceConfigSaveLocation + } + + ## Calculate Feature Importance +from py_auto_ml.exploration.feature_importance import FeatureImportance + +FI = FeatureImportance() + +fi_importances = FI.run_feature_importances("XGBoost", "classifier", dataframe,20.0,"count",generic_overrides) +``` +Once the feature importance algorithm has been run, there are two dataframes that remain as attributes of the instance +of the class. The first is the `importances` dataframe which lists the features and their importance value. The second +is the `top_fields` dataframe which consists only of the features themselves. Below is an example of retrieving these +dataframes once the feature importance algorithm has been run. + +```python +##Retrieving the importances DF +fi_importances['importances'] + +## Retrieving the top_fields DF +fi_importances['top_fields'] +``` + +### AutomationRunner Class +The `run_automation_runner` function invokes the `runAutomationRunner` Scala library via the JVM. This class has a few different +type of runs you can read more about [Here](https://github.com/databricks/providentia/blob/master/APIDOCS.md#full-automation). To call the function you can pass it a + +`model_family` - one of the supported model families listed HERE + +`prediction_type` - either "regressor" or "classifier" + +`data_frame` - Dataframe that will be used for feature importance algorithm + +`runner_type` - either "run", "confusion", or "prediction" + +`overrides` - dictionary of configuration overrides. If null, this will run with default configurations + +Below is an example of calling the `run_automation_runner` function with the overrides defined above +```python +## Bring in the dataset +from pyspark.sql.functions import col,expr, when +dataframe = spark.read.parquet("/tmp/loan-risk-analysis/loan-risk-analysis-full-cleansed.parquet")\ + .withColumn("label", when((col("bad_loan") == "true"), 1).otherwise(0))\ + .drop(col("bad_loan"))\ + .drop(col("net"))\ + .sample(False, 0.025, 42)\ + .repartition(192) + + +#Splitting Train and Test +dataset_train = dataframe.where(expr("issue_year <= 2015")).cache() +dataset_valid = dataframe.where(expr("issue_year > 2015")).cache() +dataset_train.createOrReplaceTempView("dataset_train") +dataset_valid.createOrReplaceTempView("dataset_valid") + +model_family = "XGBoost" +prediction_type = "classifier" +run_type = "confusion" + +## Kickoff Automation runner +from py_auto_ml.automation_runner import AutomationRunner + + +runner = AutomationRunner.run_automation_runner(model_family, + prediction_type, + dataframe, + run_type, + generic_overrides) +``` + +Based on the `run_type`, the object will return a dictionary of the following dataframes: + +| Run Type | Attributes | +|--------------|------------------------------------------------------------------| +| "run" | generation_report, model_report | +| "confusion" | confusion_data, prediction_data, generation_report, model_report | +| "prediction" | data_with_predictions, generation_report, model_report | + + +### Family Runner +The `run_family_runner`function that lives within the `FamilyRunner` class kicks of the `runFamilyRunner` equivalent in +the scala library. This allows the user to run +several different model families in parallel. The `run_family_runner` class takes three necessary parameters: + +`dataframe` - Spark Dataframe + +`prediction_type` - either `regressor` or `classifier` + +`family_configs` - a dictionary that contains the + +`model_family` as the key and a dictionary of overrides as the value + + + + +Below is an example of calling the `run_family_runner` function: +```python +## Generic configuration +experimentNamePrefix = "/Users/marygrace.moesta@databricks.com/AutoML" +RUNVERSION = "1" +labelColumn = "label" +xgBoostExperiment = "runXG_" + RUNVERSION +logisticRegExperiment = "runLG_" + RUNVERSION +projectName = "MGM_AutoML_Demo" + +## This is the configuration of the hardware available +nodeCount = 4 +coresPerNode = 4 +totalCores = nodeCount * coresPerNode +driverCores = 4 + +cntx = dbutils.entry_point.getDbutils().notebook().getContext() +api_token = cntx.apiToken().get() +api_url = cntx.apiUrl().get() +notebook_path = cntx.notebookPath().get() +xg_boost_overrides = { + "labelCol": labelColumn, + "scoringMetric": "areaUnderROC", + "oneHotEncodeFlag": True, + "autoStoppingFlag": True, + "tunerAutoStoppingScore" : 0.91, + "tunerParallelism" : driverCores * 2, + "tunerKFold" : 2, + "tunerTrainPortion": 0.7, + "tunerTrainSplitMethod": "stratified", + "tunerInitialGenerationMode": "permutations", + "tunerInitialGenerationPermutationCount": 8, + "tunerInitialGenerationIndexMixingMode": "linear", + "tunerInitialGenerationArraySeed": 42, + "tunerFirstGenerationGenePool": 16, + "tunerNumberOfGenerations": 3, + "tunerNumberOfParentsToRetain": 2, + "tunerNumberOfMutationsPerGeneration": 4, + "tunerGeneticMixing": 0.8, + "tunerGenerationalMutationStrategy": "fixed", + "tunerEvolutionStrategy": "batch", + "tunerHyperSpaceInferenceFlag": True, + "tunerHyperSpaceInferenceCount": 400000, + "tunerHyperSpaceModelType": "XGBoost", + "tunerHyperSpaceModelCount": 8, + "mlFlowLoggingFlag": True, + "mlFlowLogArtifactsFlag": False, + "mlFlowTrackingURI": api_url, + "mlFlowExperimentName": experimentNamePrefix +"/" + projectName+ "/" + xgBoostExperiment, + "mlFlowAPIToken": api_token, + "mlFlowLoggingMode": "bestOnly", + "mlFlowBestSuffix": "_best", + "mlFlowModelSaveDirectory": "/dbfs/tmp/mgm/ml", + "pipelineDebugFlag": True +} + +logisticRegOverrides = { + "labelCol": labelColumn, + "scoringMetric" : "areaUnderROC", + "oneHotEncodeFlag": True, + "autoStoppingFlag": True, + "tunerAutoStoppingScore": 0.91, + "tunerParallelism": driverCores * 2, + "tunerKFold": 2, + "tunerTrainPortion": 0.7, + "tunerTrainSplitMethod": "stratified", + "tunerInitialGenerationMode": "permutations", + "tunerInitialGenerationPermutationCount": 8, + "tunerInitialGenerationIndexMixingMode": "linear", + "tunerInitialGenerationArraySeed": 42, + "tunerFirstGenerationGenePool": 16, + "tunerNumberOfGenerations": 3, + "tunerNumberOfParentsToRetain": 2, + "tunerNumberOfMutationsPerGeneration": 4, + "tunerGeneticMixing": 0.8, + "tunerGenerationalMutationStrategy": "fixed", + "tunerEvolutionStrategy": "batch", + "mlFlowLoggingFlag": True, + "mlFlowLogArtifactsFlag": False, + "mlFlowTrackingURI": api_url, + "mlFlowExperimentName": experimentNamePrefix +"/" + projectName+ "/" + logisticRegExperiment, + "mlFlowAPIToken": api_token, + "mlFlowLoggingMode": "bestOnly", + "mlFlowBestSuffix" : "_best", + "mlFlowModelSaveDirectory": "/dbfs/tmp/mgm/ml", + "pipelineDebugFlag": True +} +# Import the family runner +from py_auto_ml.executor.family_runner import FamilyRunner + +family_runner = FamilyRunner() +prediction_type = "classifier" +family_runner_configs = { + "XGBoost": xg_boost_overrides, + "LogisticRegression": logisticRegOverrides +} + +family_runner = family_runner.run_family_runner(dataframe, + prediction_type, + family_runner_configs) + +``` + +The return of the 'run_family_runner' function is a dictionary of the following dataframes: +1. `model_report` +2. `generation_report` +3. `best_mlflow_run_id` + +## Using the Family Runner for Inference +There is currently support for the pipeline api in pyspark. There are two ways to run inference on a modeling pipeline: +1. By MLflow run id +2. Pipeline model path + +The `mlflow_pipeline_inference` function, that lives in the `FamilyRunner` class, takes the following parametersB; + +`run_id` - the mlflow run_id of interest + +`model_family` - a supported model family + +`prediction_type` - either `regressor` or `classifier` + +`datafrme` - a pyspark dataframe that will be used for inference + +`configs` - a dictonary of configs to override default values + +`label` - the name of the label column, i.e. the column the model is predicting + +Below is an example of running a full inference pipeline via the mlflow run_id from the `family_runner` above +```python +mlflow_inference_df = family_runner.mlflow_pipeline_inference(run_id, + "XGBoost", + "classifier", + source_data, + xg_boost_overrides, + "label") +``` +The `mlflow_pipeline_inference` function returns a dataframe that includes the original dataframe used for inferece plus +additional columns with the feature vector, raw prediction, probability (if applicable), and the prediction. + +Inference can be run directly against the patch of a pipeline model already created by the AutoML Toolkit. The +`path_pipeline_inference` function (which lives in the `FamilyRunner` class) takes the following parameters: + +`path` - the path of the pipelined model + +`dataframe` - a pyspark dataframe that will be used for inference + +Below is an example of running inference directly against the path for a pipelined model created by AutoML: +```python +pipeline_save_path = "/dbfs/tmp/mgm/ml/BestRunclassifier_XGBoost_862a8ceacb534404b86f8bdae69c6449/BestPipeline" +path_df = family_runner.path_pipeline_inference(pipeline_save_path, + source_data) +``` +The `path_pipeline_inference` function returns a dataframe that include the original datafrmae passed as an argument +plus additional columns with the feature vector, raw prediction, probability (if applicable), and the prediction + +The Family Runner APIs can also be used for feature engineering tasks based on a selected number of configs. The +`feaure_eng_pipeline` function (which lives in the `FamilyRunner` class) takes the following parameters: + +`dataframe` - a pyspark dataframe that will be feature engineered + +`model_family` - a supported model family + +`prediction_type` - either `regressor` or `classifier` + +`configs` - the set of configs used for the Family Runner + +Below is +an example of using the family runner to generate a feature engineered dataframe: +```python +feat_eng_df = family_runner.feature_eng_pipeline(source_data, + "XGBoost", + "classifier", + family_runner_configs) +``` +The `feature_eng_pipeline`function returns a feature engineered dataframe base on the + diff --git a/python/pyAutoML.egg-info/PKG-INFO b/python/pyAutoML.egg-info/PKG-INFO new file mode 100644 index 00000000..d51458f8 --- /dev/null +++ b/python/pyAutoML.egg-info/PKG-INFO @@ -0,0 +1,10 @@ +Metadata-Version: 1.0 +Name: pyAutoML +Version: 0.2.0 +Summary: UNKNOWN +Home-page: UNKNOWN +Author: Databricks +Author-email: UNKNOWN +License: UNKNOWN +Description: UNKNOWN +Platform: UNKNOWN diff --git a/python/pyAutoML.egg-info/SOURCES.txt b/python/pyAutoML.egg-info/SOURCES.txt new file mode 100644 index 00000000..6c85d294 --- /dev/null +++ b/python/pyAutoML.egg-info/SOURCES.txt @@ -0,0 +1,19 @@ +setup.py +pyAutoML.egg-info/PKG-INFO +pyAutoML.egg-info/SOURCES.txt +pyAutoML.egg-info/dependency_links.txt +pyAutoML.egg-info/top_level.txt +py_auto_ml/__init__.py +py_auto_ml/automation_runner.py +py_auto_ml/local_spark_singleton.py +py_auto_ml/executor/__init__.py +py_auto_ml/executor/family_runner.py +py_auto_ml/exploration/__init__.py +py_auto_ml/exploration/feature_importance.py +py_auto_ml/test/__init__.py +py_auto_ml/test/local_spark_singleton.py +py_auto_ml/test/test_automation_runner.py +py_auto_ml/test/test_family_runner.py +py_auto_ml/test/test_feature_importance.py +py_auto_ml/utilities/__init__.py +py_auto_ml/utilities/helpers.py \ No newline at end of file diff --git a/python/py_auto_ml/__init__.py b/python/py_auto_ml/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/py_auto_ml/__pycache__/__init__.cpython-37.pyc b/python/py_auto_ml/__pycache__/__init__.cpython-37.pyc new file mode 100644 index 00000000..da1897e9 Binary files /dev/null and b/python/py_auto_ml/__pycache__/__init__.cpython-37.pyc differ diff --git a/python/py_auto_ml/__pycache__/local_spark_singleton.cpython-37.pyc b/python/py_auto_ml/__pycache__/local_spark_singleton.cpython-37.pyc new file mode 100644 index 00000000..193b01fb Binary files /dev/null and b/python/py_auto_ml/__pycache__/local_spark_singleton.cpython-37.pyc differ diff --git a/python/py_auto_ml/automation_runner.py b/python/py_auto_ml/automation_runner.py new file mode 100644 index 00000000..a6cef528 --- /dev/null +++ b/python/py_auto_ml/automation_runner.py @@ -0,0 +1,122 @@ +import json +from pyspark.sql.functions import DataFrame +from py_auto_ml.local_spark_singleton import SparkSingleton +from py_auto_ml.utilities.helpers import Helpers + + +class AutomationRunner: + + def __init__(self): + # Setup Spark singleton Instance + self.spark = SparkSingleton.get_instance() + + def run_automation_runner(self, + model_family: str, + prediction_type: str, + dataframe: DataFrame, + runner_type: str, + overrides=None): + """ + + :param model_family: str + One of the supported model types + + :param prediction_type: str + Either "classifier" or "regressor" + + :param dataframe: DataFrame + + :param runner_type: str + One of the following calls to the automation runner: "run", "confusion", "prediction" + + :param overrides: dict + Dictionary of configuration overrides + + :return: + """ + # Checking for supported model families and types + Helpers.check_model_family(model_family) + Helpers.check_prediction_type(prediction_type) + + runner_type_lower = runner_type.lower() + Helpers.check_runner_types(runner_type_lower) + + + # Check if you need default instance config or generating from map of overrides + if overrides is not None: + default_flag = "false" + # Stringify overrides to JSON + stringified_overrides = json.dumps(overrides) + else: + default_flag = "true" + stringified_overrides = "" + + self.spark._jvm.com.databricks.labs.automl.pyspark.AutomationRunnerUtil.runAutomationRunner(model_family, + prediction_type, + stringified_overrides, + dataframe._jdf, + runner_type_lower, + default_flag) + self._automation_runner = True + + return self._get_returns(runner_type_lower) + + def _get_returns(self, + runner_type: str): + """ + + :param runner_type: + One of the following calls to the automation runner: "run", "confusion", "prediction" + :return: Dataframe depending on `runner_type` + `run` + generation_report dataframe + model_report dataframe + `confusion` + confusion_data: dataframe + prediction_data: dataframe + generation_report: dataframe + model_report: dataframe + `prediction` + data_with_predictions: dataframe + generation_report: dataframe + model_report: dataframe + """ + # Cache the returns + if self._automation_runner == True: + if runner_type == "run": + generation_report = self.spark.sql("select * from generationReport") + model_report = self.spark.sql("select * from modelReport") + return_dict = { + 'generation_report': generation_report, + "model_report": model_report + } + return return_dict + elif runner_type == "confusion": + confusion_data = self.spark.sql("select * from confusionData") + prediction_data = self.spark.sql("select * from predictionData") + generation_report = self.spark.sql("select * from generationReport") + model_report = self.spark.sql("select * from modelReport") + return_dict = { + 'confusion_data': confusion_data, + 'prediction_data': prediction_data, + 'generation_report': generation_report, + 'model_report': model_report + } + return return_dict + elif runner_type == "prediction": + data_with_predictions = self.spark.sql("select * from dataWithPredictions") + generation_report = self.spark.sql("select * from generationReport") + model_report = self.spark.sql("select * from modelReportData") + return_dict = { + 'data_with_predictions': data_with_predictions, + 'generation_report': generation_report, + 'model_report': model_report + } + return return_dict + else: + print("No returns were added - check your runner_type") + + else: + raise Exception ("In order to generate the proper returns for the automation runner, please first run the " + "automation runner with the `run_automation_runner`") + diff --git a/python/py_auto_ml/executor/__init__.py b/python/py_auto_ml/executor/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/py_auto_ml/executor/family_runner.py b/python/py_auto_ml/executor/family_runner.py new file mode 100644 index 00000000..843a5bcc --- /dev/null +++ b/python/py_auto_ml/executor/family_runner.py @@ -0,0 +1,166 @@ +import json +from pyspark.sql.functions import DataFrame +from py_auto_ml.local_spark_singleton import SparkSingleton +from py_auto_ml.utilities.helpers import Helpers + + +class FamilyRunner: + def __init__(self): + self.spark = SparkSingleton.get_instance() + + def run_family_runner(self, + df: DataFrame, + prediction_type: str, + family_configs: dict): + """ + + :param family_configs: dict + Supported model_family as a key, vaue is a dictionary of configuration overrides + :param prediction_type: str + "regressor" or "classifier" + :param df: dataframe + :param path: string + Path to writing to writing out the pipeline models + :return: + """ + # Checking for supported model families and types + Helpers.check_prediction_type(prediction_type) + + stringified_family_configs = json.dumps(family_configs) + self.spark._jvm.com.databricks.labs.automl.pyspark.FamilyRunnerUtil.runFamilyRunner(df._jdf, + stringified_family_configs, + prediction_type) + + self._family_runner = True + + return self._get_returns() + + # Fetch the temp tables and bring them into python + + def _get_returns(self): + """ + + :return: dict of dataframes + 'model_report':model_report: dataframe + 'generation_report':generation_report: dataframe + 'best_mlflow_run_id':best_mlflow_run_id: dataframe + """ + if self._family_runner != True: + raise Exception("You must first run the family runner to generate the proper return dataframes") + else: + model_report = self.spark.sql("SELECT * FROM modelReportDataFrame") + generation_report = self.spark.sql("SELECT * FROM generationReportDataFrame") + best_mlflow_run_id = self.spark.sql("SELECT * FROM bestMlFlowRunId") + return_dict = { + 'model_report': model_report, + 'generation_report': generation_report, + 'best_mlflow_run_id': best_mlflow_run_id + } + return return_dict + + # Get the best mlflow run Id from DF + # Get pipeline path from tag in MLflow using tracking client + # Returns pipeline model + + # TO DO full inference on pipeline model + def mlflow_pipeline_inference(self, + mlflow_run_id: str, + model_family: str, + prediction_type: str, + dataframe: DataFrame, + configs = {}, + label_col='label'): + """ + + :param mlflow_run_id: string + The Mflow Run Id (as generated by AutoML) which has the pipeline model of interest + :param model_family: string + Support model family + :parameter prediction_type: string + Supported prediction type + :param dataframe + The dataframe being used for inference + :param configs + Dictionary of configuration overrides, default is empty dictionary + :param label_col: string + Label column of dataset that will be used for inference. Default is "label" + :return inferred_df: dataframe + The dataframe with prediction + """ + # Checking for supported model families and types + Helpers.check_model_family(model_family) + Helpers.check_prediction_type(prediction_type) + + stringified_configs = json.dumps(configs) + + self.spark._jvm.com.databricks.labs.automl.pyspark.FamilyRunnerUtil.runMlFlowInference(mlflow_run_id, + model_family, + prediction_type, + label_col, + stringified_configs, + dataframe._jdf) + # Pull out the inference + inference_df = self.spark.sql("SELECT * FROM inferenceDF") + + return inference_df + + def path_pipeline_inference(self, + path: str, + dataframe: DataFrame): + """ + + :param path: + Path to the pipelined model created by AutoML + :param dataframe: + Spark dataframe that will be used for inference + :return: inference_df: Dataframe + Dataframe with predictions + """ + + self.spark._jvm.com.databricks.labs.automl.pyspark.FamilyRunnerUtil.runPathInference(path, + dataframe._jdf) + inferred_df = self.spark.sql('SELECT * FROM pathInferenceDF') + + return inferred_df + + def feature_eng_pipeline(self, + df: DataFrame, + model_family: str, + prediction_type: str, + configs= {} + ): + """ + + :param df: Dataframe + Dataframe feature engineering pipeline will be applied to + :param model_family: string + Supported model family + :param prediction_type: string + Supported prediction type + :param configs: dict + Dictionary of overrides + :return: Dataframe + Feature engineered dataframe + """ + + # Checking for supported model families and types + Helpers.check_model_family(model_family) + Helpers.check_prediction_type(prediction_type) + + + stringified_family_configs = json.dumps(configs) + self.spark._jvm.com.databricks.labs.automl.pyspark.FamilyRunnerUtil.runFeatureEngPipeline(df._jdf, + model_family, + prediction_type, + stringified_family_configs) + feature_eng_df = self.spark.sql("SELECT * FROM featEngDf") + + return feature_eng_df + + + + + + + + diff --git a/python/py_auto_ml/exploration/__init__.py b/python/py_auto_ml/exploration/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/py_auto_ml/exploration/feature_importance.py b/python/py_auto_ml/exploration/feature_importance.py new file mode 100644 index 00000000..bf1e2ed1 --- /dev/null +++ b/python/py_auto_ml/exploration/feature_importance.py @@ -0,0 +1,84 @@ +import json +from pyspark.sql.functions import DataFrame +from py_auto_ml.local_spark_singleton import SparkSingleton +from py_auto_ml.utilities.helpers import Helpers + + +class FeatureImportance: + def __init__(self): + self.spark = SparkSingleton.get_instance() + + + def run_feature_importance(self, + model_family: str, + prediction_type: str, + dataframe: DataFrame, + cutoff_value: float, + cutoff_type: str, + overrides=None): + """ + + :param model_family: str + One of the supported model types + + :param prediction_type: str + Either "classifier" or "regressor" + + :param df: DataFrame + + :param cutoff_value: float + Threshold value for feature importance algorithm + + :param cutoff_type: str + Cutoff for the number features + + :param overrides: dict + Dictionary of configuration overrides + + :return: + """ + + # Checking for supported model families and types + Helpers.check_model_family(model_family) + Helpers.check_prediction_type(prediction_type) + + + ## Set flag for default configs + if overrides is not None: + default_flag = "false" + # Convert the configs to JSON + stringified_overrides = json.dumps(overrides) + else: + stringified_overrides = "" + default_flag = "true" + + # Pass to JVM to run FI + self.spark._jvm.com.databricks.labs.automl.pyspark.FeatureImportanceUtil.runFeatureImportance(model_family, + prediction_type, + stringified_overrides, + dataframe._jdf, + cutoff_type, + cutoff_value, + default_flag) + self.feature_importance = True + return self._get_returns() + + def _get_returns(self): + """ + + :return: dict of dataframes: + 'importances': importances df + 'top_fields': top fields df + """ + if self.feature_importance != True: + raise Exception ("Please first generate feature importances by running `run_feature_importance`") + else: + importances = self.spark.sql("select * from importances") + top_fields = self.spark.sql("select feature from importances") + return_dict = { + 'importances': importances, + 'top_fields': top_fields + } + return return_dict + + diff --git a/python/py_auto_ml/local_spark_singleton.py b/python/py_auto_ml/local_spark_singleton.py new file mode 100644 index 00000000..1cb38c91 --- /dev/null +++ b/python/py_auto_ml/local_spark_singleton.py @@ -0,0 +1,8 @@ +from pyspark.sql import SparkSession + + +class SparkSingleton: + + @classmethod + def get_instance(cls): + return SparkSession.builder.getOrCreate() diff --git a/python/py_auto_ml/test/__init__.py b/python/py_auto_ml/test/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/py_auto_ml/test/loan_risk.csv b/python/py_auto_ml/test/loan_risk.csv new file mode 100644 index 00000000..e8acb981 --- /dev/null +++ b/python/py_auto_ml/test/loan_risk.csv @@ -0,0 +1,1001 @@ +term,home_ownership,purpose,addr_state,verification_status,application_type,loan_amnt,emp_length,annual_inc,dti,delinq_2yrs,revol_util,total_acc,credit_length_in_years,label,int_rate,net,issue_year + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,10000,3,60000,16.18,0,88.7,12,9,tru,18.75,-4580.83,2012 + 36 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,8000,3,40950,23.92,0,60.4,13,15,fals,7.62,813.72,2012 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,3200,null,10728,19.91,0,91,5,9,tru,24.99,-2441.12,2016 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,27000,8,100000,12.53,0,85.9,37,20,fals,13.65,6045.97,2014 + 36 months,RENT,credit_card,NC,Not Verified,INDIVIDUAL,6000,0,40000,33.57,0,28.6,33,6,fals,9.16,400.09,2016 + 36 months,RENT,debt_consolidation,MO,Verified,INDIVIDUAL,12000,10,80000,22.38,1,65.8,25,11,tru,11.99,-5170.3,2014 + 36 months,MORTGAGE,credit_card,MA,Verified,INDIVIDUAL,18000,10,65000,14.57,1,55.3,20,13,fals,11.47,1923.34,2016 + 36 months,RENT,debt_consolidation,NC,Verified,INDIVIDUAL,7900,7,22000,30.66,0,94.8,12,14,fals,14.46,1087.14,2016 + 60 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,22975,10,125000,15.26,0,42,42,19,tru,14.65,-13542.43,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,20000,0,121656,13.11,13,42,47,21,fals,9.99,2191.74,2015 + 60 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,35000,5,181000,9.87,1,76.2,28,24,fals,15.88,12964.24,2013 + 36 months,MORTGAGE,major_purchase,TX,Not Verified,INDIVIDUAL,12000,10,110000,12.21,0,47.6,32,16,tru,7.62,-828.7,2012 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,3500,8,32000,20.97,0,71.9,15,17,tru,12.99,-2430.21,2014 + 60 months,MORTGAGE,home_improvement,NY,Verified,INDIVIDUAL,15800,4,59000,32.81,0,19.1,35,13,fals,15.61,2669.96,2014 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,35000,6,180000,11.87,0,42.1,20,17,fals,9.76,3577.28,2012 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,15000,9,55000,25.18,0,80.6,13,13,fals,21.49,2533.81,2016 + 36 months,RENT,debt_consolidation,HI,Verified,INDIVIDUAL,11200,9,54600,5.8,0,36.9,13,11,fals,12.99,696.86,2015 + 36 months,MORTGAGE,major_purchase,CA,Verified,INDIVIDUAL,6800,9,70000,19.48,0,28.7,33,29,fals,18.24,1952.11,2014 + 36 months,MORTGAGE,debt_consolidation,MN,Verified,INDIVIDUAL,35000,10,110000,15,0,44.3,36,25,fals,14.65,6975.62,2015 + 36 months,MORTGAGE,home_improvement,MO,Verified,INDIVIDUAL,3600,10,110000,21.31,0,86.5,28,16,fals,13.35,516.37,2014 + 36 months,MORTGAGE,home_improvement,CA,Verified,INDIVIDUAL,15000,6,113000,5.21,0,35,31,30,tru,10.99,-13526.97,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,21000,6,85000,21.59,0,46.1,26,16,fals,8.18,897.43,2015 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,7500,9,55000,27.65,0,46.3,36,11,tru,16.59,-2332.48,2014 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,15600,6,70000,6.79,1,59.8,5,9,fals,16.2,2769.44,2013 + 36 months,RENT,credit_card,NY,Verified,INDIVIDUAL,20000,3,65000,10.51,1,48.1,24,9,fals,10.99,2215.59,2015 + 60 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,22000,null,49000,19.05,0,69.5,28,14,tru,17.77,2870.05,2013 + 60 months,MORTGAGE,credit_card,UT,Not Verified,INDIVIDUAL,14000,1,80000,25.64,0,82.2,24,11,fals,10.49,2921.16,2014 + 60 months,RENT,major_purchase,GA,Verified,INDIVIDUAL,5900,3,144000,1.24,0,26.8,29,15,fals,15.61,1180.95,2014 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,7025,2,30000,0,0,0,7,4,fals,18.99,1678.46,2014 + 60 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,21000,0,215000,13.04,0,94.2,50,17,fals,18.99,4634.81,2015 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,25000,2,60000,9.76,0,72,7,7,fals,7.39,107.08,2016 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,10050,8,30000,14.96,0,45.1,8,8,fals,9.71,1575.06,2013 + 36 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,35000,10,80000,21.18,0,27.4,29,12,tru,15.61,-15492.3,2015 + 36 months,MORTGAGE,debt_consolidation,NV,Not Verified,INDIVIDUAL,2000,0,40000,15.03,0,51.8,21,16,fals,14.31,284.97,2015 + 60 months,RENT,credit_card,OR,Verified,INDIVIDUAL,16675,9,65000,1.88,0,15.4,27,22,fals,15.61,216.92,2014 + 36 months,MORTGAGE,credit_card,RI,Not Verified,INDIVIDUAL,15000,3,75000,27.15,0,84.6,39,13,tru,13.98,-8204.47,2014 + 36 months,MORTGAGE,credit_card,CO,Verified,INDIVIDUAL,24000,2,80000,14.45,0,67.4,18,13,fals,6.99,2450.02,2014 + 60 months,MORTGAGE,home_improvement,FL,Verified,INDIVIDUAL,25000,2,170000,9.37,0,33.7,48,13,fals,14.47,3491.63,2014 + 36 months,MORTGAGE,credit_card,GA,Not Verified,INDIVIDUAL,6700,8,33000,13.27,0,52.7,20,10,fals,9.67,1045.49,2013 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,27575,10,80880,29.35,1,40,45,33,tru,14.99,-12952.57,2015 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,18000,10,100000,7.91,0,44,20,14,tru,7.26,-13543.74,2015 + 60 months,MORTGAGE,debt_consolidation,NM,Verified,INDIVIDUAL,35000,10,110357.14,18.95,0,86.8,34,21,tru,13.99,-19106.16,2015 + 36 months,RENT,debt_consolidation,PA,Not Verified,INDIVIDUAL,5000,2,52000,14.81,2,55.2,12,28,fals,12.49,1021.26,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,15000,4,71000,22.01,0,64.8,15,12,tru,6.39,-8125.8,2015 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,10000,10,78000,20.69,0,99.6,18,20,fals,12.39,812.24,2014 + 36 months,RENT,credit_card,AL,Verified,INDIVIDUAL,16500,null,42445,20.07,0,96.6,6,20,fals,11.67,3094.75,2014 + 36 months,RENT,debt_consolidation,PA,Verified,INDIVIDUAL,6000,10,78000,26.15,0,9.9,43,25,fals,12.29,958.73,2015 + 60 months,MORTGAGE,debt_consolidation,ME,Verified,INDIVIDUAL,35000,0,93000,22.04,0,63.2,18,13,fals,18.55,2923.93,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,20000,10,55000,26.78,1,61.8,39,17,fals,13.11,2997.82,2013 + 36 months,RENT,credit_card,NV,Verified,INDIVIDUAL,12000,10,53000,14.1,0,81.6,36,16,fals,9.76,1864.82,2012 + 60 months,RENT,credit_card,MD,Verified,INDIVIDUAL,24000,0,60000,18.28,1,67.7,28,18,fals,9.17,3550.64,2015 + 36 months,OWN,credit_card,WY,Verified,INDIVIDUAL,5400,3,55000,24,0,22,30,20,fals,8.9,775.36,2014 + 36 months,MORTGAGE,major_purchase,TX,Verified,INDIVIDUAL,1500,10,63000,21.18,0,35.3,48,25,fals,14.98,369.77,2014 + 36 months,MORTGAGE,home_improvement,IN,Not Verified,INDIVIDUAL,4500,4,61847,19.41,0,21.6,39,18,fals,7.12,517.1,2014 + 36 months,OWN,credit_card,NY,Verified,INDIVIDUAL,8000,10,73000,8.3,0,60.5,29,27,fals,11.14,1447.79,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,22000,7,55000,6.95,0,24.3,23,12,fals,9.17,1316.6,2015 + 36 months,RENT,debt_consolidation,OH,Not Verified,INDIVIDUAL,5000,2,38500,21.79,0,62.5,20,19,tru,14.64,551.61,2014 + 36 months,RENT,medical,TN,Verified,INDIVIDUAL,2000,9,40000,11.97,2,64.9,16,14,fals,9.49,86.22,2016 + 36 months,MORTGAGE,debt_consolidation,OK,Not Verified,INDIVIDUAL,8000,1,41000,6.5,6,79.6,17,13,fals,8.19,912.34,2015 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,12000,2,111000,9.37,1,39.6,17,10,fals,9.17,924.43,2015 + 36 months,RENT,debt_consolidation,OH,Not Verified,INDIVIDUAL,10000,10,40000,30.48,1,24.7,40,16,tru,8.18,-6234.14,2015 + 36 months,MORTGAGE,small_business,MD,Verified,INDIVIDUAL,10000,4,60000,28.04,0,42.1,13,7,fals,17.57,2937.24,2014 + 36 months,RENT,debt_consolidation,WA,Not Verified,INDIVIDUAL,1600,3,35928,19.86,0,79.5,11,8,fals,17.1,456.4,2013 + 36 months,MORTGAGE,debt_consolidation,IN,Not Verified,INDIVIDUAL,16000,4,60000,24.78,2,46.6,45,12,fals,18.49,4295.77,2013 + 36 months,MORTGAGE,other,WI,Verified,INDIVIDUAL,24500,7,49000,14.6,1,54.3,31,24,tru,14.65,-8462.85,2015 + 36 months,RENT,major_purchase,PA,Verified,INDIVIDUAL,9000,10,89000,10.53,0,62.5,51,14,tru,18.99,-954,2014 + 36 months,RENT,medical,NY,Verified,INDIVIDUAL,4000,1,58000,26.06,0,1.3,30,16,tru,10.78,-2305.38,2016 + 36 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,18400,2,50000,6.81,0,22.9,18,11,fals,13.99,1777.07,2015 + 36 months,RENT,other,NY,Not Verified,INDIVIDUAL,10400,8,60000,16.44,0,37.7,9,8,fals,7.62,1113.12,2012 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,10000,2,78000,13.25,0,29.9,42,25,fals,7.69,767.58,2014 + 36 months,OWN,home_improvement,NC,Not Verified,INDIVIDUAL,4800,10,29070,24.69,0,33.6,12,10,fals,9.8,442.35,2016 + 36 months,RENT,credit_card,AL,Verified,INDIVIDUAL,14000,7,52000,22.13,0,68.6,14,17,fals,12.79,1097.23,2016 + 60 months,OWN,debt_consolidation,KY,Not Verified,INDIVIDUAL,19200,6,75000,6.94,2,5,19,13,fals,7.89,1811.14,2016 + 36 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,11200,8,89000,10.5,1,59.2,22,16,fals,14.33,2434.28,2013 + 36 months,RENT,debt_consolidation,MO,Not Verified,INDIVIDUAL,2500,5,32000,18.38,0,37.7,27,15,fals,13.35,547.63,2014 + 60 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,23000,10,55500,12.15,1,59.6,18,37,fals,14.64,1634.83,2014 + 36 months,RENT,small_business,CA,Verified,INDIVIDUAL,19075,9,77000,20.2,1,68.2,22,9,fals,18.49,5919.91,2013 + 36 months,RENT,credit_card,FL,Verified,INDIVIDUAL,8225,1,21835,7.94,0,60.7,37,32,fals,17.57,2022.12,2014 + 36 months,RENT,other,NY,Verified,INDIVIDUAL,7000,10,65000,15.01,0,47.7,18,8,fals,10.74,1219.11,2012 + 60 months,MORTGAGE,home_improvement,WV,Verified,INDIVIDUAL,12000,5,55000,17.85,2,40.8,20,19,tru,19.52,-5385.66,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,5500,3,65000,17.3,0,26.2,22,17,fals,8.18,140.76,2015 + 36 months,OWN,other,NY,Not Verified,INDIVIDUAL,10100,7,33738,10.14,0,59.3,16,4,fals,24.5,4264.12,2014 + 60 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,24000,10,108000,17.35,0,81.4,25,9,tru,22.47,6118.05,2012 + 36 months,RENT,moving,CA,Verified,INDIVIDUAL,8100,2,75000,7.74,0,26.5,23,16,fals,18.25,2377.61,2014 + 36 months,MORTGAGE,car,AZ,Not Verified,INDIVIDUAL,4000,10,52000,23.7,0,52.8,30,16,fals,19.52,1264.16,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,25000,10,110000,21.58,0,70.2,21,17,fals,8.39,2729.4,2014 + 60 months,RENT,debt_consolidation,TX,Not Verified,INDIVIDUAL,20000,1,48000,32.82,0,30.8,47,10,fals,12.29,2914.98,2015 + 36 months,RENT,debt_consolidation,LA,Verified,INDIVIDUAL,15000,3,77000,25.45,0,73.2,26,12,fals,15.8,3935.35,2013 + 36 months,MORTGAGE,home_improvement,MN,Not Verified,INDIVIDUAL,5000,null,60000,25.76,0,8.1,36,27,tru,7.49,-4694.18,2015 + 36 months,OWN,debt_consolidation,OH,Not Verified,INDIVIDUAL,3625,10,30000,34.48,0,93.7,11,18,tru,18.99,-2002.63,2016 + 36 months,OWN,credit_card,FL,Not Verified,INDIVIDUAL,10000,10,75858,9.84,0,63,28,11,tru,12.99,-8322.72,2014 + 36 months,MORTGAGE,credit_card,GA,Verified,INDIVIDUAL,12000,10,37000,22.09,0,49.3,20,19,fals,12.69,2089.45,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,10000,9,51000,11.11,0,73,9,9,tru,19.53,-6870.56,2016 + 60 months,MORTGAGE,credit_card,CT,Verified,INDIVIDUAL,28800,9,85000,27.49,0,62,30,12,tru,12.49,-19102.98,2014 + 36 months,OWN,home_improvement,NC,Not Verified,INDIVIDUAL,12000,6,150000,3.51,0,4.6,39,17,fals,6.03,493.06,2014 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,25000,10,225000,13.63,3,61,32,18,fals,7.89,2300.94,2015 + 36 months,OWN,home_improvement,TX,Verified,INDIVIDUAL,8400,5,39000,18.06,0,90,6,16,fals,18.99,2684.83,2014 + 36 months,OWN,home_improvement,MI,Verified,INDIVIDUAL,7425,2,60000,9.98,0,80.5,42,32,fals,15.31,698.12,2013 + 60 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,19200,10,74767,29.13,0,47.5,21,21,tru,18.99,-10256.98,2014 + 60 months,MORTGAGE,debt_consolidation,LA,Verified,INDIVIDUAL,21000,10,92000,33.59,2,79.7,24,14,fals,17.57,5819.49,2015 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,18000,4,48000,31.05,0,66.6,18,29,fals,15.61,3379.48,2014 + 36 months,MORTGAGE,debt_consolidation,UT,Verified,INDIVIDUAL,15500,10,79000,28.42,3,71.2,39,18,fals,12.12,1397.84,2012 + 60 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,18000,10,90000,24.72,2,80,35,15,tru,22.2,-46.5,2013 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,7000,null,30000,18.88,0,21.8,36,20,fals,6.24,134.26,2015 + 36 months,RENT,credit_card,TX,Not Verified,INDIVIDUAL,5000,5,70000,10.89,0,37.3,25,15,fals,12.35,102.02,2013 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,17375,9,58000,20.79,0,40.8,27,14,fals,10.99,3100.06,2014 + 36 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,27850,4,75000,28.88,0,50.9,48,14,fals,13.98,5014.99,2014 + 36 months,RENT,debt_consolidation,WI,Verified,INDIVIDUAL,10000,null,39000,21.91,0,22.4,17,30,tru,15.61,-5804.2,2013 + 60 months,RENT,debt_consolidation,PA,Verified,INDIVIDUAL,14900,3,58000,31.51,0,56.8,48,10,tru,19.52,-5932.21,2014 + 60 months,MORTGAGE,debt_consolidation,DC,Verified,INDIVIDUAL,18375,5,73000,22.13,0,95.1,30,12,tru,18.54,-8907.08,2015 + 36 months,MORTGAGE,moving,MI,Not Verified,INDIVIDUAL,5000,10,90000,36.27,0,82.6,46,28,tru,16.29,-4715.87,2014 + 36 months,RENT,debt_consolidation,NV,Verified,INDIVIDUAL,9600,10,39000,23.72,0,58.1,14,7,tru,10.99,-3739.86,2014 + 36 months,MORTGAGE,credit_card,MN,Verified,INDIVIDUAL,12200,10,30000,27.8,0,39.2,15,10,fals,7.89,1053.99,2015 + 36 months,MORTGAGE,debt_consolidation,MD,Not Verified,INDIVIDUAL,4100,1,50000,10.67,0,96.8,15,9,fals,16.99,1161.56,2014 + 36 months,RENT,credit_card,IL,Verified,INDIVIDUAL,12000,0,70000,27.33,0,45.5,27,10,tru,12.59,-6389.21,2015 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,10000,10,56000,31.22,0,57.2,16,9,fals,7.26,407.34,2015 + 36 months,RENT,credit_card,TN,Verified,INDIVIDUAL,5000,1,32000,25.24,0,85.5,40,18,fals,8.19,634.96,2015 + 36 months,MORTGAGE,car,CO,Verified,INDIVIDUAL,12000,0,75000,18.36,0,28.6,49,18,fals,6.24,950,2015 + 60 months,MORTGAGE,debt_consolidation,WI,Verified,INDIVIDUAL,27275,10,62000,14.2,1,64.1,45,21,tru,22.99,-23500.97,2015 + 60 months,RENT,major_purchase,MO,Verified,INDIVIDUAL,12000,10,62000,3.77,0,8.6,13,7,tru,17.57,-5639,2015 + 36 months,RENT,major_purchase,CA,Not Verified,INDIVIDUAL,9000,10,101000,28.8,1,69.7,19,22,fals,12.99,1924.51,2014 + 36 months,MORTGAGE,credit_card,KY,Verified,INDIVIDUAL,24000,10,319374,10.91,0,59.8,39,12,fals,7.89,1336.13,2015 + 60 months,MORTGAGE,credit_card,MD,Verified,INDIVIDUAL,18000,10,71000,7.81,0,75.5,21,26,fals,13.98,5385.81,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,13500,10,68500,21.67,0,71.1,19,9,fals,15.61,3313.52,2014 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,3000,10,37000,24.91,1,47.1,24,14,fals,18.99,2.94,2016 + 60 months,MORTGAGE,major_purchase,WA,Verified,INDIVIDUAL,16000,7,45000,7.55,0,16.9,11,8,fals,17.99,944.36,2012 + 36 months,RENT,debt_consolidation,WA,Not Verified,INDIVIDUAL,12000,3,50000,20.15,0,52.8,9,7,fals,9.99,1396.6,2015 + 36 months,MORTGAGE,credit_card,OR,Not Verified,INDIVIDUAL,25000,6,58000,11.45,0,51,14,12,fals,7.26,2266.28,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,15000,8,59000,26.51,0,38,27,17,fals,9.99,1212.56,2015 + 60 months,MORTGAGE,debt_consolidation,OR,Verified,INDIVIDUAL,24000,4,60000,14.14,0,57.1,11,11,fals,14.49,5681.66,2014 + 36 months,MORTGAGE,credit_card,WI,Verified,INDIVIDUAL,10000,10,35000,20.57,0,68.6,26,21,fals,11.55,1879.92,2013 + 60 months,RENT,debt_consolidation,MD,Verified,INDIVIDUAL,22000,10,55000,14.2,0,50.4,33,15,tru,18.25,-6966.6,2014 + 36 months,RENT,debt_consolidation,TN,Verified,INDIVIDUAL,19000,1,50000,20.12,0,52.8,11,10,fals,11.44,2698.87,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,10000,2,60000,26.78,0,20.2,37,12,fals,11.53,1210.38,2015 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,28000,10,100000,22.42,0,89.1,30,13,fals,18.85,8719.27,2013 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,17700,9,59200,32.37,0,75.4,29,37,fals,21.99,4426.97,2014 + 60 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,30000,3,80000,10.79,2,0,20,13,fals,21.99,4752.12,2015 + 60 months,RENT,debt_consolidation,AL,Verified,INDIVIDUAL,13750,0,41000,19.08,0,93.8,13,18,fals,18.92,4764.01,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,18900,10,45000,21.42,0,58.6,25,28,fals,12.69,1538.04,2015 + 36 months,RENT,other,PA,Not Verified,INDIVIDUAL,1000,2,23556,16.35,0,54.3,12,5,fals,18.75,315.07,2013 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,4750,1,26000,8.08,0,31.7,12,4,fals,18.24,387.94,2014 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,35000,1,120000,7.39,0,27.4,17,15,fals,11.14,3971.44,2013 + 60 months,OWN,home_improvement,CA,Not Verified,INDIVIDUAL,25000,2,130000,7.07,1,11.8,42,23,fals,6.49,5.19,2016 + 36 months,RENT,debt_consolidation,NC,Verified,INDIVIDUAL,15000,2,92000,14.57,1,75.4,23,14,fals,18.85,4438.06,2013 + 60 months,MORTGAGE,major_purchase,NY,Not Verified,INDIVIDUAL,15000,2,70000,21.16,1,59.2,32,25,fals,14.65,3700.41,2015 + 36 months,MORTGAGE,credit_card,KS,Not Verified,INDIVIDUAL,16700,10,89000,9.4,0,37.4,25,15,fals,7.62,2042.75,2013 + 60 months,MORTGAGE,medical,CO,Verified,INDIVIDUAL,11200,0,45000,16,0,43.9,18,7,tru,18.54,-6500.7,2015 + 36 months,RENT,debt_consolidation,HI,Not Verified,INDIVIDUAL,8400,2,78000,19.34,1,63.5,33,21,fals,12.99,1618.41,2013 + 60 months,OWN,credit_card,NY,Verified,INDIVIDUAL,35000,7,120000,10.61,0,53.1,48,30,fals,15.31,9627.29,2014 + 36 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,6000,10,50000,13.44,0,50.3,18,17,fals,18.75,1890.41,2013 + 60 months,MORTGAGE,home_improvement,VA,Not Verified,INDIVIDUAL,14800,10,156800,9.4,0,20.6,28,15,fals,12.29,270.13,2015 + 60 months,OWN,debt_consolidation,OH,Verified,INDIVIDUAL,29175,9,65000,14.29,0,79.3,17,14,tru,19.72,-5610.6,2013 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,10800,6,80000,5.04,0,39.2,27,35,tru,18.84,-7192.88,2015 + 60 months,MORTGAGE,home_improvement,NC,Verified,INDIVIDUAL,16150,3,80000,6.44,0,61.3,15,10,fals,19.99,6324.65,2014 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,22500,9,51012,28.3,1,75.6,24,14,fals,22.95,11174.17,2013 + 36 months,MORTGAGE,credit_card,TX,Not Verified,INDIVIDUAL,6000,10,54000,33.82,0,15.4,28,22,fals,6.89,106.07,2015 + 36 months,RENT,other,NJ,Not Verified,INDIVIDUAL,8400,2,155000,10.06,0,20.4,49,13,fals,15.31,2128.72,2013 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,6000,2,70000,8.21,4,27.3,20,12,fals,9.49,472.91,2016 + 60 months,RENT,credit_card,TX,Verified,INDIVIDUAL,16000,6,47000,37.31,0,54.9,21,16,fals,13.33,1287.55,2015 + 36 months,RENT,credit_card,MN,Not Verified,INDIVIDUAL,8000,0,29000,21.52,0,57,24,13,fals,13.65,1794.21,2014 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,11200,0,47500,16.43,0,12.4,43,10,fals,6.49,778.01,2014 + 36 months,OWN,debt_consolidation,OR,Verified,INDIVIDUAL,7000,8,50000,19.47,0,65.6,7,6,fals,7.12,757.25,2014 + 36 months,MORTGAGE,medical,IN,Not Verified,INDIVIDUAL,14300,0,95000,9.11,1,21.6,46,12,fals,15.61,3699.82,2013 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,16000,1,78500,16.15,0,87.9,19,10,fals,23.43,1500.17,2014 + 36 months,MORTGAGE,credit_card,TX,Verified,INDIVIDUAL,16000,null,61576,19.72,0,90.9,19,16,tru,12.99,-5188.08,2014 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,35000,10,96012,27.81,0,7.8,21,20,fals,8.9,4543.3,2014 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,16000,6,50000,9.19,0,15.9,57,24,fals,6.49,256.8,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,29400,1,220000,18.59,0,96.9,22,16,fals,12.85,9526.63,2014 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,15000,2,120000,18.87,0,14.2,33,20,fals,8.67,797.59,2015 + 60 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,16000,0,52000,29.29,1,77.2,15,7,fals,24.74,2142.55,2016 + 36 months,RENT,other,NM,Verified,INDIVIDUAL,11325,5,31475,14.8,0,45.9,10,13,tru,16.55,-9033.89,2015 + 60 months,RENT,debt_consolidation,IN,Verified,INDIVIDUAL,13500,4,31000,31.2,0,46.4,17,5,tru,27.34,-9102.3,2016 + 60 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,21425,10,89740,8.08,0,35,22,31,tru,17.14,-16124.9,2015 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,5300,2,85000,9.28,2,25.9,28,24,tru,13.67,-1841.96,2013 + 60 months,OWN,debt_consolidation,OH,Verified,INDIVIDUAL,19200,10,58000,20.11,0,59.7,8,8,fals,20.5,3096.33,2016 + 36 months,OWN,car,NC,Not Verified,INDIVIDUAL,6000,0,45000,11.41,0,2.5,16,11,fals,7.89,565.32,2015 + 60 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,20000,0,68000,20.77,0,68.5,29,16,tru,21.98,-10073.74,2012 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,21000,4,50000,23.36,0,54.2,30,10,fals,18.55,5491.18,2015 + 36 months,MORTGAGE,credit_card,NC,Not Verified,INDIVIDUAL,14000,10,72000,20.02,0,83.9,12,41,fals,7.69,1729.74,2014 + 36 months,RENT,credit_card,NC,Not Verified,INDIVIDUAL,15000,2,80000,9.07,0,24.8,9,24,tru,10.15,-9664.23,2014 + 60 months,OWN,home_improvement,CA,Verified,INDIVIDUAL,17400,10,66000,26.97,1,19.8,36,23,fals,14.33,2681.52,2012 + 36 months,RENT,credit_card,TN,Verified,INDIVIDUAL,10750,3,32000,33.56,0,50,27,14,fals,16.29,2628.7,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,17000,7,125000,9.83,0,39,45,12,fals,5.32,361.12,2016 + 60 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,35000,2,120000,18.12,0,48,23,14,fals,13.35,9889.36,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,24150,10,72000,35.6,0,68,22,17,fals,18.25,4837.18,2015 + 60 months,OWN,debt_consolidation,FL,Verified,INDIVIDUAL,12000,7,60000,18.28,0,34.4,29,12,tru,24.99,-8847.31,2015 + 36 months,RENT,credit_card,CO,Verified,INDIVIDUAL,9000,2,30000,24.08,0,71.4,16,7,fals,11.44,1463.55,2015 + 36 months,MORTGAGE,vacation,TX,Verified,INDIVIDUAL,4000,0,48000,16.73,0,60,18,14,fals,16.59,1091.36,2014 + 36 months,OWN,debt_consolidation,NY,Not Verified,INDIVIDUAL,24000,7,180000,12,0,25.4,45,29,fals,7.59,1354.15,2016 + 60 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,35000,8,130000,22.49,0,74.3,36,11,tru,23.99,-26947.38,2015 + 36 months,RENT,vacation,NY,Not Verified,INDIVIDUAL,10000,3,74000,8.09,0,20.8,5,10,fals,9.76,1310.07,2012 + 36 months,RENT,home_improvement,CA,Verified,INDIVIDUAL,8000,4,54000,28.39,0,33.4,37,11,fals,10.15,1282.44,2014 + 36 months,MORTGAGE,credit_card,OH,Verified,INDIVIDUAL,14000,10,55000,15.3,0,60,16,24,fals,6.49,974.16,2014 + 36 months,MORTGAGE,debt_consolidation,KY,Not Verified,INDIVIDUAL,7200,10,98000,16.04,0,93,20,21,fals,12.69,115.7,2015 + 60 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,21000,10,42000,23.57,0,74.2,23,18,tru,18.49,-14414.8,2015 + 36 months,RENT,debt_consolidation,MI,Not Verified,INDIVIDUAL,10000,0,48000,22.09,0,48.1,22,25,tru,12.99,-2803.1,2014 + 60 months,RENT,debt_consolidation,CT,Verified,INDIVIDUAL,25000,10,57987,31.89,1,47.3,31,18,tru,18.25,-16044.77,2015 + 60 months,MORTGAGE,debt_consolidation,KY,Verified,INDIVIDUAL,9925,10,28874,15.71,1,35.4,22,23,fals,16.29,3667.39,2013 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,6000,3,22000,35.84,0,62.8,8,4,fals,14.46,512.31,2016 + 36 months,OWN,small_business,MN,Verified,INDIVIDUAL,24000,4,71000,13.02,0,43.7,10,15,tru,17.27,-22328.25,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,13700,2,79000,20.22,0,39.3,51,11,fals,12.12,4147.44,2012 + 36 months,OWN,vacation,TX,Verified,INDIVIDUAL,12000,10,65000,13.9,0,19.7,50,17,fals,9.67,419.71,2014 + 36 months,RENT,medical,NV,Verified,INDIVIDUAL,17000,0,122000,13.89,0,37.9,52,24,fals,15.31,1968.29,2014 + 36 months,OWN,debt_consolidation,MN,Not Verified,INDIVIDUAL,17100,1,96000,34.23,0,63.4,20,20,tru,25.44,-12679.04,2016 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,8125,7,36000,21.51,0,39.5,13,12,fals,14.49,1945.25,2014 + 60 months,RENT,credit_card,TX,Verified,INDIVIDUAL,23500,2,55000,30.92,0,68,18,11,tru,14.65,-10085.03,2015 + 36 months,OWN,debt_consolidation,MA,Verified,INDIVIDUAL,6250,null,32000,16.17,0,36.5,21,19,fals,10.99,843.34,2014 + 36 months,RENT,debt_consolidation,VA,Not Verified,INDIVIDUAL,10000,4,45000,22.29,0,62.3,12,15,tru,13.68,-6257.47,2013 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,35000,2,107000,1.17,0,15.3,33,17,fals,9.49,1888.3,2016 + 36 months,OWN,debt_consolidation,AL,Not Verified,INDIVIDUAL,7800,10,48000,20.03,0,28.6,24,13,fals,6.39,438.45,2015 + 36 months,MORTGAGE,debt_consolidation,UT,Verified,INDIVIDUAL,11500,7,119000,15.6,1,31.3,29,21,fals,10.99,2051.81,2014 + 36 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,10000,4,68000,10.71,1,86.2,21,33,fals,14.09,2319.63,2013 + 36 months,RENT,debt_consolidation,AL,Verified,INDIVIDUAL,24000,3,150000,12.05,1,54.1,52,21,fals,6.39,2089.21,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,1850,8,28000,10.33,0,70.8,10,13,fals,11.14,82.08,2012 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,30000,2,200000,23.24,0,99.2,28,45,fals,19.72,8833.98,2013 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,8450,10,28200,38.17,2,38.7,20,11,fals,18.25,1451.05,2016 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,12000,5,100000,17.84,0,35.5,29,14,fals,14.64,291.13,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,14675,9,66000,28.08,1,48.8,41,23,tru,18.25,-12500.85,2015 + 36 months,RENT,moving,NY,Not Verified,INDIVIDUAL,6000,5,62000,20.72,0,75.2,25,13,fals,18.25,1836.01,2013 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,11300,10,40000,26.79,0,92.5,17,16,tru,17.57,-8441.87,2015 + 60 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,35000,10,150000,18.04,0,46.3,54,27,fals,17.57,1310.28,2014 + 36 months,RENT,debt_consolidation,CT,Not Verified,INDIVIDUAL,9000,7,65200,17.14,0,42.2,12,5,fals,10.99,880.6,2015 + 36 months,RENT,credit_card,WI,Verified,INDIVIDUAL,4000,null,24220,9.34,0,43.6,18,15,fals,10.64,639.15,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,5875,7,25300,24.52,0,35.3,10,10,fals,10.99,1048.19,2013 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,10500,10,48000,13.55,0,49.8,27,17,fals,10.99,1874.89,2014 + 36 months,MORTGAGE,debt_consolidation,MA,Not Verified,INDIVIDUAL,10000,1,82000,3.81,0,64.2,21,25,fals,11.67,1603.49,2014 + 60 months,OWN,debt_consolidation,WV,Verified,INDIVIDUAL,30000,6,70000,27.07,0,29.9,15,19,tru,21.6,-22501.38,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,9000,8,42500,31.37,0,28.7,40,15,fals,16.99,1110.73,2016 + 60 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,15000,10,58000,16.38,2,78,26,14,fals,17.57,3708.56,2014 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,20000,5,57000,15.79,0,60.3,15,10,tru,12.99,-11173.33,2016 + 36 months,MORTGAGE,home_improvement,CA,Verified,INDIVIDUAL,6000,1,73000,19.6,0,19.1,45,14,fals,14.46,503,2016 + 36 months,MORTGAGE,home_improvement,VA,Verified,INDIVIDUAL,14300,4,38000,12.28,0,52.7,26,12,fals,13.11,1148.25,2013 + 60 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,25000,10,98000,14.39,1,76.1,16,20,fals,11.53,911.43,2015 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,5000,10,85000,6.58,0,59.2,17,9,fals,10.49,352.31,2016 + 36 months,MORTGAGE,small_business,GA,Verified,INDIVIDUAL,5000,2,70000,19.85,0,80.2,17,12,fals,13.49,271.02,2016 + 36 months,RENT,debt_consolidation,IL,Not Verified,INDIVIDUAL,6000,8,70000,17.95,0,47.5,28,12,fals,7.12,280.36,2014 + 36 months,RENT,other,NY,Verified,INDIVIDUAL,4800,0,95000,0.54,0,51,5,11,fals,20.2,1487.23,2013 + 36 months,MORTGAGE,medical,OH,Verified,INDIVIDUAL,4200,10,74839.91,19.91,0,89.2,34,37,fals,21.6,856.81,2013 + 36 months,RENT,other,CA,Not Verified,INDIVIDUAL,1450,10,52400,15.3,0,74.3,16,14,fals,18.25,61.88,2015 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,21250,10,74000,9.2,0,49.3,26,10,fals,23.4,7363.56,2013 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,20000,2,96000,15.82,0,68.3,22,15,fals,12.35,3726.11,2013 + 36 months,OWN,other,FL,Verified,INDIVIDUAL,5000,9,70000,14.74,1,34.8,15,11,tru,11.49,-4485.07,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,23550,10,65000,13.79,0,40.3,27,12,tru,16.99,-21231.59,2015 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,24000,0,110000,24.06,0,73.1,53,26,fals,10.49,3514.29,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,33425,5,75000,16.27,0,72.1,11,11,fals,15.8,8760.7,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,24000,10,115000,20.23,1,81.2,48,10,fals,9.49,374.42,2014 + 60 months,MORTGAGE,major_purchase,CA,Verified,INDIVIDUAL,21000,2,144000,18.35,0,90,33,16,fals,15.8,550.18,2013 + 36 months,MORTGAGE,debt_consolidation,NJ,Not Verified,INDIVIDUAL,18000,5,87500,11.34,0,27.9,31,12,fals,6.62,1895.95,2013 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,8400,2,65000,10.43,0,31.1,49,14,fals,8.39,605.35,2016 + 36 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,11300,2,47500,27.01,0,66.2,22,12,fals,22.99,1280.89,2015 + 36 months,RENT,credit_card,NV,Not Verified,INDIVIDUAL,3000,null,24000,25.15,0,45.8,17,19,fals,9.49,154.28,2016 + 36 months,RENT,debt_consolidation,OH,Verified,INDIVIDUAL,20000,10,72000,13.47,1,23.9,29,21,fals,13.49,845.32,2016 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,10000,7,102700,14.43,0,22,27,17,fals,11.99,1627.7,2014 + 36 months,RENT,moving,TX,Not Verified,INDIVIDUAL,1450,2,24000,22.5,0,3.8,13,7,fals,14.65,311.96,2015 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,28000,10,89207,17.01,1,62.3,31,14,fals,26.3,1499.1,2017 + 60 months,MORTGAGE,credit_card,KY,Verified,INDIVIDUAL,30000,10,300000,14.6,0,83.8,51,24,fals,18.84,8207.86,2015 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,20700,10,45000,16.45,2,0.4,17,18,fals,19.99,4127.64,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,4000,10,92000,22.67,0,77.7,19,15,fals,10.99,449.89,2015 + 36 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,14400,3,60000,13.36,0,35.3,10,10,fals,9.17,1565.84,2014 + 60 months,MORTGAGE,credit_card,IL,Verified,INDIVIDUAL,14400,2,55000,14.82,0,62.9,13,13,tru,15.61,-3213.99,2014 + 36 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,10000,10,90000,15.45,0,40.1,33,18,fals,5.32,126.62,2016 + 60 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,20000,10,79000,20.27,0,82.3,35,16,fals,13.99,3336.13,2016 + 36 months,RENT,other,NJ,Verified,INDIVIDUAL,7000,7,190000,4.5,0,75.1,14,13,fals,12.39,1150.58,2015 + 36 months,RENT,credit_card,FL,Verified,INDIVIDUAL,2100,0,34200,24.11,0,21.6,19,8,fals,7.62,255.79,2013 + 36 months,RENT,credit_card,AZ,Verified,INDIVIDUAL,14400,3,150000,6.49,1,96.6,16,15,fals,8.18,1501.27,2015 + 36 months,RENT,debt_consolidation,AZ,Verified,INDIVIDUAL,7475,6,57000,33.45,0,30,24,8,fals,21.18,1879.91,2014 + 60 months,RENT,debt_consolidation,KS,Verified,INDIVIDUAL,16300,2,45300,31.26,0,75.6,28,19,fals,18.75,7898.23,2012 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,20000,9,58000,20.09,0,65,20,20,tru,13.67,-13483.46,2016 + 60 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,24500,10,80000,22.71,0,45.6,28,13,fals,15.61,5225.29,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,7200,null,62000,32.64,0,46.4,40,34,fals,13.49,336.65,2016 + 36 months,RENT,debt_consolidation,LA,Verified,INDIVIDUAL,5300,10,39187,25.21,0,68.4,13,20,tru,16.29,-1138.98,2014 + 60 months,RENT,debt_consolidation,IL,Verified,INDIVIDUAL,15000,10,59600,17.18,0,76.6,41,26,fals,19.97,5961.35,2014 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,10000,0,48200,22.31,0,42.1,15,8,fals,11.49,480.67,2016 + 36 months,MORTGAGE,home_improvement,OH,Not Verified,INDIVIDUAL,15000,10,107000,28.81,0,14.6,37,31,fals,8.99,282.48,2016 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,19300,4,46000,26.4,0,42.4,11,6,fals,17.57,1917.22,2015 + 60 months,MORTGAGE,debt_consolidation,MS,Verified,INDIVIDUAL,19250,10,45838,26.68,0,96.5,26,15,fals,22.99,6632.72,2015 + 60 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,10200,10,44000,22.34,0,36,49,17,tru,16.29,-5580.88,2014 + 36 months,RENT,car,PA,Not Verified,INDIVIDUAL,3500,10,42000,12.43,0,36.4,29,10,fals,19.52,1151.83,2014 + 36 months,MORTGAGE,credit_card,TX,Verified,INDIVIDUAL,11000,6,34300,16.67,1,26,18,12,fals,10.16,1807.55,2013 + 36 months,MORTGAGE,other,NY,Not Verified,INDIVIDUAL,4000,6,65488,20.94,0,80.7,35,13,fals,18.85,1267.44,2013 + 36 months,MORTGAGE,credit_card,PA,Not Verified,INDIVIDUAL,7500,10,42000,18.91,0,85,18,23,fals,14.09,1365.57,2013 + 60 months,RENT,debt_consolidation,PA,Verified,INDIVIDUAL,21600,9,54000,36.22,0,65,44,11,tru,22.99,-11145,2015 + 60 months,OWN,major_purchase,WA,Not Verified,INDIVIDUAL,22500,10,45000,22.61,0,8,16,20,fals,24.99,2120.11,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,18000,6,75000,22.96,0,91.2,20,18,fals,10.99,637.99,2016 + 36 months,OWN,credit_card,TX,Verified,INDIVIDUAL,10000,1,25000,29.67,0,42.3,26,12,fals,8.9,1431.12,2014 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,3500,0,32000,7.58,0,28.1,12,7,fals,6.62,368.64,2012 + 36 months,RENT,debt_consolidation,CT,Not Verified,INDIVIDUAL,4500,8,31000,25.63,0,76.4,14,5,fals,12.12,889.95,2012 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,7550,10,26520,18.33,0,78.8,13,12,fals,14.33,1069.4,2013 + 60 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,17000,7,51150,22.64,0,62.9,14,8,fals,17.57,7599.8,2014 + 36 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,28000,0,130000,5.2,0,48.6,16,12,fals,6.89,2474.69,2015 + 36 months,OWN,other,PA,Verified,INDIVIDUAL,1500,4,65000,12.72,0,61.4,28,26,fals,8.18,120.69,2015 + 36 months,RENT,debt_consolidation,TN,Verified,INDIVIDUAL,16000,2,82000,18.48,0,38.8,33,21,tru,12.59,-6186.66,2015 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,3500,10,86000,3.03,0,4.2,36,13,fals,6.62,360.64,2013 + 60 months,RENT,credit_card,IL,Verified,INDIVIDUAL,10000,3,42000,18.17,0,57.4,21,13,fals,17.76,1668.57,2013 + 36 months,RENT,other,MA,Not Verified,INDIVIDUAL,8000,4,46000,17.4,0,54.9,24,11,fals,9.76,885.7,2012 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,15000,10,42000,22.2,1,29.2,29,15,tru,10.49,-4988.93,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,11800,9,104288,24.39,0,64.3,23,11,fals,18.49,3662.09,2012 + 36 months,MORTGAGE,debt_consolidation,NY,Not Verified,INDIVIDUAL,16000,4,70000,17.61,0,16.7,25,20,fals,11.22,1509.32,2015 + 60 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,18250,9,41600,14.8,0,81.7,16,13,fals,15.61,6560.81,2014 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,20000,4,194000,13.6,0,71.4,10,28,fals,5.32,827.36,2016 + 60 months,RENT,house,CA,Verified,INDIVIDUAL,26050,8,102000,8.4,1,49.1,28,15,fals,25.89,542.83,2015 + 36 months,RENT,debt_consolidation,MT,Not Verified,INDIVIDUAL,24000,1,60000,32.13,1,86.4,31,38,tru,18.25,-21528.26,2016 + 36 months,OWN,medical,FL,Verified,INDIVIDUAL,3550,null,29604,25.01,0,74.4,47,64,tru,21.99,-1298.05,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,10400,9,90000,24.21,0,49.1,21,10,tru,16.29,-6250.78,2016 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,16000,0,130000,16.42,1,83.4,13,10,tru,18.2,-10248.79,2015 + 60 months,OWN,credit_card,NY,Verified,INDIVIDUAL,12000,2,45000,8.5,0,32.8,32,11,fals,10.64,610.81,2015 + 36 months,RENT,debt_consolidation,TX,Not Verified,INDIVIDUAL,2500,8,36000,16.03,0,85.3,24,8,tru,9.49,-419.24,2015 + 36 months,OWN,credit_card,FL,Verified,INDIVIDUAL,20000,3,126000,7.05,0,72.4,14,8,fals,13.99,4554.38,2012 + 60 months,OWN,credit_card,NJ,Verified,INDIVIDUAL,19200,10,71400,12.05,0,21,52,16,tru,8.67,-13276.75,2014 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,9425,8,30000,27.48,0,51.3,18,10,tru,15.77,-6556.85,2015 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,10000,10,50000,0.74,0,4.4,14,8,fals,7.26,717.51,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,10000,9,38000,6.73,2,97.8,8,9,fals,10.49,979.02,2015 + 36 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,25375,4,67000,18.95,1,96.9,15,19,fals,18.25,7282.77,2015 + 36 months,MORTGAGE,debt_consolidation,OR,Verified,INDIVIDUAL,9800,7,42000,31,0,59.4,29,18,fals,18.25,976.32,2013 + 60 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,18000,null,68000,15.04,0,37.1,16,12,fals,15.99,1081.98,2017 + 60 months,OWN,debt_consolidation,GA,Verified,INDIVIDUAL,15125,2,50000,10.87,0,60.3,10,7,tru,24.99,-8327.56,2015 + 36 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,20000,10,50000,12.82,0,90.9,11,15,fals,8.39,2691.91,2014 + 60 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,33000,0,92000,21.99,0,66.1,40,8,tru,15.61,-8333.92,2014 + 36 months,OWN,credit_card,PA,Verified,INDIVIDUAL,6700,0,29760,21.9,0,16.2,21,6,fals,12.74,500.6,2016 + 60 months,RENT,other,UT,Verified,INDIVIDUAL,18000,10,180000,35.18,0,76,44,15,tru,22.45,-16539.62,2016 + 36 months,OWN,credit_card,NV,Verified,INDIVIDUAL,9000,null,33156,10.39,0,85.2,11,15,fals,13.11,1933.96,2012 + 36 months,MORTGAGE,credit_card,OH,Verified,INDIVIDUAL,22750,10,85000,15.47,0,65.7,22,16,fals,8.9,3021.19,2012 + 36 months,OWN,vacation,WA,Not Verified,INDIVIDUAL,6000,10,52000,11.69,4,37.7,21,23,fals,19.72,289.7,2013 + 36 months,MORTGAGE,credit_card,NY,Not Verified,INDIVIDUAL,8000,10,60000,12.56,0,66.6,21,19,fals,15.31,2027.38,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,6000,5,55000,8.18,0,58.3,9,6,fals,12.12,1095.39,2012 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,25000,0,110000,22.8,0,45.7,21,34,fals,10.78,2797.02,2016 + 60 months,MORTGAGE,credit_card,UT,Verified,INDIVIDUAL,30000,1,120000,24.54,0,87.5,29,27,fals,14.65,3313.26,2015 + 60 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,20000,1,51250,25.69,0,47.4,25,8,tru,19.52,-4961.92,2014 + 60 months,RENT,debt_consolidation,CO,Not Verified,INDIVIDUAL,10500,9,127000,15.05,0,39.1,39,13,fals,17.57,4188.24,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,15000,3,112000,18.96,0,94.1,19,12,fals,11.49,388.37,2017 + 36 months,MORTGAGE,debt_consolidation,LA,Not Verified,INDIVIDUAL,10000,2,68000,11.6,1,45.8,14,14,fals,12.99,2070.48,2014 + 36 months,MORTGAGE,home_improvement,NY,Verified,INDIVIDUAL,17600,10,149000,13.24,2,67.4,36,26,fals,12.49,3425.6,2014 + 60 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,35000,1,137000,11.89,0,14,23,15,fals,12.99,7089.1,2015 + 36 months,RENT,debt_consolidation,KY,Verified,INDIVIDUAL,6825,8,19000,33.89,0,91,35,16,fals,16.99,1139.29,2015 + 60 months,RENT,other,TX,Verified,INDIVIDUAL,12800,10,85000,17.13,0,33.3,27,11,fals,16.55,3258.26,2015 + 60 months,RENT,debt_consolidation,MD,Verified,INDIVIDUAL,32350,6,73000,16.69,0,90.2,19,15,fals,21.15,18237.24,2013 + 36 months,RENT,credit_card,NY,Not Verified,INDIVIDUAL,12000,5,80000,9.8,0,55.6,19,23,fals,7.9,1517.36,2013 + 36 months,MORTGAGE,home_improvement,NC,Verified,INDIVIDUAL,5000,8,15600,6.46,0,29.1,11,10,fals,12.29,746.84,2015 + 36 months,RENT,debt_consolidation,TN,Not Verified,INDIVIDUAL,4000,0,40000,12.3,0,12.2,8,5,fals,6.89,359.01,2015 + 36 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,4000,10,50000,14.57,0,28.8,11,13,fals,18.99,179.68,2016 + 60 months,MORTGAGE,credit_card,FL,Verified,INDIVIDUAL,35000,8,270000,7.1,0,73,20,19,tru,12.99,-29174.83,2016 + 36 months,RENT,debt_consolidation,CT,Not Verified,INDIVIDUAL,7750,1,50000,14.64,3,54,14,25,tru,11.53,-4180.41,2015 + 36 months,MORTGAGE,credit_card,TX,Verified,INDIVIDUAL,7000,10,39000,17.57,2,34.6,20,19,fals,13.33,730.92,2015 + 60 months,RENT,credit_card,IL,Verified,INDIVIDUAL,35000,10,104999,32.04,1,21.6,26,16,tru,18.49,-22498.09,2016 + 36 months,RENT,other,IL,Not Verified,INDIVIDUAL,7125,3,26000,24.52,0,1.4,63,11,fals,16.55,546.85,2015 + 60 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,25000,9,50000,5.09,1,31.4,16,11,fals,13.33,4652.41,2015 + 60 months,RENT,debt_consolidation,NC,Verified,INDIVIDUAL,10225,2,38900,16.01,0,25.7,46,12,fals,14.99,3400.57,2014 + 36 months,RENT,credit_card,MI,Not Verified,INDIVIDUAL,4500,2,35000,23.35,0,32.5,8,4,fals,12.99,957.57,2014 + 36 months,MORTGAGE,credit_card,OK,Verified,INDIVIDUAL,14675,5,70000,32.61,0,80.4,58,19,fals,15.22,3695.62,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,9000,10,180000,7.13,0,65.5,24,10,tru,13.65,88.43,2014 + 60 months,MORTGAGE,home_improvement,OH,Not Verified,INDIVIDUAL,35000,10,185000,17.7,0,69.3,29,33,fals,22.35,5157.3,2016 + 60 months,MORTGAGE,credit_card,GA,Verified,INDIVIDUAL,10000,5,67000,16.64,0,55.1,30,13,tru,13.33,-5193.58,2015 + 36 months,RENT,debt_consolidation,MN,Not Verified,INDIVIDUAL,1700,7,24000,37.6,3,19.3,37,22,fals,13.33,203.96,2015 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,12000,10,30000,19.44,0,25.6,39,20,fals,18.99,557.77,2014 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,30000,10,128000,21.7,0,58.1,32,14,fals,8.18,3089.87,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,15000,1,95000,14.59,0,74.7,25,10,fals,15.99,9.47,2016 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,5000,9,58000,11.05,0,33.9,13,11,fals,9.67,763.7,2014 + 36 months,RENT,major_purchase,OH,Verified,INDIVIDUAL,17000,5,50000,10.15,0,22.1,11,10,fals,7.69,1894.97,2014 + 36 months,MORTGAGE,credit_card,IL,Not Verified,INDIVIDUAL,8000,10,57000,7.28,0,71,28,24,fals,11.67,685.75,2014 + 36 months,OWN,home_improvement,NY,Verified,INDIVIDUAL,8000,10,156000,15.46,0,99.1,28,19,fals,11.47,314.7,2016 + 36 months,RENT,credit_card,NY,Not Verified,INDIVIDUAL,4800,7,42000,21.03,0,34.1,13,10,fals,12.99,507.72,2014 + 36 months,RENT,major_purchase,NY,Not Verified,INDIVIDUAL,8500,2,45500,12.19,0,37.9,30,10,tru,9.75,-6318.36,2016 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,18500,3,127000,14.77,0,48.4,45,14,fals,9.71,2342.93,2013 + 36 months,RENT,debt_consolidation,GA,Verified,INDIVIDUAL,20500,9,62000,11.46,0,54.5,37,14,fals,12.39,2935.32,2015 + 60 months,OWN,debt_consolidation,MO,Verified,INDIVIDUAL,14400,10,37000,5.38,0,25.2,32,11,fals,14.31,1888.71,2015 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,1000,10,34000,2.29,0,49,3,21,fals,17.77,297.3,2012 + 60 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,19425,5,44000,5.35,0,46.4,12,9,fals,16.29,8694.79,2013 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,12000,5,120000,25.06,0,37.9,23,14,fals,10.15,1854.06,2014 + 36 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,7975,0,24960,12.55,0,81.5,14,18,fals,18.99,780.64,2016 + 36 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,16000,10,50000,30.36,1,95.4,34,18,tru,13.33,-5220.65,2015 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,12000,6,73000,15.96,0,83.6,26,20,fals,10.75,1372.16,2016 + 60 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,21500,5,62000,29.97,1,64.4,55,18,fals,17.14,5690.5,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,17850,10,44695,23.52,0,71.1,9,13,tru,15.61,-5888.35,2015 + 60 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,35000,2,175000,23.44,0,8.2,21,14,fals,13.98,11315.24,2014 + 60 months,RENT,other,NY,Not Verified,INDIVIDUAL,15775,2,157000,28.88,0,32.8,44,19,fals,12.29,2931.99,2015 + 60 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,18000,1,58000,14.71,1,68.6,18,10,fals,15.61,6901.25,2013 + 36 months,MORTGAGE,home_improvement,PA,Verified,INDIVIDUAL,4700,10,48000,23.68,0,41.6,15,28,fals,16.55,620.64,2015 + 36 months,RENT,credit_card,MO,Not Verified,INDIVIDUAL,5000,7,70902,14.4,0,47,46,17,fals,5.32,268.63,2016 + 60 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,12800,0,72500,22.73,0,36,20,12,tru,12.59,-12240.79,2015 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,15000,2,58000,20.57,0,74.1,15,13,fals,15.61,3953.92,2014 + 60 months,OWN,debt_consolidation,FL,Verified,INDIVIDUAL,25000,2,90000,10.72,1,58.3,20,17,fals,17.57,728.24,2014 + 36 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,6150,5,35000,16.5,1,27.8,19,21,fals,12.99,257.1,2014 + 36 months,MORTGAGE,home_improvement,NM,Verified,INDIVIDUAL,5000,4,98000,8.76,0,98.2,14,19,fals,17.76,1268.4,2013 + 36 months,MORTGAGE,debt_consolidation,WY,Verified,INDIVIDUAL,27375,6,78000,29.4,4,62.5,56,15,fals,19.52,4268.02,2015 + 36 months,MORTGAGE,debt_consolidation,MO,Verified,INDIVIDUAL,15000,10,62500,24.48,1,66.9,15,18,fals,16.2,3897.55,2013 + 36 months,MORTGAGE,house,FL,Verified,INDIVIDUAL,24250,0,55000,12.33,0,37.7,22,19,fals,11.14,2006.92,2013 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,12000,2,29000,39.9,0,41.6,25,12,fals,21.97,1285.11,2016 + 36 months,RENT,credit_card,NV,Verified,INDIVIDUAL,11800,7,46225,6.67,0,36.2,12,13,fals,15.31,854.25,2013 + 60 months,MORTGAGE,debt_consolidation,CT,Verified,INDIVIDUAL,30000,10,75000,11.84,0,35.7,13,10,tru,14.47,-8838.6,2014 + 36 months,RENT,debt_consolidation,MD,Not Verified,INDIVIDUAL,15000,0,67000,3.87,3,87,51,20,tru,14.77,-8253.77,2016 + 36 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,6250,10,24000,26.2,0,14.6,35,18,fals,7.91,563.45,2016 + 36 months,RENT,other,MA,Verified,INDIVIDUAL,35000,10,400000,6.66,0,65.6,25,15,fals,19.72,11575.77,2012 + 36 months,MORTGAGE,home_improvement,AZ,Verified,INDIVIDUAL,28000,0,110000,8.07,1,13,33,11,fals,7.62,3070.47,2013 + 36 months,RENT,debt_consolidation,MD,Verified,INDIVIDUAL,9600,10,56862,11.65,0,52.9,19,13,fals,7.89,917.03,2015 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,21000,3,50028,8.88,0,46.5,7,4,tru,12.29,-12536.72,2015 + 36 months,MORTGAGE,credit_card,MA,Verified,INDIVIDUAL,24000,3,62000,22.09,0,71.6,21,24,fals,13.53,5332.59,2014 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,29975,3,75000,29.9,3,58.3,32,29,fals,14.64,4413.16,2014 + 36 months,RENT,other,FL,Not Verified,INDIVIDUAL,2500,2,120000,6.91,1,54.7,34,12,fals,10.49,86.78,2015 + 36 months,RENT,credit_card,MD,Verified,INDIVIDUAL,10000,4,72000,14.53,0,39.5,14,6,fals,9.49,533.57,2015 + 36 months,RENT,debt_consolidation,MN,Verified,INDIVIDUAL,20000,10,75000,31.34,0,62.3,42,22,fals,13.33,2043.51,2015 + 60 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,22125,10,75000,21.76,0,55,32,20,tru,12.69,-11413.4,2015 + 60 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,16175,4,100000,15,0,64.5,14,10,tru,19.99,-14925.58,2016 + 60 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,10450,null,38168,38.65,5,59.4,48,17,tru,17.86,-6824.59,2015 + 60 months,RENT,credit_card,CO,Verified,INDIVIDUAL,19200,4,70000,18.67,0,64.7,23,13,fals,15.8,3891.66,2013 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,25325,9,137500,16.19,4,3.2,24,14,tru,19.99,-2399.92,2014 + 36 months,MORTGAGE,credit_card,MD,Verified,INDIVIDUAL,15000,0,56270,26.85,1,65.7,30,14,tru,10.99,1203.33,2014 + 60 months,RENT,other,PA,Not Verified,INDIVIDUAL,23800,6,70000,5.78,1,49,27,8,tru,25.83,-14125.42,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,4500,10,53000,22.62,0,78.6,22,9,fals,15.31,516.39,2012 + 36 months,RENT,debt_consolidation,WA,Verified,INDIVIDUAL,20000,1,95000,22.64,0,48.1,32,14,fals,8.39,1749.15,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,25000,0,75000,34.62,0,30.1,40,16,fals,14.99,6036.62,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,7500,7,25000,25.01,0,82.8,31,18,fals,14.47,1186.44,2014 + 36 months,MORTGAGE,credit_card,AL,Not Verified,INDIVIDUAL,28000,10,125000,28.98,0,47.2,23,14,fals,8.18,1562.8,2015 + 36 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,12800,4,54000,15.09,0,66.4,16,13,fals,12.12,1610.62,2012 + 60 months,RENT,debt_consolidation,NV,Verified,INDIVIDUAL,10850,3,37000,25.11,0,9.6,17,7,tru,19.52,-5129.73,2014 + 60 months,MORTGAGE,credit_card,TN,Verified,INDIVIDUAL,35000,7,96600,8.86,0,54,11,10,fals,11.53,5486.08,2015 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,5800,10,23000,29.28,0,42.5,13,11,fals,13.65,1187.06,2014 + 36 months,MORTGAGE,home_improvement,VA,Not Verified,INDIVIDUAL,4000,1,81000,31.18,2,87.1,39,20,fals,12.99,837.69,2014 + 36 months,RENT,other,CA,Not Verified,INDIVIDUAL,10000,1,90000,9.89,0,31.8,11,7,fals,11.53,265.93,2015 + 36 months,MORTGAGE,debt_consolidation,AR,Verified,INDIVIDUAL,12500,8,40000,7.98,0,38.9,10,14,fals,9.75,836.17,2016 + 60 months,MORTGAGE,debt_consolidation,WA,Not Verified,INDIVIDUAL,18000,10,50000,16.99,0,63.1,14,11,fals,13.66,3816.36,2015 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,8400,1,67000,17.48,1,35.4,17,16,fals,7.62,1023.18,2012 + 36 months,RENT,debt_consolidation,CT,Verified,INDIVIDUAL,4800,8,150000,11.01,3,75.3,20,10,fals,9.49,256.37,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,14400,2,265000,17.4,1,46.2,26,15,fals,8.24,418.81,2017 + 36 months,RENT,major_purchase,CO,Verified,INDIVIDUAL,2600,10,87000,17.81,0,62.9,24,24,fals,12.79,219.61,2016 + 36 months,OWN,other,CA,Verified,INDIVIDUAL,10000,10,48000,10.9,0,96.8,10,10,tru,14.65,-8429.07,2012 + 36 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,4150,1,36500,18.04,2,32.7,27,12,fals,12.99,488.09,2015 + 36 months,MORTGAGE,home_improvement,FL,Verified,INDIVIDUAL,26850,5,515000,8.94,0,76.5,33,19,fals,11.67,4324.1,2014 + 60 months,MORTGAGE,credit_card,UT,Verified,INDIVIDUAL,12000,8,110000,11.65,0,72.8,39,13,fals,13.67,137.92,2013 + 36 months,MORTGAGE,home_improvement,FL,Verified,INDIVIDUAL,21000,null,120000,3.47,0,35.4,32,35,fals,15.1,5184.13,2013 + 60 months,RENT,debt_consolidation,MN,Verified,INDIVIDUAL,23275,10,74200,14.17,0,67,15,15,fals,14.09,8125.62,2012 + 60 months,MORTGAGE,debt_consolidation,RI,Not Verified,INDIVIDUAL,25000,7,72000,33.95,0,51.5,44,14,tru,15.31,-18455.4,2016 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,8000,10,73000,9.3,0,75.2,48,39,fals,10.99,1245.07,2014 + 60 months,OWN,debt_consolidation,AZ,Verified,INDIVIDUAL,35000,10,80000,20.79,0,11.3,34,14,tru,12.99,-20403.53,2014 + 36 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,15000,7,75000,17.42,0,56.8,32,16,fals,13.11,2248.36,2013 + 60 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,28000,6,83000,26.27,0,57.5,26,22,fals,15.31,11494.89,2013 + 36 months,MORTGAGE,debt_consolidation,NJ,Not Verified,INDIVIDUAL,15000,4,80000,13.56,0,44.4,16,12,fals,13.11,2598.86,2012 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,25000,1,80000,30.41,0,32.8,61,14,fals,17.27,4140.22,2016 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,25000,8,90000,11.97,0,90.4,18,12,fals,10.99,3218.92,2014 + 60 months,RENT,credit_card,CA,Verified,INDIVIDUAL,12000,null,38000,16.99,1,88,13,20,fals,16.55,527.72,2015 + 36 months,RENT,debt_consolidation,MA,Verified,INDIVIDUAL,19200,2,200000,5.33,0,51.7,34,13,fals,10.15,2124.95,2014 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,3000,null,11000,14.18,0,49.5,9,19,fals,13.05,641.5,2013 + 60 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,15875,10,65904,35.54,0,53.9,14,6,fals,19.99,3453.99,2015 + 60 months,MORTGAGE,credit_card,NJ,Verified,INDIVIDUAL,29000,null,58000,22.62,0,83.5,26,28,fals,20.99,7827.1,2015 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,16800,10,97000,19.37,0,95,18,12,tru,9.99,-11692.27,2015 + 36 months,RENT,debt_consolidation,OK,Not Verified,INDIVIDUAL,16000,3,82680,21.36,0,87.1,30,17,fals,12.12,3151.84,2013 + 60 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,17975,0,60000,11.67,0,51.4,20,11,fals,20.2,6565.63,2014 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,10000,null,31776,11.03,1,27.2,44,20,fals,17.99,764.19,2016 + 36 months,MORTGAGE,debt_consolidation,AL,Not Verified,INDIVIDUAL,3750,5,53000,24.08,0,92.9,28,42,fals,7.9,473.42,2012 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,10000,7,58000,24.25,3,55.1,16,16,tru,14.46,-8335.53,2016 + 36 months,RENT,credit_card,NY,Verified,INDIVIDUAL,27350,10,60000,13.02,0,85.1,14,7,fals,15.41,1159.88,2015 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,10000,10,70000,12.24,1,38.8,46,15,fals,12.79,849.06,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,15000,10,70000,16.66,0,17.5,54,20,fals,12.29,2531.18,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,8000,10,45000,13.6,4,65.4,26,13,fals,12.99,965.55,2015 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,13150,10,68052,16.45,3,24.8,31,17,tru,12.49,-7616.91,2014 + 36 months,MORTGAGE,other,IL,Verified,INDIVIDUAL,1000,10,69000,30.44,0,17.4,50,22,fals,13.99,37.39,2016 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,10000,3,75000,15.23,0,39.8,11,9,tru,20.99,-6507.08,2015 + 36 months,RENT,debt_consolidation,UT,Not Verified,INDIVIDUAL,5075,0,60000,2.6,0,47.2,8,4,tru,13.65,-4384.6,2014 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,24000,1,80000,17.54,0,54.2,36,14,fals,8.9,3434.76,2013 + 36 months,MORTGAGE,home_improvement,CA,Verified,INDIVIDUAL,16000,4,100000,13.69,0,89,10,7,fals,11.99,3140.95,2014 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,11975,3,72000,10.83,0,61.1,12,5,fals,16.99,1258.63,2015 + 60 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,22250,6,50000,25.4,0,81.6,27,14,tru,15.31,-5291.88,2012 + 36 months,RENT,credit_card,FL,Not Verified,INDIVIDUAL,14700,10,42000,27.6,0,63,30,11,tru,11.99,-2882.97,2014 + 36 months,RENT,other,PA,Verified,INDIVIDUAL,5000,10,52000,16.8,0,0,18,14,fals,16.29,526.47,2014 + 36 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,12400,0,48000,12.5,0,17.9,34,16,fals,6.03,1186.39,2013 + 60 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,35000,10,90000,34.37,0,95.5,43,20,fals,18.25,2177.85,2015 + 36 months,RENT,credit_card,PA,Verified,INDIVIDUAL,20000,10,59000,37.49,0,47.9,39,14,fals,13.66,2927.38,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,35000,10,116000,12.45,0,66.5,35,17,fals,14.46,1000.01,2016 + 60 months,RENT,other,MA,Verified,INDIVIDUAL,10000,10,60000,20.9,0,71.9,20,9,tru,18.99,-4631.61,2014 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,6000,6,39000,16.74,0,80.3,9,28,fals,8.9,858.7,2012 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,7000,10,67000,24.04,0,74,28,22,fals,9.99,742.13,2015 + 36 months,RENT,debt_consolidation,AL,Verified,INDIVIDUAL,6500,3,38000,16.93,0,46.9,7,11,fals,10.99,762.77,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,10500,10,75000,8.4,0,97.2,15,23,fals,21,3741.15,2013 + 36 months,RENT,debt_consolidation,NV,Verified,INDIVIDUAL,6500,10,36000,18.3,1,50.4,14,15,tru,17.27,-2209.33,2013 + 36 months,OWN,debt_consolidation,MI,Not Verified,INDIVIDUAL,10000,10,52000,14.63,0,54.3,10,12,fals,10.64,1724.63,2013 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,35000,10,120000,15.48,0,85.7,27,15,fals,18.99,10498.86,2014 + 60 months,MORTGAGE,debt_consolidation,KS,Verified,INDIVIDUAL,22250,1,50000,21.65,0,59.7,14,6,tru,20.2,-12329.19,2013 + 36 months,RENT,other,CA,Not Verified,INDIVIDUAL,7000,5,32000,22.01,0,27.6,20,26,fals,18.55,2179.99,2013 + 36 months,RENT,moving,RI,Verified,INDIVIDUAL,5000,7,24000,6.45,0,56.3,11,8,fals,13.99,806.73,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,15000,0,40000,9.45,0,47.4,19,11,tru,14.09,88.15,2013 + 36 months,OWN,debt_consolidation,MI,Not Verified,INDIVIDUAL,10000,1,68000,16.25,0,31.3,24,23,tru,8.19,-1212.38,2014 + 60 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,26400,8,102913,15.94,0,31.4,37,13,fals,9.17,1167.92,2015 + 60 months,MORTGAGE,other,NY,Verified,INDIVIDUAL,24000,8,180000,7.39,0,38.1,38,19,fals,25.8,16484.11,2013 + 60 months,MORTGAGE,home_improvement,IL,Verified,INDIVIDUAL,16000,7,41600,10.99,0,55,15,7,fals,21.98,10294.58,2012 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,20000,6,115000,17.39,0,73.5,26,10,fals,12.39,764.24,2014 + 36 months,RENT,credit_card,UT,Not Verified,INDIVIDUAL,9600,3,41364,14.27,0,92.3,9,9,tru,13.98,-705.78,2013 + 60 months,RENT,debt_consolidation,OH,Verified,INDIVIDUAL,10000,10,52000,19.92,1,47.7,34,15,tru,15.31,-7905.91,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,5500,10,20000,21.25,0,87.2,12,30,fals,14.47,432.49,2013 + 36 months,MORTGAGE,debt_consolidation,IN,Not Verified,INDIVIDUAL,4000,8,42000,23.89,0,50.7,28,13,fals,18.24,410,2014 + 36 months,MORTGAGE,credit_card,MD,Verified,INDIVIDUAL,13000,3,75000,13.01,0,81,25,15,fals,7.49,419.88,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,13850,6,45000,34.21,0,22.5,40,29,fals,6.03,70.48,2012 + 36 months,OWN,other,OH,Verified,INDIVIDUAL,5000,2,70695,15.04,0,39.3,11,5,fals,13.49,2.22,2016 + 36 months,OWN,credit_card,CA,Not Verified,INDIVIDUAL,12000,0,36000,11.67,0,29.1,20,17,tru,7.26,-12000,2015 + 36 months,MORTGAGE,credit_card,IL,Verified,INDIVIDUAL,10000,9,45000,28.59,0,41.2,47,18,fals,6.89,759.46,2015 + 60 months,OWN,debt_consolidation,FL,Verified,INDIVIDUAL,35000,6,88000,21.7,0,54.9,37,11,fals,19.19,8643.97,2015 + 36 months,RENT,debt_consolidation,DE,Verified,INDIVIDUAL,24000,10,115000,24.64,0,73.7,19,27,tru,12.69,-7046.66,2015 + 60 months,MORTGAGE,home_improvement,MI,Verified,INDIVIDUAL,15000,8,35000,15.71,0,59.6,28,14,fals,17.14,3531.9,2014 + 36 months,OWN,debt_consolidation,NJ,Verified,INDIVIDUAL,35000,null,150000,20.58,0,42.2,53,17,fals,14.46,2204.48,2016 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,3000,10,108000,7.12,3,49.7,24,14,fals,17.1,694.11,2013 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,15950,7,137000,10.63,0,32.7,44,13,fals,6.03,929.33,2014 + 60 months,MORTGAGE,credit_card,CT,Verified,INDIVIDUAL,18600,4,40000,16.71,0,97.6,11,7,fals,20.99,11585.18,2012 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,15200,2,50000,16.39,0,72.4,26,15,fals,11.99,2972.22,2013 + 36 months,MORTGAGE,debt_consolidation,NJ,Not Verified,INDIVIDUAL,12000,10,108000,15.36,0,58.8,30,28,fals,6.49,1237.85,2014 + 60 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,10500,3,45000,11.04,0,19.2,16,27,tru,18.84,-5081.79,2015 + 36 months,OWN,credit_card,TX,Verified,INDIVIDUAL,15000,6,65000,18.49,0,65,21,15,fals,11.48,1417.25,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,19200,10,250000,2.25,1,53.3,31,39,fals,8.9,2728.72,2013 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,6000,5,65000,8.25,0,21.1,12,12,fals,13.33,1101.78,2015 + 36 months,MORTGAGE,car,OK,Not Verified,INDIVIDUAL,6000,0,48000,9.74,0,2.1,8,9,fals,6.03,404.24,2014 + 36 months,MORTGAGE,debt_consolidation,MO,Verified,INDIVIDUAL,20000,9,52000,18.07,0,49.1,25,11,fals,10.75,1377.43,2016 + 36 months,RENT,credit_card,OR,Not Verified,INDIVIDUAL,7500,1,32000,27.61,0,91.8,25,11,fals,10.99,1295.99,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,31000,10,115000,25.21,0,70.5,26,17,fals,14.16,6578.98,2014 + 36 months,OWN,credit_card,CA,Verified,INDIVIDUAL,15000,3,45000,14.09,0,39.5,13,8,fals,10.49,579.91,2017 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,10800,0,55000,38.14,0,62.1,32,21,tru,10.99,-5156.71,2015 + 36 months,RENT,credit_card,NJ,Verified,INDIVIDUAL,15000,2,46675,23.89,0,60.3,24,9,fals,13.33,2239.52,2015 + 60 months,MORTGAGE,credit_card,MO,Not Verified,INDIVIDUAL,18000,1,70054,26.01,0,62.8,40,16,fals,17.57,3614.96,2015 + 36 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,15000,9,34000,23.9,0,77.6,11,10,fals,16.29,767.8,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,25000,4,65200,24.94,0,57.5,28,19,fals,12.85,4900.65,2014 + 36 months,MORTGAGE,debt_consolidation,WV,Verified,INDIVIDUAL,12000,10,120000,22.34,0,59.8,40,9,fals,11.67,1855.92,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,6250,9,35655,23.58,2,55.4,15,6,tru,14.3,-1583.26,2013 + 36 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,10000,10,51000,20.05,0,37.1,20,12,fals,14.99,1974.25,2014 + 36 months,MORTGAGE,credit_card,KS,Verified,INDIVIDUAL,28000,10,120000,20.98,0,99,18,16,fals,7.26,2868.86,2015 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,11000,10,40000,23.52,0,63.1,20,14,fals,9.17,1624.04,2014 + 36 months,MORTGAGE,debt_consolidation,MD,Not Verified,INDIVIDUAL,11000,3,42000,21.17,0,82,49,25,fals,15.59,2160.27,2015 + 36 months,MORTGAGE,debt_consolidation,MT,Not Verified,INDIVIDUAL,25000,10,94000,19.65,0,47.4,31,29,fals,7.62,2053.04,2013 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,2500,null,70000,23.97,0,78.8,18,25,fals,15.61,284.88,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,12000,6,250000,8.49,0,62.5,26,20,fals,6.99,949.52,2014 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,3000,null,21000,17.71,0,54,16,15,fals,11.14,542.91,2012 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,31200,10,99000,21.88,0,62.1,23,11,tru,18.55,-16171.6,2015 + 60 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,35000,7,150000,14.76,0,83.4,38,11,fals,24.5,18017.77,2014 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,28000,7,115000,3.11,0,22,22,12,fals,7.12,143.99,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,7700,10,124000,13.42,0,33.1,29,12,fals,14.99,791.54,2016 + 36 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,10200,2,60000,19.78,0,56,31,12,fals,13.49,528.97,2017 + 36 months,OWN,other,CA,Verified,INDIVIDUAL,5000,0,90000,9.98,0,52,29,19,fals,18.25,1522.18,2013 + 36 months,RENT,debt_consolidation,MA,Not Verified,INDIVIDUAL,7175,4,30000,20.8,0,49.7,12,5,fals,20.5,122.58,2014 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,20000,10,79364,33.63,0,94.2,16,19,tru,14.46,-12184.47,2016 + 36 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,11500,null,35000,32.81,0,78,16,29,fals,15.22,2895.35,2013 + 36 months,RENT,debt_consolidation,SC,Verified,INDIVIDUAL,20000,10,64000,15.19,0,80.4,30,23,fals,13.11,3978.86,2012 + 36 months,MORTGAGE,credit_card,MT,Not Verified,INDIVIDUAL,10000,10,108000,8.16,1,49,23,17,fals,15.61,130.08,2014 + 36 months,RENT,house,WA,Verified,INDIVIDUAL,15000,10,108000,4.4,1,31.9,22,13,fals,17.27,3255.3,2012 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,18000,10,100000,5.85,0,6.3,25,16,fals,7.26,761.53,2015 + 36 months,MORTGAGE,credit_card,PA,Not Verified,INDIVIDUAL,14000,2,98750,9.32,1,44.3,35,10,tru,10.99,-5462.53,2014 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,20675,4,74620,9.01,1,54.3,20,19,fals,15.99,3105.14,2014 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,10000,0,50000,31.5,0,36.8,32,11,fals,7.89,548.27,2015 + 36 months,OWN,debt_consolidation,TX,Not Verified,INDIVIDUAL,13000,1,58000,17.28,5,71.7,26,8,fals,17.77,3668.96,2013 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,7800,10,72000,7.9,2,48.5,28,11,tru,20.2,-4242.58,2014 + 36 months,MORTGAGE,debt_consolidation,NE,Verified,INDIVIDUAL,8000,6,108000,25.11,2,75.9,16,21,fals,11.49,235.55,2017 + 36 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,8000,10,35000,13.51,1,34,49,21,fals,10.99,357.67,2015 + 36 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,32500,4,90012,24.6,1,74.1,18,11,fals,12.69,5162.03,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,14000,8,83000,21.3,1,53.6,26,33,fals,10.99,2307.24,2014 + 36 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,12000,10,93384,29.27,0,65.9,35,18,fals,8.67,1491.81,2014 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,7525,10,82000,18.47,0,73.2,21,15,fals,13.33,1011.23,2015 + 36 months,MORTGAGE,credit_card,PA,Verified,INDIVIDUAL,30000,8,73165,13.01,0,52.2,18,16,fals,12.49,4297.3,2014 + 36 months,RENT,credit_card,LA,Not Verified,INDIVIDUAL,5775,10,49400,18.68,0,46.2,47,20,fals,17.1,1660.4,2013 + 36 months,OWN,debt_consolidation,NY,Verified,INDIVIDUAL,20000,0,75000,17.39,0,54.8,29,13,fals,19.05,6314.07,2013 + 60 months,MORTGAGE,debt_consolidation,MN,Verified,INDIVIDUAL,15250,10,55000,21.62,0,66.5,21,17,fals,10.64,4112.71,2013 + 36 months,MORTGAGE,credit_card,NY,Verified,INDIVIDUAL,7200,10,40000,15.72,0,88.5,18,15,fals,12.49,828.96,2014 + 36 months,MORTGAGE,other,FL,Verified,INDIVIDUAL,40000,10,102000,9.87,0,90.8,28,28,fals,8.24,1046.09,2017 + 36 months,RENT,other,NJ,Verified,INDIVIDUAL,12000,10,100000,10.29,4,20.5,36,17,fals,11.47,103.15,2016 + 36 months,MORTGAGE,other,MI,Verified,INDIVIDUAL,3500,null,42000,24.11,0,67,29,42,fals,17.57,1028.07,2014 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,7000,5,750000,2.87,1,59,15,20,fals,9.49,254.1,2016 + 36 months,OWN,credit_card,CO,Verified,INDIVIDUAL,9600,10,34000,33.18,1,30.6,17,21,fals,12.99,2042.93,2014 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,17000,4,60000,16.21,0,49.4,16,8,tru,14.49,-9401.9,2014 + 36 months,RENT,credit_card,NV,Verified,INDIVIDUAL,35000,0,100000,11.31,0,65,22,16,fals,14.46,2372.94,2016 + 36 months,RENT,debt_consolidation,SC,Not Verified,INDIVIDUAL,6000,8,64000,4.5,1,66.4,22,13,fals,15.61,582.69,2015 + 36 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,8400,2,55000,31.38,0,46,22,11,fals,8.18,1054.12,2015 + 36 months,RENT,debt_consolidation,IL,Verified,INDIVIDUAL,35000,1,116000,16.55,0,72.1,28,16,fals,12.69,1255.24,2015 + 60 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,18825,4,51000,24.8,0,36.3,23,15,tru,22.99,-17258.93,2015 + 36 months,RENT,debt_consolidation,HI,Verified,INDIVIDUAL,35000,10,110000,15.36,0,46.8,17,20,tru,11.99,-10353.47,2013 + 60 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,20100,5,130000,33.23,0,59,37,20,fals,18.99,3811.08,2016 + 36 months,MORTGAGE,major_purchase,CA,Verified,INDIVIDUAL,3600,null,47000,26.38,5,10.6,41,22,fals,8.18,258.61,2015 + 36 months,MORTGAGE,credit_card,FL,Not Verified,INDIVIDUAL,5000,4,32000,5.96,0,47.8,15,11,fals,10.99,892.12,2013 + 60 months,MORTGAGE,debt_consolidation,ND,Not Verified,INDIVIDUAL,10000,7,50000,16.92,0,18.9,30,20,fals,12.74,172.42,2017 + 60 months,MORTGAGE,credit_card,MN,Verified,INDIVIDUAL,35000,5,267000,16.96,1,67.1,59,19,tru,23.5,-24035.75,2013 + 60 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,25000,2,76000,28.61,0,81.4,18,16,fals,14.09,8092.92,2012 + 36 months,MORTGAGE,credit_card,NC,Verified,INDIVIDUAL,33100,0,72000,26.45,0,76.1,29,16,tru,16.99,-29522.64,2016 + 36 months,RENT,other,OH,Not Verified,INDIVIDUAL,16000,5,65000,29.71,0,52.1,28,10,fals,16.2,4346.94,2013 + 60 months,MORTGAGE,home_improvement,MI,Not Verified,INDIVIDUAL,18000,10,66000,16.15,0,62.8,28,32,tru,19.99,-1312.35,2014 + 36 months,MORTGAGE,credit_card,WA,Verified,INDIVIDUAL,18000,8,48000,12.88,0,83,16,14,fals,10.99,930.09,2014 + 36 months,RENT,debt_consolidation,UT,Verified,INDIVIDUAL,5000,1,30000,25.08,0,91,10,6,fals,10.99,662.78,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,10000,8,40000,1.5,0,2,16,18,fals,11.99,1955.4,2014 + 36 months,MORTGAGE,debt_consolidation,CT,Verified,INDIVIDUAL,35000,10,85000,22.47,0,30.5,39,25,fals,10.64,6036.29,2013 + 36 months,RENT,other,SC,Verified,INDIVIDUAL,6075,5,89000,9.76,0,78.2,9,11,fals,17.77,1802.03,2012 + 36 months,RENT,debt_consolidation,KY,Verified,INDIVIDUAL,16000,10,65000,11.74,0,50.3,43,20,fals,15.1,3690.56,2013 + 60 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,29775,8,84000,29.7,0,24,33,14,fals,13.18,4070.8,2015 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,30000,10,100000,18.66,0,92.7,35,26,fals,12.49,1205.53,2014 + 36 months,RENT,debt_consolidation,WA,Not Verified,INDIVIDUAL,19475,8,100000,20.76,0,59.7,32,9,tru,12.12,-12347.33,2012 + 60 months,MORTGAGE,debt_consolidation,MO,Verified,INDIVIDUAL,15000,5,65000,29.01,2,21.4,52,11,tru,11.49,-12801.35,2016 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,16200,2,45000,15.65,1,51.9,21,20,tru,13.99,-7290.75,2015 + 36 months,RENT,major_purchase,WA,Not Verified,INDIVIDUAL,15000,3,80000,0.75,0,12.9,6,7,fals,12.69,1539.04,2015 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,12950,2,70000,13.1,2,20.8,31,11,fals,14.99,1975.15,2014 + 60 months,RENT,debt_consolidation,AZ,Verified,INDIVIDUAL,12600,4,35000,33.92,0,22.7,14,9,fals,18.99,1399.55,2015 + 36 months,RENT,other,CA,Not Verified,INDIVIDUAL,1000,0,60000,25.18,0,101.9,9,10,fals,18.24,274.54,2014 + 60 months,MORTGAGE,debt_consolidation,SC,Verified,INDIVIDUAL,19050,10,65000,16.73,0,86.9,11,16,tru,24.99,-966.09,2014 + 36 months,MORTGAGE,debt_consolidation,NV,Not Verified,INDIVIDUAL,17500,10,48000,37.67,1,50.8,42,20,fals,16.29,4670.53,2014 + 36 months,RENT,other,NC,Verified,INDIVIDUAL,2400,null,62000,7.16,0,85.4,15,15,fals,17.1,688.77,2013 + 60 months,MORTGAGE,debt_consolidation,AR,Verified,INDIVIDUAL,30800,9,80000,21.19,1,41.5,51,30,fals,12.29,5190.06,2015 + 60 months,MORTGAGE,medical,VT,Verified,INDIVIDUAL,20000,10,75000,12.37,0,88.8,26,11,fals,20.99,5095.29,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,1000,null,11856,24.39,0,57.9,39,20,fals,18.25,305.98,2013 + 60 months,OWN,home_improvement,ME,Verified,INDIVIDUAL,32000,4,169000,16.74,0,51.1,32,15,fals,19.99,390.92,2016 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,16000,10,65000,20.77,0,66.8,21,14,tru,6.03,-4626.34,2014 + 60 months,RENT,debt_consolidation,GA,Verified,INDIVIDUAL,35000,1,133000,16.77,2,49.2,50,19,tru,21.99,-26344.53,2015 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,5000,3,90688,13.75,0,95.9,26,14,fals,14.49,1043.69,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,16000,8,53000,13.72,0,51.8,29,20,fals,6.49,1019.06,2016 + 36 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,28100,10,74000,23.94,0,28.8,44,15,fals,19.24,6071.89,2015 + 36 months,MORTGAGE,other,FL,Verified,INDIVIDUAL,6000,10,40594,6.45,0,62.1,7,10,fals,11.99,182.47,2016 + 36 months,RENT,debt_consolidation,CT,Not Verified,INDIVIDUAL,3000,2,92000,13.66,1,16.2,20,8,fals,6.99,140.76,2016 + 36 months,MORTGAGE,credit_card,KY,Verified,INDIVIDUAL,7000,10,80000,29.57,2,77,48,18,fals,11.53,748.59,2015 + 36 months,RENT,credit_card,NJ,Not Verified,INDIVIDUAL,24000,0,120000,27.24,0,21.8,13,5,fals,11.49,1195.95,2016 + 36 months,MORTGAGE,debt_consolidation,AL,Verified,INDIVIDUAL,8000,10,65000,17.02,0,32.5,45,15,fals,13.68,1461.76,2013 + 60 months,OWN,debt_consolidation,KY,Verified,INDIVIDUAL,19650,2,65000,13.96,0,38,36,14,fals,10.16,3853.1,2013 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,4650,3,71000,28.28,0,72.3,42,26,fals,14.64,166.38,2014 + 60 months,OWN,debt_consolidation,FL,Verified,INDIVIDUAL,30000,10,72000,18.13,0,49.7,21,19,fals,14.99,2668.05,2014 + 36 months,MORTGAGE,debt_consolidation,MS,Verified,INDIVIDUAL,12000,1,55500,23.83,0,53.9,25,15,tru,9.75,-8521.51,2016 + 60 months,MORTGAGE,debt_consolidation,UT,Verified,INDIVIDUAL,35000,4,105000,16.94,0,84.2,28,14,fals,25.8,13001.09,2015 + 36 months,MORTGAGE,credit_card,PA,Not Verified,INDIVIDUAL,6000,6,60000,20.38,2,48.8,25,9,fals,12.99,477.11,2013 + 36 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,30400,6,186000,14.72,0,72,36,24,fals,20.99,10189.44,2014 + 36 months,OWN,debt_consolidation,MS,Not Verified,INDIVIDUAL,12000,4,33560,29.72,0,51,17,7,fals,7.69,1341.9,2014 + 60 months,MORTGAGE,debt_consolidation,RI,Verified,INDIVIDUAL,16000,10,79500,22.22,0,74.4,37,15,tru,15.8,-2807.96,2013 + 36 months,MORTGAGE,debt_consolidation,WY,Verified,INDIVIDUAL,10000,null,72000,27.83,2,68.7,37,22,fals,13.53,1220.48,2013 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,5000,8,38000,19.33,0,69.6,13,12,fals,9.67,721.11,2013 + 36 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,16000,5,95000,18.85,3,65.5,27,26,fals,11.99,1455.35,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,12800,10,85000,5.56,0,95.2,7,16,fals,13.53,2844.05,2014 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,12000,0,30000,18,0,67.1,11,7,fals,14.85,2198.81,2015 + 60 months,MORTGAGE,credit_card,WA,Verified,INDIVIDUAL,18950,10,53000,28.66,0,43.3,21,12,fals,21,331.95,2013 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,25000,1,60000,11.94,0,11.4,35,14,fals,13.33,4206.64,2015 + 60 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,32000,5,175000,27.45,0,84.1,32,15,tru,18.99,-21278.43,2015 + 36 months,RENT,credit_card,TX,Not Verified,INDIVIDUAL,10000,8,45000,23.58,0,62.8,19,8,fals,14.99,1004.94,2015 + 60 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,17625,10,110000,5.88,0,41.6,27,14,fals,18.25,3038.01,2013 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,10000,10,32000,13.35,2,40.7,13,12,fals,15.31,2377.11,2012 + 60 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,17325,9,76000,15.11,0,97.1,22,18,tru,25.57,-12188.24,2015 + 36 months,RENT,credit_card,MO,Not Verified,INDIVIDUAL,20000,10,40000,30.63,0,56.7,44,8,tru,10.99,-10082.88,2015 + 60 months,MORTGAGE,debt_consolidation,KY,Verified,INDIVIDUAL,25000,5,135456,31.45,0,35.7,50,25,tru,13.66,-13195.91,2015 + 36 months,OWN,credit_card,CA,Verified,INDIVIDUAL,27500,2,55000,23.14,0,62.8,19,15,fals,7.89,2575.88,2015 + 36 months,MORTGAGE,debt_consolidation,MD,Not Verified,INDIVIDUAL,10000,null,49000,24.2,0,30.8,29,11,fals,9.17,1024.71,2015 + 36 months,MORTGAGE,debt_consolidation,KY,Verified,INDIVIDUAL,9950,2,30000,12.4,0,32.8,16,8,fals,8.9,957.96,2013 + 36 months,MORTGAGE,credit_card,PA,Not Verified,INDIVIDUAL,4500,5,35000,12.62,4,30.7,25,12,fals,11.55,845.96,2013 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,12275,4,60000,18.98,0,1.4,33,20,fals,12.99,1508.4,2014 + 36 months,MORTGAGE,debt_consolidation,MI,Not Verified,INDIVIDUAL,12000,10,75000,18.72,1,75.2,23,18,fals,13.67,2695.52,2012 + 60 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,21000,10,80000,20.09,0,60,9,24,fals,14.99,2130.68,2016 + 36 months,OWN,debt_consolidation,NJ,Verified,INDIVIDUAL,6800,3,20000,26.53,0,58.1,14,5,tru,17.99,-5086.13,2016 + 36 months,OWN,debt_consolidation,PA,Verified,INDIVIDUAL,9000,10,39000,18.15,0,43.9,17,11,fals,10.99,471.72,2015 + 60 months,MORTGAGE,credit_card,GA,Verified,INDIVIDUAL,32000,10,325000,10.44,0,59.9,38,20,fals,17.56,3171.84,2013 + 36 months,OWN,debt_consolidation,AZ,Verified,INDIVIDUAL,13250,1,60000,14.76,1,57.6,31,14,fals,19.52,3413.43,2013 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,20000,3,75000,29.33,0,34,43,13,fals,11.44,2242.45,2015 + 36 months,RENT,debt_consolidation,MI,Not Verified,INDIVIDUAL,3825,6,38000,13.48,0,75,7,15,fals,18.85,842.7,2013 + 36 months,RENT,credit_card,GA,Verified,INDIVIDUAL,10075,6,35000,28.37,0,87.1,19,12,fals,17.77,2998.78,2012 + 60 months,MORTGAGE,other,IN,Verified,INDIVIDUAL,14275,7,42000,21.23,0,54.5,25,13,tru,28.99,-6495.88,2015 + 60 months,MORTGAGE,debt_consolidation,CT,Verified,INDIVIDUAL,18000,10,72000,13.63,0,45.8,30,14,fals,16.99,3319.61,2014 + 36 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,20000,9,96000,20.88,0,44.8,29,17,tru,15.31,-13738.27,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,15000,6,185000,10.1,0,28.7,33,15,fals,10.64,1516.94,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,27325,7,72000,19.35,0,81.1,21,13,fals,19.47,6823.38,2014 + 60 months,MORTGAGE,credit_card,IL,Verified,INDIVIDUAL,21600,8,71000,12.2,0,81.9,15,8,tru,17.57,-1479.61,2014 + 36 months,RENT,vacation,GA,Verified,INDIVIDUAL,4225,10,82000,19.05,0,55.1,28,30,fals,14.99,361.49,2016 + 36 months,RENT,debt_consolidation,MI,Not Verified,INDIVIDUAL,3000,2,30000,9.24,0,37,38,24,fals,17.57,872.18,2014 + 60 months,RENT,debt_consolidation,UT,Verified,INDIVIDUAL,22750,7,58000,30.63,0,92,24,22,tru,21.49,5902.54,2012 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,16000,2,37558,15.08,1,28.4,37,16,fals,11.55,4796.89,2013 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,35000,10,200000,8.23,1,35.8,17,27,fals,17.86,5099.06,2015 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,6000,0,40000,15.06,0,38.3,19,9,fals,9.67,521.6,2014 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,3000,3,18000,13.87,0,18.4,16,8,fals,10.99,516.93,2014 + 36 months,RENT,major_purchase,MO,Not Verified,INDIVIDUAL,30000,5,400000,16.78,0,38.5,48,15,fals,19.99,714.06,2016 + 60 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,15000,2,40000,6.54,0,72.1,17,16,tru,12.99,-7814.27,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,14825,9,92000,23.79,0,45.7,24,13,fals,16.29,1648.55,2014 + 60 months,MORTGAGE,credit_card,MI,Verified,INDIVIDUAL,12000,10,96200,19,0,111.1,29,20,fals,13.99,1161.51,2015 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,15750,7,35000,12.77,1,26.6,23,12,fals,20.49,2345.41,2015 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,3000,10,95000,17.96,0,51,22,11,fals,11.49,101.14,2017 + 36 months,OWN,credit_card,NY,Not Verified,INDIVIDUAL,8925,1,40000,21.33,6,65.8,22,11,fals,7.69,323.09,2014 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,20000,6,100000,29.26,0,81.4,43,23,fals,14.33,2064.01,2013 + 36 months,OWN,debt_consolidation,SD,Verified,INDIVIDUAL,8000,2,30000,36.4,1,80.6,33,24,fals,12.99,430.16,2016 + 36 months,OWN,credit_card,CA,Verified,INDIVIDUAL,9500,3,60000,14.62,0,84.5,9,13,fals,10.78,1251.11,2016 + 36 months,MORTGAGE,credit_card,IN,Not Verified,INDIVIDUAL,9500,2,66000,17.02,0,4.2,33,12,tru,6.03,-9210.86,2014 + 60 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,31350,2,145000,9.57,1,26.5,35,17,fals,17.57,6955.16,2015 + 60 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,20000,10,85000,14.47,0,68.1,29,14,fals,12.49,4380.99,2014 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,15000,10,50000,13.08,0,54.9,25,15,fals,11.99,2794.91,2014 + 36 months,RENT,debt_consolidation,SC,Not Verified,INDIVIDUAL,7000,10,30000,10.96,0,76.2,10,12,fals,13.35,1460.69,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,5000,3,45000,15.18,3,30.2,33,13,fals,7.69,614.83,2014 + 60 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,30000,10,114200,7.85,0,84.4,20,30,fals,14.65,6868.08,2015 + 36 months,RENT,debt_consolidation,OR,Verified,INDIVIDUAL,12800,10,65000,29.6,0,42.5,49,11,tru,13.99,-3132,2015 + 36 months,MORTGAGE,home_improvement,VA,Verified,INDIVIDUAL,7450,2,99599,19.8,0,44.9,39,12,fals,12.74,168.46,2017 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,7625,1,58000,18.17,0,57,31,6,fals,16.99,2034.99,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,5600,6,60000,19.2,0,88,15,14,fals,13.99,726.17,2015 + 36 months,OWN,debt_consolidation,VA,Verified,INDIVIDUAL,4000,null,20000,6.84,0,30,9,23,fals,7.9,505.76,2013 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,5000,0,24000,28.9,0,54.3,20,11,fals,14.16,1041.88,2014 + 36 months,MORTGAGE,home_improvement,FL,Not Verified,INDIVIDUAL,5000,6,64000,8.63,0,25.1,18,11,fals,9.49,129.76,2016 + 36 months,OWN,debt_consolidation,TX,Not Verified,INDIVIDUAL,6350,10,40000,17.22,0,3.8,17,10,fals,12.12,709.44,2012 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,9000,0,60000,27.26,0,94.8,25,17,tru,12.29,-4722.9,2015 + 60 months,RENT,major_purchase,TX,Verified,INDIVIDUAL,35000,5,75000,11.35,0,4,14,15,tru,21.97,-26686.83,2016 + 60 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,15000,7,51500,26.59,0,56,43,13,tru,22.7,2651.76,2013 + 36 months,MORTGAGE,debt_consolidation,NY,Not Verified,INDIVIDUAL,10125,10,50000,15.41,0,67.1,15,14,tru,15.59,-5895.5,2015 + 36 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,25000,10,73000,24.12,0,57.1,42,16,fals,15.1,6242.87,2013 + 36 months,RENT,credit_card,IL,Verified,INDIVIDUAL,30000,9,80000,33.25,1,51.2,27,10,tru,15.61,-12412.38,2015 + 36 months,MORTGAGE,credit_card,WI,Not Verified,INDIVIDUAL,9000,10,112801,11.05,0,29.8,27,32,fals,6.92,610.74,2015 + 36 months,OWN,debt_consolidation,MI,Verified,INDIVIDUAL,6400,null,16000,33.61,0,25.6,26,15,fals,11.99,1115.22,2015 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,16000,2,43000,13.76,0,53.5,34,12,fals,12.12,3164.45,2013 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,7000,3,47000,15.6,0,17.3,18,8,fals,6.03,254.2,2014 + 36 months,MORTGAGE,debt_consolidation,LA,Verified,INDIVIDUAL,35000,4,170000,21.26,0,62,42,20,fals,11.53,5272.58,2015 + 36 months,MORTGAGE,home_improvement,MI,Verified,INDIVIDUAL,9500,10,66000,7.92,0,22,30,17,fals,9.17,1181.95,2015 + 60 months,MORTGAGE,debt_consolidation,WV,Verified,INDIVIDUAL,20000,6,98800,24.43,0,81.6,44,19,fals,10.99,4799.45,2014 + 60 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,13000,0,36000,27.57,0,78.6,24,13,fals,12.39,3139.57,2015 + 36 months,MORTGAGE,debt_consolidation,SC,Not Verified,INDIVIDUAL,10000,6,40000,9.39,0,45.8,25,16,fals,10.15,541.85,2014 + 36 months,MORTGAGE,other,WA,Verified,INDIVIDUAL,6800,3,20000,24.07,0,28.2,23,10,fals,19.19,791.28,2015 + 60 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,11200,10,115000,15.8,0,61.4,70,19,fals,20.49,2877.52,2015 + 36 months,MORTGAGE,credit_card,RI,Verified,INDIVIDUAL,28000,8,94000,18.6,0,75.3,36,23,fals,12.99,5958.62,2013 + 36 months,MORTGAGE,debt_consolidation,IL,Not Verified,INDIVIDUAL,10700,3,32160,16.12,0,56.7,24,11,fals,11.47,710.73,2016 + 36 months,MORTGAGE,credit_card,NY,Not Verified,INDIVIDUAL,8000,10,50000,37.06,0,21.9,25,12,fals,7.26,533.96,2015 + 60 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,16800,9,42000,19.37,0,67.1,23,21,tru,12.59,-12939.01,2015 + 60 months,MORTGAGE,other,NV,Verified,INDIVIDUAL,35000,2,100000,23.48,0,75.1,21,13,fals,21.99,4623.84,2014 + 36 months,MORTGAGE,other,IN,Verified,INDIVIDUAL,4200,9,92000,18.16,0,97.6,12,12,fals,16.55,776.66,2015 + 36 months,MORTGAGE,debt_consolidation,UT,Not Verified,INDIVIDUAL,12150,8,69192,11.9,0,65,40,18,fals,13.98,2794.96,2014 + 60 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,35000,0,163000,22.88,3,67,49,26,fals,18.25,597.45,2015 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,22800,10,60000,25.49,0,64.8,12,20,fals,14.85,1893.98,2016 + 36 months,RENT,debt_consolidation,HI,Verified,INDIVIDUAL,8000,6,67000,25.78,0,56.4,17,11,fals,11.99,698.9,2016 + 36 months,MORTGAGE,credit_card,NY,Not Verified,INDIVIDUAL,28000,7,122000,15.13,0,17.2,58,20,fals,5.32,1529.92,2015 + 36 months,MORTGAGE,credit_card,MN,Not Verified,INDIVIDUAL,15200,10,68000,18.04,0,53.4,27,17,fals,7.12,1651.14,2014 + 36 months,RENT,other,LA,Verified,INDIVIDUAL,7500,10,45000,5.11,0,38.2,6,23,fals,18.24,2323.03,2014 + 60 months,MORTGAGE,debt_consolidation,NH,Verified,INDIVIDUAL,27300,10,70000,18.05,0,82.4,29,12,tru,21.98,-11279.36,2012 + 36 months,MORTGAGE,credit_card,IL,Not Verified,INDIVIDUAL,10400,2,55000,33.84,6,89,31,11,fals,6.24,874.12,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,16000,10,98000,9.53,0,70.8,14,27,fals,10.99,429.2,2014 + 60 months,RENT,credit_card,CA,Verified,INDIVIDUAL,20000,10,70000,24.16,0,51,16,12,fals,10.99,814.39,2015 + 36 months,MORTGAGE,credit_card,NM,Verified,INDIVIDUAL,8000,10,36000,22.5,0,30.2,19,16,fals,8.18,641.32,2015 + 36 months,MORTGAGE,credit_card,WI,Verified,INDIVIDUAL,10000,10,59101,31.47,1,3.2,30,13,fals,9.17,1001.29,2014 + 36 months,MORTGAGE,debt_consolidation,SC,Not Verified,INDIVIDUAL,18000,4,85000,13.65,0,56.7,27,14,fals,13.99,607.75,2015 + 36 months,RENT,debt_consolidation,IL,Verified,INDIVIDUAL,12000,5,38000,11.56,0,41.4,11,10,fals,13.99,804.88,2016 + 60 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,19950,10,50000,17.52,0,60.2,14,14,tru,17.57,-10331.06,2015 + 60 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,18000,7,50000,21.29,1,40.7,36,16,tru,18.84,-13870.35,2015 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,16700,10,67392,22.49,0,84.6,12,12,tru,19.99,-12700.14,2012 + 36 months,OWN,credit_card,CA,Not Verified,INDIVIDUAL,13625,5,36000,27.34,0,46.5,9,4,tru,13.98,-12399.8,2014 + 36 months,MORTGAGE,other,PA,Not Verified,INDIVIDUAL,7600,10,73580,17.37,0,37.7,29,15,fals,11.99,905.79,2016 + 36 months,RENT,debt_consolidation,WI,Verified,INDIVIDUAL,25000,6,160000,15.61,0,40.4,31,11,fals,6.49,1726.14,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,35000,10,170000,8.32,0,51,38,45,fals,9.17,979.87,2015 + 60 months,MORTGAGE,debt_consolidation,MO,Verified,INDIVIDUAL,20000,3,73000,18.63,1,72.3,26,16,tru,15.61,-10007.39,2015 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,6000,null,24000,21.95,0,36.6,21,31,fals,8.18,647.98,2015 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,28000,10,189895,15.04,0,8.5,29,16,fals,5.93,2141.09,2015 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,24000,1,110000,10.82,0,58.7,31,12,fals,8.9,3434.76,2014 + 36 months,MORTGAGE,debt_consolidation,TN,Not Verified,INDIVIDUAL,11000,6,64000,17.96,0,5.3,39,15,fals,9.49,523.04,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,5000,10,37000,26.34,0,38.4,24,19,fals,13.99,199.62,2017 + 36 months,RENT,debt_consolidation,MO,Verified,INDIVIDUAL,6250,4,25000,28.08,0,75.7,17,6,fals,15.8,1630.34,2012 + 60 months,OWN,credit_card,PA,Verified,INDIVIDUAL,24000,9,66900,31.2,0,82.9,43,15,fals,12.49,6347.57,2014 + 36 months,MORTGAGE,credit_card,IL,Not Verified,INDIVIDUAL,9600,1,50000,13.9,2,48,34,12,fals,13.11,1757.69,2013 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,6400,2,40000,18.21,0,68.5,22,11,tru,16.77,-4807.85,2012 + 36 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,12000,4,105000,19.41,0,71,28,18,fals,12.12,2373.35,2012 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,14000,10,123000,13.61,0,12.7,35,15,fals,10.99,3570.49,2014 + 36 months,MORTGAGE,debt_consolidation,MN,Verified,INDIVIDUAL,21750,0,110000,27.7,0,79.2,27,21,fals,18.25,209.5,2016 + 36 months,RENT,major_purchase,MN,Verified,INDIVIDUAL,15000,10,45240,25.17,0,83,15,11,fals,13.67,1938.36,2016 + 60 months,MORTGAGE,credit_card,MT,Verified,INDIVIDUAL,25000,10,75000,21.68,0,56.8,34,17,fals,14.49,2999.66,2014 + 36 months,RENT,debt_consolidation,NM,Not Verified,INDIVIDUAL,18050,3,85000,12.93,0,32,21,12,tru,13.49,-14503.81,2016 + 36 months,MORTGAGE,credit_card,MN,Not Verified,INDIVIDUAL,20000,3,115000,14.08,0,63.7,40,18,fals,7.62,127.46,2013 + 36 months,MORTGAGE,credit_card,IL,Verified,INDIVIDUAL,19950,10,41600,33.73,0,51.3,22,11,fals,15.99,1047.77,2016 + 60 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,25200,5,60000,7.78,1,58.4,5,5,tru,21.18,-16983.83,2016 + 60 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,18325,10,115000,22.6,0,33.5,32,10,fals,30.84,1600.83,2016 + 36 months,OWN,home_improvement,NY,Verified,INDIVIDUAL,28000,10,205000,21.47,0,21,32,14,fals,15.99,410.41,2016 + 36 months,RENT,credit_card,UT,Verified,INDIVIDUAL,6000,7,40000,20.4,0,50,26,30,fals,7.69,330.41,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,15000,null,64000,9.43,0,63.1,9,11,fals,12.85,161.41,2014 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,8725,null,23000,17.95,0,74.4,28,47,tru,16.99,-2193.37,2014 + 36 months,RENT,debt_consolidation,AZ,Verified,INDIVIDUAL,5000,8,58000,29.3,0,33.3,21,10,fals,10.49,317.77,2014 + 36 months,MORTGAGE,debt_consolidation,MD,Verified,INDIVIDUAL,7000,2,132000,8.77,0,84.8,33,22,tru,12.74,-6752.61,2016 + 36 months,MORTGAGE,other,OR,Verified,INDIVIDUAL,3000,6,35000,32.47,1,29.2,26,11,tru,20.31,-1505.16,2013 + 60 months,OWN,debt_consolidation,KY,Verified,INDIVIDUAL,30000,8,225000,7.42,0,29.6,36,31,fals,7.39,1798.2,2016 + 36 months,OWN,medical,NJ,Verified,INDIVIDUAL,2000,10,65000,8.49,0,70.3,36,27,fals,11.44,0.88,2016 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,5500,10,42000,21.63,2,47.3,24,26,fals,10.64,29.27,2015 + 60 months,MORTGAGE,medical,FL,Verified,INDIVIDUAL,17000,9,65712,26.13,0,54.6,22,12,tru,23.63,962.34,2013 + 36 months,MORTGAGE,debt_consolidation,MD,Verified,INDIVIDUAL,5000,10,80000,9.17,3,97.4,30,24,fals,18.25,1470.13,2013 + 36 months,MORTGAGE,credit_card,AR,Not Verified,INDIVIDUAL,17000,5,90000,14.69,0,57.7,39,17,fals,14.49,1042.04,2016 + 36 months,RENT,debt_consolidation,MA,Verified,INDIVIDUAL,8000,4,75000,15.1,1,11.9,18,8,fals,13.99,313.17,2016 + 60 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,11000,8,47000,15.5,0,43.7,23,8,fals,20.99,1337.37,2015 + 36 months,MORTGAGE,credit_card,TN,Verified,INDIVIDUAL,13200,10,65000,26.2,0,93.9,18,11,fals,9.17,1379.05,2015 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,10000,3,64000,23.18,0,48.6,25,10,fals,11.39,431.12,2016 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,18600,6,78000,29.02,0,49,17,5,fals,19.48,311.71,2016 + 36 months,MORTGAGE,credit_card,MO,Verified,INDIVIDUAL,24000,10,84200,18.8,0,40.9,13,24,fals,7.26,2075.59,2015 + 36 months,RENT,credit_card,NY,Not Verified,INDIVIDUAL,7000,6,60000,9.42,0,23.7,14,20,fals,8.9,1001.77,2013 + 36 months,MORTGAGE,other,OK,Not Verified,INDIVIDUAL,2750,1,90260,9.09,0,88.8,26,27,fals,11.53,451.3,2015 + 36 months,RENT,debt_consolidation,UT,Verified,INDIVIDUAL,7800,0,23000,18.94,0,77.9,28,11,fals,15.99,168.63,2017 + 36 months,MORTGAGE,home_improvement,IL,Not Verified,INDIVIDUAL,9000,0,99400,17.07,0,38.8,34,36,fals,14.64,1623.09,2014 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,30000,8,150000,5.35,0,62.5,29,12,tru,11.49,-26003.66,2016 + 36 months,RENT,moving,FL,Not Verified,INDIVIDUAL,7175,10,40000,18.72,0,54,10,10,tru,24.5,-5474.66,2014 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,16000,5,48800,13.92,0,51.3,30,11,fals,7.89,1092.43,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,30000,0,86000,24.28,2,69.8,46,21,fals,16.99,3787.08,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,25000,3,140000,10.26,0,16.8,22,4,fals,9.17,2860.5,2015 + 60 months,RENT,debt_consolidation,MA,Verified,INDIVIDUAL,14000,2,60000,12.34,0,67.6,8,8,fals,13.33,953.31,2015 + 36 months,OWN,debt_consolidation,IL,Not Verified,INDIVIDUAL,3000,10,30000,26.12,0,63.6,25,8,tru,16.29,-1329.63,2012 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,15000,1,48000,14.53,0,41.9,16,9,fals,12.69,2390.7,2015 + 60 months,RENT,debt_consolidation,TN,Verified,INDIVIDUAL,18700,1,52000,12.69,0,28.9,14,17,fals,21.99,4853.19,2014 + 36 months,OWN,credit_card,MI,Verified,INDIVIDUAL,17625,2,39187,31.38,0,71.6,9,9,fals,13.98,4054.54,2014 + 36 months,MORTGAGE,debt_consolidation,IL,Not Verified,INDIVIDUAL,6000,8,52000,13.85,0,63.2,26,13,fals,11.55,1127.95,2013 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,10600,10,103586,4.12,0,37.4,42,21,fals,16.99,3498.01,2013 + 36 months,MORTGAGE,credit_card,MA,Not Verified,INDIVIDUAL,14000,8,71000,13.56,0,66,16,16,fals,11.44,2507.46,2015 + 36 months,RENT,debt_consolidation,OR,Not Verified,INDIVIDUAL,16750,10,70000,17.88,1,85.6,40,13,fals,18.25,4643.65,2013 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,25000,null,100000,5.35,0,23.8,29,24,fals,5.32,14.93,2016 + 36 months,RENT,credit_card,VA,Verified,INDIVIDUAL,5000,4,180000,11.39,1,60.8,12,18,fals,11.49,250.18,2017 + 36 months,MORTGAGE,home_improvement,MD,Verified,INDIVIDUAL,2500,10,58700,7.99,0,1.8,60,21,fals,14.46,20.09,2016 + 36 months,RENT,debt_consolidation,TX,Not Verified,INDIVIDUAL,19000,4,38000,21.38,0,28,39,10,fals,16.29,1910.95,2016 + 36 months,RENT,debt_consolidation,AZ,Verified,INDIVIDUAL,16000,1,50000,31.04,0,24.1,22,10,fals,8.18,1094.68,2015 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,13500,6,65300,12.44,3,45.2,36,37,tru,14.16,-8509.39,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,12000,10,127000,16.69,0,63.7,30,16,fals,7.89,717.97,2015 + 36 months,OWN,debt_consolidation,SC,Verified,INDIVIDUAL,35000,7,100000,13.61,1,75.6,31,16,tru,12.99,-27587.58,2015 + 60 months,RENT,debt_consolidation,OK,Verified,INDIVIDUAL,20000,10,58000,18.36,1,52.1,17,14,tru,18.54,-13997.53,2015 + 36 months,OWN,home_improvement,CA,Verified,INDIVIDUAL,30000,10,211000,7.3,0,42.8,25,13,fals,14.99,7556.22,2014 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,30000,10,125000,7.9,0,43.8,31,19,fals,14.3,3739.73,2013 + 36 months,MORTGAGE,credit_card,PA,Verified,INDIVIDUAL,4800,6,65000,22.88,0,63.8,15,7,fals,13.33,733.48,2015 + 36 months,MORTGAGE,credit_card,AL,Verified,INDIVIDUAL,25000,10,75000,25.38,0,59.8,36,14,fals,13.05,5074.38,2013 + 36 months,RENT,credit_card,SC,Not Verified,INDIVIDUAL,7000,2,45000,27.73,1,49,17,6,fals,10.15,756.11,2014 + 36 months,RENT,credit_card,IN,Not Verified,INDIVIDUAL,15000,1,52000,24.16,0,74.8,15,21,fals,18.75,4726.02,2013 + 36 months,RENT,small_business,NY,Verified,INDIVIDUAL,3075,2,30000,2,0,17.5,5,5,fals,15.1,155.19,2013 + 60 months,RENT,small_business,AZ,Verified,INDIVIDUAL,12000,2,110000,8.79,0,41.3,20,14,tru,21.67,-953.67,2015 + 36 months,OWN,debt_consolidation,NC,Verified,INDIVIDUAL,30000,4,105000,8.16,0,9.9,23,20,fals,7.99,1551.09,2016 + 36 months,RENT,other,FL,Verified,INDIVIDUAL,2000,4,40000,15.6,0,56.3,12,8,tru,21.97,-1426.1,2016 + 60 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,15000,9,33000,2.98,0,35.3,13,8,fals,22.74,826,2017 + 60 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,28000,10,148000,13.06,0,83,25,9,tru,24.08,-332.24,2014 + 36 months,MORTGAGE,credit_card,CO,Verified,INDIVIDUAL,20000,0,105000,7.67,0,65,24,18,fals,7.9,2528.97,2014 + 36 months,OWN,debt_consolidation,AL,Verified,INDIVIDUAL,5600,10,27342,38.85,0,38.5,26,10,fals,11.22,440.88,2015 + 36 months,MORTGAGE,debt_consolidation,MA,Not Verified,INDIVIDUAL,10000,10,50000,23.07,1,40.5,30,21,fals,9.75,1034.47,2016 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,13725,10,97800,17.14,0,64,34,17,fals,8.9,1876.83,2012 + 36 months,RENT,debt_consolidation,MO,Verified,INDIVIDUAL,11850,10,35000,18.45,0,46.7,17,13,fals,19.52,3901.19,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,6000,10,96000,11.46,0,44.4,23,21,fals,5.32,328.9,2016 + 60 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,10375,4,45600,31.89,0,67.3,24,12,tru,17.86,-7927.74,2015 + 36 months,RENT,debt_consolidation,NC,Not Verified,INDIVIDUAL,10000,0,61000,27.86,0,85.4,28,22,tru,22.2,-7137.41,2013 + 36 months,RENT,debt_consolidation,OH,Verified,INDIVIDUAL,10000,0,105000,12.71,3,77.8,50,14,fals,16.99,1047.18,2016 + 36 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,18225,10,52000,28.45,0,77.5,18,15,tru,12.99,808.69,2013 + 60 months,MORTGAGE,major_purchase,NH,Verified,INDIVIDUAL,36000,10,92000,20.3,0,19.8,27,16,fals,13.99,2477.79,2016 + 36 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,15000,8,95000,15.3,0,38.7,44,17,fals,7.9,1677.16,2014 + 60 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,21200,2,75000,24.21,2,68.2,28,16,fals,15.41,4823.97,2015 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,7400,1,56000,10.82,0,87.5,12,7,fals,9.17,828.96,2015 + 36 months,RENT,house,NJ,Verified,INDIVIDUAL,4575,10,65000,3.19,0,24.5,13,7,fals,12.99,962.89,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,30225,8,128000,18.09,0,26,24,11,tru,16.29,-20588.96,2016 + 36 months,MORTGAGE,debt_consolidation,OR,Not Verified,INDIVIDUAL,13925,10,46000,27.92,0,58.2,27,21,fals,12.29,2675.65,2015 + 60 months,OWN,credit_card,FL,Verified,INDIVIDUAL,15000,10,65000,21.82,0,36.9,28,10,fals,14.49,1509.65,2014 + 36 months,OWN,debt_consolidation,CO,Verified,INDIVIDUAL,18000,10,126000,14.43,0,21.8,31,29,fals,7.12,1562.26,2014 + 60 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,12950,10,36000,18,0,71.4,17,13,tru,20.99,-10810.63,2015 + 36 months,OWN,credit_card,NC,Verified,INDIVIDUAL,10000,10,48000,15.38,0,53.6,17,24,fals,10.64,1724.63,2013 + 36 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,20000,4,75000,14.38,0,48.4,25,16,fals,10.74,2244.4,2012 + 36 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,24000,2,110000,15.87,3,50.8,38,18,tru,14.33,-13183.01,2013 + 36 months,RENT,medical,CA,Not Verified,INDIVIDUAL,2100,1,36000,4.4,0,4.8,4,6,tru,23.5,-1100.28,2013 + 36 months,OWN,credit_card,MN,Not Verified,INDIVIDUAL,8100,7,45000,28.75,0,77,46,19,fals,11.44,849.23,2015 + 36 months,RENT,small_business,VA,Verified,INDIVIDUAL,9500,0,55000,14.84,2,33,14,7,fals,22.95,3565.05,2013 + 60 months,MORTGAGE,other,NJ,Not Verified,INDIVIDUAL,10000,10,130000,17.08,0,35.1,30,13,fals,13.35,1826.01,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,5000,0,24000,25.15,0,55.3,10,6,fals,6.49,481.36,2014 + 60 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,35000,5,75000,30.01,0,14,42,20,fals,20.75,8360.95,2016 + 36 months,RENT,credit_card,OH,Verified,INDIVIDUAL,11875,10,34000,17.23,0,69.8,19,13,fals,14.64,2043.48,2014 + 60 months,MORTGAGE,home_improvement,CA,Verified,INDIVIDUAL,15125,10,45000,35.95,0,80.9,14,10,fals,21.97,631.64,2016 + 36 months,MORTGAGE,home_improvement,TX,Not Verified,INDIVIDUAL,8000,2,128000,2.38,0,5.9,30,14,fals,15.31,1803.95,2013 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,35000,null,90000,24.76,0,75.6,33,32,tru,25.29,-25270.39,2016 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,24000,null,73000,18.79,0,63.5,32,39,fals,8.39,2240.08,2014 + 36 months,RENT,debt_consolidation,WA,Not Verified,INDIVIDUAL,15000,8,90000,8.4,1,34.1,11,13,fals,8.59,875.29,2016 + 36 months,OWN,other,VA,Verified,INDIVIDUAL,20000,null,52384,10.29,1,73.6,17,47,tru,17.56,-15688.1,2013 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,16000,4,92000,14.92,0,69.3,42,18,tru,17.27,-13709.6,2012 + 36 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,12000,5,90000,12.85,0,23.6,9,6,fals,7.9,1422.66,2013 + 60 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,23400,4,93000,19.81,0,25,26,17,fals,9.75,762.94,2016 + 36 months,OWN,other,CA,Not Verified,INDIVIDUAL,10000,4,60000,34.34,0,19.2,37,14,fals,6.89,829.57,2015 + 60 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,12600,2,35000,31.93,0,67,14,7,tru,19.99,-7444.31,2015 + 36 months,MORTGAGE,credit_card,UT,Verified,INDIVIDUAL,10000,0,58000,9.41,0,82.2,9,8,fals,12.12,1977.77,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,7900,3,22000,37.32,0,80,13,11,fals,17.57,2097.61,2015 + 36 months,RENT,credit_card,GA,Verified,INDIVIDUAL,8400,5,66000,12.3,0,48,9,8,fals,8.9,1040.03,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,10000,10,85000,28.8,0,27.4,29,14,tru,6.39,-4170.82,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,10625,2,31157.85,16.87,0,83.7,10,11,fals,13.05,2272.11,2013 + 36 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,2950,10,30000,19.76,0,10.6,9,7,fals,17.86,542.23,2015 + 60 months,RENT,credit_card,DE,Verified,INDIVIDUAL,17600,3,65000,5.39,0,43.3,4,3,fals,18.55,4542.08,2015 + 60 months,MORTGAGE,credit_card,NC,Verified,INDIVIDUAL,28000,10,210000,15.78,0,36.6,52,23,fals,13.11,7906.01,2013 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,20000,5,89000,17.5,3,68.4,39,17,fals,15.88,5285.42,2013 + 36 months,MORTGAGE,debt_consolidation,KS,Not Verified,INDIVIDUAL,8000,10,120000,30.98,0,85.2,21,22,tru,12.49,-512.75,2014 + 36 months,RENT,house,TX,Not Verified,INDIVIDUAL,14400,10,54000,18.02,9,43.6,33,20,fals,17.57,3082.6,2015 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,14000,10,43000,11.86,0,78.7,10,21,fals,15.61,4593.15,2014 + 60 months,MORTGAGE,debt_consolidation,OR,Verified,INDIVIDUAL,27650,1,63000,20.5,0,48.8,17,13,fals,16.99,10952.56,2014 + 36 months,MORTGAGE,major_purchase,TN,Verified,INDIVIDUAL,8000,null,52476,27.19,0,29.5,33,18,fals,19.99,670.85,2016 + 36 months,RENT,major_purchase,NY,Verified,INDIVIDUAL,6000,6,200000,3.24,0,77,10,11,tru,11.14,912.3,2013 + 36 months,MORTGAGE,debt_consolidation,TN,Not Verified,INDIVIDUAL,12500,0,74000,22.77,0,44.9,30,13,fals,7.49,1366.57,2014 + 36 months,OWN,debt_consolidation,NM,Not Verified,INDIVIDUAL,7450,9,22000,19.47,0,65.1,12,7,fals,7.9,843.57,2012 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,10625,10,35000,16.91,0,82.7,15,38,tru,20.5,-2090.9,2013 + 36 months,MORTGAGE,credit_card,IL,Not Verified,INDIVIDUAL,14000,5,62000,9.63,0,50,19,30,fals,15.88,366.79,2013 + 36 months,OWN,credit_card,MI,Verified,INDIVIDUAL,15000,1,56000,12.69,0,16.5,51,15,fals,7.89,127.32,2015 + 36 months,MORTGAGE,car,MI,Verified,INDIVIDUAL,24500,null,64000,3.99,0,20.3,19,37,fals,6.03,243.1,2013 + 36 months,MORTGAGE,other,CO,Verified,INDIVIDUAL,2000,2,40000,24.72,0,80.4,27,10,fals,12.79,135.41,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,9600,9,27000,26.58,0,82.3,11,8,tru,13.99,-5992.93,2015 + 36 months,RENT,credit_card,NY,Verified,INDIVIDUAL,20050,3,45000,30.35,0,64.4,32,4,fals,19.05,6100.05,2013 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,12000,6,135000,11.34,0,39.1,28,18,fals,10.49,806.18,2014 + 36 months,OWN,debt_consolidation,CT,Verified,INDIVIDUAL,4000,6,60000,14.06,1,56.4,9,21,fals,13.49,85.91,2017 + 36 months,RENT,credit_card,WA,Verified,INDIVIDUAL,19500,8,70000,20.45,0,70.7,13,8,fals,10.16,3204.31,2013 + 60 months,MORTGAGE,small_business,CA,Verified,INDIVIDUAL,26000,7,190000,12.96,0,46.4,32,23,fals,11.14,6458.1,2012 + 36 months,OWN,credit_card,SC,Verified,INDIVIDUAL,2800,9,32000,14.1,0,23.3,18,10,fals,9.16,156.34,2016 + 36 months,MORTGAGE,car,NY,Not Verified,INDIVIDUAL,7000,8,87000,7.74,0,7.7,24,12,fals,13.99,3.22,2016 + 60 months,OWN,major_purchase,PA,Verified,INDIVIDUAL,18000,10,97000,8.41,0,48.8,20,15,fals,17.27,1019.56,2013 + 36 months,MORTGAGE,credit_card,CT,Verified,INDIVIDUAL,10200,null,28751,11.98,1,81.5,33,16,tru,8.39,-5994.26,2014 + 36 months,RENT,credit_card,NY,Not Verified,INDIVIDUAL,17000,10,95000,6.42,2,19.5,41,19,fals,10.16,358.39,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,13600,4,38000,11.92,1,94.4,10,7,fals,19.47,4463.34,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,5000,10,78000,9.05,0,15.3,38,22,fals,9.16,427.92,2016 + 36 months,RENT,debt_consolidation,WA,Verified,INDIVIDUAL,7200,2,60000,25.74,0,51.2,37,10,fals,9.99,1162.44,2013 + 36 months,MORTGAGE,debt_consolidation,AR,Not Verified,INDIVIDUAL,18000,6,61000,18.67,0,60.5,41,19,fals,8.18,2079.43,2015 + 36 months,OWN,credit_card,NC,Verified,INDIVIDUAL,35000,10,220000,11.92,0,53.6,38,20,fals,9.17,3863,2015 + 36 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,7200,10,87000,4.41,0,18.8,9,15,fals,14.33,1700.88,2012 + 36 months,RENT,debt_consolidation,IL,Not Verified,INDIVIDUAL,6000,1,28000,13.16,0,59.9,19,7,fals,12.85,758.28,2014 + 36 months,OWN,credit_card,NC,Not Verified,INDIVIDUAL,7000,10,50000,29.91,0,89.6,37,29,fals,10.15,606.47,2014 + 36 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,8000,10,62000,21.85,0,17.9,39,16,fals,5.32,178.18,2016 + 36 months,MORTGAGE,debt_consolidation,IN,Not Verified,INDIVIDUAL,12575,9,35000,27.78,0,59.9,30,11,tru,16.99,-3621.27,2014 + 36 months,OWN,debt_consolidation,NY,Verified,INDIVIDUAL,30000,3,85000,22.29,1,37.2,41,12,tru,16.99,-27889.44,2016 + 60 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,12000,2,29000,37.87,0,48.3,24,20,tru,14.33,-9460.59,2015 + 36 months,RENT,credit_card,FL,Verified,INDIVIDUAL,5600,4,47000,16.37,0,29.3,6,4,tru,18.75,-4075.08,2012 + 60 months,MORTGAGE,debt_consolidation,LA,Verified,INDIVIDUAL,14000,10,82000,20.04,0,55.2,31,24,tru,20.5,-8217.84,2016 + 36 months,OWN,home_improvement,IL,Verified,INDIVIDUAL,30000,10,100000,2.45,0,1.1,32,33,fals,6.92,69.21,2015 + 60 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,12000,10,90000,7.35,0,57.9,28,17,fals,16.99,2478.64,2015 + 36 months,MORTGAGE,debt_consolidation,WA,Not Verified,INDIVIDUAL,4000,5,80000,12.2,0,96.7,14,11,fals,11.14,37.5,2013 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,30000,9,96000,17.65,0,21,26,14,fals,18.55,4672.03,2015 + 60 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,21925,10,56000,15.56,0,17.6,29,16,tru,18.92,-16081.63,2014 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,6600,1,40000,20.13,0,41.7,18,14,fals,8.39,888.34,2014 + 36 months,MORTGAGE,home_improvement,OH,Not Verified,INDIVIDUAL,5600,3,78000,13.03,0,37.4,16,15,fals,5.32,407.55,2015 + 36 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,8000,10,67000,22.41,0,85.8,32,12,fals,16.55,1519.25,2015 + 36 months,MORTGAGE,credit_card,NJ,Verified,INDIVIDUAL,9000,5,109000,4.8,1,46,24,29,fals,8.39,460.82,2016 + 36 months,OWN,debt_consolidation,NY,Verified,INDIVIDUAL,12000,10,71000,16.35,0,91.5,9,17,tru,13.53,-3857.46,2014 + 36 months,RENT,debt_consolidation,KS,Not Verified,INDIVIDUAL,6000,0,40000,26.01,0,68.2,19,8,tru,9.99,-2181.33,2015 + 60 months,RENT,credit_card,GA,Verified,INDIVIDUAL,10000,0,51000,4.52,0,63.2,8,6,tru,13.66,-5771.76,2014 + 36 months,RENT,debt_consolidation,LA,Verified,INDIVIDUAL,5000,3,36000,17.83,0,53.8,24,13,fals,11.99,668.75,2015 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,25000,3,100000,12.7,0,58.6,9,5,fals,18.55,7785.92,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,8000,2,49000,12.22,0,63.3,16,28,fals,15.61,1123.26,2014 + 36 months,RENT,debt_consolidation,TX,Not Verified,INDIVIDUAL,8000,10,56805,6.78,2,24,25,13,fals,13.53,1759.5,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,8000,3,30000,22.88,1,75.2,18,11,fals,14.64,1931.09,2014 + 36 months,MORTGAGE,credit_card,TN,Verified,INDIVIDUAL,21000,10,95000,23.82,0,77.5,32,13,fals,14.09,3391.8,2013 + 36 months,RENT,debt_consolidation,VA,Not Verified,INDIVIDUAL,10000,1,60000,18.74,0,53.2,35,8,fals,9.8,1153.17,2016 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,19350,7,55000,35.48,0,32.4,40,23,fals,12.99,1390.68,2016 + 36 months,MORTGAGE,credit_card,CT,Not Verified,INDIVIDUAL,15725,10,54000,18.91,0,18.4,38,37,fals,7.89,1724.58,2015 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,10000,2,52215,7.47,0,79.5,18,9,fals,13.67,325.26,2016 + 36 months,MORTGAGE,home_improvement,MO,Verified,INDIVIDUAL,3000,3,105000,33.48,0,53.7,32,19,fals,8.9,429.35,2014 + 36 months,RENT,debt_consolidation,PA,Not Verified,INDIVIDUAL,15000,10,45000,19.18,0,58,14,14,fals,9.71,2350.84,2013 + 36 months,RENT,credit_card,IL,Verified,INDIVIDUAL,6000,2,90000,8.31,0,79.1,27,15,fals,9.17,399.08,2015 + 36 months,RENT,home_improvement,RI,Verified,INDIVIDUAL,4000,7,40000,25.8,0,22.5,27,13,fals,15.8,695.42,2013 + 36 months,RENT,medical,NJ,Verified,INDIVIDUAL,9000,10,41600,28.3,0,31.6,17,12,fals,6.62,805.99,2012 + 60 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,35000,3,400000,18.85,1,85.3,41,33,tru,14.99,-24922.37,2014 + 36 months,RENT,credit_card,NJ,Not Verified,INDIVIDUAL,9000,7,51000,5.51,0,41.3,21,16,fals,6.24,633.32,2015 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,12000,10,75000,18.37,1,49.1,63,14,fals,9.99,1596.8,2015 + 60 months,MORTGAGE,home_improvement,AZ,Verified,INDIVIDUAL,10900,10,67000,11.52,1,35.7,15,10,tru,21.48,-4990.8,2016 + 36 months,RENT,credit_card,IL,Verified,INDIVIDUAL,8000,5,43000,6.81,0,64.8,10,10,fals,6.62,831.97,2012 + 60 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,21000,3,146086,10.68,2,71.7,27,13,fals,13.99,1642.74,2015 + 60 months,RENT,debt_consolidation,WA,Verified,INDIVIDUAL,12000,3,48000,8.35,0,81.6,8,6,fals,15.61,305.15,2014 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,30700,1,80800,26.41,0,69.8,31,16,tru,25.99,-18591,2015 + 60 months,RENT,medical,AZ,Verified,INDIVIDUAL,12000,0,96000,26.56,1,80.2,45,21,tru,18.25,-8285.87,2015 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,4675,10,40000,19.11,0,31.7,11,11,fals,22.35,556.28,2016 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,30000,5,115000,10.16,0,41.8,35,11,fals,11.99,41.47,2017 + 36 months,MORTGAGE,medical,NC,Not Verified,INDIVIDUAL,8000,10,125000,10.12,0,29.8,39,28,fals,7.89,713.33,2015 + 36 months,OWN,debt_consolidation,NY,Verified,INDIVIDUAL,7925,10,40000,22.29,0,34.5,11,14,fals,11.44,22.66,2017 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,3000,3,72000,12.08,1,32.8,34,25,fals,12.49,600.62,2014 + 36 months,MORTGAGE,debt_consolidation,AL,Not Verified,INDIVIDUAL,4000,6,74000,23.89,0,39.5,26,14,fals,14.3,942.54,2013 + 36 months,MORTGAGE,other,NY,Verified,INDIVIDUAL,5000,null,55000,19.03,0,54.2,34,13,tru,12.49,-1891.46,2014 + 36 months,MORTGAGE,credit_card,MD,Not Verified,INDIVIDUAL,7000,10,80000,18.9,0,83.5,23,17,fals,6.68,701.49,2015 + 36 months,OWN,debt_consolidation,HI,Verified,INDIVIDUAL,21000,1,50000,12.55,0,76.4,15,11,tru,15.31,-11647.39,2016 + 60 months,RENT,credit_card,AZ,Verified,INDIVIDUAL,22000,10,50000,24.48,0,78,31,7,tru,22.4,-15342.13,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,12200,10,88000,16.46,0,72.6,17,17,fals,7.89,1106.47,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,2400,9,115000,10.75,0,17.6,26,24,fals,8.9,343.47,2014 + 36 months,RENT,other,CO,Verified,INDIVIDUAL,5000,0,25000,34.85,0,65.4,26,7,tru,17.86,-4102.86,2015 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,10000,4,40000,22.02,0,43.6,28,29,fals,12.74,4.6,2017 + 36 months,MORTGAGE,vacation,WV,Not Verified,INDIVIDUAL,7500,10,95000,9.16,1,57,43,23,fals,16.29,1799.28,2014 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,18250,null,42100,25.1,0,41.2,28,20,fals,9.99,2946.4,2013 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,7000,10,130720,10.49,0,17.6,30,15,fals,6.92,705.41,2015 + 60 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,18000,1,64541,17.85,1,20,25,12,fals,22.74,159.18,2017 + 36 months,MORTGAGE,credit_card,NC,Verified,INDIVIDUAL,6000,5,32000,16.84,0,48.9,10,11,fals,12.69,906.42,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,30000,9,142660,13.42,0,92.5,25,24,tru,14.33,-10946.94,2013 + 36 months,RENT,other,CO,Verified,INDIVIDUAL,5000,0,38000,28.58,0,36.8,80,24,tru,7.49,-1425.13,2015 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,6000,0,20000,17.89,0,6,32,14,fals,8.18,453.04,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,6325,6,43000,7.76,0,74.7,13,5,fals,12.12,468.94,2012 + 36 months,MORTGAGE,home_improvement,TX,Verified,INDIVIDUAL,20700,10,46000,14.64,0,46.1,40,28,fals,13.35,4534.54,2014 + 36 months,MORTGAGE,other,TX,Verified,INDIVIDUAL,4000,2,58000,15.33,0,4.4,25,13,fals,12.99,232.93,2016 + 60 months,MORTGAGE,credit_card,IN,Verified,INDIVIDUAL,14900,null,35500,11.12,0,62.8,14,12,fals,18.25,4704.72,2015 + 36 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,24500,10,179200,12.16,0,52.9,25,17,fals,7.69,2989.41,2014 + 60 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,19800,2,91000,27.55,0,86.2,25,15,tru,21.49,-12062.04,2013 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,9000,0,50000,38.07,0,46.8,31,14,fals,9.16,512.35,2016 + 36 months,RENT,credit_card,KS,Verified,INDIVIDUAL,35000,8,200000,6.48,0,16.5,34,22,fals,7.39,2449.99,2016 + 60 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,13325,10,104000,26.03,0,14.3,35,22,tru,7.89,-8782.65,2015 + 60 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,18000,1,110000,10.81,0,82.1,49,16,fals,19.52,5820.02,2014 + 36 months,MORTGAGE,vacation,TX,Not Verified,INDIVIDUAL,10000,10,55000,2.23,0,26.6,10,15,tru,13.44,-8233.93,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,9600,1,51000,27.84,1,73.8,14,8,tru,15.31,-241,2013 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,6450,10,43576,25.27,0,41.6,29,13,fals,16.55,1437.95,2015 + 36 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,10000,10,74400,24.18,0,61,56,27,fals,9.99,863.75,2013 + 36 months,RENT,other,OH,Verified,INDIVIDUAL,4500,9,44000,33.14,0,49.7,23,11,fals,17.57,63.65,2015 + 60 months,MORTGAGE,major_purchase,MI,Verified,INDIVIDUAL,15000,10,63000,21.3,2,36.5,42,18,tru,16.29,-9797.57,2013 + 36 months,MORTGAGE,credit_card,NY,Verified,INDIVIDUAL,13000,1,75000,23.36,4,38.8,65,11,tru,14.65,-8619.75,2015 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,7500,10,79000,2.26,0,5,37,22,fals,13.65,1480.3,2014 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,7500,6,42720,16.52,0,45.5,18,12,fals,12.12,1450.39,2013 + 36 months,MORTGAGE,home_improvement,AL,Verified,INDIVIDUAL,15000,2,48000,12.03,0,35,47,28,fals,9.67,2340.72,2014 + 60 months,MORTGAGE,credit_card,RI,Not Verified,INDIVIDUAL,21600,10,140000,23.48,0,69,33,25,fals,11.99,2915.34,2016 + 36 months,MORTGAGE,home_improvement,MA,Not Verified,INDIVIDUAL,10000,3,127000,9.13,0,66.1,13,14,fals,10.99,1492.1,2014 + 36 months,OWN,credit_card,CA,Verified,INDIVIDUAL,20000,8,100000,13.59,0,69.2,26,17,fals,12.69,3416.27,2015 + 36 months,RENT,other,FL,Verified,INDIVIDUAL,23400,1,65000,13.11,0,39,35,14,fals,20.99,8010,2015 + 60 months,MORTGAGE,major_purchase,NC,Verified,INDIVIDUAL,20000,8,62000,9.74,0,19,13,12,tru,22.99,-15406.72,2015 + 36 months,MORTGAGE,home_improvement,FL,Verified,INDIVIDUAL,3950,10,52000,13.8,0,96.2,28,11,fals,21.18,341.59,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,16000,7,60000,6.63,0,86.1,20,21,fals,25.57,3715.25,2014 + 60 months,RENT,credit_card,FL,Verified,INDIVIDUAL,12000,3,48000,33.9,0,70.2,16,15,fals,12.69,1510.36,2015 + 36 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,25000,4,105000,3.84,0,47.9,30,12,fals,7.89,1056.01,2015 + 60 months,RENT,debt_consolidation,IL,Verified,INDIVIDUAL,13250,4,55000,19.66,0,74,18,18,fals,18.25,5145.01,2014 + 36 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,20000,8,75000,14.11,0,60.3,32,13,fals,15.8,5015.9,2012 + 36 months,MORTGAGE,credit_card,MO,Verified,INDIVIDUAL,10000,null,50000,29.94,0,16.7,37,33,fals,7.69,1229.66,2014 + 36 months,MORTGAGE,other,MO,Verified,INDIVIDUAL,3500,null,72000,7.97,0,8,42,19,fals,6.03,334.86,2012 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,8500,7,45000,16.43,0,36.4,30,8,fals,10.99,725.14,2014 + 60 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,10000,6,65000,21.95,0,49.3,27,13,fals,13.99,3073.97,2012 + 36 months,MORTGAGE,home_improvement,MD,Not Verified,INDIVIDUAL,4800,4,65000,12.39,0,17.1,9,18,fals,9.99,600,2015 + 36 months,RENT,debt_consolidation,WI,Verified,INDIVIDUAL,7825,1,30000,18.24,0,45.8,11,5,tru,18.84,-5098.89,2015 + 60 months,RENT,debt_consolidation,PA,Verified,INDIVIDUAL,35000,10,85000,29.61,0,49,37,16,fals,17.57,3327.54,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,3450,4,36000,4.97,0,43.6,7,11,fals,15.31,629.34,2016 + 36 months,OWN,debt_consolidation,CA,Not Verified,INDIVIDUAL,21000,2,64000,29.35,0,22.5,19,16,fals,7.89,2178.98,2015 + 36 months,MORTGAGE,credit_card,MO,Verified,INDIVIDUAL,15000,10,118450,11.22,1,91.9,21,44,fals,9.67,2305.07,2014 + 36 months,RENT,other,AZ,Verified,INDIVIDUAL,12700,5,56000,4.11,0,58.4,15,22,tru,17.14,-5918.99,2015 + 60 months,MORTGAGE,credit_card,NM,Verified,INDIVIDUAL,11050,10,68000,3.92,1,48.5,8,32,tru,14.65,-7916.47,2015 + 36 months,MORTGAGE,debt_consolidation,WI,Not Verified,INDIVIDUAL,4000,3,56000,11.44,0,11.5,20,13,fals,6.03,234.82,2012 + 60 months,MORTGAGE,debt_consolidation,OR,Not Verified,INDIVIDUAL,10000,9,58000,36.23,2,56.1,29,11,fals,17.57,2676.35,2015 + 36 months,OWN,debt_consolidation,OR,Verified,INDIVIDUAL,24450,10,49000,11.38,2,7.8,27,19,fals,7.12,1814.75,2014 + 36 months,MORTGAGE,credit_card,GA,Not Verified,INDIVIDUAL,15000,5,140000,8.4,0,35.8,26,24,fals,6.99,1501.16,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,17475,9,53778,16.74,1,89.8,25,11,tru,18.75,-7477.57,2013 diff --git a/python/py_auto_ml/test/local_spark_singleton.py b/python/py_auto_ml/test/local_spark_singleton.py new file mode 100644 index 00000000..f4272588 --- /dev/null +++ b/python/py_auto_ml/test/local_spark_singleton.py @@ -0,0 +1,23 @@ +from pyspark.sql import SparkSession +import os + + +class SparkSingleton: + """A singleton class on Datalib which returns one Spark instance""" + __instance = None + + @classmethod + def get_instance(cls): + """Create a Spark instance for Datalib. + :return: A Spark instance + """ + return (SparkSession.builder + .getOrCreate()) + + @classmethod + def get_local_instance(cls): + return (SparkSession.builder + .master("local[*]") + .appName("automl") + .getOrCreate()) + diff --git a/python/py_auto_ml/test/test_automation_runner.py b/python/py_auto_ml/test/test_automation_runner.py new file mode 100644 index 00000000..edd7868d --- /dev/null +++ b/python/py_auto_ml/test/test_automation_runner.py @@ -0,0 +1,53 @@ +import unittest +from py_auto_ml.test.local_spark_singleton import SparkSingleton +from py_auto_ml.automation_runner import AutomationRunner + + +class TestFamilyRunner(unittest.TestCase): + + def setup(self): + self.spark = SparkSingleton.get_instance() + + def test_get_returns(self): + self.setup() + automation_runner = AutomationRunner() + + model_report_data_frame = self.spark.createDataFrame([(1,2,3)],["col1", "col2", "col3"]) + model_report_data_frame.createOrReplaceTempView("modelReport") + model_report_data_frame.createOrReplaceTempView("modelReportData") + + generation_report_data_frame = self.spark.createDataFrame([(4, 5, 6, 7)], ["col1", "col2", "col3", "col4"]) + generation_report_data_frame.createOrReplaceTempView("generationReport") + + confusion_data = self.spark.createDataFrame([(7, 8)], ["col1", "col2"]) + confusion_data.createOrReplaceTempView("confusionData") + + prediction_data = self.spark.createDataFrame([(9,10,11,12,13)], ["col1", "col2", "col3", "col4", "col5"]) + prediction_data.createOrReplaceTempView("predictionData") + + data_with_preds = self.spark.createDataFrame([(14, 15)], ["col1", "col2"]) + data_with_preds.createOrReplaceTempView("dataWithPredictions") + + # Test with RUN + automation_runner._automation_runner = True + automation_runner.get_returns("run") + assert len(automation_runner.generation_report.columns) == 4 + assert len(automation_runner.model_report.columns) == 3 + + # Test with CONFUSION + automation_runner.get_returns("confusion") + assert len(automation_runner.confusion_data.columns) == 2 + assert len(automation_runner.prediction_data.columns) == 5 + assert len(automation_runner.generation_report.columns) == 4 + assert len(automation_runner.model_report.columns) == 3 + + # Test with PREDICTION + automation_runner.get_returns("prediction") + assert len(automation_runner.generation_report.columns) == 4 + assert len(automation_runner.model_report.columns) == 3 + assert len(automation_runner.data_with_predictions.columns) == 2 + + self.tear_down() + + def tear_down(self): + self.spark.stop() diff --git a/python/py_auto_ml/test/test_family_runner.py b/python/py_auto_ml/test/test_family_runner.py new file mode 100644 index 00000000..b3079af2 --- /dev/null +++ b/python/py_auto_ml/test/test_family_runner.py @@ -0,0 +1,34 @@ +import unittest +from py_auto_ml.test.local_spark_singleton import SparkSingleton +from py_auto_ml.executor.family_runner import FamilyRunner + + +class TestFamilyRunner(unittest.TestCase): + + def setup(self): + self.spark = SparkSingleton.get_instance() + + def test_get_returns(self): + self.setup() + family_runner = FamilyRunner() + + model_report_data_frame = self.spark.createDataFrame([(1,2,3)],["col1", "col2", "col3"]) + model_report_data_frame.createOrReplaceTempView("modelReportDataFrame") + + generation_report_data_frame = self.spark.createDataFrame([(4, 5, 6, 7)], ["col1", "col2", "col3", "col4"]) + generation_report_data_frame.createOrReplaceTempView("generationReportDataFrame") + + best_mlflow_run_id = self.spark.createDataFrame([(7, 8)], ["col1", "col2"]) + best_mlflow_run_id.createOrReplaceTempView("bestMlFlowRunId") + + family_runner._family_runner = True + family_runner.get_returns() + + assert len(family_runner.model_report.columns) == 3 + assert len(family_runner.best_mlflow_run_id.columns) == 2 + assert len(family_runner.generation_report.columns) == 4 + + self.tear_down() + + def tear_down(self): + self.spark.stop() diff --git a/python/py_auto_ml/test/test_feature_importance.py b/python/py_auto_ml/test/test_feature_importance.py new file mode 100644 index 00000000..0d08bfe0 --- /dev/null +++ b/python/py_auto_ml/test/test_feature_importance.py @@ -0,0 +1,75 @@ +import unittest +from python.py_auto_ml.test.local_spark_singleton import SparkSingleton +from python.py_auto_ml.exploration.feature_importance import FeatureImportance +from pyspark.sql import SparkSession + + +class TestFeatureImportance(unittest.TestCase): + + def setup(self): + self.spark = SparkSingleton.get_instance() + + def test_bring_in_returns(self): + self.setup() + + importances_data_frame = self.spark.createDataFrame([(1, 2, 3)], ["feature", "col2", "col3"]) + importances_data_frame.createOrReplaceTempView("importances") + + feat_imp = FeatureImportance() + feat_imp.feature_importance = True + feat_imp.bring_in_returns() + + assert len(feat_imp.importances.columns) == 3 + assert len(feat_imp.top_fields.columns) == 1 + + self.tear_down() + + def tear_down(self): + self.spark.stop() + + @staticmethod + def convert_csv_to_df(csv_path: str): + spark_session = SparkSession.builder.master('local[*]').appName("providentiaml-unit-tests").getOrCreate() + spark_session.sparkContext.setLogLevel("ERROR") + return spark_session.read.format('csv').option("header", "true").option("inferSchema", "true").load(csv_path) + + def test_loan_risk_xgboost(self): + self.setup() + loan_risk_df = self.convert_csv_to_df("Desktop/providenc/load_risk.csv") + generic_overrides = { + "labelCol": "label", + "scoringMetric": "areaUnderROC", + "dataPrepCachingFlag": False, + "autoStoppingFlag": True, + "tunerAutoStoppingScore": 0.91, + "tunerParallelism": 1*2, + "tunerKFold": 2, + "tunerSeed": 42, + "tunerInitialGenerationArraySeed": 42, + "tunerTrainPortion": 0.7, + "tunerTrainSplitMethod": "stratified", + "tunerInitialGenerationMode": "permutations", + "tunerInitialGenerationPermutationCount": 8, + "tunerInitialGenerationIndexMixingMode": "linear", + "tunerFirstGenerationGenePool": 16, + "tunerNumberOfGenerations": 3, + "tunerNumberOfParentsToRetain": 2, + "tunerNumberOfMutationsPerGeneration": 4, + "tunerGeneticMixing": 0.8, + "tunerGenerationalMutationStrategy": "fixed", + "tunerEvolutionStrategy": "batch", + "tunerHyperSpaceInferenceFlag": True, + "tunerHyperSpaceInferenceCount": 400000, + "tunerHyperSpaceModelType": "XGBoost", + "tunerHyperSpaceModelCount": 8, + "mlFlowLoggingFlag": False, + "mlFlowLogArtifactsFlag": False + } + feat_imp = FeatureImportance.run_feature_importance("XGBoost", "classifier", loan_risk_df, 20.0, "count", generic_overrides) + + assert len(feat_imp.top_fields.columns) != 0 + assert len(feat_imp.importances.columns) !=0 + + + + self.tear_down() \ No newline at end of file diff --git a/python/py_auto_ml/utilities/__init__.py b/python/py_auto_ml/utilities/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/python/py_auto_ml/utilities/helpers.py b/python/py_auto_ml/utilities/helpers.py new file mode 100644 index 00000000..0232fc09 --- /dev/null +++ b/python/py_auto_ml/utilities/helpers.py @@ -0,0 +1,28 @@ +class Helpers: + + @staticmethod + def check_model_family(model_family: str): + supported_models = ["RandomForest","XGBoost", "LogisticRegresesion","Trees","GBT","LinearRegression", + "MLPC", "SVM"] + if model_family not in supported_models: + raise Exception("Your model family but be within any of the following supported model types:", + supported_models) + + @staticmethod + def check_prediction_type(prediction_type:str): + supported_prediction_types = ['regressor', 'classifier'] + if prediction_type not in supported_prediction_types: + raise Exception("Prediction type is not supported - it must be one of the following", + supported_prediction_types) + + @staticmethod + def check_runner_types(runner_type: str): + """ + + :param runner_type: str + Checking that the runner_type is a supported runner_type + :return: + """ + acceptable_strings = ["run", "confusion", "prediction"] + if runner_type not in acceptable_strings: + raise Exception("runner_type must be one of the following run, confusion, or prediction") \ No newline at end of file diff --git a/python/setup.py b/python/setup.py new file mode 100644 index 00000000..158904a4 --- /dev/null +++ b/python/setup.py @@ -0,0 +1,8 @@ +from setuptools import setup, find_packages + +setup( + name = "pyAutoML", + version= "0.2.0", + author="Databricks", + packages=find_packages() + ) \ No newline at end of file diff --git a/src/main/scala/com/databricks/backend/common/rpc/CommandContext.scala b/src/main/scala/com/databricks/backend/common/rpc/CommandContext.scala new file mode 100644 index 00000000..e7cd8220 --- /dev/null +++ b/src/main/scala/com/databricks/backend/common/rpc/CommandContext.scala @@ -0,0 +1,3 @@ +package com.databricks.backend.common.rpc + +trait CommandContext {} diff --git a/src/main/scala/com/databricks/labs/automl/AutomationRunner.scala b/src/main/scala/com/databricks/labs/automl/AutomationRunner.scala new file mode 100644 index 00000000..040898d6 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/AutomationRunner.scala @@ -0,0 +1,2649 @@ +package com.databricks.labs.automl + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.inference.{ + InferenceConfig, + InferenceModelConfig, + InferenceTools +} +import com.databricks.labs.automl.model._ +import com.databricks.labs.automl.model.tools.split.{ + DataSplitCustodial, + DataSplitUtility +} +import com.databricks.labs.automl.model.tools.{PostModelingOptimization} +import com.databricks.labs.automl.params._ +import com.databricks.labs.automl.reports.{ + DecisionTreeSplits, + RandomForestFeatureImportance +} +import com.databricks.labs.automl.tracking.{ + MLFlowReportStructure, + MLFlowReturn, + MLFlowTracker +} +import com.microsoft.ml.spark.lightgbm.{ + LightGBMClassificationModel, + LightGBMRegressionModel +} +import ml.dmlc.xgboost4j.scala.spark.{ + XGBoostClassificationModel, + XGBoostRegressionModel +} +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.log4j.{Level, Logger} +import org.apache.spark.ml.classification._ +import org.apache.spark.ml.regression.{ + DecisionTreeRegressionModel, + GBTRegressionModel, + LinearRegressionModel, + RandomForestRegressionModel +} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.storage.StorageLevel +import org.json4s.jackson.Serialization +import org.json4s.jackson.Serialization.writePretty +import org.json4s.{Formats, NoTypeHints} + +import scala.collection.mutable.ArrayBuffer + +class AutomationRunner(df: DataFrame) extends DataPrep(df) with InferenceTools { + + private val logger: Logger = Logger.getLogger(this.getClass) + + private def runRandomForest( + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[RandomForestModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + val initialize = new RandomForestTuner( + cachedData, + splitData, + payload.modelType, + isPipeline + ).setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setRandomForestNumericBoundaries(_mainConfig.numericBoundaries) + .setRandomForestStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(_mainConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(_mainConfig.geneticConfig.hyperSpaceModelCount) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily("RandomForest") + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries(initialize.getRandomForestNumericBoundaries) + .setStringBoundaries(initialize.getRandomForestStringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .randomForestPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + + } + + private def runLightGBM( + lightGBMType: String, + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[LightGBMModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + val initialize = new LightGBMTuner( + cachedData, + splitData, + payload.modelType, + lightGBMType, + isPipeline + ).setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setLGBMNumericBoundaries(_mainConfig.numericBoundaries) + .setLGBMStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(_mainConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(_mainConfig.geneticConfig.hyperSpaceModelCount) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily(lightGBMType) + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries(initialize.getLightGBMNumericBoundaries) + .setStringBoundaries(initialize.getLightGBMStringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .lightGBMPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + + } + + private def runXGBoost( + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[XGBoostModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + val initialize = new XGBoostTuner( + cachedData, + splitData, + payload.modelType, + isPipeline + ).setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setXGBoostNumericBoundaries(_mainConfig.numericBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(_mainConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(_mainConfig.geneticConfig.hyperSpaceModelCount) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily("XGBoost") + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries(initialize.getXGBoostNumericBoundaries) + .setStringBoundaries(_mainConfig.stringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .xgBoostPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + + } + + private def runMLPC( + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[MLPCModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + payload.modelType match { + case "classifier" => + val initialize = new MLPCTuner(cachedData, splitData, isPipeline) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setMlpcNumericBoundaries(_mainConfig.numericBoundaries) + .setMlpcStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter( + _mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter + ) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables( + _mainConfig.geneticConfig.kSampleConfig.lshHashTables + ) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue( + _mainConfig.geneticConfig.kSampleConfig.mutationValue + ) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget( + _mainConfig.geneticConfig.kSampleConfig.numericTarget + ) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount( + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily("MLPC") + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries(initialize.getMlpcNumericBoundaries) + .setStringBoundaries(initialize.getMlpcStringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .mlpcPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount, + initialize.getFeatureInputSize, + initialize.getClassDistinctCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + + case _ => + throw new UnsupportedOperationException( + s"Detected Model Type ${payload.modelType} is not supported by MultiLayer Perceptron Classifier" + ) + } + } + + private def runGBT( + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[GBTModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + val initialize = new GBTreesTuner( + cachedData, + splitData, + payload.modelType, + isPipeline + ).setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setGBTNumericBoundaries(_mainConfig.numericBoundaries) + .setGBTStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(_mainConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(_mainConfig.geneticConfig.hyperSpaceModelCount) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily("GBT") + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries(initialize.getGBTNumericBoundaries) + .setStringBoundaries(initialize.getGBTStringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .gbtPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + + } + + private def runLinearRegression( + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[LinearRegressionModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + payload.modelType match { + case "regressor" => + val initialize = new LinearRegressionTuner( + cachedData, + splitData, + isPipeline + ).setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setLinearRegressionNumericBoundaries(_mainConfig.numericBoundaries) + .setLinearRegressionStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter( + _mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter + ) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables( + _mainConfig.geneticConfig.kSampleConfig.lshHashTables + ) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue( + _mainConfig.geneticConfig.kSampleConfig.mutationValue + ) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget( + _mainConfig.geneticConfig.kSampleConfig.numericTarget + ) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount( + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily("LinearRegression") + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries( + initialize.getLinearRegressionNumericBoundaries + ) + .setStringBoundaries(initialize.getLinearRegressionStringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .linearRegressionPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + + case _ => + throw new UnsupportedOperationException( + s"Detected Model Type ${payload.modelType} is not supported by Linear Regression" + ) + } + + } + + private def runLogisticRegression( + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[LogisticRegressionModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + payload.modelType match { + case "classifier" => + val initialize = new LogisticRegressionTuner( + cachedData, + splitData, + isPipeline + ).setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setLogisticRegressionNumericBoundaries(_mainConfig.numericBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter( + _mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter + ) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables( + _mainConfig.geneticConfig.kSampleConfig.lshHashTables + ) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue( + _mainConfig.geneticConfig.kSampleConfig.mutationValue + ) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget( + _mainConfig.geneticConfig.kSampleConfig.numericTarget + ) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount( + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily("LogisticRegression") + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries( + initialize.getLogisticRegressionNumericBoundaries + ) + .setStringBoundaries(_mainConfig.stringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .logisticRegressionPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + + case _ => + throw new UnsupportedOperationException( + s"Detected Model Type ${payload.modelType} is not supported by Logistic Regression" + ) + } + + } + + private def runSVM( + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[SVMModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + payload.modelType match { + case "classifier" => + val initialize = new SVMTuner(cachedData, splitData, isPipeline) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setSvmNumericBoundaries(_mainConfig.numericBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter( + _mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter + ) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables( + _mainConfig.geneticConfig.kSampleConfig.lshHashTables + ) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue( + _mainConfig.geneticConfig.kSampleConfig.mutationValue + ) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget( + _mainConfig.geneticConfig.kSampleConfig.numericTarget + ) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount( + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily("SVM") + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries(initialize.getSvmNumericBoundaries) + .setStringBoundaries(_mainConfig.stringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .svmPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + + case _ => + throw new UnsupportedOperationException( + s"Detected Model Type ${payload.modelType} is not supported by Support Vector Machines" + ) + } + } + + private def runTrees( + payload: DataGeneration, + isPipeline: Boolean = false + ): (Array[TreesModelsWithResults], DataFrame, String, DataFrame) = { + + val cachedData = if (_mainConfig.dataPrepCachingFlag) { + val data = payload.data.persist(StorageLevel.MEMORY_AND_DISK) + data.foreach(_ => ()) + data + } else { + payload.data + } + + val splitData = DataSplitUtility.split( + cachedData, + _mainConfig.geneticConfig.kFold, + _mainConfig.geneticConfig.trainSplitMethod, + _mainConfig.labelCol, + _mainConfig.geneticConfig.deltaCacheBackingDirectory, + _mainConfig.geneticConfig.splitCachingStrategy, + _mainConfig.modelFamily, + _mainConfig.geneticConfig.parallelism, + _mainConfig.geneticConfig.trainPortion, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.geneticConfig.trainSplitChronologicalColumn, + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + _mainConfig.dataReductionFactor + ) + + val initialize = new DecisionTreeTuner( + cachedData, + splitData, + payload.modelType, + isPipeline + ).setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setTreesNumericBoundaries(_mainConfig.numericBoundaries) + .setTreesStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric(_mainConfig.scoringMetric) + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod( + trainSplitValidation( + _mainConfig.geneticConfig.trainSplitMethod, + payload.modelType + ) + ) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(_mainConfig.geneticConfig.parallelism) + .setKFold(_mainConfig.geneticConfig.kFold) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + _mainConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + _mainConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + _mainConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + _mainConfig.geneticConfig.numberOfParentsToRetain + ) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(_mainConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setContinuousEvolutionMaxIterations( + _mainConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + _mainConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + _mainConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + _mainConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + _mainConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + _mainConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations( + _mainConfig.geneticConfig.initialGenerationConfig.permutationCount + ) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(_mainConfig.geneticConfig.hyperSpaceModelCount) + + if (_modelSeedSetStatus) + initialize.setModelSeed(_mainConfig.geneticConfig.modelSeed) + + val (modelResultsRaw, modelStatsRaw) = initialize.evolveWithScoringDF() + + val resultBuffer = modelResultsRaw.toBuffer + val statsBuffer = new ArrayBuffer[DataFrame]() + statsBuffer += modelStatsRaw + + if (_mainConfig.geneticConfig.hyperSpaceInference) { + + println("\n\t\tStarting Post Tuning Inference Run.\n") + + val genericResults = new ArrayBuffer[GenericModelReturn] + + modelResultsRaw.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + + val hyperSpaceRunCandidates = new PostModelingOptimization() + .setModelFamily("Trees") + .setModelType(payload.modelType) + .setHyperParameterSpaceCount( + _mainConfig.geneticConfig.hyperSpaceInferenceCount + ) + .setNumericBoundaries(initialize.getTreesNumericBoundaries) + .setStringBoundaries(initialize.getTreesStringBoundaries) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy(_mainConfig.scoringOptimizationStrategy) + .treesPrediction( + genericResults.result.toArray, + _mainConfig.geneticConfig.hyperSpaceModelType, + _mainConfig.geneticConfig.hyperSpaceModelCount + ) + + val (hyperResults, hyperDataFrame) = + initialize.postRunModeledHyperParams(hyperSpaceRunCandidates) + + hyperResults.foreach { x => + resultBuffer += x + } + statsBuffer += hyperDataFrame + + } + + DataSplitCustodial.cleanCachedInstances(splitData, _mainConfig) + + ( + resultBuffer.toArray, + statsBuffer.reduce(_ union _), + payload.modelType, + cachedData + ) + } + + private def logResultsToMlFlow(runData: Array[GenericModelReturn], + modelFamily: String, + modelType: String): MLFlowReportStructure = { + + val mlFlowLogger = MLFlowTracker(_mainConfig) + + if (_mainConfig.mlFlowLogArtifactsFlag) mlFlowLogger.logArtifactsOn() + else mlFlowLogger.logArtifactsOff() + + mlFlowLogger.logMlFlowDataAndModels( + runData, + modelFamily, + modelType, + _mainConfig.inferenceConfigSaveLocation, + _mainConfig.scoringOptimizationStrategy + ) + } + + private def logPipelineResultsToMlFlow( + runData: Array[GenericModelReturn], + modelFamily: String, + modelType: String + ): MLFlowReportStructure = { + + val mlFlowLogger = MLFlowTracker(_mainConfig) + mlFlowLogger.logMlFlowForPipeline( + AutoMlPipelineMlFlowUtils + .getMainConfigByPipelineId(_mainConfig.pipelineId) + .mlFlowRunId, + runData, + modelFamily, + modelType, + _mainConfig.scoringOptimizationStrategy + ) + } + + protected[automl] def executeTuning( + payload: DataGeneration, + isPipeline: Boolean = false + ): TunerOutput = { + + val genericResults = new ArrayBuffer[GenericModelReturn] + logger.log(Level.INFO, convertMainConfigToJson(_mainConfig)) + + val (resultArray, modelStats, modelSelection, dataframe) = + _mainConfig.modelFamily match { + case "RandomForest" => + val (results, stats, selection, data) = + runRandomForest(payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + case "XGBoost" => + val (results, stats, selection, data) = + runXGBoost(payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + case "gbmBinary" | "gbmMulti" | "gbmMultiOVA" | "gbmHuber" | "gbmFair" | + "gbmLasso" | "gbmRidge" | "gbmPoisson" | "gbmQuantile" | "gbmMape" | + "gbmTweedie" | "gbmGamma" => + val (results, stats, selection, data) = + runLightGBM(_mainConfig.modelFamily, payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + case "GBT" => + val (results, stats, selection, data) = runGBT(payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + case "MLPC" => + val (results, stats, selection, data) = runMLPC(payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractMLPCPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + case "LinearRegression" => + val (results, stats, selection, data) = + runLinearRegression(payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + case "LogisticRegression" => + val (results, stats, selection, data) = + runLogisticRegression(payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + case "SVM" => + val (results, stats, selection, data) = runSVM(payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + case "Trees" => + val (results, stats, selection, data) = runTrees(payload, isPipeline) + results.foreach { x => + genericResults += GenericModelReturn( + hyperParams = extractPayload(x.modelHyperParams), + model = x.model, + score = x.score, + metrics = x.evalMetrics, + generation = x.generation + ) + } + (genericResults, stats, selection, data) + } + + val genericResultData = genericResults.result.toArray + + val mlFlow = if (_mainConfig.mlFlowLoggingFlag && !isPipeline) { + + // set the Inference details in general for the run + // TODO - Remove this - It's here and in the tracker but the values are different and should be set equal + val inferenceModelConfig = InferenceModelConfig( + modelFamily = _mainConfig.modelFamily, + modelType = modelSelection, + modelLoadMethod = "path", + mlFlowConfig = _mainConfig.mlFlowConfig, + mlFlowRunId = "none", + modelPathLocation = "notDefined" + ) + + // Set the Inference Config + InferenceConfig.setInferenceModelConfig(inferenceModelConfig) + InferenceConfig.setInferenceConfigStorageLocation( + _mainConfig.inferenceConfigSaveLocation + ) + + // Write the Inference Payload out to the specified location + val outputInferencePayload = InferenceConfig.getInferenceConfig + + val inferenceConfigReadable = convertInferenceConfigToJson( + outputInferencePayload + ) + val inferenceLog = + s"Inference Configuration: \n${inferenceConfigReadable.prettyJson}" + println(inferenceLog) + + logger.log(Level.INFO, inferenceLog) + + val mlFlowResult = try { + logResultsToMlFlow( + genericResultData, + _mainConfig.modelFamily, + modelSelection + ) + } catch { + case e: Exception => + println( + s"Failed to log to mlflow. Check configuration. \n ${e.printStackTrace()} " + + s"\n ${e.getStackTraceString}" + ) + logger.log(Level.FATAL, e.getStackTraceString) + generateDummyMLFlowReturn("error").get + } + + implicit val formats: Formats = Serialization.formats(hints = NoTypeHints) + val pretty = writePretty(mlFlowResult) + + logger.log(Level.INFO, pretty) + mlFlowResult + } else if (isPipeline && _mainConfig.mlFlowLoggingFlag) { + logPipelineResultsToMlFlow( + genericResultData, + _mainConfig.modelFamily, + modelSelection + ) + } else { + generateDummyMLFlowReturn("undefined").get + } + + val generationalData = extractGenerationalScores( + genericResultData, + _mainConfig.scoringOptimizationStrategy, + _mainConfig.modelFamily, + modelSelection + ) + + new TunerOutput( + rawData = dataframe, + modelSelection = modelSelection, + mlFlowOutput = mlFlow + ) { + override def modelReport: Array[GenericModelReturn] = genericResultData + override def generationReport: Array[GenerationalReport] = + generationalData + override def modelReportDataFrame: DataFrame = modelStats + override def generationReportDataFrame: DataFrame = + generationDataFrameReport( + generationalData, + _mainConfig.scoringOptimizationStrategy + ) + } + + } + + private def generateDummyMLFlowReturn( + msg: String + ): Option[MLFlowReportStructure] = { + try { + val genTracker = MLFlowTracker(_mainConfig) + val dummyLog = MLFlowReturn( + genTracker.getMLFlowClient, + msg, + Array((msg, 0.0)) + ) + Some(MLFlowReportStructure(dummyLog, dummyLog)) + } catch { + case ex: Exception => Some(MLFlowReportStructure(null, null)) + } + } + + protected[automl] def predictFromBestModel( + resultPayload: Array[GenericModelReturn], + rawData: DataFrame, + modelSelection: String + ): DataFrame = { + + val bestModel = resultPayload(0) + + _mainConfig.modelFamily match { + case "RandomForest" => + modelSelection match { + case "regressor" => + val model = + bestModel.model.asInstanceOf[RandomForestRegressionModel] + model.transform(rawData) + case "classifier" => + val model = + bestModel.model.asInstanceOf[RandomForestClassificationModel] + model.transform(rawData) + } + case "XGBoost" => + modelSelection match { + case "regressor" => + val model = bestModel.model.asInstanceOf[XGBoostRegressionModel] + model.transform(rawData) + case "classifier" => + val model = bestModel.model.asInstanceOf[XGBoostClassificationModel] + model.transform(rawData) + } + case "gbmBinary" | "gbmMulti" | "gbmMultiOVA" => + val model = bestModel.model.asInstanceOf[LightGBMClassificationModel] + model.transform(rawData) + case "gbmHuber" | "gbmFair" | "gbmLasso" | "gbmRidge" | "gbmPoisson" | + "gbmQuantile" | "gbmMape" | "gbmTweedie" | "gbmGamma" => + val model = bestModel.model.asInstanceOf[LightGBMRegressionModel] + model.transform(rawData) + case "GBT" => + modelSelection match { + case "regressor" => + val model = bestModel.model.asInstanceOf[GBTRegressionModel] + model.transform(rawData) + case "classifier" => + val model = bestModel.model.asInstanceOf[GBTClassificationModel] + model.transform(rawData) + } + case "MLPC" => + val model = + bestModel.model.asInstanceOf[MultilayerPerceptronClassificationModel] + model.transform(rawData) + case "LinearRegression" => + val model = bestModel.model.asInstanceOf[LinearRegressionModel] + model.transform(rawData) + case "LogisticRegression" => + val model = bestModel.model.asInstanceOf[LogisticRegressionModel] + model.transform(rawData) + case "SVM" => + val model = bestModel.model.asInstanceOf[LinearSVCModel] + model.transform(rawData) + case "Trees" => + modelSelection match { + case "classifier" => + val model = + bestModel.model.asInstanceOf[DecisionTreeClassificationModel] + model.transform(rawData) + case "regressor" => + val model = + bestModel.model.asInstanceOf[DecisionTreeRegressionModel] + model.transform(rawData) + } + } + + } + + @deprecated( + "This method will be removed and replaced with the standalone version in " + + "com.databricks.labs.automl.exploration.FeatureImportances in a future release." + ) + def exploreFeatureImportances(): FeatureImportanceReturn = { + + println( + "[DEPRECATION WARNING] - .exploreFeatureImportances() has been replaced by " + + "com.databricks.labs.automl.exploration.FeatureImportances . This method will be removed in the next release." + ) + + val payload = prepData() + + val cachedData = if (_featureImportancesConfig.dataPrepCachingFlag) { + payload.data.persist(StorageLevel.MEMORY_AND_DISK) + payload.data.count() + payload.data + } else { + payload.data + } + + if (_featureImportancesConfig.dataPrepCachingFlag) payload.data.count() + + val featureResults = new RandomForestFeatureImportance( + cachedData, + _featureImportancesConfig, + payload.modelType + ).setCutoffType(_featureImportancesConfig.featureImportanceCutoffType) + .setCutoffValue(_featureImportancesConfig.featureImportanceCutoffValue) + .runFeatureImportances(payload.fields) + + if (_featureImportancesConfig.dataPrepCachingFlag) cachedData.unpersist() + + FeatureImportanceReturn( + featureResults._1, + featureResults._2, + featureResults._3, + payload.modelType + ) + } + + @deprecated( + "This method will be removed and replaced with the standalone version in " + + "com.databricks.labs.automl.exploration.FeatureImportances in a future release." + ) + def runWithFeatureCulling(): FeatureImportanceOutput = { + + println( + "[DEPRECATION WARNING] - .runWithFeatureCulling() has been replaced by " + + "com.databricks.labs.automl.exploration.FeatureImportances . This method will be removed in the next release." + ) + + // Get the Feature Importances + + val featureImportanceResults = exploreFeatureImportances() + + val selectableFields = featureImportanceResults.fields :+ _featureImportancesConfig.labelCol + + println( + s"Feature Selected: ${featureImportanceResults.fields.mkString(", ")}" + ) + + val dataSubset = df.select(selectableFields.map(col): _*) + + if (_featureImportancesConfig.dataPrepCachingFlag) { + dataSubset.persist(StorageLevel.MEMORY_AND_DISK) + dataSubset.count + } + + val runResults = + new AutomationRunner(dataSubset).setMainConfig(_mainConfig).run() + + if (_mainConfig.dataPrepCachingFlag) dataSubset.unpersist() + + new FeatureImportanceOutput( + featureImportanceResults.data, + mlFlowOutput = runResults.mlFlowOutput + ) { + override def modelReport: Array[GenericModelReturn] = + runResults.modelReport + override def generationReport: Array[GenerationalReport] = + runResults.generationReport + override def modelReportDataFrame: DataFrame = + runResults.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + runResults.generationReportDataFrame + } + + } + + @deprecated( + "This method will be removed and replaced with the standalone version in " + + "com.databricks.labs.automl.exploration.FeatureImportances in a future release." + ) + def runFeatureCullingWithPrediction(): FeatureImportancePredictionOutput = { + + println( + "[DEPRECATION WARNING] - .runFeatureCullingWithPrediction() has been replaced by " + + "com.databricks.labs.automl.exploration.FeatureImportances . This method will be removed in the next release." + ) + + val featureImportanceResults = exploreFeatureImportances() + + val selectableFields = featureImportanceResults.fields :+ _mainConfig.labelCol + + println( + s"Features Selected: ${featureImportanceResults.fields.mkString(", ")}" + ) + + val dataSubset = df.select(selectableFields.map(col): _*) + + if (_mainConfig.dataPrepCachingFlag) { + dataSubset.persist(StorageLevel.MEMORY_AND_DISK) + dataSubset.count + } + + val runner = new AutomationRunner(dataSubset).setMainConfig(_mainConfig) + + val payload = runner.prepData() + + val runResults = runner.executeTuning(payload) + + if (_mainConfig.dataPrepCachingFlag) dataSubset.unpersist() + + val cleanedData = _mainConfig.geneticConfig.trainSplitMethod match { + case "kSample" => + runResults.rawData + .filter( + col(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) === false + ) + .drop(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + case _ => runResults.rawData + } + + val predictedData = predictFromBestModel( + runResults.modelReport, + cleanedData, + runResults.modelSelection + ) + + if (_mainConfig.dataPrepCachingFlag) runResults.rawData.unpersist() + + new FeatureImportancePredictionOutput( + featureImportances = featureImportanceResults.data, + predictionData = predictedData, + mlFlowOutput = runResults.mlFlowOutput + ) { + override def modelReport: Array[GenericModelReturn] = + runResults.modelReport + override def generationReport: Array[GenerationalReport] = + runResults.generationReport + override def modelReportDataFrame: DataFrame = + runResults.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + runResults.generationReportDataFrame + } + + } + + def generateDecisionSplits(): TreeSplitReport = { + + val payload = prepData() + + new DecisionTreeSplits(payload.data, _treeSplitsConfig, payload.modelType) + .runTreeSplitAnalysis(payload.fields) + + } + + def run(): AutomationOutput = { + + val tunerResult = executeTuning(prepData()) + + if (_mainConfig.dataPrepCachingFlag) tunerResult.rawData.unpersist() + + new AutomationOutput(mlFlowOutput = tunerResult.mlFlowOutput) { + override def modelReport: Array[GenericModelReturn] = + tunerResult.modelReport + override def generationReport: Array[GenerationalReport] = + tunerResult.generationReport + override def modelReportDataFrame: DataFrame = + tunerResult.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + tunerResult.generationReportDataFrame + } + + } + + def runWithPrediction(): PredictionOutput = { + + val tunerResult = executeTuning(prepData()) + + val cleanedData = _mainConfig.geneticConfig.trainSplitMethod match { + case "kSample" => + tunerResult.rawData + .filter( + col(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) === false + ) + .drop(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + case _ => tunerResult.rawData + } + + val predictedData = predictFromBestModel( + tunerResult.modelReport, + cleanedData, + tunerResult.modelSelection + ) + + if (_mainConfig.dataPrepCachingFlag) tunerResult.rawData.unpersist() + + new PredictionOutput( + dataWithPredictions = predictedData, + mlFlowOutput = tunerResult.mlFlowOutput + ) { + override def modelReport: Array[GenericModelReturn] = + tunerResult.modelReport + override def generationReport: Array[GenerationalReport] = + tunerResult.generationReport + override def modelReportDataFrame: DataFrame = + tunerResult.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + tunerResult.generationReportDataFrame + } + + } + + def runWithConfusionReport(): ConfusionOutput = { + val predictionPayload = runWithPrediction() + val confusionData = predictionPayload.dataWithPredictions + .select("prediction", _mainConfig.labelCol) + .groupBy("prediction", _mainConfig.labelCol) + .agg(count("*").alias("count")) + + new ConfusionOutput( + predictionData = predictionPayload.dataWithPredictions, + confusionData = confusionData, + mlFlowOutput = predictionPayload.mlFlowOutput + ) { + override def modelReport: Array[GenericModelReturn] = + predictionPayload.modelReport + override def generationReport: Array[GenerationalReport] = + predictionPayload.generationReport + override def modelReportDataFrame: DataFrame = + predictionPayload.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + predictionPayload.generationReportDataFrame + } + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/ManualRunner.scala b/src/main/scala/com/databricks/labs/automl/ManualRunner.scala new file mode 100644 index 00000000..0f0edb30 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/ManualRunner.scala @@ -0,0 +1,169 @@ +package com.databricks.labs.automl + +import com.databricks.labs.automl.params._ +import com.databricks.labs.automl.reports.{ + DecisionTreeSplits, + RandomForestFeatureImportance +} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.{col, count} + +class ManualRunner(dataPayload: DataGeneration) + extends AutomationRunner(dataPayload.data) { + + override def exploreFeatureImportances(): FeatureImportanceReturn = { + val featureResults = new RandomForestFeatureImportance( + dataPayload.data, + _featureImportancesConfig, + dataPayload.modelType + ).setCutoffType(_mainConfig.featureImportanceCutoffType) + .setCutoffValue(_mainConfig.featureImportanceCutoffValue) + .runFeatureImportances(dataPayload.fields) + + params.FeatureImportanceReturn( + featureResults._1, + featureResults._2, + featureResults._3, + dataPayload.modelType + ) + } + + override def run(): AutomationOutput = { + + val tunerResult = executeTuning(dataPayload) + + new AutomationOutput(mlFlowOutput = tunerResult.mlFlowOutput) { + override def modelReport: Array[GenericModelReturn] = + tunerResult.modelReport + override def generationReport: Array[GenerationalReport] = + tunerResult.generationReport + override def modelReportDataFrame: DataFrame = + tunerResult.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + tunerResult.generationReportDataFrame + } + + } + + override def generateDecisionSplits(): TreeSplitReport = { + + new DecisionTreeSplits( + dataPayload.data, + _treeSplitsConfig, + dataPayload.modelType + ).runTreeSplitAnalysis(dataPayload.fields) + + } + + override def runWithFeatureCulling(): FeatureImportanceOutput = { + + val featureImportanceResults = exploreFeatureImportances() + val selectableFields = featureImportanceResults.fields :+ _mainConfig.labelCol + + val dataSubset = dataPayload.data.select(selectableFields.map(col): _*) + val runResults = + new AutomationRunner(dataSubset).setMainConfig(_mainConfig).run() + + new FeatureImportanceOutput( + featureImportanceResults.data, + mlFlowOutput = runResults.mlFlowOutput + ) { + override def modelReport: Array[GenericModelReturn] = + runResults.modelReport + override def generationReport: Array[GenerationalReport] = + runResults.generationReport + override def modelReportDataFrame: DataFrame = + runResults.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + runResults.generationReportDataFrame + } + } + + override def runFeatureCullingWithPrediction() + : FeatureImportancePredictionOutput = { + + val featureImportanceResults = exploreFeatureImportances() + val selectableFields = featureImportanceResults.fields :+ _mainConfig.labelCol + + val dataSubset = dataPayload.data.select(selectableFields.map(col): _*) + val payload = + DataGeneration(dataSubset, selectableFields, dataPayload.modelType) + + val runResults = new AutomationRunner(dataSubset) + .setMainConfig(_mainConfig) + .executeTuning(payload) + val predictedData = predictFromBestModel( + runResults.modelReport, + runResults.rawData, + runResults.modelSelection + ) + + new FeatureImportancePredictionOutput( + featureImportances = featureImportanceResults.data, + predictionData = predictedData, + mlFlowOutput = runResults.mlFlowOutput + ) { + override def modelReport: Array[GenericModelReturn] = + runResults.modelReport + override def generationReport: Array[GenerationalReport] = + runResults.generationReport + override def modelReportDataFrame: DataFrame = + runResults.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + runResults.generationReportDataFrame + } + } + + override def runWithPrediction(): PredictionOutput = { + + val tunerResult = executeTuning(dataPayload) + + val predictedData = predictFromBestModel( + tunerResult.modelReport, + tunerResult.rawData, + tunerResult.modelSelection + ) + + new PredictionOutput( + dataWithPredictions = predictedData, + mlFlowOutput = tunerResult.mlFlowOutput + ) { + override def modelReport: Array[GenericModelReturn] = + tunerResult.modelReport + override def generationReport: Array[GenerationalReport] = + tunerResult.generationReport + override def modelReportDataFrame: DataFrame = + tunerResult.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + tunerResult.generationReportDataFrame + } + + } + + override def runWithConfusionReport(): ConfusionOutput = { + + val predictionPayload = runWithPrediction() + + val confusionData = predictionPayload.dataWithPredictions + .select("prediction", _labelCol) + .groupBy("prediction", _labelCol) + .agg(count("*").alias("count")) + + new ConfusionOutput( + predictionData = predictionPayload.dataWithPredictions, + confusionData = confusionData, + mlFlowOutput = predictionPayload.mlFlowOutput + ) { + override def modelReport: Array[GenericModelReturn] = + predictionPayload.modelReport + override def generationReport: Array[GenerationalReport] = + predictionPayload.generationReport + override def modelReportDataFrame: DataFrame = + predictionPayload.modelReportDataFrame + override def generationReportDataFrame: DataFrame = + predictionPayload.generationReportDataFrame + } + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/Runner.scala b/src/main/scala/com/databricks/labs/automl/Runner.scala new file mode 100644 index 00000000..b838b5e3 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/Runner.scala @@ -0,0 +1,50 @@ +package com.databricks.labs.automl + +import com.databricks.labs.automl.executor.FamilyRunner +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.pipeline.inference.PipelineModelInference +import com.databricks.labs.automl.params.MLFlowConfig +import org.apache.spark.ml.PipelineModel +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.functions._ + +object Runner extends App { + + val spark = SparkSession.builder() + .master("local") + .getOrCreate() + + spark.sparkContext.setLogLevel("ERROR") +// spark.conf.set("spark.databricks.service.client.checkDeps", false) + import spark.implicits._ + val sc = spark.sparkContext + sc.addJar("/Users/danieltomes/Dev/gitProjects/Databricks---AutoML--providentia/target/scala-2.11/automatedml_2.11-0.6.1.jar") + + val df = spark.table("tke_features.cost_features") + .withColumn("cancellation_term", when('cancellation_term.isNull, "NA").otherwise('cancellation_term)) + .withColumn("TARGET", lit("HOLD")) + .drop("ch_start_snap_yr") + .limit(200) + .na.fill("NA") + + val engOvrCgfs = Map( + "labelCol" -> "TARGET", + "fieldsToIgnoreInVector" -> Array("UnitID"), + "scoringMetric" -> "f1", + "naFillFlag" -> true, + "varianceFilterFlag" -> true, + "outlierFilterFlag" -> false, + "pearsonFilterFlag" -> false, + "covarianceFilterFlag" -> false, + "oneHotEncodeFlag" -> false, + "scalingFlag" -> false, + "dataPrepCachingFlag" -> false, + "mlFlowLoggingFlag" -> false, + "mlFlowLogArtifactsFlag" -> false + ) + + val engConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", engOvrCgfs) + val featureEngPipelineModel = FamilyRunner(df, Array(engConfig)).generateFeatureEngineeredPipeline(verbose=true)("XGBoost") + val featurizedDF = featureEngPipelineModel.transform(df) + featurizedDF.show() +} diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/BooleanFieldFillException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/BooleanFieldFillException.scala new file mode 100644 index 00000000..3a5b2a1d --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/BooleanFieldFillException.scala @@ -0,0 +1,13 @@ +package com.databricks.labs.automl.exceptions + +case class BooleanFieldFillException( + private val fieldName: String, + private val conversionMode: String, + private val allowableConversionModes: Array[String], + cause: Throwable = None.orNull +) extends RuntimeException( + s"The boolean fill type " + + s"specified: $conversionMode is not in the allowable list of supported models: ${allowableConversionModes + .mkString(", ")} for field $fieldName", + cause + ) diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/DateFeatureConversionException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/DateFeatureConversionException.scala new file mode 100644 index 00000000..e961c1b6 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/DateFeatureConversionException.scala @@ -0,0 +1,6 @@ +package com.databricks.labs.automl.exceptions + +final case class DateFeatureConversionException( + private val dateFields: Array[String] = Array.empty, + private val cause: Throwable = None.orNull) + extends FeatureConversionException(dateFields, "Date") \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/FeatureConversionException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/FeatureConversionException.scala new file mode 100644 index 00000000..370a0095 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/FeatureConversionException.scala @@ -0,0 +1,6 @@ +package com.databricks.labs.automl.exceptions + +abstract class FeatureConversionException(private val fields: Array[String] = Array.empty, + private val fieldsType: String, + private val cause: Throwable = None.orNull) + extends RuntimeException(s"Not all $fieldsType features [[ ${fields.toList} ]] have been converted into vectorizable fields", cause) \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/FeatureCorrelationException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/FeatureCorrelationException.scala new file mode 100644 index 00000000..163f0888 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/FeatureCorrelationException.scala @@ -0,0 +1,11 @@ +package com.databricks.labs.automl.exceptions + +final case class FeatureCorrelationException( + private val originalFeatures: Array[String], + private val filteredFeatures: Array[String], + cause: Throwable = None.orNull +) extends RuntimeException( + s"Feature Correlation Detection has filtered out every field from the feature candidates (feature count: " + + s"${originalFeatures.length}). Filtered Fields: ${filteredFeatures.mkString(", ")}", + cause + ) diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/LightGBMModelTypeException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/LightGBMModelTypeException.scala new file mode 100644 index 00000000..45e5990e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/LightGBMModelTypeException.scala @@ -0,0 +1,16 @@ +package com.databricks.labs.automl.exceptions + +final case class LightGBMModelTypeException( + private val modelType: String, + private val lightGBMType: String, + lightGBMRegressorTypes: Array[String], + lightGBMClassifierTypes: Array[String], + cause: Throwable = None.orNull +) extends RuntimeException( + s"The model type: [$modelType] and light GBM type: [$lightGBMType] are not a supported combination. Supported " + + s"types for [$modelType] are: [${modelType match { + case "regressor" => lightGBMRegressorTypes.mkString(", ") + case "classifier" => lightGBMClassifierTypes.mkString(", ") + }}]", + cause + ) diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/MlFlowValidationException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/MlFlowValidationException.scala new file mode 100644 index 00000000..d2045b7c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/MlFlowValidationException.scala @@ -0,0 +1,7 @@ +package com.databricks.labs.automl.exceptions + + +final case class MlFlowValidationException(private val message: String = "", + private val cause: Throwable = None.orNull) extends RuntimeException(message, cause) + + diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/ModelingTypeException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/ModelingTypeException.scala new file mode 100644 index 00000000..be0783da --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/ModelingTypeException.scala @@ -0,0 +1,12 @@ +package com.databricks.labs.automl.exceptions + +final case class ModelingTypeException( + private val modelType: String, + private val allowableModelTypes: Array[String], + cause: Throwable = None.orNull +) extends RuntimeException( + s"The model type " + + s"specified: $modelType is not in the allowable list of supported models: ${allowableModelTypes + .mkString(", ")}", + cause + ) diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/PipelineExecutionException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/PipelineExecutionException.scala new file mode 100644 index 00000000..e5a7b8d7 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/PipelineExecutionException.scala @@ -0,0 +1,12 @@ +package com.databricks.labs.automl.exceptions + +/** + * @author Jas Bali + * @since 0.6.1 + * This exception is thrown when there is a failure in the execution of a train pipeline + * + * @param pipelineId: Unique identifier for a pipeline run + * @param cause: Root exception for pipeline execution failure + */ +final case class PipelineExecutionException(private val pipelineId: String, private val cause: Throwable) + extends RuntimeException(s"Pipeline with ID $pipelineId failed", cause) diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/StringFeatureConversionException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/StringFeatureConversionException.scala new file mode 100644 index 00000000..8d503066 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/StringFeatureConversionException.scala @@ -0,0 +1,6 @@ +package com.databricks.labs.automl.exceptions + +final case class StringFeatureConversionException( + private val stringFields: Array[String] = Array.empty, + private val cause: Throwable = None.orNull) + extends FeatureConversionException(stringFields, "String") \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/ThreadPoolsBySize.scala b/src/main/scala/com/databricks/labs/automl/exceptions/ThreadPoolsBySize.scala new file mode 100644 index 00000000..dd7a53e7 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/ThreadPoolsBySize.scala @@ -0,0 +1,31 @@ +package com.databricks.labs.automl.exceptions + +import java.util.concurrent.{ArrayBlockingQueue, ThreadPoolExecutor, TimeUnit} + +import scala.concurrent.{ExecutionContext, ExecutionContextExecutor} + +/** + * @author Jas Bali + * Provides thread pools by size and can be used with [[ExecutionContextExecutor]] to ensure a true Thread Pool + * is created, so that threads are reused and limited + */ +object ThreadPoolsBySize { + + private lazy val SMALL_RUNNING_TASKS_TP_CORE_SIZE = 2 + private lazy val SMALL_RUNNING_TASKS_TP_INITIAL_MAX_SIZE = 20 + + private lazy val SMALL_RUNNING_TASKS_TP = new ThreadPoolExecutor( + SMALL_RUNNING_TASKS_TP_CORE_SIZE, SMALL_RUNNING_TASKS_TP_INITIAL_MAX_SIZE, + 15, TimeUnit.MINUTES, new ArrayBlockingQueue[Runnable](100)) + + lazy val SMALL_RUNNING_TASKS_TP_EC: ExecutionContextExecutor = ExecutionContext.fromExecutor(SMALL_RUNNING_TASKS_TP) + + // Add more thread pools as needed + + def withScalaExecutionContext(parallelism: Int = SMALL_RUNNING_TASKS_TP_INITIAL_MAX_SIZE): ExecutionContextExecutor = { + if(parallelism > SMALL_RUNNING_TASKS_TP_INITIAL_MAX_SIZE ) { + SMALL_RUNNING_TASKS_TP.setMaximumPoolSize(parallelism) + } + ExecutionContext.fromExecutor(SMALL_RUNNING_TASKS_TP) + } +} diff --git a/src/main/scala/com/databricks/labs/automl/exceptions/TimeFeatureConversionException.scala b/src/main/scala/com/databricks/labs/automl/exceptions/TimeFeatureConversionException.scala new file mode 100644 index 00000000..fe358062 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exceptions/TimeFeatureConversionException.scala @@ -0,0 +1,6 @@ +package com.databricks.labs.automl.exceptions + +final case class TimeFeatureConversionException( + private val timeFields: Array[String] = Array.empty, + private val cause: Throwable = None.orNull) + extends FeatureConversionException(timeFields, "Time") \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/executor/AutomationConfig.scala b/src/main/scala/com/databricks/labs/automl/executor/AutomationConfig.scala new file mode 100644 index 00000000..d167f789 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/AutomationConfig.scala @@ -0,0 +1,2843 @@ +package com.databricks.labs.automl.executor + +import com.databricks.labs.automl.params._ +import com.databricks.labs.automl.sanitize.SanitizerDefaults + +trait AutomationConfig extends Defaults with SanitizerDefaults { + + var _modelingFamily: String = _defaultModelingFamily + + var _labelCol: String = _defaultLabelCol + + var _featuresCol: String = _defaultFeaturesCol + + var _naFillFlag: Boolean = _defaultNAFillFlag + + var _varianceFilterFlag: Boolean = _defaultVarianceFilterFlag + + var _outlierFilterFlag: Boolean = _defaultOutlierFilterFlag + + var _pearsonFilterFlag: Boolean = _defaultPearsonFilterFlag + + var _covarianceFilterFlag: Boolean = _defaultCovarianceFilterFlag + + var _oneHotEncodeFlag: Boolean = _defaultOneHotEncodeFlag + + var _scalingFlag: Boolean = _defaultScalingFlag + + var _featureInteractionFlag: Boolean = _defaultFeatureInteractionFlag + + var _dataPrepCachingFlag: Boolean = _defaultDataPrepCachingFlag + + var _dataPrepParallelism: Int = _defaultDataPrepParallelism + + var _numericBoundaries: Map[String, (Double, Double)] = + _rfDefaultNumBoundaries + + var _stringBoundaries: Map[String, List[String]] = _rfDefaultStringBoundaries + + var _scoringMetric: String = _scoringDefaultClassifier + + var _scoringOptimizationStrategy: String = + _scoringOptimizationStrategyClassifier + + var _numericFillStat: String = _fillConfigDefaults.numericFillStat + + var _characterFillStat: String = _fillConfigDefaults.characterFillStat + + var _dateTimeConversionType: String = _defaultDateTimeConversionType + + var _fieldsToIgnoreInVector: Array[String] = _defaultFieldsToIgnoreInVector + + var _naFillFilterPrecision: Double = _fillConfigDefaults.filterPrecision + + var _categoricalNAFillMap: Map[String, String] = + _fillConfigDefaults.categoricalNAFillMap + + var _numericNAFillMap: Map[String, AnyVal] = + _fillConfigDefaults.numericNAFillMap + + var _characterNABlanketFillValue: String = + _fillConfigDefaults.characterNABlanketFillValue + + var _numericNABlanketFillValue: Double = + _fillConfigDefaults.numericNABlanketFillValue + + var _naFillMode: String = _fillConfigDefaults.naFillMode + + var _cardinalitySwitchFlag: Boolean = _fillConfigDefaults.cardinalitySwitch + + var _cardinalityType: String = _fillConfigDefaults.cardinalityType + + var _cardinalityLimit: Int = _fillConfigDefaults.cardinalityLimit + + var _cardinalityPrecision: Double = _fillConfigDefaults.cardinalityPrecision + + var _cardinalityCheckMode: String = _fillConfigDefaults.cardinalityCheckMode + + var _modelSelectionDistinctThreshold: Int = + _fillConfigDefaults.modelSelectionDistinctThreshold + + var _fillConfig: FillConfig = _fillConfigDefaults + + var _filterBounds: String = _outlierConfigDefaults.filterBounds + + var _lowerFilterNTile: Double = _outlierConfigDefaults.lowerFilterNTile + + var _upperFilterNTile: Double = _outlierConfigDefaults.upperFilterNTile + + var _filterPrecision: Double = _outlierConfigDefaults.filterPrecision + + var _continuousDataThreshold: Int = + _outlierConfigDefaults.continuousDataThreshold + + var _fieldsToIgnore: Array[String] = _outlierConfigDefaults.fieldsToIgnore + + var _outlierConfig: OutlierConfig = _outlierConfigDefaults + + var _pearsonFilterStatistic: String = _pearsonConfigDefaults.filterStatistic + + var _pearsonFilterDirection: String = _pearsonConfigDefaults.filterDirection + + var _pearsonFilterManualValue: Double = + _pearsonConfigDefaults.filterManualValue + + var _pearsonFilterMode: String = _pearsonConfigDefaults.filterMode + + var _pearsonAutoFilterNTile: Double = _pearsonConfigDefaults.autoFilterNTile + + var _pearsonConfig: PearsonConfig = _pearsonConfigDefaults + + var _correlationCutoffLow: Double = + _covarianceConfigDefaults.correlationCutoffLow + + var _correlationCutoffHigh: Double = + _covarianceConfigDefaults.correlationCutoffHigh + + var _covarianceConfig: CovarianceConfig = _covarianceConfigDefaults + + var _scalerType: String = defaultScalerType + + var _scalerMin: Double = defaultScalerMin + + var _scalerMax: Double = defaultScalerMax + + var _standardScalerMeanFlag: Boolean = defaultStandardScalerMeanFlag + + var _standardScalerStdDevFlag: Boolean = defaultStandardScalerStdDevFlag + + var _pNorm: Double = defaultPNorm + + var _scalingConfig: ScalingConfig = _scalingConfigDefaults + + var _featureInteractionConfig: FeatureInteractionConfig = + _defaultFeatureInteractionConfig + + var _parallelism: Int = _geneticTunerDefaults.parallelism + + var _kFold: Int = _geneticTunerDefaults.kFold + + var _trainPortion: Double = _geneticTunerDefaults.trainPortion + + var _trainSplitMethod: String = _geneticTunerDefaults.trainSplitMethod + + var _kSampleConfig: KSampleConfig = _geneticTunerDefaults.kSampleConfig + + var _syntheticCol: String = _geneticTunerDefaults.kSampleConfig.syntheticCol + + var _kGroups: Int = _geneticTunerDefaults.kSampleConfig.kGroups + + var _kMeansMaxIter: Int = _geneticTunerDefaults.kSampleConfig.kMeansMaxIter + + var _kMeansTolerance: Double = + _geneticTunerDefaults.kSampleConfig.kMeansTolerance + + var _kMeansDistanceMeasurement: String = + _geneticTunerDefaults.kSampleConfig.kMeansDistanceMeasurement + + var _kMeansSeed: Long = _geneticTunerDefaults.kSampleConfig.kMeansSeed + + var _kMeansPredictionCol: String = + _geneticTunerDefaults.kSampleConfig.kMeansPredictionCol + + var _lshHashTables: Int = _geneticTunerDefaults.kSampleConfig.lshHashTables + + var _lshSeed: Long = _geneticTunerDefaults.kSampleConfig.lshSeed + + var _lshOutputCol: String = _geneticTunerDefaults.kSampleConfig.lshOutputCol + + var _quorumCount: Int = _geneticTunerDefaults.kSampleConfig.quorumCount + + var _minimumVectorCountToMutate: Int = + _geneticTunerDefaults.kSampleConfig.minimumVectorCountToMutate + + var _vectorMutationMethod: String = + _geneticTunerDefaults.kSampleConfig.vectorMutationMethod + + var _mutationMode: String = _geneticTunerDefaults.kSampleConfig.mutationMode + + var _mutationValue: Double = _geneticTunerDefaults.kSampleConfig.mutationValue + + var _labelBalanceMode: String = + _geneticTunerDefaults.kSampleConfig.labelBalanceMode + + var _cardinalityThreshold: Int = + _geneticTunerDefaults.kSampleConfig.cardinalityThreshold + + var _numericRatio: Double = _geneticTunerDefaults.kSampleConfig.numericRatio + + var _numericTarget: Int = _geneticTunerDefaults.kSampleConfig.numericTarget + + var _outputDfRepartitionScaleFactor: Int = + _geneticTunerDefaults.kSampleConfig.outputDfRepartitionScaleFactor + + var _trainSplitChronologicalColumn: String = + _geneticTunerDefaults.trainSplitChronologicalColumn + + var _trainSplitChronologicalRandomPercentage: Double = + _geneticTunerDefaults.trainSplitChronologicalRandomPercentage + + var _trainSplitColumnSet: Boolean = false + + var _seed: Long = _geneticTunerDefaults.seed + + var _firstGenerationGenePool: Int = + _geneticTunerDefaults.firstGenerationGenePool + + var _numberOfGenerations: Int = _geneticTunerDefaults.numberOfGenerations + + var _numberOfParentsToRetain: Int = + _geneticTunerDefaults.numberOfParentsToRetain + + var _numberOfMutationsPerGeneration: Int = + _geneticTunerDefaults.numberOfMutationsPerGeneration + + var _geneticMixing: Double = _geneticTunerDefaults.geneticMixing + + var _generationalMutationStrategy: String = + _geneticTunerDefaults.generationalMutationStrategy + + var _fixedMutationValue: Int = _geneticTunerDefaults.fixedMutationValue + + var _mutationMagnitudeMode: String = + _geneticTunerDefaults.mutationMagnitudeMode + + var _modelSeedMap: Map[String, Any] = Map.empty + + var _modelSeedSetStatus: Boolean = false + + var _firstGenerationConfig: FirstGenerationConfig = + _defaultFirstGenerationConfig + + var _firstGenerationPermutationCount: Int = + _geneticTunerDefaults.initialGenerationConfig.permutationCount + + var _firstGenerationIndexMixingMode: String = + _geneticTunerDefaults.initialGenerationConfig.indexMixingMode + + var _firstGenerationArraySeed: Long = + _geneticTunerDefaults.initialGenerationConfig.arraySeed + + var _hyperSpaceInference: Boolean = _defaultHyperSpaceInference + + var _hyperSpaceInferenceCount: Int = _defaultHyperSpaceInferenceCount + + var _hyperSpaceModelType: String = _defaultHyperSpaceModelType + + var _hyperSpaceModelCount: Int = _defaultHyperSpaceModelCount + + var _firstGenerationMode: String = _defaultInitialGenerationMode + + var _deltaCacheBackingDirectory: String = + _geneticTunerDefaults.deltaCacheBackingDirectory + + var _splitCachingStrategy: String = _geneticTunerDefaults.splitCachingStrategy + + var _deltaCacheBackingDirectoryRemovalFlag: Boolean = + _geneticTunerDefaults.deltaCacheBackingDirectoryRemovalFlag + + var _geneticConfig: GeneticConfig = _geneticTunerDefaults + + var _mainConfig: MainConfig = _mainConfigDefaults + + var _featureImportancesConfig: MainConfig = _featureImportancesDefaults + + var _treeSplitsConfig: MainConfig = _treeSplitDefaults + + var _mlFlowConfig: MLFlowConfig = _mlFlowConfigDefaults + + var _mlFlowLoggingFlag: Boolean = _defaultMlFlowLoggingFlag + + var _mlFlowArtifactsFlag: Boolean = _defaultMlFlowArtifactsFlag + + var _mlFlowTrackingURI: String = _mlFlowConfigDefaults.mlFlowTrackingURI + + var _mlFlowExperimentName: String = _mlFlowConfigDefaults.mlFlowExperimentName + + var _mlFlowAPIToken: String = _mlFlowConfigDefaults.mlFlowAPIToken + + var _mlFlowModelSaveDirectory: String = + _mlFlowConfigDefaults.mlFlowModelSaveDirectory + + var _mlFlowLoggingMode: String = _mlFlowConfigDefaults.mlFlowLoggingMode + + var _mlFlowBestSuffix: String = _mlFlowConfigDefaults.mlFlowBestSuffix + + var _mlFlowCustomRunTags: Map[String, String] = + _mlFlowConfigDefaults.mlFlowCustomRunTags + + var _autoStoppingFlag: Boolean = _defaultAutoStoppingFlag + + var _autoStoppingScore: Double = _defaultAutoStoppingScore + + var _featureImportanceCutoffType: String = _defaultFeatureImportanceCutoffType + + var _featureImportanceCutoffValue: Double = + _defaultFeatureImportanceCutoffValue + + var _evolutionStrategy: String = _geneticTunerDefaults.evolutionStrategy + + var _continuousEvolutionImprovementThreshold: Int = + _geneticTunerDefaults.continuousEvolutionImprovementThreshold + + var _geneticMBORegressorType: String = + _geneticTunerDefaults.geneticMBORegressorType + + var _geneticMBOCandidateFactor: Int = + _geneticTunerDefaults.geneticMBOCandidateFactor + + var _continuousEvolutionMaxIterations: Int = + _geneticTunerDefaults.continuousEvolutionMaxIterations + + var _continuousEvolutionStoppingScore: Double = + _geneticTunerDefaults.continuousEvolutionStoppingScore + + var _continuousEvolutionParallelism: Int = + _geneticTunerDefaults.continuousEvolutionParallelism + + var _continuousEvolutionMutationAggressiveness: Int = + _geneticTunerDefaults.continuousEvolutionMutationAggressiveness + + var _continuousEvolutionGeneticMixing: Double = + _geneticTunerDefaults.continuousEvolutionGeneticMixing + + var _continuousEvolutionRollingImprovementCount: Int = + _geneticTunerDefaults.continuousEvolutionRollingImprovementCount + + var _inferenceConfigSaveLocation: String = _inferenceConfigSaveLocationDefault + + var _dataReductionFactor: Double = _defaultDataReductionFactor + + var _pipelineDebugFlag: Boolean = _defaultPipelineDebugFlag + + var _featureInteractionRetentionMode: String = + _defaultFeatureInteractionConfig.retentionMode + var _featureInteractionContinuousDiscretizerBucketCount: Int = + _defaultFeatureInteractionConfig.continuousDiscretizerBucketCount + var _featureInteractionParallelism: Int = + _defaultFeatureInteractionConfig.parallelism + var _featureInteractionTargetInteractionPercentage: Double = + _defaultFeatureInteractionConfig.targetInteractionPercentage + + var _pipelineId: String = _defaultPipelineId + + def setPipelineId(value: String): this.type = { + _pipelineId = value + this + } + + private def setConfigs(): this.type = { + setMainConfig() + } + + def setModelingFamily(value: String): this.type = { + _modelingFamily = value + _numericBoundaries = value match { + case "RandomForest" => _rfDefaultNumBoundaries + case "MLPC" => _mlpcDefaultNumBoundaries + case "Trees" => _treesDefaultNumBoundaries + case "GBT" => _gbtDefaultNumBoundaries + case "LinearRegression" => _linearRegressionDefaultNumBoundaries + case "LogisticRegression" => _logisticRegressionDefaultNumBoundaries + case "SVM" => _svmDefaultNumBoundaries + case "XGBoost" => _xgboostDefaultNumBoundaries + case "gbmBinary" | "gbmMulti" | "gbmMultiOVA" | "gbmHuber" | "gbmFair" | + "gbmLasso" | "gbmRidge" | "gbmPoisson" | "gbmQuantile" | "gbmMape" | + "gbmTweedie" | "gbmGamma" => + _lightGBMDefaultNumBoundaries + case _ => + throw new IllegalArgumentException( + s"$value is an unsupported Model Type" + ) + } + _stringBoundaries = value match { + case "RandomForest" => _rfDefaultStringBoundaries + case "MLPC" => _mlpcDefaultStringBoundaries + case "Trees" => _treesDefaultStringBoundaries + case "GBT" => _gbtDefaultStringBoundaries + case "LinearRegression" => _linearRegressionDefaultStringBoundaries + case "LogisticRegression" => _logisticRegressionDefaultStringBoundaries + case "SVM" => _svmDefaultStringBoundaries + case "XGBoost" => Map() + case "gbmBinary" | "gbmMulti" | "gbmMultiOVA" | "gbmHuber" | "gbmFair" | + "gbmLasso" | "gbmRidge" | "gbmPoisson" | "gbmQuantile" | "gbmMape" | + "gbmTweedie" | "gbmGamma" => + _lightGBMDefaultStringBoundaries + case _ => + throw new IllegalArgumentException( + s"$value is an unsupported Model Type" + ) + } + setConfigs() + this + } + + def setLabelCol(value: String): this.type = { + _labelCol = value + setConfigs() + this + } + + def setFeaturesCol(value: String): this.type = { + _featuresCol = value + setConfigs() + this + } + + def naFillOn(): this.type = { + _naFillFlag = true + setConfigs() + this + } + + def naFillOff(): this.type = { + _naFillFlag = false + setConfigs() + this + } + + def varianceFilterOn(): this.type = { + _varianceFilterFlag = true + setConfigs() + this + } + + def varianceFilterOff(): this.type = { + _varianceFilterFlag = false + setConfigs() + this + } + + def outlierFilterOn(): this.type = { + _outlierFilterFlag = true + setConfigs() + this + } + + def outlierFilterOff(): this.type = { + _outlierFilterFlag = false + setConfigs() + this + } + + def pearsonFilterOn(): this.type = { + _pearsonFilterFlag = true + setConfigs() + this + } + + def pearsonFilterOff(): this.type = { + _pearsonFilterFlag = false + setConfigs() + this + } + + def covarianceFilterOn(): this.type = { + _covarianceFilterFlag = true + setConfigs() + this + } + + def covarianceFilterOff(): this.type = { + _covarianceFilterFlag = false + setConfigs() + this + } + + def oneHotEncodingOn(): this.type = { + _oneHotEncodeFlag = true + setConfigs() + this + } + + def oneHotEncodingOff(): this.type = { + _oneHotEncodeFlag = false + setConfigs() + this + } + + def scalingOn(): this.type = { + _scalingFlag = true + setConfigs() + this + } + + def scalingOff(): this.type = { + _scalingFlag = false + setConfigs() + this + } + + def dataPrepCachingOn(): this.type = { + _dataPrepCachingFlag = true + setConfigs() + this + } + + def dataPrepCachingOff(): this.type = { + _dataPrepCachingFlag = false + setConfigs() + this + } + + def featureInteractionOn(): this.type = { + _featureInteractionFlag = true + setConfigs() + this + } + + def featureInteractionOff(): this.type = { + _featureInteractionFlag = false + setConfigs() + this + } + + /** + * Setter for defining the number of concurrent threads allocated to performing asynchronous data prep tasks within + * the feature engineering aspect of this application. + * @param value Int: A value that must be greater than zero. + * @note This value has an upper limit, depending on driver size, that will restrict the efficacy of the asynchronous + * tasks within the pool. Setting this too high may cause cluster instability. + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if a value less than or equal to zero is supplied. + */ + @throws(classOf[IllegalArgumentException]) + def setDataPrepParallelism(value: Int): this.type = { + + require(value > 0, s"DataPrepParallelism must be greater than zero.") + _dataPrepParallelism = value + setConfigs() + this + } + + def setNumericBoundaries(value: Map[String, (Double, Double)]): this.type = { + _numericBoundaries = value + setConfigs() + this + } + + def setStringBoundaries(value: Map[String, List[String]]): this.type = { + _stringBoundaries = value + setConfigs() + this + } + + def setScoringMetric(value: String): this.type = { + val adjusted_value = value.toLowerCase + val matched_value = adjusted_value match { + case "f1" => "f1" + case "weightedprecision" => "weightedPrecision" + case "weightedrecall" => "weightedRecall" + case "accuracy" => "accuracy" + case "areaunderpr" => "areaUnderPR" + case "areaunderroc" => "areaUnderROC" + case "rmse" => "rmse" + case "mse" => "mse" + case "r2" => "r2" + case "mae" => "mae" + case _ => + throw new IllegalArgumentException( + s"Supplied Scoring Metric '${value}' is not supported. " + + s"Must be one of: weightedPrecision, weightedRecall, accuracy, areaUnderPR, areaUnderROC, rmse, mse, r2, mae.'" + ) + } + _scoringMetric = matched_value + setConfigs() + this + } + + def setScoringOptimizationStrategy(value: String): this.type = { + require( + Array("minimize", "maximize").contains(value), + s"$value is not a member of allowed scoring optimizations: " + + s"'minimize' or 'maximize'" + ) + _scoringOptimizationStrategy = value + setConfigs() + this + } + + def setNumericFillStat(value: String): this.type = { + _numericFillStat = value + setFillConfig() + setConfigs() + this + } + + def setCharacterFillStat(value: String): this.type = { + _characterFillStat = value + setFillConfig() + setConfigs() + this + } + + def setDateTimeConversionType(value: String): this.type = { + _dateTimeConversionType = value + setConfigs() + this + } + + def setFieldsToIgnoreInVector(value: Array[String]): this.type = { + _fieldsToIgnoreInVector = value + if (_trainSplitColumnSet) + _fieldsToIgnoreInVector = _fieldsToIgnoreInVector :+ _trainSplitChronologicalColumn + setConfigs() + this + } + + /** + * Setter for defining the precision for calculating the model type as per the label column + * + * @note setting this value to zero (0) for a large regression problem will incur a long processing time and + * an expensive shuffle. + * @param value Double: Precision accuracy for approximate distinct calculation. + * @throws java.lang.AssertionError If the value is outside of the allowable range of {0, 1} + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[AssertionError]) + def setNAFillFilterPrecision(value: Double): this.type = { + require( + value >= 0, + s"Filter Precision for NA Fill must be greater than or equal to 0." + ) + require( + value <= 1, + s"Filter Precision for NA Fill must be less than or equal to 1." + ) + _naFillFilterPrecision = value + setFillConfig() + setConfigs() + this + } + + /** + * Setter for providing a map of [Column Name -> String Fill Value] for manual by-column overrides. Any non-specified + * fields in this map will utilize the "auto" statistics-based fill paradigm to calculate and fill any NA values + * in non-numeric columns. + * + * @note if naFillMode is specified as using Map Fill modes, this setter or the numeric na fill map MUST be set. + * @note If fields are specified in here that are not part of the DataFrame's schema, an exception will be thrown. + * @param value Map[String, String]: Column Name as String -> Fill Value as String + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setCategoricalNAFillMap(value: Map[String, String]): this.type = { + _categoricalNAFillMap = value + setFillConfig() + setConfigs() + this + } + + /** + * Setter for providing a map of [Column Name -> AnyVal Fill Value] (must be numeric). Any non-specified + * fields in this map will utilize the "auto" statistics-based fill paradigm to calculate and fill any NA values + * in numeric columns. + * + * @note if naFillMode is specified as using Map Fill modes, this setter or the categorical na fill map MUST be set. + * @note If fields are specified in here that are not part of the DataFrame's schema, an exception will be thrown. + * @param value Map[String, AnyVal]: Column Name as String -> Fill Numeric Type Value + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setNumericNAFillMap(value: Map[String, AnyVal]): this.type = { + _numericNAFillMap = value + setFillConfig() + setConfigs() + this + } + + /** + * Setter for providing a 'blanket override' value (fill all found categorical columns' missing values with this + * specified value). + * + * @param value String: A value to fill all categorical na values in the DataFrame with. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setCharacterNABlanketFillValue(value: String): this.type = { + _characterNABlanketFillValue = value + setFillConfig() + setConfigs() + this + } + + /** + * Setter for providing a 'blanket override' value (fill all found numeric columns' missing values with this + * specified value) + * + * @param value Double: A value to fill all numeric na value in the DataFrame with. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setNumericNABlanketFillValue(value: Double): this.type = { + _numericNABlanketFillValue = value + setFillConfig() + setConfigs() + this + } + + /** + * Mode for na fill
+ * Available modes:
+ * auto : Stats-based na fill for fields. Usage of .setNumericFillStat and + * .setCharacterFillStat will inform the type of statistics that will be used to fill.
+ * mapFill : Custom by-column overrides to 'blanket fill' na values on a per-column + * basis. The categorical (string) fields are set via .setCategoricalNAFillMap while the + * numeric fields are set via .setNumericNAFillMap.
+ * blanketFillAll : Fills all fields based on the values specified by + * .setCharacterNABlanketFillValue and .setNumericNABlanketFillValue. All NA's for the + * appropriate types will be filled in accordingly throughout all columns.
+ * blanketFillCharOnly Will use statistics to fill in numeric fields, but will replace + * all categorical character fields na values with a blanket fill value.
+ * blanketFillNumOnly Will use statistics to fill in character fields, but will replace + * all numeric fields na values with a blanket value. + * + * @throws IllegalArgumentException if the mods specified is not supported. + * @param value String: Mode for NA Fill + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + def setNAFillMode(value: String): this.type = { + require( + _allowableNAFillModes.contains(value), + s"NA Fill Mode '$value' is not a supported mode. Must be one of:" + + s"${_allowableNAFillModes.mkString(", ")}" + ) + _naFillMode = value + setFillConfig() + setConfigs() + this + } + + def setModelSelectionDistinctThreshold(value: Int): this.type = { + _modelSelectionDistinctThreshold = value + setFillConfig() + setConfigs() + this + } + + def cardinalitySwitchOn(): this.type = { + _cardinalitySwitchFlag = true + setFillConfig() + setConfigs() + this + } + + def cardinalitySwitchOff(): this.type = { + _cardinalitySwitchFlag = false + setFillConfig() + setConfigs() + this + } + def setCardinalitySwitch(value: Boolean): this.type = { + _cardinalitySwitchFlag = value + setFillConfig() + setConfigs() + this + } + + @throws(classOf[AssertionError]) + def setCardinalityType(value: String): this.type = { + _cardinalityType = value + assert( + allowableCardinalilties.contains(value), + s"Supplied CardinalityType '$value' is not in: " + + s"${allowableCardinalilties.mkString(", ")}" + ) + setFillConfig() + setConfigs() + this + } + + @throws(classOf[IllegalArgumentException]) + def setCardinalityLimit(value: Int): this.type = { + require(value > 0, s"Cardinality limit must be greater than 0") + _cardinalityLimit = value + setFillConfig() + setConfigs() + this + } + + @throws(classOf[IllegalArgumentException]) + def setCardinalityPrecision(value: Double): this.type = { + require(value >= 0.0, s"Precision must be greater than or equal to 0.") + require(value <= 1.0, s"Precision must be less than or equal to 1.") + _cardinalityPrecision = value + setFillConfig() + setConfigs() + this + } + + @throws(classOf[AssertionError]) + def setCardinalityCheckMode(value: String): this.type = { + assert( + allowableCategoricalFilterModes.contains(value), + s"Supplied CardinalityCheckMode $value is not in: ${allowableCategoricalFilterModes.mkString(", ")}" + ) + _cardinalityCheckMode = value + setFillConfig() + setConfigs() + this + } + + private def setFillConfig(): this.type = { + _fillConfig = FillConfig( + numericFillStat = _numericFillStat, + characterFillStat = _characterFillStat, + modelSelectionDistinctThreshold = _modelSelectionDistinctThreshold, + cardinalitySwitch = _cardinalitySwitchFlag, + cardinalityType = _cardinalityType, + cardinalityLimit = _cardinalityLimit, + cardinalityPrecision = _cardinalityPrecision, + cardinalityCheckMode = _cardinalityCheckMode, + filterPrecision = _naFillFilterPrecision, + categoricalNAFillMap = _categoricalNAFillMap, + numericNAFillMap = _numericNAFillMap, + characterNABlanketFillValue = _characterNABlanketFillValue, + numericNABlanketFillValue = _numericNABlanketFillValue, + naFillMode = _naFillMode + ) + this + } + + def setFilterBounds(value: String): this.type = { + _filterBounds = value + setOutlierConfig() + setConfigs() + this + } + + def setLowerFilterNTile(value: Double): this.type = { + _lowerFilterNTile = value + setOutlierConfig() + setConfigs() + this + } + + def setUpperFilterNTile(value: Double): this.type = { + _upperFilterNTile = value + setOutlierConfig() + setConfigs() + this + } + + def setFilterPrecision(value: Double): this.type = { + _filterPrecision = value + setOutlierConfig() + setConfigs() + this + } + + def setContinuousDataThreshold(value: Int): this.type = { + _continuousDataThreshold = value + setOutlierConfig() + setConfigs() + this + } + + def setFieldsToIgnore(value: Array[String]): this.type = { + _fieldsToIgnore = value + setOutlierConfig() + setConfigs() + this + } + + private def setOutlierConfig(): this.type = { + _outlierConfig = OutlierConfig( + filterBounds = _filterBounds, + lowerFilterNTile = _lowerFilterNTile, + upperFilterNTile = _upperFilterNTile, + filterPrecision = _filterPrecision, + continuousDataThreshold = _continuousDataThreshold, + fieldsToIgnore = _fieldsToIgnore + ) + this + } + + def setPearsonFilterStatistic(value: String): this.type = { + _pearsonFilterStatistic = value + setPearsonConfig() + setConfigs() + this + } + + def setPearsonFilterDirection(value: String): this.type = { + _pearsonFilterDirection = value + setPearsonConfig() + setConfigs() + this + } + + def setPearsonFilterManualValue(value: Double): this.type = { + _pearsonFilterManualValue = value + setPearsonConfig() + setConfigs() + this + } + + def setPearsonFilterMode(value: String): this.type = { + _pearsonFilterMode = value + setPearsonConfig() + setConfigs() + this + } + + def setPearsonAutoFilterNTile(value: Double): this.type = { + _pearsonAutoFilterNTile = value + setPearsonConfig() + setConfigs() + this + } + + private def setPearsonConfig(): this.type = { + _pearsonConfig = PearsonConfig( + filterStatistic = _pearsonFilterStatistic, + filterDirection = _pearsonFilterDirection, + filterManualValue = _pearsonFilterManualValue, + filterMode = _pearsonFilterMode, + autoFilterNTile = _pearsonAutoFilterNTile + ) + this + } + + def setCorrelationCutoffLow(value: Double): this.type = { + _correlationCutoffLow = value + setCovarianceConfig() + setConfigs() + this + } + + def setCorrelationCutoffHigh(value: Double): this.type = { + _correlationCutoffHigh = value + setCovarianceConfig() + setConfigs() + this + } + + private def setCovarianceConfig(): this.type = { + _covarianceConfig = CovarianceConfig( + correlationCutoffLow = _correlationCutoffLow, + correlationCutoffHigh = _correlationCutoffHigh + ) + this + } + + def setScalerType(value: String): this.type = { + _scalerType = value + setScalerConfig() + setConfigs() + this + } + + def setScalerMin(value: Double): this.type = { + _scalerMin = value + setScalerConfig() + setConfigs() + this + } + + def setScalerMax(value: Double): this.type = { + _scalerMax = value + setScalerConfig() + setConfigs() + this + } + + def setStandardScalerMeanFlagOn(): this.type = { + _standardScalerMeanFlag = true + setScalerConfig() + setConfigs() + this + } + + def setStandardScalerMeanFlagOff(): this.type = { + _standardScalerMeanFlag = false + setScalerConfig() + setConfigs() + this + } + + def setStandardScalerStdDevFlagOn(): this.type = { + _standardScalerStdDevFlag = true + setScalerConfig() + setConfigs() + this + } + + def setStandardScalerStdDevFlagOff(): this.type = { + _standardScalerStdDevFlag = false + setScalerConfig() + setConfigs() + this + } + + def setPNorm(value: Double): this.type = { + _pNorm = value + setScalerConfig() + setConfigs() + this + } + + private def setScalerConfig(): this.type = { + _scalingConfig = ScalingConfig( + scalerType = _scalerType, + scalerMin = _scalerMin, + scalerMax = _scalerMax, + standardScalerMeanFlag = _standardScalerMeanFlag, + standardScalerStdDevFlag = _standardScalerStdDevFlag, + pNorm = _pNorm + ) + this + } + + /** + * Setter for determining the mode of operation for inclusion of interacted features. + * Modes are: + * - all -> Includes all interactions between all features (after string indexing of categorical values) + * - optimistic -> If the Information Gain / Variance, as compared to at least ONE of the parents of the interaction + * is above the threshold set by featureInteractionTargetInteractionPercentage + * (e.g. if IG of left parent is 0.5 and right parent is 0.9, with threshold set at 10, if the interaction + * between these two parents has an IG of 0.42, it would be rejected, but if it was 0.46, it would be kept) + * - strict -> the threshold percentage must be met for BOTH parents. + * (in the above example, the IG for the interaction would have to be > 0.81 in order to be included in + * the feature vector). + * @param value String -> one of: 'all', 'optimistic', or 'strict' + * @throws IllegalArgumentException if the specified value submitted is not permitted + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + def setFeatureInteractionRetentionMode(value: String): this.type = { + require( + allowableFeatureInteractionModes.contains(value), + s"FeatureInteractionRetentionMode is invalid. Must be one of: ${allowableFeatureInteractionModes + .mkString(", ")}" + ) + _featureInteractionRetentionMode = value + setFeatureInteractionConfig() + setConfigs() + this + } + + /** + * Setter for determining the behavior of continuous feature columns. In order to calculate Entropy for a continuous + * variable, the distribution must be converted to nominal values for estimation of per-split information gain. + * This setting defines how many nominal categorical values to create out of a continuously distributed feature + * in order to calculate Entropy. + * @param value Int -> must be greater than 1 + * @throws IllegalArgumentException if the value specified is <= 1 + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def setFeatureInteractionContinuousDiscretizerBucketCount( + value: Int + ): this.type = { + require( + value > 1, + s"FeatureInteractionContinuousDiscretizerBucketCount must be greater than 1." + ) + _featureInteractionContinuousDiscretizerBucketCount = value + setFeatureInteractionConfig() + setConfigs() + this + } + + /** + * Setter for configuring the concurrent count for scoring of feature interaction candidates. + * Due to the nature of these operations, the configuration here may need to be set differently to that of + * the modeling and general feature engineering phases of the toolkit. This is highly dependent on the row + * count of the data set being submitted. + * @param value Int -> must be greater than 0 + * @since 0.6.2 + * @author Ben Wilson, Databricks + * @throws IllegalArgumentException if the value is < 1 + */ + @throws(classOf[IllegalArgumentException]) + def setFeatureInteractionParallelism(value: Int): this.type = { + require( + value >= 1, + s"FeatureInteractionParallelism must be set to a value >= 1." + ) + _featureInteractionParallelism = value + setFeatureInteractionConfig() + setConfigs() + this + } + + /** + * Setter for establishing the minimum acceptable InformationGain or Variance allowed for an interaction + * candidate based on comparison to the scores of its parents. + * @param value Double in range of -inf -> inf + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def setFeatureInteractionTargetInteractionPercentage( + value: Double + ): this.type = { + _featureInteractionTargetInteractionPercentage = value + setFeatureInteractionConfig() + setConfigs() + this + } + + /** + * Private setter for establishing the feature interaction configuration + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def setFeatureInteractionConfig(): this.type = { + _featureInteractionConfig = FeatureInteractionConfig( + retentionMode = _featureInteractionRetentionMode, + continuousDiscretizerBucketCount = + _featureInteractionContinuousDiscretizerBucketCount, + parallelism = _featureInteractionParallelism, + targetInteractionPercentage = + _featureInteractionTargetInteractionPercentage + ) + this + } + + def setParallelism(value: Int): this.type = { + require( + _parallelism < 100, + s"Parallelism above 100 will result in cluster instability." + ) + _parallelism = value + setGeneticConfig() + setConfigs() + this + } + + def setKFold(value: Int): this.type = { + _kFold = value + setGeneticConfig() + setConfigs() + this + } + + def setTrainPortion(value: Double): this.type = { + _trainPortion = value + setGeneticConfig() + setConfigs() + this + } + + def setTrainSplitMethod(value: String): this.type = { + require( + trainSplitMethods.contains(value), + s"TrainSplitMethod $value must be one of: ${trainSplitMethods.mkString(", ")}" + ) + _trainSplitMethod = value + if (value == "chronological") + println( + "[WARNING] setTrainSplitMethod() -> Chronological splits is shuffle-intensive and will increase " + + "runtime significantly. Only use if necessary for modeling scenario!" + ) + setGeneticConfig() + setConfigs() + this + } + + def setKSampleConfig(): this.type = { + + _kSampleConfig = KSampleConfig( + syntheticCol = _syntheticCol, + kGroups = _kGroups, + kMeansMaxIter = _kMeansMaxIter, + kMeansTolerance = _kMeansTolerance, + kMeansDistanceMeasurement = _kMeansDistanceMeasurement, + kMeansSeed = _kMeansSeed, + kMeansPredictionCol = _kMeansPredictionCol, + lshHashTables = _lshHashTables, + lshSeed = _lshSeed, + lshOutputCol = _lshOutputCol, + quorumCount = _quorumCount, + minimumVectorCountToMutate = _minimumVectorCountToMutate, + vectorMutationMethod = _vectorMutationMethod, + mutationMode = _mutationMode, + mutationValue = _mutationValue, + labelBalanceMode = _labelBalanceMode, + cardinalityThreshold = _cardinalityThreshold, + numericRatio = _numericRatio, + numericTarget = _numericTarget, + outputDfRepartitionScaleFactor = _outputDfRepartitionScaleFactor + ) + this + } + + /** + * Setter - for setting the name of the Synthetic column name + * + * @param value String: A column name that is uniquely not part of the main DataFrame + * @since 0.5.1 + * @author Ben Wilson + */ + def setSyntheticCol(value: String): this.type = { + _syntheticCol = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for specifying the number of K-Groups to generate in the KMeans model + * + * @param value Int: number of k groups to generate + * @return this + */ + def setKGroups(value: Int): this.type = { + _kGroups = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for specifying the maximum number of iterations for the KMeans model to go through to converge + * + * @param value Int: Maximum limit on iterations + * @return this + */ + def setKMeansMaxIter(value: Int): this.type = { + _kMeansMaxIter = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for Setting the tolerance for KMeans (must be >0) + * + * @param value The tolerance value setting for KMeans + * @see reference: [[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans]] + * for further details. + * @return this + * @throws IllegalArgumentException() if a value less than 0 is entered + */ + @throws(classOf[IllegalArgumentException]) + def setKMeansTolerance(value: Double): this.type = { + require( + value > 0, + s"KMeans tolerance value ${value.toString} is out of range. Must be > 0." + ) + _kMeansTolerance = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for which distance measurement to use to calculate the nearness of vectors to a centroid + * + * @param value String: Options -> "euclidean" or "cosine" Default: "euclidean" + * @return this + * @throws IllegalArgumentException() if an invalid value is entered + */ + @throws(classOf[IllegalArgumentException]) + def setKMeansDistanceMeasurement(value: String): this.type = { + require( + allowableKMeansDistanceMeasurements.contains(value), + s"Kmeans Distance Measurement $value is not " + + s"a valid mode of operation. Must be one of: ${allowableKMeansDistanceMeasurements.mkString(", ")}" + ) + _kMeansDistanceMeasurement = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for a KMeans seed for the clustering algorithm + * + * @param value Long: Seed value + * @return this + */ + def setKMeansSeed(value: Long): this.type = { + _kMeansSeed = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for the internal KMeans column for cluster membership attribution + * + * @param value String: column name for internal algorithm column for group membership + * @return this + */ + def setKMeansPredictionCol(value: String): this.type = { + _kMeansPredictionCol = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for Configuring the number of Hash Tables to use for MinHashLSH + * + * @param value Int: Count of hash tables to use + * @see [[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH]] + * for more information + * @return this + */ + def setLSHHashTables(value: Int): this.type = { + _lshHashTables = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for Configuring the Seed value for the LSH MinHash model + * + * @param value Long: A Seed value + * @since 0.5.1 + * @author Ben Wilson + */ + def setLSHSeed(value: Long): this.type = { + _lshSeed = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for the internal LSH output hash information column + * + * @param value String: column name for the internal MinHashLSH Model transformation value + * @return this + */ + def setLSHOutputCol(value: String): this.type = { + _lshOutputCol = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data + * + * @note the higher the value set here, the higher the variance in synthetic data generation + * @param value Int: Number of vectors to find nearest each centroid within the class + * @return this + */ + def setQuorumCount(value: Int): this.type = { + _quorumCount = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for minimum threshold for vector indexes to mutate within the feature vector. + * + * @note In vectorMutationMethod "fixed" this sets the fixed count of how many vector positions to mutate. + * In vectorMutationMethod "random" this sets the lower threshold for 'at least this many indexes will + * be mutated' + * @param value The minimum (or fixed) number of indexes to mutate. + * @return this + */ + def setMinimumVectorCountToMutate(value: Int): this.type = { + _minimumVectorCountToMutate = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for the Vector Mutation Method + * + * @note Options: + * "fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. + * "random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. + * "all" - will mutate all of the vectors. + * @param value String - the mode to use. + * @return this + * @throws IllegalArgumentException() if the mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setVectorMutationMethod(value: String): this.type = { + require( + allowableVectorMutationMethods.contains(value), + s"Vector Mutation Mode $value is not supported. " + + s"Must be one of: ${allowableVectorMutationMethods.mkString(", ")} " + ) + _vectorMutationMethod = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for the Mutation Mode of the feature vector individual values + * + * @note Options: + * "weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors + * "random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors + * "ratio" - uses a ratio between the values of the centroid vector and the mutation vector * + * @param value String: the mode to use. + * @return this + * @throws IllegalArgumentException() if the mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setMutationMode(value: String): this.type = { + require( + allowableMutationModes.contains(value), + s"Mutation Mode $value is not a valid mode of operation. " + + s"Must be one of: ${allowableMutationModes.mkString(", ")}" + ) + _mutationMode = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode + * + * @param value Double: value between 0 and 1 for mutation magnitude adjustment. + * @note the higher this value, the closer to the centroid vector vs. the candidate mutation vector the synthetic row data will be. + * @return this + * @throws IllegalArgumentException() if the value specified is outside of the range (0, 1) + */ + @throws(classOf[IllegalArgumentException]) + def setMutationValue(value: Double): this.type = { + require( + value > 0 & value < 1, + s"Mutation Value must be between 0 and 1. Value $value is not permitted." + ) + _mutationValue = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter - for determining the label balance approach mode. + * + * @note Available modes:
+ * 'match': Will match all smaller class counts to largest class count. [WARNING] - May significantly increase memory pressure!
+ * 'percentage' Will adjust smaller classes to a percentage value of the largest class count. + * 'target' Will increase smaller class counts to a fixed numeric target of rows. + * @param value String: one of: 'match', 'percentage' or 'target' + * @note Default: "percentage" + * @since 0.5.1 + * @author Ben Wilson + * @throws UnsupportedOperationException() if the provided mode is not supported. + */ + @throws(classOf[UnsupportedOperationException]) + def setLabelBalanceMode(value: String): this.type = { + require( + allowableLabelBalanceModes.contains(value), + s"Label Balance Mode $value is not supported." + + s"Must be one of: ${allowableLabelBalanceModes.mkString(", ")}" + ) + _labelBalanceMode = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter - for overriding the cardinality threshold exception threshold. [WARNING] increasing this value on + * a sufficiently large data set could incur, during runtime, excessive memory and cpu pressure on the cluster. + * + * @param value Int: the limit above which an exception will be thrown for a classification problem wherein the + * label distinct count is too large to successfully generate synthetic data. + * @note Default: 20 + * @since 0.5.1 + * @author Ben Wilson + */ + def setCardinalityThreshold(value: Int): this.type = { + _cardinalityThreshold = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode() + * + * @param value Double: A fractional double in the range of 0.0 to 1.0. + * @note Setting this value to 1.0 is equivalent to setting the label balance mode to 'match' + * @note Default: 0.2 + * @since 0.5.1 + * @author Ben Wilson + * @throws UnsupportedOperationException() if the provided value is outside of the range of 0.0 -> 1.0 + */ + @throws(classOf[UnsupportedOperationException]) + def setNumericRatio(value: Double): this.type = { + require( + value <= 1.0 & value > 0.0, + s"Invalid Numeric Ratio entered! Must be between 0 and 1." + + s"${value.toString} is not valid." + ) + _numericRatio = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode() + * + * @param value Int: The desired final number of rows per minority class label + * @note [WARNING] Setting this value to too high of a number will greatly increase runtime and memory pressure. + * @since 0.5.1 + * @author Ben Wilson + */ + def setNumericTarget(value: Int): this.type = { + _numericTarget = value + setKSampleConfig() + setGeneticConfig() + setConfigs() + this + } + + def setTrainSplitChronologicalColumn(value: String): this.type = { + _trainSplitChronologicalColumn = value + val ignoredFields: Array[String] = _fieldsToIgnoreInVector ++ Array(value) + setFieldsToIgnoreInVector(ignoredFields) + _trainSplitColumnSet = true + setGeneticConfig() + setConfigs() + this + } + + def setTrainSplitChronologicalRandomPercentage(value: Double): this.type = { + _trainSplitChronologicalRandomPercentage = value + if (value > 10) + println( + "[WARNING] setTrainSplitChronologicalRandomPercentage() setting this value above 10 " + + "percent will cause significant per-run train/test skew and variability in row counts during training. " + + "Use higher values only if this is desired." + ) + setGeneticConfig() + setConfigs() + this + } + + def setSeed(value: Long): this.type = { + _seed = value + setGeneticConfig() + setConfigs() + this + } + + def setFirstGenerationGenePool(value: Int): this.type = { + _firstGenerationGenePool = value + setGeneticConfig() + setConfigs() + this + } + + def setNumberOfGenerations(value: Int): this.type = { + _numberOfGenerations = value + setGeneticConfig() + setConfigs() + this + } + + def setNumberOfParentsToRetain(value: Int): this.type = { + _numberOfParentsToRetain = value + setGeneticConfig() + setConfigs() + this + } + + def setNumberOfMutationsPerGeneration(value: Int): this.type = { + _numberOfMutationsPerGeneration = value + setGeneticConfig() + setConfigs() + this + } + + def setGeneticMixing(value: Double): this.type = { + _geneticMixing = value + setGeneticConfig() + setConfigs() + this + } + + def setGenerationalMutationStrategy(value: String): this.type = { + _generationalMutationStrategy = value + setGeneticConfig() + setConfigs() + this + } + + def setFixedMutationValue(value: Int): this.type = { + _fixedMutationValue = value + setGeneticConfig() + setConfigs() + this + } + + def setMutationMagnitudeMode(value: String): this.type = { + _mutationMagnitudeMode = value + setGeneticConfig() + setConfigs() + this + } + + def setModelSeedString(value: String): this.type = { + _modelSeedMap = extractGenericModelReturnMap(value) + _modelSeedSetStatus = true + setGeneticConfig() + setConfigs() + this + } + + def setModelSeedMap(value: Map[String, Any]): this.type = { + _modelSeedMap = value + _modelSeedSetStatus = true + setGeneticConfig() + setConfigs() + this + } + + private def setFirstGenerationConfig(): this.type = { + _firstGenerationConfig = FirstGenerationConfig( + permutationCount = _firstGenerationPermutationCount, + indexMixingMode = _firstGenerationIndexMixingMode, + arraySeed = _firstGenerationArraySeed + ) + setGeneticConfig() + setConfigs() + this + } + + def setFirstGenerationPermutationCount(value: Int): this.type = { + _firstGenerationPermutationCount = value + setFirstGenerationConfig() + this + } + + def setFirstGenerationIndexMixingMode(value: String): this.type = { + require( + _allowableInitialGenerationIndexMixingModes.contains(value), + s"Invalid First Generation Index Mixing " + + s"Mode: $value . First Generation Index Mixing Mode must be one of: " + + s"${_allowableInitialGenerationIndexMixingModes.mkString(", ")}" + ) + _firstGenerationIndexMixingMode = value + setFirstGenerationConfig() + this + } + + def setFirstGenerationArraySeed(value: Long): this.type = { + _firstGenerationArraySeed = value + setFirstGenerationConfig() + this + } + + def hyperSpaceInferenceOn(): this.type = { + _hyperSpaceInference = true + setGeneticConfig() + setConfigs() + this + } + + def hyperSpaceInferenceOff(): this.type = { + _hyperSpaceInference = false + setGeneticConfig() + setConfigs() + this + } + + def setHyperSpaceInferenceCount(value: Int): this.type = { + if (value > 500000) + println( + "WARNING! Setting permutation counts above 500,000 will put stress on the driver." + ) + if (value > 1000000) + throw new UnsupportedOperationException( + s"Setting permutation above 1,000,000 is not supported" + + s" due to runtime considerations. $value is too large of a value." + ) + _hyperSpaceInferenceCount = value + setGeneticConfig() + setConfigs() + this + } + + def setHyperSpaceModelType(value: String): this.type = { + require( + Array("RandomForest", "LinearRegression", "XGBoost").contains(value), + s"Model type $value is not supported for post " + + s"modeling hyper space optimization! Please choose either RandomForest or LinearRegression" + ) + _hyperSpaceModelType = value + setGeneticConfig() + setConfigs() + this + } + + def setHyperSpaceModelCount(value: Int): this.type = { + if (value > 50) + println( + "WARNING! Setting this value above 50 will incur 50 additional models to be built. Proceed" + + "only if this is intended." + ) + _hyperSpaceModelCount = value + setGeneticConfig() + setConfigs() + this + } + + def setFirstGenerationMode(value: String): this.type = { + require( + _allowableInitialGenerationModes.contains(value), + s"Invalid First Generation Mode: $value . " + + s"First Generation Mode must be one of : ${_allowableInitialGenerationModes.mkString(", ")}" + ) + _firstGenerationMode = value + setGeneticConfig() + setConfigs() + this + } + + def setMlFlowConfig(value: MLFlowConfig): this.type = { + _mlFlowConfig = value + setConfigs() + this + } + + def mlFlowLoggingOn(): this.type = { + _mlFlowLoggingFlag = true + setConfigs() + this + } + + def mlFlowLoggingOff(): this.type = { + _mlFlowLoggingFlag = false + setConfigs() + this + } + + def mlFlowLogArtifactsOn(): this.type = { + _mlFlowArtifactsFlag = true + setConfigs() + this + } + + def mlFlowLogArtifactsOff(): this.type = { + _mlFlowArtifactsFlag = false + setConfigs() + this + } + + def setMlFlowTrackingURI(value: String): this.type = { + _mlFlowTrackingURI = value + setMlFlowConfig() + setConfigs() + this + } + + def setMlFlowExperimentName(value: String): this.type = { + _mlFlowExperimentName = value + setMlFlowConfig() + setConfigs() + this + } + + def setMlFlowAPIToken(value: String): this.type = { + _mlFlowAPIToken = value + setMlFlowConfig() + setConfigs() + this + } + + @throws(classOf[IllegalArgumentException]) + def setMlFlowModelSaveDirectory(value: String): this.type = { + require( + value.take(6) == "dbfs:/", + s"Model save directory must be written to dbfs:/." + ) + _mlFlowModelSaveDirectory = value + setMlFlowConfig() + setConfigs() + this + } + + def setMlFlowLoggingMode(value: String): this.type = { + require( + _allowableMlFlowLoggingModes.contains(value), + s"MlFlow logging mode $value is not permitted. Must be " + + s"one of: ${_allowableMlFlowLoggingModes.mkString(",")}" + ) + _mlFlowLoggingMode = value + setMlFlowConfig() + setConfigs() + this + } + + def setMlFlowBestSuffix(value: String): this.type = { + _mlFlowBestSuffix = value + setMlFlowConfig() + setConfigs() + this + } + + def setMlFlowCustomRunTags(value: Map[String, String]): this.type = { + _mlFlowCustomRunTags = value + setMlFlowConfig() + setConfigs() + this + } + + private def setMlFlowConfig(): this.type = { + _mlFlowConfig = MLFlowConfig( + mlFlowTrackingURI = _mlFlowTrackingURI, + mlFlowExperimentName = _mlFlowExperimentName, + mlFlowAPIToken = _mlFlowAPIToken, + mlFlowModelSaveDirectory = _mlFlowModelSaveDirectory, + mlFlowLoggingMode = _mlFlowLoggingMode, + mlFlowBestSuffix = _mlFlowBestSuffix, + mlFlowCustomRunTags = _mlFlowCustomRunTags + ) + this + } + + def autoStoppingOn(): this.type = { + _autoStoppingFlag = true + setConfigs() + this + } + + def autoStoppingOff(): this.type = { + _autoStoppingFlag = false + setConfigs() + this + } + + def setAutoStoppingScore(value: Double): this.type = { + _autoStoppingScore = value + setConfigs() + this + } + + /** + * Setter for defining the secondary stopping criteria for continuous training mode ( number of consistentlt + * not-improving runs to terminate the learning algorithm due to diminishing returns. + * @param value Negative Integer (an improvement to a priori will reset the counter and subsequent non-improvements + * will decrement a mutable counter. If the counter hits this limit specified in value, the continuous + * mode algorithm will stop). + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is positive. + */ + @throws(classOf[IllegalArgumentException]) + def setContinuousEvolutionImprovementThreshold(value: Int): this.type = { + require( + value < 0, + s"ContinuousEvolutionImprovementThreshold must be less than zero. It is " + + s"recommended to set this value to less than -4." + ) + _continuousEvolutionImprovementThreshold = value + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates + * @param value String - one of "XGBoost", "LinearRegression" or "RandomForest" + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is not supported + */ + @throws(classOf[IllegalArgumentException]) + def setGeneticMBORegressorType(value: String): this.type = { + require( + allowableMBORegressorTypes.contains(value), + s"GeneticRegressorType $value is not a supported Regressor " + + s"Type. Must be one of: ${allowableMBORegressorTypes.mkString(", ")}" + ) + _geneticMBORegressorType = value + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through + * mutation for each generation other than the initial and post-modeling optimization phases. The larger this + * value (default: 10), the more potential space can be searched. There is not a large performance hit to this, + * and as such, values in excess of 100 are viable. + * @param value Int - a factor to multiply the numberOfMutationsPerGeneration by to generate a count of potential + * candidates. + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is not greater than zero. + */ + @throws(classOf[IllegalArgumentException]) + def setGeneticMBOCandidateFactor(value: Int): this.type = { + require(value > 0, s"GeneticMBOCandidateFactor must be greater than zero.") + _geneticMBOCandidateFactor = value + setGeneticConfig() + setConfigs() + this + } + + def setFeatureImportanceCutoffType(value: String): this.type = { + + require( + _supportedFeatureImportanceCutoffTypes.contains(value), + s"Feature Importance Cutoff Type '$value' is not supported. Allowable values: " + + s"${_supportedFeatureImportanceCutoffTypes.mkString(" ,")}" + ) + _featureImportanceCutoffType = value + setConfigs() + this + } + + def setFeatureImportanceCutoffValue(value: Double): this.type = { + _featureImportanceCutoffValue = value + setConfigs() + this + } + + def setEvolutionStrategy(value: String): this.type = { + require( + _allowableEvolutionStrategies.contains(value), + s"Evolution Strategy '$value' is not a supported mode. Must be one of: ${_allowableEvolutionStrategies + .mkString(", ")}" + ) + _evolutionStrategy = value + setGeneticConfig() + setConfigs() + this + } + + def setContinuousEvolutionMaxIterations(value: Int): this.type = { + if (value > 500) + println( + s"[WARNING] Total Modeling count $value is higher than recommended limit of 500. " + + s"This tuning will take a long time to run." + ) + _continuousEvolutionMaxIterations = value + setGeneticConfig() + setConfigs() + this + } + + def setContinuousEvolutionStoppingScore(value: Double): this.type = { + _continuousEvolutionStoppingScore = value + setGeneticConfig() + setConfigs() + this + } + + def setContinuousEvolutionParallelism(value: Int): this.type = { + if (value > 10) + println( + s"[WARNING] ContinuousEvolutionParallelism -> $value is higher than recommended " + + s"concurrency for efficient optimization for convergence." + + s"\n Setting this value below 11 will converge faster in most cases." + ) + _continuousEvolutionParallelism = value + setGeneticConfig() + setConfigs() + this + } + + def setContinuousEvolutionMutationAggressiveness(value: Int): this.type = { + if (value > 4) + println( + s"[WARNING] ContinuousEvolutionMutationAggressiveness -> $value. " + + s"\n Setting this higher than 4 will result in extensive random search and will take longer to converge " + + s"to optimal hyperparameters." + ) + _continuousEvolutionMutationAggressiveness = value + setGeneticConfig() + setConfigs() + this + } + + def setContinuousEvolutionGeneticMixing(value: Double): this.type = { + require( + value < 1.0 & value > 0.0, + s"Mutation Aggressiveness must be in range (0,1). Current Setting of $value is not permitted." + ) + _continuousEvolutionGeneticMixing = value + setGeneticConfig() + setConfigs() + this + } + + def setContinuousEvolutionRollingImprovementCount(value: Int): this.type = { + require( + value > 0, + s"ContinuousEvolutionRollingImprovementCount must be > 0. $value is invalid." + ) + if (value < 10) + println( + s"[WARNING] ContinuousEvolutionRollingImprovementCount -> $value setting is low. " + + s"Optimal Convergence may not occur due to early stopping." + ) + _continuousEvolutionRollingImprovementCount = value + setGeneticConfig() + setConfigs() + this + } + + @throws(classOf[IllegalArgumentException]) + def setInferenceConfigSaveLocation(value: String): this.type = { + require( + value.take(6) == "dbfs:/", + s"Inference save location must be on dbfs:/." + ) + _inferenceConfigSaveLocation = value + setConfigs() + this + } + + def setDataReductionFactor(value: Double): this.type = { + require(value > 0, s"Data Reduction Factor must be between 0 and 1") + require(value < 1, s"Data Reduction Factor must be between 0 and 1") + _dataReductionFactor = value + setConfigs() + this + } + + /** + * Setter for providing a path to write the kfold train/test splits as Delta data sets to (useful for extremely + * large data sets or a situation where using local disk storage might be prohibitively expensive) + * @param value String path to a dbfs location for creating the temporary (or persisted) + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def setDeltaCacheBackingDirectory(value: String): this.type = { + + if (value != "") { + require( + value.take(6) == "dbfs:/", + s"Delta backing location must be written to dbfs." + ) + } + _deltaCacheBackingDirectory = value + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for determining the split caching strategy (either persist to disk for each kfold split or backing to Delta) + * @param value Configuration string either 'persist' or 'delta' + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def setSplitCachingStrategy(value: String): this.type = { + val valueSet = value.toLowerCase + require( + valueSet == "persist" || valueSet == "delta" || valueSet == "cache", + s"SplitCachingStrategy '${}' is invalid. Must be either 'delta', 'cache', or 'persist'" + ) + _splitCachingStrategy = valueSet + setGeneticConfig() + setConfigs() + this + } + + /** + * Setter for whether or not to delete the written train/test splits for the run in Delta. Defaulted to true + * which means that the job will delete the data on Object store to clean itself up after the run is completed + * if the splitCachingStrategy is set to 'delta' + * @param value Boolean - true => delete false => leave on Object Store + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def setDeltaCacheBackingDirectoryRemovalFlag(value: Boolean): this.type = { + _deltaCacheBackingDirectoryRemovalFlag = value + setGeneticConfig() + setConfigs() + this + } + + def deltaCheckBackingDirectoryRemovalOn(): this.type = { + _deltaCacheBackingDirectoryRemovalFlag = true + setGeneticConfig() + setConfigs() + this + } + + def deltaCheckBackingDirectoryRemovalOff(): this.type = { + _deltaCacheBackingDirectoryRemovalFlag = false + setGeneticConfig() + setConfigs() + this + } + + private def setGeneticConfig(): this.type = { + _geneticConfig = GeneticConfig( + parallelism = _parallelism, + kFold = _kFold, + trainPortion = _trainPortion, + trainSplitMethod = _trainSplitMethod, + kSampleConfig = _kSampleConfig, + trainSplitChronologicalColumn = _trainSplitChronologicalColumn, + trainSplitChronologicalRandomPercentage = + _trainSplitChronologicalRandomPercentage, + seed = _seed, + firstGenerationGenePool = _firstGenerationGenePool, + numberOfGenerations = _numberOfGenerations, + numberOfParentsToRetain = _numberOfParentsToRetain, + numberOfMutationsPerGeneration = _numberOfMutationsPerGeneration, + geneticMixing = _geneticMixing, + generationalMutationStrategy = _generationalMutationStrategy, + fixedMutationValue = _fixedMutationValue, + mutationMagnitudeMode = _mutationMagnitudeMode, + evolutionStrategy = _evolutionStrategy, + geneticMBORegressorType = _geneticMBORegressorType, + geneticMBOCandidateFactor = _geneticMBOCandidateFactor, + continuousEvolutionMaxIterations = _continuousEvolutionMaxIterations, + continuousEvolutionStoppingScore = _continuousEvolutionStoppingScore, + continuousEvolutionImprovementThreshold = + _continuousEvolutionImprovementThreshold, + continuousEvolutionParallelism = _continuousEvolutionParallelism, + continuousEvolutionMutationAggressiveness = + _continuousEvolutionMutationAggressiveness, + continuousEvolutionGeneticMixing = _continuousEvolutionGeneticMixing, + continuousEvolutionRollingImprovementCount = + _continuousEvolutionRollingImprovementCount, + modelSeed = _modelSeedMap, + hyperSpaceInference = _hyperSpaceInference, + hyperSpaceInferenceCount = _hyperSpaceInferenceCount, + hyperSpaceModelType = _hyperSpaceModelType, + hyperSpaceModelCount = _hyperSpaceModelCount, + initialGenerationMode = _firstGenerationMode, + initialGenerationConfig = _firstGenerationConfig, + deltaCacheBackingDirectory = _deltaCacheBackingDirectory, + splitCachingStrategy = _splitCachingStrategy, + deltaCacheBackingDirectoryRemovalFlag = + _deltaCacheBackingDirectoryRemovalFlag + ) + this + } + + def setMainConfig(): this.type = { + _mainConfig = MainConfig( + modelFamily = _modelingFamily, + labelCol = _labelCol, + featuresCol = _featuresCol, + naFillFlag = _naFillFlag, + varianceFilterFlag = _varianceFilterFlag, + outlierFilterFlag = _outlierFilterFlag, + pearsonFilteringFlag = _pearsonFilterFlag, + covarianceFilteringFlag = _covarianceFilterFlag, + oneHotEncodeFlag = _oneHotEncodeFlag, + scalingFlag = _scalingFlag, + featureInteractionFlag = _featureInteractionFlag, + dataPrepCachingFlag = _dataPrepCachingFlag, + dataPrepParallelism = _dataPrepParallelism, + autoStoppingFlag = _autoStoppingFlag, + autoStoppingScore = _autoStoppingScore, + featureImportanceCutoffType = _featureImportanceCutoffType, + featureImportanceCutoffValue = _featureImportanceCutoffValue, + dateTimeConversionType = _dateTimeConversionType, + fieldsToIgnoreInVector = _fieldsToIgnoreInVector, + numericBoundaries = _numericBoundaries, + stringBoundaries = _stringBoundaries, + scoringMetric = _scoringMetric, + scoringOptimizationStrategy = _scoringOptimizationStrategy, + fillConfig = _fillConfig, + outlierConfig = _outlierConfig, + pearsonConfig = _pearsonConfig, + covarianceConfig = _covarianceConfig, + scalingConfig = _scalingConfig, + featureInteractionConfig = _featureInteractionConfig, + geneticConfig = _geneticConfig, + mlFlowLoggingFlag = _mlFlowLoggingFlag, + mlFlowLogArtifactsFlag = _mlFlowArtifactsFlag, + mlFlowConfig = _mlFlowConfig, + inferenceConfigSaveLocation = _inferenceConfigSaveLocation, + dataReductionFactor = _dataReductionFactor, + pipelineDebugFlag = _pipelineDebugFlag, + pipelineId = _pipelineId + ) + this + } + + private def setFillConfig(config: FillConfig): this.type = { + + _fillConfig = config + _numericFillStat = config.numericFillStat + _characterFillStat = config.characterFillStat + _modelSelectionDistinctThreshold = config.modelSelectionDistinctThreshold + _cardinalitySwitchFlag = config.cardinalitySwitch + _cardinalityType = config.cardinalityType + _cardinalityLimit = config.cardinalityLimit + _cardinalityPrecision = config.cardinalityPrecision + _cardinalityCheckMode = config.cardinalityCheckMode + _naFillFilterPrecision = config.filterPrecision + _categoricalNAFillMap = config.categoricalNAFillMap + _numericNAFillMap = config.numericNAFillMap + _characterNABlanketFillValue = config.characterNABlanketFillValue + _numericNABlanketFillValue = config.numericNABlanketFillValue + _naFillMode = config.naFillMode + + this + } + + private def setOutlierConfig(config: OutlierConfig): this.type = { + + _outlierConfig = config + _filterBounds = config.filterBounds + _lowerFilterNTile = config.lowerFilterNTile + _upperFilterNTile = config.upperFilterNTile + _filterPrecision = config.filterPrecision + _continuousDataThreshold = config.continuousDataThreshold + _fieldsToIgnore = config.fieldsToIgnore + + this + } + + private def setPearsonConfig(config: PearsonConfig): this.type = { + + _pearsonConfig = config + _pearsonFilterStatistic = config.filterStatistic + _pearsonFilterDirection = config.filterDirection + _pearsonFilterManualValue = config.filterManualValue + _pearsonFilterMode = config.filterMode + _pearsonAutoFilterNTile = config.autoFilterNTile + + this + } + + private def setCovarianceConfig(config: CovarianceConfig): this.type = { + _covarianceConfig = config + _correlationCutoffLow = config.correlationCutoffLow + _correlationCutoffHigh = config.correlationCutoffHigh + + this + } + + private def setScalerConfig(config: ScalingConfig): this.type = { + + _scalingConfig = config + _scalerType = config.scalerType + _scalerMin = config.scalerMin + _scalerMax = config.scalerMax + _standardScalerMeanFlag = config.standardScalerMeanFlag + _standardScalerStdDevFlag = config.standardScalerStdDevFlag + _pNorm = config.pNorm + + this + } + + private def setFeatureInteractionConfig( + config: FeatureInteractionConfig + ): this.type = { + + _featureInteractionConfig = config + _featureInteractionRetentionMode = config.retentionMode + _featureInteractionContinuousDiscretizerBucketCount = + config.continuousDiscretizerBucketCount + _featureInteractionParallelism = config.parallelism + _featureInteractionTargetInteractionPercentage = + config.targetInteractionPercentage + + this + } + + private def setKSampleConfig(config: KSampleConfig): this.type = { + + _kSampleConfig = config + _syntheticCol = config.syntheticCol + _kGroups = config.kGroups + _kMeansMaxIter = config.kMeansMaxIter + _kMeansTolerance = config.kMeansTolerance + _kMeansDistanceMeasurement = config.kMeansDistanceMeasurement + _kMeansSeed = config.kMeansSeed + _kMeansPredictionCol = config.kMeansPredictionCol + _lshHashTables = config.lshHashTables + _lshSeed = config.lshSeed + _lshOutputCol = config.lshOutputCol + _quorumCount = config.quorumCount + _minimumVectorCountToMutate = config.minimumVectorCountToMutate + _vectorMutationMethod = config.vectorMutationMethod + _mutationMode = config.mutationMode + _mutationValue = config.mutationValue + _labelBalanceMode = config.labelBalanceMode + _cardinalityThreshold = config.cardinalityThreshold + _numericRatio = config.numericRatio + _numericTarget = config.numericTarget + _outputDfRepartitionScaleFactor = config.outputDfRepartitionScaleFactor + + this + } + + private def setFirstGenerationConfig( + config: FirstGenerationConfig + ): this.type = { + + _firstGenerationConfig = config + _firstGenerationPermutationCount = config.permutationCount + _firstGenerationIndexMixingMode = config.indexMixingMode + _firstGenerationArraySeed = config.arraySeed + + this + } + + private def setGeneticConfig(config: GeneticConfig): this.type = { + + _geneticConfig = config + _parallelism = config.parallelism + _kFold = config.kFold + _trainPortion = config.trainPortion + _trainSplitMethod = config.trainSplitMethod + setKSampleConfig(config.kSampleConfig) + _trainSplitChronologicalColumn = config.trainSplitChronologicalColumn + _trainSplitChronologicalRandomPercentage = + config.trainSplitChronologicalRandomPercentage + _seed = config.seed + _firstGenerationGenePool = config.firstGenerationGenePool + _numberOfGenerations = config.numberOfGenerations + _numberOfParentsToRetain = config.numberOfParentsToRetain + _numberOfMutationsPerGeneration = config.numberOfMutationsPerGeneration + _geneticMixing = config.geneticMixing + _generationalMutationStrategy = config.generationalMutationStrategy + _fixedMutationValue = config.fixedMutationValue + _mutationMagnitudeMode = config.mutationMagnitudeMode + _evolutionStrategy = config.evolutionStrategy + _continuousEvolutionMaxIterations = config.continuousEvolutionMaxIterations + _continuousEvolutionStoppingScore = config.continuousEvolutionStoppingScore + _continuousEvolutionParallelism = config.continuousEvolutionParallelism + _continuousEvolutionMutationAggressiveness = + config.continuousEvolutionMutationAggressiveness + _continuousEvolutionGeneticMixing = config.continuousEvolutionGeneticMixing + _continuousEvolutionRollingImprovementCount = + config.continuousEvolutionRollingImprovementCount + _modelSeedMap = config.modelSeed + _hyperSpaceInference = config.hyperSpaceInference + _hyperSpaceInferenceCount = config.hyperSpaceInferenceCount + _hyperSpaceModelType = config.hyperSpaceModelType + _hyperSpaceModelCount = config.hyperSpaceModelCount + _firstGenerationMode = config.initialGenerationMode + _continuousEvolutionImprovementThreshold = + config.continuousEvolutionImprovementThreshold + _geneticMBORegressorType = config.geneticMBORegressorType + _geneticMBOCandidateFactor = config.geneticMBOCandidateFactor + setFirstGenerationConfig(config.initialGenerationConfig) + _deltaCacheBackingDirectoryRemovalFlag = + config.deltaCacheBackingDirectoryRemovalFlag + _deltaCacheBackingDirectory = config.deltaCacheBackingDirectory + _splitCachingStrategy = config.splitCachingStrategy + this + } + + private def resetMlFlowConfig(config: MLFlowConfig): this.type = { + + _mlFlowConfig = config + _mlFlowTrackingURI = config.mlFlowTrackingURI + _mlFlowExperimentName = config.mlFlowExperimentName + _mlFlowAPIToken = config.mlFlowAPIToken + _mlFlowModelSaveDirectory = config.mlFlowModelSaveDirectory + _mlFlowLoggingMode = config.mlFlowLoggingMode + _mlFlowBestSuffix = config.mlFlowBestSuffix + _mlFlowCustomRunTags = config.mlFlowCustomRunTags + + this + } + + def setMainConfig(value: MainConfig): this.type = { + _mainConfig = value + + /** + * Reset all of the local var's so that setters can be used in a chained manner without reverting to defaults. + */ + _modelingFamily = value.modelFamily + _labelCol = value.labelCol + _featuresCol = value.featuresCol + _naFillFlag = value.naFillFlag + _varianceFilterFlag = value.varianceFilterFlag + _outlierFilterFlag = value.outlierFilterFlag + _pearsonFilterFlag = value.pearsonFilteringFlag + _covarianceFilterFlag = value.covarianceFilteringFlag + _oneHotEncodeFlag = value.oneHotEncodeFlag + _scalingFlag = value.scalingFlag + _featureInteractionFlag = value.featureInteractionFlag + _dataPrepCachingFlag = value.dataPrepCachingFlag + _dataPrepParallelism = value.dataPrepParallelism + _autoStoppingFlag = value.autoStoppingFlag + _autoStoppingScore = value.autoStoppingScore + _featureImportanceCutoffType = value.featureImportanceCutoffType + _featureImportanceCutoffValue = value.featureImportanceCutoffValue + _dateTimeConversionType = value.dateTimeConversionType + _fieldsToIgnoreInVector = value.fieldsToIgnoreInVector + _numericBoundaries = value.numericBoundaries + _stringBoundaries = value.stringBoundaries + _scoringMetric = value.scoringMetric + _scoringOptimizationStrategy = value.scoringOptimizationStrategy + setFillConfig(value.fillConfig) + setOutlierConfig(value.outlierConfig) + setPearsonConfig(value.pearsonConfig) + setCovarianceConfig(value.covarianceConfig) + setScalerConfig(value.scalingConfig) + setFeatureInteractionConfig(value.featureInteractionConfig) + setGeneticConfig(value.geneticConfig) + _mlFlowLoggingFlag = value.mlFlowLoggingFlag + _mlFlowArtifactsFlag = value.mlFlowLogArtifactsFlag + resetMlFlowConfig(value.mlFlowConfig) + _inferenceConfigSaveLocation = value.inferenceConfigSaveLocation + _dataReductionFactor = value.dataReductionFactor + _pipelineDebugFlag = value.pipelineDebugFlag + _pipelineId = value.pipelineId + this + } + + def setFeatConfig(): this.type = { + _featureImportancesConfig = MainConfig( + modelFamily = "RandomForest", + labelCol = _labelCol, + featuresCol = _featuresCol, + naFillFlag = _naFillFlag, + varianceFilterFlag = _varianceFilterFlag, + outlierFilterFlag = _outlierFilterFlag, + pearsonFilteringFlag = _pearsonFilterFlag, + covarianceFilteringFlag = _covarianceFilterFlag, + oneHotEncodeFlag = _oneHotEncodeFlag, + scalingFlag = _scalingFlag, + featureInteractionFlag = _featureInteractionFlag, + dataPrepCachingFlag = _dataPrepCachingFlag, + dataPrepParallelism = _dataPrepParallelism, + autoStoppingFlag = _autoStoppingFlag, + autoStoppingScore = _autoStoppingScore, + featureImportanceCutoffType = _featureImportanceCutoffType, + featureImportanceCutoffValue = _featureImportanceCutoffValue, + dateTimeConversionType = _dateTimeConversionType, + fieldsToIgnoreInVector = _fieldsToIgnoreInVector, + numericBoundaries = _numericBoundaries, + stringBoundaries = _stringBoundaries, + scoringMetric = _scoringMetric, + scoringOptimizationStrategy = _scoringOptimizationStrategy, + fillConfig = _fillConfig, + outlierConfig = _outlierConfig, + pearsonConfig = _pearsonConfig, + covarianceConfig = _covarianceConfig, + scalingConfig = _scalingConfig, + featureInteractionConfig = _featureInteractionConfig, + geneticConfig = _geneticConfig, + mlFlowLoggingFlag = _mlFlowLoggingFlag, + mlFlowLogArtifactsFlag = _mlFlowArtifactsFlag, + mlFlowConfig = _mlFlowConfig, + inferenceConfigSaveLocation = _inferenceConfigSaveLocation, + dataReductionFactor = _dataReductionFactor, + pipelineDebugFlag = _pipelineDebugFlag, + pipelineId = _pipelineId + ) + this + } + + def setFeatConfig(value: MainConfig): this.type = { + _featureImportancesConfig = value + require( + value.modelFamily == "RandomForest", + s"Model Family for Feature Importances must be 'RandomForest'. ${value.modelFamily} is not supported." + ) + setConfigs() + this + } + + def setTreeSplitsConfig(): this.type = { + _treeSplitsConfig = MainConfig( + modelFamily = "Trees", + labelCol = _labelCol, + featuresCol = _featuresCol, + naFillFlag = _naFillFlag, + varianceFilterFlag = _varianceFilterFlag, + outlierFilterFlag = _outlierFilterFlag, + pearsonFilteringFlag = _pearsonFilterFlag, + covarianceFilteringFlag = _covarianceFilterFlag, + oneHotEncodeFlag = _oneHotEncodeFlag, + scalingFlag = _scalingFlag, + featureInteractionFlag = _featureInteractionFlag, + dataPrepCachingFlag = _dataPrepCachingFlag, + dataPrepParallelism = _dataPrepParallelism, + autoStoppingFlag = _autoStoppingFlag, + autoStoppingScore = _autoStoppingScore, + featureImportanceCutoffType = _featureImportanceCutoffType, + featureImportanceCutoffValue = _featureImportanceCutoffValue, + dateTimeConversionType = _dateTimeConversionType, + fieldsToIgnoreInVector = _fieldsToIgnoreInVector, + numericBoundaries = _numericBoundaries, + stringBoundaries = _stringBoundaries, + scoringMetric = _scoringMetric, + scoringOptimizationStrategy = _scoringOptimizationStrategy, + fillConfig = _fillConfig, + outlierConfig = _outlierConfig, + pearsonConfig = _pearsonConfig, + covarianceConfig = _covarianceConfig, + scalingConfig = _scalingConfig, + featureInteractionConfig = _featureInteractionConfig, + geneticConfig = _geneticConfig, + mlFlowLoggingFlag = _mlFlowLoggingFlag, + mlFlowLogArtifactsFlag = _mlFlowArtifactsFlag, + mlFlowConfig = _mlFlowConfig, + inferenceConfigSaveLocation = _inferenceConfigSaveLocation, + dataReductionFactor = _dataReductionFactor, + pipelineDebugFlag = _pipelineDebugFlag, + pipelineId = _pipelineId + ) + this + } + + def setTreeSplitsConfig(value: MainConfig): this.type = { + _treeSplitsConfig = value + require( + value.modelFamily == "Trees", + s"Model Family for Trees Splits must be 'Trees'. ${value.modelFamily} is not supported." + ) + setConfigs() + this + } + + def getPipelineId: String = _mainConfig.pipelineId + + def getModelingFamily: String = _modelingFamily + + def getLabelCol: String = _labelCol + + def getFeaturesCol: String = _featuresCol + + def getNaFillStatus: Boolean = _naFillFlag + + def getVarianceFilterStatus: Boolean = _varianceFilterFlag + + def getOutlierFilterStatus: Boolean = _outlierFilterFlag + + def getPearsonFilterStatus: Boolean = _pearsonFilterFlag + + def getCovarianceFilterStatus: Boolean = _covarianceFilterFlag + + def getOneHotEncodingStatus: Boolean = _oneHotEncodeFlag + + def getScalingStatus: Boolean = _scalingFlag + + def getFeatureInteractionStatus: Boolean = _featureInteractionFlag + + def getDataPrepCachingStatus: Boolean = _dataPrepCachingFlag + + def getDataPrepParallelism: Int = _dataPrepParallelism + + def getNumericBoundaries: Map[String, (Double, Double)] = _numericBoundaries + + def getStringBoundaries: Map[String, List[String]] = _stringBoundaries + + def getScoringMetric: String = _scoringMetric + + def getScoringOptimizationStrategy: String = _scoringOptimizationStrategy + + def getNumericFillStat: String = _numericFillStat + + def getCharacterFillStat: String = _characterFillStat + + def getDateTimeConversionType: String = _dateTimeConversionType + + def getFieldsToIgnoreInVector: Array[String] = _fieldsToIgnoreInVector + + def getNAFillFilterPrecision: Double = _naFillFilterPrecision + + def getCategoricalNAFillMap: Map[String, String] = _categoricalNAFillMap + + def getNumericNAFillMap: Map[String, AnyVal] = _numericNAFillMap + + def getCharacterNABlanketFillValue: String = _characterNABlanketFillValue + + def getNumericNABlanketFillValue: Double = _numericNABlanketFillValue + + def getNAFillMode: String = _naFillMode + + def getCardinalitySwitch: Boolean = _cardinalitySwitchFlag + + def getCardinalityType: String = _cardinalityType + + def getCardinalityLimit: Int = _cardinalityLimit + + def getCardinalityPrecision: Double = _cardinalityPrecision + + def getCardinalityCheckMode: String = _cardinalityCheckMode + + def getModelSelectionDistinctThreshold: Int = _modelSelectionDistinctThreshold + + def getFillConfig: FillConfig = _fillConfig + + def getFilterBounds: String = _filterBounds + + def getLowerFilterNTile: Double = _lowerFilterNTile + + def getUpperFilterNTile: Double = _upperFilterNTile + + def getFilterPrecision: Double = _filterPrecision + + def getContinuousDataThreshold: Int = _continuousDataThreshold + + def getFieldsToIgnore: Array[String] = _fieldsToIgnore + + def getOutlierConfig: OutlierConfig = _outlierConfig + + def getPearsonFilterStatistic: String = _pearsonFilterStatistic + + def getPearsonFilterDirection: String = _pearsonFilterDirection + + def getPearsonFilterManualValue: Double = _pearsonFilterManualValue + + def getPearsonFilterMode: String = _pearsonFilterMode + + def getPearsonAutoFilterNTile: Double = _pearsonAutoFilterNTile + + def getPearsonConfig: PearsonConfig = _pearsonConfig + + def getCorrelationCutoffLow: Double = _correlationCutoffLow + + def getCorrelationCutoffHigh: Double = _correlationCutoffHigh + + def getCovarianceConfig: CovarianceConfig = _covarianceConfig + + def getScalerType: String = _scalerType + + def getScalerMin: Double = _scalerMin + + def getScalerMax: Double = _scalerMax + + def getStandardScalingMeanFlag: Boolean = _standardScalerMeanFlag + + def getStandardScalingStdDevFlag: Boolean = _standardScalerStdDevFlag + + def getPNorm: Double = _pNorm + + def getScalingConfig: ScalingConfig = _scalingConfig + + def getFeatureInteractionConfig: FeatureInteractionConfig = + _featureInteractionConfig + + def getFeatureInteractionRetentionMode: String = + _featureInteractionRetentionMode + + def getFeatureInteractionContinuousDiscretizerBucketCount: Int = + _featureInteractionContinuousDiscretizerBucketCount + + def getFeatureInteractionParallelism: Int = _featureInteractionParallelism + + def getFeatureInteractionTargetInteractionPercentage: Double = + _featureInteractionTargetInteractionPercentage + + def getParallelism: Int = _parallelism + + def getKFold: Int = _kFold + + def getTrainPortion: Double = _trainPortion + + def getTrainSplitMethod: String = _trainSplitMethod + + def getKSampleConfig: KSampleConfig = _kSampleConfig + + def getSyntheticCol: String = _syntheticCol + + def getKGroups: Int = _kGroups + + def getKMeansMaxIter: Int = _kMeansMaxIter + + def getKMeansTolerance: Double = _kMeansTolerance + + def getKMeansDistanceMeasurement: String = _kMeansDistanceMeasurement + + def getKMeansSeed: Long = _kMeansSeed + + def getKMeansPredictionCol: String = _kMeansPredictionCol + + def getLSHHashTables: Int = _lshHashTables + + def getLSHOutputCol: String = _lshOutputCol + + def getQuorumCount: Int = _quorumCount + + def getMinimumVectorCountToMutate: Int = _minimumVectorCountToMutate + + def getVectorMutationMethod: String = _vectorMutationMethod + + def getMutationMode: String = _mutationMode + + def getMutationValue: Double = _mutationValue + + def getTrainSplitChronologicalColumn: String = _trainSplitChronologicalColumn + + def getTrainSplitChronologicalRandomPercentage: Double = + _trainSplitChronologicalRandomPercentage + + def getSeed: Long = _seed + + def getFirstGenerationGenePool: Int = _firstGenerationGenePool + + def getNumberOfGenerations: Int = _numberOfGenerations + + def getNumberOfParentsToRetain: Int = _numberOfParentsToRetain + + def getNumberOfMutationsPerGeneration: Int = _numberOfMutationsPerGeneration + + def getGeneticMixing: Double = _geneticMixing + + def getGenerationalMutationStrategy: String = _generationalMutationStrategy + + def getFixedMutationValue: Int = _fixedMutationValue + + def getMutationMagnitudeMode: String = _mutationMagnitudeMode + + def getModelSeedSetStatus: Boolean = _modelSeedSetStatus + + def getModelSeedMap: Map[String, Any] = _modelSeedMap + + def getFirstGenerationPermutationCount: Int = _firstGenerationPermutationCount + + def getFirstGenerationIndexMixingMode: String = + _firstGenerationIndexMixingMode + + def getFirstGenerationArraySeed: Long = _firstGenerationArraySeed + + def getHyperSpaceInferenceStatus: Boolean = _hyperSpaceInference + + def getHyperSpaceInferenceCount: Int = _hyperSpaceInferenceCount + + def getHyperSpaceModelType: String = _hyperSpaceModelType + + def getHyperSpaceModelCount: Int = _hyperSpaceModelCount + + def getFirstGenerationConfig: FirstGenerationConfig = _firstGenerationConfig + + def getFirstGenerationMode: String = _firstGenerationMode + + def getMlFlowLoggingFlag: Boolean = _mlFlowLoggingFlag + + def getMlFlowLogArtifactsFlag: Boolean = _mlFlowArtifactsFlag + + def getMlFlowTrackingURI: String = _mlFlowTrackingURI + + def getMlFlowExperimentName: String = _mlFlowExperimentName + + def getMlFlowModelSaveDirectory: String = _mlFlowModelSaveDirectory + + def getMlFlowLoggingMode: String = _mlFlowLoggingMode + + def getMlFlowBestSuffix: String = _mlFlowBestSuffix + + def getMlFlowCustomRunTags: Map[String, String] = _mlFlowCustomRunTags + + def getMlFlowConfig: MLFlowConfig = _mlFlowConfig + + def getGeneticConfig: GeneticConfig = _geneticConfig + + def getMainConfig: MainConfig = _mainConfig + + def getFeatConfig: MainConfig = _featureImportancesConfig + + def getTreeSplitsConfig: MainConfig = _treeSplitsConfig + + def getAutoStoppingFlag: Boolean = _autoStoppingFlag + + def getAutoStoppingScore: Double = _autoStoppingScore + + def getFeatureImportanceCutoffType: String = _featureImportanceCutoffType + + def getFeatureImportanceCutoffValue: Double = _featureImportanceCutoffValue + + def getEvolutionStrategy: String = _evolutionStrategy + + def getContinuousEvolutionMaxIterations: Int = + _continuousEvolutionMaxIterations + + def getContinuousEvolutionStoppingScore: Double = + _continuousEvolutionStoppingScore + + def getContinuousEvolutionParallelism: Int = _continuousEvolutionParallelism + + def getContinuousEvolutionMutationAggressiveness: Int = + _continuousEvolutionMutationAggressiveness + + def getContinuousEvolutionGeneticMixing: Double = + _continuousEvolutionGeneticMixing + + def getContinuousEvolutionRollingImporvementCount: Int = + _continuousEvolutionRollingImprovementCount + + def getInferenceConfigSaveLocation: String = _inferenceConfigSaveLocation + + def getDataReductionFactor: Double = _dataReductionFactor + + def getDeltaCacheBackingDirectory: String = _deltaCacheBackingDirectory + + def getDeltaCacheBackingDirectoryRemovalFlag: Boolean = + _deltaCacheBackingDirectoryRemovalFlag + + def getSplitCachingStrategy: String = _splitCachingStrategy + + /** + * Helper method for extracting the config from a run's GenericModelReturn payload + * This is designed to handle "lazy" copy/paste from either stdout or the mlflow ui. + * The alternative (preferred method of seeding a run start) is to submit a Map() for the run configuration seed. + * + * @param fullModelReturn: String The Generic Model Config of a run, to be used as a starting point for further + * tuning or refinement. + * @return A Map Object that can be parsed into the requisite case class definition to set a seed for a particular + * type of model run. + */ + private def extractGenericModelReturnMap( + fullModelReturn: String + ): Map[String, Any] = { + + val patternToMatch = "(?<=\\()[^()]*".r + + val configElements = + patternToMatch.findAllIn(fullModelReturn).toList(1).split(",") + + var configMap = Map[String, Any]() + + configElements.foreach { x => + val components = x.trim.split(" -> ") + configMap += (components(0) -> components(1)) + } + configMap + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/executor/DataPrep.scala b/src/main/scala/com/databricks/labs/automl/executor/DataPrep.scala new file mode 100644 index 00000000..56cd67f3 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/DataPrep.scala @@ -0,0 +1,711 @@ +package com.databricks.labs.automl.executor + +import com.databricks.labs.automl.feature.{ + FeatureInteraction, + SyntheticFeatureGenerator +} +import com.databricks.labs.automl.feature.structures.{ + FeatureInteractionOutputPayload, + InteractionPayloadExtract +} +import com.databricks.labs.automl.inference.{ + FeatureInteractionConfig, + InferenceConfig, + NaFillConfig +} +import com.databricks.labs.automl.params.{ + DataGeneration, + DataPrepReturn, + OutlierFilteringReturn +} +import com.databricks.labs.automl.pipeline.FeaturePipeline +import com.databricks.labs.automl.sanitize._ +import com.databricks.labs.automl.utils.{ + AutomationTools, + WorkspaceDirectoryValidation +} +import org.apache.log4j.{Level, Logger} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.storage.StorageLevel + +class DataPrep(df: DataFrame) extends AutomationConfig with AutomationTools { + + // TODO: parallelism config for non genetic parallel control should be added + private val logger: Logger = Logger.getLogger(this.getClass) + + private def logConfig(): Unit = { + + val configString = s"Configuration setting flags: \n NA Fill Flag: ${_mainConfig.naFillFlag.toString}" + + s"\n Zero Variance Filter Flag: ${_mainConfig.varianceFilterFlag.toString}" + + s"\n Outlier Filter Flag: ${_mainConfig.outlierFilterFlag.toString}" + + s"\n Covariance Filter Flag: ${_mainConfig.covarianceFilteringFlag.toString}" + + s"\n Pearson Filter Flag: ${_mainConfig.pearsonFilteringFlag.toString}" + + s"\n OneHotEncoding Flag: ${_mainConfig.oneHotEncodeFlag.toString}" + + s"\n Scaling Flag: ${_mainConfig.scalingFlag.toString}" + + s"\n Feature Interaction Flag: ${_mainConfig.featureInteractionFlag.toString}" + + s"\n Hyperspace Inference Flag: ${_mainConfig.geneticConfig.hyperSpaceInference.toString}" + + s"\n First Generation Seed Mode: ${_mainConfig.geneticConfig.initialGenerationMode}" + + s"\n MlFlow Logging Flag: ${_mainConfig.mlFlowLoggingFlag.toString}" + + s"\n Early Stopping Flag: ${_mainConfig.autoStoppingFlag.toString}" + + s"\n Data Prep Caching Flag: ${_mainConfig.dataPrepCachingFlag.toString}" + println( + configString + "\nFull Model Tuning Run Config: \n" + prettyPrintConfig( + _mainConfig + ) + ) + logger.log(Level.INFO, configString) + + } + + private def vectorPipeline( + data: DataFrame, + cardinalityFlag: Boolean + ): (DataFrame, Array[String], Array[String]) = { + + // Creates the feature vector and returns the fields that go into the vector + + new FeaturePipeline(data) + .setLabelCol(_mainConfig.labelCol) + .setFeatureCol(_mainConfig.featuresCol) + .setDateTimeConversionType(_mainConfig.dateTimeConversionType) + .setCardinalityCheck(cardinalityFlag) + .setCardinalityCheckMode(_mainConfig.fillConfig.cardinalityCheckMode) + .setCardinalityLimit(_mainConfig.fillConfig.cardinalityLimit) + .setCardinalityPrecision(_mainConfig.fillConfig.cardinalityPrecision) + .setCardinalityType(_mainConfig.fillConfig.cardinalityType) + .makeFeaturePipeline(_mainConfig.fieldsToIgnoreInVector) + + } + + private def oneHotEncodeVector( + data: DataFrame, + featureColumns: Array[String], + totalFields: Array[String] + ): (DataFrame, Array[String], Array[String]) = { + + new FeaturePipeline(data) + .setLabelCol(_mainConfig.labelCol) + .setFeatureCol(_mainConfig.featuresCol) + .applyOneHotEncoding(featureColumns, totalFields) + + } + + private def fillNA(data: DataFrame): (DataFrame, NaFillConfig, String) = { + + // Output has no feature vector + + val naConfig = new DataSanitizer(data) + .setLabelCol(_mainConfig.labelCol) + .setFeatureCol(_mainConfig.featuresCol) + .setModelSelectionDistinctThreshold( + _mainConfig.fillConfig.modelSelectionDistinctThreshold + ) + .setNumericFillStat(_mainConfig.fillConfig.numericFillStat) + .setCharacterFillStat(_mainConfig.fillConfig.characterFillStat) + .setParallelism(_mainConfig.dataPrepParallelism) + .setFieldsToIgnoreInVector(_mainConfig.fieldsToIgnoreInVector) + .setFilterPrecision(_mainConfig.fillConfig.filterPrecision) + .setCategoricalNAFillMap(_mainConfig.fillConfig.categoricalNAFillMap) + .setNumericNAFillMap(_mainConfig.fillConfig.numericNAFillMap) + .setCharacterNABlanketFillValue( + _mainConfig.fillConfig.characterNABlanketFillValue + ) + .setNumericNABlanketFillValue( + _mainConfig.fillConfig.numericNABlanketFillValue + ) + .setNAFillMode(_mainConfig.fillConfig.naFillMode) + + val (naFilledDataFrame, fillMap, detectedModelType) = + if (_mainConfig.naFillFlag) { + naConfig.generateCleanData() + } else { + ( + data, + NaFillConfig(Map("" -> ""), Map("" -> 0.0), Map("" -> false)), + naConfig.decideModel() + ) + } + + val naLog: String = if (_mainConfig.naFillFlag) { + s"NA values filled on Dataframe. Detected Model Type: $detectedModelType" + } else { + s"Detected Model Type: $detectedModelType" + } + + logger.log(Level.INFO, naLog) + println(naLog) + + (naFilledDataFrame, fillMap, detectedModelType) + + } + + private def varianceFilter(data: DataFrame): DataPrepReturn = { + + // Output has no feature vector + val varianceFiltering = new VarianceFiltering(data) + .setLabelCol(_mainConfig.labelCol) + .setFeatureCol(_mainConfig.featuresCol) + .setDateTimeConversionType(_mainConfig.dateTimeConversionType) + .setParallelism(_mainConfig.dataPrepParallelism) + + val (varianceFilteredData, removedColumns) = + varianceFiltering.filterZeroVariance(_mainConfig.fieldsToIgnoreInVector) + + val varianceFilterLog = if (removedColumns.length == 0) { + "Zero Variance fields have been removed from the data." + } else { + s"The following columns were removed due to zero variance: ${removedColumns.mkString(", ")}" + } + + logger.log(Level.INFO, varianceFilterLog) + println(varianceFilterLog) + + DataPrepReturn(varianceFilteredData, removedColumns) + + } + + private def outlierFilter(data: DataFrame): OutlierFilteringReturn = { + + // Output has no feature vector + val outlierFiltering = new OutlierFiltering(data) + .setLabelCol(_mainConfig.labelCol) + .setFilterBounds(_mainConfig.outlierConfig.filterBounds) + .setLowerFilterNTile(_mainConfig.outlierConfig.lowerFilterNTile) + .setUpperFilterNTile(_mainConfig.outlierConfig.upperFilterNTile) + .setFilterPrecision(_mainConfig.outlierConfig.filterPrecision) + .setParallelism(_mainConfig.dataPrepParallelism) + .setContinuousDataThreshold( + _mainConfig.outlierConfig.continuousDataThreshold + ) + + val (outlierCleanedData, outlierRemovedData, filteringMap) = + outlierFiltering.filterContinuousOutliers( + _mainConfig.fieldsToIgnoreInVector, + _mainConfig.outlierConfig.fieldsToIgnore + ) + + val outlierRemovalInfo = + s"Removed outlier data. Total rows removed = ${outlierRemovedData.count()}" + + logger.log(Level.INFO, outlierRemovalInfo) + println(outlierRemovalInfo) + + OutlierFilteringReturn(outlierCleanedData, filteringMap) + + } + + private def covarianceFilter(data: DataFrame, + fields: Array[String]): DataPrepReturn = { + + // Output has no feature vector + + val covarianceFilteredData = new FeatureCorrelationDetection(data, fields) + .setLabelCol(_mainConfig.labelCol) + .setParallelism(_mainConfig.dataPrepParallelism) + .setCorrelationCutoffLow( + _mainConfig.covarianceConfig.correlationCutoffLow + ) + .setCorrelationCutoffHigh( + _mainConfig.covarianceConfig.correlationCutoffHigh + ) + .filterFeatureCorrelation() + + val removedFields = + fieldRemovalCompare(fields, covarianceFilteredData.schema.fieldNames) + + val covarianceFilterLog = + s"Covariance Filtering completed.\n Removed fields: ${removedFields.mkString(", ")}" + + logger.log(Level.INFO, covarianceFilterLog) + println(covarianceFilterLog) + + DataPrepReturn(covarianceFilteredData, removedFields.toArray) + + } + + private def pearsonFilter(data: DataFrame, + fields: Array[String], + modelType: String): DataPrepReturn = { + + // Requires a Dataframe that has a feature vector field. Output has no feature vector. + + val pearsonFiltering = new PearsonFiltering(data, fields, modelType) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFilterStatistic(_mainConfig.pearsonConfig.filterStatistic) + .setFilterDirection(_mainConfig.pearsonConfig.filterDirection) + .setFilterManualValue(_mainConfig.pearsonConfig.filterManualValue) + .setFilterMode(_mainConfig.pearsonConfig.filterMode) + .setAutoFilterNTile(_mainConfig.pearsonConfig.autoFilterNTile) + .setParallelism(_mainConfig.dataPrepParallelism) + .filterFields(_mainConfig.fieldsToIgnoreInVector) + + val removedFields = + fieldRemovalCompare(fields, pearsonFiltering.schema.fieldNames) + + val pearsonFilterLog = + s"Pearson Filtering completed.\n Removed fields: ${removedFields.mkString(", ")}" + + logger.log(Level.INFO, pearsonFiltering) + println(pearsonFilterLog) + + DataPrepReturn(pearsonFiltering, removedFields.toArray) + + } + + private def scaler(data: DataFrame): DataFrame = { + + // Requires a Dataframe that has a feature vector field. Output has a feature vector. + + val scaledData = new Scaler(data) + .setFeaturesCol(_mainConfig.featuresCol) + .setScalerType(_mainConfig.scalingConfig.scalerType) + .setScalerMin(_mainConfig.scalingConfig.scalerMin) + .setScalerMax(_mainConfig.scalingConfig.scalerMax) + .setStandardScalerMeanMode( + _mainConfig.scalingConfig.standardScalerMeanFlag + ) + .setStandardScalerStdDevMode( + _mainConfig.scalingConfig.standardScalerStdDevFlag + ) + .setPNorm(_mainConfig.scalingConfig.pNorm) + .scaleFeatures() + + val scaleLog = + s"Scaling of type '${_mainConfig.scalingConfig.scalerType}' completed." + + logger.log(Level.INFO, scaleLog) + println(scaleLog) + + scaledData + + } + + def prepData(): DataGeneration = { + + // Record the Switch Settings from MainConfig to return an InferenceSwitchSettings object + val inferenceSwitchSettings = recordInferenceSwitchSettings(_mainConfig) + InferenceConfig.setInferenceSwitchSettings(inferenceSwitchSettings) + + // Perform validation of mlflow logging location so that it can fail early in case logging doesn't work. + // Only run this if mlflow Logging flag is turned on. + if (_mainConfig.mlFlowLoggingFlag) { + val dirValidate = WorkspaceDirectoryValidation( + _mainConfig.mlFlowConfig.mlFlowTrackingURI, + _mainConfig.mlFlowConfig.mlFlowAPIToken, + _mainConfig.mlFlowConfig.mlFlowExperimentName + ) + if (dirValidate) { + val rgx = "(\\/\\w+$)".r + val dir = + rgx.replaceFirstIn(_mainConfig.mlFlowConfig.mlFlowExperimentName, "") + println( + s"MLFlow Logging Directory confirmed accessible at: " + + s"$dir" + ) + } + } + + val includeFieldsFinalData = _mainConfig.fieldsToIgnoreInVector + + println( + s"Fields Set To Ignore: ${_mainConfig.fieldsToIgnoreInVector.mkString(", ")}" + ) + + val cacheLevel = StorageLevel.MEMORY_AND_DISK + val unpersistBlock = true + + // log the settings used for the run + logConfig() + + if (_mainConfig.dataPrepCachingFlag) { + // cache the main DataFrame + df.persist(cacheLevel) + // force the cache + df.count() + } + + //DEBUG + logger.log(Level.DEBUG, printSchema(df, "input").toString) + + val (naFilledData, fillMap, detectedModelType) = fillNA(df) + val (entryPointData, entryPointFields, selectFields) = + vectorPipeline(naFilledData, _mainConfig.fillConfig.cardinalitySwitch) + + // Record the Inference Settings for DataConfig + val inferenceDataConfig = + recordInferenceDataConfig(_mainConfig, selectFields) + InferenceConfig.setInferenceDataConfig(inferenceDataConfig) + + val dataStage1 = entryPointData.select(selectFields map col: _*) + + // Record the Inference Settings for NaFillConfig mappings + InferenceConfig.setInferenceNaFillConfig( + fillMap.categoricalColumns, + fillMap.numericColumns, + fillMap.booleanColumns + ) + + // uncache the main DataFrame, force the GC + val (persistDataStage1, dataStage1RowCount) = + if (_mainConfig.dataPrepCachingFlag && _mainConfig.naFillFlag) { + dataPersist(df, dataStage1, cacheLevel, unpersistBlock) + } else { + (dataStage1, "no count when data prep caching is disabled") + } + + if (_mainConfig.naFillFlag) { + println(dataStage1RowCount) + logger.log(Level.INFO, dataStage1RowCount) + } + + //DEBUG + logger.log(Level.DEBUG, printSchema(dataStage1, "stage1").toString) + logger.log(Level.DEBUG, printSchema(selectFields, "stage1_full").toString) + + // Variance Filtering + val dataStage2 = + if (_mainConfig.varianceFilterFlag) varianceFilter(persistDataStage1) + else DataPrepReturn(persistDataStage1, Array.empty[String]) + + // Record the Inference Settings for Variance Filtering + InferenceConfig.setInferenceVarianceFilterConfig(dataStage2.fieldListing) + + val (persistDataStage2, dataStage2RowCount) = + if (_mainConfig.dataPrepCachingFlag && _mainConfig.varianceFilterFlag) { + dataPersist( + persistDataStage1, + dataStage2.outputData, + cacheLevel, + unpersistBlock + ) + } else { + (dataStage2.outputData, "no count when data prep caching is disabled") + } + + if (_mainConfig.varianceFilterFlag) { + println(dataStage2RowCount) + logger.log(Level.INFO, dataStage2RowCount) + } + + //DEBUG + logger.log( + Level.DEBUG, + printSchema(dataStage2.outputData, "stage2").toString + ) + + // Outlier Filtering + val dataStage3 = + if (_mainConfig.outlierFilterFlag) outlierFilter(persistDataStage2) + else + OutlierFilteringReturn( + persistDataStage2, + Map.empty[String, (Double, String)] + ) + + val (persistDataStage3, dataStage3RowCount) = + if (_mainConfig.dataPrepCachingFlag && _mainConfig.outlierFilterFlag) { + dataPersist( + persistDataStage2, + dataStage3.outputData, + cacheLevel, + unpersistBlock + ) + } else { + (dataStage3.outputData, "no count when data prep caching is disabled") + } + + if (_mainConfig.outlierFilterFlag) { + println(dataStage2RowCount) + logger.log(Level.INFO, dataStage3RowCount) + } + + //DEBUG + logger.log( + Level.DEBUG, + printSchema(dataStage3.outputData, "stage3").toString + ) + + // Record the Inference Settings for Outlier Filtering + InferenceConfig.setInferenceOutlierFilteringConfig( + dataStage3.fieldRemovalMap + ) + + // Next stages require a feature vector + val (featurizedData, initialFields, initialFullFields) = + vectorPipeline(persistDataStage3, cardinalityFlag = false) + + // Ensure that the only fields in the DataFrame are the Individual Feature Columns, Label, and Exclusion Fields + val featureFieldCleanup = initialFields ++ Array(_mainConfig.labelCol) + + val featurizedDataCleaned = if (_mainConfig.dataPrepCachingFlag) { + dataPersist( + persistDataStage3, + featurizedData.select(featureFieldCleanup map col: _*), + cacheLevel, + unpersistBlock + )._1 + } else { + featurizedData.select(featureFieldCleanup map col: _*) + } + + //DEBUG + logger.log( + Level.DEBUG, + printSchema(featurizedDataCleaned, "featurizedDataCleaned").toString + ) + + // Covariance Filtering + val dataStage4 = if (_mainConfig.covarianceFilteringFlag) { + covarianceFilter(featurizedDataCleaned, initialFields) + } else DataPrepReturn(featurizedDataCleaned, Array.empty[String]) + + val (persistDataStage4, dataStage4RowCount) = + if (_mainConfig.dataPrepCachingFlag && _mainConfig.covarianceFilteringFlag) { + dataPersist( + featurizedDataCleaned, + dataStage4.outputData, + cacheLevel, + unpersistBlock + ) + } else { + (dataStage4.outputData, "no count when data prep caching is disabled") + } + + if (_mainConfig.covarianceFilteringFlag) { + println(dataStage4RowCount) + logger.log(Level.INFO, dataStage4RowCount) + } + + //DEBUG + logger.log( + Level.DEBUG, + printSchema(dataStage4.outputData, "stage4").toString + ) + + // Record the Inference Settings for Covariance Filtering + InferenceConfig.setInferenceCovarianceFilteringConfig( + dataStage4.fieldListing + ) + + // All stages after this point require a feature vector. + val (dataStage5, stage5Fields, stage5FullFields) = + vectorPipeline(persistDataStage4, cardinalityFlag = false) + + val (persistDataStage5, dataStage5RowCount) = + if (_mainConfig.dataPrepCachingFlag) { + dataPersist(persistDataStage4, dataStage5, cacheLevel, unpersistBlock) + } else { + (dataStage5, "no count when data prep caching is disabled") + } + + // Pearson Filtering (generates a vector features Field) + val (dataStage6, stage6Fields, stage6FullFields) = + if (_mainConfig.pearsonFilteringFlag) { + + val pearsonReturn = + pearsonFilter(persistDataStage5, stage5Fields, detectedModelType) + + // Record the Inference Settings for Pearson Filtering + InferenceConfig.setInferencePearsonFilteringConfig( + pearsonReturn.fieldListing + ) + + vectorPipeline(pearsonReturn.outputData, cardinalityFlag = false) + } else { + // Record the Inference Settings for Pearson Filtering + InferenceConfig.setInferencePearsonFilteringConfig(Array.empty[String]) + (persistDataStage5, stage5Fields, stage5FullFields) + } + + val (persistDataStage6, dataStage6RowCount) = + if (_mainConfig.dataPrepCachingFlag && _mainConfig.pearsonFilteringFlag) { + dataPersist(persistDataStage5, dataStage6, cacheLevel, unpersistBlock) + } else { + (dataStage5, "no count when data prep caching is disabled") + } + + //DEBUG + logger.log(Level.DEBUG, printSchema(persistDataStage6, "stage6").toString) + + // Feature Interaction Stage + val featureInteractionResult = if (_mainConfig.featureInteractionFlag) { + + val nominalFields = stage6Fields + .filter(x => x.takeRight(3) == "_si") + .filterNot(x => x.contains(_labelCol)) + + val continuousFields = stage6Fields + .diff(nominalFields) + .filterNot(_.contains(_labelCol)) + .filterNot(_.contains(_featuresCol)) + + FeatureInteraction.interactFeatures( + persistDataStage6, + nominalFields, + continuousFields, + detectedModelType, + _mainConfig.featureInteractionConfig.retentionMode, + _labelCol, + _featuresCol, + _mainConfig.featureInteractionConfig.continuousDiscretizerBucketCount, + _mainConfig.featureInteractionConfig.parallelism, + _mainConfig.featureInteractionConfig.targetInteractionPercentage + ) + } else { + FeatureInteractionOutputPayload( + persistDataStage6, + stage6Fields, + Array[InteractionPayloadExtract]() + ) + } + + // Log the Inference config elements for Feature Interactions + InferenceConfig.setFeatureInteractionConfig( + FeatureInteractionConfig(featureInteractionResult.interactionReport) + ) + + val (persistFeatureInteractionData, persistFeatureInteractionCount) = + if (_mainConfig.dataPrepCachingFlag && _mainConfig.featureInteractionFlag) { + dataPersist( + persistDataStage6, + featureInteractionResult.data, + cacheLevel, + unpersistBlock + ) + } else { + ( + featureInteractionResult.data, + "no count when data prep caching is disabled" + ) + } + + //DEBUG + logger.log( + Level.DEBUG, + printSchema(persistFeatureInteractionData, "featureInteractionStage").toString + ) + + // OneHotEncoding Option + val (dataStage65, stage65Fields, stage65FullFields) = + if (_mainConfig.oneHotEncodeFlag) { + oneHotEncodeVector( + persistFeatureInteractionData, + featureInteractionResult.fullFeatureVectorColumns, + persistFeatureInteractionData.schema.names + ) + } else + ( + persistFeatureInteractionData, + featureInteractionResult.fullFeatureVectorColumns, + persistFeatureInteractionData.schema.names + ) + + val (persistDataStage65, dataStage65RowCount) = + if (_mainConfig.dataPrepCachingFlag && _mainConfig.oneHotEncodeFlag) { + dataPersist( + persistFeatureInteractionData, + dataStage65, + cacheLevel, + unpersistBlock + ) + } else { + (dataStage65, "no count when data prep caching is disabled") + } + + //DEBUG + logger.log(Level.DEBUG, printSchema(persistDataStage65, "stage65").toString) + + // Scaler + val dataStage7 = + if (_mainConfig.scalingFlag) scaler(dataStage65) else dataStage65 + + val (persistDataStage7, dataStage7RowCount) = + if (_mainConfig.dataPrepCachingFlag && _mainConfig.scalingFlag) { + dataPersist(persistDataStage65, dataStage7, cacheLevel, unpersistBlock) + } else { + (dataStage7, "no count when data prep caching is disabled") + } + + if (_mainConfig.scalingFlag && _mainConfig.dataPrepCachingFlag) { + println(dataStage7RowCount) + logger.log(Level.INFO, dataStage7RowCount) + } + + // Record the Inference Settings for Scaling + InferenceConfig.setInferenceScalingConfig(_mainConfig.scalingConfig) + + // Get the final DataFrame Field Loading + + val finalStageDF = + persistDataStage7.select(persistDataStage7.columns map col: _*) + if (_mainConfig.dataPrepCachingFlag) + dataPersist(persistDataStage7, finalStageDF, cacheLevel, unpersistBlock) + else finalStageDF.persist(cacheLevel) + + val finalCount = finalStageDF.count + + val finalSchema = s"Final Schema: \n ${stage65Fields.mkString(", ")}" + val finalFullSchema = + s"Final Full Schema: \n ${finalStageDF.columns.mkString(", ")}" + + val finalOutputDataFrame1 = + if (_mainConfig.geneticConfig.trainSplitMethod == "kSample") { + SyntheticFeatureGenerator( + finalStageDF, + _mainConfig.featuresCol, + _mainConfig.labelCol, + _mainConfig.geneticConfig.kSampleConfig.syntheticCol, + _mainConfig.fieldsToIgnoreInVector, + _mainConfig.geneticConfig.kSampleConfig.kGroups, + _mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter, + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance, + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement, + _mainConfig.geneticConfig.kSampleConfig.kMeansSeed, + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol, + _mainConfig.geneticConfig.kSampleConfig.lshHashTables, + _mainConfig.geneticConfig.kSampleConfig.lshSeed, + _mainConfig.geneticConfig.kSampleConfig.lshOutputCol, + _mainConfig.geneticConfig.kSampleConfig.quorumCount, + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate, + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod, + _mainConfig.geneticConfig.kSampleConfig.mutationMode, + _mainConfig.geneticConfig.kSampleConfig.mutationValue, + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode, + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold, + _mainConfig.geneticConfig.kSampleConfig.numericRatio, + _mainConfig.geneticConfig.kSampleConfig.numericTarget + ) + } else finalStageDF + + // If scaling is used, make sure that the synthetic data has the same scaling. + val finalOutputDataFrame2 = + if (_mainConfig.scalingFlag & _mainConfig.geneticConfig.trainSplitMethod == "kSample") { + val syntheticData = finalOutputDataFrame1.filter( + col(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + ) + scaler(syntheticData).unionByName( + finalOutputDataFrame1.filter( + col(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) === false + ) + ) + } else finalOutputDataFrame1 + + val finalStatement = + s"Data Prep complete. Final Dataframe cached. Total Observations: $finalCount" + // DEBUG + logger.log(Level.INFO, finalSchema) + logger.log(Level.INFO, finalFullSchema) + logger.log(Level.INFO, finalStatement) + println(finalStatement) + + DataGeneration( + finalOutputDataFrame2, + finalOutputDataFrame2.columns, + detectedModelType + ) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/executor/FamilyRunner.scala b/src/main/scala/com/databricks/labs/automl/executor/FamilyRunner.scala new file mode 100644 index 00000000..eef15803 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/FamilyRunner.scala @@ -0,0 +1,382 @@ +package com.databricks.labs.automl.executor + +import com.databricks.labs.automl.AutomationRunner +import com.databricks.labs.automl.exceptions.PipelineExecutionException +import com.databricks.labs.automl.executor.config.{ + ConfigurationGenerator, + InstanceConfig +} +import com.databricks.labs.automl.model.tools.ModelUtils +import com.databricks.labs.automl.model.tools.split.PerformanceSettings +import com.databricks.labs.automl.params._ +import com.databricks.labs.automl.pipeline._ +import com.databricks.labs.automl.tracking.{ + MLFlowReportStructure, + MLFlowTracker +} +import com.databricks.labs.automl.utils.{ + AutoMlPipelineMlFlowUtils, + PipelineMlFlowTagKeys, + SparkSessionWrapper +} +import org.apache.spark.ml.PipelineModel +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.ArrayBuffer + +case class ModelReportSchema(generation: Int, score: Double, model: String) + +case class GenerationReportSchema(model_family: String, + model_type: String, + generation: Int, + generation_mean_score: Double, + generation_std_dev_score: Double, + model: String) + +/** + * @constructor Determine the best possible SparkML model for an ML task by supplying a DataFrame and an Array of + * InstanceConfig objects that have been defined with ConfigurationGenerator() + * @author Ben Wilson, Databricks + * @since 0.5.0.3 + * @param data A Spark DataFrame that contains feature columns and a label column + * @param configs The configuration for each of the model types that are to be tested, stored in an Array. + * @example + * ``` + * val data: DataFrame = spark.table("db.test") + * val mapOverrides: Map[String, Any] = Map("labelCol" -> "finalLabelCol", "tunerNumberOfMutationsPerGeneration" -> 5) + * + * val randomForestConfig = ConfigurationGenerator.generateConfigFromMap("RandomForest", "classifier", mapOverrides) + * val logRegConfig = ConfigurationGenerator.generateConfigFromMap("LogisticRegression", "classifier", mapOverrides) + * val treeConfig = ConfigurationGenerator.generateConfigFromMap("Trees", "classifier", mapOverrides) + * + * val runner = new FamilyRunner(data, Array(randomForestConfig, logRegConfig, treeConfig)).execute + * ``` + */ +class FamilyRunner(data: DataFrame, configs: Array[InstanceConfig]) + extends SparkSessionWrapper { + + /** + * Private method for adding a field to the output collection DataFrame to tell which model family generated + * the data report. + * + * @param modelType The model type that was used for the experiment run + * @param dataFrame the dataframe whose contents will be added to with a field of the literal model type that + * generated the results. + * @return a dataframe with the modeltype column added + */ + private def augmentDF(modelType: String, dataFrame: DataFrame): DataFrame = { + + dataFrame.withColumn("model", lit(modelType)) + + } + + /** + * TODO All other tuner requirements should be called from here as well to enable fail fast + * Enable fail fast for poorly configured environment + * @param _parallelism Tuner Parallelism from Main Config + * @throws java.lang.IllegalArgumentException if Invalid environment config + */ + @throws(classOf[IllegalArgumentException]) + private def validatePerformanceSettings(_parallelism: Int, + modelFamily: String): Unit = { + if (modelFamily == "XGBoost") { + PerformanceSettings.xgbWorkers(_parallelism) + } else PerformanceSettings.optimalJVMModelPartitions(_parallelism) + } + + /** + * Private method for unifying the outputs of each modeling run by family. Allows for collapsing the array outputs + * and unioning the DataFrames with additional information about what model was used to generate the summary report + * data. + * + * @param outputArray output array of each modeling family's run + * @return condensed report structure for all of the runs in a similar API return format. + */ + private def unifyFamilyOutput( + outputArray: Array[FamilyOutput] + ): FamilyFinalOutput = { + + import spark.implicits._ + + var modelReport = ArrayBuffer[GroupedModelReturn]() + var generationReport = ArrayBuffer[GenerationalReport]() + var modelReportDataFrame = spark.emptyDataset[ModelReportSchema].toDF + var generationReportDataFrame = + spark.emptyDataset[GenerationReportSchema].toDF + var mlFlowOutput = ArrayBuffer[MLFlowReportStructure]() + + outputArray.map { x => + x.modelReport.map { y => + val model = y.model + + modelReport += GroupedModelReturn( + modelFamily = x.modelType, + hyperParams = y.hyperParams, + model = model, + score = y.score, + metrics = y.metrics, + generation = y.generation + ) + } + generationReport +: x.generationReport + modelReportDataFrame.union(x.modelReportDataFrame) + generationReportDataFrame.union(x.generationReportDataFrame) + mlFlowOutput += x.mlFlowOutput + } + + FamilyFinalOutput( + modelReport = modelReport.toArray, + generationReport = generationReport.toArray, + modelReportDataFrame = modelReportDataFrame, + generationReportDataFrame = generationReportDataFrame, + mlFlowReport = mlFlowOutput.toArray + ) + } + + def getNewFamilyOutPut(output: TunerOutput, + instanceConfig: InstanceConfig): FamilyOutput = { + new FamilyOutput(instanceConfig.modelFamily, output.mlFlowOutput) { + override def modelReport: Array[GenericModelReturn] = output.modelReport + + override def generationReport: Array[GenerationalReport] = + output.generationReport + + override def modelReportDataFrame: DataFrame = + augmentDF(instanceConfig.modelFamily, output.modelReportDataFrame) + + override def generationReportDataFrame: DataFrame = + augmentDF(instanceConfig.modelFamily, output.generationReportDataFrame) + } + } + + /** + * + * @deprecated Use [[executeWithPipeline()]] instead. + * Start using executeWithPipeline to leverage Pipeline semantics + * + * Main method for executing the family runs as configured. + * @return FamilyOutput object that reports the results of each of the family modeling runs. + */ + @Deprecated + def execute(): FamilyFinalOutput = { + + val outputBuffer = ArrayBuffer[FamilyOutput]() + + configs.foreach { x => + val mainConfiguration = ConfigurationGenerator.generateMainConfig(x) + + val runner = new AutomationRunner(data) + .setMainConfig(mainConfiguration) + + val preppedData = runner.prepData() + + val preppedDataOverride = preppedData.copy(modelType = x.predictionType) + + val output = runner.executeTuning(preppedDataOverride) + + outputBuffer += getNewFamilyOutPut(output, x) + } + unifyFamilyOutput(outputBuffer.toArray) + + } + + private def addMainConfigToPipelineCache(mainConfig: MainConfig): Unit = { + PipelineStateCache + .addToPipelineCache( + mainConfig.pipelineId, + PipelineVars.MAIN_CONFIG.key, + mainConfig + ) + } + + private def addMlFlowConfigForPipelineUse(mainConfig: MainConfig) = { + addMainConfigToPipelineCache(mainConfig) + if (mainConfig.mlFlowLoggingFlag) { + val mlFlowRunId = + MLFlowTracker(mainConfig).generateMlFlowRunId() + PipelineStateCache + .addToPipelineCache( + mainConfig.pipelineId, + PipelineVars.MLFLOW_RUN_ID.key, + mlFlowRunId + ) + AutoMlPipelineMlFlowUtils + .logTagsToMlFlow( + mainConfig.pipelineId, + Map( + s"${PipelineMlFlowTagKeys.PIPELINE_ID}" + -> + mainConfig.pipelineId + ) + ) + PipelineMlFlowProgressReporter.starting(mainConfig.pipelineId) + } + } + + /** + * + * @return grouped results same as execute [[FamilyFinalOutputWithPipeline]] but + * also contains a map of model family and best pipeline model (along with mlflow Run ID) + * based on optimization strategy settings + */ + def executeWithPipeline(): FamilyFinalOutputWithPipeline = { + + val outputBuffer = ArrayBuffer[FamilyOutput]() + + val pipelineConfigMap = scala.collection.mutable + .Map[String, (FeatureEngineeringOutput, MainConfig)]() + configs.foreach { x => + val mainConfiguration = ConfigurationGenerator.generateMainConfig(x) + validatePerformanceSettings( + mainConfiguration.geneticConfig.parallelism, + mainConfiguration.modelFamily + ) + val runner = new AutomationRunner(data).setMainConfig(mainConfiguration) + + // Perform cardinality check if the model type is a tree-based family and update the + // numeric mappings to handle the maxBins issue for nominal and categorical data. + + x.modelFamily.toLowerCase.replaceAll("\\s", "") match { + case "randomforest" | "trees" | "gbt" | "xgboost" => { + val updatedNumBoundaries = ModelUtils.resetTreeBinsSearchSpace( + data, + x.algorithmConfig.numericBoundaries, + x.genericConfig.fieldsToIgnoreInVector, + x.genericConfig.labelCol, + x.genericConfig.featuresCol + ) + runner.setNumericBoundaries(updatedNumBoundaries) + } + case _ => Unit + } + + // Setup MLflow Run + addMlFlowConfigForPipelineUse(mainConfiguration) + try { + //Get feature engineering pipeline and transform it to get feature engineered dataset + val featureEngOutput = FeatureEngineeringPipelineContext + .generatePipelineModel(data, mainConfiguration) + val featureEngineeredDf = featureEngOutput.transformedForTrainingDf + val preppedDataOverride = DataGeneration( + featureEngineeredDf, + featureEngineeredDf.columns, + featureEngOutput.decidedModel + ).copy(modelType = x.predictionType) + + val output = + runner.executeTuning(preppedDataOverride, isPipeline = true) + + outputBuffer += getNewFamilyOutPut(output, x) + pipelineConfigMap += x.modelFamily -> (featureEngOutput, mainConfiguration) + } catch { + case ex: Exception => { + println(ex.printStackTrace()) + PipelineMlFlowProgressReporter.failed( + mainConfiguration.pipelineId, + ex.getMessage + ) + throw PipelineExecutionException(mainConfiguration.pipelineId, ex) + } + } + } + withPipelineInferenceModel( + unifyFamilyOutput(outputBuffer.toArray), + configs, + pipelineConfigMap.toMap + ) + } + + /** + * @param verbose: If set to true, any dataset transformed with this feature engineered pipeline will include all + * input columns for the vector assembler stage. + * @return Generates feature engineering pipeline for a given configuration under a given Model Family + * Note: It does not trigger any Model training. + */ + def generateFeatureEngineeredPipeline( + verbose: Boolean = false + ): Map[String, PipelineModel] = { + val featureEngineeredMap = + scala.collection.mutable.Map[String, PipelineModel]() + configs.foreach { x => + val mainConfiguration = ConfigurationGenerator.generateMainConfig(x) + addMainConfigToPipelineCache(mainConfiguration) + val featureEngOutput = + FeatureEngineeringPipelineContext.generatePipelineModel( + data, + mainConfiguration, + verbose, + isFeatureEngineeringOnly = true + ) + val finalPipelineModel = + FeatureEngineeringPipelineContext.addUserReturnViewStage( + featureEngOutput.pipelineModel, + mainConfiguration, + featureEngOutput.pipelineModel.transform(data), + featureEngOutput.originalDfViewName + ) + featureEngineeredMap += x.modelFamily -> finalPipelineModel + } + featureEngineeredMap.toMap + } + + def withPipelineInferenceModel( + familyFinalOutput: FamilyFinalOutput, + configs: Array[InstanceConfig], + pipelineConfigs: Map[String, (FeatureEngineeringOutput, MainConfig)] + ): FamilyFinalOutputWithPipeline = { + val pipelineModels = scala.collection.mutable.Map[String, PipelineModel]() + val bestMlFlowRunIds = scala.collection.mutable.Map[String, String]() + configs.foreach(config => { + val modelReport = familyFinalOutput.modelReport.filter( + item => item.modelFamily.equals(config.modelFamily) + ) + pipelineModels += config.modelFamily -> FeatureEngineeringPipelineContext + .buildFullPredictPipeline( + pipelineConfigs(config.modelFamily)._1, + modelReport, + pipelineConfigs(config.modelFamily)._2, + data + ) + bestMlFlowRunIds += config.modelFamily -> familyFinalOutput + .mlFlowReport(0) + .bestLog + .runIdPayload(0) + ._1 + }) + FamilyFinalOutputWithPipeline( + familyFinalOutput, + pipelineModels.toMap, + bestMlFlowRunIds.toMap + ) + } +} + +/** + * Companion Object allowing for class instantiation through configs either as an Instance config or Map overrides + * collection. + */ +object FamilyRunner { + + def apply(data: DataFrame, configs: Array[InstanceConfig]): FamilyRunner = + new FamilyRunner(data, configs) + + def apply(data: DataFrame, + modelFamily: String, + predictionType: String, + configs: Array[Map[String, Any]]): FamilyRunner = { + + val configBuffer = ArrayBuffer[InstanceConfig]() + + configs.foreach { x => + configBuffer += ConfigurationGenerator.generateConfigFromMap( + modelFamily, + predictionType, + x + ) + } + + new FamilyRunner(data, configBuffer.toArray) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/executor/config/BatteryDefaults.scala b/src/main/scala/com/databricks/labs/automl/executor/config/BatteryDefaults.scala new file mode 100644 index 00000000..68191789 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/config/BatteryDefaults.scala @@ -0,0 +1,68 @@ +package com.databricks.labs.automl.executor.config + +import com.databricks.labs.automl.executor.config.ModelSelector.ModelSelector +import com.databricks.labs.automl.executor.config.PredictionType.PredictionType + +trait BatteryDefaults { + + /** + * Default constructor for a given family of models to generate enumeration types of models to execute + * @param predictionType Supplied prediction type (either `Regressor` or `Classifier` + * @return An array of models that support the provided predictionType supplied in the constructor. + * @since 0.5.0.3 + */ + def modelSelection(predictionType: PredictionType): Array[ModelSelector] = { + + predictionType match { + case PredictionType.Regressor => + RegressorModels.values.toArray + .map(_.toString) + .map(x => ModelSelector.withName(x)) + case PredictionType.Classifier => + ClassiferModels.values.toArray + .map(_.toString) + .map(x => ModelSelector.withName(x)) + case _ => + throw new UnsupportedOperationException( + s"PrecitionType ${predictionType.toString} is not a supported" + + s"type. Must be one of: ${PredictionType.values.mkString(", ")}" + ) + } + } + + def modelToStandardizedString(modelType: ModelSelector): String = { + + modelType match { + case ModelSelector.GBTClassifier => "gbt" + case ModelSelector.GBTRegressor => "gbt" + case ModelSelector.LinearRegression => "linearregression" + case ModelSelector.LogisticRegression => "logisticregression" + case ModelSelector.MLPC => "mlpc" + case ModelSelector.RandomForestClassifier => "randomforest" + case ModelSelector.RandomForestRegressor => "randomforest" + case ModelSelector.SVM => "svm" + case ModelSelector.TreesClassifier => "trees" + case ModelSelector.TreesRegressor => "trees" + case ModelSelector.XGBoostClassifier => "xgboost" + case ModelSelector.XGBoostRegressor => "xgboost" + case ModelSelector.LightGBMBinary => "gbmbinary" + case ModelSelector.LightGBMMulti => "gbmmulti" + case ModelSelector.LightGBMMultiOVA => "gbmmultiova" + case ModelSelector.LightGBMHuber => "gbmhuber" + case ModelSelector.LightGBMFair => "gbmfair" + case ModelSelector.LightGBMLasso => "gbmlasso" + case ModelSelector.LightGBMRidge => "gbmridge" + case ModelSelector.LightGBMPoisson => "gbmpoisson" + case ModelSelector.LightGBMQuantile => "gbmquantile" + case ModelSelector.LightGBMMape => "gbmmape" + case ModelSelector.LightGBMTweedie => "gbmtweedie" + case ModelSelector.LightGBMGamma => "gbmgamma" + case _ => + throw new UnsupportedOperationException( + s"'${modelType.toString}' is not supported." + ) + } + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/executor/config/BatteryGenerator.scala b/src/main/scala/com/databricks/labs/automl/executor/config/BatteryGenerator.scala new file mode 100644 index 00000000..f5d7dbe1 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/config/BatteryGenerator.scala @@ -0,0 +1,210 @@ +package com.databricks.labs.automl.executor.config + +import com.databricks.labs.automl.executor.config.ModelSelector.ModelSelector +import com.databricks.labs.automl.executor.config.PredictionType.PredictionType +import com.databricks.labs.automl.params.MainConfig + +import scala.collection.mutable.ArrayBuffer + +/** + * A class and companion object to generate configurations for a particular modeling type. Developer API. + * Currently available modeling types: `regressor` and `classifier` + * + * @example + * ``` + * val classifierConfigs = new BatterryGenerator("regressor").setModelsToTest(Array("RandomForest", "GBT")) + * .generateDefaultConfigs( + * "labelColumn", + * "features", + * "https://instance.host.com", + * "regressorTest", + * "", + * "/model/save/dir", + * "/inference/save/dir", + * "full", + * "_best_model", + * true, + * false) + * ``` + * @author Ben Wilson, Databricks + * @since 0.5.0.3 + * @param predictionType modeling type to generate: either 'regressor' or 'classifier' + */ +class BatteryGenerator(predictionType: String) + extends BatteryDefaults + with ConfigurationDefaults { + + private[config] val batteryType: PredictionType = + BatteryGenerator.predictionTypeAssignment(predictionType) + + // Instantiate the Defaults + private var _modelsToTest: Array[ModelSelector] = modelSelection(batteryType) + + // Override setter for defining models to be run + def setModelsToTest(modelFamilies: Array[String]): this.type = { + + // Define the allowable collection of model types + val allowableModels = modelSelection(batteryType) + + // Create a holding collection + var modelCollection = ArrayBuffer[ModelSelector]() + + // Run validation on supplied models to create instance types + modelFamilies foreach { x => + val definedModel = modelTypeEvaluator(x, predictionType) + require( + allowableModels.contains(definedModel), + s"ModelsToTest setting $x is not allowable. Must be one of:" + + s"${allowableModels.flatMap(x => x.toString).mkString(", ")}" + ) + modelCollection += definedModel + } + + // Overwrite the defined defaults for restricting the families of models based on user input. + _modelsToTest = modelCollection.toArray + this + } + + def setModelsToTest(modelFamilies: String*): this.type = { + setModelsToTest(modelFamilies.toArray) + this + } + + def getModelsToTest: Array[ModelSelector] = _modelsToTest + + /** + * Public method that requires definition of settings that will need to be unique for the run, but using + * all other default settings for modeling and algorithm tuning (quick-start) + * + * @param labelCol Label column in the DataFrame + * @param featuresCol Name of the to-be-generated feature vector column + * @param mlFlowTrackingURI Tracking host address + * @param mlFlowExperimentName Unique name for the experiment to be run + * @param mlFlowAPIToken API token for the MLFlow service + * @param modelSaveDirectory Blob storage location for saving build models + * @param inferenceConfigSaveLocation Blob storage location for saving the Inference Config to reproduce the run + * @param mlFlowLoggingMode modes supported: "tuningOnly", "bestOnly", "full" + * @param mlFlowBestSuffix string to append to a new experiment run that captures only the best run found in its + * own experiment location + * @param mlFlowLoggingFlag setting on whether or not to log anything to mlflow + * @param mlFlowLogArtifactsFlag setting on whether or not to log the model artifacts to mlflow + * @param fieldsToIgnoreInModeling optional field that can be populated by an array of column names to be ignored + * for the purposes of modeling, but retained in the final result dataframe(s) + * @return An array of configuration objects with (mostly) default settings for each model type that has been + * supplied in the configuration of this class. + */ + def generateDefaultConfigs(labelCol: String, + featuresCol: String, + mlFlowTrackingURI: String, + mlFlowExperimentName: String, + mlFlowAPIToken: String, + modelSaveDirectory: String, + inferenceConfigSaveLocation: String, + mlFlowLoggingMode: String = "full", + mlFlowBestSuffix: String = "_best", + mlFlowLoggingFlag: Boolean = true, + mlFlowLogArtifactsFlag: Boolean = true, + fieldsToIgnoreInModeling: Array[String] = + Array.empty[String]): Array[InstanceConfig] = { + + val defaultBuffer: ArrayBuffer[InstanceConfig] = + ArrayBuffer[InstanceConfig]() + + _modelsToTest foreach { x => + val defaultGenericConfig: GenericConfig = + new GenericConfigGenerator(predictionType) + .setLabelCol(labelCol) + .setFeaturesCol(featuresCol) + .setFieldsToIgnoreInVector(fieldsToIgnoreInModeling) + .getConfig + + val modelSpecificConfig = new ConfigurationGenerator( + modelToStandardizedString(x), + predictionType, + defaultGenericConfig + ).setMlFlowTrackingURI(mlFlowTrackingURI) + .setMlFlowAPIToken(mlFlowAPIToken) + .setMlFlowExperimentName(mlFlowExperimentName) + .setMlFlowModelSaveDirectory(modelSaveDirectory) + .setMlFlowLoggingFlag(mlFlowLoggingFlag) + .setMlFlowLogArtifactsFlag(mlFlowLogArtifactsFlag) + + defaultBuffer += modelSpecificConfig.getInstanceConfig + } + + defaultBuffer.toArray + + } + + /** + * Public method for generating main config objects based on a family + * + * @param labelCol Label column in the DataFrame + * @param featuresCol Name of the to-be-generated feature vector column + * @param mlFlowTrackingURI Tracking host address + * @param mlFlowExperimentName Unique name for the experiment to be run + * @param mlFlowAPIToken API token for the MLFlow service + * @param modelSaveDirectory Blob storage location for saving build models + * @param inferenceConfigSaveLocation Blob storage location for saving the Inference Config to reproduce the run + * @param mlFlowLoggingMode modes supported: "tuningOnly", "bestOnly", "full" + * @param mlFlowBestSuffix string to append to a new experiment run that captures only the best run found in its + * own experiment location + * @param mlFlowLoggingFlag setting on whether or not to log anything to mlflow + * @param mlFlowLogArtifactsFlag setting on whether or not to log the model artifacts to mlflow + * @param fieldsToIgnoreInModeling optional field that can be populated by an array of column names to be ignored + * for the purposes of modeling, but retained in the final result dataframe(s) + * @return Array of MainConfig objects + */ + @deprecated( + "Main Config accessor will be replaced in future versions by InstanceConfig()." + ) + def generateDefaultMainConfigs(labelCol: String, + featuresCol: String, + mlFlowTrackingURI: String, + mlFlowExperimentName: String, + mlFlowAPIToken: String, + modelSaveDirectory: String, + inferenceConfigSaveLocation: String, + mlFlowLoggingMode: String = "full", + mlFlowBestSuffix: String = "_best", + mlFlowLoggingFlag: Boolean = true, + mlFlowLogArtifactsFlag: Boolean = true, + fieldsToIgnoreInModeling: Array[String] = + Array.empty[String]): Array[MainConfig] = { + + val mainConfigBuffer: ArrayBuffer[MainConfig] = ArrayBuffer[MainConfig]() + + val instanceConfigs = generateDefaultConfigs( + labelCol, + featuresCol, + mlFlowTrackingURI, + mlFlowExperimentName, + mlFlowAPIToken, + modelSaveDirectory, + inferenceConfigSaveLocation, + mlFlowLoggingMode, + mlFlowBestSuffix, + mlFlowLoggingFlag, + mlFlowLogArtifactsFlag, + fieldsToIgnoreInModeling + ) + + instanceConfigs foreach { x => + mainConfigBuffer += ConfigurationGenerator.generateMainConfig(x) + } + mainConfigBuffer.toArray + + } + +} + +object BatteryGenerator extends ConfigurationDefaults { + + def apply(predictionType: String): BatteryGenerator = + new BatteryGenerator(predictionType) + + def predictionTypeAssignment(predictionType: String): PredictionType = { + predictionTypeEvaluator(predictionType) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationDefaults.scala b/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationDefaults.scala new file mode 100644 index 00000000..f3c6ac99 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationDefaults.scala @@ -0,0 +1,821 @@ +package com.databricks.labs.automl.executor.config + +import com.databricks.labs.automl.utils.InitDbUtils + +trait ConfigurationDefaults { + + import FamilyValidator._ + import ModelDefaults._ + import ModelSelector._ + import PredictionType._ + + /** + * General Tools + */ + private[config] def modelTypeEvaluator( + modelFamily: String, + predictionType: String + ): ModelSelector = { + ( + modelFamily.toLowerCase.replaceAll("\\s", ""), + predictionType.toLowerCase.replaceAll("\\s", "") + ) match { + case ("trees", "regressor") => TreesRegressor + case ("trees", "classifier") => TreesClassifier + case ("gbt", "regressor") => GBTRegressor + case ("gbt", "classifier") => GBTClassifier + case ("randomforest", "regressor") => RandomForestRegressor + case ("randomforest", "classifier") => RandomForestClassifier + case ("linearregression", "regressor") => LinearRegression + case ("logisticregression", "classifier") => LogisticRegression + case ("xgboost", "regressor") => XGBoostRegressor + case ("xgboost", "classifier") => XGBoostClassifier + case ("mlpc", "classifier") => MLPC + case ("svm", "regressor") => SVM +// case ("gbmbinary", "classifier") => LightGBMBinary // turning these off until LightGBM is fixed by MSFT +// case ("gbmmulti", "classifier") => LightGBMMulti +// case ("gbmmultiova", "classifier") => LightGBMMultiOVA +// case ("gbmhuber", "regressor") => LightGBMHuber +// case ("gbmfair", "regressor") => LightGBMFair +// case ("gbmlasso", "regressor") => LightGBMLasso +// case ("gbmridge", "regressor") => LightGBMRidge +// case ("gbmpoisson", "regressor") => LightGBMPoisson +// case ("gbmquantile", "regressor") => LightGBMQuantile +// case ("gbmmape", "regressor") => LightGBMMape +// case ("gbmtweedie", "regressor") => LightGBMTweedie +// case ("gbmgamma", "regressor") => LightGBMGamma + case (_, _) => + throw new IllegalArgumentException( + s"'$modelFamily' Model Family and PredictionType " + + s"'$predictionType' are not supported." + ) + } + } + + private[config] def predictionTypeEvaluator( + predictionType: String + ): PredictionType = { + predictionType.toLowerCase.replaceAll("\\s", "") match { + case "regressor" => Regressor + case "classifier" => Classifier + case _ => + throw new IllegalArgumentException( + s"'$predictionType' is not a supported type! Must be either: " + + s"'regressor' or 'classifier'" + ) + } + } + + private[config] def familyTypeEvaluator( + modelFamily: String + ): FamilyValidator = { + modelFamily.toLowerCase.replaceAll("\\s", "") match { + case "trees" | "gbt" | "randomforest" | "xgboost" => Trees + case _ => NonTrees + } + } + + private[config] def zeroToOneValidation(value: Double, + parameterName: String): Unit = { + require( + value >= 0.0 & value <= 1.0, + s"$parameterName submitted value of '$value' is outside of the allowable " + + s"bounds of 0.0 to 1.0." + ) + } + + private[config] def validateMembership(value: String, + collection: List[String], + parameterName: String): Unit = { + require( + collection.contains(value), + s"$parameterName value '$value' is not supported. Must be one of: '" + + s"${collection.mkString(", ")}'" + ) + } + + /** + * Static restrictions + */ + final val allowableDateTimeConversionTypes: List[String] = + List("unix", "split") + final val allowableRegressionScoringMetrics: List[String] = + List("rmse", "mse", "r2", "mae") + final val allowableClassificationScoringMetrics: List[String] = List( + "f1", + "weightedPrecision", + "weightedRecall", + "accuracy", + "areaUnderPR", + "areaUnderROC" + ) + final val allowableScoringOptimizationStrategies: List[String] = + List("minimize", "maximize") + final val allowableNumericFillStats: List[String] = + List("min", "25p", "mean", "median", "75p", "max") + final val allowableCharacterFillStats: List[String] = List("min", "max") + final val allowableOutlierFilterBounds: List[String] = + List("lower", "upper", "both") + final val allowablePearsonFilterStats: List[String] = + List("pValue", "degreesFreedom", "pearsonStat") + final val allowablePearsonFilterDirections: List[String] = + List("greater", "lesser") + final val allowablePearsonFilterModes: List[String] = List("auto", "manual") + final val allowableScalers: List[String] = + List("minMax", "standard", "normalize", "maxAbs") + final val allowableTrainSplitMethods: List[String] = List( + "random", + "chronological", + "stratifyReduce", + "stratified", + "overSample", + "underSample", + "kSample" + ) + final val allowableEvolutionStrategies: List[String] = + List("batch", "continuous") + final val allowableMlFlowLoggingModes: List[String] = + List("tuningOnly", "bestOnly", "full") + final val allowableInitialGenerationModes: List[String] = + List("random", "permutations") + final val allowableInitialGenerationIndexMixingModes: List[String] = + List("random", "linear") + final val allowableMutationStrategies: List[String] = List("linear", "fixed") + final val allowableMutationMagnitudeMode: List[String] = + List("random", "fixed") + final val allowableHyperSpaceModelTypes: List[String] = + List("RandomForest", "LinearRegression", "XGBoost") + final val allowableFeatureImportanceCutoffTypes: List[String] = + List("none", "value", "count") + final val allowableKMeansDistanceMeasurements: List[String] = + List("cosine", "euclidean") + final val allowableMutationModes: List[String] = + List("weighted", "random", "ratio") + final val allowableVectorMutationMethods: List[String] = + List("random", "fixed", "all") + final val allowableLabelBalanceModes: List[String] = + List("match", "percentage", "target") + final val allowableDateTimeConversions: List[String] = List("unix", "split") + final val allowableCategoricalFilterModes: List[String] = + List("silent", "warn") + final val allowableCardinalilties: List[String] = List("approx", "exact") + final val allowableNAFillModes: List[String] = + List( + "auto", + "mapFill", + "blanketFillAll", + "blanketFillCharOnly", + "blanketFillNumOnly" + ) + final val allowableGeneticMBORegressorTypes: List[String] = + List("XGBoost", "LinearRegression", "RandomForest") + + final val allowableFeatureInteractionModes = + List("optimistic", "strict", "all") + + /** + * Generic Helper Methods + */ + private def familyScoringCheck(predictionType: PredictionType): String = { + predictionType match { + case Regressor => "rmse" + case _ => "f1" + } + } + + private def familyScoringCheck(predictionType: String): String = { + familyScoringCheck(predictionTypeEvaluator(predictionType)) + } + + private def treesBooleanSwitch(modelType: FamilyValidator): Boolean = { + modelType match { + case Trees => false + case _ => true + } + } + + def oneHotEncodeFlag(family: FamilyValidator): Boolean = + treesBooleanSwitch(family) + def scalingFlag(family: FamilyValidator): Boolean = treesBooleanSwitch(family) + + private def familyScoringDirection(predictionType: PredictionType): String = { + predictionType match { + case Regressor => "minimize" + case _ => "maximize" + } + } + + private def familyScoringDirection(predictionType: String): String = { + familyScoringDirection(predictionTypeEvaluator(predictionType)) + } + + /** + * Algorithm Helper Methods + */ + private[config] def boundaryValidation(modelKeys: Set[String], + overwriteKeys: Set[String]): Unit = { + require( + modelKeys == overwriteKeys, + s"The provided configuration does not match. Expected: " + + s"${modelKeys.mkString(", ")}, but got: ${overwriteKeys.mkString(", ")} }" + ) + } + + private[config] def validateNumericBoundariesKeys( + modelType: ModelSelector, + value: Map[String, (Double, Double)] + ): Unit = { + modelType match { + case RandomForestRegressor => + boundaryValidation(randomForestNumeric.keys.toSet, value.keys.toSet) + case RandomForestClassifier => + boundaryValidation(randomForestNumeric.keys.toSet, value.keys.toSet) + case TreesRegressor => + boundaryValidation(treesNumeric.keys.toSet, value.keys.toSet) + case TreesClassifier => + boundaryValidation(treesNumeric.keys.toSet, value.keys.toSet) + case XGBoostRegressor => + boundaryValidation(xgBoostNumeric.keys.toSet, value.keys.toSet) + case XGBoostClassifier => + boundaryValidation(xgBoostNumeric.keys.toSet, value.keys.toSet) + case MLPC => boundaryValidation(mlpcNumeric.keys.toSet, value.keys.toSet) + case GBTRegressor => + boundaryValidation(gbtNumeric.keys.toSet, value.keys.toSet) + case GBTClassifier => + boundaryValidation(gbtNumeric.keys.toSet, value.keys.toSet) + case LinearRegression => + boundaryValidation(linearRegressionNumeric.keys.toSet, value.keys.toSet) + case LogisticRegression => + boundaryValidation( + logisticRegressionNumeric.keys.toSet, + value.keys.toSet + ) + case SVM => boundaryValidation(svmNumeric.keys.toSet, value.keys.toSet) + case LightGBMBinary | LightGBMMulti | LightGBMMultiOVA | LightGBMHuber | + LightGBMFair | LightGBMLasso | LightGBMLasso | LightGBMRidge | + LightGBMPoisson | LightGBMQuantile | LightGBMMape | LightGBMTweedie | + LightGBMGamma => + boundaryValidation(lightGBMnumeric.keys.toSet, value.keys.toSet) + } + } + + private[config] def validateNumericBoundariesValues( + values: Map[String, (Double, Double)] + ): Unit = { + values.foreach( + k => + require( + k._2._1 < k._2._2, + s"Numeric Boundary key ${k._1} is set incorrectly! " + + s"Boundary definitions must be in the form: (min, max)" + ) + ) + } + + private[config] def numericBoundariesAssignment( + modelType: ModelSelector + ): Map[String, (Double, Double)] = { + modelType match { + case RandomForestRegressor => randomForestNumeric + case RandomForestClassifier => randomForestNumeric + case TreesRegressor => treesNumeric + case TreesClassifier => treesNumeric + case XGBoostRegressor => xgBoostNumeric + case XGBoostClassifier => xgBoostNumeric + case MLPC => mlpcNumeric + case GBTRegressor => gbtNumeric + case GBTClassifier => gbtNumeric + case LinearRegression => linearRegressionNumeric + case LogisticRegression => logisticRegressionNumeric + case SVM => svmNumeric + case LightGBMBinary | LightGBMMulti | LightGBMMultiOVA | LightGBMHuber | + LightGBMFair | LightGBMLasso | LightGBMLasso | LightGBMRidge | + LightGBMPoisson | LightGBMQuantile | LightGBMMape | LightGBMTweedie | + LightGBMGamma => + lightGBMnumeric + case _ => + throw new NotImplementedError( + s"Model Type ${modelType.toString} is not implemented." + ) + } + } + + private[config] def validateStringBoundariesKeys( + modelType: ModelSelector, + value: Map[String, List[String]] + ): Unit = { + modelType match { + case RandomForestRegressor => + boundaryValidation(randomForestString.keys.toSet, value.keys.toSet) + case RandomForestClassifier => + boundaryValidation(randomForestString.keys.toSet, value.keys.toSet) + case TreesRegressor => + boundaryValidation(treesString.keys.toSet, value.keys.toSet) + case TreesClassifier => + boundaryValidation(treesString.keys.toSet, value.keys.toSet) + case MLPC => boundaryValidation(mlpcString.keys.toSet, value.keys.toSet) + case GBTRegressor => + boundaryValidation(gbtString.keys.toSet, value.keys.toSet) + case GBTClassifier => + boundaryValidation(gbtString.keys.toSet, value.keys.toSet) + case LinearRegression => + boundaryValidation(linearRegressionString.keys.toSet, value.keys.toSet) + case LightGBMBinary | LightGBMMulti | LightGBMMultiOVA | LightGBMHuber | + LightGBMFair | LightGBMLasso | LightGBMLasso | LightGBMRidge | + LightGBMPoisson | LightGBMQuantile | LightGBMMape | LightGBMTweedie | + LightGBMGamma => + boundaryValidation(lightGBMString.keys.toSet, value.keys.toSet) + case _ => None + } + } + + private[config] def stringBoundariesAssignment( + modelType: ModelSelector + ): Map[String, List[String]] = { + modelType match { + case RandomForestRegressor => randomForestString + case RandomForestClassifier => randomForestString + case TreesRegressor => treesString + case TreesClassifier => treesString + case XGBoostRegressor => Map.empty + case XGBoostClassifier => Map.empty + case MLPC => mlpcString + case GBTRegressor => gbtString + case GBTClassifier => gbtString + case LinearRegression => linearRegressionString + case LogisticRegression => Map.empty + case SVM => Map.empty + case LightGBMBinary | LightGBMMulti | LightGBMMultiOVA | LightGBMHuber | + LightGBMFair | LightGBMLasso | LightGBMLasso | LightGBMRidge | + LightGBMPoisson | LightGBMQuantile | LightGBMMape | LightGBMTweedie | + LightGBMGamma => + lightGBMString + case _ => + throw new NotImplementedError( + s"Model Type ${modelType.toString} is not implemented." + ) + } + } + + /** + * Generate the default configuration objects + */ + private[config] def genericConfig( + predictionType: PredictionType + ): GenericConfig = { + val labelCol = "label" + val featuresCol = "features" + val dateTimeConversionType = "split" + val fieldsToIgnoreInVector = Array.empty[String] + val scoringMetric = familyScoringCheck(predictionType) + val scoringOptimizationStrategy = familyScoringDirection(predictionType) + + GenericConfig( + labelCol, + featuresCol, + dateTimeConversionType, + fieldsToIgnoreInVector, + scoringMetric, + scoringOptimizationStrategy + ) + } + + private[config] def switchConfig(family: FamilyValidator): SwitchConfig = { + val naFillFlag = true + val varianceFilterFlag = true + val outlierFilterFlag = false + val pearsonFilterFlag = false + val covarianceFilterFlag = false + val oheFlag = oneHotEncodeFlag(family) + val scaleFlag = scalingFlag(family) + val dataPrepCachingFlag = false + val autoStoppingFlag = false + val pipelineDebugFlag = false + val featureInteractionFlag = false + + SwitchConfig( + naFillFlag, + varianceFilterFlag, + outlierFilterFlag, + pearsonFilterFlag, + covarianceFilterFlag, + oheFlag, + scaleFlag, + featureInteractionFlag, + dataPrepCachingFlag, + autoStoppingFlag, + pipelineDebugFlag + ) + } + + private[config] def algorithmConfig( + modelType: ModelSelector + ): AlgorithmConfig = + AlgorithmConfig( + stringBoundariesAssignment(modelType), + numericBoundariesAssignment(modelType) + ) + + private[config] def featureEngineeringConfig(): FeatureEngineeringConfig = { + val dataPrepParallelism = 10 + val numericFillStat = "mean" + val characterFillStat = "max" + val modelSelectionDistinctThreshold = 50 + val outlierFilterBounds = "both" + val outlierLowerFilterNTile = 0.02 + val outlierUpperFilterNTile = 0.98 + val outlierFilterPrecision = 0.01 + val outlierContinuousDataThreshold = 50 + val outlierFieldsToIgnore = Array.empty[String] + val pearsonFilterStatistic = "pValue" + val pearsonFilterDirection = "greater" + val pearsonFilterManualValue = 0.0 + val pearsonFilterMode = "auto" + val pearsonAutoFilterNTile = 0.75 + val covarianceCorrelationCutoffLow = -0.8 + val covarianceCorrelctionCutoffHigh = 0.8 + val scalingType = "minMax" + val scalingMin = 0.0 + val scalingMax = 1.0 + val scalingStandardMeanFlag = false + val scalingStdDevFlag = true + val scalingPNorm = 2.0 + val featureImportanceCutoffType = "count" + val featureImportanceCutoffValue = 15.0 + val dataReductionFactor = 0.5 + val cardinalitySwitch = true + val cardinalityType = "exact" + val cardinalityLimit = 200 + val cardinalityPrecision = 0.05 + val cardinalityCheckMode = "silent" + val filterPrecision = 0.01 + val categoricalNAFillMap = Map.empty[String, String] + val numericNAFillMap = Map.empty[String, AnyVal] + val characterNABlanketFillValue = "" + val numericNABlanketFillValue = 0.0 + val naFillMode = "auto" + val featureInteractionRetentionMode = "optimistic" + val featureInteractionContinuousDiscretizerBucketCount = 10 + val featureInteractionParallelism = 12 + val featureInteractionTargetInteractionPercentage = 10.0 + + FeatureEngineeringConfig( + dataPrepParallelism, + numericFillStat, + characterFillStat, + modelSelectionDistinctThreshold, + outlierFilterBounds, + outlierLowerFilterNTile, + outlierUpperFilterNTile, + outlierFilterPrecision, + outlierContinuousDataThreshold, + outlierFieldsToIgnore, + pearsonFilterStatistic, + pearsonFilterDirection, + pearsonFilterManualValue, + pearsonFilterMode, + pearsonAutoFilterNTile, + covarianceCorrelationCutoffLow, + covarianceCorrelctionCutoffHigh, + scalingType, + scalingMin, + scalingMax, + scalingStandardMeanFlag, + scalingStdDevFlag, + scalingPNorm, + featureImportanceCutoffType, + featureImportanceCutoffValue, + dataReductionFactor, + cardinalitySwitch, + cardinalityType, + cardinalityLimit, + cardinalityPrecision, + cardinalityCheckMode, + filterPrecision, + categoricalNAFillMap, + numericNAFillMap, + characterNABlanketFillValue, + numericNABlanketFillValue, + naFillMode, + featureInteractionRetentionMode, + featureInteractionContinuousDiscretizerBucketCount, + featureInteractionParallelism, + featureInteractionTargetInteractionPercentage + ) + } + + private[config] def tunerConfig(): TunerConfig = { + val tunerAutoStoppingScore = 0.99 + val tunerParallelism = 20 + val tunerKFold = 5 + val tunerTrainPortion = 0.8 + val tunerTrainSplitMethod = "random" + val tunerKSampleSyntheticCol = "synthetic_ksample" + val tunerKSampleKGroups = 25 + val tunerKSampleKMeansMaxIter = 100 + val tunerKSampleKMeansTolerance = 1E-6 + val tunerKSampleKMeansDistanceMeasurement = "euclidean" + val tunerKSampleKMeansSeed = 42L + val tunerKSampleKMeansPredictionCol = "kGroups_ksample" + val tunerKSampleLSHHashTables = 10 + val tunerKSampleLSHSeed = 42L + val tunerKSampleLSHOutputCol = "hashes_ksample" + val tunerKSampleQuorumCount = 7 + val tunerKSampleMinimumVectorCountToMutate = 1 + val tunerKSampleVectorMutationMethod = "random" + val tunerKSampleMutationMode = "weighted" + val tunerKSampleMutationValue = 0.5 + val tunerKSampleLabelBalanceMode = "match" + val tunerKSampleCardinalityThreshold = 20 + val tunerKSampleNumericRatio = 0.2 + val tunerKSampleNumericTarget = 500 + val tunerTrainSplitChronologicalColumn = "" + val tunerTrainSplitChronologicalRandomPercentage = 0.0 + val tunerSeed = 42L + val tunerFirstGenerationGenePool = 20 + val tunerNumberOfGenerations = 10 + val tunerNumberOfParentsToRetain = 3 + val tunerNumberOfMutationsPerGeneration = 10 + val tunerGeneticMixing = 0.7 + val tunerGenerationMutationStrategy = "linear" + val tunerFixedMutationValue = 1 + val tunerMutationMagnitudeMode = "fixed" + val tunerEvolutionStrategy = "batch" + val tunerGeneticMBORegressorType = "XGBoost" + val tunerGeneticMBOCandidateFactor = 10 + val tunerContinuousImprovementThreshold = -10 + val tunerContinuousEvolutionMaxIterations = 200 + val tunerContinuousEvolutionStoppingScore = 1.0 + val tunerContinuousEvolutionParallelism = 4 + val tunerContinuousEvolutionMutationAggressiveness = 3 + val tunerContinuousEvolutionGeneticMixing = 0.7 + val tunerContinuousEvolutionRollingImprovementCount = 20 + val tunerModelSeed = Map.empty[String, Any] + val tunerHyperSpaceInference = true + val tunerHyperSpaceInferenceCount = 200000 + val tunerHyperSpaceModelCount = 10 + val tunerHyperSpaceModelType = "RandomForest" + val tunerInitialGenerationMode = "random" + val tunerInitialGenerationPermutationCount = 10 + val tunerInitialGenerationIndexMixingMode = "linear" + val tunerInitialGenerationArraySeed = 42L + val tunerOutputDfRepartitionScaleFactor = 3 + val tunerDeltaCacheBackingDirectory = "dbfs:/mnt/automl/" + val tunerDeltaCacheBackingDirectoryRemovalFlag = true + val splitCachingStrategy = "persist" + + TunerConfig( + tunerAutoStoppingScore, + tunerParallelism, + tunerKFold, + tunerTrainPortion, + tunerTrainSplitMethod, + tunerKSampleSyntheticCol, + tunerKSampleKGroups, + tunerKSampleKMeansMaxIter, + tunerKSampleKMeansTolerance, + tunerKSampleKMeansDistanceMeasurement, + tunerKSampleKMeansSeed, + tunerKSampleKMeansPredictionCol, + tunerKSampleLSHHashTables, + tunerKSampleLSHSeed, + tunerKSampleLSHOutputCol, + tunerKSampleQuorumCount, + tunerKSampleMinimumVectorCountToMutate, + tunerKSampleVectorMutationMethod, + tunerKSampleMutationMode, + tunerKSampleMutationValue, + tunerKSampleLabelBalanceMode, + tunerKSampleCardinalityThreshold, + tunerKSampleNumericRatio, + tunerKSampleNumericTarget, + tunerTrainSplitChronologicalColumn, + tunerTrainSplitChronologicalRandomPercentage, + tunerSeed, + tunerFirstGenerationGenePool, + tunerNumberOfGenerations, + tunerNumberOfParentsToRetain, + tunerNumberOfMutationsPerGeneration, + tunerGeneticMixing, + tunerGenerationMutationStrategy, + tunerFixedMutationValue, + tunerMutationMagnitudeMode, + tunerEvolutionStrategy, + tunerGeneticMBORegressorType, + tunerGeneticMBOCandidateFactor, + tunerContinuousImprovementThreshold, + tunerContinuousEvolutionMaxIterations, + tunerContinuousEvolutionStoppingScore, + tunerContinuousEvolutionParallelism, + tunerContinuousEvolutionMutationAggressiveness, + tunerContinuousEvolutionGeneticMixing, + tunerContinuousEvolutionRollingImprovementCount, + tunerModelSeed, + tunerHyperSpaceInference, + tunerHyperSpaceInferenceCount, + tunerHyperSpaceModelCount, + tunerHyperSpaceModelType, + tunerInitialGenerationMode, + tunerInitialGenerationPermutationCount, + tunerInitialGenerationIndexMixingMode, + tunerInitialGenerationArraySeed, + tunerOutputDfRepartitionScaleFactor, + tunerDeltaCacheBackingDirectory, + tunerDeltaCacheBackingDirectoryRemovalFlag, + splitCachingStrategy + ) + } + + private[config] def loggingConfig(): LoggingConfig = { + val mlFlowLoggingFlag = true + val mlFlowLogArtifactsFlag = false + val mlfloWLoggingConfig = + InitDbUtils.getMlFlowLoggingConfig(mlFlowLoggingFlag) + val mlFlowLoggingMode = "full" + val mlFlowBestSuffix = "_best" + val inferenceSaveLocation = "dbfs:/tmp/automl/inference/" + val mlFlowCustomRunTags = Map[String, String]() + + LoggingConfig( + mlFlowLoggingFlag, + mlFlowLogArtifactsFlag, + mlfloWLoggingConfig.mlFlowTrackingURI, + mlfloWLoggingConfig.mlFlowExperimentName, + mlfloWLoggingConfig.mlFlowAPIToken, + mlfloWLoggingConfig.mlFlowModelSaveDirectory, + mlFlowLoggingMode, + mlFlowBestSuffix, + inferenceSaveLocation, + mlFlowCustomRunTags + ) + } + + private[config] def instanceConfig(modelFamily: String, + predictionType: String): InstanceConfig = { + val modelingType = predictionTypeEvaluator(predictionType) + val family = familyTypeEvaluator(modelFamily) + val modelType = modelTypeEvaluator(modelFamily, predictionType) + InstanceConfig( + modelFamily, + predictionType, + genericConfig(modelingType), + switchConfig(family), + featureEngineeringConfig(), + algorithmConfig(modelType), + tunerConfig(), + loggingConfig() + ) + } + + def getDefaultConfig(modelFamily: String, + predictionType: String): InstanceConfig = + instanceConfig(modelFamily, predictionType) + + private[config] def defaultConfigMap( + modelFamily: String, + predictionType: String + ): Map[String, Any] = { + + val genDef = genericConfig(predictionTypeEvaluator(predictionType)) + val switchDef = switchConfig(familyTypeEvaluator(modelFamily)) + val featDef = featureEngineeringConfig() + val algDef = algorithmConfig( + modelTypeEvaluator(modelFamily, predictionType) + ) + val tunerDef = tunerConfig() + + val logDef = loggingConfig() + + Map( + "labelCol" -> genDef.labelCol, + "featuresCol" -> genDef.featuresCol, + "dateTimeConversionType" -> genDef.dateTimeConversionType, + "fieldsToIgnoreInVector" -> genDef.fieldsToIgnoreInVector, + "scoringMetric" -> genDef.scoringMetric, + "scoringOptimizationStrategy" -> genDef.scoringOptimizationStrategy, + "dataPrepParallelism" -> featDef.dataPrepParallelism, + "naFillFlag" -> switchDef.naFillFlag, + "varianceFilterFlag" -> switchDef.varianceFilterFlag, + "outlierFilterFlag" -> switchDef.outlierFilterFlag, + "pearsonFilterFlag" -> switchDef.pearsonFilterFlag, + "covarianceFilterFlag" -> switchDef.covarianceFilterFlag, + "oneHotEncodeFlag" -> switchDef.oneHotEncodeFlag, + "scalingFlag" -> switchDef.scalingFlag, + "featureInteractionFlag" -> switchDef.featureInteractionFlag, + "dataPrepCachingFlag" -> switchDef.dataPrepCachingFlag, + "autoStoppingFlag" -> switchDef.autoStoppingFlag, + "pipelineDebugFlag" -> switchDef.pipelineDebugFlag, + "fillConfigNumericFillStat" -> featDef.numericFillStat, + "fillConfigCharacterFillStat" -> featDef.characterFillStat, + "fillConfigModelSelectionDistinctThreshold" -> featDef.modelSelectionDistinctThreshold, + "outlierFilterBounds" -> featDef.outlierFilterBounds, + "outlierLowerFilterNTile" -> featDef.outlierLowerFilterNTile, + "outlierUpperFilterNTile" -> featDef.outlierUpperFilterNTile, + "outlierFilterPrecision" -> featDef.outlierFilterPrecision, + "outlierContinuousDataThreshold" -> featDef.outlierContinuousDataThreshold, + "outlierFieldsToIgnore" -> featDef.outlierFieldsToIgnore, + "pearsonFilterStatistic" -> featDef.pearsonFilterStatistic, + "pearsonFilterDirection" -> featDef.pearsonFilterDirection, + "pearsonFilterManualValue" -> featDef.pearsonFilterManualValue, + "pearsonFilterMode" -> featDef.pearsonFilterMode, + "pearsonAutoFilterNTile" -> featDef.pearsonAutoFilterNTile, + "covarianceCutoffLow" -> featDef.covarianceCorrelationCutoffLow, + "covarianceCutoffHigh" -> featDef.covarianceCorrelationCutoffHigh, + "scalingType" -> featDef.scalingType, + "scalingMin" -> featDef.scalingMin, + "scalingMax" -> featDef.scalingMax, + "scalingStandardMeanFlag" -> featDef.scalingStandardMeanFlag, + "scalingStdDevFlag" -> featDef.scalingStdDevFlag, + "scalingPNorm" -> featDef.scalingPNorm, + "featureInteractionRetentionMode" -> featDef.featureInteractionRetentionMode, + "featureInteractionContinuousDiscretizerBucketCount" -> featDef.featureInteractionContinuousDiscretizerBucketCount, + "featureInteractionParallelism" -> featDef.featureInteractionParallelism, + "featureInteractionTargetInteractionPercentage" -> featDef.featureInteractionTargetInteractionPercentage, + "featureImportanceCutoffType" -> featDef.featureImportanceCutoffType, + "featureImportanceCutoffValue" -> featDef.featureImportanceCutoffValue, + "dataReductionFactor" -> featDef.dataReductionFactor, + "fillConfigCardinalitySwitch" -> featDef.cardinalitySwitch, + "fillConfigCardinalityType" -> featDef.cardinalityType, + "fillConfigCardinalityLimit" -> featDef.cardinalityLimit, + "fillConfigCardinalityPrecision" -> featDef.cardinalityPrecision, + "fillConfigCardinalityCheckMode" -> featDef.cardinalityCheckMode, + "fillConfigFilterPrecision" -> featDef.filterPrecision, + "fillConfigCategoricalNAFillMap" -> featDef.categoricalNAFillMap, + "fillConfigNumericNAFillMap" -> featDef.numericNAFillMap, + "fillConfigCharacterNABlanketFillValue" -> featDef.characterNABlanketFillValue, + "fillConfigNumericNABlanketFillValue" -> featDef.numericNABlanketFillValue, + "fillConfigNAFillMode" -> featDef.naFillMode, + "stringBoundaries" -> algDef.stringBoundaries, + "numericBoundaries" -> algDef.numericBoundaries, + "tunerAutoStoppingScore" -> tunerDef.tunerAutoStoppingScore, + "tunerParallelism" -> tunerDef.tunerParallelism, + "tunerKFold" -> tunerDef.tunerKFold, + "tunerTrainPortion" -> tunerDef.tunerTrainPortion, + "tunerTrainSplitMethod" -> tunerDef.tunerTrainSplitMethod, + "tunerKSampleSyntheticCol" -> tunerDef.tunerKSampleSyntheticCol, + "tunerKSampleKGroups" -> tunerDef.tunerKSampleKGroups, + "tunerKSampleKMeansMaxIter" -> tunerDef.tunerKSampleKMeansMaxIter, + "tunerKSampleKMeansTolerance" -> tunerDef.tunerKSampleKMeansTolerance, + "tunerKSampleKMeansDistanceMeasurement" -> tunerDef.tunerKSampleKMeansDistanceMeasurement, + "tunerKSampleKMeansSeed" -> tunerDef.tunerKSampleKMeansSeed, + "tunerKSampleKMeansPredictionCol" -> tunerDef.tunerKSampleKMeansPredictionCol, + "tunerKSampleLSHHashTables" -> tunerDef.tunerKSampleLSHHashTables, + "tunerKSampleLSHSeed" -> tunerDef.tunerKSampleLSHSeed, + "tunerKSampleLSHOutputCol" -> tunerDef.tunerKSampleLSHOutputCol, + "tunerKSampleQuorumCount" -> tunerDef.tunerKSampleQuorumCount, + "tunerKSampleMinimumVectorCountToMutate" -> tunerDef.tunerKSampleMinimumVectorCountToMutate, + "tunerKSampleVectorMutationMethod" -> tunerDef.tunerKSampleVectorMutationMethod, + "tunerKSampleMutationMode" -> tunerDef.tunerKSampleMutationMode, + "tunerKSampleMutationValue" -> tunerDef.tunerKSampleMutationValue, + "tunerKSampleLabelBalanceMode" -> tunerDef.tunerKSampleLabelBalanceMode, + "tunerKSampleCardinalityThreshold" -> tunerDef.tunerKSampleCardinalityThreshold, + "tunerKSampleNumericRatio" -> tunerDef.tunerKSampleNumericRatio, + "tunerKSampleNumericTarget" -> tunerDef.tunerKSampleNumericTarget, + "tunerTrainSplitChronologicalColumn" -> tunerDef.tunerTrainSplitChronologicalColumn, + "tunerTrainSplitChronologicalRandomPercentage" -> tunerDef.tunerTrainSplitChronologicalRandomPercentage, + "tunerSeed" -> tunerDef.tunerSeed, + "tunerFirstGenerationGenePool" -> tunerDef.tunerFirstGenerationGenePool, + "tunerNumberOfGenerations" -> tunerDef.tunerNumberOfGenerations, + "tunerNumberOfParentsToRetain" -> tunerDef.tunerNumberOfParentsToRetain, + "tunerNumberOfMutationsPerGeneration" -> tunerDef.tunerNumberOfMutationsPerGeneration, + "tunerGeneticMixing" -> tunerDef.tunerGeneticMixing, + "tunerGenerationalMutationStrategy" -> tunerDef.tunerGenerationalMutationStrategy, + "tunerFixedMutationValue" -> tunerDef.tunerFixedMutationValue, + "tunerMutationMagnitudeMode" -> tunerDef.tunerMutationMagnitudeMode, + "tunerEvolutionStrategy" -> tunerDef.tunerEvolutionStrategy, + "tunerGeneticMBORegressorType" -> tunerDef.tunerGeneticMBORegressorType, + "tunerGeneticMBOCandidateFactor" -> tunerDef.tunerGeneticMBOCandidateFactor, + "tunerContinuousEvolutionImprovementThreshold" -> tunerDef.tunerContinuousEvolutionImprovementThreshold, + "tunerContinuousEvolutionMaxIterations" -> tunerDef.tunerContinuousEvolutionMaxIterations, + "tunerContinuousEvolutionStoppingScore" -> tunerDef.tunerContinuousEvolutionStoppingScore, + "tunerContinuousEvolutionParallelism" -> tunerDef.tunerContinuousEvolutionParallelism, + "tunerContinuousEvolutionMutationAggressiveness" -> tunerDef.tunerContinuousEvolutionMutationAggressiveness, + "tunerContinuousEvolutionGeneticMixing" -> tunerDef.tunerContinuousEvolutionGeneticMixing, + "tunerContinuousEvolutionRollingImprovementCount" -> tunerDef.tunerContinuousEvolutionRollingImprovingCount, + "tunerModelSeed" -> tunerDef.tunerModelSeed, + "tunerHyperSpaceInferenceFlag" -> tunerDef.tunerHyperSpaceInference, + "tunerHyperSpaceInferenceCount" -> tunerDef.tunerHyperSpaceInferenceCount, + "tunerHyperSpaceModelCount" -> tunerDef.tunerHyperSpaceModelCount, + "tunerHyperSpaceModelType" -> tunerDef.tunerHyperSpaceModelType, + "tunerInitialGenerationMode" -> tunerDef.tunerInitialGenerationMode, + "tunerInitialGenerationPermutationCount" -> tunerDef.tunerInitialGenerationPermutationCount, + "tunerInitialGenerationIndexMixingMode" -> tunerDef.tunerInitialGenerationIndexMixingMode, + "tunerInitialGenerationArraySeed" -> tunerDef.tunerInitialGenerationArraySeed, + "tunerOutputDfRepartitionScaleFactor" -> tunerDef.tunerOutputDfRepartitionScaleFactor, + "mlFlowLoggingFlag" -> logDef.mlFlowLoggingFlag, + "mlFlowLogArtifactsFlag" -> logDef.mlFlowLogArtifactsFlag, + "mlFlowTrackingURI" -> logDef.mlFlowTrackingURI, + "mlFlowExperimentName" -> logDef.mlFlowExperimentName, + "mlFlowAPIToken" -> logDef.mlFlowAPIToken, + "mlFlowModelSaveDirectory" -> logDef.mlFlowModelSaveDirectory, + "mlFlowLoggingMode" -> logDef.mlFlowLoggingMode, + "mlFlowBestSuffix" -> logDef.mlFlowBestSuffix, + "inferenceConfigSaveLocation" -> logDef.inferenceConfigSaveLocation, + "mlFlowCustomRunTags" -> logDef.mlFlowCustomRunTags, + "tunerDeltaCacheBackingDirectory" -> tunerDef.tunerDeltaCacheBackingDirectory, + "tunerDeltaCacheBackingDirectoryRemovalFlag" -> tunerDef.tunerDeltaCacheBackingDirectoryRemovalFlag, + "splitCachingStrategy" -> tunerDef.splitCachingStrategy + ) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationGenerator.scala b/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationGenerator.scala new file mode 100644 index 00000000..823e9646 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationGenerator.scala @@ -0,0 +1,3676 @@ +package com.databricks.labs.automl.executor.config + +import com.databricks.labs.automl.exploration.structures.FeatureImportanceConfig +import com.databricks.labs.automl.params._ +import com.databricks.labs.automl.pipeline.PipelineStateCache +import com.fasterxml.jackson.databind.ObjectMapper +import com.fasterxml.jackson.module.scala.DefaultScalaModule +import org.json4s.jackson.Serialization +import org.json4s.jackson.Serialization.{read, writePretty} +import org.json4s.{Formats, FullTypeHints, NoTypeHints} +import org.json4s.jackson.JsonMethods + +import scala.collection.mutable.ListBuffer + +/** + * @constructor Generate a configuration InstanceConfig for a given prediction type (either regressor or classifier) + * @author Ben Wilson, Databricks + * @param predictionType either 'regressor' or 'classifier', depending on the type of supervised ML needed for the task + */ +class GenericConfigGenerator(predictionType: String) + extends ConfigurationDefaults { + + import PredictionType._ + + private val familyType: PredictionType = predictionTypeEvaluator( + predictionType + ) + + private var _genericConfig = genericConfig(familyType) + + /** + * Setter + * + * @param value name of the Label column for the supervised learning task + */ + def setLabelCol(value: String): this.type = { + _genericConfig.labelCol = value + this + } + + /** + * Setter + * + * @param value name of the feature vector to be used throughout the modeling process. + */ + def setFeaturesCol(value: String): this.type = { + _genericConfig.featuresCol = value + this + } + + /** + * Setter + * + * @param value type of data to convert a datetime field to allowable values: + * "unix" - converts to a LongType for the number of milliseconds since Jan 1, 1970 + * "split" - converts the aspects of the date into representative columns -> + * Year, Month, Day, Hour, Minute, Second + * @throws IllegalArgumentException() if an invalid entry is made. + */ + @throws(classOf[IllegalArgumentException]) + def setDateTimeConversionType(value: String): this.type = { + validateMembership( + value, + allowableDateTimeConversionTypes, + "DateTimeConversionType" + ) + _genericConfig.dateTimeConversionType = value + this + } + + /** + * Setter + * + * @param value Collection (Array) of fields that will be ignored throughout modeling and will not be included + * in feature vector operations. + */ + def setFieldsToIgnoreInVector(value: Array[String]): this.type = { + _genericConfig.fieldsToIgnoreInVector = value + this + } + + /** + * Setter + * + * @param value Metric to be used to determine the 'best of' within generations of mutation. + * Allowable values for regressor: List("rmse", "mse", "r2", "mae") + * Allowable values for classifier: List("f1", "weightedPrecision", "weightedRecall", + * "accuracy", "areaUnderPR", "areaUnderROC") + * @throws IllegalArgumentException() if an invalid entry is made. + */ + @throws(classOf[IllegalArgumentException]) + def setScoringMetric(value: String): this.type = { + + val adjusted_value = value.toLowerCase + val matched_value = adjusted_value match { + case "f1" => "f1" + case "weightedprecision" => "weightedPrecision" + case "weightedrecall" => "weightedRecall" + case "accuracy" => "accuracy" + case "areaunderpr" => "areaUnderPR" + case "areaunderroc" => "areaUnderROC" + case "rmse" => "rmse" + case "mse" => "mse" + case "r2" => "r2" + case "mae" => "mae" + case _ => + throw new IllegalArgumentException( + s"Supplied Scoring Metric '${value}' is not supported. " + + s"Must be one of: weightedPrecision, weightedRecall, accuracy, areaUnderPR, areaUnderROC, rmse, mse, r2, mae.'" + ) + } + + familyType match { + case Regressor => + validateMembership( + matched_value, + allowableRegressionScoringMetrics, + s"$predictionType Scoring Metric" + ) + case Classifier => + validateMembership( + matched_value, + allowableClassificationScoringMetrics, + s"$predictionType Scoring Metric" + ) + } + _genericConfig.scoringMetric = matched_value + this + } + + /** + * Setter + * + * @param value Direction of optimization. Options:
+ * 'maximize' - will sort returned scores in descending order and take the top(n)
+ * 'minimize' - will sort returned scores in ascending order and take the top(n) + * @throws IllegalArgumentException if an invalid entry is made. + */ + @throws(classOf[IllegalArgumentException]) + def setScoringOptimizationStrategy(value: String): this.type = { + validateMembership( + value, + allowableScoringOptimizationStrategies, + "ScoringOptimizationStrategy" + ) + _genericConfig.scoringOptimizationStrategy = value + this + } + + /** + * Setter
+ * + * Aids in creating multiple instances of a Generic Config (useful for Feature Importance usages) + * + * @param value an Instance of a GenericConfig Object + */ + def setConfig(value: GenericConfig): this.type = { + _genericConfig = value + this + } + + /** + * Getter + * + * @return Currently assigned name of the label column for modeling. + */ + def getLabelCol: String = _genericConfig.labelCol + + /** + * Getter + * + * @return Currently assigned name of the feature column for the modeling vector. + */ + def getFeaturesCol: String = _genericConfig.featuresCol + + /** + * Getter + * + * @return Currently assigned setting for the datetime column conversion methodology. + */ + def getDateTimeConversionType: String = _genericConfig.dateTimeConversionType + + /** + * Getter + * + * @return A collection (default Empty Array) of fields that are to be ignored for the purposes of modeling. + */ + def getFieldsToIgnoreInVector: Array[String] = + _genericConfig.fieldsToIgnoreInVector + + /** + * Getter + * + * @return Currently assigned setting for the metric to be used for determining quality of models for subsequent + * optimization generations / iterations. + */ + def getScoringMetric: String = _genericConfig.scoringMetric + + /** + * Getter + * + * @return Currently assigned setting for the direction of sorting for the provided scoringMetric value + * (either 'minimize' or 'maximize') + */ + def getScoringOptimizationStrategy: String = + _genericConfig.scoringOptimizationStrategy + + /** + * Main Method accessor to return the GenericConfig current state. + * + * @return :GenericConfig type objects of the results of setter usage. + */ + def getConfig: GenericConfig = _genericConfig + +} + +object GenericConfigGenerator { + + /** + * Companion object apply generator + * + * @param predictionType the type of modeling desired: 'regressor' or 'classifier' + * @return Instance of the GenericConfigGenerator with defaults applied. + */ + def apply(predictionType: String): GenericConfigGenerator = + new GenericConfigGenerator(predictionType) + + /** + * Helper method that allows for default settings for a classifier to be used and generated + * + * @example ``` + * val defaultClassifierGenericConfig = GenericConfigGenerator.generateDefaultClassifierConfig + * ``` + * @return GenericConfig Object, setup for classifiers. + * + */ + def generateDefaultClassifierConfig: GenericConfig = + new GenericConfigGenerator("classifier").getConfig + + /** + * Helper method that allows for default settings for a regressor to be used and generated + * + * @example ``` + * val defaultRegressirGenericConfig = GenericConfigGenerator.generateDefaultRegressorConfig + * ``` + * @return GenericConfig Object, setup for regressors. + */ + def generateDefaultRegressorConfig: GenericConfig = + new GenericConfigGenerator("regressor").getConfig + +} + +/** + * Main Configuration Generator utility class, used for generating a modeling configuration to execute the autoML + * framework. + * + * @since 0.5 + * @author Ben Wilson, Databricks + * @param modelFamily The model family that is desired to be run (e.g. 'RandomForest') + * Allowable Options: + * "Trees", "GBT", "RandomForest", "LinearRegression", "LogisticRegression", "XGBoost", "MLPC", + * "SVM" + * @param predictionType The modeling type that is desired to be run (e.g. 'classifier') + * Allowable Options: + * "classifier" or "regressor" + * @param genericConfig Configuration object from GenericConfigGenerator + */ +class ConfigurationGenerator(modelFamily: String, + predictionType: String, + var genericConfig: GenericConfig) + extends ConfigurationDefaults { + + import FamilyValidator._ + import ModelSelector._ + + private val modelType: ModelSelector = + modelTypeEvaluator(modelFamily, predictionType) + private val family: FamilyValidator = familyTypeEvaluator(modelFamily) + + // Default configuration generation + + private var _instanceConfig = instanceConfig(modelFamily, predictionType) + + _instanceConfig.genericConfig = genericConfig + + /** + * Helper method for copying a pre-defined InstanceConfig to a new instance. + * + * @param value InstanceConfig object + */ + def setConfig(value: InstanceConfig): this.type = { + _instanceConfig = value + this + } + + //Switch Config + /** + * Boolean switch for turning on naFill actions + * + * @note Default: On + * @note HIGHLY RECOMMENDED TO LEAVE ON. + */ + def naFillOn(): this.type = { + _instanceConfig.switchConfig.naFillFlag = true + this + } + + /** + * Boolean switch for turning off naFill actions + * + * @note Default: On + * @note HIGHLY RECOMMENDED TO NOT TURN OFF + */ + def naFillOff(): this.type = { + _instanceConfig.switchConfig.naFillFlag = false + this + } + + /** + * Boolean switch for setting the state of naFillFlag + * + * @param value Boolean + * (whether to execute filling of na values on the DataFrame's non-ignored fields) + */ + def setNaFillFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.naFillFlag = value + this + } + + /** + * Boolean switch for turning variance filtering on + * + * @note Default: On + */ + def varianceFilterOn(): this.type = { + _instanceConfig.switchConfig.varianceFilterFlag = true + this + } + + /** + * Boolean switch for turning variance filtering off + * + * @note Default: On + */ + def varianceFilterOff(): this.type = { + _instanceConfig.switchConfig.varianceFilterFlag = false + this + } + + /** + * Boolean switch for setting the state of varianceFilterFlag + * + * @param value Boolean + * (whether or not to filter out fields from the feature vector that all have the same value) + */ + def setVarianceFilterFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.varianceFilterFlag = value + this + } + + /** + * Boolean switch for turning outlier filtering on + * + * @note Default: Off + */ + def outlierFilterOn(): this.type = { + _instanceConfig.switchConfig.outlierFilterFlag = true + this + } + + /** + * Boolean switch for turning outlier filtering off + * + * @note Default: Off + */ + def outlierFilterOff(): this.type = { + _instanceConfig.switchConfig.outlierFilterFlag = false + this + } + + /** + * Boolean switch for setting the state of outlierFilterFlag + * + * @param value Boolean + */ + def setOutlierFilterFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.outlierFilterFlag = value + this + } + + /** + * Boolean switch for turning Pearson filtering on + * + * @note Default: Off + */ + def pearsonFilterOn(): this.type = { + _instanceConfig.switchConfig.pearsonFilterFlag = true + this + } + + /** + * Boolean switch for turning Pearson filtering off + * + * @note Default: Off + */ + def pearsonFilterOff(): this.type = { + _instanceConfig.switchConfig.pearsonFilterFlag = false + this + } + + /** + * Boolean switch for setting the state of pearsonFilterFlag + * + * @param value Boolean + */ + def setPearsonFilterFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.pearsonFilterFlag = value + this + } + + /** + * Boolean switch for turning Covariance filtering on + * + * @note Default: Off + */ + def covarianceFilterOn(): this.type = { + _instanceConfig.switchConfig.covarianceFilterFlag = true + this + } + + /** + * Boolean switch for turning Covariance filtering off + * + * @note Default: Off + */ + def covarianceFilterOff(): this.type = { + _instanceConfig.switchConfig.covarianceFilterFlag = false + this + } + + /** + * Boolean switch for setting the state of covarianceFilterFlag + * + * @param value Boolean + */ + def setCovarianceFilterFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.covarianceFilterFlag = value + this + } + + /** + * Boolean switch for turning One Hot Encoding of string and character features on + * + * @note Default: Off for Tree based algorithms, On for all others. + * @note Turning One Hot Encoding on for a tree-based algorithm (XGBoost, RandomForest, Trees, GBT) is not + * recommended. Introducing synthetic dummy variables in a tree algorithm will force the creation of + * sparse tree splits. + * @see See [[https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769]] + * for a full explanation. + */ + def oneHotEncodeOn(): this.type = { + family match { + case Trees => + println( + "WARNING! OneHotEncoding set on a trees algorithm will likely create a poor model. " + + "Proceed at your own risk!" + ) + case _ => None + } + _instanceConfig.switchConfig.oneHotEncodeFlag = true + this + } + + /** + * Boolean switch for turning off One Hot Encoding + * + * @note Default: Off for Tree based algorithms, On for all others. + */ + def oneHotEncodeOff(): this.type = { + _instanceConfig.switchConfig.oneHotEncodeFlag = false + this + } + + /** + * Boolean switch for setting the state of oneHotEncodeFlag + * + * @param value Boolean + */ + def setOneHotEncodeFlag(value: Boolean): this.type = { + if (value) oneHotEncodeOn() + else oneHotEncodeOff() + this + } + + /** + * Boolean switch for turning scaling On + * + * @note Default: Off for Tree based algorithms, On for all others. + * @note For Tree based algorithms (RandomForest, XGBoost, GBT, Trees), it is not necessary (and can adversely + * affect the model performance) that this be turned on. + */ + def scalingOn(): this.type = { + _instanceConfig.switchConfig.scalingFlag = true + this + } + + /** + * Boolean switch for turning scaling Off + * + * @note Default: Off for Tree based algorithms, On for all others. + */ + def scalingOff(): this.type = { + _instanceConfig.switchConfig.scalingFlag = false + this + } + + /** + * Boolean switch for setting the state of the scalingFlag + * + * @param value Boolean + */ + def setScalingFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.scalingFlag = value + this + } + + /** + * Boolean switch for setting featureInteraction on. + * This setting will, in conjunction with the settings for featureInteraction elements in the config, + * perform pair-wise product interactions of all elements of the feature vector, retaining either all or some + * of those interactions for inclusion to the feature vector. + * For classification tasks, InformationGain is used as the metric to compare inclusion (for modes other than 'all') + * For regression tasks, Variance is used as the metric. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def featureInteractionOn(): this.type = { + _instanceConfig.switchConfig.featureInteractionFlag = true + this + } + + /** + * Boolean switch for turning featureInteraction off + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def featureInteractionOff(): this.type = { + _instanceConfig.switchConfig.featureInteractionFlag = false + this + } + + /** + * Setter for defining the state of the featureInteractionFlag + * @param value Boolean on/off + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def setFeatureInteractionFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.featureInteractionFlag = value + this + } + + /** + * Boolean switch for setting the Data Prep Caching On + * + * @note Default: On + * @note Depending on the size and partitioning of the data set, caching may or may not improve performance. + */ + def dataPrepCachingOn(): this.type = { + _instanceConfig.switchConfig.dataPrepCachingFlag = true + this + } + + /** + * Boolean switch for setting the Data Prep Caching Off + * + * @note Default: On + * @note Depending on the size and partitioning of the data set, caching may or may not improve performance. + */ + def dataPrepCachingOff(): this.type = { + _instanceConfig.switchConfig.dataPrepCachingFlag = false + this + } + + /** + * Boolean switch for setting the state of DataPrepCachingFlag + * + * @param value Boolean + */ + def setDataPrepCachingFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.dataPrepCachingFlag = value + this + } + + /** + * Boolean switch for setting Auto Stopping On + * + * @note Default: Off + * @note Early stopping will invalidate the progress measurement system (due to non-determinism) + * Early termination will not occur immediately. Futures objects already committed will continue to run, but + * no new actions will be enqueued when a stopping criteria is met. + */ + def autoStoppingOn(): this.type = { + _instanceConfig.switchConfig.autoStoppingFlag = true + this + } + + /** + * Boolean switch for setting Auto Stopping Off + * + * @note Default: Off + */ + def autoStoppingOff(): this.type = { + _instanceConfig.switchConfig.autoStoppingFlag = false + this + } + + /** + * Boolean switch for setting the state of autoStoppingFlag + * + * @param value Boolean + */ + def setAutoStoppingFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.autoStoppingFlag = value + this + } + + def setPipelineDebugFlag(value: Boolean): this.type = { + _instanceConfig.switchConfig.pipelineDebugFlag = value + this + } + + def pipelineDebugFlagOn(value: Boolean): this.type = { + _instanceConfig.switchConfig.pipelineDebugFlag = true + this + } + + def pipelineDebugFlagOff(value: Boolean): this.type = { + _instanceConfig.switchConfig.pipelineDebugFlag = false + this + } + + // Feature Engineering Config + + /** + * Setter for defining the number of concurrent threads allocated to performing asynchronous data prep tasks within + * the feature engineering aspect of this application. + * @param value Int: A value that must be greater than zero. + * @note This value has an upper limit, depending on driver size, that will restrict the efficacy of the asynchronous + * tasks within the pool. Setting this too high may cause cluster instability. + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if a value less than or equal to zero is supplied. + */ + @throws(classOf[IllegalArgumentException]) + def setDataPrepParallelism(value: Int): this.type = { + + require(value > 0, s"DataPrepParallelism must be greater than zero.") + _instanceConfig.featureEngineeringConfig.dataPrepParallelism = value + this + } + + /** + * Setter + * Specifies the behavior of the naFill algorithm for numeric (continuous) fields.
+ * Values that are generated as potential fill candidates are set according to the available statistics that are + * calculated from a df.summary() method.
+ * Available options are:
+ * "min", "25p", "mean", "median", "75p", or "max" + * + * @param value String: member of allowable list. + * @note Default: "mean" + * @throws IllegalArgumentException if an invalid entry is made. + */ + @throws(classOf[IllegalArgumentException]) + def setFillConfigNumericFillStat(value: String): this.type = { + validateMembership( + value, + allowableNumericFillStats, + "FillConfigNumericFillStat" + ) + _instanceConfig.featureEngineeringConfig.numericFillStat = value + this + } + + /** + * Setter + * Specifies the behavior of the naFill algorithm for character (String, Char, Boolean, Byte, etc.) fields. + * Generated through a df.summary() method
+ * Available options are:
+ * "min" (least frequently occurring value)
+ * or
+ * "max" (most frequently occurring value) + * + * @param value String: member of allowable list + * @note Default: "max" + * @throws IllegalArgumentException if an invalid entry is made. + */ + @throws(classOf[IllegalArgumentException]) + def setFillConfigCharacterFillStat(value: String): this.type = { + validateMembership( + value, + allowableCharacterFillStats, + "FillConfigCharacterFillStat" + ) + _instanceConfig.featureEngineeringConfig.characterFillStat = value + this + } + + /** + * Setter
+ * The threshold value that is used to detect, based on the supplied labelCol, the cardinality of the label through + * a .distinct().count() being issued to the label column. Values from this cardinality determination that are + * above this setter's value will be considered to be a Regression Task, those below will be considered a + * Classification Task. + * + * @note In the case of exceptions being thrown for incorrect type (detected a classifier, but intended usage is for + * a regression, lower this value. Conversely, if a classification problem has a significant number of + * classes, above the default threshold of this setting (50), increase this value.) + * @param value Int: Threshold value for the labelCol cardinality check. Values above this setting will be + * determined to be a regression task; below to be a classification task. + * @note Default: 50 + */ + @deprecated( + "This setter, and the logic underlying it for automatically detecting modeling type, will be removed" + + "in future versions, as it is now required to be specified for utilizing a ConfigurationGenerator Object." + ) + def setFillConfigModelSelectionDistinctThreshold(value: Int): this.type = { + _instanceConfig.featureEngineeringConfig.modelSelectionDistinctThreshold = + value + this + } + + /** + * Setter + *

Configures the tails of a distribution to filter out, along with the ntile settings defined in: + * .setOutlierLowerFilterNTile() and/or .setOutlierUpperFilterNTile() + *

Available Modes:
+ * "lower" -> filters out rows from the data that are below the value set in + * ```.setOutlierLowerFilterNTile()```
+ * "upper" -> filter out rows from the data that are above the the value set in + * ```.setOutlierUpperFilterNTile()```
+ * "both" -> two-tailed filter that combines both an "upper" and "lower" filter.
+ * + *

+ *

+ * + * @param value String: Tailed direction setting for outlier filtering. + * @note Default: "both" + * @note This filter action is disabled by default. Before enabling, please ensure the fields to be filtered are + * adequately reflected in the ```.setOutlierFieldsToIgnore()``` inverse selection, as well as verifying the + * general distribution of the fields that have outlier data in order to select an appropriate NTile value. + * This feature should only be supplied in rare instances and a full understanding of the impacts that this + * filter may have should be understood before enabling it. + */ + def setOutlierFilterBounds(value: String): this.type = { + validateMembership( + value, + allowableOutlierFilterBounds, + "OutlierFilterBounds" + ) + _instanceConfig.featureEngineeringConfig.outlierFilterBounds = value + this + } + + /** + *Setter
+ * Defines the NTILE value of the distributions of feature fields below which rows that fall beneath this value will + * be filtered from the data. + * + * @param value Double: Lower Threshold boundary NTILE for Outlier Filtering + * @note Only used if Outlier filtering is set to 'On' and Filter Direction is either 'both' or 'lower' + * @throws IllegalArgumentException if the value supplied is outside of the Range(0.0,1.0) + */ + @throws(classOf[IllegalArgumentException]) + def setOutlierLowerFilterNTile(value: Double): this.type = { + zeroToOneValidation(value, "OutlierLowerFilterNTile") + _instanceConfig.featureEngineeringConfig.outlierLowerFilterNTile = value + this + } + + /** + * Setter
+ * Defines the NTILE value of the distributions of feature fields above which rows that fall above this value will + * be filtered from the data + * + * @param value Double: Upper Threshold boundary NTILE value for Outlier Filtering + * @note Only used if Outlier filtering is set to 'On' and Filter Direction is either 'both' or 'upper' + * @throws IllegalArgumentException if the value supplied is outside of the Range(0.0,1.0) + */ + @throws(classOf[IllegalArgumentException]) + def setOutlierUpperFilterNTile(value: Double): this.type = { + zeroToOneValidation(value, "OutlierUpperFilterNTile") + _instanceConfig.featureEngineeringConfig.outlierUpperFilterNTile = value + this + } + + /** + *Setter
+ * Defines the precision (RSD) in which each field's cardinality is calculated through the use of + * ```approx_count_distinct``` SparkSQL function. Lower values specify higher accuracy, but consume + * more computational resources. + * + * @param value Double: In range of 0.0, 1.0 + * @note A Value of 0.0 will be an exact computation of distinct values. Therefore, all data must be shuffled, + * which is an expensive task. + * @see [[https://en.wikipedia.org/wiki/Coefficient_of_variation]] for explanation of RSD + * @throws IllegalArgumentException if the value supplied is outside of the Range(0.0, 1.0) + */ + @throws(classOf[IllegalArgumentException]) + def setOutlierFilterPrecision(value: Double): this.type = { + zeroToOneValidation(value, "OutlierFilterPrecision") + if (value == 0.0) + println( + "Warning! Precision of 0 is an exact calculation of quantiles and may not be performant!" + ) + _instanceConfig.featureEngineeringConfig.outlierFilterPrecision = value + this + } + + /** + * Setter
+ * Defines the determination of whether to classify a numeric field as ordinal (categorical) or + * continuous. + * + * @param value Int: Threshold for distinct counts within a numeric feature field. + * @note Continuous data fields are eligible for outlier filtering. Categorical fields are not, and if below + * cardinality thresholds set by this value setter, those fields will be ignored by the filtering action. + */ + def setOutlierContinuousDataThreshold(value: Int): this.type = { + if (value < 50) + println( + "Warning! Values less than 50 may indicate ordinal (categorical numeric) data!" + ) + _instanceConfig.featureEngineeringConfig.outlierContinuousDataThreshold = + value + this + } + + /** + * Setter
+ * Defines an Array of fields to be ignored from outlier filtering. + * + * @param value Array[String]: field names to be ignored from outlier filtering. + */ + def setOutlierFieldsToIgnore(value: Array[String]): this.type = { + _instanceConfig.featureEngineeringConfig.outlierFieldsToIgnore = value + this + } + + /** + * Setter
+ * Selection for filter statistic to be used in Pearson Filtering.
+ * Available modes: "pvalue", "degreesFreedom", or "pearsonStat" + * + * @note Default: pearsonStat + * @param value String: one of available modes. + * @throws IllegalArgumentException if the value provided is not in available modes list. + */ + @throws(classOf[IllegalArgumentException]) + def setPearsonFilterStatistic(value: String): this.type = { + validateMembership( + value, + allowablePearsonFilterStats, + "PearsonFilterStatistic" + ) + _instanceConfig.featureEngineeringConfig.pearsonFilterStatistic = value + this + } + + /** + * Setter
+ * Controls which direction of correlation values to filter out. Allowable modes:
+ * "greater" or "lesser" + * + * @note Default: greater + * @param value String: one of available modes + * @throws IllegalArgumentException if the value provided is not in available modes list. + */ + @throws(classOf[IllegalArgumentException]) + def setPearsonFilterDirection(value: String): this.type = { + validateMembership( + value, + allowablePearsonFilterDirections, + "PearsonFilterDirection" + ) + _instanceConfig.featureEngineeringConfig.pearsonFilterDirection = value + this + } + + /** + * Setter
+ * Controls the Pearson manual filter value, if the PearsonFilterMode is set to "manual"
+ * + * @example with .setPearsonFilterMode("manual") and .setPearsonFilterDirection("greater")
+ * the removal of fields that have a pearson correlation coefficient result above this
+ * value will be dropped from modeling runs. + * @param value Double: A value that is used as a cut-off point to filter fields whose correlation statistic is + * either above or below will be culled from the feature vector. + */ + def setPearsonFilterManualValue(value: Double): this.type = { + _instanceConfig.featureEngineeringConfig.pearsonFilterManualValue = value + this + } + + /** + * Setter
+ * Controls whether to use "auto" mode (using the PearsonAutoFilterNTile) or "manual" mode (using the
+ * PearsonFilterManualValue) to cull fields from the feature vector. + * + * @param value String: either "auto" or "manual" + * @note Default: "auto" + * @throws IllegalArgumentException if the value provided is not in available modes list (auto and manual) + */ + @throws(classOf[IllegalArgumentException]) + def setPearsonFilterMode(value: String): this.type = { + validateMembership(value, allowablePearsonFilterModes, "PearsonFilterMode") + _instanceConfig.featureEngineeringConfig.pearsonFilterMode = value + this + } + + /** + * Setter
+ * Provides the ntile threshold above or below which (depending on PearsonFilterDirection setting) fields will
+ * be removed, depending on the distribution of pearson statistics from all feature columns. + * + * @note WARNING - this feature is ONLY recommended to be used for exploratory development work. + * @note Default: 0.75 (Q3) + * @param value Double: In range of (0.0, 1.0) + * @throws IllegalArgumentException if the value provided is outside of the range of (0.0, 1.0) + */ + @throws(classOf[IllegalArgumentException]) + def setPearsonAutoFilterNTile(value: Double): this.type = { + zeroToOneValidation(value, "PearsonAutoFilterNTile") + _instanceConfig.featureEngineeringConfig.pearsonAutoFilterNTile = value + this + } + + /** + * Setter
+ * Covariance Cutoff for specifying the feature-to-feature correlation statistic lower cutoff boundary + * + * @example For feature columns A, B, and C, if A->B is 0.02, A->C is 0.1, B->C is 0.85, with a value set of 0.05, + *
Column A would be removed from the feature vector for having a low value of the correlation + * statistic. + * @param value Double: Threshold Cutoff Value + * @note Default: -0.99 + * @note WARNING This setting is not recommended to be used in a production use case and is only potentially + * useful for data exploration and experimentation. + * @note WARNING the lower threshold boundary for correlation is less frequently used. Filtering of auto-correlated + * features is done primarily through .setCovarianceCutoffHigh values lower than the default of 0.99 + * @throws IllegalArgumentException if the value is <= -1.0 + */ + @throws(classOf[IllegalArgumentException]) + def setCovarianceCutoffLow(value: Double): this.type = { + require( + value > -1.0, + s"Covariance Cutoff Low value $value is outside of allowable range. Value must be " + + s"greater than -1.0." + ) + _instanceConfig.featureEngineeringConfig.covarianceCorrelationCutoffLow = + value + this + } + + /** + * Setter
+ * Covariance Cutoff for specifying the feature-to-feature correlation statistic upper cutoff boundary + * + * @example For feature columns A, B, and C, if A<->B is 0.02, A<->C is 0.1, B<->C is 0.85, with a value set of 0.8, + *
Column C would be removed from the feature vector for having a high value of the correlation + * statistic. + * @param value Double: Threshold Cutoff Value + * @note Default: 0.99 + * @note WARNING This setting is not recommended to be used in a production use case and is only potentially + * useful for data exploration and experimentation. + * @throws IllegalArgumentException if the value is <= -1.0 + */ + @throws(classOf[IllegalArgumentException]) + def setCovarianceCutoffHigh(value: Double): this.type = { + require( + value < 1.0, + s"Covariance Cutoff High value $value is outside of allowable range. Value must be " + + s"less than 1.0." + ) + _instanceConfig.featureEngineeringConfig.covarianceCorrelationCutoffHigh = + value + this + } + + def setScalingType(value: String): this.type = { + validateMembership(value, allowableScalers, "ScalingType") + _instanceConfig.featureEngineeringConfig.scalingType = value + this + } + + def setScalingMin(value: Double): this.type = { + _instanceConfig.featureEngineeringConfig.scalingMin = value + this + } + + def setScalingMax(value: Double): this.type = { + _instanceConfig.featureEngineeringConfig.scalingMax = value + this + } + + def setScalingStandardMeanFlagOn(): this.type = { + _instanceConfig.featureEngineeringConfig.scalingStandardMeanFlag = true + this + } + + def setScalingStandardMeanFlagOff(): this.type = { + _instanceConfig.featureEngineeringConfig.scalingStandardMeanFlag = false + this + } + + def setScalingStandardMeanFlag(value: Boolean): this.type = { + _instanceConfig.featureEngineeringConfig.scalingStandardMeanFlag = value + this + } + + def setScalingStdDevFlagOn(): this.type = { + _instanceConfig.featureEngineeringConfig.scalingStdDevFlag = true + this + } + + def setScalingStdDevFlagOff(): this.type = { + _instanceConfig.featureEngineeringConfig.scalingStdDevFlag = false + this + } + + def setScalingStdDevFlag(value: Boolean): this.type = { + _instanceConfig.featureEngineeringConfig.scalingStdDevFlag = value + this + } + + def setScalingPNorm(value: Double): this.type = { + require( + value >= 1.0, + s"pNorm value: $value is invalid. Value must be greater than or equal to 1.0." + ) + _instanceConfig.featureEngineeringConfig.scalingPNorm = value + this + } + + /** + * Setter for determining the mode of operation for inclusion of interacted features. + * Modes are: + * - all -> Includes all interactions between all features (after string indexing of categorical values) + * - optimistic -> If the Information Gain / Variance, as compared to at least ONE of the parents of the interaction + * is above the threshold set by featureInteractionTargetInteractionPercentage + * (e.g. if IG of left parent is 0.5 and right parent is 0.9, with threshold set at 10, if the interaction + * between these two parents has an IG of 0.42, it would be rejected, but if it was 0.46, it would be kept) + * - strict -> the threshold percentage must be met for BOTH parents. + * (in the above example, the IG for the interaction would have to be > 0.81 in order to be included in + * the feature vector). + * @param value String -> one of: 'all', 'optimistic', or 'strict' + * @throws IllegalArgumentException if the specified value submitted is not permitted + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + def setFeatureInteractionRetentionMode(value: String): this.type = { + validateMembership( + value, + allowableFeatureInteractionModes, + "featureInteractionRetentionMode" + ) + _instanceConfig.featureEngineeringConfig.featureInteractionRetentionMode = + value + this + } + + /** + * Setter for determining the behavior of continuous feature columns. In order to calculate Entropy for a continuous + * variable, the distribution must be converted to nominal values for estimation of per-split information gain. + * This setting defines how many nominal categorical values to create out of a continuously distributed feature + * in order to calculate Entropy. + * @param value Int -> must be greater than 1 + * @throws IllegalArgumentException if the value specified is <= 1 + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def setFeatureInteractionContinuousDiscretizerBucketCount( + value: Int + ): this.type = { + require( + value > 1, + s"FeatureInteractionContinuousDiscretizerBucketCount must be greater than 1." + ) + _instanceConfig.featureEngineeringConfig.featureInteractionContinuousDiscretizerBucketCount = + value + this + } + + /** + * Setter for configuring the concurrent count for scoring of feature interaction candidates. + * Due to the nature of these operations, the configuration here may need to be set differently to that of + * the modeling and general feature engineering phases of the toolkit. This is highly dependent on the row + * count of the data set being submitted. + * @param value Int -> must be greater than 0 + * @since 0.6.2 + * @author Ben Wilson, Databricks + * @throws IllegalArgumentException if the value is < 1 + */ + @throws(classOf[IllegalArgumentException]) + def setFeatureInteractionParallelism(value: Int): this.type = { + require( + value >= 1, + s"FeatureInteractionParallelism must be set to a value >= 1." + ) + _instanceConfig.featureEngineeringConfig.featureInteractionParallelism = + value + this + } + + /** + * Setter for establishing the minimum acceptable InformationGain or Variance allowed for an interaction + * candidate based on comparison to the scores of its parents. + * @param value Double in range of -inf -> inf + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def setFeatureInteractionTargetInteractionPercentage( + value: Double + ): this.type = { + _instanceConfig.featureEngineeringConfig.featureInteractionTargetInteractionPercentage = + value + this + } + + def setFeatureImportanceCutoffType(value: String): this.type = { + validateMembership( + value, + allowableFeatureImportanceCutoffTypes, + "FeatureImportanceCufoffType" + ) + _instanceConfig.featureEngineeringConfig.featureImportanceCutoffType = value + this + } + + def setFeatureImportanceCutoffValue(value: Double): this.type = { + _instanceConfig.featureEngineeringConfig.featureImportanceCutoffValue = + value + this + } + + def setDataReductionFactor(value: Double): this.type = { + zeroToOneValidation(value, "DateReductionFactor") + _instanceConfig.featureEngineeringConfig.dataReductionFactor = value + this + } + + /** + * Setter switch for turning cardinality switch on + * This switch is intended to set whether the a cardinality check is performed on StringIndexed columns + * + * @note Default: true + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def cardinalitySwitchOn(): this.type = { + _instanceConfig.featureEngineeringConfig.cardinalitySwitch = true + this + } + + /** + * Setter switch for turning cardinality switch off. + * + * @note Not recommended for exploratory data set features. + * @note Default: true + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def cardinalitySwitchOff(): this.type = { + _instanceConfig.featureEngineeringConfig.cardinalitySwitch = false + this + } + + /** + * Setter for direct override of the cardinality switch + * + * @note Default: true + * @param value: Boolean + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setFillConfigCardinalitySwitch(value: Boolean): this.type = { + _instanceConfig.featureEngineeringConfig.cardinalitySwitch = value + this + } + + /** + * Setter for specifying the mode of cardinality checking [either "approx" for approximate distinct or "exact"] + * + * @param value String: either "approx" or "exact" + * @note Default - exact + * @throws IllegalArgumentException if a mode other than exact or approx is specified. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + def setFillConfigCardinalityType(value: String): this.type = { + validateMembership( + value, + allowableCardinalilties, + "fillConfigCardinalityType" + ) + _instanceConfig.featureEngineeringConfig.cardinalityType = value + this + } + + /** + * Setter for overriding the default cardinality limit when validating whether a field should be considered for + * OneHotEncoding or StringIndexing + * + * @param value Int: The value at above which a field will be declared to be of too high a cardinality for StringIndexing or OneHotEncoding + * @note Default: 200 + * @throws java.lang.IllegalArgumentException if the number is <= to 0 + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + def setFillConfigCardinalityLimit(value: Int): this.type = { + require(value > 0, s"Cardinality limit must be greater than 0") + _instanceConfig.featureEngineeringConfig.cardinalityLimit = value + this + } + + /** + * Setter for defining the precision calculation when in "approx" mode for cardinalityType. Must be in range 0 -> 1 + * + * @param value Double: The precision for approximate distinct calculations for cardinality purposes + * @throws java.lang.IllegalArgumentException if the Double supplied is outside of the range of 0 -> 1 + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + def setFillConfigCardinalityPrecision(value: Double): this.type = { + require(value >= 0.0, s"Precision must be greater than or equal to 0.") + require(value <= 1.0, s"Precision must be less than or equal to 1.") + _instanceConfig.featureEngineeringConfig.cardinalityPrecision = value + this + } + + /** + * Setter for the cardinality check mode to be used. Available modes are "warn" and "silent". + * In "warn" mode, an exception will be thrown if the cardinality for a categorical column is above the threshold. + * In "silent" mode, the field will be ignored from processing and will not be included in the feature vector. + * + * @note Default: "silent" + * @param value String: either "warn" or "silent" + * @throws IllegalArgumentException if the mode supplied is not either "warn" or "silent" + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + def setFillConfigCardinalityCheckMode(value: String): this.type = { + validateMembership( + value, + allowableCategoricalFilterModes, + "fillConfigCardinalityCheckMode" + ) + _instanceConfig.featureEngineeringConfig.cardinalityCheckMode = value + this + } + + /** + * Setter for defining the precision for calculating the model type as per the label column + * @note setting this value to zero (0) for a large regression problem will incur a long processing time and + * an expensive shuffle. + * @param value Double: Precision accuracy for approximate distinct calculation. + * @throws java.lang.AssertionError If the value is outside of the allowable range of {0, 1} + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[AssertionError]) + def setFillConfigFilterPrecision(value: Double): this.type = { + require( + value >= 0, + s"Filter Precision for NA Fill must be greater than or equal to 0." + ) + require( + value <= 1, + s"Filter Precision for NA Fill must be less than or equal to 1." + ) + _instanceConfig.featureEngineeringConfig.filterPrecision = value + this + } + + /** + * Setter for providing a map of [Column Name -> String Fill Value] for manual by-column overrides. Any non-specified + * fields in this map will utilize the "auto" statistics-based fill paradigm to calculate and fill any NA values + * in non-numeric columns. + * @note if naFillMode is specified as using Map Fill modes, this setter or the numeric na fill map MUST be set. + * @note If fields are specified in here that are not part of the DataFrame's schema, an exception will be thrown. + * @param value Map[String, String]: Column Name as String -> Fill Value as String + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setFillConfigCategoricalNAFillMap( + value: Map[String, String] + ): this.type = { + _instanceConfig.featureEngineeringConfig.categoricalNAFillMap = value + this + } + + /** + * Setter for providing a map of [Column Name -> AnyVal Fill Value] (must be numeric). Any non-specified + * fields in this map will utilize the "auto" statistics-based fill paradigm to calculate and fill any NA values + * in numeric columns. + * @note if naFillMode is specified as using Map Fill modes, this setter or the categorical na fill map MUST be set. + * @note If fields are specified in here that are not part of the DataFrame's schema, an exception will be thrown. + * @param value Map[String, AnyVal]: Column Name as String -> Fill Numeric Type Value + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setFillConfigNumericNAFillMap(value: Map[String, AnyVal]): this.type = { + _instanceConfig.featureEngineeringConfig.numericNAFillMap = value + this + } + + /** + * Setter for providing a 'blanket override' value (fill all found categorical columns' missing values with this + * specified value). + * @param value String: A value to fill all categorical na values in the DataFrame with. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setFillConfigCharacterNABlanketFillValue(value: String): this.type = { + _instanceConfig.featureEngineeringConfig.characterNABlanketFillValue = value + this + } + + /** + * Setter for providing a 'blanket override' value (fill all found numeric columns' missing values with this + * specified value) + * @param value Double: A value to fill all numeric na value in the DataFrame with. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + def setFillConfigNumericNABlanketFillValue(value: Double): this.type = { + _instanceConfig.featureEngineeringConfig.numericNABlanketFillValue = value + this + } + + /** + * Mode for na fill
+ * Available modes:
+ * auto : Stats-based na fill for fields. Usage of .setNumericFillStat and + * .setCharacterFillStat will inform the type of statistics that will be used to fill.
+ * mapFill : Custom by-column overrides to 'blanket fill' na values on a per-column + * basis. The categorical (string) fields are set via .setCategoricalNAFillMap while the + * numeric fields are set via .setNumericNAFillMap.
+ * blanketFillAll : Fills all fields based on the values specified by + * .setCharacterNABlanketFillValue and .setNumericNABlanketFillValue. All NA's for the + * appropriate types will be filled in accordingly throughout all columns.
+ * blanketFillCharOnly Will use statistics to fill in numeric fields, but will replace + * all categorical character fields na values with a blanket fill value.
+ * blanketFillNumOnly Will use statistics to fill in character fields, but will replace + * all numeric fields na values with a blanket value. + * @throws IllegalArgumentException if the mods specified is not supported. + * @param value String: Mode for NA Fill + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + def setFillConfigNAFillMode(value: String): this.type = { + validateMembership(value, allowableNAFillModes, "fillConfigNAFillMode") + _instanceConfig.featureEngineeringConfig.naFillMode = value + this + } + + /** + * Algorithm Config + */ + def setStringBoundaries(value: Map[String, List[String]]): this.type = { + validateStringBoundariesKeys(modelType, value) + _instanceConfig.algorithmConfig.stringBoundaries = value + this + } + + def setNumericBoundaries(value: Map[String, (Double, Double)]): this.type = { + validateNumericBoundariesValues(value) + validateNumericBoundariesKeys(modelType, value) + _instanceConfig.algorithmConfig.numericBoundaries = value + this + } + + /** + * Tuner Config + */ + def setTunerAutoStoppingScore(value: Double): this.type = { + _instanceConfig.tunerConfig.tunerAutoStoppingScore = value + this + } + + def setTunerParallelism(value: Int): this.type = { + if (value > 30) + println( + "WARNING - Setting Tuner Parallelism greater than 30 could put excessive stress on the " + + "Driver. Ensure driver is monitored for stability." + ) + _instanceConfig.tunerConfig.tunerParallelism = value + this + } + + def setTunerKFold(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerKFold = value + this + } + + def setTunerTrainPortion(value: Double): this.type = { + require( + value > 0.0 & value < 1.0, + s"TunerTrainPortion must be within the range of 0.0 to 1.0." + ) + if (value < 0.5) + println( + s"WARNING - setting TunerTrainPortion below 0.5 may result in a poorly fit model. Best" + + s" practices guidance typically adheres to a 0.7 or 0.8 test/train ratio." + ) + _instanceConfig.tunerConfig.tunerTrainPortion = value + this + } + + def setTunerTrainSplitMethod(value: String): this.type = { + validateMembership( + value, + allowableTrainSplitMethods, + "TunerTrainSplitMethod" + ) + _instanceConfig.tunerConfig.tunerTrainSplitMethod = value + this + } + + /** + * Setter - for setting the name of the Synthetic column name + * + * @param value String: A column name that is uniquely not part of the main DataFrame + * @since 0.5.1 + * @author Ben Wilson + */ + def setTunerKSampleSyntheticCol(value: String): this.type = { + _instanceConfig.tunerConfig.tunerKSampleSyntheticCol = value + this + } + + /** + * Setter for specifying the number of K-Groups to generate in the KMeans model + * + * @param value Int: number of k groups to generate + * @return this + */ + def setTunerKSampleKGroups(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerKSampleKGroups = value + this + } + + /** + * Setter for specifying the maximum number of iterations for the KMeans model to go through to converge + * + * @param value Int: Maximum limit on iterations + * @return this + */ + def setTunerKSampleKMeansMaxIter(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerKSampleKMeansMaxIter = value + this + } + + /** + * Setter for Setting the tolerance for KMeans (must be >0) + * + * @param value The tolerance value setting for KMeans + * @see reference: [[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans]] + * for further details. + * @return this + * @throws IllegalArgumentException() if a value less than 0 is entered + */ + @throws(classOf[IllegalArgumentException]) + def setTunerKSampleKMeansTolerance(value: Double): this.type = { + require( + value > 0, + s"KMeans tolerance value ${value.toString} is out of range. Must be > 0." + ) + _instanceConfig.tunerConfig.tunerKSampleKMeansTolerance = value + this + } + + /** + * Setter for which distance measurement to use to calculate the nearness of vectors to a centroid + * + * @param value String: Options -> "euclidean" or "cosine" Default: "euclidean" + * @return this + * @throws IllegalArgumentException() if an invalid value is entered + */ + @throws(classOf[IllegalArgumentException]) + def setTunerKSampleKMeansDistanceMeasurement(value: String): this.type = { + validateMembership( + value, + allowableKMeansDistanceMeasurements, + "tunerKSampleKMeansDistanceMeasurement" + ) + _instanceConfig.tunerConfig.tunerKSampleKMeansDistanceMeasurement = value + this + } + + /** + * Setter for a KMeans seed for the clustering algorithm + * + * @param value Long: Seed value + * @return this + */ + def setTunerKSampleKMeansSeed(value: Long): this.type = { + _instanceConfig.tunerConfig.tunerKSampleKMeansSeed = value + this + } + + /** + * Setter for the internal KMeans column for cluster membership attribution + * + * @param value String: column name for internal algorithm column for group membership + * @return this + */ + def setTunerKSampleKMeansPredictionCol(value: String): this.type = { + _instanceConfig.tunerConfig.tunerKSampleKMeansPredictionCol = value + this + } + + /** + * Setter for Configuring the number of Hash Tables to use for MinHashLSH + * + * @param value Int: Count of hash tables to use + * @see [[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH]] + * for more information + * @return this + */ + def setTunerKSampleLSHHashTables(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerKSampleLSHHashTables = value + this + } + + def setTunerKSampleLSHSeed(value: Long): this.type = { + _instanceConfig.tunerConfig.tunerKSampleLSHSeed = value + this + } + + /** + * Setter for the internal LSH output hash information column + * + * @param value String: column name for the internal MinHashLSH Model transformation value + * @return this + */ + def setTunerKSampleLSHOutputCol(value: String): this.type = { + _instanceConfig.tunerConfig.tunerKSampleLSHOutputCol = value + this + } + + /** + * Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data + * + * @note the higher the value set here, the higher the variance in synthetic data generation + * @param value Int: Number of vectors to find nearest each centroid within the class + * @return this + */ + def setTunerKSampleQuorumCount(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerKSampleQuorumCount = value + this + } + + /** + * Setter for minimum threshold for vector indexes to mutate within the feature vector. + * + * @note In vectorMutationMethod "fixed" this sets the fixed count of how many vector positions to mutate. + * In vectorMutationMethod "random" this sets the lower threshold for 'at least this many indexes will + * be mutated' + * @param value The minimum (or fixed) number of indexes to mutate. + * @return this + */ + def setTunerKSampleMinimumVectorCountToMutate(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerKSampleMinimumVectorCountToMutate = value + this + } + + /** + * Setter for the Vector Mutation Method + * + * @note Options: + * "fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. + * "random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. + * "all" - will mutate all of the vectors. + * @param value String - the mode to use. + * @return this + * @throws IllegalArgumentException() if the mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setTunerKSampleVectorMutationMethod(value: String): this.type = { + validateMembership( + value, + allowableVectorMutationMethods, + "tunerKSampleVectorMutationMethod" + ) + _instanceConfig.tunerConfig.tunerKSampleVectorMutationMethod = value + this + } + + /** + * Setter for the Mutation Mode of the feature vector individual values + * + * @note Options: + * "weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors + * "random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors + * "ratio" - uses a ratio between the values of the centroid vector and the mutation vector * + * @param value String: the mode to use. + * @return this + * @throws IllegalArgumentException() if the mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setTunerKSampleMutationMode(value: String): this.type = { + validateMembership( + value, + allowableMutationModes, + "tunerKSampleMutationMode" + ) + _instanceConfig.tunerConfig.tunerKSampleMutationMode = value + this + } + + /** + * Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode + * + * @param value Double: value between 0 and 1 for mutation magnitude adjustment. + * @note the higher this value, the closer to the centroid vector vs. the candidate mutation vector the synthetic row data will be. + * @return this + * @throws IllegalArgumentException() if the value specified is outside of the range (0, 1) + */ + @throws(classOf[IllegalArgumentException]) + def setTunerKSampleMutationValue(value: Double): this.type = { + require( + value > 0 & value < 1, + s"Mutation Value must be between 0 and 1. Value $value is not permitted." + ) + _instanceConfig.tunerConfig.tunerKSampleMutationValue = value + this + } + + /** + * Setter - for determining the label balance approach mode. + * + * @note Available modes:
+ * 'match': Will match all smaller class counts to largest class count. [WARNING] - May significantly increase memory pressure!
+ * 'percentage' Will adjust smaller classes to a percentage value of the largest class count. + * 'target' Will increase smaller class counts to a fixed numeric target of rows. + * @param value String: one of: 'match', 'percentage' or 'target' + * @note Default: "percentage" + * @since 0.5.1 + * @author Ben Wilson + * @throws IllegalArgumentException if the provided mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setTunerKSampleLabelBalanceMode(value: String): this.type = { + validateMembership( + value, + allowableLabelBalanceModes, + "tunerKSampleLabelBalanceMode" + ) + _instanceConfig.tunerConfig.tunerKSampleLabelBalanceMode = value + this + } + + /** + * Setter - for overriding the cardinality threshold exception threshold. [WARNING] increasing this value on + * a sufficiently large data set could incur, during runtime, excessive memory and cpu pressure on the cluster. + * + * @param value Int: the limit above which an exception will be thrown for a classification problem wherein the + * label distinct count is too large to successfully generate synthetic data. + * @note Default: 20 + * @since 0.5.1 + * @author Ben Wilson + */ + def setTunerKSampleCardinalityThreshold(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerKSampleCardinalityThreshold = value + this + } + + /** + * Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode() + * + * @param value Double: A fractional double in the range of 0.0 to 1.0. + * @note Setting this value to 1.0 is equivalent to setting the label balance mode to 'match' + * @note Default: 0.2 + * @since 0.5.1 + * @author Ben Wilson + * @throws UnsupportedOperationException() if the provided value is outside of the range of 0.0 -> 1.0 + */ + @throws(classOf[UnsupportedOperationException]) + def setTunerKSampleNumericRatio(value: Double): this.type = { + require( + value <= 1.0 & value > 0.0, + s"Invalid Numeric Ratio entered! Must be between 0 and 1." + + s"${value.toString} is not valid." + ) + _instanceConfig.tunerConfig.tunerKSampleNumericRatio = value + this + } + + /** + * Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode() + * + * @param value Int: The desired final number of rows per minority class label + * @note [WARNING] Setting this value to too high of a number will greatly increase runtime and memory pressure. + * @since 0.5.1 + * @author Ben Wilson + */ + def setTunerKSampleNumericTarget(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerKSampleNumericTarget = value + this + } + + def setTunerTrainSplitChronologicalColumn(value: String): this.type = { + _instanceConfig.tunerConfig.tunerTrainSplitChronologicalColumn = value + if (value.length > 0) { + val updatedFieldsToIgnore = genericConfig.fieldsToIgnoreInVector ++ Array( + value + ) + genericConfig.fieldsToIgnoreInVector = updatedFieldsToIgnore + } + this + } + + def setTunerTrainSplitChronologicalRandomPercentage( + value: Double + ): this.type = { + if (value > 10) + println( + "[WARNING] TunerTrainSplitChronologicalRandomPercentage setting this value above 10 " + + "percent will cause significant per-run train/test skew and variability in row counts during training. " + + "Use higher values only if this is desired." + ) + _instanceConfig.tunerConfig.tunerTrainSplitChronologicalRandomPercentage = + value + this + } + + def setTunerSeed(value: Long): this.type = { + _instanceConfig.tunerConfig.tunerSeed = value + this + } + + def setTunerFirstGenerationGenePool(value: Int): this.type = { + if (value < 10) + println( + "[WARNING] TunerFirstGenerationGenePool values of less than 10 may not find global minima" + + "for hyperparameters. Consider setting the value > 30 for best performance." + ) + _instanceConfig.tunerConfig.tunerFirstGenerationGenePool = value + this + } + + def setTunerNumberOfGenerations(value: Int): this.type = { + if (value < 3) + println( + "[WARNING] TunerNumberOfGenerations set below 3 may not explore hyperparameter feature " + + "space effectively to arrive at a global minima." + ) + if (value > 20) + println( + "[WARNING] TunerNumberOfGenerations set above 20 will take a long time to run. Evaluate" + + "whether first generation gene pool count and numer of mutations per generation should be adjusted higher" + + "instead." + ) + _instanceConfig.tunerConfig.tunerNumberOfGenerations = value + this + } + + def setTunerNumberOfParentsToRetain(value: Int): this.type = { + require( + value > 0, + s"TunerNumberOfParentsToRetain must be > 0. $value is outside of bounds." + ) + _instanceConfig.tunerConfig.tunerNumberOfParentsToRetain = value + this + } + + def setTunerNumberOfMutationsPerGeneration(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerNumberOfMutationsPerGeneration = value + this + } + + def setTunerGeneticMixing(value: Double): this.type = { + zeroToOneValidation(value, "TunerGeneticMixing") + if (value > 0.9) + println( + s"[WARNING] Setting TunerGeneticMixing to a value greater than 0.9 will not effectively" + + s"explore the hyperparameter feature space. Use such settings only for fine-tuning around a pre-calculated " + + s"global minima." + ) + _instanceConfig.tunerConfig.tunerGeneticMixing = value + this + } + + def setTunerGenerationalMutationStrategy(value: String): this.type = { + validateMembership( + value, + allowableMutationStrategies, + "TunerGenerationalMutationStrategy" + ) + _instanceConfig.tunerConfig.tunerGenerationalMutationStrategy = value + this + } + + def setTunerFixedMutationValue(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerFixedMutationValue = value + this + } + + def setTunerMutationMagnitudeMode(value: String): this.type = { + validateMembership( + value, + allowableMutationMagnitudeMode, + "TunerMutationMagnitudeMode" + ) + _instanceConfig.tunerConfig.tunerMutationMagnitudeMode = value + this + } + + def setTunerEvolutionStrategy(value: String): this.type = { + validateMembership( + value, + allowableEvolutionStrategies, + "TunerEvolutionStrategy" + ) + _instanceConfig.tunerConfig.tunerEvolutionStrategy = value + this + } + + /** + * Setter for defining the secondary stopping criteria for continuous training mode ( number of consistently + * not-improving runs to terminate the learning algorithm due to diminishing returns. + * @param value Negative Integer (an improvement to a priori will reset the counter and subsequent non-improvements + * will decrement a mutable counter. If the counter hits this limit specified in value, the continuous + * mode algorithm will stop). + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is positive. + */ + @throws(classOf[IllegalArgumentException]) + def setTunerContinuousEvolutionImprovementThreshold(value: Int): this.type = { + require( + value < 0, + s"ContinuousEvolutionImprovementThreshold must be less than zero. It is " + + s"recommended to set this value to less than -4." + ) + _instanceConfig.tunerConfig.tunerContinuousEvolutionImprovementThreshold = + value + this + } + + /** + * Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates + * @param value String - one of "XGBoost", "LinearRegression" or "RandomForest" + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is not supported + */ + @throws(classOf[IllegalArgumentException]) + def setTunerGeneticMBORegressorType(value: String): this.type = { + validateMembership( + value, + allowableGeneticMBORegressorTypes, + "GeneticMBORegressorType" + ) + _instanceConfig.tunerConfig.tunerGeneticMBORegressorType = value + this + } + + /** + * Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through + * mutation for each generation other than the initial and post-modeling optimization phases. The larger this + * value (default: 10), the more potential space can be searched. There is not a large performance hit to this, + * and as such, values in excess of 100 are viable. + * @param value Int - a factor to multiply the numberOfMutationsPerGeneration by to generate a count of potential + * candidates. + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is not greater than zero. + */ + @throws(classOf[IllegalArgumentException]) + def setTunerGeneticMBOCandidateFactor(value: Int): this.type = { + require(value > 0, s"GeneticMBOCandidateFactor must be greater than zero.") + _instanceConfig.tunerConfig.tunerGeneticMBOCandidateFactor = value + this + } + + def setTunerContinuousEvolutionMaxIterations(value: Int): this.type = { + if (value > 500) + println( + s"[WARNING] Setting this value higher increases runtime by O(n/parallelism) amount. " + + s"Values higher than 500 may take an unacceptably long time to run. " + ) + _instanceConfig.tunerConfig.tunerContinuousEvolutionMaxIterations = value + this + } + + def setTunerContinuousEvolutionStoppingScore(value: Double): this.type = { + zeroToOneValidation(value, "TunerContinuuousEvolutionStoppingScore") + _instanceConfig.tunerConfig.tunerContinuousEvolutionStoppingScore = value + this + } + + def setTunerContinuousEvolutionParallelism(value: Int): this.type = { + if (value > 10) + println( + "[WARNING] Setting value of TunerContinuousEvolutionParallelism greater than 10 may have" + + "unintended side-effects of a longer convergence time due to async Futures that have not returned results" + + "by the time that the next iteration is initiated. Recommended settings are in the range of [4:8] for " + + "continuous mode." + ) + _instanceConfig.tunerConfig.tunerContinuousEvolutionParallelism = value + this + } + + def setTunerContinuousEvolutionMutationAggressiveness( + value: Int + ): this.type = { + _instanceConfig.tunerConfig.tunerContinuousEvolutionMutationAggressiveness = + value + this + } + + def setTunerContinuousEvolutionGeneticMixing(value: Double): this.type = { + zeroToOneValidation(value, "TunerContinuousEvolutionGeneticMixing") + if (value > 0.9) + println( + s"[WARNING] Setting TunerContinuousEvolutionGeneticMixing to a value greater than 0.9 " + + s"will not effectively explore the hyperparameter feature space. Use such settings only for fine-tuning " + + s"around a pre-calculated global minima." + ) + _instanceConfig.tunerConfig.tunerContinuousEvolutionGeneticMixing = value + this + } + + def setTunerContinuousEvolutionRollingImprovementCount( + value: Int + ): this.type = { + _instanceConfig.tunerConfig.tunerContinuousEvolutionRollingImprovingCount = + value + this + } + + //TODO: per model validation of keys? + def setTunerModelSeed(value: Map[String, Any]): this.type = { + _instanceConfig.tunerConfig.tunerModelSeed = value + this + } + + def setTunerHyperSpaceInferenceOn(): this.type = { + _instanceConfig.tunerConfig.tunerHyperSpaceInference = true + this + } + + def setTunerHyperSpaceInferenceOff(): this.type = { + _instanceConfig.tunerConfig.tunerHyperSpaceInference = false + this + } + + def setTunerHyperSpaceInferenceFlag(value: Boolean): this.type = { + _instanceConfig.tunerConfig.tunerHyperSpaceInference = value + this + } + + def setTunerHyperSpaceInferenceCount(value: Int): this.type = { + if (value > 500000) + println( + "[WARNING] Setting TunerHyperSpaceInferenceCount above 500,000 will put stress on the " + + "driver for generating so many leaves." + ) + if (value > 1000000) + throw new UnsupportedOperationException( + s"Setting TunerHyperSpaceInferenceCount above " + + s"1,000,000 is not supported due to runtime considerations. $value is too large of a value." + ) + _instanceConfig.tunerConfig.tunerHyperSpaceInferenceCount = value + this + } + + def setTunerHyperSpaceModelCount(value: Int): this.type = { + if (value > 50) + println( + "[WARNING] TunerHyperSpaceModelCount values set excessively high will incur long runtime" + + "costs after the conclusion of Genetic Tuner running. Gains are diminishing after a value of 20." + ) + _instanceConfig.tunerConfig.tunerHyperSpaceModelCount = value + this + } + + def setTunerHyperSpaceModelType(value: String): this.type = { + validateMembership( + value, + allowableHyperSpaceModelTypes, + "TunerHyperSpaceModelType" + ) + _instanceConfig.tunerConfig.tunerHyperSpaceModelType = value + this + } + + def setTunerInitialGenerationMode(value: String): this.type = { + validateMembership( + value, + allowableInitialGenerationModes, + "TunerInitialGenerationMode" + ) + _instanceConfig.tunerConfig.tunerInitialGenerationMode = value + this + } + + def setTunerInitialGenerationPermutationCount(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerInitialGenerationPermutationCount = value + this + } + + def setTunerInitialGenerationIndexMixingMode(value: String): this.type = { + validateMembership( + value, + allowableInitialGenerationIndexMixingModes, + "TunerInitialGenerationIndexMixingMode" + ) + _instanceConfig.tunerConfig.tunerInitialGenerationIndexMixingMode = value + this + } + + def setTunerInitialGenerationArraySeed(value: Long): this.type = { + _instanceConfig.tunerConfig.tunerInitialGenerationArraySeed = value + this + } + + def setTunerOutputDfRepartitionScaleFactor(value: Int): this.type = { + _instanceConfig.tunerConfig.tunerOutputDfRepartitionScaleFactor = value + this + } + + /** + * MLFlow Logging Config + */ + def setMlFlowLoggingOn(): this.type = { + _instanceConfig.loggingConfig.mlFlowLoggingFlag = true + this + } + + def setMlFlowLoggingOff(): this.type = { + _instanceConfig.loggingConfig.mlFlowLoggingFlag = false + this + } + + def setMlFlowLoggingFlag(value: Boolean): this.type = { + _instanceConfig.loggingConfig.mlFlowLoggingFlag = value + this + } + + def setMlFlowLogArtifactsOn(): this.type = { + _instanceConfig.loggingConfig.mlFlowLogArtifactsFlag = true + this + } + + def setMlFlowLogArtifactsOff(): this.type = { + _instanceConfig.loggingConfig.mlFlowLogArtifactsFlag = false + this + } + + def setMlFlowLogArtifactsFlag(value: Boolean): this.type = { + _instanceConfig.loggingConfig.mlFlowLogArtifactsFlag = value + this + } + + //TODO: Add path validation here!! + def setMlFlowTrackingURI(value: String): this.type = { + _instanceConfig.loggingConfig.mlFlowTrackingURI = value + this + } + + def setMlFlowExperimentName(value: String): this.type = { + _instanceConfig.loggingConfig.mlFlowExperimentName = value + this + } + + def setMlFlowAPIToken(value: String): this.type = { + _instanceConfig.loggingConfig.mlFlowAPIToken = value + this + } + + @throws(classOf[IllegalArgumentException]) + def setMlFlowModelSaveDirectory(value: String): this.type = { + require( + value.take(6) == "dbfs:/", + s"Model save directory must be written to dbfs:/." + ) + _instanceConfig.loggingConfig.mlFlowModelSaveDirectory = value + this + } + + def setMlFlowLoggingMode(value: String): this.type = { + validateMembership(value, allowableMlFlowLoggingModes, "MlFlowLoggingMode") + _instanceConfig.loggingConfig.mlFlowLoggingMode = value + this + } + + def setMlFlowBestSuffix(value: String): this.type = { + _instanceConfig.loggingConfig.mlFlowBestSuffix = value + this + } + + @throws(classOf[IllegalArgumentException]) + def setInferenceConfigSaveLocation(value: String): this.type = { + require( + value.take(6) == "dbfs:/", + s"Inference save location must be on dbfs:/." + ) + _instanceConfig.loggingConfig.inferenceConfigSaveLocation = value + this + } + + /** + * Setter
+ * Allows for setting a series of custom mlflow logging tags to an experiment run (universal across all + * iterations and models of the run) to be logged in mlflow as a custom tag key value pair + * + * @param value Array of Map[String -> AnyVal] + * @note The mapped values can be of types: Double, Float, Long, Int, Short, Byte, Boolean, or String + */ + def setMlFlowCustomRunTags(value: Map[String, AnyVal]): this.type = { + + val parsedValue = + value.map { case (k, v) => k -> v.asInstanceOf[String] } + + _instanceConfig.loggingConfig.mlFlowCustomRunTags = parsedValue + this + } + + /** + * Setter for providing a path to write the kfold train/test splits as Delta data sets to (useful for extremely + * large data sets or a situation where using local disk storage might be prohibitively expensive) + * @param value String path to a dbfs location for creating the temporary (or persisted) + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def setTunerDeltaCacheBackingDirectory(value: String): this.type = { + + require( + value.take(6) == "dbfs:/", + s"Delta backing location must be written to dbfs." + ) + _instanceConfig.tunerConfig.tunerDeltaCacheBackingDirectory = value + this + } + + /** + * Setter for determining the split caching strategy (either persist to disk for each kfold split or backing to Delta) + * @param value Configuration string either 'persist' or 'delta' + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def setSplitCachingStrategy(value: String): this.type = { + val valueSet = value.toLowerCase + require( + valueSet == "persist" || valueSet == "delta" || valueSet == "cache", + s"SplitCachingStrategy '${}' is invalid. Must be either 'delta', 'cache', or 'persist'" + ) + _instanceConfig.tunerConfig.splitCachingStrategy = valueSet + this + } + + /** + * Setter for whether or not to delete the written train/test splits for the run in Delta. Defaulted to true + * which means that the job will delete the data on Object store to clean itself up after the run is completed + * if the splitCachingStrategy is set to 'delta' + * @param value Boolean - true => delete false => leave on Object Store + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def setTunerDeltaCacheBackingDirectoryRemovalFlag( + value: Boolean + ): this.type = { + _instanceConfig.tunerConfig.tunerDeltaCacheBackingDirectoryRemovalFlag = + value + this + } + + def deltaCheckBackingDirectoryRemovalOn(): this.type = { + _instanceConfig.tunerConfig.tunerDeltaCacheBackingDirectoryRemovalFlag = + true + this + } + + def deltaCheckBackingDirectoryRemovalOff(): this.type = { + _instanceConfig.tunerConfig.tunerDeltaCacheBackingDirectoryRemovalFlag = + false + this + } + + /** + * Getters + */ + def getInstanceConfig: InstanceConfig = _instanceConfig + + def generateMainConfig: MainConfig = + ConfigurationGenerator.generateMainConfig(_instanceConfig) + + def generateFeatureImportanceConfig: MainConfig = + ConfigurationGenerator.generateMainConfig(_instanceConfig) + + def generateTreeSplitConfig: MainConfig = + ConfigurationGenerator.generateMainConfig(_instanceConfig) + +} + +object ConfigurationGenerator extends ConfigurationDefaults { + + import PredictionType._ + + def apply(modelFamily: String, + predictionType: String, + genericConfig: GenericConfig): ConfigurationGenerator = + new ConfigurationGenerator(modelFamily, predictionType, genericConfig) + + /** + * + * @param modelFamily + * @param predictionType + * @return + */ + def generateDefaultConfig(modelFamily: String, + predictionType: String): InstanceConfig = { + + predictionTypeEvaluator(predictionType) match { + case Regressor => + new ConfigurationGenerator( + modelFamily, + predictionType, + GenericConfigGenerator.generateDefaultRegressorConfig + ).getInstanceConfig + case Classifier => + new ConfigurationGenerator( + modelFamily, + predictionType, + GenericConfigGenerator.generateDefaultClassifierConfig + ).getInstanceConfig + } + + } + + private def standardizeModelFamilyStrings(value: String): String = { + value.toLowerCase match { + case "randomforest" => "RandomForest" + case "gbt" => "GBT" + case "linearregression" => "LinearRegression" + case "logisticregression" => "LogisticRegression" + case "mlpc" => "MLPC" + case "svm" => "SVM" + case "trees" => "Trees" + case "xgboost" => "XGBoost" + case "gbmbinary" => "gbmBinary" + case "gbmmulti" => "gbmMulti" + case "gbmmultiova" => "gbmMultiOVA" + case "gbmhuber" => "gbmHuber" + case "gbmfair" => "gbmFair" + case "gbmlasso" => "gbmLasso" + case "gbmridge" => "gbmRidge" + case "gbmpoisson" => "gbmPoisson" + case "gbmquantile" => "gbmQuantile" + case "gbmmape" => "gbmMape" + case "gbmtweedie" => "gbmTweedie" + case "gbmgamma" => "gbmGamma" + case _ => + throw new IllegalArgumentException( + s"standardizeModelFamilyStrings does not have a supported" + + s"type of: $value" + ) + } + } + + /** + * + * @param config + * @return + */ + def generateMainConfig(config: InstanceConfig): MainConfig = { + MainConfig( + modelFamily = standardizeModelFamilyStrings(config.modelFamily), + labelCol = config.genericConfig.labelCol, + featuresCol = config.genericConfig.featuresCol, + naFillFlag = config.switchConfig.naFillFlag, + varianceFilterFlag = config.switchConfig.varianceFilterFlag, + outlierFilterFlag = config.switchConfig.outlierFilterFlag, + pearsonFilteringFlag = config.switchConfig.pearsonFilterFlag, + covarianceFilteringFlag = config.switchConfig.covarianceFilterFlag, + oneHotEncodeFlag = config.switchConfig.oneHotEncodeFlag, + scalingFlag = config.switchConfig.scalingFlag, + featureInteractionFlag = config.switchConfig.featureInteractionFlag, + dataPrepCachingFlag = config.switchConfig.dataPrepCachingFlag, + dataPrepParallelism = config.featureEngineeringConfig.dataPrepParallelism, + autoStoppingFlag = config.switchConfig.autoStoppingFlag, + autoStoppingScore = config.tunerConfig.tunerAutoStoppingScore, + featureImportanceCutoffType = + config.featureEngineeringConfig.featureImportanceCutoffType, + featureImportanceCutoffValue = + config.featureEngineeringConfig.featureImportanceCutoffValue, + dateTimeConversionType = config.genericConfig.dateTimeConversionType, + fieldsToIgnoreInVector = config.genericConfig.fieldsToIgnoreInVector, + numericBoundaries = config.algorithmConfig.numericBoundaries, + stringBoundaries = config.algorithmConfig.stringBoundaries, + scoringMetric = config.genericConfig.scoringMetric, + scoringOptimizationStrategy = + config.genericConfig.scoringOptimizationStrategy, + fillConfig = FillConfig( + numericFillStat = config.featureEngineeringConfig.numericFillStat, + characterFillStat = config.featureEngineeringConfig.characterFillStat, + modelSelectionDistinctThreshold = + config.featureEngineeringConfig.modelSelectionDistinctThreshold, + cardinalitySwitch = config.featureEngineeringConfig.cardinalitySwitch, + cardinalityType = config.featureEngineeringConfig.cardinalityType, + cardinalityLimit = config.featureEngineeringConfig.cardinalityLimit, + cardinalityPrecision = + config.featureEngineeringConfig.cardinalityPrecision, + cardinalityCheckMode = + config.featureEngineeringConfig.cardinalityCheckMode, + filterPrecision = config.featureEngineeringConfig.filterPrecision, + categoricalNAFillMap = + config.featureEngineeringConfig.categoricalNAFillMap, + numericNAFillMap = config.featureEngineeringConfig.numericNAFillMap, + characterNABlanketFillValue = + config.featureEngineeringConfig.characterNABlanketFillValue, + numericNABlanketFillValue = + config.featureEngineeringConfig.numericNABlanketFillValue, + naFillMode = config.featureEngineeringConfig.naFillMode + ), + outlierConfig = OutlierConfig( + filterBounds = config.featureEngineeringConfig.outlierFilterBounds, + lowerFilterNTile = + config.featureEngineeringConfig.outlierLowerFilterNTile, + upperFilterNTile = + config.featureEngineeringConfig.outlierUpperFilterNTile, + filterPrecision = config.featureEngineeringConfig.outlierFilterPrecision, + continuousDataThreshold = + config.featureEngineeringConfig.outlierContinuousDataThreshold, + fieldsToIgnore = config.featureEngineeringConfig.outlierFieldsToIgnore + ), + pearsonConfig = PearsonConfig( + filterStatistic = config.featureEngineeringConfig.pearsonFilterStatistic, + filterDirection = config.featureEngineeringConfig.pearsonFilterDirection, + filterManualValue = + config.featureEngineeringConfig.pearsonFilterManualValue, + filterMode = config.featureEngineeringConfig.pearsonFilterMode, + autoFilterNTile = config.featureEngineeringConfig.pearsonAutoFilterNTile + ), + covarianceConfig = CovarianceConfig( + correlationCutoffLow = + config.featureEngineeringConfig.covarianceCorrelationCutoffLow, + correlationCutoffHigh = + config.featureEngineeringConfig.covarianceCorrelationCutoffHigh + ), + featureInteractionConfig = FeatureInteractionConfig( + retentionMode = + config.featureEngineeringConfig.featureInteractionRetentionMode, + continuousDiscretizerBucketCount = + config.featureEngineeringConfig.featureInteractionContinuousDiscretizerBucketCount, + parallelism = + config.featureEngineeringConfig.featureInteractionParallelism, + targetInteractionPercentage = + config.featureEngineeringConfig.featureInteractionTargetInteractionPercentage + ), + scalingConfig = ScalingConfig( + scalerType = config.featureEngineeringConfig.scalingType, + scalerMin = config.featureEngineeringConfig.scalingMin, + scalerMax = config.featureEngineeringConfig.scalingMax, + standardScalerMeanFlag = + config.featureEngineeringConfig.scalingStandardMeanFlag, + standardScalerStdDevFlag = + config.featureEngineeringConfig.scalingStdDevFlag, + pNorm = config.featureEngineeringConfig.scalingPNorm + ), + geneticConfig = GeneticConfig( + parallelism = config.tunerConfig.tunerParallelism, + kFold = config.tunerConfig.tunerKFold, + trainPortion = config.tunerConfig.tunerTrainPortion, + trainSplitMethod = config.tunerConfig.tunerTrainSplitMethod, + kSampleConfig = KSampleConfig( + syntheticCol = config.tunerConfig.tunerKSampleSyntheticCol, + kGroups = config.tunerConfig.tunerKSampleKGroups, + kMeansMaxIter = config.tunerConfig.tunerKSampleKMeansMaxIter, + kMeansTolerance = config.tunerConfig.tunerKSampleKMeansTolerance, + kMeansDistanceMeasurement = + config.tunerConfig.tunerKSampleKMeansDistanceMeasurement, + kMeansSeed = config.tunerConfig.tunerKSampleKMeansSeed, + kMeansPredictionCol = + config.tunerConfig.tunerKSampleKMeansPredictionCol, + lshHashTables = config.tunerConfig.tunerKSampleLSHHashTables, + lshSeed = config.tunerConfig.tunerKSampleLSHSeed, + lshOutputCol = config.tunerConfig.tunerKSampleLSHOutputCol, + quorumCount = config.tunerConfig.tunerKSampleQuorumCount, + minimumVectorCountToMutate = + config.tunerConfig.tunerKSampleMinimumVectorCountToMutate, + vectorMutationMethod = + config.tunerConfig.tunerKSampleVectorMutationMethod, + mutationMode = config.tunerConfig.tunerKSampleMutationMode, + mutationValue = config.tunerConfig.tunerKSampleMutationValue, + labelBalanceMode = config.tunerConfig.tunerKSampleLabelBalanceMode, + cardinalityThreshold = + config.tunerConfig.tunerKSampleCardinalityThreshold, + numericRatio = config.tunerConfig.tunerKSampleNumericRatio, + numericTarget = config.tunerConfig.tunerKSampleNumericTarget, + outputDfRepartitionScaleFactor = + config.tunerConfig.tunerOutputDfRepartitionScaleFactor + ), + trainSplitChronologicalColumn = + config.tunerConfig.tunerTrainSplitChronologicalColumn, + trainSplitChronologicalRandomPercentage = + config.tunerConfig.tunerTrainSplitChronologicalRandomPercentage, + seed = config.tunerConfig.tunerSeed, + firstGenerationGenePool = + config.tunerConfig.tunerFirstGenerationGenePool, + numberOfGenerations = config.tunerConfig.tunerNumberOfGenerations, + numberOfParentsToRetain = + config.tunerConfig.tunerNumberOfParentsToRetain, + numberOfMutationsPerGeneration = + config.tunerConfig.tunerNumberOfMutationsPerGeneration, + geneticMixing = config.tunerConfig.tunerGeneticMixing, + generationalMutationStrategy = + config.tunerConfig.tunerGenerationalMutationStrategy, + fixedMutationValue = config.tunerConfig.tunerFixedMutationValue, + mutationMagnitudeMode = config.tunerConfig.tunerMutationMagnitudeMode, + evolutionStrategy = config.tunerConfig.tunerEvolutionStrategy, + geneticMBORegressorType = + config.tunerConfig.tunerGeneticMBORegressorType, + geneticMBOCandidateFactor = + config.tunerConfig.tunerGeneticMBOCandidateFactor, + continuousEvolutionMaxIterations = + config.tunerConfig.tunerContinuousEvolutionMaxIterations, + continuousEvolutionStoppingScore = + config.tunerConfig.tunerContinuousEvolutionStoppingScore, + continuousEvolutionImprovementThreshold = + config.tunerConfig.tunerContinuousEvolutionImprovementThreshold, + continuousEvolutionParallelism = + config.tunerConfig.tunerContinuousEvolutionParallelism, + continuousEvolutionMutationAggressiveness = + config.tunerConfig.tunerContinuousEvolutionMutationAggressiveness, + continuousEvolutionGeneticMixing = + config.tunerConfig.tunerContinuousEvolutionGeneticMixing, + continuousEvolutionRollingImprovementCount = + config.tunerConfig.tunerContinuousEvolutionRollingImprovingCount, + modelSeed = config.tunerConfig.tunerModelSeed, + hyperSpaceInference = config.tunerConfig.tunerHyperSpaceInference, + hyperSpaceInferenceCount = + config.tunerConfig.tunerHyperSpaceInferenceCount, + hyperSpaceModelType = config.tunerConfig.tunerHyperSpaceModelType, + hyperSpaceModelCount = config.tunerConfig.tunerHyperSpaceModelCount, + initialGenerationMode = config.tunerConfig.tunerInitialGenerationMode, + initialGenerationConfig = FirstGenerationConfig( + permutationCount = + config.tunerConfig.tunerInitialGenerationPermutationCount, + indexMixingMode = + config.tunerConfig.tunerInitialGenerationIndexMixingMode, + arraySeed = config.tunerConfig.tunerInitialGenerationArraySeed + ), + deltaCacheBackingDirectory = + config.tunerConfig.tunerDeltaCacheBackingDirectory, + deltaCacheBackingDirectoryRemovalFlag = + config.tunerConfig.tunerDeltaCacheBackingDirectoryRemovalFlag, + splitCachingStrategy = config.tunerConfig.splitCachingStrategy + ), + mlFlowLoggingFlag = config.loggingConfig.mlFlowLoggingFlag, + mlFlowLogArtifactsFlag = config.loggingConfig.mlFlowLogArtifactsFlag, + mlFlowConfig = MLFlowConfig( + mlFlowTrackingURI = config.loggingConfig.mlFlowTrackingURI, + mlFlowExperimentName = config.loggingConfig.mlFlowExperimentName, + mlFlowAPIToken = config.loggingConfig.mlFlowAPIToken, + mlFlowModelSaveDirectory = config.loggingConfig.mlFlowModelSaveDirectory, + mlFlowLoggingMode = config.loggingConfig.mlFlowLoggingMode, + mlFlowBestSuffix = config.loggingConfig.mlFlowBestSuffix, + mlFlowCustomRunTags = config.loggingConfig.mlFlowCustomRunTags + ), + inferenceConfigSaveLocation = + config.loggingConfig.inferenceConfigSaveLocation, + dataReductionFactor = config.featureEngineeringConfig.dataReductionFactor, + pipelineDebugFlag = config.switchConfig.pipelineDebugFlag, + pipelineId = PipelineStateCache.generatePipelineId() + ) + } + + /** + * Helper method for generating the configuration for executing an exploratory FeatureImportance run + * + * @param config InstanceConfig Object + * @return Instance of FeatureImportanceConfig + * @since 0.5.1 + * @author Ben Wilson + */ + def generateFeatureImportanceConfig( + config: InstanceConfig + ): FeatureImportanceConfig = { + + FeatureImportanceConfig( + labelCol = config.genericConfig.labelCol, + featuresCol = config.genericConfig.featuresCol, + dataPrepParallelism = config.featureEngineeringConfig.dataPrepParallelism, + numericBoundaries = config.algorithmConfig.numericBoundaries, + stringBoundaries = config.algorithmConfig.stringBoundaries, + scoringMetric = config.genericConfig.scoringMetric, + trainPortion = config.tunerConfig.tunerTrainPortion, + trainSplitMethod = config.tunerConfig.tunerTrainSplitMethod, + trainSplitChronologicalColumn = + config.tunerConfig.tunerTrainSplitChronologicalColumn, + trainSplitChronlogicalRandomPercentage = + config.tunerConfig.tunerTrainSplitChronologicalRandomPercentage, + parallelism = config.tunerConfig.tunerParallelism, + kFold = config.tunerConfig.tunerKFold, + seed = config.tunerConfig.tunerSeed, + scoringOptimizationStrategy = + config.genericConfig.scoringOptimizationStrategy, + firstGenerationGenePool = config.tunerConfig.tunerFirstGenerationGenePool, + numberOfGenerations = config.tunerConfig.tunerNumberOfGenerations, + numberOfMutationsPerGeneration = + config.tunerConfig.tunerNumberOfMutationsPerGeneration, + numberOfParentsToRetain = config.tunerConfig.tunerNumberOfParentsToRetain, + geneticMixing = config.tunerConfig.tunerGeneticMixing, + generationalMutationStrategy = + config.tunerConfig.tunerGenerationalMutationStrategy, + mutationMagnitudeMode = config.tunerConfig.tunerMutationMagnitudeMode, + fixedMutationValue = config.tunerConfig.tunerFixedMutationValue, + autoStoppingScore = config.tunerConfig.tunerAutoStoppingScore, + autoStoppingFlag = config.switchConfig.autoStoppingFlag, + evolutionStrategy = config.tunerConfig.tunerEvolutionStrategy, + continuousEvolutionMaxIterations = + config.tunerConfig.tunerContinuousEvolutionMaxIterations, + continuousEvolutionStoppingScore = + config.tunerConfig.tunerContinuousEvolutionStoppingScore, + continuousEvolutionParallelism = + config.tunerConfig.tunerContinuousEvolutionParallelism, + continuousEvolutionMutationAggressiveness = + config.tunerConfig.tunerContinuousEvolutionMutationAggressiveness, + continuousEvolutionGeneticMixing = + config.tunerConfig.tunerContinuousEvolutionGeneticMixing, + continuousEvolutionRollingImprovementCount = + config.tunerConfig.tunerContinuousEvolutionRollingImprovingCount, + dataReductionFactor = config.featureEngineeringConfig.dataReductionFactor, + firstGenMode = config.tunerConfig.tunerInitialGenerationMode, + firstGenPermutations = + config.tunerConfig.tunerInitialGenerationPermutationCount, + firstGenIndexMixingMode = + config.tunerConfig.tunerInitialGenerationIndexMixingMode, + firstGenArraySeed = config.tunerConfig.tunerInitialGenerationArraySeed, + fieldsToIgnore = config.genericConfig.fieldsToIgnoreInVector, + numericFillStat = config.featureEngineeringConfig.numericFillStat, + characterFillStat = config.featureEngineeringConfig.characterFillStat, + modelSelectionDistinctThreshold = + config.featureEngineeringConfig.modelSelectionDistinctThreshold, + dateTimeConversionType = config.genericConfig.dateTimeConversionType, + modelType = config.predictionType, + featureImportanceModelFamily = config.modelFamily, + featureInteractionFlag = config.switchConfig.featureInteractionFlag, + featureInteractionRetentionMode = + config.featureEngineeringConfig.featureInteractionRetentionMode, + featureInteractionContinuousDiscretizerBucketCount = + config.featureEngineeringConfig.featureInteractionContinuousDiscretizerBucketCount, + featureInteractionParallelism = + config.featureEngineeringConfig.featureInteractionParallelism, + featureInteractionTargetInteractionPercentage = + config.featureEngineeringConfig.featureInteractionTargetInteractionPercentage, + deltaCacheBackingDirectory = + config.tunerConfig.tunerDeltaCacheBackingDirectory, + deltaCacheBackingDirectoryRemovalFlag = + config.tunerConfig.tunerDeltaCacheBackingDirectoryRemovalFlag, + splitCachingStrategy = config.tunerConfig.splitCachingStrategy + ) + + } + + /** + * Method for generating a default configuration if no overrides are specified + * @param modelFamily The model family (i.e. RandomForest) to generate a tuned model for + * @param predictionType The type of prediction being made (either classifier or regressor) + * @return MainConfig instance + * @since 0.4.0 + * @author Ben Wilson, Databricks + */ + def generateDefaultMainConfig(modelFamily: String, + predictionType: String): MainConfig = { + val defaultInstanceConfig = + generateDefaultConfig(modelFamily, predictionType) + generateMainConfig(defaultInstanceConfig) + } + + /** + * + * @param config + * @return + */ + def generatePrettyJsonInstanceConfig(config: InstanceConfig): String = { + + implicit val formats: Formats = Serialization.formats(hints = NoTypeHints) + writePretty(config) + } + + /** + * User quality of life method for showing human readable config in pseudo-Map form + * @param config Instance config object + */ + def printFullConfig(config: InstanceConfig): Unit = { + + implicit val formats: Formats = + Serialization.formats(hints = FullTypeHints(List(config.getClass))) + println( + writePretty(config) + .replaceAll(": \\{", "\\{") + .replaceAll(":", "->") + ) + } + + /** + * User quality of life method for showing human readable config in pseudo-Map form + * @param config Instance config object + */ + def printFullConfig(config: MainConfig): Unit = { + + implicit val formats: Formats = + Serialization.formats(hints = FullTypeHints(List(config.getClass))) + println( + writePretty(config) + .replaceAll(": \\{", "\\{") + .replaceAll(":", "->") + ) + } + + /** + * + * @param json + * @return + */ + def generateInstanceConfigFromJson(json: String): InstanceConfig = { + implicit val formats: Formats = Serialization.formats(hints = NoTypeHints) + read[InstanceConfig](json) + } + + def generateMainConfigFromJson(json: String): MainConfig = { + val objectMapper = new ObjectMapper() + objectMapper.registerModule(DefaultScalaModule) + objectMapper.readValue(json, classOf[MainConfig]) + } + + def jsonStrToMap(jsonStr: String): Map[String, Any] = { + implicit val formats = org.json4s.DefaultFormats + + JsonMethods.parse(jsonStr).extract[Map[String, Any]] + } + + private def validateMapConfig(defaultMap: Map[String, Any], + submittedMap: Map[String, Any]): Unit = { + + val definedKeys = defaultMap.keys + val submittedKeys = submittedMap.keys + + // Perform a quick-check + + val contained = submittedKeys.forall(definedKeys.toList.contains) + if (!contained) { + + val invalidKeys = ListBuffer[String]() + + submittedKeys.map( + x => if (!definedKeys.toList.contains(x)) invalidKeys += x + ) + + throw new IllegalArgumentException( + s"Invalid map key(s) submitted for configuration generation. \nInvalid Keys: " + + s"'${invalidKeys.mkString("','")}'. \n\tTo see a list of available keys, submit: \n\n\t\t" + + s"ConfigurationGenerator.getConfigMapKeys \n\t\t\tor \n\t\tConfigurationGenerator.printConfigMapKeys \n\t\t\t " + + s"to visualize in stdout.\n" + ) + } + + } + + /** + * + * @param modelFamily + * @param predictionType + * @param config + * @return + */ + def generateConfigFromMap(modelFamily: String, + predictionType: String, + config: Map[String, Any]): InstanceConfig = { + + val defaultMap = defaultConfigMap(modelFamily, predictionType) + + // Validate the submitted keys to ensure that there are no invalid or mispelled entries + validateMapConfig(defaultMap, config) + + lazy val genericConfigObject = new GenericConfigGenerator(predictionType) + .setLabelCol( + config.getOrElse("labelCol", defaultMap("labelCol")).toString + ) + .setFeaturesCol( + config.getOrElse("featuresCol", defaultMap("featuresCol")).toString + ) + .setDateTimeConversionType( + config + .getOrElse( + "dateTimeConversionType", + defaultMap("dateTimeConversionType") + ) + .toString + ) + .setFieldsToIgnoreInVector( + config + .getOrElse( + "fieldsToIgnoreInVector", + defaultMap("fieldsToIgnoreInVector") + ) + .asInstanceOf[Array[String]] + ) + .setScoringMetric( + config.getOrElse("scoringMetric", defaultMap("scoringMetric")).toString + ) + .setScoringOptimizationStrategy( + config + .getOrElse( + "scoringOptimizationStrategy", + defaultMap("scoringOptimizationStrategy") + ) + .toString + ) + + lazy val configObject = new ConfigurationGenerator( + modelFamily, + predictionType, + genericConfigObject.getConfig + ).setDataPrepParallelism( + config + .getOrElse("dataPrepParallelism", defaultMap("dataPrepParallelism")) + .toString + .toInt + ) + .setNaFillFlag( + config + .getOrElse("naFillFlag", defaultMap("naFillFlag")) + .toString + .toBoolean + ) + .setVarianceFilterFlag( + config + .getOrElse("varianceFilterFlag", defaultMap("varianceFilterFlag")) + .toString + .toBoolean + ) + .setOutlierFilterFlag( + config + .getOrElse("outlierFilterFlag", defaultMap("outlierFilterFlag")) + .toString + .toBoolean + ) + .setPearsonFilterFlag( + config + .getOrElse("pearsonFilterFlag", defaultMap("pearsonFilterFlag")) + .toString + .toBoolean + ) + .setCovarianceFilterFlag( + config + .getOrElse("covarianceFilterFlag", defaultMap("covarianceFilterFlag")) + .toString + .toBoolean + ) + .setOneHotEncodeFlag( + config + .getOrElse("oneHotEncodeFlag", defaultMap("oneHotEncodeFlag")) + .toString + .toBoolean + ) + .setScalingFlag( + config + .getOrElse("scalingFlag", defaultMap("scalingFlag")) + .toString + .toBoolean + ) + .setFeatureInteractionFlag( + config + .getOrElse( + "featureInteractionFlag", + defaultMap("featureInteractionFlag") + ) + .toString + .toBoolean + ) + .setDataPrepCachingFlag( + config + .getOrElse("dataPrepCachingFlag", defaultMap("dataPrepCachingFlag")) + .toString + .toBoolean + ) + .setAutoStoppingFlag( + config + .getOrElse("autoStoppingFlag", defaultMap("autoStoppingFlag")) + .toString + .toBoolean + ) + .setPipelineDebugFlag( + config + .getOrElse("pipelineDebugFlag", defaultMap("pipelineDebugFlag")) + .toString + .toBoolean + ) + .setFillConfigNumericFillStat( + config + .getOrElse( + "fillConfigNumericFillStat", + defaultMap("fillConfigNumericFillStat") + ) + .toString + ) + .setFillConfigCharacterFillStat( + config + .getOrElse( + "fillConfigCharacterFillStat", + defaultMap("fillConfigCharacterFillStat") + ) + .toString + ) + .setFillConfigModelSelectionDistinctThreshold( + config + .getOrElse( + "fillConfigModelSelectionDistinctThreshold", + defaultMap("fillConfigModelSelectionDistinctThreshold") + ) + .toString + .toInt + ) + .setFillConfigCardinalitySwitch( + config + .getOrElse( + "fillConfigCardinalitySwitch", + defaultMap("fillConfigCardinalitySwitch") + ) + .toString + .toBoolean + ) + .setFillConfigCardinalityType( + config + .getOrElse( + "fillConfigCardinalityType", + defaultMap("fillConfigCardinalityType") + ) + .toString + ) + .setFillConfigCardinalityPrecision( + config + .getOrElse( + "fillConfigCardinalityPrecision", + defaultMap("fillConfigCardinalityPrecision") + ) + .toString + .toDouble + ) + .setFillConfigCardinalityCheckMode( + config + .getOrElse( + "fillConfigCardinalityCheckMode", + defaultMap("fillConfigCardinalityCheckMode") + ) + .toString + ) + .setFillConfigCardinalityLimit( + config + .getOrElse( + "fillConfigCardinalityLimit", + defaultMap("fillConfigCardinalityLimit") + ) + .toString + .toInt + ) + .setFillConfigFilterPrecision( + config + .getOrElse( + "fillConfigFilterPrecision", + defaultMap("fillConfigFilterPrecision") + ) + .toString + .toDouble + ) + .setFillConfigCategoricalNAFillMap( + config + .getOrElse( + "fillConfigCategoricalNAFillMap", + defaultMap("fillConfigCategoricalNAFillMap") + ) + .asInstanceOf[Map[String, String]] + ) + .setFillConfigNumericNAFillMap( + config + .getOrElse( + "fillConfigNumericNAFillMap", + defaultMap("fillConfigNumericNAFillMap") + ) + .asInstanceOf[Map[String, AnyVal]] + ) + .setFillConfigCharacterNABlanketFillValue( + config + .getOrElse( + "fillConfigCharacterNABlanketFillValue", + defaultMap("fillConfigCharacterNABlanketFillValue") + ) + .toString + ) + .setFillConfigNumericNABlanketFillValue( + config + .getOrElse( + "fillConfigNumericNABlanketFillValue", + defaultMap("fillConfigNumericNABlanketFillValue") + ) + .toString + .toDouble + ) + .setFillConfigNAFillMode( + config + .getOrElse("fillConfigNAFillMode", defaultMap("fillConfigNAFillMode")) + .toString + ) + .setOutlierFilterBounds( + config + .getOrElse("outlierFilterBounds", defaultMap("outlierFilterBounds")) + .toString + ) + .setOutlierLowerFilterNTile( + config + .getOrElse( + "outlierLowerFilterNTile", + defaultMap("outlierLowerFilterNTile") + ) + .toString + .toDouble + ) + .setOutlierUpperFilterNTile( + config + .getOrElse( + "outlierUpperFilterNTile", + defaultMap("outlierUpperFilterNTile") + ) + .toString + .toDouble + ) + .setOutlierFilterPrecision( + config + .getOrElse( + "outlierFilterPrecision", + defaultMap("outlierFilterPrecision") + ) + .toString + .toDouble + ) + .setOutlierContinuousDataThreshold( + config + .getOrElse( + "outlierContinuousDataThreshold", + defaultMap("outlierContinuousDataThreshold") + ) + .toString + .toInt + ) + .setOutlierFieldsToIgnore( + config + .getOrElse( + "outlierFieldsToIgnore", + defaultMap("outlierFieldsToIgnore") + ) + .asInstanceOf[Array[String]] + ) + .setPearsonFilterStatistic( + config + .getOrElse( + "pearsonFilterStatistic", + defaultMap("pearsonFilterStatistic") + ) + .toString + ) + .setPearsonFilterDirection( + config + .getOrElse( + "pearsonFilterDirection", + defaultMap("pearsonFilterDirection") + ) + .toString + ) + .setPearsonFilterManualValue( + config + .getOrElse( + "pearsonFilterManualValue", + defaultMap("pearsonFilterManualValue") + ) + .toString + .toDouble + ) + .setPearsonFilterMode( + config + .getOrElse("pearsonFilterMode", defaultMap("pearsonFilterMode")) + .toString + ) + .setPearsonAutoFilterNTile( + config + .getOrElse( + "pearsonAutoFilterNTile", + defaultMap("pearsonAutoFilterNTile") + ) + .toString + .toDouble + ) + .setCovarianceCutoffLow( + config + .getOrElse("covarianceCutoffLow", defaultMap("covarianceCutoffLow")) + .toString + .toDouble + ) + .setCovarianceCutoffHigh( + config + .getOrElse("covarianceCutoffHigh", defaultMap("covarianceCutoffHigh")) + .toString + .toDouble + ) + .setScalingType( + config.getOrElse("scalingType", defaultMap("scalingType")).toString + ) + .setScalingMin( + config + .getOrElse("scalingMin", defaultMap("scalingMin")) + .toString + .toDouble + ) + .setScalingMax( + config + .getOrElse("scalingMax", defaultMap("scalingMax")) + .toString + .toDouble + ) + .setScalingStandardMeanFlag( + config + .getOrElse( + "scalingStandardMeanFlag", + defaultMap("scalingStandardMeanFlag") + ) + .toString + .toBoolean + ) + .setScalingStdDevFlag( + config + .getOrElse("scalingStdDevFlag", defaultMap("scalingStdDevFlag")) + .toString + .toBoolean + ) + .setScalingPNorm( + config + .getOrElse("scalingPNorm", defaultMap("scalingPNorm")) + .toString + .toDouble + ) + .setFeatureInteractionRetentionMode( + config + .getOrElse( + "featureInteractionRetentionMode", + defaultMap("featureInteractionRetentionMode") + ) + .toString + ) + .setFeatureInteractionContinuousDiscretizerBucketCount( + config + .getOrElse( + "featureInteractionContinuousDiscretizerBucketCount", + defaultMap("featureInteractionContinuousDiscretizerBucketCount") + ) + .toString + .toInt + ) + .setFeatureInteractionParallelism( + config + .getOrElse( + "featureInteractionParallelism", + defaultMap("featureInteractionParallelism") + ) + .toString + .toInt + ) + .setFeatureInteractionTargetInteractionPercentage( + config + .getOrElse( + "featureInteractionTargetInteractionPercentage", + defaultMap("featureInteractionTargetInteractionPercentage") + ) + .toString + .toDouble + ) + .setFeatureImportanceCutoffType( + config + .getOrElse( + "featureImportanceCutoffType", + defaultMap("featureImportanceCutoffType") + ) + .toString + ) + .setFeatureImportanceCutoffValue( + config + .getOrElse( + "featureImportanceCutoffValue", + defaultMap("featureImportanceCutoffValue") + ) + .toString + .toDouble + ) + .setDataReductionFactor( + config + .getOrElse("dataReductionFactor", defaultMap("dataReductionFactor")) + .toString + .toDouble + ) + .setStringBoundaries( + config + .getOrElse("stringBoundaries", defaultMap("stringBoundaries")) + .asInstanceOf[Map[String, List[String]]] + ) + .setNumericBoundaries( + config + .getOrElse("numericBoundaries", defaultMap("numericBoundaries")) + .asInstanceOf[Map[String, (Double, Double)]] + ) + .setTunerAutoStoppingScore( + config + .getOrElse( + "tunerAutoStoppingScore", + defaultMap("tunerAutoStoppingScore") + ) + .toString + .toDouble + ) + .setTunerParallelism( + config + .getOrElse("tunerParallelism", defaultMap("tunerParallelism")) + .toString + .toInt + ) + .setTunerKFold( + config.getOrElse("tunerKFold", defaultMap("tunerKFold")).toString.toInt + ) + .setTunerTrainPortion( + config + .getOrElse("tunerTrainPortion", defaultMap("tunerTrainPortion")) + .toString + .toDouble + ) + .setTunerTrainSplitMethod( + config + .getOrElse( + "tunerTrainSplitMethod", + defaultMap("tunerTrainSplitMethod") + ) + .toString + ) + .setTunerKSampleSyntheticCol( + config + .getOrElse( + "tunerKSampleSyntheticCol", + defaultMap("tunerKSampleSyntheticCol") + ) + .toString + ) + .setTunerKSampleKGroups( + config + .getOrElse("tunerKSampleKGroups", defaultMap("tunerKSampleKGroups")) + .toString + .toInt + ) + .setTunerKSampleKMeansMaxIter( + config + .getOrElse( + "tunerKSampleKMeansMaxIter", + defaultMap("tunerKSampleKMeansMaxIter") + ) + .toString + .toInt + ) + .setTunerKSampleKMeansTolerance( + config + .getOrElse( + "tunerKSampleKMeansTolerance", + defaultMap("tunerKSampleKMeansTolerance") + ) + .toString + .toDouble + ) + .setTunerKSampleKMeansDistanceMeasurement( + config + .getOrElse( + "tunerKSampleKMeansDistanceMeasurement", + defaultMap("tunerKSampleKMeansDistanceMeasurement") + ) + .toString + ) + .setTunerKSampleKMeansSeed( + config + .getOrElse( + "tunerKSampleKMeansSeed", + defaultMap("tunerKSampleKMeansSeed") + ) + .toString + .toLong + ) + .setTunerKSampleKMeansPredictionCol( + config + .getOrElse( + "tunerKSampleKMeansPredictionCol", + defaultMap("tunerKSampleKMeansPredictionCol") + ) + .toString + ) + .setTunerKSampleLSHHashTables( + config + .getOrElse( + "tunerKSampleLSHHashTables", + defaultMap("tunerKSampleLSHHashTables") + ) + .toString + .toInt + ) + .setTunerKSampleLSHSeed( + config + .getOrElse("tunerKSampleLSHSeed", defaultMap("tunerKSampleLSHSeed")) + .toString + .toLong + ) + .setTunerKSampleLSHOutputCol( + config + .getOrElse( + "tunerKSampleLSHOutputCol", + defaultMap("tunerKSampleLSHOutputCol") + ) + .toString + ) + .setTunerKSampleQuorumCount( + config + .getOrElse( + "tunerKSampleQuorumCount", + defaultMap("tunerKSampleQuorumCount") + ) + .toString + .toInt + ) + .setTunerKSampleMinimumVectorCountToMutate( + config + .getOrElse( + "tunerKSampleMinimumVectorCountToMutate", + defaultMap("tunerKSampleMinimumVectorCountToMutate") + ) + .toString + .toInt + ) + .setTunerKSampleVectorMutationMethod( + config + .getOrElse( + "tunerKSampleVectorMutationMethod", + defaultMap("tunerKSampleVectorMutationMethod") + ) + .toString + ) + .setTunerKSampleMutationMode( + config + .getOrElse( + "tunerKSampleMutationMode", + defaultMap("tunerKSampleMutationMode") + ) + .toString + ) + .setTunerKSampleMutationValue( + config + .getOrElse( + "tunerKSampleMutationValue", + defaultMap("tunerKSampleMutationValue") + ) + .toString + .toDouble + ) + .setTunerKSampleLabelBalanceMode( + config + .getOrElse( + "tunerKSampleLabelBalanceMode", + defaultMap("tunerKSampleLabelBalanceMode") + ) + .toString + ) + .setTunerKSampleCardinalityThreshold( + config + .getOrElse( + "tunerKSampleCardinalityThreshold", + defaultMap("tunerKSampleCardinalityThreshold") + ) + .toString + .toInt + ) + .setTunerKSampleNumericRatio( + config + .getOrElse( + "tunerKSampleNumericRatio", + defaultMap("tunerKSampleNumericRatio") + ) + .toString + .toDouble + ) + .setTunerKSampleNumericTarget( + config + .getOrElse( + "tunerKSampleNumericTarget", + defaultMap("tunerKSampleNumericTarget") + ) + .toString + .toInt + ) + .setTunerOutputDfRepartitionScaleFactor( + config + .getOrElse( + "tunerOutputDfRepartitionScaleFactor", + defaultMap("tunerOutputDfRepartitionScaleFactor") + ) + .toString + .toInt + ) + .setTunerTrainSplitChronologicalColumn( + config + .getOrElse( + "tunerTrainSplitChronologicalColumn", + defaultMap("tunerTrainSplitChronologicalColumn") + ) + .toString + ) + .setTunerTrainSplitChronologicalRandomPercentage( + config + .getOrElse( + "tunerTrainSplitChronologicalRandomPercentage", + defaultMap("tunerTrainSplitChronologicalRandomPercentage") + ) + .toString + .toDouble + ) + .setTunerSeed( + config.getOrElse("tunerSeed", defaultMap("tunerSeed")).toString.toLong + ) + .setTunerFirstGenerationGenePool( + config + .getOrElse( + "tunerFirstGenerationGenePool", + defaultMap("tunerFirstGenerationGenePool") + ) + .toString + .toInt + ) + .setTunerNumberOfGenerations( + config + .getOrElse( + "tunerNumberOfGenerations", + defaultMap("tunerNumberOfGenerations") + ) + .toString + .toInt + ) + .setTunerNumberOfParentsToRetain( + config + .getOrElse( + "tunerNumberOfParentsToRetain", + defaultMap("tunerNumberOfParentsToRetain") + ) + .toString + .toInt + ) + .setTunerNumberOfMutationsPerGeneration( + config + .getOrElse( + "tunerNumberOfMutationsPerGeneration", + defaultMap("tunerNumberOfMutationsPerGeneration") + ) + .toString + .toInt + ) + .setTunerGeneticMixing( + config + .getOrElse("tunerGeneticMixing", defaultMap("tunerGeneticMixing")) + .toString + .toDouble + ) + .setTunerGenerationalMutationStrategy( + config + .getOrElse( + "tunerGenerationalMutationStrategy", + defaultMap("tunerGenerationalMutationStrategy") + ) + .toString + ) + .setTunerFixedMutationValue( + config + .getOrElse( + "tunerFixedMutationValue", + defaultMap("tunerFixedMutationValue") + ) + .toString + .toInt + ) + .setTunerMutationMagnitudeMode( + config + .getOrElse( + "tunerMutationMagnitudeMode", + defaultMap("tunerMutationMagnitudeMode") + ) + .toString + ) + .setTunerEvolutionStrategy( + config + .getOrElse( + "tunerEvolutionStrategy", + defaultMap("tunerEvolutionStrategy") + ) + .toString + ) + .setTunerGeneticMBORegressorType( + config + .getOrElse( + "tunerGeneticMBORegressorType", + defaultMap("tunerGeneticMBORegressorType") + ) + .toString + ) + .setTunerGeneticMBOCandidateFactor( + config + .getOrElse( + "tunerGeneticMBOCandidateFactor", + defaultMap("tunerGeneticMBOCandidateFactor") + ) + .toString + .toInt + ) + .setTunerContinuousEvolutionImprovementThreshold( + config + .getOrElse( + "tunerContinuousEvolutionImprovementThreshold", + defaultMap("tunerContinuousEvolutionImprovementThreshold") + ) + .toString + .toInt + ) + .setTunerContinuousEvolutionMaxIterations( + config + .getOrElse( + "tunerContinuousEvolutionMaxIterations", + defaultMap("tunerContinuousEvolutionMaxIterations") + ) + .toString + .toInt + ) + .setTunerContinuousEvolutionStoppingScore( + config + .getOrElse( + "tunerContinuousEvolutionStoppingScore", + defaultMap("tunerContinuousEvolutionStoppingScore") + ) + .toString + .toDouble + ) + .setTunerContinuousEvolutionParallelism( + config + .getOrElse( + "tunerContinuousEvolutionParallelism", + defaultMap("tunerContinuousEvolutionParallelism") + ) + .toString + .toInt + ) + .setTunerContinuousEvolutionMutationAggressiveness( + config + .getOrElse( + "tunerContinuousEvolutionMutationAggressiveness", + defaultMap("tunerContinuousEvolutionMutationAggressiveness") + ) + .toString + .toInt + ) + .setTunerContinuousEvolutionGeneticMixing( + config + .getOrElse( + "tunerContinuousEvolutionGeneticMixing", + defaultMap("tunerContinuousEvolutionGeneticMixing") + ) + .toString + .toDouble + ) + .setTunerContinuousEvolutionRollingImprovementCount( + config + .getOrElse( + "tunerContinuousEvolutionRollingImprovementCount", + defaultMap("tunerContinuousEvolutionRollingImprovementCount") + ) + .toString + .toInt + ) + .setTunerModelSeed( + config + .getOrElse("tunerModelSeed", defaultMap("tunerModelSeed")) + .asInstanceOf[Map[String, Any]] + ) + .setTunerHyperSpaceInferenceFlag( + config + .getOrElse( + "tunerHyperSpaceInferenceFlag", + defaultMap("tunerHyperSpaceInferenceFlag") + ) + .toString + .toBoolean + ) + .setTunerHyperSpaceInferenceCount( + config + .getOrElse( + "tunerHyperSpaceInferenceCount", + defaultMap("tunerHyperSpaceInferenceCount") + ) + .toString + .toInt + ) + .setTunerHyperSpaceModelCount( + config + .getOrElse( + "tunerHyperSpaceModelCount", + defaultMap("tunerHyperSpaceModelCount") + ) + .toString + .toInt + ) + .setTunerHyperSpaceModelType( + config + .getOrElse( + "tunerHyperSpaceModelType", + defaultMap("tunerHyperSpaceModelType") + ) + .toString + ) + .setTunerInitialGenerationMode( + config + .getOrElse( + "tunerInitialGenerationMode", + defaultMap("tunerInitialGenerationMode") + ) + .toString + ) + .setTunerInitialGenerationPermutationCount( + config + .getOrElse( + "tunerInitialGenerationPermutationCount", + defaultMap("tunerInitialGenerationPermutationCount") + ) + .toString + .toInt + ) + .setTunerInitialGenerationIndexMixingMode( + config + .getOrElse( + "tunerInitialGenerationIndexMixingMode", + defaultMap("tunerInitialGenerationIndexMixingMode") + ) + .toString + ) + .setTunerInitialGenerationArraySeed( + config + .getOrElse( + "tunerInitialGenerationArraySeed", + defaultMap("tunerInitialGenerationArraySeed") + ) + .toString + .toLong + ) + .setMlFlowLoggingFlag( + config + .getOrElse("mlFlowLoggingFlag", defaultMap("mlFlowLoggingFlag")) + .toString + .toBoolean + ) + .setMlFlowLogArtifactsFlag( + config + .getOrElse( + "mlFlowLogArtifactsFlag", + defaultMap("mlFlowLogArtifactsFlag") + ) + .toString + .toBoolean + ) + .setMlFlowTrackingURI( + config + .getOrElse("mlFlowTrackingURI", defaultMap("mlFlowTrackingURI")) + .toString + ) + .setMlFlowExperimentName( + config + .getOrElse("mlFlowExperimentName", defaultMap("mlFlowExperimentName")) + .toString + ) + .setMlFlowAPIToken( + config + .getOrElse("mlFlowAPIToken", defaultMap("mlFlowAPIToken")) + .toString + ) + .setMlFlowModelSaveDirectory( + config + .getOrElse( + "mlFlowModelSaveDirectory", + defaultMap("mlFlowModelSaveDirectory") + ) + .toString + ) + .setMlFlowLoggingMode( + config + .getOrElse("mlFlowLoggingMode", defaultMap("mlFlowLoggingMode")) + .toString + ) + .setMlFlowBestSuffix( + config + .getOrElse("mlFlowBestSuffix", defaultMap("mlFlowBestSuffix")) + .toString + ) + .setInferenceConfigSaveLocation( + config + .getOrElse( + "inferenceConfigSaveLocation", + defaultMap("inferenceConfigSaveLocation") + ) + .toString + ) + .setMlFlowCustomRunTags( + config + .getOrElse("mlFlowCustomRunTags", defaultMap("mlFlowCustomRunTags")) + .asInstanceOf[Map[String, AnyVal]] + ) + .setTunerDeltaCacheBackingDirectory( + config + .getOrElse( + "tunerDeltaCacheBackingDirectory", + defaultMap("tunerDeltaCacheBackingDirectory") + ) + .toString + ) + .setTunerDeltaCacheBackingDirectoryRemovalFlag( + config + .getOrElse( + "tunerDeltaCacheBackingDirectoryRemovalFlag", + defaultMap("tunerDeltaCacheBackingDirectoryRemovalFlag") + ) + .toString + .toBoolean + ) + .setSplitCachingStrategy( + config + .getOrElse("splitCachingStrategy", defaultMap("splitCachingStrategy")) + .toString + ) + + configObject.getInstanceConfig + + } + + def getDefaultConfigMap(modelFamily: String, + predictionType: String): Map[String, Any] = + defaultConfigMap(modelFamily, predictionType) + + def getConfigMapKeys: Iterable[String] = + defaultConfigMap("randomForest", "classifier").keys + + def printConfigMapKeys(): Unit = { getConfigMapKeys.foreach(println(_)) } + +} diff --git a/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationStructures.scala b/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationStructures.scala new file mode 100644 index 00000000..067b9ab9 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/config/ConfigurationStructures.scala @@ -0,0 +1,192 @@ +package com.databricks.labs.automl.executor.config + +object RegressorModels extends Enumeration { + type RegressorModels = Value + val TreesRegressor, GBTRegressor, LinearRegression, RandomForestRegressor, + SVM, XGBoostRegressor, LightGBMHuber, LightGBMFair, LightGBMLasso, + LightGBMRidge, LightGBMPoisson, LightGBMQuantile, LightGBMMape, + LightGBMTweedie, LightGBMGamma = Value +} + +object ClassiferModels extends Enumeration { + type ClassifierModels = Value + val TreesClassifier, GBTClassifier, LogisticRegression, MLPC, + RandomForestClassifier, XGBoostClassifier, LightGBMBinary, LightGBMMulti, + LightGBMMultiOVA = Value +} + +object ModelSelector extends Enumeration { + type ModelSelector = Value + val TreesRegressor, TreesClassifier, GBTRegressor, GBTClassifier, + LinearRegression, LogisticRegression, MLPC, RandomForestRegressor, + RandomForestClassifier, SVM, XGBoostRegressor, XGBoostClassifier, + LightGBMBinary, LightGBMMulti, LightGBMMultiOVA, LightGBMHuber, LightGBMFair, + LightGBMLasso, LightGBMRidge, LightGBMPoisson, LightGBMQuantile, LightGBMMape, + LightGBMTweedie, LightGBMGamma = Value +} + +object FamilyValidator extends Enumeration { + type FamilyValidator = Value + val Trees, NonTrees = Value +} + +object PredictionType extends Enumeration { + type PredictionType = Value + val Regressor, Classifier = Value +} + +/** + * + * @param labelCol + * @param featuresCol + * @param dateTimeConversionType + * @param fieldsToIgnoreInVector + * @param scoringMetric + * @param scoringOptimizationStrategy + */ +case class GenericConfig(var labelCol: String, + var featuresCol: String, + var dateTimeConversionType: String, + var fieldsToIgnoreInVector: Array[String], + var scoringMetric: String, + var scoringOptimizationStrategy: String) + +case class FeatureEngineeringConfig( + var dataPrepParallelism: Int, + var numericFillStat: String, + var characterFillStat: String, + var modelSelectionDistinctThreshold: Int, + var outlierFilterBounds: String, + var outlierLowerFilterNTile: Double, + var outlierUpperFilterNTile: Double, + var outlierFilterPrecision: Double, + var outlierContinuousDataThreshold: Int, + var outlierFieldsToIgnore: Array[String], + var pearsonFilterStatistic: String, + var pearsonFilterDirection: String, + var pearsonFilterManualValue: Double, + var pearsonFilterMode: String, + var pearsonAutoFilterNTile: Double, + var covarianceCorrelationCutoffLow: Double, + var covarianceCorrelationCutoffHigh: Double, + var scalingType: String, + var scalingMin: Double, + var scalingMax: Double, + var scalingStandardMeanFlag: Boolean, + var scalingStdDevFlag: Boolean, + var scalingPNorm: Double, + var featureImportanceCutoffType: String, + var featureImportanceCutoffValue: Double, + var dataReductionFactor: Double, + var cardinalitySwitch: Boolean, + var cardinalityType: String, + var cardinalityLimit: Int, + var cardinalityPrecision: Double, + var cardinalityCheckMode: String, + var filterPrecision: Double, + var categoricalNAFillMap: Map[String, String], + var numericNAFillMap: Map[String, AnyVal], + var characterNABlanketFillValue: String, + var numericNABlanketFillValue: Double, + var naFillMode: String, + var featureInteractionRetentionMode: String, + var featureInteractionContinuousDiscretizerBucketCount: Int, + var featureInteractionParallelism: Int, + var featureInteractionTargetInteractionPercentage: Double +) + +case class SwitchConfig(var naFillFlag: Boolean, + var varianceFilterFlag: Boolean, + var outlierFilterFlag: Boolean, + var pearsonFilterFlag: Boolean, + var covarianceFilterFlag: Boolean, + var oneHotEncodeFlag: Boolean, + var scalingFlag: Boolean, + var featureInteractionFlag: Boolean, + var dataPrepCachingFlag: Boolean, + var autoStoppingFlag: Boolean, + var pipelineDebugFlag: Boolean) + +case class TunerConfig(var tunerAutoStoppingScore: Double, + var tunerParallelism: Int, + var tunerKFold: Int, + var tunerTrainPortion: Double, + var tunerTrainSplitMethod: String, + var tunerKSampleSyntheticCol: String, + var tunerKSampleKGroups: Int, + var tunerKSampleKMeansMaxIter: Int, + var tunerKSampleKMeansTolerance: Double, + var tunerKSampleKMeansDistanceMeasurement: String, + var tunerKSampleKMeansSeed: Long, + var tunerKSampleKMeansPredictionCol: String, + var tunerKSampleLSHHashTables: Int, + var tunerKSampleLSHSeed: Long, + var tunerKSampleLSHOutputCol: String, + var tunerKSampleQuorumCount: Int, + var tunerKSampleMinimumVectorCountToMutate: Int, + var tunerKSampleVectorMutationMethod: String, + var tunerKSampleMutationMode: String, + var tunerKSampleMutationValue: Double, + var tunerKSampleLabelBalanceMode: String, + var tunerKSampleCardinalityThreshold: Int, + var tunerKSampleNumericRatio: Double, + var tunerKSampleNumericTarget: Int, + var tunerTrainSplitChronologicalColumn: String, + var tunerTrainSplitChronologicalRandomPercentage: Double, + var tunerSeed: Long, + var tunerFirstGenerationGenePool: Int, + var tunerNumberOfGenerations: Int, + var tunerNumberOfParentsToRetain: Int, + var tunerNumberOfMutationsPerGeneration: Int, + var tunerGeneticMixing: Double, + var tunerGenerationalMutationStrategy: String, + var tunerFixedMutationValue: Int, + var tunerMutationMagnitudeMode: String, + var tunerEvolutionStrategy: String, + var tunerGeneticMBORegressorType: String, + var tunerGeneticMBOCandidateFactor: Int, + var tunerContinuousEvolutionImprovementThreshold: Int, + var tunerContinuousEvolutionMaxIterations: Int, + var tunerContinuousEvolutionStoppingScore: Double, + var tunerContinuousEvolutionParallelism: Int, + var tunerContinuousEvolutionMutationAggressiveness: Int, + var tunerContinuousEvolutionGeneticMixing: Double, + var tunerContinuousEvolutionRollingImprovingCount: Int, + var tunerModelSeed: Map[String, Any], + var tunerHyperSpaceInference: Boolean, + var tunerHyperSpaceInferenceCount: Int, + var tunerHyperSpaceModelCount: Int, + var tunerHyperSpaceModelType: String, + var tunerInitialGenerationMode: String, + var tunerInitialGenerationPermutationCount: Int, + var tunerInitialGenerationIndexMixingMode: String, + var tunerInitialGenerationArraySeed: Long, + var tunerOutputDfRepartitionScaleFactor: Int, + var tunerDeltaCacheBackingDirectory: String, + var tunerDeltaCacheBackingDirectoryRemovalFlag: Boolean, + var splitCachingStrategy: String) + +case class AlgorithmConfig(var stringBoundaries: Map[String, List[String]], + var numericBoundaries: Map[String, (Double, Double)]) + +case class LoggingConfig(var mlFlowLoggingFlag: Boolean, + var mlFlowLogArtifactsFlag: Boolean, + var mlFlowTrackingURI: String, + var mlFlowExperimentName: String, + var mlFlowAPIToken: String, + var mlFlowModelSaveDirectory: String, + var mlFlowLoggingMode: String, + var mlFlowBestSuffix: String, + var inferenceConfigSaveLocation: String, + var mlFlowCustomRunTags: Map[String, String]) + +case class InstanceConfig( + var modelFamily: String, + var predictionType: String, + var genericConfig: GenericConfig, + var switchConfig: SwitchConfig, + var featureEngineeringConfig: FeatureEngineeringConfig, + var algorithmConfig: AlgorithmConfig, + var tunerConfig: TunerConfig, + var loggingConfig: LoggingConfig +) diff --git a/src/main/scala/com/databricks/labs/automl/executor/config/ModelDefaults.scala b/src/main/scala/com/databricks/labs/automl/executor/config/ModelDefaults.scala new file mode 100644 index 00000000..0b47d77b --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/executor/config/ModelDefaults.scala @@ -0,0 +1,112 @@ +package com.databricks.labs.automl.executor.config + +object ModelDefaults { + + protected[config] def randomForestNumeric: Map[String, (Double, Double)] = + Map( + "numTrees" -> Tuple2(50.0, 1000.0), + "maxBins" -> Tuple2(10.0, 500.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ) + + protected[config] def randomForestString: Map[String, List[String]] = Map( + "impurity" -> List("gini", "entropy"), + "featureSubsetStrategy" -> List("auto") + ) + + protected[config] def treesNumeric: Map[String, (Double, Double)] = Map( + "maxBins" -> Tuple2(10.0, 500.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "minInstancesPerNode" -> Tuple2(1.0, 50.0) + ) + + protected[config] def treesString: Map[String, List[String]] = Map( + "impurity" -> List("gini", "entropy") + ) + + protected[config] def xgBoostNumeric: Map[String, (Double, Double)] = Map( + "alpha" -> Tuple2(0.0, 1.0), + "eta" -> Tuple2(0.1, 0.5), + "gamma" -> Tuple2(0.0, 10.0), + "lambda" -> Tuple2(0.1, 10.0), + "maxDepth" -> Tuple2(3.0, 10.0), + "subSample" -> Tuple2(0.4, 0.6), + "minChildWeight" -> Tuple2(0.1, 10.0), + "numRound" -> Tuple2(5.0, 25.0), + "maxBins" -> Tuple2(25.0, 500.0), + "trainTestRatio" -> Tuple2(0.2, 0.8) + ) + + protected[config] def mlpcNumeric: Map[String, (Double, Double)] = Map( + "layers" -> Tuple2(1.0, 10.0), + "maxIter" -> Tuple2(10.0, 100.0), + "stepSize" -> Tuple2(0.01, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5), + "hiddenLayerSizeAdjust" -> Tuple2(0.0, 50.0) + ) + + protected[config] def mlpcString: Map[String, List[String]] = Map( + "solver" -> List("gd", "l-bfgs") + ) + + protected[config] def gbtNumeric: Map[String, (Double, Double)] = Map( + "maxBins" -> Tuple2(10.0, 500.0), + "maxIter" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "minInstancesPerNode" -> Tuple2(1.0, 50.0), + "stepSize" -> Tuple2(1E-4, 1.0) + ) + + protected[config] def gbtString: Map[String, List[String]] = + Map("impurity" -> List("gini", "entropy"), "lossType" -> List("logistic")) + + protected[config] def linearRegressionNumeric: Map[String, (Double, Double)] = + Map( + "elasticNetParams" -> Tuple2(0.0, 1.0), + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) + + protected[config] def linearRegressionString: Map[String, List[String]] = Map( + "loss" -> List("squaredError", "huber") + ) + + protected[config] def logisticRegressionNumeric + : Map[String, (Double, Double)] = Map( + "elasticNetParams" -> Tuple2(0.0, 1.0), + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) + + protected[config] def svmNumeric: Map[String, (Double, Double)] = Map( + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) + + protected[config] def lightGBMnumeric: Map[String, (Double, Double)] = Map( + "baggingFraction" -> Tuple2(0.5, 1.0), + "baggingFreq" -> Tuple2(0.0, 1.0), + "featureFraction" -> Tuple2(0.6, 1.0), + "learningRate" -> Tuple2(1E-8, 1.0), + "maxBin" -> Tuple2(50, 1000), + "maxDepth" -> Tuple2(3.0, 20.0), + "minSumHessianInLeaf" -> Tuple2(1e-5, 50.0), + "numIterations" -> Tuple2(25.0, 250.0), + "numLeaves" -> Tuple2(10.0, 50.0), + "lambdaL1" -> Tuple2(0.0, 1.0), + "lambdaL2" -> Tuple2(0.0, 1.0), + "alpha" -> Tuple2(0.0, 1.0) + ) + + protected[config] def lightGBMString: Map[String, List[String]] = Map( + "boostingType" -> List("gbdt", "rf", "dart", "goss") + ) + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/FeatureImportances.scala b/src/main/scala/com/databricks/labs/automl/exploration/FeatureImportances.scala new file mode 100644 index 00000000..70028aad --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/FeatureImportances.scala @@ -0,0 +1,425 @@ +package com.databricks.labs.automl.exploration + +import com.databricks.labs.automl.exploration.structures.{ + FeatureImportanceConfig, + FeatureImportanceOutput, + FeatureImportanceReturn, + FeatureImportanceTools +} +import com.databricks.labs.automl.feature.FeatureInteraction +import com.databricks.labs.automl.model.tools.split.{ + DataSplitCustodial, + DataSplitUtility +} +import com.databricks.labs.automl.model.{RandomForestTuner, XGBoostTuner} +import com.databricks.labs.automl.pipeline.FeaturePipeline +import com.databricks.labs.automl.sanitize.DataSanitizer +import com.databricks.labs.automl.utils.SparkSessionWrapper +import ml.dmlc.xgboost4j.scala.spark.{ + XGBoostClassificationModel, + XGBoostRegressionModel +} +import org.apache.spark.ml.classification.RandomForestClassificationModel +import org.apache.spark.ml.regression.RandomForestRegressionModel +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +/** + * + * @param data DataFrame: A data to determine which fields are most important when used to predict a label column. + * @param config FeatureImportanceConfig: A configuration object for the Feature Importance run. + * @param cutoffType String: The type of cutoff to use, either: 'count', 'none', or 'value'
+ * @note Count => Return the top n most important fields maxing the naming of the original data, + * in descending order, are returned.
+ * None => A sorted list of columns, in descending order of importance, are returned.
+ * Value => All values above the thresholded value set in cutoffValue are returned + * in descending order. + * @param cutoffValue Double: Linked to cutoffType, providing a threshold value for how many fields to return. + * @note for cutoffType 'None', this value can be set to 0.0 + * @example ``` + * + * val genericMapOverrides = Map("labelCol" -> "label", "tunerKFold" -> 2, "tunerTrainSplitMethod" -> + * "stratified", "tunerNumberOfGenerations" -> 4, "tunerNumberOfMutationsPerGeneration" -> 6, + * "tunerInitialGenerationPermutationCount" -> 25, + * "fieldsToIgnoreInVector" -> Array("final_weight"),"tunerInitialGenerationMode" -> "permutations") + * + * val xgbConfig = ConfigurationGenerator.generateConfigFromMap("XGBoost", "classifier", genericMapOverrides) + * + * val featConfig = ConfigurationGenerator.generateFeatureImportanceConfig(xgbConfig) + * + * val importances = new FeatureImportances(data, featConfig, "count", 5.0).generateFeatureImportances() + * ``` + */ +class FeatureImportances(data: DataFrame, + config: FeatureImportanceConfig, + cutoffType: String, + cutoffValue: Double) + extends FeatureImportanceTools + with SparkSessionWrapper { + + import com.databricks.labs.automl.exploration.structures.FeatureImportanceModelFamily._ + import com.databricks.labs.automl.exploration.structures.ModelType._ + import com.databricks.labs.automl.exploration.structures.CutoffTypes._ + + private val cutOff = cutoffTypeEvaluator(cutoffType) + private val modelFamily = featureImportanceFamilyEvaluator( + config.featureImportanceModelFamily + ) + private val modelType = modelTypeEvaluator(config.modelType) + + val importanceCol = "Importance" + val featureCol = "Feature" + + private def fillNaValues(): DataFrame = { + + val (cleanedData, fillMap, modelDetectedType) = new DataSanitizer(data) + .setLabelCol(config.labelCol) + .setFeatureCol(config.featuresCol) + .setNumericFillStat(config.numericFillStat) + .setCharacterFillStat(config.characterFillStat) + .setModelSelectionDistinctThreshold( + config.modelSelectionDistinctThreshold + ) + .setFieldsToIgnoreInVector(config.fieldsToIgnore) + .setParallelism(config.dataPrepParallelism) + .setFilterPrecision(0.01) + .generateCleanData() + + cleanedData + } + + private def createFeatureVector(df: DataFrame): FeatureImportanceOutput = { + + val (pipelinedData, vectorFields, totalFields) = new FeaturePipeline(df) + .setLabelCol(config.labelCol) + .setFeatureCol(config.featuresCol) + .setDateTimeConversionType(config.dateTimeConversionType) + .makeFeaturePipeline(config.fieldsToIgnore) + + new FeatureImportanceOutput { + override def data: DataFrame = pipelinedData + + override def fieldsInVector: Array[String] = vectorFields + + override def allFields: Array[String] = totalFields + } + + } + + /** + * Private method for interacting features to determine if they'll be useful in a model + * @param vectorPayload DataFrame payload with feature vector already created + * @return FeatureImportanceOutput with the DataFrame supporting potential additions to the feature vector + */ + private def interactFeatures( + vectorPayload: FeatureImportanceOutput + ): FeatureImportanceOutput = { + + val nominalFeatures = vectorPayload.fieldsInVector + .filter(x => x.takeRight(3) == "_si") + .filterNot(x => x.contains(config.labelCol)) + + val continuousFeatures = vectorPayload.fieldsInVector + .diff(nominalFeatures) + .filterNot(_.contains(config.labelCol)) + .filterNot(_.contains(config.featuresCol)) + + val interaction = FeatureInteraction.interactFeatures( + vectorPayload.data, + nominalFeatures, + continuousFeatures, + config.modelType, + config.featureInteractionRetentionMode, + config.labelCol, + config.featuresCol, + config.featureInteractionContinuousDiscretizerBucketCount, + config.featureInteractionParallelism, + config.featureInteractionTargetInteractionPercentage + ) + + new FeatureImportanceOutput { + override def data: DataFrame = interaction.data + + override def fieldsInVector: Array[String] = + interaction.fullFeatureVectorColumns + + override def allFields: Array[String] = + interaction.data.schema.names + .filterNot(_.contains(config.labelCol)) + .filterNot(_.contains(config.featuresCol)) + } + + } + + private def cleanFieldNames(fields: Array[String]): Array[String] = { + fields.map { x => + x.takeRight(3) match { + case "_si" => x.dropRight(3) + case "_oh" => x.dropRight(3) + case _ => x + } + } + } + + private def getImportances( + df: DataFrame, + vectorFields: Array[String] + ): Map[String, Double] = { + + val adjustedFieldNames = cleanFieldNames(vectorFields) + + val splitData = DataSplitUtility.split( + df, + config.kFold, + config.trainSplitMethod, + config.labelCol, + config.deltaCacheBackingDirectory, + config.splitCachingStrategy, + config.featureImportanceModelFamily, + config.parallelism, + config.trainPortion, + "syntheticColumn", + config.trainSplitChronologicalColumn, + config.trainSplitChronlogicalRandomPercentage, + config.dataReductionFactor + ) + + val result = modelFamily match { + case RandomForest => + val rfModel = new RandomForestTuner(df, splitData, config.modelType) + .setLabelCol(config.labelCol) + .setFeaturesCol(config.featuresCol) + .setRandomForestNumericBoundaries(config.numericBoundaries) + .setRandomForestStringBoundaries(config.stringBoundaries) + .setScoringMetric(config.scoringMetric) + .setTrainPortion(config.trainPortion) + .setTrainSplitMethod(config.trainSplitMethod) + .setTrainSplitChronologicalColumn( + config.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + config.trainSplitChronlogicalRandomPercentage + ) + .setParallelism(config.parallelism) + .setKFold(config.kFold) + .setSeed(config.seed) + .setOptimizationStrategy(config.scoringOptimizationStrategy) + .setFirstGenerationGenePool(config.firstGenerationGenePool) + .setNumberOfMutationGenerations(config.numberOfGenerations) + .setNumberOfMutationsPerGeneration( + config.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain(config.numberOfParentsToRetain) + .setGeneticMixing(config.geneticMixing) + .setGenerationalMutationStrategy(config.generationalMutationStrategy) + .setMutationMagnitudeMode(config.mutationMagnitudeMode) + .setFixedMutationValue(config.fixedMutationValue) + .setEarlyStoppingScore(config.autoStoppingScore) + .setEarlyStoppingFlag(config.autoStoppingFlag) + .setEvolutionStrategy(config.evolutionStrategy) + .setContinuousEvolutionMaxIterations( + config.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + config.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + config.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + config.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + config.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + config.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(config.dataReductionFactor) + .setFirstGenMode(config.firstGenMode) + .setFirstGenPermutations(config.firstGenPermutations) + .setFirstGenIndexMixingMode(config.firstGenIndexMixingMode) + .setFirstGenArraySeed(config.firstGenArraySeed) + .evolveBest() + .model + modelType match { + case Regressor => + val importances = rfModel + .asInstanceOf[RandomForestRegressionModel] + .featureImportances + .toArray + adjustedFieldNames.zip(importances).toMap[String, Double] + + case Classifier => + val importances = rfModel + .asInstanceOf[RandomForestClassificationModel] + .featureImportances + .toArray + adjustedFieldNames.zip(importances).toMap[String, Double] + } + case XGBoost => + val xgModel = new XGBoostTuner(df, splitData, config.modelType) + .setLabelCol(config.labelCol) + .setFeaturesCol(config.featuresCol) + .setXGBoostNumericBoundaries(config.numericBoundaries) + .setScoringMetric(config.scoringMetric) + .setTrainPortion(config.trainPortion) + .setTrainSplitMethod(config.trainSplitMethod) + .setTrainSplitChronologicalColumn( + config.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + config.trainSplitChronlogicalRandomPercentage + ) + .setParallelism(config.parallelism) + .setKFold(config.kFold) + .setSeed(config.seed) + .setOptimizationStrategy(config.scoringOptimizationStrategy) + .setFirstGenerationGenePool(config.firstGenerationGenePool) + .setNumberOfMutationGenerations(config.numberOfGenerations) + .setNumberOfMutationsPerGeneration( + config.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain(config.numberOfParentsToRetain) + .setGeneticMixing(config.geneticMixing) + .setGenerationalMutationStrategy(config.generationalMutationStrategy) + .setMutationMagnitudeMode(config.mutationMagnitudeMode) + .setFixedMutationValue(config.fixedMutationValue) + .setEarlyStoppingScore(config.autoStoppingScore) + .setEarlyStoppingFlag(config.autoStoppingFlag) + .setEvolutionStrategy(config.evolutionStrategy) + .setContinuousEvolutionMaxIterations( + config.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + config.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + config.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + config.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + config.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + config.continuousEvolutionRollingImprovementCount + ) + .setDataReductionFactor(config.dataReductionFactor) + .setFirstGenMode(config.firstGenMode) + .setFirstGenPermutations(config.firstGenPermutations) + .setFirstGenIndexMixingMode(config.firstGenIndexMixingMode) + .setFirstGenArraySeed(config.firstGenArraySeed) + .evolveBest() + .model + modelType match { + case Regressor => + xgModel + .asInstanceOf[XGBoostRegressionModel] + .nativeBooster + .getFeatureScore(adjustedFieldNames) + .map { case (k, v) => k -> v.toDouble } + .toMap + case Classifier => + xgModel + .asInstanceOf[XGBoostClassificationModel] + .nativeBooster + .getFeatureScore(adjustedFieldNames) + .map { case (k, v) => k -> v.toDouble } + .toMap + } + } + + DataSplitCustodial.cleanCachedInstances(splitData, config) + + result + } + + private def getTopFeaturesCount(featureDataFrame: DataFrame, + featureCount: Int): Array[String] = { + featureDataFrame + .sort(col(importanceCol).desc) + .limit(featureCount) + .collect() + .map(x => x(0).toString) + } + + private def getTopFeaturesValue(featureDataFrame: DataFrame, + importanceValue: Double): Array[String] = { + featureDataFrame + .filter(col(importanceCol) >= importanceValue) + .sort(col(importanceCol).desc) + .collect() + .map(x => x(0).toString) + } + + private def getAllImportances(featureDataFrame: DataFrame): Array[String] = { + featureDataFrame + .sort(col(importanceCol).desc) + .collect() + .map(x => x(0).toString) + } + + /** + * Main method for retrieving Feature Importances + * @return FeatureImportanceReturn: DataFrame of Importances, sorted top fields in an Array[String], the raw data + * with vector applied, the fields included in the vector as Array[String], and all fields as Array[String] + */ + def generateFeatureImportances(): FeatureImportanceReturn = { + + import spark.implicits._ + + val cleanedData = fillNaValues() + + val vectorOutput = createFeatureVector(cleanedData) + + val interactionSwitch = if (config.featureInteractionFlag) { + interactFeatures(vectorOutput) + } else vectorOutput + + val importances = + getImportances(interactionSwitch.data, interactionSwitch.fieldsInVector) + + val importancesDF = importances.toSeq + .toDF(featureCol, importanceCol) + .orderBy(col(importanceCol).desc) + + val importancesDFOutput = modelFamily match { + case XGBoost => + importancesDF + .withColumn(featureCol, split(col(featureCol), "_si$")(0)) + case _ => + importancesDF + .withColumn(importanceCol, col(importanceCol) * 100.0) + .withColumn(featureCol, split(col(featureCol), "_si$")(0)) + } + + val topFieldArray = cutOff match { + case Count => getTopFeaturesCount(importancesDF, cutoffValue.toInt) + case Threshold => getTopFeaturesValue(importancesDF, cutoffValue) + case None => getAllImportances(importancesDF) + } + + new FeatureImportanceReturn( + importances = importancesDFOutput, + topFields = topFieldArray + ) { + override def data: DataFrame = vectorOutput.data + + override def fieldsInVector: Array[String] = vectorOutput.fieldsInVector + + override def allFields: Array[String] = vectorOutput.allFields + } + } +} + +object FeatureImportances { + + def apply(data: DataFrame, + config: FeatureImportanceConfig, + cutoffType: String, + cutoffValue: Double): FeatureImportances = + new FeatureImportances(data, config, cutoffType, cutoffValue) + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/structures/FeatureImportanceConfig.scala b/src/main/scala/com/databricks/labs/automl/exploration/structures/FeatureImportanceConfig.scala new file mode 100644 index 00000000..85681f2e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/structures/FeatureImportanceConfig.scala @@ -0,0 +1,55 @@ +package com.databricks.labs.automl.exploration.structures + +case class FeatureImportanceConfig( + labelCol: String, + featuresCol: String, + dataPrepParallelism: Int, + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]], + scoringMetric: String, + trainPortion: Double, + trainSplitMethod: String, + trainSplitChronologicalColumn: String, + trainSplitChronlogicalRandomPercentage: Double, + parallelism: Int, + kFold: Int, + seed: Long, + scoringOptimizationStrategy: String, + firstGenerationGenePool: Int, + numberOfGenerations: Int, + numberOfMutationsPerGeneration: Int, + numberOfParentsToRetain: Int, + geneticMixing: Double, + generationalMutationStrategy: String, + mutationMagnitudeMode: String, + fixedMutationValue: Int, + autoStoppingScore: Double, + autoStoppingFlag: Boolean, + evolutionStrategy: String, + continuousEvolutionMaxIterations: Int, + continuousEvolutionStoppingScore: Double, + continuousEvolutionParallelism: Int, + continuousEvolutionMutationAggressiveness: Int, + continuousEvolutionGeneticMixing: Double, + continuousEvolutionRollingImprovementCount: Int, + dataReductionFactor: Double, + firstGenMode: String, + firstGenPermutations: Int, + firstGenIndexMixingMode: String, + firstGenArraySeed: Long, + fieldsToIgnore: Array[String], + numericFillStat: String, + characterFillStat: String, + modelSelectionDistinctThreshold: Int, + dateTimeConversionType: String, + modelType: String, + featureImportanceModelFamily: String, + featureInteractionFlag: Boolean, + featureInteractionRetentionMode: String, + featureInteractionContinuousDiscretizerBucketCount: Int, + featureInteractionParallelism: Int, + featureInteractionTargetInteractionPercentage: Double, + deltaCacheBackingDirectory: String, + deltaCacheBackingDirectoryRemovalFlag: Boolean, + splitCachingStrategy: String +) diff --git a/src/main/scala/com/databricks/labs/automl/exploration/structures/FeatureImportanceTools.scala b/src/main/scala/com/databricks/labs/automl/exploration/structures/FeatureImportanceTools.scala new file mode 100644 index 00000000..1d9eb05d --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/structures/FeatureImportanceTools.scala @@ -0,0 +1,47 @@ +package com.databricks.labs.automl.exploration.structures + +trait FeatureImportanceTools { + + import com.databricks.labs.automl.exploration.structures.CutoffTypes._ + import com.databricks.labs.automl.exploration.structures.FeatureImportanceModelFamily._ + import com.databricks.labs.automl.exploration.structures.ModelType._ + + private[exploration] def cutoffTypeEvaluator(value: String): CutoffTypes = { + + value.toLowerCase.replaceAll("\\s", "") match { + case "none" => None + case "value" => Threshold + case "count" => Count + case _ => + throw new IllegalArgumentException( + s"$value is not supported! Must be one of: 'none', 'value', or " + + s"'count' " + ) + } + } + + private[exploration] def featureImportanceFamilyEvaluator( + value: String + ): FeatureImportanceModelFamily = { + value.toLowerCase.replaceAll("\\s", "") match { + case "randomforest" => RandomForest + case "xgboost" => XGBoost + case _ => + throw new IllegalArgumentException( + s"$value is not supported! Must be either 'RandomForest' or 'XGBoost'" + ) + } + } + + private[exploration] def modelTypeEvaluator(value: String): ModelType = { + value.toLowerCase.replaceAll("\\s", "") match { + case "regressor" => Regressor + case "classifier" => Classifier + case _ => + throw new IllegalArgumentException( + s"$value is not supported! Must be either 'Regressor' or 'Classifier'" + ) + } + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/structures/Utilities.scala b/src/main/scala/com/databricks/labs/automl/exploration/structures/Utilities.scala new file mode 100644 index 00000000..31891a52 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/structures/Utilities.scala @@ -0,0 +1,29 @@ +package com.databricks.labs.automl.exploration.structures + +import org.apache.spark.sql.DataFrame + +object CutoffTypes extends Enumeration { + type CutoffTypes = Value + val None, Threshold, Count = Value +} + +object FeatureImportanceModelFamily extends Enumeration { + type FeatureImportanceModelFamily = Value + val RandomForest, XGBoost = Value +} + +object ModelType extends Enumeration { + type ModelType = Value + val Regressor, Classifier = Value +} + +sealed trait FIReturn { + def data: DataFrame + def fieldsInVector: Array[String] + def allFields: Array[String] +} + +abstract case class FeatureImportanceOutput() extends FIReturn +abstract case class FeatureImportanceReturn(importances: DataFrame, + topFields: Array[String]) + extends FIReturn diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/AnovaTest.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/AnovaTest.scala new file mode 100644 index 00000000..d746a1c6 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/AnovaTest.scala @@ -0,0 +1,3 @@ +package com.databricks.labs.automl.exploration.tools + +object AnovaTest {} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/ExplorationStructures.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/ExplorationStructures.scala new file mode 100644 index 00000000..45661fb6 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/ExplorationStructures.scala @@ -0,0 +1,110 @@ +package com.databricks.labs.automl.exploration.tools + +import org.apache.commons.math3.analysis.polynomials.PolynomialFunction +import org.apache.commons.math3.distribution.RealDistribution +import org.apache.commons.math3.stat.descriptive.SummaryStatistics + +case class SummaryStats(count: Long, + min: Double, + max: Double, + sum: Double, + mean: Double, + geometricMean: Double, + variance: Double, + popVariance: Double, + secondMoment: Double, + sumOfSquares: Double, + stdDeviation: Double, + sumOfLogs: Double) + +case class DistributionValidationData(test: String, + pValue: Double, + dStatistic: Double) + +case class DistributionTestPayload(testName: String, + distribution: RealDistribution) + +case class DistributionTestResult(bestDistributionFit: String, + bestDistributionPValue: Double, + bestDistributionDStatistic: Double, + allTests: Array[DistributionValidationData]) + +case class OneDimStatsData(mean: Double, + geomMean: Double, + variance: Double, + semiVariance: Double, + stddev: Double, + skew: Double, + kurtosis: Double, + kurtosisType: String, + skewType: String, + summaryStats: SummaryStats, + distributionData: DistributionTestResult) + +case class ShapiroScoreData(w: Double, z: Double, probability: Double) + +case class ShapiroInternalData(w: Double, + z: Double, + probability: Double, + normalcyTest: Boolean, + normalcy: String) + +case class RegressionInternal(sumX: Double, + sumY: Double, + sumSqX: Double, + sumSqY: Double, + sumProduct: Double) + +case class RegressionCoefficients(slope: Double, + intercept: Double, + t1: Double, + t2: Double, + t3: Double) + +case class RegressionBarData(xBar: Double, yBar: Double, xyBar: Double) +case class RegressionResidualData(ssr: Double, rss: Double) + +case class SimpleRegressorResult(slope: Double, + slopeStdErr: Double, + slopeConfidenceInterval: Double, + intercept: Double, + interceptStdErr: Double, + rSquared: Double, + significance: Double, + mse: Double, + rmse: Double, + sumSquares: Double, + totalSumSquares: Double, + sumSquareError: Double, + pairLength: Long, + pearsonR: Double, + crossProductSum: Double) + +case class PolynomialRegressorResult(order: Int, + function: PolynomialFunction, + residualSumSquares: Double, + sumSquareError: Double, + totalSumSquares: Double, + mse: Double, + rmse: Double, + r2: Double) + +case class PairedSeq(left: SummaryStatistics, right: SummaryStatistics) +case class TTestData(alpha: Double, + tStat: Double, + tTestSignificance: Boolean, + tTestPValue: Double, + equivalencyJudgement: Char) + +case class KSTestResult(ksTestPvalue: Double, + ksTestDStatistic: Double, + ksTestEquivalency: Char) + +case class CorrelationTestResult(covariance: Double, + pearsonCoefficient: Double, + spearmanCoefficient: Double, + kendallsTauCoefficient: Double) + +case class PairedTestResult(correlationTestData: CorrelationTestResult, + tTestData: TTestData, + kolmogorovSmirnovData: KSTestResult) diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/OneDimStats.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/OneDimStats.scala new file mode 100644 index 00000000..9cb0e58e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/OneDimStats.scala @@ -0,0 +1,238 @@ +package com.databricks.labs.automl.exploration.tools + +import org.apache.commons.math3.distribution +import org.apache.commons.math3.distribution._ +import org.apache.commons.math3.stat.descriptive.SummaryStatistics +import org.apache.commons.math3.stat.descriptive.moment._ +import org.apache.commons.math3.stat.inference.TestUtils + +/** + * Package for testing a one-dimensional data set with standard data explanation metrics + * + * Attributes tested: + * Mean + * Geometric Mean + * Variance + * Semi Variance + * Standard Deviation + * Skew + * Kurtosis + * + * The return type also provides String classifications based on thresholds for Skew and Kurtosis to classify the + * distribution. + * + * @note + * Kurtosis types: + * Mesokurtic - kurtosis around zero + * Leptokurtic - positive excess kurtosis (long heavy tails) + * Platykurtic - negative excess kurtosis (short thin tails) + * + * + * Skewness types: + * Symmetrical -> normal + * Asymmetricral Positive skewness -> right tailed + * Asymmetrical Negative skewness -> left tailed + * @param data Array[Double] to test one dimensional stats for + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ +class OneDimStats(data: Array[Double]) { + + final private val SKEW_LOWER_BOUND = -1 + final private val SKEW_UPPER_BOUND = 1 + final private val KURTOSIS_LOWER_BOUND = 2.5 + final private val KURTOSIS_UPPER_BOUND = 3.5 + + /** + * Private method for calculating Kurtosis of the data set + * @return Effective Kurtosis (k - 3) + */ + private def calculateKurtosis: Double = new Kurtosis().evaluate(data) + + /** + * Method for calculating Skewness of the data set + * @return Skew value + */ + private def calculateSkew: Double = new Skewness().evaluate(data) + + /** + * Private method for calculating Variance of the data set + * @return Variance value + */ + private def calculateVariance: Double = new Variance().evaluate(data) + + /** + * Private method for calculating Mean of the data set + * @return Mean value + */ + private def calculateMean: Double = new Mean().evaluate(data) + + /** + * Private method for calculating the geometric mean and resetting in the event that the series is negative and + * a geometric mean cannot be calculated + * @return Geometric Mean value or 0.0 if invalid + */ + private def calculateGeometricMean: Double = { + val gm = new GeometricMean().evaluate(data) + gm match { + case x if x.isNaN => 0.0 + case _ => gm + } + } + + /** + * Private method for calculating semi-variance (variance of values less than the mean) for the data set + * @return Semi Variance value + */ + private def calculateSemiVariance: Double = new SemiVariance().evaluate(data) + + /** + * Private method for calculating the standard deviation for the data set + * @return Standard Deviation value + */ + private def calculateStandardDeviation: Double = + new StandardDeviation().evaluate(data) + + /** + * Private method for returning the statistics evaluator for the data series + * @return SummaryStats payload + */ + private def getSummaryStatistics: SummaryStats = { + + val stats = new SummaryStatistics() + data.foreach(x => stats.addValue(x)) + + SummaryStats( + count = stats.getN, + min = stats.getMin, + max = stats.getMax, + sum = stats.getSum, + mean = stats.getMean, + geometricMean = stats.getGeometricMean, + variance = stats.getVariance, + popVariance = stats.getPopulationVariance, + secondMoment = stats.getSecondMoment, + sumOfSquares = stats.getSumsq, + stdDeviation = stats.getStandardDeviation, + sumOfLogs = stats.getSumOfLogs + ) + + } + + /** + * Helper method for comparing a distribution to the data series in order to get the p-value and the d-statistic + * of difference from the 'shape' of the distribution. + * @param test The payload consisting of test name (for pass-through) and the RealDistribution under test. + * @return + */ + private def compareKolmogorovSmirnov( + test: DistributionTestPayload + ): DistributionValidationData = { + + val p = TestUtils.kolmogorovSmirnovTest(test.distribution, data) + val d = TestUtils.kolmogorovSmirnovStatistic(test.distribution, data) + + DistributionValidationData(test.testName, p, d) + } + + /** + * Method for comparing a data series to a set of standard distributions + * @return DistributionTestResult, consisting of, at a top-level, the best fit + * based on p-value of similarity, the p-value, and the D-statistic. + * Also returned are all of the different distribution test results for the series (for performing + * distributed analytic roll-up on a distributed DataFrame's partitions) + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def compareToDistributions: DistributionTestResult = { + + val tests = Array( + DistributionTestPayload("normal", new NormalDistribution(0, 1)), + DistributionTestPayload("standardBetaPrime", new BetaDistribution(1, 1)), + DistributionTestPayload("cauchy", new CauchyDistribution(0, 1)), + DistributionTestPayload("standardChiSq", new ChiSquaredDistribution(1)), + DistributionTestPayload("exponential", new ExponentialDistribution(1)), + DistributionTestPayload("f", new FDistribution(1, 1)), + DistributionTestPayload("erlang", new GammaDistribution(1, 2)), + DistributionTestPayload("gumbel", new GumbelDistribution(1, 2)), + DistributionTestPayload("laplace", new LaplaceDistribution(0, 1)), + DistributionTestPayload("levy", new LevyDistribution(0, 1)), + DistributionTestPayload("logistic", new LogisticDistribution(0, 1)), + DistributionTestPayload("logNormal", new LogNormalDistribution(0, 1)), + DistributionTestPayload("nakagami", new NakagamiDistribution(1, 1)), + DistributionTestPayload( + "pareto", + new distribution.ParetoDistribution(1, 1) + ), + DistributionTestPayload("studentT", new TDistribution(1)), + DistributionTestPayload("weibull", new WeibullDistribution(1, 1)) + ) + + val allTests = tests.map(x => compareKolmogorovSmirnov(x)) + + val bestFit = allTests.sortWith(_.pValue < _.pValue).head + + DistributionTestResult( + bestDistributionFit = bestFit.test, + bestDistributionPValue = bestFit.pValue, + bestDistributionDStatistic = bestFit.dStatistic, + allTests = allTests + ) + + } + + /** + * Public method for executing analysis of all metrics + * @return OneDimStatsData, containing all of the information for the analysis of the One Dimensional series of data. + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def evaluate: OneDimStatsData = { + + val kurtosis = calculateKurtosis + val skew = calculateSkew + + val kurtosisType = kurtosis match { + case x if x <= KURTOSIS_LOWER_BOUND => "Platykurtic" + case x if x > KURTOSIS_LOWER_BOUND && x < KURTOSIS_UPPER_BOUND => + "Mesokurtic" + case x if x >= KURTOSIS_UPPER_BOUND => "Leptokurtic" + case _ => throw new IllegalArgumentException("Unsupported Kurtosis range") + } + + val skewType = skew match { + case x if x <= SKEW_LOWER_BOUND => "Asymmetrical Left Tailed" + case x if x > SKEW_LOWER_BOUND && x < SKEW_UPPER_BOUND => + "Symmetric Normal" + case x if x >= SKEW_UPPER_BOUND => "Asymmetrical Right Tailed" + case _ => throw new IllegalArgumentException("Unsupported Skew Range") + } + + OneDimStatsData( + calculateMean, + calculateGeometricMean, + calculateVariance, + calculateSemiVariance, + calculateStandardDeviation, + skew, + kurtosis, + kurtosisType, + skewType, + getSummaryStatistics, + compareToDistributions + ) + + } + +} + +/** + * Companion Object + */ +object OneDimStats { + + def evaluate(data: Array[Double]): OneDimStatsData = { + new OneDimStats(data).evaluate + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/PCAReducer.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/PCAReducer.scala new file mode 100644 index 00000000..854211a8 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/PCAReducer.scala @@ -0,0 +1,22 @@ +package com.databricks.labs.automl.exploration.tools + +import org.apache.spark.sql.DataFrame + +class PCAReducer(data: DataFrame) { + + //TOOD in progress AML- + + final private val K_VALUE = 2 + + var labelColumn = "label" + + def setLabelColumn(value: String): this.type = { + labelColumn = value + this + } + + def getLabelColumn: String = labelColumn + + // create feature vector + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/PairedTesting.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/PairedTesting.scala new file mode 100644 index 00000000..a8cb4bfb --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/PairedTesting.scala @@ -0,0 +1,179 @@ +package com.databricks.labs.automl.exploration.tools + +import org.apache.commons.math3.stat.correlation.{ + Covariance, + KendallsCorrelation, + PearsonsCorrelation, + SpearmansCorrelation +} +import org.apache.commons.math3.stat.descriptive.SummaryStatistics +import org.apache.commons.math3.stat.inference.{TTest, TestUtils} + +class PairedTesting(left: Array[Double], + right: Array[Double], + alpha: Double = 0.05) { + + /** + * Private helper method for creating the PairedStatistics construct for calculating t tests + * @return PairedSeq of SummaryStatistics Instances for both data series + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def createPairedStatistics: PairedSeq = { + + assert( + left.length == right.length, + s"Length of pairs is not equal. Left: ${left.length} Right: ${right.length}" + ) + + val leftStats = new SummaryStatistics() + val rightStats = new SummaryStatistics() + + left.foreach(x => leftStats.addValue(x)) + right.foreach(x => rightStats.addValue(x)) + PairedSeq(leftStats, rightStats) + + } + + /** + * Method for determining the t-test values for comparing if the mean values are equivalent between two sequences + * of Doubles. + * @return TTestData payload, consisting of the alpha that was used, the t-stat value, significance determination, + * significance p-value, and a judgement of equivalency (Y or N) + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def computePairedTTest: TTestData = { + + val pairData = createPairedStatistics + + val tStatisticValue = TestUtils.t(pairData.left, pairData.right) + val tStatisticSignificance = + TestUtils.pairedTTest(left, right, alpha) + val tStatisticPValue = new TTest().pairedTTest(left, right) + + val equivalencyJudgement = tStatisticSignificance match { + case x if x => "N" + case _ => "Y" + } + + TTestData( + alpha = alpha, + tStat = tStatisticValue, + tTestSignificance = tStatisticSignificance, + tTestPValue = tStatisticPValue, + equivalencyJudgement = equivalencyJudgement.head + ) + } + + /** + * Equivalency tests for the distribution of data between two series. + * @returns Payload of the equivalency p value, D statistic, and equivalency judgement between the two distributions + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def computeKolmogorovSmirnovTest = { + + val ksTestPValue = + TestUtils.kolmogorovSmirnovTest(left, right) + val ksTestDStatistic = + TestUtils.kolmogorovSmirnovStatistic(left, right) + val equivalency = ksTestPValue match { + case x if x <= alpha => "Y" + case _ => "N" + } + + KSTestResult(ksTestPValue, ksTestDStatistic, equivalency.head) + + } + + /** + * Method for calculating unbiased covariance between two data series + * @return unbiased covariance score + * @note https://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math4/stat/correlation/Covariance.html + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def computeCovariance: Double = + new Covariance().covariance(left, right, false) + + /** + * Method for calculating Pearson's product-moment correlation coefficient for two data series + * @return Pearson's product-moment correlation coefficient + * @note https://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math4/stat/correlation/PearsonsCorrelation.html + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def computePearsons: Double = + new PearsonsCorrelation().correlation(left, right) + + /** + * Method for calculating Spearman's Rank correlation for two data series using Natural Ranking + * @return Spearman's rank correlation coefficient + * @note https://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math4/stat/correlation/SpearmansCorrelation.html + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def computeSpearmans: Double = + new SpearmansCorrelation().correlation(left, right) + + /** + * Method for calculating Kendall's Tau-b Rank correlation for two data series + * @return Kendall's Tau-b correlation coefficient + * @note https://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math4/stat/correlation/KendallsCorrelation.html + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def computeKendallsTauRank: Double = + new KendallsCorrelation().correlation(left, right) + + /** + * Main execution method for getting the pair testing data for two data series. + * Performs equivalency correlation testing for: + * - Unbiased correlation testing + * - Pearson's correlation testing + * - Spearman's correlation testing + * - Kendall's correlation testing + * Computes a t-test for mean equivalency + * Computes distribution equivalency testing between the two series + * @return Testing Payload of the statistical data + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def execute: PairedTestResult = { + + val correlationChecks = CorrelationTestResult( + computeCovariance, + computePearsons, + computeSpearmans, + computeKendallsTauRank + ) + + PairedTestResult( + correlationChecks, + computePairedTTest, + computeKolmogorovSmirnovTest + ) + + } + +} + +/** + * Companion Object for Paired Testing + */ +object PairedTesting { + + def evaluate(left: Seq[Double], + right: Seq[Double], + alpha: Double): PairedTestResult = { + new PairedTesting(left.toArray, right.toArray, alpha).execute + } + + def evaluate(left: Array[Double], + right: Array[Double], + alpha: Double): PairedTestResult = { + new PairedTesting(left, right, alpha).execute + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/PolynomialRegressor.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/PolynomialRegressor.scala new file mode 100644 index 00000000..6692ca88 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/PolynomialRegressor.scala @@ -0,0 +1,105 @@ +package com.databricks.labs.automl.exploration.tools + +import org.apache.commons.math3.analysis.polynomials +import org.apache.commons.math3.analysis.polynomials.PolynomialFunction +import org.apache.commons.math3.analysis.solvers.LaguerreSolver +import org.apache.commons.math3.fitting +import org.apache.commons.math3.fitting.{ + PolynomialCurveFitter, + WeightedObservedPoints +} + +object PolynomialRegressor { + + private def createObservations(x: Seq[Double], + y: Seq[Double]): WeightedObservedPoints = { + + val data = x.zip(y) + val payload = new fitting.WeightedObservedPoints() + + data.foreach { case (x, y) => payload.add(x, y) } + + payload + } + + private def fitParameters(order: Int, + data: WeightedObservedPoints): Array[Double] = { + + val parameterFitter = PolynomialCurveFitter.create(order) + + parameterFitter.fit(data.toList) + + } + + def getRoot(params: Array[Double]): Double = { + + val polynomial = new polynomials.PolynomialFunction(params) + val solver = new LaguerreSolver() + solver.solve(100, polynomial, -1000, 1000) + } + + private def calculateFit(x: Double, poly: PolynomialFunction): Double = { + poly.value(x) + } + + private def calculateSSR(data: Seq[(Double, Double)], + predictions: Seq[Double]): Double = { + + data.map(_._2).zip(predictions).foldLeft(0.0) { + case (acc, i) => + acc + math.pow(i._1 - i._2, 2) + } + } + + private def calculateSSE(predictions: Seq[Double], + meanActual: Double): Double = { + predictions.foldLeft(0.0) { + case (acc, i) => acc + math.pow(i - meanActual, 2) + } + } + + private def calculateSST(data: Seq[(Double, Double)], + meanActual: Double): Double = { + data.map(_._2).foldLeft(0.0) { + case (acc, i) => acc + math.pow(i - meanActual, 2) + } + } + + private def calculateR2(ssr: Double, sst: Double): Double = { + 1.0 - (ssr / sst) + } + + def fit(x: Seq[Double], + y: Seq[Double], + order: Int): PolynomialRegressorResult = { + + val points = createObservations(x, y) + val params = fitParameters(order, points) + val polynomial = new PolynomialFunction(params) + + val zippedData = x.zip(y) + + val predictions = x.map(a => calculateFit(a, polynomial)) + + val ssr = calculateSSR(zippedData, predictions) + val sse = calculateSSE(predictions, y.sum / y.length) + val sst = calculateSST(zippedData, y.sum / y.length) + val mse = ssr / (x.size - order) + val rmse = math.sqrt(mse) + val r2 = calculateR2(ssr, sst) + + PolynomialRegressorResult(order, polynomial, ssr, sse, sst, mse, rmse, r2) + + } + + def fitMultipleOrders( + x: Seq[Double], + y: Seq[Double], + orders: Array[Int] + ): Array[PolynomialRegressorResult] = { + + orders.map(o => fit(x, y, o)) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/ShapiroBase.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/ShapiroBase.scala new file mode 100644 index 00000000..81d7a460 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/ShapiroBase.scala @@ -0,0 +1,157 @@ +package com.databricks.labs.automl.exploration.tools + +trait ShapiroBase { + + final val C1 = + Array(0.0, 0.221157E0, -0.147981E0, -0.207119E1, 0.4434685E1, -0.2706056E1) + final val C2 = Array(0.0E0, 0.42981E-1, -0.293762E0, -0.1752461E1, + 0.5682633E1, -0.3582633E1) + final val C3 = Array(0.5440E0, -0.39978E0, 0.25054E-1, -0.6714E-3) + final val C4 = + Array(0.13822E1, -0.77857E0, 0.62767E-1, -0.20322E-2) + final val C5 = + Array(Double.NaN, -0.15861E1, -0.31082E0, -0.83751E-1, 0.38915E-2) + final val C6 = Array(-0.4803E0, -0.82676E-1, 0.30302E-2) + final val C7 = Array(0.164E0, 0.533E0) + final val C8 = Array(0.1736E0, 0.315E0) + final val C9 = Array(0.256E0, -0.635E-2) + final val G = Array(-0.2273E1, 0.459E0) + final val Z90 = 0.12816E1 + final val Z95 = 0.16449E1 + final val Z99 = 0.23263E1 + final val ZM = 0.17509E1 + final val ZSS = 0.56268E0 + final val BF1 = 0.8378E0 + final val XX90 = 0.556E0 + final val XX95 = 0.622E0 + final val SQRTH = 0.70711E0 + final val TH = 0.375E0 + final val SMALL = 1E-19 + final val PI6 = 0.1909859E1 + final val STQR = 0.1047198E1 + final val UPPER = true + + final val NORMALCY_TRUE = "Y" + final val NORMALCY_FALSE = "N" + + /** + * Compute the quantile function for the normal distribution. For small to moderate probabilities, algorithm referenced + * below is used to obtain an initial approximation which is polished with a final Newton step. For very large arguments, an algorithm of Wichura is used. + * Used by ShapiroWilk Test + * Ported by Javascript implementation found at https://raw.github.com/rniwa/js-shapiro-wilk/master/shapiro-wilk.js + * Originally ported from http://svn.r-project.org/R/trunk/src/nmath/qnorm.c + * + * @param p + * @param mu + * @param sigma + * @return + */ + def normalQuantile(p: Double, mu: Double, sigma: Double): Double = { // The inverse of cdf. + if (sigma < 0) + throw new IllegalArgumentException( + "The sigma parameter must be positive." + ) + else if (sigma == 0) return mu + var r = .0 + var `val` = .0 + val q = p - 0.5 + if (0.075 <= p && p <= 0.925) { + r = 0.180625 - q * q + `val` = q * (((((((r * 2509.0809287301226727 + 33430.575583588128105) * r + 67265.770927008700853) + * r + 45921.953931549871457) * r + 13731.693765509461125) * r + 1971.5909503065514427) + * r + 133.14166789178437745) * r + 3.387132872796366608) / + (((((((r * 5226.495278852854561 + 28729.085735721942674) * r + 39307.89580009271061) + * r + 21213.794301586595867) * r + 5394.1960214247511077) * r + 687.1870074920579083) + * r + 42.313330701600911252) * r + 1) + } else { + /* closer than 0.075 from {0,1} boundary */ /* r = min(p, 1-p) < 0.075 */ + if (q > 0) r = 1 - p + else r = p /* = R_DT_Iv(p) ^= p */ + r = Math.sqrt(-Math.log(r)) /* r = sqrt(-log(r)) <==> min(p, 1-p) = exp( - r^2 ) */ + if (r <= 5.0) { + /* <==> min(p,1-p) >= exp(-25) ~= 1.3888e-11 */ + r += -1.6 + `val` = (((((((r * 7.7454501427834140764e-4 + 0.0227238449892691845833) * r + 0.24178072517745061177) + * r + 1.27045825245236838258) * r + 3.64784832476320460504) * r + 5.7694972214606914055) + * r + 4.6303378461565452959) * r + 1.42343711074968357734) / + (((((((r * 1.05075007164441684324e-9 + 5.475938084995344946e-4) * r + 0.0151986665636164571966) + * r + 0.14810397642748007459) * r + 0.68976733498510000455) * r + 1.6763848301838038494) + * r + 2.05319162663775882187) * r + 1.0) + } else { + /* very close to 0 or 1 */ + r += -5.0 + `val` = (((((((r * 2.01033439929228813265e-7 + 2.71155556874348757815e-5) * r + 0.0012426609473880784386) + * r + 0.026532189526576123093) * r + 0.29656057182850489123) * r + 1.7848265399172913358) + * r + 5.4637849111641143699) * r + 6.6579046435011037772) / + (((((((r * 2.04426310338993978564e-15 + 1.4215117583164458887e-7) * r + 1.8463183175100546818e-5) + * r + 7.868691311456132591e-4) * r + 0.0148753612908506148525) * r + 0.13692988092273580531) + * r + 0.59983220655588793769) * r + 1.0) + } + if (q < 0.0) `val` = -`val` + } + mu + sigma * `val` + } + + def gaussCdf(z: Double): Double = { // input = z-value (-inf to +inf) + + // ACM Algorithm #209 + var y = 0.0 // 209 scratch variable + var p = 0.0 // result. called ‘z’ in 209 + var w = 0.0 + if (z == 0.0) p = 0.0 + else { + y = Math.abs(z) / 2.0 + if (y >= 3.0) p = 1.0 + else if (y < 1.0) { + w = y * y + p = ((((((((0.000124818987 * w - 0.001075204047) * w + 0.005198775019) * w - 0.019198292004) + * w + 0.059054035642) * w - 0.151968751364) * w + 0.319152932694) * w - 0.531923007300) + * w + 0.797884560593) * y * 2.0 + } else { + y = y - 2.0 + p = (((((((((((((-0.000045255659 * y + 0.000152529290) * y - 0.000019538132) + * y - 0.000676904986) * y + 0.001390604284) * y - 0.000794620820) * y - 0.002034254874) + * y + 0.006549791214) * y - 0.010557625006) * y + 0.011630447319) * y - 0.009279453341) + * y + 0.005353579108) * y - 0.002141268741) * y + 0.000535310849) * y + 0.999936657524 + } + } + if (z > 0.0) return (p + 1.0) / 2.0 + (1.0 - p) / 2.0 + } + + /** + * Used internally by ShapiroWilkW(). + * + * @param cc + * @param nord + * @param x + * @return + */ + def poly(cc: Array[Double], nord: Int, x: Double): Double = { + /* Algorithm AS 181.2 Appl. Statist. (1982) Vol. 31, No. 2 + Calculates the algebraic polynomial of order nord-1 with array of coefficients cc. + Zero order coefficient is cc(1) = cc[0] */ + var ret_val = cc(0) + if (nord > 1) { + var p = x * cc(nord - 1) + for (j <- nord - 2 until 0 by -1) { + p = (p + cc(j)) * x + } + ret_val += p + } + ret_val + } + + /** + * Used internally by ShapiroWilkW() + * + * @param x + * @return + */ + def sign(x: Double): Int = { + if (x == 0) return 0 + if (x > 0) 1 + else -1 + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/ShapiroWilk.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/ShapiroWilk.scala new file mode 100644 index 00000000..24e80b07 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/ShapiroWilk.scala @@ -0,0 +1,234 @@ +package com.databricks.labs.automl.exploration.tools + +/** + * Shapiro-Wilk test for normality. + * @note the algorithm below is restricted to a maximum of 5000 elements. + */ +object ShapiroWilk extends ShapiroBase { + + /** + * Calculates P-value for ShapiroWilk Test and return: + * W score + * Z score + * probability p-value + * Boolean decision on whether or not to reject the null hypothesis that the data is normally distributed + * String value describing whether the data is normally distributed or not + * + * This equation and the modeled source code is ported from the Javascript implementation + * which is originally based on the FORTRAN code used to execute this test. + * https://github.com/rniwa/js-shapiro-wilk/blob/master/shapiro-wilk.js + * + * @param x Array[Double] The data to be tested + * @return Test results from the ShapiroWilk algorithm + * @throws IllegalArgumentException + */ + @throws[IllegalArgumentException] + private def ShapiroWilkW(x: Array[Double]): ShapiroScoreData = { + java.util.Arrays.sort(x) + val n = x.length + if (n < 3) + throw new IllegalArgumentException( + s"Count of elements to measure W is too small ($n) must be more than 3" + ) + if (n > 5000) { + throw new IllegalArgumentException( + s"Count of elements to measure W is too large ($n) must be less than 5001" + ) + } + val nn2 = n / 2 + val a = new Array[Double](nn2 + 1) + /* 1-based */ + /* + ALGORITHM AS R94 APPL. STATIST. (1995) vol.44, no.4, 547-551. + Calculates the Shapiro-Wilk W test and its significance level + */ + val small = 1e-19 + /* polynomial coefficients */ + val g = Array(-2.273, 0.459) + val c1 = Array(0.0, 0.221157, -0.147981, -2.07119, 4.434685, -2.706056) + val c2 = Array(0.0, 0.042981, -0.293762, -1.752461, 5.682633, -3.582633) + val c3 = Array(0.544, -0.39978, 0.025054, -6.714e-4) + val c4 = Array(1.3822, -0.77857, 0.062767, -0.0020322) + val c5 = Array(-1.5861, -0.31082, -0.083751, 0.0038915) + val c6 = Array(-0.4803, -0.082676, 0.0030302) + /* Local variables */ + var i = 0 + var j = 0 + var i1 = 0 + var ssassx = 0.0 + var summ2 = 0.0 + var ssumm2 = 0.0 + var gamma = 0.0 + var range = 0.0 + var a1 = 0.0 + var a2 = 0.0 + var an = 0.0 + var m = 0.0 + var s = 0.0 + var sa = 0.0 + var xi = 0.0 + var sx = 0.0 + var xx = 0.0 + var y = 0.0 + var w1 = 0.0 + var fac = 0.0 + var asa = 0.0 + var an25 = 0.0 + var ssa = 0.0 + var sax = 0.0 + var rsn = 0.0 + var ssx = 0.0 + var xsx = 0.0 + var pw = 1.0 + an = n.toDouble + if (n == 3) a(1) = 0.70710678 /* = sqrt(1/2) */ + else { + an25 = an + 0.25 + summ2 = 0.0 + i = 1 + while ({ + i <= nn2 + }) { + a(i) = normalQuantile((i - 0.375) / an25, 0, 1) // p(X <= x), + + summ2 += a(i) * a(i) + + i += 1 + } + summ2 *= 2.0 + ssumm2 = Math.sqrt(summ2) + rsn = 1.0 / Math.sqrt(an) + a1 = poly(c1, 6, rsn) - a(1) / ssumm2 + /* Normalize a[] */ + if (n > 5) { + i1 = 3 + a2 = -a(2) / ssumm2 + poly(c2, 6, rsn) + fac = Math.sqrt( + (summ2 - 2.0 * (a(1) * a(1)) - 2.0 * (a(2) * a(2))) / (1.0 - 2.0 * (a1 * a1) - 2.0 * (a2 * a2)) + ) + a(2) = a2 + } else { + i1 = 2 + fac = Math.sqrt((summ2 - 2.0 * (a(1) * a(1))) / (1.0 - 2.0 * (a1 * a1))) + } + a(1) = a1 + i = i1 + while ({ + i <= nn2 + }) { + a(i) /= -fac + + i += 1 + } + } + range = x(n - 1) - x(0) + if (range < small) { + throw new IllegalArgumentException( + s"Total Range of data is too small to calculate ShapiroWilk test (${range}) which is less than minimum of ($small)" + ) + } + /* Check for correct sort order on range - scaled X */ + xx = x(0) / range + sx = xx + sa = -a(1) + i = 1 + j = n - 1 + while ({ + i < n + }) { + xi = x(i) / range + if (xx - xi > small) { + throw new IllegalArgumentException( + s"Scaled range is too small to calculate ShapiroWilk properly (${xx - xi}) is less than minimum of ($small)" + ) + } + sx += xi + i += 1 + if (i != j) sa += sign(i - j) * a(Math.min(i, j)) + xx = xi + + j -= 1 + } + // Calculate W statistic + sa /= n + sx /= n + ssa = 0.0 + ssx = 0.0 + sax = 0.0 + i = 0 + j = n - 1 + while ({ + i < n + }) { + if (i != j) asa = sign(i - j) * a(1 + Math.min(i, j)) - sa + else asa = -sa + xsx = x(i) / range - sx + ssa += asa * asa + ssx += xsx * xsx + sax += asa * xsx + + i += 1 + j -= 1 + } + ssassx = Math.sqrt(ssa * ssx) + w1 = (ssassx - sax) * (ssassx + sax) / (ssa * ssx) + val w = 1.0 - w1 + /* Calculate significance level for W */ + if (n == 3) { + /* exact P value : */ + val pi6 = 1.90985931710274 + /* = 6/pi */ + val stqr = 1.04719755119660 + pw = pi6 * (Math.asin(Math.sqrt(w)) - stqr) + if (pw < 0.0) pw = 0 + //return w; + return ShapiroScoreData(w, 0.0, pw) + } + y = Math.log(w1) + xx = Math.log(an) + if (n <= 11) { + gamma = poly(g, 2, an) + if (y >= gamma) { + pw = 1e-99 + return ShapiroScoreData(w, 0.0, pw) + } + y = -Math.log(gamma - y) + m = poly(c3, 4, an) + s = Math.exp(poly(c4, 4, an)) + } else { + m = poly(c5, 4, xx) + s = Math.exp(poly(c6, 3, xx)) + } + val z = (y - m) / s + pw = gaussCdf(z) + ShapiroScoreData(w, z, pw) + } + + /** + * Tests the rejection of null Hypothesis for a particular confidence level + * + * @param data the Array of data to perform a test against + * @param aLevel the alpha for calculating the probability of normalcy resulting in a test pass or failure. + * Defaulted to 5% + * @return ShapiroInternalData payload consisting of the W value, Z score, probability result, normalcy boolean, + * and String value of "Normally Distributed y/n" + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def test(data: Array[Double], aLevel: Double = 0.05): ShapiroInternalData = { + var normalcyTest = false + val swTest = ShapiroWilkW(data) + val a = aLevel + if (swTest.probability <= a || swTest.probability >= (1.0 - a)) + normalcyTest = true + val normalcy = if (normalcyTest) NORMALCY_FALSE else NORMALCY_TRUE + + ShapiroInternalData( + swTest.w, + swTest.z, + swTest.probability, + normalcyTest, + normalcy + ) + } +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/tools/SimpleRegressor.scala b/src/main/scala/com/databricks/labs/automl/exploration/tools/SimpleRegressor.scala new file mode 100644 index 00000000..764173aa --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/tools/SimpleRegressor.scala @@ -0,0 +1,45 @@ +package com.databricks.labs.automl.exploration.tools + +import org.apache.commons.math3.stat.regression.SimpleRegression + +/** + * + */ +object SimpleRegressor { + + private def createMatchedPairs(x: Seq[Double], + y: Seq[Double]): Array[Array[Double]] = { + + x.zip(y).map(x => Array(x._1, x._2)).toArray + + } + + def calculate(x: Seq[Double], y: Seq[Double]): SimpleRegressorResult = { + + val pairs = createMatchedPairs(x, y) + + val r = new SimpleRegression() + + r.addData(pairs) + + SimpleRegressorResult( + r.getSlope, + r.getSlopeStdErr, + r.getSlopeConfidenceInterval, + r.getIntercept, + r.getInterceptStdErr, + r.getRSquare, + r.getSignificance, + r.getMeanSquareError, + math.sqrt(r.getMeanSquareError), + r.getRegressionSumSquares, + r.getTotalSumSquares, + r.getSumSquaredErrors, + r.getN, + r.getR, + r.getSumOfCrossProducts + ) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/exploration/visualizations/OneDimVisualizations.scala b/src/main/scala/com/databricks/labs/automl/exploration/visualizations/OneDimVisualizations.scala new file mode 100644 index 00000000..782f4c49 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/exploration/visualizations/OneDimVisualizations.scala @@ -0,0 +1,25 @@ +package com.databricks.labs.automl.exploration.visualizations + +import com.databricks.labs.automl.exploration.tools.OneDimStatsData +import vegas.DSL.ExtendedUnitSpecBuilder +import vegas._ + +object OneDimVisualizations { + + def renderHTML(plot: ExtendedUnitSpecBuilder, name: String): String = { + plot.html.pageHTML(name) + } + + def generateOneDimPlots(results: OneDimStatsData) = { + + val plot = Vegas("Variance") + .withData(Seq(Map("field" -> "A", "variance" -> results.variance))) + .encodeX("field", Nom) + .encodeY("variance", Quant) + .mark(Bar) + + renderHTML(plot, "variance") + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/DataEvaluator.scala b/src/main/scala/com/databricks/labs/automl/feature/DataEvaluator.scala new file mode 100644 index 00000000..0435aecf --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/DataEvaluator.scala @@ -0,0 +1,17 @@ +package com.databricks.labs.automl.feature + +class DataEvaluator { + + //TODO: make this useable as a stand-alone data exploration toolkit. + + //TODO: phase 1 - run through a supplied DF and get the scores for Variance / Information Gain of each field + + //TODO: phase 2 - Provide test suite for Shapiro-Wilk test for normalcy + + //TODO: phase 3 - Provide test suite for Agostino normalcy ? + + //TODO: phase 4 - PCA + + //TODO: phase 5 - Pearson report + +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/FeatureEvaluator.scala b/src/main/scala/com/databricks/labs/automl/feature/FeatureEvaluator.scala new file mode 100644 index 00000000..c1a8a146 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/FeatureEvaluator.scala @@ -0,0 +1,196 @@ +package com.databricks.labs.automl.feature + +import org.apache.spark.ml.feature.QuantileDiscretizer +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.{col, lit, log2, sum, variance} +import com.databricks.labs.automl.feature.structures._ +import org.apache.spark.sql.types.StructType + +object FeatureEvaluator extends FeatureInteractionBase { + + /** + * Helper method for calculating the Information Gain of a feature field + * @param df DataFrame that contains at least the fieldToTest and the Label Column + * @param fieldToTest The field to calculate Information Gain for + * @param totalRecordCount Total number of records in the data set + * @return The Information Gain of the field + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def calculateCategoricalInformationGain(df: DataFrame, + labelColumn: String, + fieldToTest: String, + totalRecordCount: Long): Double = { + + val groupedData = df.groupBy(labelColumn, fieldToTest).count() + + val fieldCounts = df + .select(fieldToTest) + .groupBy(fieldToTest) + .count() + .withColumnRenamed(COUNT_COLUMN, AGGREGATE_COLUMN) + + val mergeCounts = groupedData + .join(fieldCounts, Seq(fieldToTest), "left") + .withColumn(RATIO_COLUMN, col(COUNT_COLUMN) / col(AGGREGATE_COLUMN)) + .withColumn( + ENTROPY_COLUMN, + lit(-1) * col(RATIO_COLUMN) * log2(col(RATIO_COLUMN)) + ) + .withColumn( + TOTAL_RATIO_COLUMN, + col(AGGREGATE_COLUMN) / lit(totalRecordCount) + ) + + val distinctValues = + mergeCounts.select(fieldToTest, TOTAL_RATIO_COLUMN).distinct() + + val mergedEntropy = mergeCounts + .groupBy(fieldToTest) + .agg(sum(ENTROPY_COLUMN).alias(ENTROPY_COLUMN)) + + val joinedEntropy = mergedEntropy + .join(distinctValues, Seq(fieldToTest)) + .withColumn( + FIELD_ENTROPY_COLUMN, + col(ENTROPY_COLUMN) * col(TOTAL_RATIO_COLUMN) + ) + .select(fieldToTest, FIELD_ENTROPY_COLUMN) + .collect() + .map(r => EntropyData(r.get(0).toString.toDouble, r.getDouble(1))) + + joinedEntropy.map(_.entropy).sum / joinedEntropy.length.toDouble + + } + + /** + * Helper method for converting a continuous feature to a discrete bucketed value so that entropy can be calculated + * effectively for the feature. + * @param df DataFrame containing at least the field to test in continuous numeric format + * @param fieldToTest The name of the field under conversion + * @return A Dataframe with the continuous value converted to a quantized bucket membership value. + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def discretizeContinuousFeature(df: DataFrame, + fieldToTest: String, + bucketCount: Int): DataFrame = { + + val renamedFieldToTest = s"d_$fieldToTest" + + val discretizer = new QuantileDiscretizer() + .setInputCol(renamedFieldToTest) + .setOutputCol(fieldToTest) + .setNumBuckets(bucketCount) + .setHandleInvalid("keep") + .setRelativeError(QUANTILE_PRECISION) + + val modifiedData = df.withColumnRenamed(fieldToTest, renamedFieldToTest) + + discretizer.fit(modifiedData).transform(modifiedData) + + } + + /** + * Helper method for handling Information Gain Calculation for classification data set when dealing with continuous + * (numeric) feature elements. The continuous feature will be split upon the configured value of _continuousDiscretizerBucketCount, + * which is set by overriding .setContinuousDiscretizerBucketCount() + * @param df DataFrame that contains the feature to test and the label column + * @param fieldToTest The feature field that is under test for entropy evaluation + * @param totalRecordCount Total number of elements in the data set. + * @return Information Gain associated with the feature field based on splits that could occur. + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def calculateContinuousInformationGain(df: DataFrame, + labelCol: String, + fieldToTest: String, + totalRecordCount: Long, + bucketCount: Int): Double = { + + val adjustedFieldData = + discretizeContinuousFeature(df, fieldToTest, bucketCount) + + calculateCategoricalInformationGain( + adjustedFieldData, + labelCol, + fieldToTest, + totalRecordCount + ) + + } + + /** + * Method for calculating the variance of a categorical (nominal) field based on a post-split first-layer variance + * of the label column's values to determine the minimum variance achievable in the label column. + * @param df DataFrame that contains the label column and the field under test for minimum by-group variance + * @param labelColumn The label column of the data set + * @param fieldToTest The feature column to test for variance reduction + * @return The minimum split variance of the aggregated label data by nominal group of the fieldToTest + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def calculateCategoricalVariance(df: DataFrame, + labelColumn: String, + fieldToTest: String): Double = { + + val groupedData = df + .select(labelColumn, fieldToTest) + .groupBy(fieldToTest) + .agg(variance(fieldToTest).alias(fieldToTest)) + .collect() + .map(r => VarianceData(r.get(0).toString.toDouble, r.getDouble(1))) + groupedData.map(_.variance).min + } + + /** + * Method for calculating the variance of a continuous field for variance reduction in the label column based on + * bucketized grouping of the field under test. + * @param df DataFrame that contains the label column and the field under test of continuous numeric type + * @param labelColumn The label column of the data set + * @param fieldToTest The field to test (continuous numeric) that need to be evaluated + * @param bucketCount The number of quantized buckets to create to group the field under test into in order to + * simulate where a decision split would occur. + * @return The minimum split variance of each of the buckets that have been created + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def calculateContinuousVariance(df: DataFrame, + labelColumn: String, + fieldToTest: String, + bucketCount: Int): Double = { + + val convertedContinuousData = + discretizeContinuousFeature(df, fieldToTest, bucketCount) + + calculateCategoricalVariance( + convertedContinuousData, + labelColumn, + fieldToTest + ) + + } + + /** + * Helper method for extracting field names and ensuring that the feature vector is present + * @param schema Schema of the DataFrame undergoing feature interaction + * @param featureVector The name of the features column + * @return Array of column names of the DataFrame + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def extractAndValidateSchema(schema: StructType, + featureVector: String): Unit = { + + val schemaFields = schema.names + + require( + schemaFields.contains(featureVector), + s"The feature vector column $featureVector does not " + + s"exist in the DataFrame supplied to FeatureInteraction.createCandidatesAndAddToVector. Field listing is: " + + s"${schemaFields.mkString(", ")} " + ) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/FeatureInteraction.scala b/src/main/scala/com/databricks/labs/automl/feature/FeatureInteraction.scala new file mode 100644 index 00000000..6adc7f5f --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/FeatureInteraction.scala @@ -0,0 +1,651 @@ +package com.databricks.labs.automl.feature + +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.ArrayBuffer +import scala.collection.parallel.ForkJoinTaskSupport +import scala.concurrent.forkjoin.ForkJoinPool +import com.databricks.labs.automl.feature.structures._ +import com.databricks.labs.automl.pipeline.{ + DropColumnsTransformer, + InteractionTransformer +} +import org.apache.spark.ml.Pipeline + +class FeatureInteraction(modelingType: String, retentionMode: String) + extends FeatureInteractionBase { + + import com.databricks.labs.automl.feature.structures.FieldEncodingType._ + import com.databricks.labs.automl.feature.structures.InteractionRetentionMode._ + import com.databricks.labs.automl.feature.structures.ModelingType._ + + private var _labelCol: String = "label" + + private var _fullDataEntropy: Double = 0.0 + + private var _fullDataVariance: Double = 0.0 + + private var _continuousDiscretizerBucketCount: Int = 10 + + private var _parallelism: Int = 4 + + private var _targetInteractionPercentage: Double = modelingType match { + case "regressor" => -1.0 + case "classifier" => 1.0 + } + + def setLabelCol(value: String): this.type = { + _labelCol = value + this + } + + def setContinuousDiscretizerBucketCount(value: Int): this.type = { + + require( + value > 1, + s"Continuous Discretizer Bucket Count for continuous features must be greater than 1. $value is invalid." + ) + _continuousDiscretizerBucketCount = value + this + } + + def setParallelism(value: Int): this.type = { + require( + value > 0, + s"Parallelism value $value is invalid. Must be 1 or greater." + ) + _parallelism = value + this + } + + def setTargetInteractionPercentage(value: Double): this.type = { + _targetInteractionPercentage = value + this + } + + /** + * Helper method to set the class property for data-level entropy based on the values of a nominal label column + * @param df The raw data frame + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def setFullDataEntropy(df: DataFrame): this.type = { + + val uniqueLabelEntries = df.select(_labelCol).groupBy(_labelCol).count() + + val labelEntropy = + uniqueLabelEntries + .agg(sum(COUNT_COLUMN).alias(AGGREGATE_COLUMN)) + .crossJoin(uniqueLabelEntries) + .withColumn(RATIO_COLUMN, col(COUNT_COLUMN) / col(AGGREGATE_COLUMN)) + .withColumn( + ENTROPY_COLUMN, + lit(-1) * col(RATIO_COLUMN) * log2(col(RATIO_COLUMN)) + ) + .select(_labelCol, ENTROPY_COLUMN) + .collect() + .map(r => EntropyData(r.get(0).toString.toDouble, r.getDouble(1))) + + _fullDataEntropy = labelEntropy.map(_.entropy).sum + this + + } + + /** + * Private method for setting the data set label's variance value + * @param df The source DataFrame + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def setFullDataVariance(df: DataFrame): this.type = { + + _fullDataVariance = scala.math.pow( + df.select(_labelCol) + .summary(VARIANCE_STATISTIC) + .first() + .getAs[String](_labelCol) + .toDouble, + 2 + ) + this + } + + /** + * Private method for scoring a column based on the model type and the field type + * @param df Dataframe for evaluation + * @param modelType Model Type: Either Classifier or Regressor from Enum + * @param fieldToTest The field to be scored + * @param fieldType The type of the field: Either Nominal (String Indexed) or Continuous from Enum + * @param totalRecordCount Total number of rows in the data set in order to calculate Entropy correctly + * @return A Score as Double + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def scoreColumn(df: DataFrame, + modelType: ModelingType.Value, + fieldToTest: String, + fieldType: FieldEncodingType.Value, + totalRecordCount: Long): Double = { + + val subsetData = df.select(fieldToTest, _labelCol) + + modelType match { + case Classifier => + fieldType match { + case Nominal => + FeatureEvaluator.calculateCategoricalInformationGain( + subsetData, + _labelCol, + fieldToTest, + totalRecordCount + ) + case Continuous => + FeatureEvaluator.calculateContinuousInformationGain( + subsetData, + _labelCol, + fieldToTest, + totalRecordCount, + _continuousDiscretizerBucketCount + ) + } + case Regressor => + fieldType match { + case Nominal => + FeatureEvaluator.calculateCategoricalVariance( + subsetData, + _labelCol, + fieldToTest + ) + case Continuous => + FeatureEvaluator.calculateContinuousVariance( + subsetData, + _labelCol, + fieldToTest, + _continuousDiscretizerBucketCount + ) + } + } + + } + + /** + * Private method for evaluating an interacted column + * @param df A DataFrame to be used for candidate feature interaction evaluations + * @param candidate The InteractionPayload for the parents of left/right to make the interacted feature + * @param totalRecordCount Total number of records in the DataFrame (calculated only once for the Object) + * @return InteractionResult payload of interaction scores associated with the interacted features + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def evaluateInteraction(df: DataFrame, + candidate: InteractionPayload, + totalRecordCount: Long): InteractionResult = { + + // Generate a subset DataFrame, create the interaction column, and retain only the fields needed. + val evaluationDf = df + .select(candidate.left, candidate.right, _labelCol) + + val interactedDf = interactProduct(evaluationDf, candidate) + + val dataModelDecision = + (candidate.leftDataType, candidate.rightDataType) match { + case ("nominal", "nominal") => "nominal" + case _ => "continuous" + } + + // Score the interaction + val score = scoreColumn( + interactedDf, + getModelType(modelingType), + candidate.outputName, + getFieldType(dataModelDecision), + totalRecordCount + ) + + InteractionResult( + candidate.left, + candidate.right, + candidate.outputName, + score + ) + + } + + /** + * Private method for comparing the parents scores to the interacted feature score and return a Boolean keep / not keep for the interacted feature in the final + * data set and configuration for this module. + * @param interactionResult the evaluated result of an interacted feature + * @param leftScore left parent of interaction's score + * @param rightScore right parent of interaction's score + * @return Boolean value of whether to keep the interacted field or not + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def parentCompare(interactionResult: InteractionResult, + leftScore: ColumnScoreData, + rightScore: ColumnScoreData): Boolean = { + + val percentageChangeLeft = + calculatePercentageChange(leftScore.score, interactionResult.score) + + val percentageChangeRight = + calculatePercentageChange(rightScore.score, interactionResult.score) + + val keepCheck = getRetentionMode(retentionMode) match { + case Optimistic => + getModelType(modelingType) match { + case Regressor => + percentageChangeLeft <= _targetInteractionPercentage * -100 | percentageChangeRight <= _targetInteractionPercentage * -100 + case Classifier => + percentageChangeLeft >= _targetInteractionPercentage * -100 | percentageChangeRight >= _targetInteractionPercentage * -100 + } + case Strict => + getModelType(modelingType) match { + case Regressor => + percentageChangeLeft <= _targetInteractionPercentage * -100 & percentageChangeRight <= _targetInteractionPercentage * -100 + case Classifier => + percentageChangeLeft >= _targetInteractionPercentage * -100 & percentageChangeRight >= _targetInteractionPercentage * -100 + } + case All => true + } + + keepCheck + + } + + /** + * Main method for generating a list of interaction candidates based on the configuration specified in the class configuration. + * @param df The DataFrame to process interactions for + * @param nominalFields The nominal fields (String Indexed) to be used for interaction + * @param continuousFields The continuous fields (Original Numeric Types) to be used for interaction + * @return Array[InteractionPayload] for candidate fields interactions that meet the acceptance criteria as set by configuration. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateCandidates( + df: DataFrame, + nominalFields: Array[String], + continuousFields: Array[String] + ): Array[InteractionPayloadExtract] = { + + val modelType = getModelType(modelingType) + + val totalRecordCount = df.count() + + modelType match { + case Regressor => setFullDataVariance(df) + case Classifier => setFullDataEntropy(df) + } + + val nominalScores = nominalFields.map { x => + x -> ColumnScoreData( + scoreColumn( + df, + modelType, + x, + getFieldType("nominal"), + totalRecordCount + ), + "nominal" + ) + + }.toMap + + val continuousScores = continuousFields.map { x => + x -> ColumnScoreData( + scoreColumn( + df, + modelType, + x, + getFieldType("continuous"), + totalRecordCount + ), + "continuous" + ) + }.toMap + + val mergedParentScores = nominalScores ++ continuousScores + + val interactionCandidatePayload = nominalFields.map( + x => ColumnTypeData(x, "nominal") + ) ++ continuousFields.map(y => ColumnTypeData(y, "continuous")) + + val interactionCandidates = generateInteractionCandidates( + interactionCandidatePayload + ) + + val forkJoinTaskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_parallelism) + ) + + val scoredCandidates = ArrayBuffer[InteractionResult]() + + val candidateChecks = interactionCandidates.par + candidateChecks.tasksupport = forkJoinTaskSupport + + candidateChecks.foreach { x => + val interaction = evaluateInteraction(df, x, totalRecordCount) + scoredCandidates += interaction + } + + var interactionBuffer = ArrayBuffer[InteractionPayloadExtract]() + + // Iterate over the evaluations and determine whether to keep them + + for (x <- scoredCandidates) { + + if (parentCompare( + x, + mergedParentScores(x.left), + mergedParentScores(x.right) + )) + interactionBuffer += InteractionPayloadExtract( + x.left, + mergedParentScores(x.left).dataType, + x.right, + mergedParentScores(x.right).dataType, + x.interaction, + x.score + ) + + } + + interactionBuffer.toArray + + } + + /** + * Method for determining feature interaction candidates, apply those candidates as new fields to the DataFrame, + * and return a configuration payload that has the information about the interactions that can be used in a Pipeline. + * @param df DataFrame to be used to calculate and potentially add feature interactions to + * @param nominalFields Fields from the DataFrame that were originally non-numeric (Character, String, etc.) + * @param continuousFields Fields from the DataFrame that were originally numeric, continuous types. + * @return FeatureInteractionCollection -> the DataFrame with candidate feature interactions added in and the + * payload of interaction features and their constituent parents in order to recreate for a Pipeline. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def createCandidates( + df: DataFrame, + nominalFields: Array[String], + continuousFields: Array[String] + ): FeatureInteractionCollection = { + + val fieldsToCreatePrime = + generateCandidates(df, nominalFields, continuousFields) + val fieldsToCreate = fieldsToCreatePrime.map(x => { + InteractionPayload( + x.left, + x.leftDataType, + x.right, + x.rightDataType, + x.outputName + ) + }) + + var data = df + + for (c <- fieldsToCreate) { + data = interactProduct(data, c) + } + + FeatureInteractionCollection(data, fieldsToCreatePrime) + + } + + /** + * Method for generating interaction candidates and re-building a feature vector + * @param df DataFrame to interact features with (that has a feature vector already built) + * @param nominalFields Array of column names for nominal (string indexed) values + * @param continuousFields Array of column names for continuous numeric values + * @param featureVectorColumn Name of the feature vector column + * @return DataFrame with a re-built feature vector that includes the interacted feature columns as part of it. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def createCandidatesAndAddToVector( + df: DataFrame, + nominalFields: Array[String], + continuousFields: Array[String], + featureVectorColumn: String + ): FeatureInteractionOutputPayload = { + + FeatureEvaluator.extractAndValidateSchema(df.schema, featureVectorColumn) + + val strippedDf = df.drop(featureVectorColumn) + + val candidatePayload = + createCandidates(strippedDf, nominalFields, continuousFields) + + // Reset the nominal interaction fields + val indexedInteractions = generateNominalIndexesInteractionFields( + candidatePayload + ) + + // Build the Vector again + val vectorFields = nominalFields ++ continuousFields + + val assemblerOutput = regenerateFeatureVector( + indexedInteractions.data, + vectorFields, + indexedInteractions.adjustedFields, + featureVectorColumn + ) + + val outputData = restructureSchema( + assemblerOutput.data, + df.schema.names, + vectorFields, + indexedInteractions.adjustedFields, + featureVectorColumn, + _labelCol + ) + + FeatureInteractionOutputPayload( + outputData, + vectorFields ++ indexedInteractions.adjustedFields, + candidatePayload.interactionPayload + ) + + } + + /** + * Method for generating a pipeline-friendly feature interaction to support serialization of the automl pipeline + * properly. Utilizes the InteractionTransformer to generate the fields required for inference + * @param df DataFrame to be used for generating the interaction candidates and pipeline + * @param nominalFields Nominal type numeric fields that are part of the vector + * @param continuousFields Continuous type numeric fields that are part of the vector + * @param featureVectorColumn Name of the current feature vector column + * @return PipelineInteractionOutput which contains the pipeline to be applied to the automl pipeline flow. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def createPipeline(df: DataFrame, + nominalFields: Array[String], + continuousFields: Array[String], + featureVectorColumn: String): PipelineInteractionOutput = { + + FeatureEvaluator.extractAndValidateSchema(df.schema, featureVectorColumn) + + // Create the pipeline stage for dropping the feature vector + val columnDropTransformer = + new DropColumnsTransformer().setInputCols(Array(featureVectorColumn)) + + // Remove the feature vector + val strippedDf = columnDropTransformer.transform(df) + + // Get the fields that are needed for interaction, if any + val candidatePayload = + createCandidates(strippedDf, nominalFields, continuousFields) + + // Create the fields through the Interaction Transformer + val leftColumns = candidatePayload.interactionPayload.map(_.left) + val rightColumns = candidatePayload.interactionPayload.map(_.right) + + val interactor = new InteractionTransformer() + .setLeftCols(leftColumns) + .setRightCols(rightColumns) + + // Create the string indexers + val indexedInteractions = generateNominalIndexesInteractionFields( + candidatePayload + ) + + val preIndexerFieldsToRemove = indexedInteractions.fieldsToRemove + + val indexedColumnDropTransformer = + new DropColumnsTransformer().setInputCols(preIndexerFieldsToRemove) + + // Create the vector + val vectorFields = nominalFields ++ continuousFields + + val assemblerOutput = regenerateFeatureVector( + indexedInteractions.data, + vectorFields, + indexedInteractions.adjustedFields, + featureVectorColumn + ) + + // create the pipeline + val pipelineElement = new Pipeline().setStages( + Array(columnDropTransformer) ++ Array(interactor) ++ indexedInteractions.indexers ++ Array( + indexedColumnDropTransformer + ) //++ Array(assemblerOutput.assembler) + ) + + PipelineInteractionOutput( + pipelineElement, + pipelineElement.fit(df).transform(df), + vectorFields ++ indexedInteractions.adjustedFields, + candidatePayload.interactionPayload + ) + + } + + /** + * Private method for enforcing re-ordering of the Dataframe that is returned to preserve the structure of the + * original dataframe before being passed to this module and to create appropriate placement of interacted features + * @param data DataFrame that has been interacted + * @param originalSchemaNames Names within the original schema + * @param originalFeatureNames Features that were originally contained in the vector prior to interaction + * @param interactedFields Fields that have been retained as interaction candidates + * @param featureCol feature column name + * @param labelCol label column name + * @return DataFrame in correct order + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def restructureSchema(data: DataFrame, + originalSchemaNames: Array[String], + originalFeatureNames: Array[String], + interactedFields: Array[String], + featureCol: String, + labelCol: String): DataFrame = { + + val startingFields = originalFeatureNames + .filterNot(x => x.contains(featureCol)) + .filterNot(x => x.contains(labelCol)) + + val ignoredFields = originalSchemaNames + .filterNot(originalFeatureNames.contains) + .filterNot(x => x.contains(labelCol)) + .filterNot(x => x.contains(featureCol)) + val featureOrdered = startingFields.filterNot(ignoredFields.contains) ++ interactedFields + val orderedFields = featureOrdered ++ ignoredFields ++ Array( + featureCol, + labelCol + ) + data.select(orderedFields map col: _*) + + } + +} + +object FeatureInteraction { + + def interactFeatures( + data: DataFrame, + nominalFields: Array[String], + continuousFields: Array[String], + modelingType: String, + retentionMode: String, + labelCol: String, + featureCol: String, + continuousDiscretizerBucketCount: Int, + parallelism: Int, + targetInteractionPercentage: Double + ): FeatureInteractionOutputPayload = + new FeatureInteraction(modelingType, retentionMode) + .setLabelCol(labelCol) + .setContinuousDiscretizerBucketCount(continuousDiscretizerBucketCount) + .setParallelism(parallelism) + .setTargetInteractionPercentage(targetInteractionPercentage) + .createCandidatesAndAddToVector( + data, + nominalFields, + continuousFields, + featureCol + ) + + def interactDataFrame( + data: DataFrame, + nominalFields: Array[String], + continuousFields: Array[String], + modelingType: String, + retentionMode: String, + labelCol: String, + continuousDiscretizerBucketCount: Int, + parallelism: Int, + targetInteractionPercentage: Double + ): FeatureInteractionCollection = { + new FeatureInteraction(modelingType, retentionMode) + .setLabelCol(labelCol) + .setContinuousDiscretizerBucketCount(continuousDiscretizerBucketCount) + .setParallelism(parallelism) + .setTargetInteractionPercentage(targetInteractionPercentage) + .createCandidates(data, nominalFields, continuousFields) + } + + def interactionReport( + data: DataFrame, + nominalFields: Array[String], + continuousFields: Array[String], + modelingType: String, + retentionMode: String, + labelCol: String, + continuousDiscretizerBucketCount: Int, + parallelism: Int, + targetInteractionPercentage: Double + ): Array[InteractionPayloadExtract] = { + new FeatureInteraction(modelingType, retentionMode) + .setLabelCol(labelCol) + .setContinuousDiscretizerBucketCount(continuousDiscretizerBucketCount) + .setParallelism(parallelism) + .setTargetInteractionPercentage(targetInteractionPercentage) + .generateCandidates(data, nominalFields, continuousFields) + + } + + def interactionPipeline( + data: DataFrame, + nominalFields: Array[String], + continuousFields: Array[String], + modelingType: String, + retentionMode: String, + labelCol: String, + featureCol: String, + continuousDiscretizerBucketCount: Int, + parallelism: Int, + targetInteractionPercentage: Double + ): PipelineInteractionOutput = { + new FeatureInteraction(modelingType, retentionMode) + .setLabelCol(labelCol) + .setContinuousDiscretizerBucketCount(continuousDiscretizerBucketCount) + .setParallelism(parallelism) + .setTargetInteractionPercentage(targetInteractionPercentage) + .createPipeline(data, nominalFields, continuousFields, featureCol) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/FeatureInteractionBase.scala b/src/main/scala/com/databricks/labs/automl/feature/FeatureInteractionBase.scala new file mode 100644 index 00000000..1073486c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/FeatureInteractionBase.scala @@ -0,0 +1,206 @@ +package com.databricks.labs.automl.feature + +import com.databricks.labs.automl.exceptions.ModelingTypeException +import com.databricks.labs.automl.feature.structures._ +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.col + +trait FeatureInteractionBase { + import com.databricks.labs.automl.feature.structures.FieldEncodingType._ + import com.databricks.labs.automl.feature.structures.InteractionRetentionMode._ + import com.databricks.labs.automl.feature.structures.ModelingType._ + + private final val allowableModelTypes = Array("classifier", "regressor") + private final val allowableFieldTypes = Array("nominal", "continuous") + private final val allowableRetentionModes = + Array("optimistic", "strict", "all") + + final val AGGREGATE_COLUMN: String = "totalCount" + final val COUNT_COLUMN: String = "count" + final val RATIO_COLUMN: String = "labelRatio" + final val TOTAL_RATIO_COLUMN: String = "totalRatio" + final val ENTROPY_COLUMN: String = "entropy" + final val FIELD_ENTROPY_COLUMN: String = "fieldEntropy" + final val QUANTILE_THRESHOLD: Double = 0.5 + final val QUANTILE_PRECISION: Double = 0.95 + final val VARIANCE_STATISTIC: String = "stddev" + final val INDEXED_SUFFIX: String = "_si" + + protected[feature] def getModelType( + modelingType: String + ): ModelingType.Value = { + modelingType match { + case "regressor" => Regressor + case "classifier" => Classifier + case _ => throw ModelingTypeException(modelingType, allowableModelTypes) + } + } + + protected[feature] def getFieldType( + fieldType: String + ): FieldEncodingType.Value = { + fieldType match { + case "nominal" => Nominal + case "continuous" => Continuous + case _ => throw ModelingTypeException(fieldType, allowableFieldTypes) + } + } + + protected[feature] def getRetentionMode( + retentionMode: String + ): InteractionRetentionMode.Value = { + retentionMode match { + case "optimistic" => Optimistic + case "strict" => Strict + case "all" => All + case _ => + throw ModelingTypeException(retentionMode, allowableRetentionModes) + } + } + + /** + * Method for generating a collection of Interaction Candidates to be tested and applied to the feature set + * if the tests for inclusion pass. + * @param featureColumns List of the columns that make up the feature vector + * @return Array of InteractionPayload values. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + protected[feature] def generateInteractionCandidates( + featureColumns: Array[ColumnTypeData] + ): Array[InteractionPayload] = { + val colIdx = featureColumns.zipWithIndex + colIdx.flatMap { + case (x, i) => + val maxIdx = colIdx.length + for (j <- Range(i + 1, maxIdx)) yield { + InteractionPayload( + x.name, + x.dataType, + colIdx(j)._1.name, + colIdx(j)._1.dataType, + s"i_${x.name}_${colIdx(j)._1.name}" + ) + } + } + } + + /** + * Method for evaluating the percentage change to the score metric to normalize. + * @param before Score of a parent feature + * @param after Score of an interaction feature + * @return the percentage change + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + protected[feature] def calculatePercentageChange(before: Double, + after: Double): Double = { + + (after - before) / math.abs(before) * 100.0 + + } + + /** + * Method for generating a product interaction between feature columns + * @param df A DataFrame to add a field for an interaction between two columns + * @param candidate InteractionPayload information about the two parent columns and the name of the new interaction column to be created. + * @return A modified DataFrame with the new column. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + protected[feature] def interactProduct( + df: DataFrame, + candidate: InteractionPayload + ): DataFrame = { + + df.withColumn( + candidate.outputName, + col(candidate.left) * col(candidate.right) + ) + + } + + /** + * Method for converting nominal interaction fields to a new StringIndexed value to preserve information type and + * eliminate the possibility of data distribution skew + * @param payload FeatureInteractionCollection of the source parents and their interacted children fields + * @return NominalDataCollecction payload containing a DataFrame that has new StringIndexed fields for nominal + * interactions and the fields that need to be seen as included in the final feature vector + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + protected[feature] def generateNominalIndexesInteractionFields( + payload: FeatureInteractionCollection + ): NominalDataCollection = { + + // Check for nominal data types on interactions + + val parsedNames = payload.interactionPayload + .map( + x => + (x.rightDataType, x.leftDataType) match { + case ("nominal", "nominal") => + NominalIndexCollection(x.outputName, indexCheck = true) + case _ => NominalIndexCollection(x.outputName, indexCheck = false) + } + ) + + val nominalFields = parsedNames + .filter(x => x.indexCheck) + .map(x => x.name) + + // String Index these fields + + val indexers = nominalFields.map { x => + new StringIndexer() + .setHandleInvalid("keep") + .setInputCol(x) + .setOutputCol(x + INDEXED_SUFFIX) + } + + val pipeline = new Pipeline() + .setStages(indexers) + .fit(payload.data) + + val adjustedFieldsToIncludeInVector = parsedNames.map { x => + if (x.indexCheck) x.name + INDEXED_SUFFIX + else x.name + } + + NominalDataCollection( + pipeline.transform(payload.data), + adjustedFieldsToIncludeInVector, + nominalFields, + indexers + ) + + } + + /** + * Helper method for recreating the feature vector after interactions have been completed on individual columns + * @param df DataFrame containing the interacted fields with the original feature vector dropped + * @param preInteractedFields Fields making up the original vector before interaction + * @param interactedFields Interaction candidate fields that have been selected to be included in the final feature vector + * @param featureCol Name of the feature vector field + * @return DataFrame with a new feature vector. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + protected[feature] def regenerateFeatureVector( + df: DataFrame, + preInteractedFields: Array[String], + interactedFields: Array[String], + featureCol: String + ): VectorAssemblyOutput = { + + val assembler = new VectorAssembler() + .setInputCols(preInteractedFields ++ interactedFields) + .setOutputCol(featureCol) + + VectorAssemblyOutput(assembler, assembler.transform(df)) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/KSampling.scala b/src/main/scala/com/databricks/labs/automl/feature/KSampling.scala new file mode 100644 index 00000000..b5de56c8 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/KSampling.scala @@ -0,0 +1,842 @@ +package com.databricks.labs.automl.feature + +import java.util.{Calendar, Date} + +import com.databricks.labs.automl.feature.structures.{ + CentroidVectors, + RowGenerationConfig, + RowMapping, + SchemaDefinitions, + SchemaMapping, + StructMapping +} +import org.apache.spark.ml.clustering.{KMeans, KMeansModel} +import org.apache.spark.ml.feature.{ + MaxAbsScaler, + MinHashLSH, + MinHashLSHModel, + VectorAssembler +} +import org.apache.spark.sql.expressions.Window +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.{Column, DataFrame, Row} + +import scala.collection.mutable.ListBuffer + +class KSampling(df: DataFrame) extends KSamplingBase { + + /** + * Build a KMeans model in order to find centroids for data simulation + * @param data The source DataFrame, consisting of the feature fields and a vector column + * @return KMeansModel that will be used to extract the centroid vectors + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def buildKMeans(data: DataFrame): KMeansModel = { + + val model = new KMeans() + .setK(conf.kGroups) + .setSeed(conf.kMeansSeed) + .setFeaturesCol(conf.featuresCol) + .setDistanceMeasure(conf.kMeansDistanceMeasurement) + .setPredictionCol(conf.kMeansPredictionCol) + .setTol(conf.kMeansTolerance) + .setMaxIter(conf.kMeansMaxIter) + + model.fit(data) + } + + /** + * Build a MinHashLSH Model so that approximate nearest neighbors can be used to find adjacent vectors to a given + * centroid vector + * @param data The source DataFrame, consisting of the feature fields and a vector column + * @return MinHashLSHModel that will be used to generate distances between centroids and a provided vector + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def buildLSH(data: DataFrame): MinHashLSHModel = { + + val model = new MinHashLSH() + .setNumHashTables(conf.lshHashTables) + .setSeed(conf.lshSeed) + .setInputCol(conf.featuresCol) + .setOutputCol(conf.lshOutputCol) + + model.fit(data) + } + + /** + * Method for scaling the feature vector to enable better K Means performance for highly unbalanced vectors + * @param data The 'raw' vector assembled dataframe + * @return A DataFrame that has the feature vector scaled as a replacement. + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def scaleFeatureVector(data: DataFrame): DataFrame = { + + val renamedCol = conf.featuresCol + "_f" + + val renamedData = data.withColumnRenamed(conf.featuresCol, renamedCol) + + // Initialize the Scaler + val scaler = new MaxAbsScaler() + .setInputCol(renamedCol) + .setOutputCol(conf.featuresCol) + + // Create the scaler model + val scalerModel = scaler.fit(renamedData) + + // Apply the scaler and replace the feature vector with scaled features + scalerModel.transform(renamedData).drop(renamedCol) + + } + + /** + * Method for getting representative rows that are closest to the calculated centroid positions. + * + * @param data The DataFrame that has been transformed by the LSH Model + * @param lshModel a fit MinHashLSHModel + * @param kModel a fit KMeansModel + * @return Array of CentroidVectors that contains the vector and the KMeans Group that it is assigned to. + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def acquireNearestVectorToCentroids( + data: DataFrame, + lshModel: MinHashLSHModel, + kModel: KMeansModel + ): Array[CentroidVectors] = { + + val centerCandidates = kModel.clusterCenters + .map { x => + lshModel + .approxNearestNeighbors(kModel.transform(data), x, conf.quorumCount) + .toDF + } + .reduce(_ union _) + .distinct + .withColumn( + "rank", + dense_rank.over( + Window + .partitionBy(col(conf.kMeansPredictionCol)) + .orderBy(col("distCol"), col(conf.featuresCol)) + ) + ) + .where(col("rank") === 1) + .drop("rank") + + centerCandidates + .select(col(conf.featuresCol), col(conf.kMeansPredictionCol)) + .collect() + .map { x => + CentroidVectors( + x.getAs[org.apache.spark.ml.linalg.Vector](conf.featuresCol), + x.getAs[Int](conf.kMeansPredictionCol) + ) + } + + } + + /** + * Method for retrieving the feature vectors that are closest in n-dimensional space to the provided vector + * @param data The transformed data (transformed from KMeans) + * @param lshModel a fit MinHashLSHModel + * @param vectorCenter The vector and the kGroup that is closest to that group's centroid + * @param targetCount The desired number of closest neighbors to find + * @return A DataFrame consisting of the rows that are closest to the supplied vector nearest the centroid + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def acquireNeighborVectors(data: DataFrame, + lshModel: MinHashLSHModel, + vectorCenter: CentroidVectors, + targetCount: Int): DataFrame = { + + lshModel + .approxNearestNeighbors( + data.filter(col(conf.kMeansPredictionCol) === vectorCenter.kGroup), + vectorCenter.vector, + targetCount + ) + .toDF + + } + + /** + * Method for converting the Row object to a map of key/value pairs + * + * @param row a row of data + * @return a Map of key/value pairs for the feature columns of the row. + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def getRowAsMap( + row: org.apache.spark.sql.Row + ): Map[String, Any] = { + val rowSchema = row.schema.fieldNames + row.getValuesMap[Any](rowSchema) + } + + /** + * Method for mutating between two row values of all features in the rows. + * A mutation value is set to provide a ratio between the min and max values. + * @param first a value to mix with the second variable + * @param second a value to mix with the first variable + * @param mutationValue the ratio of mixing between the two variables. + * @return The scaled value between first and second + * @author Ben Wilson + * @since 0.5.1 + */ + def mutateValueFixed(first: Double, + second: Double, + mutationValue: Double): Double = { + + val minVal = scala.math.min(first, second) + val maxVal = scala.math.max(first, second) + (minVal * mutationValue) + (maxVal * (1 - mutationValue)) + } + + /** + * Method for mutating between row values with a fixed ratio value + * @param first a value to mix + * @param second a value to mix + * @param mutationValue ratio modifier between the two values + * @return the mutated value + * @author Ben Wilson + * @since 0.5.1 + */ + def ratioValueFixed(first: Double, + second: Double, + mutationValue: Double): Double = { + (first + second) * mutationValue + } + + /** + * Method for randomly mutating between the bounds of two values + * @param first a value to mix + * @param second a value to mix + * @return the randomly mutated value + * @author Ben Wilson + * @since 0.5.1 + */ + def mutateValueRandom(first: Double, second: Double): Double = { + val rand = scala.util.Random + mutateValueFixed(first, second, rand.nextDouble()) + } + + /** + * Method for converting Integer Types to Double Types + * @param x Numeric: Integer or Double + * @return Option Double type conversion + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def toDoubleType(x: Any): Option[Double] = x match { + case i: Int => Some(i) + case f: Float => Some(f) + case l: Long => Some(l) + case d: Double => Some(d) + case _ => None + } + + /** + * Method for modifying a row of feature data by finding a linear point along the + * vector between the features. + * Options for modification of a vector include: + * - Full Vector mutation based on a provided ratio, random modification, or a weighted average + * - Partial Random Vector mutation of a random number of index sites, bound by a lower limit + * - Fixed Vector mutation of random indexes at a constant count of indexes to modify + * @param originalRow The Map() representation of one of the vectors + * @param mutateRow The Map() representation of the other vector + * @param indexMapping Vector definition of the payload (field name and index) + * @param indexesToMutate A list of Integers of index positions within the vector to mutate + * @param mode The method of mutation (weighted average, ratio, or random mutation) + * @param mutation The magnitude of % share of the value from the centroid-adjacent vector + * to the other vector. A higher percentage value will be closer in euclidean + * distance to the centroid vector. + * @return A List of Doubles of the feature vector modifications (data for a new row) + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def mutateRow(originalRow: Map[String, Any], + mutateRow: Map[String, Any], + indexMapping: Array[RowMapping], + indexesToMutate: List[Int], + mode: String, + mutation: Double): List[Double] = { + + val outputRow = ListBuffer[Double]() + + indexMapping.foreach { r => + val origData = originalRow(r.fieldName).toString.toDouble + val mutateData = mutateRow(r.fieldName).toString.toDouble + + if (indexesToMutate.contains(r.idx)) { + + mode match { + case "weighted" => + outputRow += mutateValueFixed(origData, mutateData, mutation) + case "random" => outputRow += mutateValueRandom(origData, mutateData) + case "ratio" => + outputRow += ratioValueFixed(origData, mutateData, mutation) + } + + } else outputRow += origData + + } + + outputRow.toList + + } + + /** + * Method for generating a random collection of random-counts of vectors for mutation + * @param vectorSize The size of the feature vector components (columns) + * @param minimumCount The minimum number of vectors that can be selected for mutation + * - used to ensure that at least some values will be modified. + * @return The list of indexes that will be modified through averaging along the vector between + * the centroid and a chosen feature vector + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def generateRandomIndexPositions( + vectorSize: Int, + minimumCount: Int + ): List[Int] = { + val candidateList = List.range(0, vectorSize) + val restrictionSize = scala.util.Random.nextInt(vectorSize) + val adjustedSize = + if (restrictionSize < minimumCount) minimumCount else restrictionSize + scala.util.Random.shuffle(candidateList).take(adjustedSize).sortWith(_ < _) + } + + /** + * Method for generating a fixed number of random indexes in the feature vector to manipulate. + * @param vectorSize The size of the feature vector + * @param minimumCount The number of features to mutate which are randomly selected. + * @note if the value specified in .setMinimumVectorCountToMutate is greater than the + * feature vector size, all indexes will be mutated. + * @return The list of indexes selected for mutation + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def generateFixedIndexPositions( + vectorSize: Int, + minimumCount: Int + ): List[Int] = { + val candidateList = List.range(0, vectorSize) + val adjustedSize = + if (minimumCount > vectorSize) vectorSize else minimumCount + scala.util.Random.shuffle(candidateList).take(adjustedSize).sortWith(_ < _) + } + + /** + * Builder method for controlling what type of index selection will be used, returning the list + * of indexes that are selected for mutation + * @param vectorSize The size of the feature vector + * @return The list of indexes selected for mutation + * @author Ben Wilson + * @since 0.5.1 + * @throws IllegalArgumentException() if an invalid entry is made. + */ + @throws(classOf[IllegalArgumentException]) + private[feature] def generateIndexPositions(vectorSize: Int): List[Int] = { + + conf.vectorMutationMethod match { + case "random" => + generateRandomIndexPositions( + vectorSize, + conf.minimumVectorCountToMutate + ) + case "all" => List.range(0, vectorSize) + case "fixed" => + generateFixedIndexPositions(vectorSize, conf.minimumVectorCountToMutate) + case _ => + throw new IllegalArgumentException( + s"Vector Mutation Method ${conf.vectorMutationMethod} is not supported. " + + s"Please use one of: ${allowableVectorMutationMethods.mkString(", ")}" + ) + } + + } + + /** + * Method for generating a collection of synthetic row data, stored as lists of lists of doubles. + * @param nearestNeighborData A Dataframe that has been transformed by a KMeans model + * @param targetCount The target number of synthetic rows to generate + * @param minVectorsToMutate The minimum (or exact) target of vectors to mutate in the feature vector + * @param mutationMode The mutation mode (random, weighted, or Ratio) + * @param mutationValue The value of vector ratio share between the centroid-associated vector and the other vectors + * @return A list of Lists of Doubles (basic DF structure) + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def acquireRowCollections( + nearestNeighborData: DataFrame, + targetCount: Int, + minVectorsToMutate: Int, + mutationMode: String, + mutationValue: Double + ): List[List[Double]] = { + + val mutatedRows = ListBuffer[List[Double]]() + + val colIdx = nearestNeighborData.schema.names.zipWithIndex + .map(x => RowMapping.tupled(x)) + + val kGroupCollection = nearestNeighborData.collect.map(getRowAsMap) + + val (center, others) = kGroupCollection.splitAt(1) + + val centerVector = center(0) + + val vectorLength = centerVector.size + val candidateLength = others.length + + var iter = 0 + var idx = 0 + + val minIndexes = + if (minVectorsToMutate > vectorLength) vectorLength + else minVectorsToMutate + + do { + + val indexesToMutate = generateIndexPositions(vectorLength) + + mutatedRows += mutateRow( + others(idx), + centerVector, + colIdx, + indexesToMutate, + mutationMode, + mutationValue + ) + + iter += 1 + if (idx >= candidateLength - 1) idx = 0 else idx += 1 + + } while (iter < targetCount) + + mutatedRows.toList + + } + + /** + * Helper method for constructing a valid DF schema Struct object wherein all of the numeric columns types are + * converted to DoubleType + * @param data The DataFrame that contains various numeric type data columns + * @param fieldsToExclude Fields that should not be included in the conversion and subsequent schema definition + * @note This function allows for casting the mutated return value of acquireRowCollections to a DataFrame since + * the return types of that collection are all of DoubleType. + * @return A DoubleType encoded Schema + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def generateDoublesSchema( + data: DataFrame, + fieldsToExclude: List[String] + ): StructType = { + + val baseStruct = new StructType() + + val baseSchema = data + .drop(fieldsToExclude: _*) + .schema + .names + .flatMap(x => baseStruct.add(x, DoubleType, nullable = false)) + + DataTypes.createStructType(baseSchema) + } + + /** + * Method for converting the raw `List[List[Double]]` collection from the data generator method acquireRowCollections + * to a DataFrame object + * @param collections The collection of synthetic feature data in `List[List[Double]]` format + * @param schema The Double-formatted schema from the helper method generateDoublesSchema + * @return A DataFrame that has all numeric types of feature columns as DoubleType. + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def convertCollectionsToDataFrame( + collections: List[List[Double]], + schema: StructType + ): DataFrame = { + spark.createDataFrame(sc.makeRDD(collections.map(x => Row(x: _*))), schema) + } + + /** + * Method for generating the Group Vector Centroids that are used for generating the mutated feature rows. + * @param clusteredData KMeans transformed DataFrame + * @param centroids Array of the Centroid Vector data for look-up purposes through MinHashLSH + * @param lshModel The trained MinHashLSH Model + * @param labelCol The label Column of the DataFrame + * @param labelGroup The canonical class of the label Column that is intended to have data generated for + * @param targetCount The desired number of rows of synthetic data from the class to generate + * @param fieldsToDrop Fields to be ignored from the DataFrame (that are not part of the feature vector) + * @tparam T Numeric Type of the Label column + * @return A DataFrame of the rows that are closest to the K cluster centroids. + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def generateGroupVectors[T: scala.math.Numeric]( + clusteredData: DataFrame, + centroids: Array[CentroidVectors], + lshModel: MinHashLSHModel, + labelCol: String, + labelGroup: T, + targetCount: Int, + fieldsToDrop: List[String] + ): DataFrame = { + + val rowsToGeneratePerGroup = + scala.math.ceil(targetCount / centroids.length).toInt + + val calculatedRowsToGenerate = + if (rowsToGeneratePerGroup < 1) 1 else rowsToGeneratePerGroup + + centroids + .map { x => + acquireNeighborVectors( + clusteredData.filter(col(labelCol) === labelGroup), + lshModel, + x, + calculatedRowsToGenerate + ) + } + .reduce(_.union(_)) + .drop(fieldsToDrop: _*) + .limit(targetCount) + + } + + /** + * Helper Method for converting from Spark DataType to scala types. + * @param sparkType The spark type of a column + * @return a string representation of the native scala type + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def sparkToScalaTypeConversion( + sparkType: DataType + ): String = { + + sparkType match { + case x: ByteType => "Byte" + case x: ShortType => "Short" + case x: LongType => "Long" + case x: FloatType => "Float" + case x: StringType => "String" + case x: BinaryType => "Array[Byte]" + case x: BooleanType => "Boolean" + case x: TimestampType => "java.sql.Timestamp" + case x: DateType => "java.sql.Date" + case x: IntegerType => "Int" + case x: DoubleType => "Double" + case _ => "Unknown" + } + + } + + def scalaToSparkTypeConversion(scalaType: String): DataType = { + scalaType match { + case "Byte" => ByteType + case "Short" => ShortType + case "Long" => LongType + case "Float" => FloatType + case "String" => StringType + case "Array[Byte]" => BinaryType + case "Boolean" => BooleanType + case "java.sql.Timestamp" => TimestampType + case "java.sql.Date" => DateType + case "Int" => IntegerType + case "Double" => DoubleType + } + } + + /** + * Method for generating the original DataFrame's schema information, storing it in a collection that defines + * both the full starting schema types as well as information about the feature vector that was passed in. + * @param fullSchema The full schema as a StructType collection from the input DataFrame + * @param fieldsNotInVector Array of field names to ignore + * @return SchemaDefinitions collection payload + * @author Ben Wilson + * @since 0.5.1 + */ + private[feature] def generateSchemaInformationPayload( + fullSchema: StructType, + fieldsNotInVector: Array[String] + ): SchemaDefinitions = { + + val schemaMapped: Seq[StructMapping] = + fullSchema.zipWithIndex.map(x => StructMapping.tupled(x)) + + // Extract the schema as a case class collection for further manipulations + val allFields: Seq[SchemaMapping] = schemaMapped.map { x => + SchemaMapping( + fieldName = x.field.name, + originalFieldIndex = x.idx, + dfType = x.field.dataType, + scalaType = sparkToScalaTypeConversion(x.field.dataType) + ) + } + + // Get only the fields involved in the feature vector + val featureFields: Seq[RowMapping] = schemaMapped + .filterNot(x => fieldsNotInVector.contains(x.field.name)) + .map(x => RowMapping(x.field.name, x.idx)) + + SchemaDefinitions(allFields.toArray, featureFields.toArray) + + } + + /** + * Helper method for converting the generated data DataFrame's types to the original types. + * @note This is critical in order to be able to re-join the data back to the original DataFrame with the correct + * types for each field. + * @param dataFrame The synthetic DataFrame with all fields as DoubleType + * @param schemaPayload the original DataFrame's type information payload + * @return a converted types DataFrame + */ + private[feature] def castColumnsToCorrectTypes( + dataFrame: DataFrame, + schemaPayload: SchemaDefinitions + ): DataFrame = { + + // Extract the fields and types that are part of the feature vector. + val featureFieldsPayload = schemaPayload.features + .map(x => x.fieldName) + .flatMap(y => schemaPayload.fullSchema.filter(z => y == z.fieldName)) + + // Perform casting by applying the original DataTypes to the feature vector fields. + featureFieldsPayload + .foldLeft(dataFrame) { + case (accum, x) => + accum.withColumn(x.fieldName, dataFrame(x.fieldName).cast(x.dfType)) + } + + } + + /** + * Method for filling in with dummy data any field that was not part of the feature vector + * @param dataFrame Input DataFrame with feature fields + * @param schemaPayload schema definitions from original dataframe + * @param featureCol name of the feature field + * @param labelCol name of the label field + * @return the dataframe with any of the missing fields populated with dummy data of the correct type. + */ + private[feature] def fillMissingColumns(dataFrame: DataFrame, + schemaPayload: SchemaDefinitions, + featureCol: String, + labelCol: String): DataFrame = { + + // Get a roster of all current fields + val currentSchema = dataFrame.schema.names ++ featureCol ++ labelCol + + // Find the fields that don't exist yet and remove the feature and label columns from the manifest. + val fieldsToAdd = schemaPayload.fullSchema + .map(x => x.fieldName) + .filterNot(y => currentSchema.contains(y)) + + // Get the definition of the fields that need to be added. + val fieldsToAddDefinition = + schemaPayload.fullSchema.filter(x => fieldsToAdd.contains(x.fieldName)) + + fieldsToAddDefinition.foldLeft(dataFrame) { + case (accum, x) => + accum.withColumn(x.fieldName, lit(x.dfType match { + case _: IntegerType => defaultFill(x.dfType).toString.toInt + case _: DoubleType => defaultFill(x.dfType).toString.toDouble + // TODO: fill out the rest of this. And split this out into its own method. + })) + } + + } + + /** + * Method for rebuilding the feature vector in the same manner as the original DataFrame's feature vector + * @param dataFrame The Synethic data DataFrame + * @param featureFields The indexed feature fields for re-creating the original DataFrame's feature vector + * @return The synthetic DataFrame with an added feature vector column + */ + private[feature] def rebuildFeatureVector( + dataFrame: DataFrame, + featureFields: Array[RowMapping] + ): DataFrame = { + + val assembler = new VectorAssembler() + .setInputCols(featureFields.map(_.fieldName)) + .setOutputCol(conf.featuresCol) + + assembler.transform(dataFrame.drop(conf.featuresCol)) + } + + case class MapTypeVal(colName: String, colValue: Column) + + private def addDummyDataForIgnoredColumns( + dataframe: DataFrame, + fieldsToIgnore: Array[StructField] + ): DataFrame = { + var newDataFrame: DataFrame = dataframe + + val dummyDate = new Date() + val dummyTime = Calendar.getInstance().getTime + + fieldsToIgnore + .map( + item => + item.dataType match { + case StringType => MapTypeVal(item.name, lit("DUMMY")) + case IntegerType => MapTypeVal(item.name, lit(0)) + case DoubleType => MapTypeVal(item.name, lit(0.0)) + case FloatType => MapTypeVal(item.name, lit(0.0f)) + case LongType => MapTypeVal(item.name, lit(0L)) + case ByteType => MapTypeVal(item.name, lit("DUMMY".getBytes)) + case BooleanType => MapTypeVal(item.name, lit(false)) + case BinaryType => MapTypeVal(item.name, lit(0)) + case DateType => MapTypeVal(item.name, lit(dummyDate)) + case TimestampType => MapTypeVal(item.name, lit(dummyTime)) + case _ => + throw new UnsupportedOperationException( + s"Field '${item.name}' is of type ${item.dataType}, which is not supported." + ) + } + ) + .foreach { m: MapTypeVal => + newDataFrame = newDataFrame.withColumn(m.colName, m.colValue) + } + + newDataFrame + } + + /** + * Main Method for generating synthetic data + * @param labelValues Array[RowGenerationConfig] for specifying which categorical labels and the target counts to + * generate data for + * @return A synthetic data DataFrame with an added field for specifying that this data is synthetic in nature. + */ + def makeRows(labelValues: Array[RowGenerationConfig]): DataFrame = { + + val collectedFieldsToIgnore = conf.fieldsToIgnore ++ Array( + conf.featuresCol, + conf.labelCol + ) + + // Get the schema information + val ignoredFieldsTypes = + df.schema.fields.filter(field => conf.fieldsToIgnore.contains(field.name)) + val origSchema = df.schema.names + val schemaMappings = + generateSchemaInformationPayload(df.schema, collectedFieldsToIgnore) + val labelColumnType = + schemaMappings.fullSchema + .filter(x => x.fieldName == _labelCol) + .head + .dfType + + val doublesSchema = + generateDoublesSchema(df, collectedFieldsToIgnore.toList) + + // Scale the feature vector + val scaled = scaleFeatureVector(df) + + // Build a KMeans Model + val kModel = buildKMeans(scaled) + + // Build a MinHashLSHModel + val lshModel = buildLSH(scaled) + + // Transform the scaled data with the KMeans model + val kModelData = kModel.transform(scaled) + + // Get the original partition count + val sourcePartitions = df.rdd.partitions.length + + val returnfinalDf = labelValues + .map { x => + val vecs = acquireNearestVectorToCentroids( + scaled.filter(col(conf.labelCol) === x.labelValue), + lshModel, + kModel + ) + val groupData = generateGroupVectors( + kModelData, + vecs, + lshModel, + conf.labelCol, + x.labelValue, + x.targetCount, + fieldsToDrop + ) + val rowCollections = acquireRowCollections( + groupData.drop(conf.featuresCol), + x.targetCount, + conf.minimumVectorCountToMutate, + conf.mutationMode, + conf.mutationValue + ) + val convertedDF = + convertCollectionsToDataFrame(rowCollections, doublesSchema) + val finalDF = castColumnsToCorrectTypes(convertedDF, schemaMappings) + // rebuild the feature vector + rebuildFeatureVector(finalDF, schemaMappings.features) + .withColumn(conf.labelCol, lit(x.labelValue)) + } + .reduce(_.unionByName(_)) + .toDF() + .repartition(sourcePartitions) + + addDummyDataForIgnoredColumns(returnfinalDf, ignoredFieldsTypes) + .select(origSchema map col: _*) + .withColumn(conf.syntheticCol, lit(true)) + .withColumn(_labelCol, col(_labelCol).cast(labelColumnType)) + + } + +} + +object KSampling extends KSamplingBase { + + def apply(data: DataFrame, + labelValues: Array[RowGenerationConfig], + featuresCol: String, + labelsCol: String, + syntheticCol: String, + fieldsToIgnore: Array[String], + kGroups: Int, + kMeansMaxIter: Int, + kMeansTolerance: Double, + kMeansDistanceMeasurement: String, + kMeansSeed: Long, + kMeansPredictionCol: String, + lshHashTables: Int, + lshSeed: Long, + lshOutputCol: String, + quorumCount: Int, + minimumVectorCountToMutate: Int, + vectorMutationMethod: String, + mutationMode: String, + mutationValue: Double): DataFrame = + new KSampling(data) + .setFeaturesCol(featuresCol) + .setLabelCol(labelsCol) + .setSyntheticCol(syntheticCol) + .setFieldsToIgnore(fieldsToIgnore) + .setKGroups(kGroups) + .setKMeansMaxIter(kMeansMaxIter) + .setKMeansTolerance(kMeansTolerance) + .setKMeansDistanceMeasurement(kMeansDistanceMeasurement) + .setKMeansSeed(kMeansSeed) + .setKMeansPredictionCol(kMeansPredictionCol) + .setLSHHashTables(lshHashTables) + .setLSHOutputCol(lshOutputCol) + .setQuorumCount(quorumCount) + .setMinimumVectorCountToMutate(minimumVectorCountToMutate) + .setVectorMutationMethod(vectorMutationMethod) + .setMutationMode(mutationMode) + .setMutationValue(mutationValue) + .makeRows(labelValues) + +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/KSamplingBase.scala b/src/main/scala/com/databricks/labs/automl/feature/KSamplingBase.scala new file mode 100644 index 00000000..e3527261 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/KSamplingBase.scala @@ -0,0 +1,351 @@ +package com.databricks.labs.automl.feature + +import com.databricks.labs.automl.feature.structures.{ + KSamplingConfiguration, + KSamplingDefaults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper + +trait KSamplingBase extends KSamplingDefaults with SparkSessionWrapper { + + final private[feature] val allowableKMeansDistanceMeasurements: List[String] = + List("cosine", "euclidean") + final private[feature] val allowableMutationModes: List[String] = + List("weighted", "random", "ratio") + final private[feature] val allowableVectorMutationMethods: List[String] = + List("random", "fixed", "all") + + private[feature] var _featuresCol: String = defaultFeaturesCol + private[feature] var _labelCol: String = defaultLabelCol + private[feature] var _syntheticCol: String = defaultSyntheticCol + private[feature] var _fieldsToIgnore: Array[String] = defaultFieldsToIgnore + private[feature] var _kGroups: Int = defaultKGroups + private[feature] var _kMeansMaxIter: Int = defaultKMeansMaxIter + private[feature] var _kMeansTolerance: Double = defaultKMeansTolerance + private[feature] var _kMeansDistanceMeasurement: String = + defaultKMeansDistanceMeasurement + private[feature] var _kMeansSeed: Long = defaultKMeansSeed + private[feature] var _kMeansPredictionCol: String = defaultKMeansPredictionCol + private[feature] var _lshHashTables = defaultHashTables + private[feature] var _lshSeed = defaultLSHSeed + private[feature] var _lshOutputCol = defaultLSHOutputCol + private[feature] var _quorumCount = defaultQuorumCount + private[feature] var _minimumVectorCountToMutate = + defaultMinimumVectorCountToMutate + private[feature] var _vectorMutationMethod = defaultVectorMutationMethod + private[feature] var _mutationMode = defaultMutationMode + private[feature] var _mutationValue = defaultMutationValue + + private[feature] var conf = getKSamplingConfig + + /** + * Setter for the Feature Column name of the input DataFrame + * @param value String: name of the feature vector column + * @return this + */ + def setFeaturesCol(value: String): this.type = { + _featuresCol = value; setConfig; this + } + + /** + * Setter for the Label Column name of the input DataFrame + * @param value String: name of the label column + * @return this + */ + def setLabelCol(value: String): this.type = { + _labelCol = value + setConfig + this + } + + /** + * Setter for the name to be used for the synthetic column flag that is attached to the output dataframe as an + * indication that the data present is generated and not original. + * @param value String: name to be used throughout the job to delineate the fact that the data in the row is + * generated. + * @return this + */ + def setSyntheticCol(value: String): this.type = { + _syntheticCol = value + setConfig + this + } + + /** + * Setter to provide a listing of any fields that are intended to be ignored in the generated dataframe + * @param value Array[String]: field names to ignore in the data generation aspect + * @return this + */ + def setFieldsToIgnore(value: Array[String]): this.type = { + _fieldsToIgnore = value + setConfig + this + } + + /** + * Setter for specifying the number of K-Groups to generate in the KMeans model + * @param value Int: number of k groups to generate + * @return this + */ + def setKGroups(value: Int): this.type = { + _kGroups = value + setConfig + this + } + + /** + * Setter for specifying the maximum number of iterations for the KMeans model to go through to converge + * @param value Int: Maximum limit on iterations + * @return this + */ + def setKMeansMaxIter(value: Int): this.type = { + _kMeansMaxIter = value + setConfig + this + } + + /** + * Setter for Setting the tolerance for KMeans (must be >0) + * @param value The tolerance value setting for KMeans + * @see reference: [[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans]] + * for further details. + * @return this + * @throws IllegalArgumentException() if a value less than 0 is entered + */ + @throws(classOf[IllegalArgumentException]) + def setKMeansTolerance(value: Double): this.type = { + require( + value > 0, + s"KMeans tolerance value ${value.toString} is out of range. Must be > 0." + ) + _kMeansTolerance = value + setConfig + this + } + + /** + * Setter for which distance measurement to use to calculate the nearness of vectors to a centroid + * @param value String: Options -> "euclidean" or "cosine" Default: "euclidean" + * @return this + * @throws IllegalArgumentException() if an invalid value is entered + */ + @throws(classOf[IllegalArgumentException]) + def setKMeansDistanceMeasurement(value: String): this.type = { + require( + allowableKMeansDistanceMeasurements.contains(value), + s"Kmeans Distance Measurement $value is not " + + s"a valid mode of operation. Must be one of: ${allowableKMeansDistanceMeasurements.mkString(", ")}" + ) + _kMeansDistanceMeasurement = value + setConfig + this + } + + /** + * Setter for a KMeans seed for the clustering algorithm + * @param value Long: Seed value + * @return this + */ + def setKMeansSeed(value: Long): this.type = { + _kMeansSeed = value + setConfig + this + } + + /** + * Setter for the internal KMeans column for cluster membership attribution + * @param value String: column name for internal algorithm column for group membership + * @return this + */ + def setKMeansPredictionCol(value: String): this.type = { + _kMeansPredictionCol = value + setConfig + this + } + + /** + * Setter for Configuring the number of Hash Tables to use for MinHashLSH + * @param value Int: Count of hash tables to use + * @see [[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH]] + * for more information + * @return this + */ + def setLSHHashTables(value: Int): this.type = { + _lshHashTables = value + setConfig + this + } + + /** + * Setter for a MinHashLSH seed value for the model. + * @param value Long: a seed value + * @return this + */ + def setLSHSeed(value: Long): this.type = { + _lshSeed = value + setConfig + this + } + + /** + * Setter for the internal LSH output hash information column + * @param value String: column name for the internal MinHashLSH Model transformation value + * @return this + */ + def setLSHOutputCol(value: String): this.type = { + _lshOutputCol = value + setConfig + this + } + + /** + * Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data + * @note the higher the value set here, the higher the variance in synthetic data generation + * @param value Int: Number of vectors to find nearest each centroid within the class + * @return this + */ + def setQuorumCount(value: Int): this.type = { + _quorumCount = value + setConfig + this + } + + /** + * Setter for minimum threshold for vector indexes to mutate within the feature vector. + * @note In vectorMutationMethod "fixed" this sets the fixed count of how many vector positions to mutate. + * In vectorMutationMethod "random" this sets the lower threshold for 'at least this many indexes will + * be mutated' + * @param value The minimum (or fixed) number of indexes to mutate. + * @return this + */ + def setMinimumVectorCountToMutate(value: Int): this.type = { + _minimumVectorCountToMutate = value + setConfig + this + } + + /** + * Setter for the Vector Mutation Method + * @note Options: + * "fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. + * "random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. + * "all" - will mutate all of the vectors. + * @param value String - the mode to use. + * @return this + * @throws IllegalArgumentException() if the mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setVectorMutationMethod(value: String): this.type = { + require( + allowableVectorMutationMethods.contains(value), + s"Vector Mutation Mode $value is not supported. " + + s"Must be one of: ${allowableVectorMutationMethods.mkString(", ")} " + ) + _vectorMutationMethod = value + setConfig + this + } + + /** + * Setter for the Mutation Mode of the feature vector individual values + * @note Options: + * "weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors + * "random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors + * "ratio" - uses a ratio between the values of the centroid vector and the mutation vector * + * @param value String: the mode to use. + * @return this + * @throws IllegalArgumentException() if the mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setMutationMode(value: String): this.type = { + require( + allowableMutationModes.contains(value), + s"Mutation Mode $value is not a valid mode of operation. " + + s"Must be one of: ${allowableMutationModes.mkString(", ")}" + ) + _mutationMode = value + setConfig + this + } + + /** + * Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode + * @param value Double: value between 0 and 1 for mutation magnitude adjustment. + * @note the higher this value, the closer to the centroid vector vs. the candidate mutation vector the synthetic row data will be. + * @return this + * @throws IllegalArgumentException() if the value specified is outside of the range (0, 1) + */ + @throws(classOf[IllegalArgumentException]) + def setMutationValue(value: Double): this.type = { + require( + value > 0 & value < 1, + s"Mutation Value must be between 0 and 1. Value $value is not permitted." + ) + _mutationValue = value + setConfig + this + } + + /** + * Private method for setting the configuration instantiation. + * @return this + */ + private def setConfig: this.type = { + conf = KSamplingConfiguration( + featuresCol = _featuresCol, + labelCol = _labelCol, + syntheticCol = _syntheticCol, + fieldsToIgnore = _fieldsToIgnore, + kGroups = _kGroups, + kMeansMaxIter = _kMeansMaxIter, + kMeansTolerance = _kMeansTolerance, + kMeansDistanceMeasurement = _kMeansDistanceMeasurement, + kMeansSeed = _kMeansSeed, + kMeansPredictionCol = _kMeansPredictionCol, + lshHashTables = _lshHashTables, + lshSeed = _lshSeed, + lshOutputCol = _lshOutputCol, + quorumCount = _quorumCount, + minimumVectorCountToMutate = _minimumVectorCountToMutate, + vectorMutationMethod = _vectorMutationMethod, + mutationMode = _mutationMode, + mutationValue = _mutationValue + ) + this + } + + /** + * Public method for returning the current state of the configuration as a new instance of the KSamplingConfiguration + * @return the current state of the KSamplingConfiguration conf + */ + def getKSamplingConfig: KSamplingConfiguration = { + KSamplingConfiguration( + featuresCol = _featuresCol, + labelCol = _labelCol, + syntheticCol = _syntheticCol, + fieldsToIgnore = _fieldsToIgnore, + kGroups = _kGroups, + kMeansMaxIter = _kMeansMaxIter, + kMeansTolerance = _kMeansTolerance, + kMeansDistanceMeasurement = _kMeansDistanceMeasurement, + kMeansSeed = _kMeansSeed, + kMeansPredictionCol = _kMeansPredictionCol, + lshHashTables = _lshHashTables, + lshSeed = _lshSeed, + lshOutputCol = _lshOutputCol, + quorumCount = _quorumCount, + minimumVectorCountToMutate = _minimumVectorCountToMutate, + vectorMutationMethod = _vectorMutationMethod, + mutationMode = _mutationMode, + mutationValue = _mutationValue + ) + } + + /** + * Static method for generating the fields to drop from the interstitial dataframes during the algorithm's execution. + * @return + */ + private[feature] def fieldsToDrop: List[String] = + List(_kMeansPredictionCol, _lshOutputCol, "distCol") + +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/SyntheticFeatureBase.scala b/src/main/scala/com/databricks/labs/automl/feature/SyntheticFeatureBase.scala new file mode 100644 index 00000000..c572da6e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/SyntheticFeatureBase.scala @@ -0,0 +1,12 @@ +package com.databricks.labs.automl.feature + +trait SyntheticFeatureBase extends KSamplingBase { + + final val allowableLabelBalanceModes: List[String] = + List("match", "percentage", "target") + + def defaultCardinalityThreshold: Int = 20 + def defaultLabelBalanceMode: String = "match" + def defaultNumericRatio: Double = 0.2 + def defaultNumericTarget: Int = 500 +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/SyntheticFeatureGenerator.scala b/src/main/scala/com/databricks/labs/automl/feature/SyntheticFeatureGenerator.scala new file mode 100644 index 00000000..ae8a67bb --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/SyntheticFeatureGenerator.scala @@ -0,0 +1,297 @@ +package com.databricks.labs.automl.feature + +import com.databricks.labs.automl.feature.structures.{ + CardinalityPayload, + RowGenerationConfig +} +import com.databricks.labs.automl.feature.tools.LabelValidation +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +class SyntheticFeatureGenerator(data: DataFrame) + extends SparkSessionWrapper + with SyntheticFeatureBase + with KSamplingBase { + + private var _labelBalanceMode: String = defaultLabelBalanceMode + private var _cardinalityThreshold: Int = defaultCardinalityThreshold + private var _numericRatio: Double = defaultNumericRatio + private var _numericTarget: Int = defaultNumericTarget + + /** + * Setter - for determining the label balance approach mode. + * @note Available modes:
+ * 'match': Will match all smaller class counts to largest class count. [WARNING] - May significantly increase memory pressure!
+ * 'percentage' Will adjust smaller classes to a percentage value of the largest class count. + * 'target' Will increase smaller class counts to a fixed numeric target of rows. + * @param value String: one of: 'match', 'percentage' or 'target' + * @note Default: "percentage" + * @since 0.5.1 + * @author Ben Wilson + * @throws UnsupportedOperationException() if the provided mode is not supported. + */ + @throws(classOf[UnsupportedOperationException]) + def setLabelBalanceMode(value: String): this.type = { + require( + allowableLabelBalanceModes.contains(value), + s"Label Balance Mode $value is not supported." + + s"Must be one of: ${allowableLabelBalanceModes.mkString(", ")}" + ) + _labelBalanceMode = value + this + } + + /** + * Setter - for overriding the cardinality threshold exception threshold. [WARNING] increasing this value on + * a sufficiently large data set could incur, during runtime, excessive memory and cpu pressure on the cluster. + * @param value Int: the limit above which an exception will be thrown for a classification problem wherein the + * label distinct count is too large to successfully generate synthetic data. + * @note Default: 20 + * @since 0.5.1 + * @author Ben Wilson + */ + def setCardinalityThreshold(value: Int): this.type = { + _cardinalityThreshold = value + this + } + + /** + * Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode() + * @param value Double: A fractional double in the range of 0.0 to 1.0. + * @note Setting this value to 1.0 is equivalent to setting the label balance mode to 'match' + * @note Default: 0.2 + * @since 0.5.1 + * @author Ben Wilson + * @throws UnsupportedOperationException() if the provided value is outside of the range of 0.0 -> 1.0 + */ + @throws(classOf[UnsupportedOperationException]) + def setNumericRatio(value: Double): this.type = { + require( + value <= 1.0 & value > 0.0, + s"Invalid Numeric Ratio entered! Must be between 0 and 1." + + s"${value.toString} is not valid." + ) + _numericRatio = value + this + } + + /** + * Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode() + * @param value Int: The desired final number of rows per minority class label + * @note [WARNING] Setting this value to too high of a number will greatly increase runtime and memory pressure. + * @since 0.5.1 + * @author Ben Wilson + */ + def setNumericTarget(value: Int): this.type = { + _numericTarget = value + this + } + + /** + * Private method for detecting the primary class, segregating it, and returning the remaining minority classes + * in a collection + * @param full The entire cardinality result for the data set + * @return + */ + def getMaxAndRest( + full: Array[CardinalityPayload] + ): (CardinalityPayload, Array[CardinalityPayload]) = { + + val sortedValues = full.sortWith(_.labelCounts > _.labelCounts) + + (sortedValues.head, sortedValues.drop(1)) + } + + /** + * Private method for calculating the targets for all smaller classes for the percentage mode + * @param max The most frequently occurring label + * @param rest The remaining labels + * @return Array[RowGenerationConfig] to supply the candidate target numbers for KSampling + * @throws RuntimeException if the configuration will not result in any KSampling synthetic rows to be generated. + * @since 0.5.1 + * @author Ben Wilson + */ + @throws(classOf[RuntimeException]) + def percentageTargets( + max: CardinalityPayload, + rest: Array[CardinalityPayload] + ): Array[RowGenerationConfig] = { + + val targetValue = max.labelCounts + + if (rest.last.labelCounts > math.floor(targetValue * _numericRatio).toInt) + throw new RuntimeException( + s"The ratio target of label counts for the smallest minority class ${rest.last.labelValue} (count: " + + s"${rest.last.labelCounts}) is already above the target" + + s"threshold value of ${math.floor(targetValue * _numericRatio).toInt}. " + + s"Revisit the configuration settings made in setNumericRatio() for KSampling Configuration." + ) + + rest + .map { x => + val targetCounts = math + .floor(targetValue * _numericRatio) + .toInt - x.labelCounts + + if (targetCounts > 0) { + + RowGenerationConfig(x.labelValue, targetCounts) + } else RowGenerationConfig(x.labelValue, 0) + } + .filter(x => x.targetCount > 0) + + } + + /** + * Private method for generating the row count targets for each minority class label for the target mode + * @param max The most frequently occurring label + * @param rest The remaining labels + * @return Array[RowGenerationConfig] to supply the candidate target numbers for KSampling + * @throws RuntimeException if the configuration will not result in any KSampling synthetic rows to be generated. + * @since 0.5.1 + * @author Ben Wilson + */ + def targetValidation( + max: CardinalityPayload, + rest: Array[CardinalityPayload] + ): Array[RowGenerationConfig] = { + + if (rest.last.labelCounts > _numericTarget) + throw new RuntimeException( + s"The target value of label counts ${_numericTarget} for KSampling class label target match" + + s"for the smallest minority class ${rest.last.labelValue} (count: ${rest.last.labelCounts})is " + + s"already above the target value. Revisit the settings made in " + + s"setNumericTarget(). " + ) + + rest + .filterNot(x => x.labelCounts > _numericTarget) + .map { x => + RowGenerationConfig(x.labelValue, _numericTarget - x.labelCounts) + } + } + + /** + * Private method for generating the row count target for each minority class label for the match mode + * @param max The most frequently occurring label + * @param rest The remaining labels + * @return Array[RowGenerationConfig] to supply the candidate target numbers for KSampling + * @since 0.5.1 + * @author Ben Wilson + */ + def matchValidation( + max: CardinalityPayload, + rest: Array[CardinalityPayload] + ): Array[RowGenerationConfig] = { + rest.map { x => + RowGenerationConfig(x.labelValue, max.labelCounts - x.labelCounts) + } + } + + /** + * Private method for generating the row config objects that KSampling requires for label targets + * @return Array[RowGeneration] for input to KSampling processing + * @since 0.5.1 + * @author Ben Wilson + */ + def determineRatios(): Array[RowGenerationConfig] = { + + val generatedGroups = + LabelValidation(data, _labelCol, _cardinalityThreshold) + + val (max, rest) = getMaxAndRest(generatedGroups) + + _labelBalanceMode match { + case "percentage" => percentageTargets(max, rest) + case "target" => targetValidation(max, rest) + case "match" => matchValidation(max, rest) + } + + } + + def upSample(): DataFrame = { + + // Get the label statistics + val labelPayload = determineRatios() + + // Generate synthetic data + val syntheticData = KSampling( + data = data, + labelValues = labelPayload, + featuresCol = _featuresCol, + labelsCol = _labelCol, + syntheticCol = _syntheticCol, + fieldsToIgnore = _fieldsToIgnore, + kGroups = _kGroups, + kMeansMaxIter = _kMeansMaxIter, + kMeansTolerance = _kMeansTolerance, + kMeansDistanceMeasurement = _kMeansDistanceMeasurement, + kMeansSeed = _kMeansSeed, + kMeansPredictionCol = _kMeansPredictionCol, + lshHashTables = _lshHashTables, + lshSeed = _lshSeed, + lshOutputCol = _lshOutputCol, + quorumCount = _quorumCount, + minimumVectorCountToMutate = _minimumVectorCountToMutate, + vectorMutationMethod = _vectorMutationMethod, + mutationMode = _mutationMode, + mutationValue = _mutationValue + ) + + // Merge the original DataFrame with the synthetic data + data.withColumn(_syntheticCol, lit(false)).unionByName(syntheticData) + + } + +} + +object SyntheticFeatureGenerator { + def apply(data: DataFrame, + featuresCol: String, + labelCol: String, + syntheticCol: String, + fieldsToIgnore: Array[String], + kGroups: Int, + kMeansMaxIter: Int, + kMeansTolerance: Double, + kMeansDistanceMeasurement: String, + kMeansSeed: Long, + kMeansPredictionCol: String, + lshHashTables: Int, + lshSeed: Long, + lshOutputCol: String, + quorumCount: Int, + minimumVectorCountToMutate: Int, + vectorMutationMethod: String, + mutationMode: String, + mutationValue: Double, + labelBalanceMode: String, + cardinalityThreshold: Int, + numericRatio: Double, + numericTarget: Int): DataFrame = + new SyntheticFeatureGenerator(data) + .setFeaturesCol(featuresCol) + .setLabelCol(labelCol) + .setSyntheticCol(syntheticCol) + .setFieldsToIgnore(fieldsToIgnore) + .setKGroups(kGroups) + .setKMeansMaxIter(kMeansMaxIter) + .setKMeansTolerance(kMeansTolerance) + .setKMeansDistanceMeasurement(kMeansDistanceMeasurement) + .setKMeansSeed(kMeansSeed) + .setKMeansPredictionCol(kMeansPredictionCol) + .setLSHHashTables(lshHashTables) + .setLSHSeed(lshSeed) + .setLSHOutputCol(lshOutputCol) + .setQuorumCount(quorumCount) + .setMinimumVectorCountToMutate(minimumVectorCountToMutate) + .setVectorMutationMethod(vectorMutationMethod) + .setMutationMode(mutationMode) + .setMutationValue(mutationValue) + .setLabelBalanceMode(labelBalanceMode) + .setCardinalityThreshold(cardinalityThreshold) + .setNumericRatio(numericRatio) + .setNumericTarget(numericTarget) + .upSample() +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/structures/FeatureInteractionStructures.scala b/src/main/scala/com/databricks/labs/automl/feature/structures/FeatureInteractionStructures.scala new file mode 100644 index 00000000..947b02f8 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/structures/FeatureInteractionStructures.scala @@ -0,0 +1,69 @@ +package com.databricks.labs.automl.feature.structures + +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler} +import org.apache.spark.sql.DataFrame + +case class ColumnTypeData(name: String, dataType: String) +case class VarianceData(labelValue: Double, variance: Double) +case class EntropyData(labelValue: Double, entropy: Double) +case class InteractionPayloadExtract(left: String, + leftDataType: String, + right: String, + rightDataType: String, + outputName: String, + score: Double) +case class InteractionPayload(left: String, + leftDataType: String, + right: String, + rightDataType: String, + outputName: String) +case class ColumnScoreData(score: Double, dataType: String) +case class InteractionResult(left: String, + right: String, + interaction: String, + score: Double) +case class FeatureInteractionCollection( + data: DataFrame, + interactionPayload: Array[InteractionPayloadExtract] +) +case class FeatureInteractionOutputPayload( + data: DataFrame, + fullFeatureVectorColumns: Array[String], + interactionReport: Array[InteractionPayloadExtract] +) +case class NominalIndexCollection(name: String, indexCheck: Boolean) +case class NominalDataCollection(data: DataFrame, + adjustedFields: Array[String], + fieldsToRemove: Array[String], + indexers: Array[StringIndexer]) +case class PipelineInteractionOutput( + pipeline: Pipeline, + data: DataFrame, + fullFeatureVectorColumns: Array[String], + interactionReport: Array[InteractionPayloadExtract] +) +case class VectorAssemblyOutput(assembler: VectorAssembler, data: DataFrame) + +object ModelingType extends Enumeration { + val Regressor = ModelType("regressor") + val Classifier = ModelType("classifier") + protected case class ModelType(modelType: String) extends super.Val() + implicit def convert(value: Value): ModelType = value.asInstanceOf[ModelType] +} + +object FieldEncodingType extends Enumeration { + val Nominal = FieldType("nominal") + val Continuous = FieldType("continuous") + protected case class FieldType(fieldType: String) extends super.Val() + implicit def convert(value: Value): FieldType = value.asInstanceOf[FieldType] +} + +object InteractionRetentionMode extends Enumeration { + val Optimistic = RetentionMode("optimistic") + val Strict = RetentionMode("strict") + val All = RetentionMode("all") + protected case class RetentionMode(retentionMode: String) extends super.Val() + implicit def convert(value: Value): RetentionMode = + value.asInstanceOf[RetentionMode] +} diff --git a/src/main/scala/com/databricks/labs/automl/feature/structures/KSamplingStructures.scala b/src/main/scala/com/databricks/labs/automl/feature/structures/KSamplingStructures.scala new file mode 100644 index 00000000..15654623 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/structures/KSamplingStructures.scala @@ -0,0 +1,77 @@ +package com.databricks.labs.automl.feature.structures + +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.sql.types._ + +trait KSamplingDefaults { + + def defaultFeaturesCol = "features" + def defaultLabelCol = "label" + def defaultSyntheticCol = "synthetic" + def defaultFieldsToIgnore: Array[String] = Array[String]() + def defaultKGroups = 25 + def defaultKMeansMaxIter = 100 + def defaultKMeansTolerance = 1E-6 + def defaultKMeansDistanceMeasurement = "euclidean" + def defaultKMeansSeed = 42L + def defaultKMeansPredictionCol = "kGroups" + def defaultHashTables = 10 + def defaultLSHSeed = 42L + def defaultLSHOutputCol = "hashes" + def defaultQuorumCount = 7 + def defaultMinimumVectorCountToMutate = 1 + def defaultVectorMutationMethod = "random" + def defaultMutationMode = "weighted" + def defaultMutationValue = 0.5 + + def defaultFill: Map[DataType, Any] = + Map( + DoubleType -> 0.0, + IntegerType -> 0, + StringType -> "hodor", + ShortType -> 0, + LongType -> 0L, + FloatType -> 0.0, + BooleanType -> true, + TimestampType -> "1980-01-08T08:03:52.0", + DateType -> "1980-06-01", + BinaryType -> Array(0, 1, 1, 0) + ) +} + +case class CentroidVectors(vector: Vector, kGroup: Int) + +case class KSamplingConfiguration(featuresCol: String, + labelCol: String, + syntheticCol: String, + fieldsToIgnore: Array[String], + kGroups: Int, + kMeansMaxIter: Int, + kMeansTolerance: Double, + kMeansDistanceMeasurement: String, + kMeansSeed: Long, + kMeansPredictionCol: String, + lshHashTables: Int, + lshSeed: Long, + lshOutputCol: String, + quorumCount: Int, + minimumVectorCountToMutate: Int, + vectorMutationMethod: String, + mutationMode: String, + mutationValue: Double) + +case class SchemaMapping(fieldName: String, + originalFieldIndex: Int, + dfType: DataType, + scalaType: String) + +case class StructMapping(field: StructField, idx: Int) + +case class RowMapping(fieldName: String, idx: Int) + +case class SchemaDefinitions(fullSchema: Array[SchemaMapping], + features: Array[RowMapping]) + +case class RowGenerationConfig(labelValue: Double, targetCount: Int) + +case class CardinalityPayload(labelValue: Double, labelCounts: Int) diff --git a/src/main/scala/com/databricks/labs/automl/feature/tools/LabelValidation.scala b/src/main/scala/com/databricks/labs/automl/feature/tools/LabelValidation.scala new file mode 100644 index 00000000..98040acf --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/feature/tools/LabelValidation.scala @@ -0,0 +1,114 @@ +package com.databricks.labs.automl.feature.tools + +import com.databricks.labs.automl.feature.SyntheticFeatureBase +import com.databricks.labs.automl.feature.structures.CardinalityPayload +import org.apache.log4j.{Level, Logger} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.col + +class LabelValidation(data: DataFrame) extends SyntheticFeatureBase { + + private var _cardinalityThreshold: Int = defaultCardinalityThreshold + + def setCardinalityThreshold(value: Int): this.type = { + value match { + case x if x > 20 => + println( + s"[WARNING] setting value of cardinality threshold greater " + + s"that 20 may indicate that this is a regression problem." + ) + case _ => None + } + _cardinalityThreshold = value + this + } + + /** + * Private helper method for checking whether the provided DataFrame is within categorical + * label type to ensure that there is not a 'runaway' condition of submitting + * too many unique labels to generate data for. + * @param grouped DataFrame: the grouped label data with counts. + * @since 0.5.1 + * @author Ben Wilson + * @throws RuntimeException() If the cardinality of the label column exceeds the thresholds + */ + @throws(classOf[RuntimeException]) + private def validateCardinalityCounts(grouped: DataFrame): Unit = { + + val logger: Logger = Logger.getLogger(this.getClass) + + grouped.count() match { + case x if x <= _cardinalityThreshold => + logger.log( + Level.INFO, + s"Unique counts of label " + + s"column ${_labelCol} : ${x.toString}" + ) + case _ => + throw new RuntimeException( + s"[ALERT] Cardinality of label column is greater" + + s"than threshold of: ${_cardinalityThreshold.toString}" + ) + } + } + + /** + * Private method for retrieving and validating the skew in the label column in order to support + * KSampling synthetic label boosting. + * @return Array[CardinalityPayload] of all of the counts of the labels throughout the data set. + * @since 0.5.1 + * @author Ben Wilson + */ + private def determineCardinality(): Array[CardinalityPayload] = { + + // Perform a DataFrame operation on the input label column + val groupedLabel = data + .select(col(_labelCol)) + .groupBy(col(_labelCol)) + .count() + + // Perform a validation check + validateCardinalityCounts(groupedLabel) + + // Get the data type of the label column + val labelType = + data.schema.filter(x => x.name == _labelCol).head.dataType.typeName + + // Create the cardinality collection + groupedLabel.collect.map { x => + labelType match { + case "double" => + CardinalityPayload( + x.getAs[Double](_labelCol), + x.getAs[Long]("count").toInt + ) + case "integer" => + CardinalityPayload( + x.getAs[Int](_labelCol).toDouble, + x.getAs[Long]("count").toInt + ) + case "float" => + CardinalityPayload( + x.getAs[Float](_labelCol).toDouble, + x.getAs[Long]("count").toInt + ) + case _ => + throw new RuntimeException( + s"The data type of the label column ${_labelCol} is: $labelType" + + s"which is not supported. Must be one of: DoubleType, IntegerType, or FloatType" + ) + } + } + } + +} + +object LabelValidation { + def apply(data: DataFrame, + labelCol: String, + cardinalityThreshold: Int): Array[CardinalityPayload] = + new LabelValidation(data) + .setLabelCol(labelCol) + .setCardinalityThreshold(cardinalityThreshold) + .determineCardinality() +} diff --git a/src/main/scala/com/databricks/labs/automl/inference/InferenceConfig.scala b/src/main/scala/com/databricks/labs/automl/inference/InferenceConfig.scala new file mode 100644 index 00000000..8629e5ab --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/inference/InferenceConfig.scala @@ -0,0 +1,474 @@ +package com.databricks.labs.automl.inference + +import com.databricks.labs.automl.params.{MLFlowConfig, ScalingConfig} + +object InferenceConfig extends InferenceDefaults { + + final val allowableModelLoads: Array[String] = Array("path", "mlflow") + final val allowableOutlierFilteringDirections: Array[String] = + Array("greater", "lesser") + + var _inferenceConfigStorageLocation: String = "" + var _inferenceDataConfig: InferenceDataConfig = _defaultInferenceDataConfig + var _inferenceDataConfigLabelCol: String = + _defaultInferenceDataConfig.labelCol + var _inferenceDataConfigFeaturesCol: String = + _defaultInferenceDataConfig.featuresCol + var _inferenceDataConfigStartingColumns: Array[String] = + _defaultInferenceDataConfig.startingColumns + var _inferenceDataConfigFieldsToIgnore: Array[String] = + _defaultInferenceDataConfig.fieldsToIgnore + var _inferenceDataConfigDateTimeConversionType: String = + _defaultInferenceDataConfig.dateTimeConversionType + var _inferenceSwitchSettings: InferenceSwitchSettings = + _defaultInferenceSwitchSettings + var _inferenceModelConfig: InferenceModelConfig = _defaultInferenceModelConfig + var _featureEngineeringConfig: FeatureEngineeringConfig = + _defaultFeatureEngineeringConfig + var _inferenceConfig: InferenceMainConfig = _defaultInferenceConfig + var _inferenceConfigModelFamily: String = + _defaultInferenceModelConfig.modelFamily + var _inferenceConfigModelType: String = _defaultInferenceModelConfig.modelType + var _inferenceConfigModelLoadMethod: String = + _defaultInferenceModelConfig.modelLoadMethod + var _inferenceConfigMlFlowConfig: MLFlowConfig = + _defaultInferenceModelConfig.mlFlowConfig + var _inferenceConfigMlFlowRunId: String = + _defaultInferenceModelConfig.mlFlowRunId + var _inferenceConfigModelPathLocation: String = + _defaultInferenceModelConfig.modelPathLocation + var _inferenceConfigMlFlowTrackingURI: String = + _defaultInferenceModelConfig.mlFlowConfig.mlFlowTrackingURI + var _inferenceConfigMlFlowExperimentName: String = + _defaultInferenceModelConfig.mlFlowConfig.mlFlowExperimentName + var _inferenceConfigMlFlowAPIToken: String = + _defaultInferenceModelConfig.mlFlowConfig.mlFlowAPIToken + var _inferenceConfigMlFlowModelSaveDirectory: String = + _defaultInferenceModelConfig.mlFlowConfig.mlFlowModelSaveDirectory + var _inferenceNaFillConfig: NaFillConfig = _defaultNaFillConfig + var _inferenceVarianceFilterConfig: VarianceFilterConfig = + _defaultVarianceFilterConfig + var _inferenceOutlierFilteringConfig: OutlierFilteringConfig = + _defaultOutlierFilteringConfig + var _inferenceCovarianceFilteringConfig: CovarianceFilteringConfig = + _defaultCovarianceFilteringConfig + var _inferencePearsonFilteringConfig: PearsonFilteringConfig = + _defaultPearsonFilteringConfig + var _inferenceScalingConfig: ScalingConfig = _scalingConfigDefaults + var _inferenceScalerType: String = _scalingConfigDefaults.scalerType + var _inferenceScalerMin: Double = _scalingConfigDefaults.scalerMin + var _inferenceScalerMax: Double = _scalingConfigDefaults.scalerMax + var _inferenceStandardScalerMeanFlag: Boolean = + _scalingConfigDefaults.standardScalerMeanFlag + var _inferenceStandardScalerStdDevFlag: Boolean = + _scalingConfigDefaults.standardScalerStdDevFlag + var _inferenceScalerPNorm: Double = _scalingConfigDefaults.pNorm + var _inferenceFeatureInteractionConfig: FeatureInteractionConfig = + _defaultInferenceFeatureInteractionConfig + + def setInferenceConfigStorageLocation(value: String): this.type = { + _inferenceConfigStorageLocation = value + setInferenceConfig() + this + } + + def setInferenceConfig(value: InferenceMainConfig): this.type = { + _inferenceConfig = value + this + } + + def setInferenceConfig(): this.type = { + _inferenceConfig = InferenceMainConfig( + inferenceDataConfig = _inferenceDataConfig, + inferenceSwitchSettings = _inferenceSwitchSettings, + inferenceModelConfig = _inferenceModelConfig, + featureEngineeringConfig = _featureEngineeringConfig, + inferenceConfigStorageLocation = _inferenceConfigStorageLocation + ) + this + } + + def setInferenceSwitchSettings(value: InferenceSwitchSettings): this.type = { + _inferenceSwitchSettings = value + setInferenceConfig() + this + } + + def setInferenceDataConfig(value: InferenceDataConfig): this.type = { + _inferenceDataConfig = value + _inferenceDataConfigLabelCol = value.labelCol + _inferenceDataConfigFeaturesCol = value.featuresCol + _inferenceDataConfigStartingColumns = value.startingColumns + _inferenceDataConfigFieldsToIgnore = value.fieldsToIgnore + _inferenceDataConfigDateTimeConversionType = value.dateTimeConversionType + setInferenceConfig() + this + } + + private def setInferenceDataConfig(): this.type = { + _inferenceDataConfig = InferenceDataConfig( + labelCol = _inferenceDataConfigLabelCol, + featuresCol = _inferenceDataConfigFeaturesCol, + startingColumns = _inferenceDataConfigStartingColumns, + fieldsToIgnore = _inferenceDataConfigFieldsToIgnore, + dateTimeConversionType = _inferenceDataConfigDateTimeConversionType + ) + setInferenceConfig() + this + } + + def setInferenceDataConfigLabelCol(value: String): this.type = { + _inferenceDataConfigLabelCol = value + setInferenceDataConfig() + this + } + + def setInferenceDataConfigFeaturesCol(value: String): this.type = { + _inferenceDataConfigFeaturesCol = value + setInferenceDataConfig() + this + } + + def setInferenceDataConfigStartingColumns(value: Array[String]): this.type = { + _inferenceDataConfigStartingColumns = value + setInferenceDataConfig() + this + } + + def setInferenceDataConfigFieldsToIgnore(value: Array[String]): this.type = { + _inferenceDataConfigFieldsToIgnore = value + setInferenceDataConfig() + this + } + + def setInferenceDataConfigDateTimeConversionType(value: String): this.type = { + _inferenceDataConfigDateTimeConversionType = value + setInferenceDataConfig() + this + } + + def setInferenceModelConfig(value: InferenceModelConfig): this.type = { + _inferenceModelConfig = value + setInferenceConfig() + this + } + + private def setInferenceModelConfig(): this.type = { + _inferenceModelConfig = InferenceModelConfig( + modelFamily = _inferenceConfigModelFamily, + modelType = _inferenceConfigModelType, + modelLoadMethod = _inferenceConfigModelLoadMethod, + mlFlowConfig = _inferenceConfigMlFlowConfig, + mlFlowRunId = _inferenceConfigMlFlowRunId, + modelPathLocation = _inferenceConfigModelPathLocation + ) + setInferenceConfig() + this + } + + def setInferenceModelConfigModelFamily(value: String): this.type = { + _inferenceConfigModelFamily = value + setInferenceModelConfig() + this + } + + def setInferenceModelConfigModelType(value: String): this.type = { + _inferenceConfigModelType = value + setInferenceModelConfig() + this + } + + def setInferenceModelConfigModelLoadMethod(value: String): this.type = { + require( + allowableModelLoads.contains(value), + s"Inference Model Config Model Load Method invalid '$value' is not " + + s"in ${allowableModelLoads.mkString(", ")}" + ) + _inferenceConfigModelLoadMethod = value + setInferenceModelConfig() + this + } + + def setInferenceModelConfigMlFlowConfig(value: MLFlowConfig): this.type = { + _inferenceConfigMlFlowConfig = value + setInferenceModelConfig() + this + } + + private def setInferenceModelConfigMlFlowConfig(): this.type = { + _inferenceConfigMlFlowConfig = MLFlowConfig( + mlFlowTrackingURI = _inferenceConfigMlFlowTrackingURI, + mlFlowExperimentName = _inferenceConfigMlFlowExperimentName, + mlFlowAPIToken = _inferenceConfigMlFlowAPIToken, + mlFlowModelSaveDirectory = _inferenceConfigMlFlowModelSaveDirectory, + mlFlowLoggingMode = "full", + mlFlowBestSuffix = "_best", + mlFlowCustomRunTags = Map("" -> "") + ) + setInferenceModelConfig() + this + } + + def setInferenceConfigMlFlowTrackingURI(value: String): this.type = { + _inferenceConfigMlFlowTrackingURI = value + setInferenceModelConfigMlFlowConfig() + this + } + + def setInferenceConfigMlFlowExperimentName(value: String): this.type = { + _inferenceConfigMlFlowExperimentName = value + setInferenceModelConfigMlFlowConfig() + this + } + + def setInferenceConfigMlFlowAPIToken(value: String): this.type = { + _inferenceConfigMlFlowAPIToken = value + setInferenceModelConfigMlFlowConfig() + this + } + + def setInferenceConfigMlFlowModelSaveDirectory(value: String): this.type = { + _inferenceConfigMlFlowModelSaveDirectory = value + setInferenceModelConfigMlFlowConfig() + this + } + + def setInferenceModelConfigMlFlowRunID(value: String): this.type = { + _inferenceConfigMlFlowRunId = value + setInferenceModelConfig() + this + } + + def setInferenceModelConfigModelPathLocation(value: String): this.type = { + _inferenceConfigModelPathLocation = value + setInferenceModelConfig() + this + } + + def setInferenceFeatureEngineeringConfig( + value: FeatureEngineeringConfig + ): this.type = { + _featureEngineeringConfig = value + setInferenceConfig() + this + } + + private def setInferenceFeatureEngineeringConfig(): this.type = { + _featureEngineeringConfig = FeatureEngineeringConfig( + naFillConfig = _inferenceNaFillConfig, + varianceFilterConfig = _inferenceVarianceFilterConfig, + outlierFilteringConfig = _inferenceOutlierFilteringConfig, + covarianceFilteringConfig = _inferenceCovarianceFilteringConfig, + pearsonFilteringConfig = _inferencePearsonFilteringConfig, + scalingConfig = _inferenceScalingConfig, + featureInteractionConfig = _inferenceFeatureInteractionConfig + ) + setInferenceConfig() + this + } + + def setInferenceNaFillConfig(categoricalMap: Map[String, String], + numericMap: Map[String, Double], + booleanMap: Map[String, Boolean]): this.type = { + _inferenceNaFillConfig = NaFillConfig( + categoricalColumns = categoricalMap, + numericColumns = numericMap, + booleanColumns = booleanMap + ) + setInferenceFeatureEngineeringConfig() + this + } + + def setInferenceVarianceFilterConfig(value: Array[String]): this.type = { + _inferenceVarianceFilterConfig = VarianceFilterConfig(fieldsRemoved = value) + setInferenceFeatureEngineeringConfig() + this + } + + def setInferenceOutlierFilteringConfig( + value: Map[String, (Double, String)] + ): this.type = { + _inferenceOutlierFilteringConfig = OutlierFilteringConfig( + fieldRemovalMap = value + ) + setInferenceFeatureEngineeringConfig() + this + } + + def setInferenceCovarianceFilteringConfig(value: Array[String]): this.type = { + _inferenceCovarianceFilteringConfig = CovarianceFilteringConfig( + fieldsRemoved = value + ) + setInferenceFeatureEngineeringConfig() + this + } + + def setInferencePearsonFilteringConfig(value: Array[String]): this.type = { + _inferencePearsonFilteringConfig = PearsonFilteringConfig( + fieldsRemoved = value + ) + setInferenceFeatureEngineeringConfig() + this + } + + private def setInferenceScalingConfig(): this.type = { + _inferenceScalingConfig = ScalingConfig( + scalerType = _inferenceScalerType, + scalerMin = _inferenceScalerMin, + scalerMax = _inferenceScalerMax, + standardScalerMeanFlag = _inferenceStandardScalerMeanFlag, + standardScalerStdDevFlag = _inferenceStandardScalerStdDevFlag, + pNorm = _inferenceScalerPNorm + ) + setInferenceConfig() + this + } + + def setFeatureInteractionConfig( + value: FeatureInteractionConfig + ): this.type = { + _inferenceFeatureInteractionConfig = value + setInferenceFeatureEngineeringConfig() + this + } + + def setInferenceScalingConfig(value: ScalingConfig): this.type = { + _inferenceScalingConfig = value + setInferenceConfig() + this + } + + def setInferenceScalerType(value: String): this.type = { + _inferenceScalerType = value + setInferenceScalingConfig() + this + } + + def setInferenceScalerMin(value: Double): this.type = { + _inferenceScalerMin = value + setInferenceScalingConfig() + this + } + + def setInferenceScalerMax(value: Double): this.type = { + _inferenceScalerMax = value + setInferenceScalingConfig() + this + } + + def setInferenceStandardScalerMeanFlagOn(): this.type = { + _inferenceStandardScalerMeanFlag = true + setInferenceScalingConfig() + this + } + + def setInferenceStandardScalerMeanFlagOff(): this.type = { + _inferenceStandardScalerMeanFlag = false + setInferenceScalingConfig() + this + } + + def setInferenceStandardScalerStdDevFlagOn(): this.type = { + _inferenceStandardScalerStdDevFlag = true + setInferenceScalingConfig() + this + } + + def setInferenceStandardScalerStdDevFlagOff(): this.type = { + _inferenceStandardScalerStdDevFlag = false + setInferenceScalingConfig() + this + } + + def setInferenceScalerPNorm(value: Double): this.type = { + _inferenceScalerPNorm = value + setInferenceScalingConfig() + this + } + + def getInferenceConfigStorageLocation: String = + _inferenceConfigStorageLocation + + def getInferenceConfig: InferenceMainConfig = _inferenceConfig + + def getInferenceSwitchSettings: InferenceSwitchSettings = + _inferenceSwitchSettings + + def getInferenceDataConfig: InferenceDataConfig = _inferenceDataConfig + + def getInferenceDataConfigLabelCol: String = _inferenceDataConfigLabelCol + + def getInferenceDataConfigFeaturesCol: String = + _inferenceDataConfigFeaturesCol + + def getInferenceDataConfigStartingColumns: Array[String] = + _inferenceDataConfigStartingColumns + + def getInferenceDataConfigFieldsToIgnore: Array[String] = + _inferenceDataConfigFieldsToIgnore + + def getInferenceDataConfigDateTimeConversionType: String = + _inferenceDataConfigDateTimeConversionType + + def getInferenceModelConfig: InferenceModelConfig = _inferenceModelConfig + + def getInferenceModelConfigModelFamily: String = _inferenceConfigModelFamily + + def getInferenceModelConfigModelType: String = _inferenceConfigModelType + + def getInferenceModelConfigModelLoadMethod: String = + _inferenceConfigModelLoadMethod + + def getInferenceModelConfigMlFlowTrackingURI: String = + _inferenceConfigMlFlowTrackingURI + + def getInferenceModelConfigMlFlowExperimentName: String = + _inferenceConfigMlFlowExperimentName + + def getInferenceModelConfigMlFlowModelSaveDirectory: String = + _inferenceConfigMlFlowModelSaveDirectory + + def getInferenceModelConfigMlFlowRunID: String = _inferenceConfigMlFlowRunId + + def getInferenceModelConfigModelPathLocation: String = + _inferenceConfigModelPathLocation + + def getInferenceFeatureEngineeringConfig: FeatureEngineeringConfig = + _featureEngineeringConfig + + def getInferenceNaFillConfig: NaFillConfig = _inferenceNaFillConfig + + def getInferenceVarianceFilterConfig: VarianceFilterConfig = + _inferenceVarianceFilterConfig + + def getInferenceOutlierFilteringConfig: OutlierFilteringConfig = + _inferenceOutlierFilteringConfig + + def getInferenceCovarianceFilteringConfig: CovarianceFilteringConfig = + _inferenceCovarianceFilteringConfig + + def getInferencePearsonFilteringConfig: PearsonFilteringConfig = + _inferencePearsonFilteringConfig + + def getInferenceScalingConfig: ScalingConfig = _inferenceScalingConfig + + def getInferenceScalerType: String = _inferenceScalerType + + def getInferenceScalerMin: Double = _inferenceScalerMin + + def getInferenceScalerMax: Double = _inferenceScalerMax + + def getInferenceStandardScalerMeanFlag: Boolean = + _inferenceStandardScalerMeanFlag + + def getInferenceStandardScalerStdDevFlag: Boolean = + _inferenceStandardScalerStdDevFlag + + def getInferenceScalerPNorm: Double = _inferenceScalerPNorm + + def getInferenceFeatureInteractionConfig: FeatureInteractionConfig = + _inferenceFeatureInteractionConfig + +} +//object InferenceConfig extends InferenceConfig{ +//} diff --git a/src/main/scala/com/databricks/labs/automl/inference/InferenceDefaults.scala b/src/main/scala/com/databricks/labs/automl/inference/InferenceDefaults.scala new file mode 100644 index 00000000..5783c19f --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/inference/InferenceDefaults.scala @@ -0,0 +1,80 @@ +package com.databricks.labs.automl.inference + +import com.databricks.labs.automl.feature.structures.InteractionPayloadExtract +import com.databricks.labs.automl.params.Defaults + +trait InferenceDefaults extends Defaults { + + def _defaultInferenceSwitchSettings: InferenceSwitchSettings = + InferenceSwitchSettings( + naFillFlag = true, + varianceFilterFlag = true, + outlierFilterFlag = false, + pearsonFilterFlag = false, + covarianceFilterFlag = false, + oneHotEncodeFlag = false, + scalingFlag = false, + featureInteractionFlag = false + ) + + def _defaultInferenceDataConfig: InferenceDataConfig = InferenceDataConfig( + labelCol = "label", + featuresCol = "features", + startingColumns = Array.empty[String], + fieldsToIgnore = Array.empty[String], + dateTimeConversionType = "split" + ) + + def _defaultInferenceModelConfig: InferenceModelConfig = InferenceModelConfig( + modelFamily = "RandomForest", + modelType = "classifier", + modelLoadMethod = "path", + mlFlowConfig = _mlFlowConfigDefaults, + mlFlowRunId = "a", + modelPathLocation = "/models/" + ) + + def _defaultNaFillConfig: NaFillConfig = NaFillConfig( + categoricalColumns = Map("default" -> "default"), + numericColumns = Map("default_num" -> 0.0), + booleanColumns = Map("default_bool" -> false) + ) + + def _defaultVarianceFilterConfig: VarianceFilterConfig = VarianceFilterConfig( + fieldsRemoved = Array.empty[String] + ) + + def _defaultOutlierFilteringConfig: OutlierFilteringConfig = + OutlierFilteringConfig( + fieldRemovalMap = Map("" -> (Double.MaxValue, "greater")) + ) + + def _defaultCovarianceFilteringConfig: CovarianceFilteringConfig = + CovarianceFilteringConfig(fieldsRemoved = Array.empty[String]) + + def _defaultPearsonFilteringConfig: PearsonFilteringConfig = + PearsonFilteringConfig(fieldsRemoved = Array.empty[String]) + + def _defaultInferenceFeatureInteractionConfig: FeatureInteractionConfig = + FeatureInteractionConfig(interactions = Array[InteractionPayloadExtract]()) + + def _defaultFeatureEngineeringConfig: FeatureEngineeringConfig = + FeatureEngineeringConfig( + naFillConfig = _defaultNaFillConfig, + varianceFilterConfig = _defaultVarianceFilterConfig, + outlierFilteringConfig = _defaultOutlierFilteringConfig, + covarianceFilteringConfig = _defaultCovarianceFilteringConfig, + pearsonFilteringConfig = _defaultPearsonFilteringConfig, + scalingConfig = _scalingConfigDefaults, + featureInteractionConfig = _defaultInferenceFeatureInteractionConfig + ) + + def _defaultInferenceConfig: InferenceMainConfig = InferenceMainConfig( + inferenceDataConfig = _defaultInferenceDataConfig, + inferenceSwitchSettings = _defaultInferenceSwitchSettings, + inferenceModelConfig = _defaultInferenceModelConfig, + featureEngineeringConfig = _defaultFeatureEngineeringConfig, + inferenceConfigStorageLocation = "" + ) + +} diff --git a/src/main/scala/com/databricks/labs/automl/inference/InferencePipeline.scala b/src/main/scala/com/databricks/labs/automl/inference/InferencePipeline.scala new file mode 100644 index 00000000..b222049b --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/inference/InferencePipeline.scala @@ -0,0 +1,437 @@ +package com.databricks.labs.automl.inference + +import com.databricks.labs.automl.executor.AutomationConfig +import com.databricks.labs.automl.feature.structures.NominalIndexCollection +import com.databricks.labs.automl.pipeline.FeaturePipeline +import com.databricks.labs.automl.sanitize.Scaler +import com.databricks.labs.automl.utils.{AutomationTools, DataValidation} +import ml.dmlc.xgboost4j.scala.spark.{ + XGBoostClassificationModel, + XGBoostRegressionModel +} +import org.apache.spark.ml.classification._ +import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler} +import org.apache.spark.ml.regression._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import com.databricks.labs.automl.inference.InferenceConfig._ +import org.apache.spark.ml.Pipeline + +class InferencePipeline(df: DataFrame) + extends AutomationConfig + with AutomationTools + with DataValidation + with InferenceTools { + + /** + * Data Prep to: + * - select only the initial columns that were present at the beginning of the training run + * - Convert the datetime entities to correct actionable types + * - StringIndex categorical (text or ordinal) fields + * - Fill NA with the values that were used during the training run for each column + * @return The courier object InferencePayload[, , ] + */ + private def dataPreparation(): InferencePayload = { + + // Filter out any non-used fields that may be included in future data sets that weren't part of model training +// TODO - Have to remove this temporarily +// val initialColumnRestriction = df.select(_inferenceConfig.inferenceDataConfig.startingColumns map col:_*) + + // Build the feature Pipeline + + val featurePipelineObject = new FeaturePipeline(df, isInferenceRun = true) + .setLabelCol(_inferenceConfig.inferenceDataConfig.labelCol) + .setFeatureCol(_inferenceConfig.inferenceDataConfig.featuresCol) + .setDateTimeConversionType( + _inferenceConfig.inferenceDataConfig.dateTimeConversionType + ) + + // Get the StringIndexed DataFrame, the fields that are set for modeling, and all fields combined. + val (indexedData, columnsForModeling, allColumns) = featurePipelineObject + .makeFeaturePipeline(_inferenceConfig.inferenceDataConfig.fieldsToIgnore) + + val outputData = if (_inferenceConfig.inferenceSwitchSettings.naFillFlag) { + indexedData.na + .fill( + _inferenceConfig.featureEngineeringConfig.naFillConfig.categoricalColumns + ) + .na + .fill( + _inferenceConfig.featureEngineeringConfig.naFillConfig.numericColumns + ) + } else { + indexedData + } + + createInferencePayload(outputData, columnsForModeling, allColumns) + + } + + /** + * Helper method for creating the Feature Vector for modeling / feature engineering tasks + * @param payload InferencePayload object that contains: + * - The DataFrame + * - The List of Columns to be included in the Feature Vector + * - The Full List of Columns (including ignored columns used for post-inference joining, etc.) + * @return a new InferencePayload object (with the DataFrame now including a feature vector) + */ + private def createFeatureVector( + payload: InferencePayload + ): InferencePayload = { + + val vectorAssembler = new VectorAssembler() + .setInputCols(payload.modelingColumns) + .setOutputCol(_inferenceConfig.inferenceDataConfig.featuresCol) + + val vectorAppliedDataFrame = vectorAssembler.transform(payload.data) + + createInferencePayload( + vectorAppliedDataFrame, + payload.modelingColumns, + payload.allColumns ++ Array( + _inferenceConfig.inferenceDataConfig.featuresCol + ) + ) + + } + + /** + * Helper method for applying one hot encoding to the feature vector, if used in the original modeling run + * @param payload InferencePayload object + * @return a new InferencePayload object (the DataFrame, with and updated feature vector, and the field listings + * now having any previous StringIndexed fields converted to OneHotEncoded fields.) + */ + private def oneHotEncodingTransform( + payload: InferencePayload + ): InferencePayload = { + + val featurePipeline = + new FeaturePipeline(payload.data, isInferenceRun = true) + .setLabelCol(_inferenceConfig.inferenceDataConfig.labelCol) + .setFeatureCol(_inferenceConfig.inferenceDataConfig.featuresCol) + .setDateTimeConversionType( + _inferenceConfig.inferenceDataConfig.dateTimeConversionType + ) + + val (returnData, vectorCols, allCols) = featurePipeline.applyOneHotEncoding( + payload.modelingColumns, + payload.allColumns + ) + + createInferencePayload(returnData, vectorCols, allCols) + + } + + /** + * Private helper functionn for recreating the feature interaction fields that were specified during model creation + * @param payload Previous step payload of data, columns in feature vector, and all columns + * @return a new InferencePayload object that has the added feature interaction fields. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def createFeatureInteractions( + payload: InferencePayload + ): InferencePayload = { + + // Interact the columns + val interactions = + _inferenceConfig.featureEngineeringConfig.featureInteractionConfig.interactions + + var mutatingDataFrame = payload.data + + for (c <- interactions) { + mutatingDataFrame = + mutatingDataFrame.withColumn(c.outputName, col(c.left) * col(c.right)) + } + + val parsedNames = interactions.map { x => + (x.leftDataType, x.rightDataType) match { + case ("nominal", "nominal") => + NominalIndexCollection(x.outputName, indexCheck = true) + case _ => NominalIndexCollection(x.outputName, indexCheck = false) + } + } + + val nominalFields = parsedNames + .filter(x => x.indexCheck) + .map(x => x.name) + + val indexers = nominalFields.map { x => + new StringIndexer() + .setHandleInvalid("keep") + .setInputCol(x) + .setOutputCol(x + "_si") + } + + val pipeline = new Pipeline().setStages(indexers).fit(mutatingDataFrame) + + val adjustedFields = parsedNames.map { x => + if (x.indexCheck) x.name + "_si" else x.name + } + + createInferencePayload( + pipeline.transform(mutatingDataFrame), + payload.modelingColumns ++ adjustedFields, + payload.allColumns ++ adjustedFields + ) + + } + + /** + * Method for performing all configured FeatureEngineering tasks as set in the InferenceMainConfig + * @param payload InferencePayload object + * @return new InferencePayload object with all actions applied to the Dataframe and associated field listings + * that were originally performed in model training. + */ + private def executeFeatureEngineering( + payload: InferencePayload + ): InferencePayload = { + + // Variance Filtering + val variancePayload = + if (_inferenceConfig.inferenceSwitchSettings.varianceFilterFlag) { + + val fieldsToRemove = + _inferenceConfig.featureEngineeringConfig.varianceFilterConfig.fieldsRemoved + + removeArrayOfColumns(payload, fieldsToRemove) + + } else payload + + // Outlier Filtering + val outlierPayload = + if (_inferenceConfig.inferenceSwitchSettings.outlierFilterFlag) { + + // apply filtering in a foreach + var outlierData = variancePayload.data + + _inferenceConfig.featureEngineeringConfig.outlierFilteringConfig.fieldRemovalMap + .foreach { x => + val field = x._1 + val direction = x._2._2 + val value = x._2._1 + + outlierData = direction match { + case "greater" => outlierData.filter(col(field) <= value) + case "lesser" => outlierData.filter(col(field) >= value) + } + + } + + createInferencePayload( + outlierData, + variancePayload.modelingColumns, + variancePayload.allColumns + ) + + } else variancePayload + + // Covariance Filtering + val covariancePayload = + if (_inferenceConfig.inferenceSwitchSettings.covarianceFilterFlag) { + + val fieldsToRemove = + _inferenceConfig.featureEngineeringConfig.covarianceFilteringConfig.fieldsRemoved + + removeArrayOfColumns(outlierPayload, fieldsToRemove) + + } else outlierPayload + + // Pearson Filtering + val pearsonPayload = + if (_inferenceConfig.inferenceSwitchSettings.pearsonFilterFlag) { + + val fieldsToRemove = + _inferenceConfig.featureEngineeringConfig.pearsonFilteringConfig.fieldsRemoved + + removeArrayOfColumns(covariancePayload, fieldsToRemove) + + } else covariancePayload + + // Build the Interacted Features + val featureInteractionPayload = + if (_inferenceConfig.inferenceSwitchSettings.featureInteractionFlag) { + createFeatureInteractions(pearsonPayload) + } else pearsonPayload + + // Build the Feature Vector + val featureVectorPayload = createFeatureVector(featureInteractionPayload) + + // OneHotEncoding + val oneHotEncodedPayload = + if (_inferenceConfig.inferenceSwitchSettings.oneHotEncodeFlag) { + + oneHotEncodingTransform(featureVectorPayload) + + } else featureVectorPayload + + // Scaling + val scaledPayload = + if (_inferenceConfig.inferenceSwitchSettings.scalingFlag) { + + val scalerConfig = + _inferenceConfig.featureEngineeringConfig.scalingConfig + + val scaledData = new Scaler(oneHotEncodedPayload.data) + .setFeaturesCol(_inferenceConfig.inferenceDataConfig.featuresCol) + .setScalerType(scalerConfig.scalerType) + .setScalerMin(scalerConfig.scalerMin) + .setScalerMax(scalerConfig.scalerMax) + .setStandardScalerMeanMode(scalerConfig.standardScalerMeanFlag) + .setStandardScalerStdDevMode(scalerConfig.standardScalerStdDevFlag) + .setPNorm(scalerConfig.pNorm) + .scaleFeatures() + + createInferencePayload( + scaledData, + oneHotEncodedPayload.modelingColumns, + oneHotEncodedPayload.allColumns + ) + + } else oneHotEncodedPayload + + // yield the Data and the Columns for the payload + + scaledPayload + + } + + /** + * Helper method for loading and applying a transformation on the Dataframe from FeatureEngineering tasks. + * @param data The Dataframe from feature engineering output. + * @return A Dataframe with a prediction and/or probability column applied. + */ + private def loadModelAndInfer(data: DataFrame): DataFrame = { + + val modelFamily = _inferenceConfig.inferenceModelConfig.modelFamily + val modelType = _inferenceConfig.inferenceModelConfig.modelType + + val modelLoadPath = _inferenceConfig.inferenceModelConfig.modelPathLocation + + // load the model and transform the dataframe to batch predict on the data + modelFamily match { + case "XGBoost" => + modelType match { + case "regressor" => + val xgboostRegressor = XGBoostRegressionModel.load(modelLoadPath) + xgboostRegressor.transform(data) + case "classifier" => + val xgboostClassifier = + XGBoostClassificationModel.load(modelLoadPath) + xgboostClassifier.transform(data) + } + case "RandomForest" => + modelType match { + case "regressor" => + val rfRegressor = RandomForestRegressionModel.load(modelLoadPath) + rfRegressor.transform(data) + case "classifier" => + val rfClassifier = + RandomForestClassificationModel.load(modelLoadPath) + rfClassifier.transform(data) + } + case "GBT" => + modelType match { + case "regressor" => + val gbtRegressor = GBTRegressionModel.load(modelLoadPath) + gbtRegressor.transform(data) + case "classifier" => + val gbtClassifier = GBTClassificationModel.load(modelLoadPath) + gbtClassifier.transform(data) + } + case "Trees" => + modelType match { + case "regressor" => + val treesRegressor = DecisionTreeRegressionModel.load(modelLoadPath) + treesRegressor.transform(data) + case "classifier" => + val treesClassifier = + DecisionTreeClassificationModel.load(modelLoadPath) + treesClassifier.transform(data) + } + case "MLPC" => + val mlpcClassifier = + MultilayerPerceptronClassificationModel.load(modelLoadPath) + mlpcClassifier.transform(data) + case "LinearRegression" => + val linearRegressor = LinearRegressionModel.load(modelLoadPath) + linearRegressor.transform(data) + case "LogisticRegression" => + val logisticRegressor = LogisticRegressionModel.load(modelLoadPath) + logisticRegressor.transform(data) + case "SVM" => + val svmClassifier = LinearSVCModel.load(modelLoadPath) + svmClassifier.transform(data) + } + } + + /** + * Helper method for loading the InferenceMainConfig from a DataFrame that has been written to a storage location + * during model training. After loading the Dataframe, the value in row 1 column 1 will be extracted, converted + * to json, converted to an instance of InferenceMainConfig, and finally used to set the current state of this + * class' MainInferenceConfig. + * @param inferenceDataFrameSaveLocation The storage location path of the Dataframe. + */ + private def getAndSetConfigFromDataFrame( + inferenceDataFrameSaveLocation: String + ): Unit = { + + val inferenceDataFrame = spark.read.load(inferenceDataFrameSaveLocation) + + val config = extractInferenceConfigFromDataFrame(inferenceDataFrame) + + setInferenceConfig(config) + + } + + /** + * Main private method for executing an inference run. + * @return A Dataframe with an applied model prediction. + */ + private def inferencePipeline(): DataFrame = { + + // Run through the Data Preparation steps as a prelude to Feature Engineering + val prep = dataPreparation() + + // Execute the Feature Engineering that was performed during initial model training + val featureEngineering = executeFeatureEngineering(prep) + + // Execute the model inference and return a transformed DataFrame. + loadModelAndInfer(featureEngineering.data) + + } + + /** + * Public method for performing an inference run from a stored InferenceConfig Dataframe location. + * @param inferenceConfigDFPath Path on storage of where the Dataframe was written during the training run. + * @return A Dataframe with predictions based on a pre-trained model. + */ + def runInferenceFromStoredDataFrame( + inferenceConfigDFPath: String + ): DataFrame = { + + // Load the Dataframe containing the configuration and set the InferenceMainConfig + getAndSetConfigFromDataFrame(inferenceConfigDFPath) + + inferencePipeline() + + } + + /** + * Public method for performing an inference run from a supplied inference config string. + * @param jsonConfig the saved inference config from a previous run as string-encoded json + * @return A Dataframe with prediction based on a pre-trained model. + */ + def runInferenceFromJSONConfig(jsonConfig: String): DataFrame = { + + val config = convertJsonConfigToClass(jsonConfig) + + setInferenceConfig(config) + + inferencePipeline() + + } + + def getInferenceConfig: InferenceMainConfig = _inferenceConfig + +} diff --git a/src/main/scala/com/databricks/labs/automl/inference/InferenceStructures.scala b/src/main/scala/com/databricks/labs/automl/inference/InferenceStructures.scala new file mode 100644 index 00000000..eeaa0e68 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/inference/InferenceStructures.scala @@ -0,0 +1,78 @@ +package com.databricks.labs.automl.inference + +import com.databricks.labs.automl.feature.structures.InteractionPayloadExtract +import com.databricks.labs.automl.params.{MLFlowConfig, ScalingConfig} +import org.apache.spark.sql.DataFrame + +case class InferenceSwitchSettings(naFillFlag: Boolean, + varianceFilterFlag: Boolean, + outlierFilterFlag: Boolean, + pearsonFilterFlag: Boolean, + covarianceFilterFlag: Boolean, + oneHotEncodeFlag: Boolean, + scalingFlag: Boolean, + featureInteractionFlag: Boolean) + +case class InferenceDataConfig(labelCol: String, + featuresCol: String, + startingColumns: Array[String], + fieldsToIgnore: Array[String], + dateTimeConversionType: String) + +case class InferenceModelConfig(modelFamily: String, + modelType: String, + modelLoadMethod: String, + mlFlowConfig: MLFlowConfig, + mlFlowRunId: String, + modelPathLocation: String) + +case class NaFillConfig(categoricalColumns: Map[String, String], + numericColumns: Map[String, Double], + booleanColumns: Map[String, Boolean]) + +case class NaFillPayload(categorical: Array[(String, Any)], + numeric: Array[(String, Any)], + boolean: Array[(String, Boolean)]) + +case class VarianceFilterConfig(fieldsRemoved: Array[String]) + +case class OutlierFilteringConfig( + fieldRemovalMap: Map[String, (Double, String)] +) + +case class CovarianceFilteringConfig(fieldsRemoved: Array[String]) + +case class PearsonFilteringConfig(fieldsRemoved: Array[String]) + +case class FeatureInteractionConfig( + interactions: Array[InteractionPayloadExtract] +) + +case class FeatureEngineeringConfig( + naFillConfig: NaFillConfig, + varianceFilterConfig: VarianceFilterConfig, + outlierFilteringConfig: OutlierFilteringConfig, + covarianceFilteringConfig: CovarianceFilteringConfig, + pearsonFilteringConfig: PearsonFilteringConfig, + scalingConfig: ScalingConfig, + featureInteractionConfig: FeatureInteractionConfig +) + +case class InferenceMainConfig( + inferenceDataConfig: InferenceDataConfig, + inferenceSwitchSettings: InferenceSwitchSettings, + inferenceModelConfig: InferenceModelConfig, + featureEngineeringConfig: FeatureEngineeringConfig, + inferenceConfigStorageLocation: String +) + +case class InferenceJsonReturn(compactJson: String, prettyJson: String) +case class MainJsonReturn(compactJson: String, prettyJson: String) + +trait InferenceBaseConstructor { + def data: DataFrame + def modelingColumns: Array[String] + def allColumns: Array[String] +} + +abstract case class InferencePayload() extends InferenceBaseConstructor diff --git a/src/main/scala/com/databricks/labs/automl/inference/InferenceTools.scala b/src/main/scala/com/databricks/labs/automl/inference/InferenceTools.scala new file mode 100644 index 00000000..a3d4095a --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/inference/InferenceTools.scala @@ -0,0 +1,137 @@ +package com.databricks.labs.automl.inference + +import com.databricks.labs.automl.executor.config.InstanceConfig +import com.databricks.labs.automl.params.MainConfig +import com.databricks.labs.automl.utils.SparkSessionWrapper +import com.fasterxml.jackson.databind.ObjectMapper +import com.fasterxml.jackson.module.scala.DefaultScalaModule +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.json4s._ +import org.json4s.jackson.Serialization +import org.json4s.jackson.Serialization._ + +trait InferenceTools extends SparkSessionWrapper { + + + //TODO: for chained feature importances, strip out the _si and _oh from field names. + /** + * + * @param dataFrame + * @param modelingColumnsPayload + * @param allColumnsPayload + * @return + */ + def createInferencePayload(dataFrame: DataFrame, modelingColumnsPayload: Array[String], allColumnsPayload: Array[String]): + InferencePayload = { + new InferencePayload { + override def data: DataFrame = dataFrame + override def modelingColumns: Array[String] = modelingColumnsPayload + override def allColumns: Array[String] = allColumnsPayload + } + } + + /** + * + * @param payload + * @param removalArray + * @return + */ + def removeArrayOfColumns(payload: InferencePayload, removalArray: Array[String]): InferencePayload = { + + val featureRemoval = payload.modelingColumns.diff(removalArray) + val fullRemoval = payload.allColumns.diff(removalArray) + val data = payload.data.select(fullRemoval map col:_*) + + createInferencePayload(data, featureRemoval, fullRemoval) + + } + + /** + * Handler method for converting the InferenceMainConfig object to a serializable Json String with correct + * scala-compatible data structures. + * @param config instance of InferenceMainConfig + * @return [InferenceJsonReturn] consisting of compact form (for logging) and prettyprint form (human readable) + */ + def convertInferenceConfigToJson(config: InferenceMainConfig): InferenceJsonReturn = { + + implicit val formats: Formats = Serialization.formats(hints=NoTypeHints) + val pretty = writePretty(config) + val compact = write(config) + + InferenceJsonReturn( + compactJson = compact, + prettyJson = pretty + ) + } + + def convertMainConfigToJson(config: MainConfig): MainJsonReturn = { + + val objectMapper = new ObjectMapper() + objectMapper.registerModule(DefaultScalaModule) + + MainJsonReturn( + compactJson = objectMapper.writeValueAsString(config), + prettyJson = objectMapper.writerWithDefaultPrettyPrinter.writeValueAsString(config) + ) + } + + /** + * Handler method for converting a read-in json config String to an instance of InferenceMainConfig + * @param jsonConfig the config as a Json-formatted String + * @return config as InstanceOf[InferenceMainConfig] + */ + def convertJsonConfigToClass(jsonConfig: String): InferenceMainConfig = { + + implicit val formats: Formats = Serialization.formats(hints = NoTypeHints) + read[InferenceMainConfig](jsonConfig) + + } + + /** + * Seems a bit counter-intuitive to do this, but this allows for cloud-agnostic storage of the config. + * Otherwise, a configuration would need to be created to manage which cloud this is operating on and handle + * native SDK object writers. Instead of re-inventing the wheel here, a DataFrame can be serialized to + * any cloud-native storage medium with very little issue. + * @param config The inference configuration generated for a particular modeling run + * @return A DataFrame consisting of a single row and a single field. Cell 1:1 contains the json string. + */ + def convertInferenceConfigToDataFrame(config: InferenceMainConfig): DataFrame = { + + import spark.sqlContext.implicits._ + + val jsonConfig = convertInferenceConfigToJson(config) + + sc.parallelize(Seq(jsonConfig.compactJson)).toDF("config") + + } + + /** + * From a supplied DataFrame that contains the configuration in cell 1:1, get the json string + * @param configDataFrame A Dataframe that contains the configuration for the Inference run. + * @return The string-encoded json payload for InferenceMainConfig + */ + def extractInferenceJsonFromDataFrame(configDataFrame: DataFrame): String = { + + configDataFrame.collect()(0).get(0).toString + + } + + /** + * Extract the InferenceMainConfig from a stored DataFrame containing the string-encoded json in row 1, column 1 + * @param configDataFrame A Dataframe that contains the configuration for the Inference run. + * @return an instance of InferenceMainConfig + */ + def extractInferenceConfigFromDataFrame(configDataFrame: DataFrame): InferenceMainConfig = { + + val encodedJson = extractInferenceJsonFromDataFrame(configDataFrame) + + convertJsonConfigToClass(encodedJson) + + } + + +} + + + diff --git a/src/main/scala/com/databricks/labs/automl/model/DecisionTreeTuner.scala b/src/main/scala/com/databricks/labs/automl/model/DecisionTreeTuner.scala new file mode 100644 index 00000000..7179d5fa --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/DecisionTreeTuner.scala @@ -0,0 +1,781 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.model.tools.{ + GenerationOptimizer, + HyperParameterFullSearch, + ModelReporting, + ModelUtils +} +import com.databricks.labs.automl.params +import com.databricks.labs.automl.params.{ + Defaults, + TreesConfig, + TreesModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} +import org.apache.spark.ml.classification.DecisionTreeClassifier +import org.apache.spark.ml.regression.DecisionTreeRegressor +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.col +import org.apache.spark.storage.StorageLevel + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class DecisionTreeTuner(df: DataFrame, + data: Array[TrainSplitReferences], + modelSelection: String, + isPipeline: Boolean = false) + extends SparkSessionWrapper + with Evolution + with Defaults { + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _scoringMetric = modelSelection match { + case "regressor" => "rmse" + case "classifier" => "f1" + case _ => + throw new UnsupportedOperationException( + s"Model $modelSelection is not supported." + ) + } + + private var _treesNumericBoundaries = _treesDefaultNumBoundaries + + private var _treesStringBoundaries = _treesDefaultStringBoundaries + + private var _classificationMetrics = classificationMetrics + + def setScoringMetric(value: String): this.type = { + modelSelection match { + case "regressor" => + require( + regressionMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, regressionMetrics)}" + ) + case "classifier" => + require( + classificationMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, classificationMetrics)}" + ) + case _ => + throw new UnsupportedOperationException( + s"Unsupported modelType $modelSelection" + ) + } + this._scoringMetric = value + this + } + + def setTreesNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + _treesNumericBoundaries = value + this + } + + def setTreesStringBoundaries(value: Map[String, List[String]]): this.type = { + _treesStringBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getTreesNumericBoundaries: Map[String, (Double, Double)] = + _treesNumericBoundaries + + def getTreesStringBoundaries: Map[String, List[String]] = + _treesStringBoundaries + + def getClassificationMetrics: List[String] = classificationMetrics + + def getRegressionMetrics: List[String] = regressionMetrics + + /** + * Private method for updating the maxBins setting for the tree algorithm to ensure that cardinality validation + * occurs for each nominal field in the feature vector to ensure that entnopy / information gain / gini calculations + * can be conducted correctly. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def resetNumericBoundaries: this.type = { + + _treesNumericBoundaries = ModelUtils.resetTreeBinsSearchSpace( + df, + _treesNumericBoundaries, + _fieldsToIgnore, + _labelCol, + _featureCol + ) + this + + } + + private def resetClassificationMetrics: List[String] = modelSelection match { + case "classifier" => + classificationMetricValidator( + classificationAdjudicator(df), + classificationMetrics + ) + case _ => classificationMetrics + } + + private def setClassificationMetrics(value: List[String]): this.type = { + _classificationMetrics = value + this + } + + private def modelDecider[A, B](modelConfig: TreesConfig) = { + + val builtModel = modelSelection match { + case "classifier" => + new DecisionTreeClassifier() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setMaxBins(modelConfig.maxBins) + .setImpurity(modelConfig.impurity) + .setMaxDepth(modelConfig.maxDepth) + .setMinInfoGain(modelConfig.minInfoGain) + .setMinInstancesPerNode(modelConfig.minInstancesPerNode) + case "regressor" => + new DecisionTreeRegressor() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setMaxBins(modelConfig.maxBins) + .setImpurity(modelConfig.impurity) + .setMaxDepth(modelConfig.maxDepth) + .setMinInfoGain(modelConfig.minInfoGain) + .setMinInstancesPerNode(modelConfig.minInstancesPerNode) + case _ => + throw new UnsupportedOperationException( + s"Unsupported model type $modelSelection" + ) + } + builtModel + } + + override def generateRandomString( + param: String, + boundaryMap: Map[String, List[String]] + ): String = { + + val stringListing = param match { + case "impurity" => + modelSelection match { + case "regressor" => List("variance") + case _ => boundaryMap(param) + } + case _ => boundaryMap(param) + } + _randomizer.shuffle(stringListing).head + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[TreesModelsWithResults] + ): (TreesConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[TreesModelsWithResults] + ): Array[TreesModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[TreesModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[TreesConfig] = { + + val iterations = new ArrayBuffer[TreesConfig] + + var i = 0 + do { + val impurity = generateRandomString("impurity", _treesStringBoundaries) + val maxBins = generateRandomInteger("maxBins", _treesNumericBoundaries) + val maxDepth = generateRandomInteger("maxDepth", _treesNumericBoundaries) + val minInfoGain = + generateRandomDouble("minInfoGain", _treesNumericBoundaries) + val minInstancesPerNode = + generateRandomInteger("minInstancesPerNode", _treesNumericBoundaries) + + iterations += TreesConfig( + impurity, + maxBins, + maxDepth, + minInfoGain, + minInstancesPerNode + ) + i += 1 + } while (i < iterationCount) + + iterations.toArray + } + + private def generateAndScoreTreesModel( + train: DataFrame, + test: DataFrame, + modelConfig: TreesConfig, + generation: Int = 1 + ): TreesModelsWithResults = { + + val treesModel = modelDecider(modelConfig) + + val builtModel = treesModel.fit(train) + + val predictedData = builtModel.transform(test) + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) +// optimizedPredictions.foreach(_ => ()) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + modelSelection match { + case "classifier" => + for (i <- _classificationMetrics) { + scoringMap(i) = + classificationScoring(i, _labelCol, optimizedPredictions) + } + case "regressor" => + for (i <- regressionMetrics) { + scoringMap(i) = regressionScoring(i, _labelCol, optimizedPredictions) + } + } + println(s" Scoring metric = ${_scoringMetric}") + val treeModelsWithResults = TreesModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + treeModelsWithResults + } + + private def runBattery(battery: Array[TreesConfig], + generation: Int = 1): Array[TreesModelsWithResults] = { + + val metrics = modelSelection match { + case "classifier" => _classificationMetrics + case _ => regressionMetrics + } + + val statusObj = new ModelReporting("trees", metrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = new ArrayBuffer[TreesModelsWithResults] + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreTreesModel(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + modelSelection match { + case "classifier" => + for (a <- _classificationMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case "regressor" => + for (a <- regressionMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case _ => + throw new UnsupportedOperationException( + s"$modelSelection is not a supported model type." + ) + } + + val runAvg = params.TreesModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + } + sortAndReturnAll(results) + } + + private def irradiateGeneration( + parents: Array[TreesConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[TreesConfig] = { + + val mutationPayload = new ArrayBuffer[TreesConfig] + val totalConfigs = modelConfigLength[TreesConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + for (i <- mutationCandidates.indices) { + + val randomParent = scala.util.Random.shuffle(parents.toList).head + val mutationIteration = mutationCandidates(i) + val mutationIndexIteration = mutationIndeces(i) + + mutationPayload += TreesConfig( + if (mutationIndexIteration.contains(0)) + geneMixing(randomParent.impurity, mutationIteration.impurity) + else randomParent.impurity, + if (mutationIndexIteration.contains(1)) + geneMixing( + randomParent.maxBins, + mutationIteration.maxBins, + mutationMagnitude + ) + else randomParent.maxBins, + if (mutationIndexIteration.contains(2)) + geneMixing( + randomParent.maxDepth, + mutationIteration.maxDepth, + mutationMagnitude + ) + else randomParent.maxDepth, + if (mutationIndexIteration.contains(3)) + geneMixing( + randomParent.minInfoGain, + mutationIteration.minInfoGain, + mutationMagnitude + ) + else randomParent.minInfoGain, + if (mutationIndexIteration.contains(4)) + geneMixing( + randomParent.minInstancesPerNode, + mutationIteration.minInstancesPerNode, + mutationMagnitude + ) + else randomParent.minInstancesPerNode + ) + } + mutationPayload.result.toArray + } + + private def continuousEvolution(): Array[TreesModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + if (!isPipeline) resetNumericBoundaries + + logger.log(Level.DEBUG, debugSettings) + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[TreesModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + val totalConfigs = modelConfigLength[TreesConfig] + // Generate the first pool of attempts to seed the hyperparameter space + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[TreesConfig] + val startingModelSeed = generateTreesConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("Trees") + .setModelType(modelSelection) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedTrees( + _treesNumericBoundaries, + _treesStringBoundaries + ) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + def generateIdealParents( + results: Array[TreesModelsWithResults] + ): Array[TreesConfig] = { + val bestParents = new ArrayBuffer[TreesConfig] + results + .take(_numberOfParentsToRetain) + .map(x => { + bestParents += x.modelHyperParams + }) + bestParents.result.toArray + } + + def evolveParameters(): Array[TreesModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + if (!isPipeline) resetNumericBoundaries + + logger.log(Level.DEBUG, debugSettings) + + var generation = 1 + // Record of all generations results + val fossilRecord = new ArrayBuffer[TreesModelsWithResults] + + val totalConfigs = modelConfigLength[TreesConfig] + + val primordial = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val generativeArray = new ArrayBuffer[TreesConfig] + val startingModelSeed = generateTreesConfig(_modelSeed) + generativeArray += startingModelSeed + generativeArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + runBattery(generativeArray.result.toArray, generation) + } else { + runBattery( + generateThresholdedParams(_firstGenerationGenePool), + generation + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("Trees") + .setModelType(modelSelection) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedTrees( + _treesNumericBoundaries, + _treesStringBoundaries + ) + runBattery(startingPool, generation) + } + + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + while (currentIteration <= _numberOfMutationGenerations && + evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .decisionTreesCandidates( + "Trees", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + } else { + (1 to _numberOfMutationGenerations).map(i => { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .decisionTreesCandidates( + "Trees", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + } + + def evolveBest(): TreesModelsWithResults = { + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[TreesModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + val scoreBuffer = new ListBuffer[(Int, Double)] + results.map(x => scoreBuffer += ((x.generation, x.score))) + val scored = scoreBuffer.result + spark.sparkContext + .parallelize(scored) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + } + + def evolveWithScoringDF(): (Array[TreesModelsWithResults], DataFrame) = { + + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + /** + * Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space + * After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential + * best-condition of hyper parameter configurations. + * + * @param paramsToTest Array of Decision Trees Configuration (hyper parameter settings) from the post-run model + * inference + * @return The results of the hyper parameter test, as well as the scored DataFrame report. + */ + def postRunModeledHyperParams( + paramsToTest: Array[TreesConfig] + ): (Array[TreesModelsWithResults], DataFrame) = { + + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/Evolution.scala b/src/main/scala/com/databricks/labs/automl/model/Evolution.scala new file mode 100644 index 00000000..92a8b83c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/Evolution.scala @@ -0,0 +1,1220 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.split.PerformanceSettings +import com.databricks.labs.automl.params.{ + Defaults, + EvolutionDefaults, + KSampleConfig, + RandomForestConfig +} +import com.databricks.labs.automl.utils.{ + DataValidation, + SeedConverters, + SparkSessionWrapper +} +import org.apache.spark.ml.evaluation.{ + BinaryClassificationEvaluator, + MulticlassClassificationEvaluator, + RegressionEvaluator +} +import org.apache.spark.sql.DataFrame + +import scala.collection.mutable.ArrayBuffer +import scala.reflect.runtime.universe._ + +trait Evolution + extends DataValidation + with EvolutionDefaults + with SeedConverters + with SparkSessionWrapper + with Defaults { + + var _labelCol: String = _defaultLabel + var _featureCol: String = _defaultFeature + var _trainPortion: Double = _defaultTrainPortion + var _trainSplitMethod: String = _defaultTrainSplitMethod + var _kSampleConfig: KSampleConfig = _defaultKSampleConfig + var _trainSplitChronologicalColumn: String = + _defaultTrainSplitChronologicalColumn + var _trainSplitChronologicalRandomPercentage: Double = + _defaultTrainSplitChronologicalRandomPercentage + var _parallelism: Int = _defaultParallelism + var _kFold: Int = _defaultKFold + var _seed: Long = _defaultSeed + var _kFoldIteratorRange: scala.collection.parallel.immutable.ParRange = + Range(0, _kFold).par + var _fieldsToIgnore = _defaultFieldsToIgnoreInVector + var _optimizationStrategy: String = _defaultOptimizationStrategy + var _firstGenerationGenePool: Int = _defaultFirstGenerationGenePool + var _numberOfMutationGenerations: Int = _defaultNumberOfMutationGenerations + var _numberOfParentsToRetain: Int = _defaultNumberOfParentsToRetain + var _numberOfMutationsPerGeneration: Int = + _defaultNumberOfMutationsPerGeneration + var _geneticMixing: Double = _defaultGeneticMixing + var _generationalMutationStrategy: String = + _defaultGenerationalMutationStrategy + var _mutationMagnitudeMode: String = _defaultMutationMagnitudeMode + var _fixedMutationValue: Int = _defaultFixedMutationValue + var _earlyStoppingScore: Double = _defaultEarlyStoppingScore + var _earlyStoppingFlag: Boolean = _defaultEarlyStoppingFlag + + var _evolutionStrategy: String = _defaultEvolutionStrategy + var _geneticMBOCandidateFactor: Int = _defaultGeneticMBOCandidateFactor + var _geneticMBORegressorType: String = _defaultGeneticMBORegressorType + var _continuousEvolutionImprovementThreshold: Int = + _defaultContinuousEvolutionImprovementThreshold + var _continuousEvolutionMaxIterations: Int = + _defaultContinuousEvolutionMaxIterations + var _continuousEvolutionStoppingScore: Double = + _defaultContinuousEvolutionStoppingScore + var _continuousEvolutionParallelism: Int = + _defaultContinuousEvolutionParallelism + var _continuousEvolutionMutationAggressiveness: Int = + _defaultContinuousEvolutionMutationAggressiveness + var _continuousEvolutionGeneticMixing: Double = + _defaultContinuousEvolutionGeneticMixing + var _continuousEvolutionRollingImprovementCount: Int = + _defaultContinuousEvolutionRollingImprovementCount + + var _initialGenerationMode: String = _defaultFirstGenMode + var _initialGenerationPermutationCount: Int = _defaultFirstGenPermutations + var _initialGenerationIndexMixingMode: String = + _defaultFirstGenIndexMixingMode + var _initialGenerationArraySeed: Long = _defaultFirstGenArraySeed + var _hyperSpaceModelCount: Int = _defaultHyperSpaceModelCount + + var _modelSeedSet: Boolean = false + var _modelSeed: Map[String, Any] = Map.empty + + var _dataReduce: Double = _defaultDataReduce + + var _syntheticCol: String = _defaultKSampleConfig.syntheticCol + var _kGroups: Int = _defaultKSampleConfig.kGroups + var _kMeansMaxIter: Int = _defaultKSampleConfig.kMeansMaxIter + var _kMeansTolerance: Double = _defaultKSampleConfig.kMeansTolerance + var _kMeansDistanceMeasurement: String = + _defaultKSampleConfig.kMeansDistanceMeasurement + var _kMeansSeed: Long = _defaultKSampleConfig.kMeansSeed + var _kMeansPredictionCol: String = _defaultKSampleConfig.kMeansPredictionCol + var _lshHashTables: Int = _defaultKSampleConfig.lshHashTables + var _lshSeed: Long = _defaultKSampleConfig.lshSeed + var _lshOutputCol: String = _defaultKSampleConfig.lshOutputCol + var _quorumCount: Int = _defaultKSampleConfig.quorumCount + var _minimumVectorCountToMutate: Int = + _defaultKSampleConfig.minimumVectorCountToMutate + var _vectorMutationMethod: String = _defaultKSampleConfig.vectorMutationMethod + var _mutationMode: String = _defaultKSampleConfig.mutationMode + var _mutationValue: Double = _defaultKSampleConfig.mutationValue + var _labelBalanceMode: String = _defaultKSampleConfig.labelBalanceMode + var _cardinalityThreshold: Int = _defaultKSampleConfig.cardinalityThreshold + var _numericRatio: Double = _defaultKSampleConfig.numericRatio + var _numericTarget: Int = _defaultKSampleConfig.numericTarget + + var _randomizer: scala.util.Random = scala.util.Random + _randomizer.setSeed(_seed) + + def setLabelCol(value: String): this.type = { + _labelCol = value + this + } + + def setFeaturesCol(value: String): this.type = { + _featureCol = value + this + } + + def setFieldsToIgnore(value: Array[String]): this.type = { + _fieldsToIgnore = value + this + } + + def setTrainPortion(value: Double): this.type = { + require( + value < 1.0 & value > 0.0, + "Training portion must be in the range > 0 and < 1" + ) + _trainPortion = value + this + } + + def setTrainSplitMethod(value: String): this.type = { + require( + allowableTrainSplitMethod.contains(value), + s"TrainSplitMethod $value must be one of: ${allowableTrainSplitMethod.mkString(", ")}" + ) + _trainSplitMethod = value + this + } + + /** + * Setter - for setting the name of the Synthetic column name + * @param value String: A column name that is uniquely not part of the main DataFrame + * @since 0.5.1 + * @author Ben Wilson + */ + def setSyntheticCol(value: String): this.type = { + _syntheticCol = value + this + } + + /** + * Setter for specifying the number of K-Groups to generate in the KMeans model + * @param value Int: number of k groups to generate + * @return this + */ + def setKGroups(value: Int): this.type = { + _kGroups = value + this + } + + /** + * Setter for specifying the maximum number of iterations for the KMeans model to go through to converge + * @param value Int: Maximum limit on iterations + * @return this + */ + def setKMeansMaxIter(value: Int): this.type = { + _kMeansMaxIter = value + this + } + + /** + * Setter for Setting the tolerance for KMeans (must be >0) + * @param value The tolerance value setting for KMeans + * @see reference: [[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans]] + * for further details. + * @return this + * @throws IllegalArgumentException() if a value less than 0 is entered + */ + @throws(classOf[IllegalArgumentException]) + def setKMeansTolerance(value: Double): this.type = { + require( + value > 0, + s"KMeans tolerance value ${value.toString} is out of range. Must be > 0." + ) + _kMeansTolerance = value + this + } + + /** + * Setter for which distance measurement to use to calculate the nearness of vectors to a centroid + * @param value String: Options -> "euclidean" or "cosine" Default: "euclidean" + * @return this + * @throws IllegalArgumentException() if an invalid value is entered + */ + @throws(classOf[IllegalArgumentException]) + def setKMeansDistanceMeasurement(value: String): this.type = { + require( + allowableKMeansDistanceMeasurements.contains(value), + s"Kmeans Distance Measurement $value is not " + + s"a valid mode of operation. Must be one of: ${allowableKMeansDistanceMeasurements.mkString(", ")}" + ) + _kMeansDistanceMeasurement = value + this + } + + /** + * Setter for a KMeans seed for the clustering algorithm + * @param value Long: Seed value + * @return this + */ + def setKMeansSeed(value: Long): this.type = { + _kMeansSeed = value + this + } + + /** + * Setter for the internal KMeans column for cluster membership attribution + * @param value String: column name for internal algorithm column for group membership + * @return this + */ + def setKMeansPredictionCol(value: String): this.type = { + _kMeansPredictionCol = value + this + } + + /** + * Setter for Configuring the number of Hash Tables to use for MinHashLSH + * @param value Int: Count of hash tables to use + * @see [[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH]] + * for more information + * @return this + */ + def setLSHHashTables(value: Int): this.type = { + _lshHashTables = value + this + } + + /** + * Setter for the LSH Seed for the model + * @param value Long: Seed value + * @return this + */ + def setLSHSeed(value: Long): this.type = { + _lshSeed = value + this + } + + /** + * Setter for the internal LSH output hash information column + * @param value String: column name for the internal MinHashLSH Model transformation value + * @return this + */ + def setLSHOutputCol(value: String): this.type = { + _lshOutputCol = value + this + } + + /** + * Setter for how many vectors to find in adjacency to the centroid for generation of synthetic data + * @note the higher the value set here, the higher the variance in synthetic data generation + * @param value Int: Number of vectors to find nearest each centroid within the class + * @return this + */ + def setQuorumCount(value: Int): this.type = { + _quorumCount = value + this + } + + /** + * Setter for minimum threshold for vector indexes to mutate within the feature vector. + * @note In vectorMutationMethod "fixed" this sets the fixed count of how many vector positions to mutate. + * In vectorMutationMethod "random" this sets the lower threshold for 'at least this many indexes will + * be mutated' + * @param value The minimum (or fixed) number of indexes to mutate. + * @return this + */ + def setMinimumVectorCountToMutate(value: Int): this.type = { + _minimumVectorCountToMutate = value + this + } + + /** + * Setter for the Vector Mutation Method + * @note Options: + * "fixed" - will use the value of minimumVectorCountToMutate to select random indexes of this number of indexes. + * "random" - will use this number as a lower bound on a random selection of indexes between this and the vector length. + * "all" - will mutate all of the vectors. + * @param value String - the mode to use. + * @return this + * @throws IllegalArgumentException() if the mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setVectorMutationMethod(value: String): this.type = { + require( + allowableVectorMutationMethods.contains(value), + s"Vector Mutation Mode $value is not supported. " + + s"Must be one of: ${allowableVectorMutationMethods.mkString(", ")} " + ) + _vectorMutationMethod = value + this + } + + /** + * Setter for the Mutation Mode of the feature vector individual values + * @note Options: + * "weighted" - uses weighted averaging to scale the euclidean distance between the centroid vector and mutation candidate vectors + * "random" - randomly selects a position on the euclidean vector between the centroid vector and the candidate mutation vectors + * "ratio" - uses a ratio between the values of the centroid vector and the mutation vector * + * @param value String: the mode to use. + * @return this + * @throws IllegalArgumentException() if the mode is not supported. + */ + @throws(classOf[IllegalArgumentException]) + def setMutationMode(value: String): this.type = { + require( + allowableMutationModes.contains(value), + s"Mutation Mode $value is not a valid mode of operation. " + + s"Must be one of: ${allowableMutationModes.mkString(", ")}" + ) + _mutationMode = value + this + } + + /** + * Setter for specifying the mutation magnitude for the modes 'weighted' and 'ratio' in mutationMode + * @param value Double: value between 0 and 1 for mutation magnitude adjustment. + * @note the higher this value, the closer to the centroid vector vs. the candidate mutation vector the synthetic row data will be. + * @return this + * @throws IllegalArgumentException() if the value specified is outside of the range (0, 1) + */ + @throws(classOf[IllegalArgumentException]) + def setMutationValue(value: Double): this.type = { + require( + value > 0 & value < 1, + s"Mutation Value must be between 0 and 1. Value $value is not permitted." + ) + _mutationValue = value + this + } + + /** + * Setter - for determining the label balance approach mode. + * @note Available modes:
+ * 'match': Will match all smaller class counts to largest class count. [WARNING] - May significantly increase memory pressure!
+ * 'percentage' Will adjust smaller classes to a percentage value of the largest class count. + * 'target' Will increase smaller class counts to a fixed numeric target of rows. + * @param value String: one of: 'match', 'percentage' or 'target' + * @note Default: "percentage" + * @since 0.5.1 + * @author Ben Wilson + * @throws UnsupportedOperationException() if the provided mode is not supported. + */ + @throws(classOf[UnsupportedOperationException]) + def setLabelBalanceMode(value: String): this.type = { + require( + allowableLabelBalanceModes.contains(value), + s"Label Balance Mode $value is not supported." + + s"Must be one of: ${allowableLabelBalanceModes.mkString(", ")}" + ) + _labelBalanceMode = value + this + } + + /** + * Setter - for overriding the cardinality threshold exception threshold. [WARNING] increasing this value on + * a sufficiently large data set could incur, during runtime, excessive memory and cpu pressure on the cluster. + * @param value Int: the limit above which an exception will be thrown for a classification problem wherein the + * label distinct count is too large to successfully generate synthetic data. + * @note Default: 20 + * @since 0.5.1 + * @author Ben Wilson + */ + def setCardinalityThreshold(value: Int): this.type = { + _cardinalityThreshold = value + this + } + + /** + * Setter - for specifying the percentage ratio for the mode 'percentage' in setLabelBalanceMode() + * @param value Double: A fractional double in the range of 0.0 to 1.0. + * @note Setting this value to 1.0 is equivalent to setting the label balance mode to 'match' + * @note Default: 0.2 + * @since 0.5.1 + * @author Ben Wilson + * @throws UnsupportedOperationException() if the provided value is outside of the range of 0.0 -> 1.0 + */ + @throws(classOf[UnsupportedOperationException]) + def setNumericRatio(value: Double): this.type = { + require( + value <= 1.0 & value > 0.0, + s"Invalid Numeric Ratio entered! Must be between 0 and 1." + + s"${value.toString} is not valid." + ) + _numericRatio = value + this + } + + /** + * Setter - for specifying the target row count to generate for 'target' mode in setLabelBalanceMode() + * @param value Int: The desired final number of rows per minority class label + * @note [WARNING] Setting this value to too high of a number will greatly increase runtime and memory pressure. + * @since 0.5.1 + * @author Ben Wilson + */ + def setNumericTarget(value: Int): this.type = { + _numericTarget = value + this + } + + def setTrainSplitChronologicalColumn(value: String): this.type = { + _trainSplitChronologicalColumn = value + this + } + + def setTrainSplitChronologicalRandomPercentage(value: Double): this.type = { + _trainSplitChronologicalRandomPercentage = value + if (value > 10) + println( + "[WARNING] setTrainSplitChronologicalRandomPercentage() setting this value above 10 " + + "percent will cause significant per-run train/test skew and variability in row counts during training. " + + "Use higher values only if this is desired." + ) + this + } + + def setParallelism(value: Int): this.type = { + require( + _parallelism < 10000, + s"Parallelism above 10000 will result in cluster instability." + ) + _parallelism = value + this + } + + def setKFold(value: Int): this.type = { + _kFold = value + _kFoldIteratorRange = Range(0, _kFold).par + this + } + + def setSeed(value: Long): this.type = { + _seed = value + this + } + + def setOptimizationStrategy(value: String): this.type = { + val valueLC = value.toLowerCase + require( + allowableOptimizationStrategies.contains(valueLC), + s"Optimization Strategy '$valueLC' is not a member of ${invalidateSelection(valueLC, allowableOptimizationStrategies)}" + ) + _optimizationStrategy = valueLC + this + } + + def setFirstGenerationGenePool(value: Int): this.type = { + require( + value >= 5, + s"Values less than 5 for firstGenerationGenePool will require excessive generational mutation to converge" + ) + _firstGenerationGenePool = value + this + } + + def setNumberOfMutationGenerations(value: Int): this.type = { + require(value > 0, s"Number of Generations must be greater than 0") + _numberOfMutationGenerations = value + this + } + + def setNumberOfParentsToRetain(value: Int): this.type = { + require( + value > 0, + s"Number of Parents must be greater than 0. '$value' is not a valid number." + ) + _numberOfParentsToRetain = value + this + } + + def setNumberOfMutationsPerGeneration(value: Int): this.type = { + require( + value > 0, + s"Number of Mutations per generation must be greater than 0. '$value' is not a valid number." + ) + _numberOfMutationsPerGeneration = value + this + } + + def setGeneticMixing(value: Double): this.type = { + require( + value < 1.0 & value > 0.0, + s"Mutation Aggressiveness must be in range (0,1). Current Setting of $value is not permitted." + ) + _geneticMixing = value + this + } + + def setGenerationalMutationStrategy(value: String): this.type = { + val valueLC = value.toLowerCase + require( + allowableMutationStrategies.contains(valueLC), + s"Generational Mutation Strategy '$valueLC' is not a member of ${invalidateSelection(valueLC, allowableMutationStrategies)}" + ) + _generationalMutationStrategy = valueLC + this + } + + def setMutationMagnitudeMode(value: String): this.type = { + val valueLC = value.toLowerCase + require( + allowableMutationMagnitudeMode.contains(valueLC), + s"Mutation Magnitude Mode '$valueLC' is not a member of ${invalidateSelection(valueLC, allowableMutationMagnitudeMode)}" + ) + _mutationMagnitudeMode = valueLC + this + } + + def setFixedMutationValue(value: Int): this.type = { + val maxMutationCount = modelConfigLength[RandomForestConfig] + require( + value <= maxMutationCount, + s"Mutation count '$value' cannot exceed number of hyperparameters ($maxMutationCount)" + ) + require(value > 0, s"Mutation count '$value' must be greater than 0") + _fixedMutationValue = value + this + } + + def setEarlyStoppingScore(value: Double): this.type = { + _earlyStoppingScore = value + this + } + + def setEarlyStoppingFlag(value: Boolean): this.type = { + _earlyStoppingFlag = value + this + } + + def setEvolutionStrategy(value: String): this.type = { + require( + allowableEvolutionStrategies.contains(value), + s"Evolution Strategy '$value' is not a supported mode. Must be one of: ${invalidateSelection(value, allowableEvolutionStrategies)}" + ) + _evolutionStrategy = value + this + } + + /** + * Setter for defining the secondary stopping criteria for continuous training mode ( number of consistentlt + * not-improving runs to terminate the learning algorithm due to diminishing returns. + * @param value Negative Integer (an improvement to a priori will reset the counter and subsequent non-improvements + * will decrement a mutable counter. If the counter hits this limit specified in value, the continuous + * mode algorithm will stop). + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is positive. + */ + @throws(classOf[IllegalArgumentException]) + def setContinuousEvolutionImprovementThreshold(value: Int): this.type = { + require( + value < 0, + s"ContinuousEvolutionImprovementThreshold must be less than zero. It is " + + s"recommended to set this value to less than -4." + ) + _continuousEvolutionImprovementThreshold = value + this + } + + /** + * Setter for selecting the type of Regressor to use for the within-epoch generation MBO of candidates + * @param value String - one of "XGBoost", "LinearRegression" or "RandomForest" + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is not supported + */ + @throws(classOf[IllegalArgumentException]) + def setGeneticMBORegressorType(value: String): this.type = { + require( + allowableMBORegressorTypes.contains(value), + s"GeneticRegressorType $value is not a supported Regressor " + + s"Type. Must be one of: ${allowableMBORegressorTypes.mkString(", ")}" + ) + _geneticMBORegressorType = value + this + } + + /** + * Setter for defining the factor to be applied to the candidate listing of hyperparameters to generate through + * mutation for each generation other than the initial and post-modeling optimization phases. The larger this + * value (default: 10), the more potential space can be searched. There is not a large performance hit to this, + * and as such, values in excess of 100 are viable. + * @param value Int - a factor to multiply the numberOfMutationsPerGeneration by to generate a count of potential + * candidates. + * @author Ben Wilson, Databricks + * @since 0.6.0 + * @throws IllegalArgumentException if the value is not greater than zero. + */ + @throws(classOf[IllegalArgumentException]) + def setGeneticMBOCandidateFactor(value: Int): this.type = { + require(value > 0, s"GeneticMBOCandidateFactor must be greater than zero.") + _geneticMBOCandidateFactor = value + this + } + + def setContinuousEvolutionMaxIterations(value: Int): this.type = { + if (value > 500) + println( + s"[WARNING] Total Modeling count $value is higher than recommended limit of 500. " + + s"This tuning will take a long time to run." + ) + _continuousEvolutionMaxIterations = value + this + } + + def setContinuousEvolutionStoppingScore(value: Double): this.type = { + _continuousEvolutionStoppingScore = value + this + } + + def setContinuousEvolutionParallelism(value: Int): this.type = { + if (value > 10) + println( + s"[WARNING] ContinuousEvolutionParallelism -> $value is higher than recommended " + + s"concurrency for efficient optimization for convergence." + + s"\n Setting this value below 11 will converge faster in most cases." + ) + _continuousEvolutionParallelism = value + this + } + + def setContinuousEvolutionMutationAggressiveness(value: Int): this.type = { + if (value > 4) + println( + s"[WARNING] ContinuousEvolutionMutationAggressiveness -> $value. " + + s"\n Setting this higher than 4 will result in extensive random search and will take longer to converge " + + s"to optimal hyperparameters." + ) + _continuousEvolutionMutationAggressiveness = value + this + } + + def setContinuousEvolutionGeneticMixing(value: Double): this.type = { + require( + value < 1.0 & value > 0.0, + s"Mutation Aggressiveness must be in range (0,1). Current Setting of $value is not permitted." + ) + _continuousEvolutionGeneticMixing = value + this + } + + def setContinuousEvolutionRollingImporvementCount(value: Int): this.type = { + require( + value > 0, + s"ContinuousEvolutionRollingImprovementCount must be > 0. $value is invalid." + ) + if (value < 10) + println( + s"[WARNING] ContinuousEvolutionRollingImprovementCount -> $value setting is low. " + + s"Optimal Convergence may not occur due to early stopping." + ) + _continuousEvolutionRollingImprovementCount = value + this + } + + def setModelSeed(value: Map[String, Any]): this.type = { + _modelSeed = value + _modelSeedSet = true + this + } + + def setDataReductionFactor(value: Double): this.type = { + require(value > 0, s"Data Reduction Factor must be between 0 and 1") + require(value < 1, s"Data Reduction Factor must be between 0 and 1") + _dataReduce = value + this + } + + def setFirstGenMode(value: String): this.type = { + require( + allowableInitialGenerationModes.contains(value), + s"First Generation Mode '$value' is not a supported mode." + + s" Must be one of: ${invalidateSelection(value, allowableInitialGenerationModes)}" + ) + _initialGenerationMode = value + this + } + + def setFirstGenPermutations(value: Int): this.type = { + _initialGenerationPermutationCount = value + this + } + + def setHyperSpaceModelCount(value: Int): this.type = { + _hyperSpaceModelCount = value + this + } + + def setFirstGenIndexMixingMode(value: String): this.type = { + require( + allowableInitialGenerationIndexMixingModes.contains(value), + s"First Generation Mode '$value' is not a" + + s"supported mode. Must be one of ${invalidateSelection(value, allowableInitialGenerationIndexMixingModes)}" + ) + _initialGenerationIndexMixingMode = value + this + } + + def setFirstGenArraySeed(value: Long): this.type = { + _initialGenerationArraySeed = value + this + } + + def getFirstGenArraySeed: Long = _initialGenerationArraySeed + + def getFirstGenIndexMixingMode: String = _initialGenerationIndexMixingMode + + def getFirstGenPermutations: Int = _initialGenerationPermutationCount + + def getFirstGenMode: String = _initialGenerationMode + + def getHyperSpaceModelCount: Int = _hyperSpaceModelCount + + def getLabelCol: String = _labelCol + + def getFeaturesCol: String = _featureCol + + def getFieldsToIgnore: Array[String] = _fieldsToIgnore + + def getTrainPortion: Double = _trainPortion + + def getTrainSplitMethod: String = _trainSplitMethod + + def getTrainSplitChronologicalColumn: String = _trainSplitChronologicalColumn + + def getTrainSplitChronologicalRandomPercentage: Double = + _trainSplitChronologicalRandomPercentage + + def getParallelism: Int = _parallelism + + def getKFold: Int = _kFold + + def getSeed: Long = _seed + + def getOptimizationStrategy: String = _optimizationStrategy + + def getFirstGenerationGenePool: Int = _firstGenerationGenePool + + def getNumberOfMutationGenerations: Int = _numberOfMutationGenerations + + def getNumberOfParentsToRetain: Int = _numberOfParentsToRetain + + def getNumberOfMutationsPerGeneration: Int = _numberOfMutationsPerGeneration + + def getGeneticMixing: Double = _geneticMixing + + def getGenerationalMutationStrategy: String = _generationalMutationStrategy + + def getMutationMagnitudeMode: String = _mutationMagnitudeMode + + def getFixedMutationValue: Int = _fixedMutationValue + + def getEarlyStoppingScore: Double = _earlyStoppingScore + + def getEarlyStoppingFlag: Boolean = _earlyStoppingFlag + + def getEvolutionStrategy: String = _evolutionStrategy + + def getGeneticMBORegressorType: String = _geneticMBORegressorType + + def getGeneticMBOCandidateFactor: Int = _geneticMBOCandidateFactor + + def getContinuousEvolutionImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + def getContinuousEvolutionMaxIterations: Int = + _continuousEvolutionMaxIterations + + def getContinuousEvolutionStoppingScore: Double = + _continuousEvolutionStoppingScore + + def getContinuousEvolutionParallelism: Int = _continuousEvolutionParallelism + + def getContinuousEvolutionMutationAggressiveness: Int = + _continuousEvolutionMutationAggressiveness + + def getContinuousEvolutionGeneticMixing: Double = + _continuousEvolutionGeneticMixing + + def getContinuousEvolutionRollingImporvementCount: Int = + _continuousEvolutionRollingImprovementCount + + def getModelSeed: Map[String, Any] = _modelSeed + + def getDataReductionFactor: Double = _dataReduce + + // DEBUG for logging purposes of configurations. + def debugSettings: String = { + + s"DEBUG: \n Evolution.scala --> xgbWorkers: ${PerformanceSettings.xgbWorkers(_parallelism)} \n " + + s"Evolution.scala --> totalCores: ${PerformanceSettings.totalCores} \n " + + s"Evolution.scala --> _parallelism: ${_parallelism} \n " + + s"Evolution.scala --> getParallelism: ${getParallelism} \n " + + s"Evolution.scala --> optimalJVMModelPartitions: ${PerformanceSettings + .optimalJVMModelPartitions(_parallelism)} \n " + + s"Evolution.scala --> parTasks: ${PerformanceSettings.parTasks}" + + } + + /** + * Internal method for validating if a numeric mapping that is specified contains any invalid keys + * @param standardConfig The static defined numeric mapping for a model type + * @param modConfig a user-specified mapping override + * @since 0.6.1 + * @author Ben Wilson, Databricks + * @throws IllegalArgumentException if the key is invalid for the model type specified. + */ + @throws(classOf[IllegalArgumentException]) + protected[model] def validateNumericMapping( + standardConfig: Map[String, (Double, Double)], + modConfig: Map[String, (Double, Double)] + ): Unit = { + + val staticKeys = standardConfig.keys.toArray + val modKeys = modConfig.keys.toArray + + modKeys.foreach( + x => + if (!staticKeys.contains(x)) + throw new IllegalArgumentException( + s"The numeric Boundary map key " + + s"supplied: [$x] is not a valid member of Numeric Mapping. " + + s"\nKeys are restricted to: [${staticKeys.mkString(", ")}]" + ) + ) + + } + + /** + * Internal method for validating if a string mapping that is specified contains any invalid keys + * @param standardConfig The static defined string mapping for a model type + * @param modConfig a user-specified mapping override + * @since 0.6.1 + * @author Ben Wilson, Databricks + * @throws IllegalArgumentException if the key is invalid for the model type specified. + */ + @throws(classOf[IllegalArgumentException]) + protected[model] def validateStringMapping( + standardConfig: Map[String, List[String]], + modConfig: Map[String, List[String]] + ): Unit = { + val staticKeys = standardConfig.keys.toArray + val modKeys = modConfig.keys.toArray + + modKeys.foreach( + x => + if (!staticKeys.contains(x)) + throw new IllegalArgumentException( + s"The string Boundary map key " + + s"supplied: [$x] is not a valid member of String Mapping. " + + s"\nKeys are restricted to: [${staticKeys.mkString(", ")}]" + ) + ) + } + + /** + * Helper function for partially updating a numeric mapping + * @param defaultMap The default configuration Map for a numeric mapping for model hyperparameter search space + * @param updateMap user-supplied updated map (doesn't have to have all elements in it) + * @return The default map, updated with the user-supplied overrides + * @since 0.6.1 + * @author Ben Wilson, Jas Bali Databricks + */ + def partialOverrideNumericMapping( + defaultMap: Map[String, (Double, Double)], + updateMap: Map[String, (Double, Double)] + ): Map[String, (Double, Double)] = { + + defaultMap ++ updateMap + } + + /** + * Helper function for partially updating a string mapping + * + * @param defaultMap The default configuration Map for a string mapping for model hyperparameter search space + * @param updateMap user-supplied updated map (doesn't have to have all elements in it) + * @return The default map, updated with the user-supplied overrides + * @since 0.6.1 + * @author Ben Wilson, Jas Bali Databricks + */ + def partialOverrideStringMapping( + defaultMap: Map[String, List[String]], + updateMap: Map[String, List[String]] + ): Map[String, List[String]] = { + defaultMap ++ updateMap + } + + // TODO - Calculation should take into account early stopping + def totalModels: Int = _evolutionStrategy match { + case "batch" => + (_numberOfMutationsPerGeneration * _numberOfMutationGenerations) + _firstGenerationGenePool + + _initialGenerationPermutationCount + _hyperSpaceModelCount + case "continuous" => + _continuousEvolutionMaxIterations - _continuousEvolutionParallelism + _firstGenerationGenePool + case _ => + throw new MatchError( + s"EvolutionStrategy mode ${_evolutionStrategy} is not supported." + + s"\n Choose one of: ${allowableEvolutionStrategies.mkString(", ")}" + ) + } + + def modelConfigLength[T: TypeTag]: Int = { + typeOf[T].members + .collect { + case m: MethodSymbol if m.isCaseAccessor => m + } + .toList + .length + } + + def extractBoundaryDouble( + param: String, + boundaryMap: Map[String, (AnyVal, AnyVal)] + ): (Double, Double) = { + val minimum = boundaryMap(param)._1.asInstanceOf[Double] + val maximum = boundaryMap(param)._2.asInstanceOf[Double] + (minimum, maximum) + } + + def extractBoundaryInteger( + param: String, + boundaryMap: Map[String, (AnyVal, AnyVal)] + ): (Int, Int) = { + val minimum = boundaryMap(param)._1.asInstanceOf[Double].toInt + val maximum = boundaryMap(param)._2.asInstanceOf[Double].toInt + (minimum, maximum) + } + + def generateRandomDouble( + param: String, + boundaryMap: Map[String, (AnyVal, AnyVal)] + ): Double = { + val (minimumValue, maximumValue) = extractBoundaryDouble(param, boundaryMap) + minimumValue + _randomizer.nextDouble() * (maximumValue - minimumValue) + } + + def generateRandomInteger(param: String, + boundaryMap: Map[String, (AnyVal, AnyVal)]): Int = { + val (minimumValue, maximumValue) = + extractBoundaryInteger(param, boundaryMap) + _randomizer.nextInt(maximumValue - minimumValue) + minimumValue + } + + def generateRandomString(param: String, + boundaryMap: Map[String, List[String]]): String = { + _randomizer.shuffle(boundaryMap(param)).head + } + + def coinFlip(): Boolean = { + math.random < 0.5 + } + + def coinFlip(parent: Boolean, child: Boolean, p: Double): Boolean = { + if (math.random < p) parent else child + } + + def buildLayerArray(inputFeatureSize: Int, + distinctClasses: Int, + nLayers: Int, + hiddenLayerSizeAdjust: Int): Array[Int] = { + + val layerConstruct = new ArrayBuffer[Int] + + layerConstruct += inputFeatureSize + + (1 to nLayers).foreach { x => + layerConstruct += inputFeatureSize + nLayers - x + hiddenLayerSizeAdjust + } + layerConstruct += distinctClasses + layerConstruct.result.toArray + } + + def generateLayerArray(layerParam: String, + layerSizeParam: String, + boundaryMap: Map[String, (AnyVal, AnyVal)], + inputFeatureSize: Int, + distinctClasses: Int): Array[Int] = { + + val layersToGenerate = generateRandomInteger(layerParam, boundaryMap) + val hiddenLayerSizeAdjust = + generateRandomInteger(layerSizeParam, boundaryMap) + + buildLayerArray( + inputFeatureSize, + distinctClasses, + layersToGenerate, + hiddenLayerSizeAdjust + ) + + } + + def getRandomIndeces(minimum: Int, + maximum: Int, + parameterCount: Int): List[Int] = { + val fullIndexArray = List.range(0, maximum) + val randomSeed = new scala.util.Random + val count = minimum + randomSeed.nextInt((parameterCount - minimum) + 1) + val adjCount = if (count < 1) 1 else count + val shuffledArray = scala.util.Random.shuffle(fullIndexArray).take(adjCount) + shuffledArray.sortWith(_ < _) + } + + def getFixedIndeces(minimum: Int, + maximum: Int, + parameterCount: Int): List[Int] = { + val fullIndexArray = List.range(0, maximum) + val randomSeed = new scala.util.Random + randomSeed.shuffle(fullIndexArray).take(parameterCount).sortWith(_ < _) + } + + def generateMutationIndeces(minimum: Int, + maximum: Int, + parameterCount: Int, + mutationCount: Int): Array[List[Int]] = { + val mutations = new ArrayBuffer[List[Int]] + for (_ <- 0 to mutationCount) { + _mutationMagnitudeMode match { + case "random" => + mutations += getRandomIndeces(minimum, maximum, parameterCount) + case "fixed" => + mutations += getFixedIndeces(minimum, maximum, parameterCount) + case _ => + new UnsupportedOperationException( + s"Unsupported mutationMagnitudeMode ${_mutationMagnitudeMode}" + ) + } + } + mutations.result.toArray + } + + def geneMixing(parent: Double, + child: Double, + parentMutationPercentage: Double): Double = { + (parent * parentMutationPercentage) + (child * (1 - parentMutationPercentage)) + } + + def geneMixing(parent: Int, + child: Int, + parentMutationPercentage: Double): Int = { + ((parent * parentMutationPercentage) + (child * (1 - parentMutationPercentage))).toInt + } + + def geneMixing(parent: String, child: String): String = { + val mixed = new ArrayBuffer[String] + mixed += parent += child + scala.util.Random.shuffle(mixed.toList).head + } + + def geneMixing(parent: Array[Int], + child: Array[Int], + parentMutationPercentage: Double): Array[Int] = { + + val staticStart = parent.head + val staticEnd = parent.last + + val parentHiddenLayers = parent.length - 2 + val childHiddenLayers = child.length - 2 + + val parentMagnitude = parent(1) - staticStart + val childMagnidue = child(1) - staticStart + + val hiddenLayerMix = geneMixing( + parentHiddenLayers, + childHiddenLayers, + parentMutationPercentage + ) + val sizeAdjustMix = + geneMixing(parentMagnitude, childMagnidue, parentMutationPercentage) + + buildLayerArray(staticStart, staticEnd, hiddenLayerMix, sizeAdjustMix) + + } + + /** + * Method for calculating the remaining time left on the genetic algorithm training (roughly) + * @note Due to the asynchronous nature of the algorithm, the times are not exact and are a reflection of time + * since the creation of the Futures and when they were initially inserted into the thread pool. + * @param currentGen The current Generation that the model is running on + * @param currentModel The index of the current model that is being run. + * @return A Double representing the total completion percentage of the modeling portion of the run. + * @since 0.2.1 + * @author Ben Wilson + */ + def calculateModelingFamilyRemainingTime(currentGen: Int, + currentModel: Int): Double = { + + val modelsComplete = _evolutionStrategy match { + case "batch" => + if (currentGen == 1) { + currentModel + } else { + _firstGenerationGenePool + (_numberOfMutationsPerGeneration * (currentGen - 2) + currentModel) + } + case _ => currentGen + _firstGenerationGenePool + } + + (modelsComplete.toDouble / totalModels.toDouble) * 100 + + } + + /** + * Method for validating the distinct class count for a classification type model (for use in determining which + * evaluator to employ for scoring and optimization of each model) + * @param df source Dataframe (prior to splitting for train/test) + * @return Boolean true for Binary Classification problem, false for multi-class problem + * @since 0.4.0 + * @author Ben Wilson + */ + def classificationAdjudicator(df: DataFrame): Boolean = { + + // Calculate the distinct entries of the label value for a classification problem + val uniqueLabelCounts = df.select(_labelCol).distinct().count() + + if (uniqueLabelCounts <= 2) true else false + + } + + /** + * Method for restricting the available metrics used or are available for optimizing for classification problems + * @param binaryValidation boolean check from classificationAdjudicator() method + * @param metricPayload the hard-coded allowable List[String] of allowable classification metrics + * from com.databricks.labs.automl.params.EvolutionDefaults + * @return a copy of the the allowable params list with the Binary metrics removed if this is a multiclass problem. + * @since 0.4.0 + * @author Ben Wilson + */ + def classificationMetricValidator( + binaryValidation: Boolean, + metricPayload: List[String] + ): List[String] = { + + if (binaryValidation) { + metricPayload + } else { + metricPayload.diff(List("areaUnderROC", "areaUnderPR")) + } + + } + + /** + * Method for scoring and evaluating classification models (supporting both multi-class and binary classification + * problems) + * @param metricName the metric to be tested against (both for binary and multi-class) + * @param labelColumn the column name in the data set that is the 'source of truth' to compare against + * @param data the DataFrame that has been transformed + * @return the score, as a Double value. + * @since 0.4.0 + * @author Ben Wilson + */ + def classificationScoring(metricName: String, + labelColumn: String, + data: DataFrame): Double = { + + metricName match { + case "areaUnderPR" | "areaUnderROC" => + new BinaryClassificationEvaluator() + .setLabelCol(labelColumn) + .setRawPredictionCol("probability") + .setMetricName(metricName) + .evaluate(data) + case _ => + new MulticlassClassificationEvaluator() + .setLabelCol(labelColumn) + .setPredictionCol("prediction") + .setMetricName(metricName) + .evaluate(data) + } + + } + + /** + * Method for scoring Regression models. + * @param metricName The metric desired to be tested + * @param labelColumn The name of the label column + * @param data the DataFrame that has been transformed by a model. + * @return the score for the metric, as a Double value. + * @since 0.4.0 + * @author Ben Wilson + */ + def regressionScoring(metricName: String, + labelColumn: String, + data: DataFrame): Double = { + + new RegressionEvaluator() + .setLabelCol(labelColumn) + .setMetricName(metricName) + .evaluate(data) + + } + + def generateAggressiveness(totalConfigs: Int, currentIteration: Int): Int = { + val mutationAggressiveness = _generationalMutationStrategy match { + case "linear" => + if (totalConfigs - (currentIteration + 1) < 1) 1 + else + totalConfigs - (currentIteration + 1) + case _ => _fixedMutationValue + } + mutationAggressiveness + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/GBTreesTuner.scala b/src/main/scala/com/databricks/labs/automl/model/GBTreesTuner.scala new file mode 100644 index 00000000..44d28ecd --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/GBTreesTuner.scala @@ -0,0 +1,818 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.model.tools.{ + GenerationOptimizer, + HyperParameterFullSearch, + ModelReporting, + ModelUtils +} +import com.databricks.labs.automl.params.{ + Defaults, + GBTConfig, + GBTModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} +import org.apache.spark.storage.StorageLevel +import org.apache.spark.ml.classification.GBTClassifier +import org.apache.spark.ml.regression.GBTRegressor +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions.col + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class GBTreesTuner(df: DataFrame, + data: Array[TrainSplitReferences], + modelSelection: String, + isPipeline: Boolean = false) + extends SparkSessionWrapper + with Evolution + with Defaults { + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _scoringMetric = modelSelection match { + case "regressor" => "rmse" + case "classifier" => "f1" + case _ => + throw new UnsupportedOperationException( + s"Model $modelSelection is not a supported modeling mode" + ) + } + + private var _gbtNumericBoundaries = _gbtDefaultNumBoundaries + + private var _gbtStringBoundaries = _gbtDefaultStringBoundaries + + private var _classificationMetrics = classificationMetrics + + def setScoringMetric(value: String): this.type = { + modelSelection match { + case "regressor" => + require( + regressionMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, regressionMetrics)}" + ) + case "classifier" => + require( + classificationMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, classificationMetrics)}" + ) + case _ => + throw new UnsupportedOperationException( + s"Unsupported modelType $modelSelection" + ) + } + this._scoringMetric = value + this + } + + def setGBTNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + _gbtNumericBoundaries = value + this + } + + def setGBTStringBoundaries(value: Map[String, List[String]]): this.type = { + _gbtStringBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getGBTNumericBoundaries: Map[String, (Double, Double)] = + _gbtNumericBoundaries + + def getGBTStringBoundaries: Map[String, List[String]] = _gbtStringBoundaries + + def getClassificationMetrics: List[String] = classificationMetrics + + def getRegressionMetrics: List[String] = regressionMetrics + + /** + * Private method for updating the maxBins setting for the tree algorithm to ensure that cardinality validation + * occurs for each nominal field in the feature vector to ensure that entnopy / information gain / gini calculations + * can be conducted correctly. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def resetNumericBoundaries: this.type = { + + _gbtNumericBoundaries = ModelUtils.resetTreeBinsSearchSpace( + df, + _gbtNumericBoundaries, + _fieldsToIgnore, + _labelCol, + _featureCol + ) + this + + } + + private def resetClassificationMetrics: List[String] = modelSelection match { + case "classifier" => + classificationMetricValidator( + classificationAdjudicator(df), + classificationMetrics + ) + case _ => classificationMetrics + } + + private def setClassificationMetrics(value: List[String]): this.type = { + _classificationMetrics = value + this + } + + private def modelDecider[A, B](modelConfig: GBTConfig) = { + + val builtModel = modelSelection match { + case "classifier" => + new GBTClassifier() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setCheckpointInterval(-1) + .setImpurity(modelConfig.impurity) + .setLossType(modelConfig.lossType) + .setMaxBins(modelConfig.maxBins) + .setMaxDepth(modelConfig.maxDepth) + .setMaxIter(modelConfig.maxIter) + .setMinInfoGain(modelConfig.minInfoGain) + .setMinInstancesPerNode(modelConfig.minInstancesPerNode) + .setStepSize(modelConfig.stepSize) + case "regressor" => + new GBTRegressor() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setCheckpointInterval(-1) + .setImpurity(modelConfig.impurity) + .setLossType(modelConfig.lossType) + .setMaxBins(modelConfig.maxBins) + .setMaxDepth(modelConfig.maxDepth) + .setMaxIter(modelConfig.maxIter) + .setMinInfoGain(modelConfig.minInfoGain) + .setMinInstancesPerNode(modelConfig.minInstancesPerNode) + .setStepSize(modelConfig.stepSize) + case _ => + throw new UnsupportedOperationException( + s"Unsupported modelType $modelSelection" + ) + } + builtModel + } + + override def generateRandomString( + param: String, + boundaryMap: Map[String, List[String]] + ): String = { + + val stringListing = param match { + case "impurity" => + modelSelection match { + case "regressor" => List("variance") + case _ => boundaryMap(param) + } + case "lossType" => + modelSelection match { + case "regressor" => List("squared", "absolute") + case _ => boundaryMap(param) + } + case _ => boundaryMap(param) + } + _randomizer.shuffle(stringListing).head + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[GBTModelsWithResults] + ): (GBTConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[GBTModelsWithResults] + ): Array[GBTModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[GBTModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[GBTConfig] = { + + val iterations = new ArrayBuffer[GBTConfig] + + var i = 0 + do { + val impurity = generateRandomString("impurity", _gbtStringBoundaries) + val lossType = generateRandomString("lossType", _gbtStringBoundaries) + val maxBins = generateRandomInteger("maxBins", _gbtNumericBoundaries) + val maxDepth = generateRandomInteger("maxDepth", _gbtNumericBoundaries) + val maxIter = generateRandomInteger("maxIter", _gbtNumericBoundaries) + val minInfoGain = + generateRandomDouble("minInfoGain", _gbtNumericBoundaries) + val minInstancesPerNode = + generateRandomInteger("minInstancesPerNode", _gbtNumericBoundaries) + val stepSize = generateRandomDouble("stepSize", _gbtNumericBoundaries) + iterations += GBTConfig( + impurity, + lossType, + maxBins, + maxDepth, + maxIter, + minInfoGain, + minInstancesPerNode, + stepSize + ) + i += 1 + } while (i < iterationCount) + + iterations.toArray + } + + private def generateAndScoreGBTModel( + train: DataFrame, + test: DataFrame, + modelConfig: GBTConfig, + generation: Int = 1 + ): GBTModelsWithResults = { + + val gbtModel = modelDecider(modelConfig) + + val builtModel = gbtModel.fit(train) + + val predictedData = builtModel.transform(test) + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) +// optimizedPredictions.foreach(_ => ()) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + modelSelection match { + case "classifier" => + for (i <- _classificationMetrics) { + scoringMap(i) = + classificationScoring(i, _labelCol, optimizedPredictions) + } + case "regressor" => + for (i <- regressionMetrics) { + scoringMap(i) = regressionScoring(i, _labelCol, optimizedPredictions) + } + } + + val gbtModelsWithResults = GBTModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + gbtModelsWithResults + } + + private def runBattery(battery: Array[GBTConfig], + generation: Int = 1): Array[GBTModelsWithResults] = { + + val metrics = modelSelection match { + case "classifier" => _classificationMetrics + case _ => regressionMetrics + } + + val statusObj = new ModelReporting("gbt", metrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = new ArrayBuffer[GBTModelsWithResults] + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val uniqueLabels: Array[Row] = df.select(_labelCol).distinct().collect() + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreGBTModel(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + modelSelection match { + case "classifier" => + for (a <- _classificationMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case "regressor" => + for (a <- regressionMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case _ => + throw new UnsupportedOperationException( + s"$modelSelection is not a supported model type." + ) + } + + val runAvg = GBTModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + } + sortAndReturnAll(results) + } + + private def irradiateGeneration( + parents: Array[GBTConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[GBTConfig] = { + + val mutationPayload = new ArrayBuffer[GBTConfig] + val totalConfigs = modelConfigLength[GBTConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + for (i <- mutationCandidates.indices) { + + val randomParent = scala.util.Random.shuffle(parents.toList).head + val mutationIteration = mutationCandidates(i) + val mutationIndexIteration = mutationIndeces(i) + + mutationPayload += GBTConfig( + if (mutationIndexIteration.contains(0)) + geneMixing(randomParent.impurity, mutationIteration.impurity) + else randomParent.impurity, + if (mutationIndexIteration.contains(1)) + geneMixing(randomParent.lossType, mutationIteration.lossType) + else randomParent.lossType, + if (mutationIndexIteration.contains(2)) + geneMixing( + randomParent.maxBins, + mutationIteration.maxBins, + mutationMagnitude + ) + else randomParent.maxBins, + if (mutationIndexIteration.contains(3)) + geneMixing( + randomParent.maxDepth, + mutationIteration.maxDepth, + mutationMagnitude + ) + else randomParent.maxDepth, + if (mutationIndexIteration.contains(4)) + geneMixing( + randomParent.maxIter, + mutationIteration.maxIter, + mutationMagnitude + ) + else randomParent.maxIter, + if (mutationIndexIteration.contains(5)) + geneMixing( + randomParent.minInfoGain, + mutationIteration.minInfoGain, + mutationMagnitude + ) + else randomParent.minInfoGain, + if (mutationIndexIteration.contains(6)) + geneMixing( + randomParent.minInstancesPerNode, + mutationIteration.minInstancesPerNode, + mutationMagnitude + ) + else randomParent.minInstancesPerNode, + if (mutationIndexIteration.contains(7)) + geneMixing( + randomParent.stepSize, + mutationIteration.stepSize, + mutationMagnitude + ) + else randomParent.stepSize + ) + } + mutationPayload.result.toArray + } + + private def continuousEvolution(): Array[GBTModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + if (!isPipeline) resetNumericBoundaries + + if (modelSelection == "classifier") + ModelUtils.validateGBTClassifier(df, _labelCol) + + logger.log(Level.DEBUG, debugSettings) + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[GBTModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + // Generate the first pool of attempts to seed the hyperparameter space + // var runSet = ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + + val totalConfigs = modelConfigLength[GBTConfig] + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[GBTConfig] + val startingModelSeed = generateGBTConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("GBT") + .setModelType(modelSelection) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedGBT(_gbtNumericBoundaries, _gbtStringBoundaries) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + def generateIdealParents( + results: Array[GBTModelsWithResults] + ): Array[GBTConfig] = { + val bestParents = new ArrayBuffer[GBTConfig] + results + .take(_numberOfParentsToRetain) + .map(x => { + bestParents += x.modelHyperParams + }) + bestParents.result.toArray + } + + def evolveParameters(): Array[GBTModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + if (!isPipeline) resetNumericBoundaries + if (modelSelection == "classifier") + ModelUtils.validateGBTClassifier(df, _labelCol) + + logger.log(Level.DEBUG, debugSettings) + + var generation = 1 + // Record of all generations results + val fossilRecord = new ArrayBuffer[GBTModelsWithResults] + + val totalConfigs = modelConfigLength[GBTConfig] + + val primordial = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val generativeArray = new ArrayBuffer[GBTConfig] + val startingModelSeed = generateGBTConfig(_modelSeed) + generativeArray += startingModelSeed + generativeArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + runBattery(generativeArray.result.toArray, generation) + } else { + runBattery( + generateThresholdedParams(_firstGenerationGenePool), + generation + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("GBT") + .setModelType(modelSelection) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedGBT(_gbtNumericBoundaries, _gbtStringBoundaries) + runBattery(startingPool, generation) + } + + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + while (currentIteration <= _numberOfMutationGenerations && + evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .gbtCandidates( + "GBT", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + } else { + (1 to _numberOfMutationGenerations).map(i => { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .gbtCandidates( + "GBT", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + } + + def evolveBest(): GBTModelsWithResults = { + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[GBTModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + val scoreBuffer = new ListBuffer[(Int, Double)] + results.map(x => scoreBuffer += ((x.generation, x.score))) + val scored = scoreBuffer.result + spark.sparkContext + .parallelize(scored) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + } + + def evolveWithScoringDF(): (Array[GBTModelsWithResults], DataFrame) = { + + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + /** + * Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space + * After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential + * best-condition of hyper parameter configurations. + * + * @param paramsToTest Array of GBT Configuration (hyper parameter settings) from the post-run model + * inference + * @return The results of the hyper parameter test, as well as the scored DataFrame report. + */ + def postRunModeledHyperParams( + paramsToTest: Array[GBTConfig] + ): (Array[GBTModelsWithResults], DataFrame) = { + + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/LightGBMTuner.scala b/src/main/scala/com/databricks/labs/automl/model/LightGBMTuner.scala new file mode 100644 index 00000000..b799f4a7 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/LightGBMTuner.scala @@ -0,0 +1,953 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools._ +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.params.{ + Defaults, + LightGBMConfig, + LightGBMModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import com.microsoft.ml.spark.lightgbm.{LightGBMClassifier, LightGBMRegressor} +import org.apache.log4j.{Level, Logger} +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.{DataFrame, Row} + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class LightGBMTuner(df: DataFrame, + data: Array[TrainSplitReferences], + modelSelection: String, + lightGBMType: String, + isPipeline: Boolean = false) + extends LightGBMBase + with SparkSessionWrapper + with Defaults + with Serializable + with Evolution { + + import GBMTypes._ + import InitialGenerationMode._ + @transient private lazy val logger: Logger = Logger.getLogger(this.getClass) + + @transient private lazy val _gbmType = + getGBMType(modelSelection, lightGBMType) + + @transient private lazy val _initialGenMode = getInitialGenMode( + _initialGenerationMode + ) + + @transient final lazy val _uniqueLabels: Int = modelSelection match { + case "regressor" => 0 + case "classifier" => df.select(col(_labelCol)).distinct.count.toInt + } + + // mutable variable instantiation + + private var _scoringMetric = _gbmType.modelType match { + case "regressor" => "rmse" + case "classifier" => "f1" + case _ => + throw new UnsupportedOperationException( + s"Model $modelSelection is not supported." + ) + } + private var _classificationMetrics = classificationMetrics + private var _lightgbmNumericBoundaries = _lightGBMDefaultNumBoundaries + private var _lightgbmStringBoundaries = _lightGBMDefaultStringBoundaries + + // Setters + + def setScoringMetric(value: String): this.type = { + _gbmType.modelType match { + case "regressor" => + require( + regressionMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, regressionMetrics)}" + ) + case "classifier" => + require( + classificationMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, classificationMetrics)}" + ) + case _ => + throw new UnsupportedOperationException( + s"Unsupported modelType $modelSelection" + ) + } + _scoringMetric = value + this + } + + /** + * Setter for overriding the numeric boundary mappings + * Allows for partial replacement of mappings (any not defined will use defaults) + * + * @param value a numeric mapping override to the defaults + * @since 0.6.1 + * @author Ben Wilson, Databricks + * @throws IllegalArgumentException + */ + @throws(classOf[IllegalArgumentException]) + def setLGBMNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + validateNumericMapping(_lightGBMDefaultNumBoundaries, value) + _lightgbmNumericBoundaries = + partialOverrideNumericMapping(_lightGBMDefaultNumBoundaries, value) + this + } + + /** + * Setter for partial overrides of string mappings + * + * @param value a string mapping override to the default values + * @since 0.6.1 + * @author Ben Wilson, Databricks + * @throws IllegalArgumentException + */ + @throws(classOf[IllegalArgumentException]) + def setLGBMStringBoundaries(value: Map[String, List[String]]): this.type = { + validateStringMapping(_lightgbmStringBoundaries, value) + _lightgbmStringBoundaries = + partialOverrideStringMapping(_lightgbmStringBoundaries, value) + this + } + + // Getters + + def getScoringMetric: String = _scoringMetric + def getLightGBMNumericBoundaries: Map[String, (Double, Double)] = + _lightgbmNumericBoundaries + def getLightGBMStringBoundaries: Map[String, List[String]] = + _lightgbmStringBoundaries + def getClassificationMetrics: List[String] = _classificationMetrics + def getRegressionMetrics: List[String] = regressionMetrics + + // Internal methods + + /** + * Private internal method for resetting the metrics to employ for the scoring of each kfold during tuning and + * evaluating model performance (primarily to select the correct type of evaluation for binary / multiclass + * classification tasks) + * + * @return + */ + private def resetClassificationMetrics: List[String] = + _gbmType.modelType match { + case "classifier" => + classificationMetricValidator( + classificationAdjudicator(df), + classificationMetrics + ) + case _ => classificationMetrics + } + + private def setClassificationMetrics(value: List[String]): this.type = { + _classificationMetrics = value + this + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[LightGBMModelsWithResults] + ): (LightGBMConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[LightGBMModelsWithResults] + ): Array[LightGBMModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[LightGBMModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def recommendModeClassifier: String = { + _uniqueLabels match { + case x if x < 2 => + s"None. The label count [${_uniqueLabels}] is invalid for prediction." + case x if x > 2 => s"Either gbmMulti or gbmMultiOVA" + case x if x == 2 => s"gbmBinary" + } + } + + private def validateGBMClassificationSetting(): Unit = { + + _gbmType match { + case GBMBinary => + if (_uniqueLabels != 2) + throw new UnsupportedOperationException( + s"LightGBM Model type was selected as [$lightGBMType] but the unique counts of the label column: " + + s"[${_uniqueLabels}] is not supported by Binary Classification. The recommended gbmModel to use is: " + + s"$recommendModeClassifier" + ) + case GBMMulti | GBMMultiOVA => + if (_uniqueLabels <= 2) + throw new UnsupportedOperationException( + s"LightGBM Model type was selected as [$lightGBMType] but the unique counts of the label column: " + + s"[${_uniqueLabels}] is not supported by Multi-class Classification. The recommended gbmModel to use is: " + + s"$recommendModeClassifier" + ) + case _ => Unit + } + + } + + /** + * Private method for returning the top n parents' hyper parameters + * + * @param results Scored model hyper parameters results collection + * @return the top n hyper parameters thus far + * @since 0.6.1 + * @author Ben Wilson, Databricks + */ + private def generateIdealParents( + results: Array[LightGBMModelsWithResults] + ): Array[LightGBMConfig] = { + results.take(_numberOfParentsToRetain).map(_.modelHyperParams) + } + + private def generateRandomThresholdedParams( + iterationCount: Int + ): Array[LightGBMConfig] = { + + val iterations = new ArrayBuffer[LightGBMConfig] + + var i = 0 + do { + + val baggingFraction: Double = + generateRandomDouble("baggingFraction", _lightgbmNumericBoundaries) + val baggingFreq: Int = + generateRandomInteger("baggingFreq", _lightgbmNumericBoundaries) + val featureFraction: Double = + generateRandomDouble("featureFraction", _lightgbmNumericBoundaries) + val learningRate: Double = + generateRandomDouble("learningRate", _lightgbmNumericBoundaries) + val maxBin: Int = + generateRandomInteger("maxBin", _lightgbmNumericBoundaries) + val maxDepth: Int = + generateRandomInteger("maxDepth", _lightgbmNumericBoundaries) + val minSumHessianInLeaf: Double = + generateRandomDouble("minSumHessianInLeaf", _lightgbmNumericBoundaries) + val numIterations: Int = + generateRandomInteger("numIterations", _lightgbmNumericBoundaries) + val numLeaves: Int = + generateRandomInteger("numLeaves", _lightgbmNumericBoundaries) + val boostFromAverage: Boolean = coinFlip() + val lambdaL1: Double = + generateRandomDouble("lambdaL1", _lightgbmNumericBoundaries) + val lambdaL2: Double = + generateRandomDouble("lambdaL2", _lightgbmNumericBoundaries) + val alpha: Double = + generateRandomDouble("alpha", _lightgbmNumericBoundaries) + val boostingType: String = + generateRandomString("boostingType", _lightgbmStringBoundaries) + + iterations += LightGBMConfig( + baggingFraction = baggingFraction, + baggingFreq = baggingFreq, + featureFraction = featureFraction, + learningRate = learningRate, + maxBin = maxBin, + maxDepth = maxDepth, + minSumHessianInLeaf = minSumHessianInLeaf, + numIterations = numIterations, + numLeaves = numLeaves, + boostFromAverage = boostFromAverage, + lambdaL1 = lambdaL1, + lambdaL2 = lambdaL2, + alpha = alpha, + boostingType = boostingType + ) + + i += 1 + } while (i < iterationCount) + + iterations.toArray + } + + def generateRegressorModel( + modelConfig: LightGBMConfig, + gbmModelType: GBMTypes.Value + ): LightGBMRegressor = { + + val base = new LightGBMRegressor() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setBaggingFraction(modelConfig.baggingFraction) + .setBaggingFreq(modelConfig.baggingFreq) + .setFeatureFraction(modelConfig.featureFraction) + .setLearningRate(modelConfig.learningRate) + .setMaxBin(modelConfig.maxBin) + .setMaxDepth(modelConfig.maxDepth) + .setMinSumHessianInLeaf(modelConfig.minSumHessianInLeaf) + .setNumIterations(modelConfig.numIterations) + .setNumLeaves(modelConfig.numLeaves) + .setBoostFromAverage(modelConfig.boostFromAverage) + .setLambdaL1(modelConfig.lambdaL1) + .setLambdaL2(modelConfig.lambdaL2) + .setAlpha(modelConfig.alpha) + .setBoostingType(modelConfig.boostingType) + .setTimeout(TIMEOUT) + .setUseBarrierExecutionMode(BARRIER_MODE) + + gbmModelType match { + case GBMFair => base.setObjective("fair") + case GBMLasso => base.setObjective("regression_l1") + case GBMRidge => base.setObjective("regression_l2") + case GBMPoisson => base.setObjective("poisson") + case GBMMape => base.setObjective("mape") + case GBMTweedie => base.setObjective("tweedie") + case GBMGamma => base.setObjective("gamma") + case GBMHuber => base.setObjective("huber") + case GBMQuantile => base.setObjective("quantile") + } + + } + + def generateClassfierModel( + modelConfig: LightGBMConfig, + gbmModelType: GBMTypes.Value + ): LightGBMClassifier = { + + val base = new LightGBMClassifier() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setBaggingFreq(modelConfig.baggingFreq) + .setBaggingFraction(modelConfig.baggingFraction) + .setFeatureFraction(modelConfig.featureFraction) + .setLearningRate(modelConfig.learningRate) + .setMaxBin(modelConfig.maxBin) + .setMaxDepth(modelConfig.maxDepth) + .setMinSumHessianInLeaf(modelConfig.minSumHessianInLeaf) + .setNumIterations(modelConfig.numIterations) + .setNumLeaves(modelConfig.numLeaves) + .setBoostFromAverage(modelConfig.boostFromAverage) + .setLambdaL1(modelConfig.lambdaL1) + .setLambdaL2(modelConfig.lambdaL2) + .setBoostingType(modelConfig.boostingType) + .setTimeout(TIMEOUT) + .setUseBarrierExecutionMode(BARRIER_MODE) + + gbmModelType match { + case GBMBinary => base.setObjective("binary") + case GBMMulti => base.setObjective("multiclass") + case GBMMultiOVA => base.setObjective("multiclassova") + } + + } + + /** + * Method for performing the fit and transform with scoring for the LGBMmodel + * + * @param train Training data set + * @param test Test validation data set + * @param modelConfig configuration of hyper parameters to use + * @param generation the generation in which the model is executing within + * @return LightGBMModelsWithResults to store the information about the run. + * @since 0.6.1 + * @author Ben Wilson, Databricks + */ + def generateAndScoreGBMModel( + train: DataFrame, + test: DataFrame, + modelConfig: LightGBMConfig, + generation: Int = 1 + ): LightGBMModelsWithResults = { + + val model = _gbmType.modelType match { + case "classifier" => generateClassfierModel(modelConfig, _gbmType) + case _ => generateRegressorModel(modelConfig, _gbmType) + } + + val builtModel = model.fit(train) + + val predictedData = builtModel.transform(test) + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) +// optimizedPredictions.foreach(_ => ()) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + _gbmType.modelType match { + case "classifier" => + for (i <- _classificationMetrics) { + scoringMap(i) = + classificationScoring(i, _labelCol, optimizedPredictions) + } + case "regressor" => + for (i <- regressionMetrics) { + scoringMap(i) = regressionScoring(i, _labelCol, optimizedPredictions) + } + } + + val lightGBMModelsWithResults = LightGBMModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + lightGBMModelsWithResults + } + + /** + * Private method for execution of a collection of hyper parameters to tune against. + * This method will instantiate models for each hyper parameter configuration, build them model, split the data, + * train k number of models, collect the evaluated scores from each of the models, and average out the results + * over the kFold grouping. + * + * @param battery Array of Configurations of the LightGBM model + * @param generation The generation that this battery execution is operating within + * @return Array[LightGBMModelsWithResults] that contains the results and configurations for each of the hyper + * parameter configurations that have been tested. + * @since 0.6.1 + * @author Ben Wilson, Databricks + */ + private def runBattery( + battery: Array[LightGBMConfig], + generation: Int = 1 + ): Array[LightGBMModelsWithResults] = { + + val metrics = _gbmType.modelType match { + case "classifier" => _classificationMetrics + case _ => regressionMetrics + } + + val statusObj = new ModelReporting("lightgbm", metrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = ArrayBuffer[LightGBMModelsWithResults]() + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val uniqueLabels: Array[Row] = df.select(_labelCol).distinct().collect() + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreGBMModel(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + _gbmType.modelType match { + case "classifier" => + for (a <- _classificationMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case "regressor" => + for (a <- regressionMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case _ => + throw new UnsupportedOperationException( + s"$modelSelection is not a supported model type." + ) + } + + val runAvg = LightGBMModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + } + + sortAndReturnAll(results) + + } + + private def irradiateGeneration( + parents: Array[LightGBMConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[LightGBMConfig] = { + + val mutationPayload = new ArrayBuffer[LightGBMConfig] + val totalConfigs = modelConfigLength[LightGBMConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateRandomThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + val mutationMerge = mutationCandidates.zip(mutationIndeces) + + mutationMerge.map { x => + val randomParent = scala.util.Random.shuffle(parents.toList).head + + LightGBMConfig( + if (x._2.contains(0)) + geneMixing( + randomParent.baggingFraction, + x._1.baggingFraction, + mutationMagnitude + ) + else randomParent.baggingFraction, + if (x._2.contains(1)) + geneMixing( + randomParent.baggingFreq, + x._1.baggingFreq, + mutationMagnitude + ) + else randomParent.baggingFreq, + if (x._2.contains(2)) + geneMixing( + randomParent.featureFraction, + x._1.featureFraction, + mutationMagnitude + ) + else randomParent.featureFraction, + if (x._2.contains(3)) + geneMixing( + randomParent.learningRate, + x._1.learningRate, + mutationMagnitude + ) + else randomParent.learningRate, + if (x._2.contains(4)) + geneMixing(randomParent.maxBin, x._1.maxBin, mutationMagnitude) + else randomParent.maxBin, + if (x._2.contains(5)) + geneMixing(randomParent.maxDepth, x._1.maxDepth, mutationMagnitude) + else randomParent.maxDepth, + if (x._2.contains(6)) + geneMixing( + randomParent.minSumHessianInLeaf, + x._1.minSumHessianInLeaf, + mutationMagnitude + ) + else randomParent.minSumHessianInLeaf, + if (x._2.contains(7)) + geneMixing( + randomParent.numIterations, + x._1.numIterations, + mutationMagnitude + ) + else randomParent.numIterations, + if (x._2.contains(8)) + geneMixing(randomParent.numLeaves, x._1.numLeaves, mutationMagnitude) + else randomParent.numLeaves, + if (x._2.contains(9)) + coinFlip( + randomParent.boostFromAverage, + x._1.boostFromAverage, + mutationMagnitude + ) + else randomParent.boostFromAverage, + if (x._2.contains(10)) + geneMixing(randomParent.lambdaL1, x._1.lambdaL1, mutationMagnitude) + else randomParent.lambdaL1, + if (x._2.contains(11)) + geneMixing(randomParent.lambdaL2, x._1.lambdaL2, mutationMagnitude) + else randomParent.lambdaL2, + if (x._2.contains(12)) + geneMixing(randomParent.alpha, x._1.alpha, mutationMagnitude) + else randomParent.alpha, + if (x._2.contains(13)) + geneMixing(randomParent.boostingType, x._1.boostingType) + else randomParent.boostingType + ) + + } + + } + + private def continuousEvolution(): Array[LightGBMModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + validateGBMClassificationSetting() + + logger.log(Level.DEBUG, debugSettings) + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[LightGBMModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + val totalConfigs = modelConfigLength[LightGBMConfig] + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[LightGBMConfig] + val startingModelSeed = generateLightGBMConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet( + generateRandomThresholdedParams(_firstGenerationGenePool): _* + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily(_gbmType.gbmType) + .setModelType(_gbmType.modelType) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedLightGBM( + _lightgbmNumericBoundaries, + _lightgbmStringBoundaries + ) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + /** + * Method for batch hyperparameter generational tuning. + * @return Tuning results + * @since 0.6.1 + * @author Ben Wilson, Databricks + */ + private def evolveParameters(): Array[LightGBMModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + validateGBMClassificationSetting() + + logger.log(Level.DEBUG, debugSettings) + + var generation = 1 + + val fossilRecord = ArrayBuffer[LightGBMModelsWithResults]() + + val totalConfigs = modelConfigLength[LightGBMConfig] + + val primordial = _initialGenMode match { + case RANDOM => + if (_modelSeedSet) { + + val startingModelSeed = generateLightGBMConfig(_modelSeed) + + runBattery( + irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) ++ Array(startingModelSeed), + generation + ) + + } else + runBattery( + generateRandomThresholdedParams(_firstGenerationGenePool), + generation + ) + } + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + while (currentIteration <= _numberOfMutationGenerations && evaluateStoppingScore( + currentBestResult, + _earlyStoppingScore + )) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = + GenerationOptimizer.lightGBMCandidates( + "LightGBM", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + fossilRecord ++= runBattery(evolution, generation) + generation += 1 + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + + } else { + + (1 to _numberOfMutationGenerations).map(i => { + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .lightGBMCandidates( + "LightGBM", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + + } + + def evolveBest(): LightGBMModelsWithResults = { + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[LightGBMModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + spark.sparkContext + .parallelize(results.map(x => (x.generation, x.score)).toList) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + + } + + def evolveWithScoringDF(): (Array[LightGBMModelsWithResults], DataFrame) = { + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + def postRunModeledHyperParams( + paramsToTest: Array[LightGBMConfig] + ): (Array[LightGBMModelsWithResults], DataFrame) = { + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} + +// ensure that LightGBM package is installed: com.microsoft.ml.spark:mmlspark_2.11:0.18.1 diff --git a/src/main/scala/com/databricks/labs/automl/model/LinearRegressionTuner.scala b/src/main/scala/com/databricks/labs/automl/model/LinearRegressionTuner.scala new file mode 100644 index 00000000..488f425c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/LinearRegressionTuner.scala @@ -0,0 +1,706 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.model.tools.{ + GenerationOptimizer, + HyperParameterFullSearch, + ModelReporting +} +import com.databricks.labs.automl.params.{ + Defaults, + LinearRegressionConfig, + LinearRegressionModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} +import org.apache.spark.storage.StorageLevel +import org.apache.spark.ml.regression.LinearRegression +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions.col + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class LinearRegressionTuner(df: DataFrame, + data: Array[TrainSplitReferences], + isPipeline: Boolean = false) + extends SparkSessionWrapper + with Defaults + with Evolution { + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _scoringMetric = _scoringDefaultRegressor + private var _linearRegressionNumericBoundaries = + _linearRegressionDefaultNumBoundaries + private var _linearRegressionStringBoundaries = + _linearRegressionDefaultStringBoundaries + + def setScoringMetric(value: String): this.type = { + require( + regressionMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, regressionMetrics)}" + ) + this._scoringMetric = value + this + } + + def setLinearRegressionNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + this._linearRegressionNumericBoundaries = value + this + } + + def setLinearRegressionStringBoundaries( + value: Map[String, List[String]] + ): this.type = { + this._linearRegressionStringBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getLinearRegressionNumericBoundaries: Map[String, (Double, Double)] = + _linearRegressionNumericBoundaries + + def getLinearRegressionStringBoundaries: Map[String, List[String]] = + _linearRegressionStringBoundaries + + def getRegressionMetrics: List[String] = regressionMetrics + + private def configureModel( + modelConfig: LinearRegressionConfig + ): LinearRegression = { + new LinearRegression() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setElasticNetParam(modelConfig.elasticNetParams) + .setFitIntercept(modelConfig.fitIntercept) + .setLoss(modelConfig.loss) + .setMaxIter(modelConfig.maxIter) + .setRegParam(modelConfig.regParam) + .setSolver("auto") + .setStandardization(modelConfig.standardization) + .setTol(modelConfig.tolerance) + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[LinearRegressionModelsWithResults] + ): (LinearRegressionConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[LinearRegressionModelsWithResults] + ): Array[LinearRegressionModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[LinearRegressionModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[LinearRegressionConfig] = { + + val iterations = new ArrayBuffer[LinearRegressionConfig] + + var i = 0 + do { + // get the loss metric first + val loss = generateRandomString("loss", _linearRegressionStringBoundaries) + + // modify the allowable results for huber loss since Huber solver can only support L2 regularization. + + val elasticNetParams = loss match { + case "huber" => 0.0 + case _ => + generateRandomDouble( + "elasticNetParams", + _linearRegressionNumericBoundaries + ) + } + val fitIntercept = coinFlip() + val maxIter = + generateRandomInteger("maxIter", _linearRegressionNumericBoundaries) + val regParam = + generateRandomDouble("regParam", _linearRegressionNumericBoundaries) + val standardization = coinFlip() + val tolerance = + generateRandomDouble("tolerance", _linearRegressionNumericBoundaries) + iterations += LinearRegressionConfig( + elasticNetParams, + fitIntercept, + loss, + maxIter, + regParam, + standardization, + tolerance + ) + i += 1 + } while (i < iterationCount) + iterations.toArray + } + + private def generateAndScoreLinearRegression( + train: DataFrame, + test: DataFrame, + modelConfig: LinearRegressionConfig, + generation: Int = 1 + ): LinearRegressionModelsWithResults = { + + val regressionModel = configureModel(modelConfig) + + val builtModel = regressionModel.fit(train) + + val predictedData = builtModel.transform(test) + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) +// optimizedPredictions.foreach(_ => ()) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + for (i <- regressionMetrics) { + scoringMap(i) = regressionScoring(i, _labelCol, optimizedPredictions) + } + val lrModelsWithResults = LinearRegressionModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + lrModelsWithResults + } + + private def runBattery( + battery: Array[LinearRegressionConfig], + generation: Int = 1 + ): Array[LinearRegressionModelsWithResults] = { + + val statusObj = new ModelReporting("linearRegression", regressionMetrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = new ArrayBuffer[LinearRegressionModelsWithResults] + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val uniqueLabels: Array[Row] = df.select(_labelCol).distinct().collect() + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreLinearRegression(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + for (a <- regressionMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + + val runAvg = LinearRegressionModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + } + + sortAndReturnAll(results) + + } + + private def irradiateGeneration( + parents: Array[LinearRegressionConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[LinearRegressionConfig] = { + + val mutationPayload = new ArrayBuffer[LinearRegressionConfig] + val totalConfigs = modelConfigLength[LinearRegressionConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + for (i <- mutationCandidates.indices) { + + val randomParent = scala.util.Random.shuffle(parents.toList).head + val mutationIteration = mutationCandidates(i) + val mutationIndexIteration = mutationIndeces(i) + + val lossSelect = + if (mutationIndexIteration.contains(2)) + geneMixing(randomParent.loss, mutationIteration.loss) + else randomParent.loss + + val elasticNetParamSelect = lossSelect match { + case "huber" => 0.0 + case _ => + if (mutationIndexIteration.contains(0)) + geneMixing( + randomParent.elasticNetParams, + mutationIteration.elasticNetParams, + mutationMagnitude + ) + else randomParent.elasticNetParams + } + + mutationPayload += LinearRegressionConfig( + elasticNetParamSelect, + if (mutationIndexIteration.contains(1)) + coinFlip( + randomParent.fitIntercept, + mutationIteration.fitIntercept, + mutationMagnitude + ) + else randomParent.fitIntercept, + lossSelect, + if (mutationIndexIteration.contains(3)) + geneMixing( + randomParent.maxIter, + mutationIteration.maxIter, + mutationMagnitude + ) + else randomParent.maxIter, + if (mutationIndexIteration.contains(4)) + geneMixing( + randomParent.regParam, + mutationIteration.regParam, + mutationMagnitude + ) + else randomParent.regParam, + if (mutationIndexIteration.contains(5)) + coinFlip( + randomParent.standardization, + mutationIteration.standardization, + mutationMagnitude + ) + else randomParent.standardization, + if (mutationIndexIteration.contains(6)) + geneMixing( + randomParent.tolerance, + mutationIteration.tolerance, + mutationMagnitude + ) + else randomParent.tolerance + ) + } + mutationPayload.result.toArray + } + + private def continuousEvolution() + : Array[LinearRegressionModelsWithResults] = { + + logger.log(Level.DEBUG, debugSettings) + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[LinearRegressionModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + val totalConfigs = modelConfigLength[LinearRegressionConfig] + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[LinearRegressionConfig] + val startingModelSeed = generateLinearRegressionConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("RandomForest") + .setModelType("regressor") + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedLinearRegression( + _linearRegressionNumericBoundaries, + _linearRegressionStringBoundaries + ) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + def generateIdealParents( + results: Array[LinearRegressionModelsWithResults] + ): Array[LinearRegressionConfig] = { + val bestParents = new ArrayBuffer[LinearRegressionConfig] + results + .take(_numberOfParentsToRetain) + .map(x => { + bestParents += x.modelHyperParams + }) + bestParents.result.toArray + } + + def evolveParameters(): Array[LinearRegressionModelsWithResults] = { + + logger.log(Level.DEBUG, debugSettings) + + var generation = 1 + // Record of all generations results + val fossilRecord = new ArrayBuffer[LinearRegressionModelsWithResults] + + val totalConfigs = modelConfigLength[LinearRegressionConfig] + + val primordial = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val generativeArray = new ArrayBuffer[LinearRegressionConfig] + val startingModelSeed = generateLinearRegressionConfig(_modelSeed) + generativeArray += startingModelSeed + generativeArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + runBattery(generativeArray.result.toArray, generation) + } else { + runBattery( + generateThresholdedParams(_firstGenerationGenePool), + generation + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("RandomForest") + .setModelType("regressor") + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedLinearRegression( + _linearRegressionNumericBoundaries, + _linearRegressionStringBoundaries + ) + runBattery(startingPool, generation) + } + + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + while (currentIteration <= _numberOfMutationGenerations && + evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .linearRegressionCandidates( + "LinearRegression", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + } else { + (1 to _numberOfMutationGenerations).map(i => { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .linearRegressionCandidates( + "LinearRegression", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + } + + def evolveBest(): LinearRegressionModelsWithResults = { + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[LinearRegressionModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + val scoreBuffer = new ListBuffer[(Int, Double)] + results.map(x => scoreBuffer += ((x.generation, x.score))) + val scored = scoreBuffer.result + spark.sparkContext + .parallelize(scored) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + } + + def evolveWithScoringDF() + : (Array[LinearRegressionModelsWithResults], DataFrame) = { + + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + /** + * Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space + * After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential + * best-condition of hyper parameter configurations. + * + * @param paramsToTest Array of Linear Regression Configuration (hyper parameter settings) from the post-run model + * inference + * @return The results of the hyper parameter test, as well as the scored DataFrame report. + */ + def postRunModeledHyperParams( + paramsToTest: Array[LinearRegressionConfig] + ): (Array[LinearRegressionModelsWithResults], DataFrame) = { + + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/LogisticRegressionTuner.scala b/src/main/scala/com/databricks/labs/automl/model/LogisticRegressionTuner.scala new file mode 100644 index 00000000..c176195a --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/LogisticRegressionTuner.scala @@ -0,0 +1,689 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.model.tools.{ + GenerationOptimizer, + HyperParameterFullSearch, + ModelReporting +} +import com.databricks.labs.automl.params.{ + Defaults, + LogisticRegressionConfig, + LogisticRegressionModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} +import org.apache.spark.storage.StorageLevel +import org.apache.spark.ml.classification.LogisticRegression +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions.col + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class LogisticRegressionTuner(df: DataFrame, + data: Array[TrainSplitReferences], + isPipeline: Boolean = false) + extends SparkSessionWrapper + with Defaults + with Evolution { + + validateInputDataframe(df) + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _scoringMetric = _scoringDefaultClassifier + + private var _logisticRegressionNumericBoundaries = + _logisticRegressionDefaultNumBoundaries + + private var _classificationMetrics = classificationMetrics + + def setScoringMetric(value: String): this.type = { + require( + classificationMetrics.contains(value), + s"Classification scoring metric $value is not a valid member of ${invalidateSelection(value, classificationMetrics)}" + ) + this._scoringMetric = value + this + } + + def setLogisticRegressionNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + this._logisticRegressionNumericBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getLogisticRegressionNumericBoundaries: Map[String, (Double, Double)] = + _logisticRegressionNumericBoundaries + + def getClassificationMetrics: List[String] = classificationMetrics + + private def resetClassificationMetrics: List[String] = + classificationMetricValidator( + classificationAdjudicator(df), + classificationMetrics + ) + + private def setClassificationMetrics(value: List[String]): this.type = { + _classificationMetrics = value + this + } + + private def configureModel( + modelConfig: LogisticRegressionConfig + ): LogisticRegression = { + new LogisticRegression() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setElasticNetParam(modelConfig.elasticNetParams) + .setFamily("auto") + .setFitIntercept(modelConfig.fitIntercept) + .setMaxIter(modelConfig.maxIter) + .setRegParam(modelConfig.regParam) + .setStandardization(modelConfig.standardization) + .setTol(modelConfig.tolerance) + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[LogisticRegressionModelsWithResults] + ): (LogisticRegressionConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[LogisticRegressionModelsWithResults] + ): Array[LogisticRegressionModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[LogisticRegressionModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[LogisticRegressionConfig] = { + + val iterations = new ArrayBuffer[LogisticRegressionConfig] + + var i = 0 + do { + val elasticNetParams = generateRandomDouble( + "elasticNetParams", + _logisticRegressionNumericBoundaries + ) + val fitIntercept = coinFlip() + val maxIter = + generateRandomInteger("maxIter", _logisticRegressionNumericBoundaries) + val regParam = + generateRandomDouble("regParam", _logisticRegressionNumericBoundaries) + val standardization = coinFlip() + val tolerance = + generateRandomDouble("tolerance", _logisticRegressionNumericBoundaries) + iterations += LogisticRegressionConfig( + elasticNetParams, + fitIntercept, + maxIter, + regParam, + standardization, + tolerance + ) + i += 1 + } while (i < iterationCount) + iterations.toArray + } + + private def generateAndScoreLogisticRegression( + train: DataFrame, + test: DataFrame, + modelConfig: LogisticRegressionConfig, + generation: Int = 1 + ): LogisticRegressionModelsWithResults = { + val regressionModel = configureModel(modelConfig) + + val builtModel = regressionModel.fit(train) + + val predictedData = builtModel.transform(test) + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) +// optimizedPredictions.foreach(_ => ()) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + for (i <- _classificationMetrics) { + scoringMap(i) = classificationScoring(i, _labelCol, optimizedPredictions) + } + val logRModelsWithResults = LogisticRegressionModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + logRModelsWithResults + } + + private def runBattery( + battery: Array[LogisticRegressionConfig], + generation: Int = 1 + ): Array[LogisticRegressionModelsWithResults] = { + + val statusObj = + new ModelReporting("logisticRegression", _classificationMetrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = new ArrayBuffer[LogisticRegressionModelsWithResults] + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val uniqueLabels: Array[Row] = df.select(_labelCol).distinct().collect() + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreLogisticRegression(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + for (a <- _classificationMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + + val runAvg = LogisticRegressionModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + } + + sortAndReturnAll(results) + + } + + private def irradiateGeneration( + parents: Array[LogisticRegressionConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[LogisticRegressionConfig] = { + + val mutationPayload = new ArrayBuffer[LogisticRegressionConfig] + val totalConfigs = modelConfigLength[LogisticRegressionConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + for (i <- mutationCandidates.indices) { + + val randomParent = scala.util.Random.shuffle(parents.toList).head + val mutationIteration = mutationCandidates(i) + val mutationIndexIteration = mutationIndeces(i) + + mutationPayload += LogisticRegressionConfig( + if (mutationIndexIteration.contains(0)) + geneMixing( + randomParent.elasticNetParams, + mutationIteration.elasticNetParams, + mutationMagnitude + ) + else randomParent.elasticNetParams, + if (mutationIndexIteration.contains(1)) + coinFlip( + randomParent.fitIntercept, + mutationIteration.fitIntercept, + mutationMagnitude + ) + else randomParent.fitIntercept, + if (mutationIndexIteration.contains(2)) + geneMixing( + randomParent.maxIter, + mutationIteration.maxIter, + mutationMagnitude + ) + else randomParent.maxIter, + if (mutationIndexIteration.contains(3)) + geneMixing( + randomParent.regParam, + mutationIteration.regParam, + mutationMagnitude + ) + else randomParent.regParam, + if (mutationIndexIteration.contains(4)) + coinFlip( + randomParent.standardization, + mutationIteration.standardization, + mutationMagnitude + ) + else randomParent.standardization, + if (mutationIndexIteration.contains(5)) + geneMixing( + randomParent.tolerance, + mutationIteration.tolerance, + mutationMagnitude + ) + else randomParent.tolerance + ) + } + mutationPayload.result.toArray + } + + private def continuousEvolution() + : Array[LogisticRegressionModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + + logger.log(Level.DEBUG, debugSettings) + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[LogisticRegressionModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + val totalConfigs = modelConfigLength[LogisticRegressionConfig] + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[LogisticRegressionConfig] + val startingModelSeed = generateLogisticRegressionConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("LogisticRegression") + .setModelType("classifier") + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedLogisticRegression( + _logisticRegressionNumericBoundaries + ) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + def generateIdealParents( + results: Array[LogisticRegressionModelsWithResults] + ): Array[LogisticRegressionConfig] = { + val bestParents = new ArrayBuffer[LogisticRegressionConfig] + results + .take(_numberOfParentsToRetain) + .map(x => { + bestParents += x.modelHyperParams + }) + bestParents.result.toArray + } + + def evolveParameters(): Array[LogisticRegressionModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + + logger.log(Level.DEBUG, debugSettings) + + var generation = 1 + // Record of all generations results + val fossilRecord = new ArrayBuffer[LogisticRegressionModelsWithResults] + + val totalConfigs = modelConfigLength[LogisticRegressionConfig] + + val primordial = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val generativeArray = new ArrayBuffer[LogisticRegressionConfig] + val startingModelSeed = generateLogisticRegressionConfig(_modelSeed) + generativeArray += startingModelSeed + generativeArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + runBattery(generativeArray.result.toArray, generation) + } else { + runBattery( + generateThresholdedParams(_firstGenerationGenePool), + generation + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("LogisticRegression") + .setModelType("classifier") + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedLogisticRegression( + _logisticRegressionNumericBoundaries + ) + runBattery(startingPool, generation) + } + + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + while (currentIteration <= _numberOfMutationGenerations && + evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .logisticRegressionCandidates( + "LogisticRegression", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + } else { + (1 to _numberOfMutationGenerations).map(i => { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .logisticRegressionCandidates( + "LogisticRegression", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + } + + def evolveBest(): LogisticRegressionModelsWithResults = { + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[LogisticRegressionModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + val scoreBuffer = new ListBuffer[(Int, Double)] + results.map(x => scoreBuffer += ((x.generation, x.score))) + val scored = scoreBuffer.result + spark.sparkContext + .parallelize(scored) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + } + + def evolveWithScoringDF() + : (Array[LogisticRegressionModelsWithResults], DataFrame) = { + + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + /** + * Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space + * After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential + * best-condition of hyper parameter configurations. + * + * @param paramsToTest Array of LogisticRegression Configuration (hyper parameter settings) from the post-run model + * inference + * @return The results of the hyper parameter test, as well as the scored DataFrame report. + */ + def postRunModeledHyperParams( + paramsToTest: Array[LogisticRegressionConfig] + ): (Array[LogisticRegressionModelsWithResults], DataFrame) = { + + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/MLPCTuner.scala b/src/main/scala/com/databricks/labs/automl/model/MLPCTuner.scala new file mode 100644 index 00000000..32eb1f2f --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/MLPCTuner.scala @@ -0,0 +1,702 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.model.tools.{ + GenerationOptimizer, + HyperParameterFullSearch, + ModelReporting +} +import com.databricks.labs.automl.params.{ + Defaults, + MLPCConfig, + MLPCModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} +import org.apache.spark.storage.StorageLevel +import org.apache.spark.ml.classification.MultilayerPerceptronClassifier +import org.apache.spark.ml.linalg.DenseVector +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions.col + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class MLPCTuner(df: DataFrame, + data: Array[TrainSplitReferences], + isPipeline: Boolean = false) + extends SparkSessionWrapper + with Evolution + with Defaults { + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _scoringMetric = _scoringDefaultClassifier + private var _mlpcNumericBoundaries = _mlpcDefaultNumBoundaries + private var _mlpcStringBoundaries = _mlpcDefaultStringBoundaries + private var _featureInputSize: Int = 0 + private var _classDistinctCount: Int = 0 + private var _classificationMetrics = classificationMetrics + + private def calcFeatureInputSize: this.type = { + _featureInputSize = + df.select(_featureCol).head()(0).asInstanceOf[DenseVector].size + this + } + + private def calcClassDistinctCount: this.type = { + _classDistinctCount = df.select(_labelCol).distinct().count().toInt + this + } + + def setScoringMetric(value: String): this.type = { + require( + classificationMetrics.contains(value), + s"Classification scoring metric $value is not a valid member of ${invalidateSelection(value, classificationMetrics)}" + ) + _scoringMetric = value + this + } + + def setMlpcNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + _mlpcNumericBoundaries = value + this + } + + def setMlpcStringBoundaries(value: Map[String, List[String]]): this.type = { + _mlpcStringBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getMlpcNumericBoundaries: Map[String, (Double, Double)] = + _mlpcNumericBoundaries + + def getMlpcStringBoundaries: Map[String, List[String]] = _mlpcStringBoundaries + + def getClassificationMetrics: List[String] = classificationMetrics + + def getFeatureInputSize: Int = _featureInputSize + + def getClassDistinctCount: Int = _classDistinctCount + + private def resetClassificationMetrics: List[String] = + classificationMetricValidator( + classificationAdjudicator(df), + classificationMetrics + ) + + private def setClassificationMetrics(value: List[String]): this.type = { + _classificationMetrics = value + this + } + + private def configureModel( + modelConfig: MLPCConfig + ): MultilayerPerceptronClassifier = { + new MultilayerPerceptronClassifier() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setLayers(modelConfig.layers) + .setMaxIter(modelConfig.maxIter) + .setSolver(modelConfig.solver) + .setStepSize(modelConfig.stepSize) + .setTol(modelConfig.tolerance) + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[MLPCModelsWithResults] + ): (MLPCConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[MLPCModelsWithResults] + ): Array[MLPCModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[MLPCModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[MLPCConfig] = { + + val iterations = new ArrayBuffer[MLPCConfig] + + var i = 0 + do { + val layers = generateLayerArray( + "layers", + "hiddenLayerSizeAdjust", + _mlpcNumericBoundaries, + _featureInputSize, + _classDistinctCount + 1 + ) + val maxIter = generateRandomInteger("maxIter", _mlpcNumericBoundaries) + val solver = generateRandomString("solver", _mlpcStringBoundaries) + val stepSize = generateRandomDouble("stepSize", _mlpcNumericBoundaries) + val tolerance = generateRandomDouble("tolerance", _mlpcNumericBoundaries) + iterations += MLPCConfig(layers, maxIter, solver, stepSize, tolerance) + i += 1 + } while (i < iterationCount) + iterations.toArray + } + + private def generateAndScoreMLPCModel( + train: DataFrame, + test: DataFrame, + modelConfig: MLPCConfig, + generation: Int = 1 + ): MLPCModelsWithResults = { + + val mlpcModel = configureModel(modelConfig) + val builtModel = mlpcModel.fit(train) + val predictedData = builtModel.transform(test) + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) +// optimizedPredictions.foreach(_ => ()) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + for (i <- _classificationMetrics) { + scoringMap(i) = classificationScoring(i, _labelCol, optimizedPredictions) + } + + val mlpcModelsWithResults = MLPCModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + mlpcModelsWithResults + } + + private def runBattery(battery: Array[MLPCConfig], + generation: Int = 1): Array[MLPCModelsWithResults] = { + + val statusObj = new ModelReporting("mlpc", _classificationMetrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = new ArrayBuffer[MLPCModelsWithResults] + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val uniqueLabels: Array[Row] = df.select(_labelCol).distinct().collect() + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreMLPCModel(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + for (a <- _classificationMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + + val runAvg = MLPCModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + } + + sortAndReturnAll(results) + + } + + private def irradiateGeneration( + parents: Array[MLPCConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[MLPCConfig] = { + + val mutationPayload = new ArrayBuffer[MLPCConfig] + val totalConfigs = modelConfigLength[MLPCConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + for (i <- mutationCandidates.indices) { + + val randomParent = scala.util.Random.shuffle(parents.toList).head + val mutationIteration = mutationCandidates(i) + val mutationIndexIteration = mutationIndeces(i) + + mutationPayload += MLPCConfig( + if (mutationIndexIteration.contains(0)) + geneMixing( + randomParent.layers, + mutationIteration.layers, + mutationMagnitude + ) + else randomParent.layers, + if (mutationIndexIteration.contains(1)) + geneMixing( + randomParent.maxIter, + mutationIteration.maxIter, + mutationMagnitude + ) + else randomParent.maxIter, + if (mutationIndexIteration.contains(2)) + geneMixing(randomParent.solver, mutationIteration.solver) + else randomParent.solver, + if (mutationIndexIteration.contains(3)) + geneMixing( + randomParent.stepSize, + mutationIteration.stepSize, + mutationMagnitude + ) + else randomParent.stepSize, + if (mutationIndexIteration.contains(4)) + geneMixing( + randomParent.tolerance, + mutationIteration.tolerance, + mutationMagnitude + ) + else randomParent.tolerance + ) + } + mutationPayload.result.toArray + } + + private def continuousEvolution(): Array[MLPCModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + + logger.log(Level.DEBUG, debugSettings) + + // Set the parameter guides for layers / label counts (only set once) + calcFeatureInputSize + calcClassDistinctCount + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[MLPCModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + val totalConfigs = modelConfigLength[MLPCConfig] + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[MLPCConfig] + val startingModelSeed = generateMLPCConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("MLPC") + .setModelType("classifier") + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedMLPC( + _mlpcNumericBoundaries, + _mlpcStringBoundaries, + _featureInputSize, + _classDistinctCount + ) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + def generateIdealParents( + results: Array[MLPCModelsWithResults] + ): Array[MLPCConfig] = { + val bestParents = new ArrayBuffer[MLPCConfig] + results + .take(_numberOfParentsToRetain) + .map(x => { + bestParents += x.modelHyperParams + }) + bestParents.result.toArray + } + + def evolveParameters(): Array[MLPCModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + + logger.log(Level.DEBUG, debugSettings) + + // Set the parameter guides for layers / label counts (only set once) + this.calcFeatureInputSize + this.calcClassDistinctCount + + var generation = 1 + // Record of all generations results + val fossilRecord = new ArrayBuffer[MLPCModelsWithResults] + + val totalConfigs = modelConfigLength[MLPCConfig] + + val primordial = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val generativeArray = new ArrayBuffer[MLPCConfig] + val startingModelSeed = generateMLPCConfig(_modelSeed) + generativeArray += startingModelSeed + generativeArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + runBattery(generativeArray.result.toArray, generation) + } else { + runBattery( + generateThresholdedParams(_firstGenerationGenePool), + generation + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("MLPC") + .setModelType("classifier") + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedMLPC( + _mlpcNumericBoundaries, + _mlpcStringBoundaries, + this.getFeatureInputSize, + this.getClassDistinctCount + ) + runBattery(startingPool, generation) + } + + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + while (currentIteration <= _numberOfMutationGenerations && + evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .mlpcCandidates( + "MLPC", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration, + _featureInputSize, + _classDistinctCount + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + } else { + (1 to _numberOfMutationGenerations).map(i => { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .mlpcCandidates( + "MLPC", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration, + _featureInputSize, + _classDistinctCount + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + } + + def evolveBest(): MLPCModelsWithResults = { + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[MLPCModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + val scoreBuffer = new ListBuffer[(Int, Double)] + results.map(x => scoreBuffer += ((x.generation, x.score))) + val scored = scoreBuffer.result + spark.sparkContext + .parallelize(scored) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + } + + def evolveWithScoringDF(): (Array[MLPCModelsWithResults], DataFrame) = { + + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + /** + * Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space + * After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential + * best-condition of hyper parameter configurations. + * + * @param paramsToTest Array of MLPC Configuration (hyper parameter settings) from the post-run model + * inference + * @return The results of the hyper parameter test, as well as the scored DataFrame report. + */ + def postRunModeledHyperParams( + paramsToTest: Array[MLPCConfig] + ): (Array[MLPCModelsWithResults], DataFrame) = { + + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/NaiveBayesTuner.scala b/src/main/scala/com/databricks/labs/automl/model/NaiveBayesTuner.scala new file mode 100644 index 00000000..8016ab17 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/NaiveBayesTuner.scala @@ -0,0 +1,201 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.params.{ + Defaults, + NaiveBayesConfig, + NaiveBayesModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.Logger +import org.apache.spark.ml.classification.NaiveBayes +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.ArrayBuffer + +class NaiveBayesTuner(df: DataFrame) + extends SparkSessionWrapper + with Defaults + with Evolution { + + //TODO: finish this some time. + + // Perform a check to validate the structure and conditions of the input DataFrame to ensure that it can be modeled + validateInputDataframe(df) + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _scoringMetric = _scoringDefaultClassifier + + private var _naiveBayesNumericBoundaries = _naiveBayesDefaultNumBoundaries + + private var _naiveBayesStringBoundaries = _naiveBayesDefaultStringBoundaries + + private var _classificationMetrics = classificationMetrics + + private var _naiveBayesThresholds = calculateThresholds() + + def setScoringMetric(value: String): this.type = { + require( + classificationMetrics.contains(value), + s"Classification scoring metric $value is not a valid member of ${invalidateSelection(value, classificationMetrics)}" + ) + this._scoringMetric = value + this + } + + def setNaiveBayesNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + this._naiveBayesNumericBoundaries = value + this + } + + def setNaiveBayesStringBoundaries( + value: Map[String, List[String]] + ): this.type = { + this._naiveBayesStringBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getNaiveBayesNumericBoundaries: Map[String, (Double, Double)] = + _naiveBayesNumericBoundaries + + def getNaiveBayesStringBoundaries: Map[String, List[String]] = + _naiveBayesStringBoundaries + + def getClassificationMetrics: List[String] = classificationMetrics + + private def resetClassificationMetrics: List[String] = + classificationMetricValidator( + classificationAdjudicator(df), + classificationMetrics + ) + + private def setClassificationMetrics(value: List[String]): this.type = { + _classificationMetrics = value + this + } + + private def calculateThresholds(): Array[Double] = { + + val uniqueLabels = df + .select(_labelCol) + .groupBy(col(_labelCol)) + .agg(count("*")) + .alias("counts") + .orderBy(col("counts").desc) + .collect() + + val values = uniqueLabels.map(x => x.getAs[Double]("counts")) + + val totals = values.sum + + values.map(x => x / totals) + + } + + private def configureModel(modelConfig: NaiveBayesConfig): NaiveBayes = { + + val nbModel = new NaiveBayes() + .setFeaturesCol(_featureCol) + .setLabelCol(_labelCol) + .setSmoothing(modelConfig.smoothing) + + if (modelConfig.thresholds) nbModel.setThresholds(_naiveBayesThresholds) + + nbModel + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[NaiveBayesModelsWithResults] + ): (NaiveBayesConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[NaiveBayesModelsWithResults] + ): Array[NaiveBayesModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[NaiveBayesModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[NaiveBayesConfig] = { + + val iterations = new ArrayBuffer[NaiveBayesConfig] + + var i = 0 + do { + val modelType = + generateRandomString("modelType", _naiveBayesStringBoundaries) + val smoothing = + generateRandomDouble("smoothing", _naiveBayesNumericBoundaries) + val thresholds = coinFlip() + iterations += NaiveBayesConfig(modelType, smoothing, thresholds) + i += 1 + } while (i < iterationCount) + iterations.toArray + } + + private def generateAndScoreNaiveBayes( + train: DataFrame, + test: DataFrame, + modelConfig: NaiveBayesConfig, + generation: Int = 1 + ): NaiveBayesModelsWithResults = { + val model = configureModel(modelConfig) + + val builtModel = model.fit(train) + + val predictedData = builtModel.transform(test) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + for (i <- _classificationMetrics) { + scoringMap(i) = classificationScoring(i, _labelCol, predictedData) + } + NaiveBayesModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/RandomForestTuner.scala b/src/main/scala/com/databricks/labs/automl/model/RandomForestTuner.scala new file mode 100644 index 00000000..87125dd3 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/RandomForestTuner.scala @@ -0,0 +1,813 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools._ +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.params.{ + Defaults, + RandomForestConfig, + RandomForestModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} +import org.apache.spark.storage.StorageLevel +import org.apache.spark.ml.classification.RandomForestClassifier +import org.apache.spark.ml.regression.RandomForestRegressor +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class RandomForestTuner(df: DataFrame, + data: Array[TrainSplitReferences], + modelSelection: String, + isPipeline: Boolean = false) + extends SparkSessionWrapper + with Evolution + with Defaults { + + @transient private val logger: Logger = Logger.getLogger(this.getClass) + + // Instantiate the default scoring metric + private var _scoringMetric = modelSelection match { + case "regressor" => "rmse" + case "classifier" => "f1" + case _ => + throw new UnsupportedOperationException( + s"Model $modelSelection is not supported." + ) + } + + private var _randomForestNumericBoundaries = _rfDefaultNumBoundaries + + private var _randomForestStringBoundaries = _rfDefaultStringBoundaries + + private var _classificationMetrics = classificationMetrics + + def setScoringMetric(value: String): this.type = { + modelSelection match { + case "regressor" => + require( + regressionMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, regressionMetrics)}" + ) + case "classifier" => + require( + _classificationMetrics.contains(value), + s"Classification scoring metric '$value' is not a valid member of ${invalidateSelection(value, _classificationMetrics)}" + ) + case _ => + throw new UnsupportedOperationException( + s"Unsupported modelType $modelSelection" + ) + } + this._scoringMetric = value + this + } + + def setRandomForestNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + this._randomForestNumericBoundaries = value + this + } + + def setRandomForestStringBoundaries( + value: Map[String, List[String]] + ): this.type = { + this._randomForestStringBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getRandomForestNumericBoundaries: Map[String, (Double, Double)] = + _randomForestNumericBoundaries + + def getRandomForestStringBoundaries: Map[String, List[String]] = + _randomForestStringBoundaries + + def getClassificationMetrics: List[String] = _classificationMetrics + + def getRegressionMetrics: List[String] = regressionMetrics + + /** + * Private method for updating the maxBins setting for the tree algorithm to ensure that cardinality validation + * occurs for each nominal field in the feature vector to ensure that entnopy / information gain / gini calculations + * can be conducted correctly. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def resetNumericBoundaries: this.type = { + + _randomForestNumericBoundaries = ModelUtils.resetTreeBinsSearchSpace( + df, + _randomForestNumericBoundaries, + _fieldsToIgnore, + _labelCol, + _featureCol + ) + this + + } + + private def resetClassificationMetrics: List[String] = modelSelection match { + case "classifier" => + classificationMetricValidator( + classificationAdjudicator(df), + classificationMetrics + ) + case _ => classificationMetrics + } + + private def setClassificationMetrics(value: List[String]): this.type = { + _classificationMetrics = value + this + } + + private def modelDecider[A, B](modelConfig: RandomForestConfig) = { + + val builtModel = modelSelection match { + case "classifier" => + new RandomForestClassifier() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setNumTrees(modelConfig.numTrees) + .setCheckpointInterval(-1) + .setImpurity(modelConfig.impurity) + .setMaxBins(modelConfig.maxBins) + .setMaxDepth(modelConfig.maxDepth) + .setMinInfoGain(modelConfig.minInfoGain) + .setFeatureSubsetStrategy(modelConfig.featureSubsetStrategy) + .setSubsamplingRate(modelConfig.subSamplingRate) + case "regressor" => + new RandomForestRegressor() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setNumTrees(modelConfig.numTrees) + .setCheckpointInterval(-1) + .setImpurity(modelConfig.impurity) + .setMaxBins(modelConfig.maxBins) + .setMaxDepth(modelConfig.maxDepth) + .setMinInfoGain(modelConfig.minInfoGain) + .setFeatureSubsetStrategy(modelConfig.featureSubsetStrategy) + .setSubsamplingRate(modelConfig.subSamplingRate) + case _ => + throw new UnsupportedOperationException( + s"Unsupported modelType $modelSelection" + ) + } + builtModel + } + + override def generateRandomString( + param: String, + boundaryMap: Map[String, List[String]] + ): String = { + + val stringListing = param match { + case "impurity" => + modelSelection match { + case "regressor" => List("variance") + case _ => boundaryMap(param) + } + case _ => boundaryMap(param) + } + _randomizer.shuffle(stringListing).head + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[RandomForestModelsWithResults] + ): (RandomForestConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[RandomForestModelsWithResults] + ): Array[RandomForestModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[RandomForestModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[RandomForestConfig] = { + + val iterations = new ArrayBuffer[RandomForestConfig] + + var i = 0 + do { + val featureSubsetStrategy = generateRandomString( + "featureSubsetStrategy", + _randomForestStringBoundaries + ) + val subSamplingRate = + generateRandomDouble("subSamplingRate", _randomForestNumericBoundaries) + val impurity = + generateRandomString("impurity", _randomForestStringBoundaries) + val minInfoGain = + generateRandomDouble("minInfoGain", _randomForestNumericBoundaries) + val maxBins = + generateRandomInteger("maxBins", _randomForestNumericBoundaries) + val numTrees = + generateRandomInteger("numTrees", _randomForestNumericBoundaries) + val maxDepth = + generateRandomInteger("maxDepth", _randomForestNumericBoundaries) + iterations += RandomForestConfig( + numTrees, + impurity, + maxBins, + maxDepth, + minInfoGain, + subSamplingRate, + featureSubsetStrategy + ) + i += 1 + } while (i < iterationCount) + + iterations.toArray + } + + private def generateAndScoreRandomForestModel( + train: DataFrame, + test: DataFrame, + modelConfig: RandomForestConfig, + generation: Int = 1 + ): RandomForestModelsWithResults = { + + val randomForestModel = modelDecider(modelConfig) + + val builtModel = randomForestModel.fit(train) + + val predictedData = builtModel.transform(test) + + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + modelSelection match { + case "classifier" => + for (i <- _classificationMetrics) { + scoringMap(i) = + classificationScoring(i, _labelCol, optimizedPredictions) + } + case "regressor" => + for (i <- regressionMetrics) { + scoringMap(i) = regressionScoring(i, _labelCol, optimizedPredictions) + } + } + + val rfModelsWithResults = RandomForestModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + rfModelsWithResults + } + + private def runBattery( + battery: Array[RandomForestConfig], + generation: Int = 1 + ): Array[RandomForestModelsWithResults] = { + + val metrics = modelSelection match { + case "classifier" => _classificationMetrics + case _ => regressionMetrics + } + + val statusObj = new ModelReporting("randomForest", metrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = new ArrayBuffer[RandomForestModelsWithResults] + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreRandomForestModel(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + modelSelection match { + case "classifier" => + for (a <- _classificationMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case "regressor" => + for (a <- regressionMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case _ => + throw new UnsupportedOperationException( + s"$modelSelection is not a supported model type." + ) + } + + val runAvg = RandomForestModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + } + + sortAndReturnAll(results) + + } + + private def irradiateGeneration( + parents: Array[RandomForestConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[RandomForestConfig] = { + + val mutationPayload = new ArrayBuffer[RandomForestConfig] + val totalConfigs = modelConfigLength[RandomForestConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + for (i <- mutationCandidates.indices) { + + val randomParent = scala.util.Random.shuffle(parents.toList).head + val mutationIteration = mutationCandidates(i) + val mutationIndexIteration = mutationIndeces(i) + + mutationPayload += RandomForestConfig( + if (mutationIndexIteration.contains(0)) + geneMixing( + randomParent.numTrees, + mutationIteration.numTrees, + mutationMagnitude + ) + else randomParent.numTrees, + if (mutationIndexIteration.contains(1)) + geneMixing(randomParent.impurity, mutationIteration.impurity) + else randomParent.impurity, + if (mutationIndexIteration.contains(2)) + geneMixing( + randomParent.maxBins, + mutationIteration.maxBins, + mutationMagnitude + ) + else randomParent.maxBins, + if (mutationIndexIteration.contains(3)) + geneMixing( + randomParent.maxDepth, + mutationIteration.maxDepth, + mutationMagnitude + ) + else randomParent.maxDepth, + if (mutationIndexIteration.contains(4)) + geneMixing( + randomParent.minInfoGain, + mutationIteration.minInfoGain, + mutationMagnitude + ) + else randomParent.minInfoGain, + if (mutationIndexIteration.contains(5)) + geneMixing( + randomParent.subSamplingRate, + mutationIteration.subSamplingRate, + mutationMagnitude + ) + else randomParent.subSamplingRate, + if (mutationIndexIteration.contains(6)) + geneMixing( + randomParent.featureSubsetStrategy, + mutationIteration.featureSubsetStrategy + ) + else randomParent.featureSubsetStrategy + ) + } + mutationPayload.result.toArray + } + + private def continuousEvolution(): Array[RandomForestModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + if (!isPipeline) resetNumericBoundaries + + logger.log(Level.DEBUG, debugSettings) + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[RandomForestModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + val totalConfigs = modelConfigLength[RandomForestConfig] + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[RandomForestConfig] + val startingModelSeed = generateRandomForestConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("RandomForest") + .setModelType(modelSelection) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedRandomForest( + _randomForestNumericBoundaries, + _randomForestStringBoundaries + ) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + def generateIdealParents( + results: Array[RandomForestModelsWithResults] + ): Array[RandomForestConfig] = { + val bestParents = new ArrayBuffer[RandomForestConfig] + results + .take(_numberOfParentsToRetain) + .map(x => { + bestParents += x.modelHyperParams + }) + bestParents.result.toArray + } + + def evolveParameters(): Array[RandomForestModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + if (!isPipeline) resetNumericBoundaries + + logger.log(Level.DEBUG, debugSettings) + + var generation = 1 + // Record of all generations results + val fossilRecord = new ArrayBuffer[RandomForestModelsWithResults] + + val totalConfigs = modelConfigLength[RandomForestConfig] + + val primordial = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val generativeArray = new ArrayBuffer[RandomForestConfig] + val startingModelSeed = generateRandomForestConfig(_modelSeed) + generativeArray += startingModelSeed + generativeArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + runBattery(generativeArray.result.toArray, generation) + } else { + runBattery( + generateThresholdedParams(_firstGenerationGenePool), + generation + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("RandomForest") + .setModelType(modelSelection) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedRandomForest( + _randomForestNumericBoundaries, + _randomForestStringBoundaries + ) + runBattery(startingPool, generation) + } + + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + while (currentIteration <= _numberOfMutationGenerations && + evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .randomForestCandidates( + "RandomForest", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + } else { + (1 to _numberOfMutationGenerations).map(i => { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .randomForestCandidates( + "RandomForest", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + } + + def evolveBest(): RandomForestModelsWithResults = { + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[RandomForestModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + val scoreBuffer = new ListBuffer[(Int, Double)] + results.map(x => scoreBuffer += ((x.generation, x.score))) + val scored = scoreBuffer.result + spark.sparkContext + .parallelize(scored) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + } + + def evolveWithScoringDF() + : (Array[RandomForestModelsWithResults], DataFrame) = { + + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + /** + * Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space + * After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential + * best-condition of hyper parameter configurations. + * + * @param paramsToTest Array of RandomForest Configuration (hyper parameter settings) from the post-run model + * inference + * @return The results of the hyper parameter test, as well as the scored DataFrame report. + */ + def postRunModeledHyperParams( + paramsToTest: Array[RandomForestConfig] + ): (Array[RandomForestModelsWithResults], DataFrame) = { + + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/SVMTuner.scala b/src/main/scala/com/databricks/labs/automl/model/SVMTuner.scala new file mode 100644 index 00000000..f96ada1e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/SVMTuner.scala @@ -0,0 +1,637 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.model.tools.{ + GenerationOptimizer, + HyperParameterFullSearch, + ModelReporting +} +import com.databricks.labs.automl.params.{ + Defaults, + SVMConfig, + SVMModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} +import org.apache.spark.storage.StorageLevel +import org.apache.spark.ml.classification.LinearSVC +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions.col + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class SVMTuner(df: DataFrame, + data: Array[TrainSplitReferences], + isPipeline: Boolean = false) + extends SparkSessionWrapper + with Evolution + with Defaults { + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _scoringMetric = _scoringDefaultRegressor + + private var _svmNumericBoundaries = _svmDefaultNumBoundaries + + def setScoringMetric(value: String): this.type = { + require( + regressionMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, regressionMetrics)}" + ) + _scoringMetric = value + this + } + + def setSvmNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + _svmNumericBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getSvmNumericBoundaries: Map[String, (Double, Double)] = + _svmNumericBoundaries + + def getRegressionMetrics: List[String] = regressionMetrics + + private def configureModel(modelConfig: SVMConfig): LinearSVC = { + new LinearSVC() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setFitIntercept(modelConfig.fitIntercept) + .setMaxIter(modelConfig.maxIter) + .setRegParam(modelConfig.regParam) + .setStandardization(modelConfig.standardization) + .setTol(modelConfig.tolerance) + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[SVMModelsWithResults] + ): (SVMConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[SVMModelsWithResults] + ): Array[SVMModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[SVMModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[SVMConfig] = { + val iterations = new ArrayBuffer[SVMConfig] + + var i = 0 + do { + val fitIntercept = coinFlip() + val maxIter = generateRandomInteger("maxIter", _svmNumericBoundaries) + val regParam = generateRandomDouble("regParam", _svmNumericBoundaries) + val standardization = coinFlip() + val tolerance = generateRandomDouble("tolerance", _svmNumericBoundaries) + iterations += SVMConfig( + fitIntercept, + maxIter, + regParam, + standardization, + tolerance + ) + } while (i < iterationCount) + iterations.toArray + } + + private def generateAndScoreSVM(train: DataFrame, + test: DataFrame, + modelConfig: SVMConfig, + generation: Int = 1): SVMModelsWithResults = { + + val svmModel = configureModel(modelConfig) + val builtModel = svmModel.fit(train) + val predictedData = builtModel.transform(test) + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) +// optimizedPredictions.foreach(_ => ()) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + for (i <- regressionMetrics) { + scoringMap(i) = regressionScoring(i, _labelCol, optimizedPredictions) + } + val svmModelsWithResults = SVMModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + svmModelsWithResults + } + + private def runBattery(battery: Array[SVMConfig], + generation: Int = 1): Array[SVMModelsWithResults] = { + + val statusObj = new ModelReporting("svm", regressionMetrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = new ArrayBuffer[SVMModelsWithResults] + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val uniqueLabels: Array[Row] = df.select(_labelCol).distinct().collect() + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreSVM(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + for (a <- regressionMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + + val runAvg = SVMModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + } + + sortAndReturnAll(results) + + } + + private def irradiateGeneration( + parents: Array[SVMConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[SVMConfig] = { + + val mutationPayload = new ArrayBuffer[SVMConfig] + val totalConfigs = modelConfigLength[SVMConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + for (i <- mutationCandidates.indices) { + + val randomParent = scala.util.Random.shuffle(parents.toList).head + val mutationIteration = mutationCandidates(i) + val mutationIndexIteration = mutationIndeces(i) + + mutationPayload += SVMConfig( + if (mutationIndexIteration.contains(0)) + coinFlip( + randomParent.fitIntercept, + mutationIteration.fitIntercept, + mutationMagnitude + ) + else randomParent.fitIntercept, + if (mutationIndexIteration.contains(1)) + geneMixing( + randomParent.maxIter, + mutationIteration.maxIter, + mutationMagnitude + ) + else randomParent.maxIter, + if (mutationIndexIteration.contains(2)) + geneMixing( + randomParent.regParam, + mutationIteration.regParam, + mutationMagnitude + ) + else randomParent.regParam, + if (mutationIndexIteration.contains(3)) + coinFlip( + randomParent.standardization, + mutationIteration.standardization, + mutationMagnitude + ) + else randomParent.standardization, + if (mutationIndexIteration.contains(4)) + geneMixing( + randomParent.tolerance, + mutationIteration.tolerance, + mutationMagnitude + ) + else randomParent.tolerance + ) + } + mutationPayload.result.toArray + } + + private def continuousEvolution(): Array[SVMModelsWithResults] = { + + logger.log(Level.DEBUG, debugSettings) + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[SVMModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + val totalConfigs = modelConfigLength[SVMConfig] + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[SVMConfig] + val startingModelSeed = generateSVMConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("SVM") + .setModelType("regressor") + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedSVM(_svmNumericBoundaries) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + def generateIdealParents( + results: Array[SVMModelsWithResults] + ): Array[SVMConfig] = { + val bestParents = new ArrayBuffer[SVMConfig] + results + .take(_numberOfParentsToRetain) + .map(x => { + bestParents += x.modelHyperParams + }) + bestParents.result.toArray + } + + def evolveParameters(): Array[SVMModelsWithResults] = { + + logger.log(Level.DEBUG, debugSettings) + + var generation = 1 + // Record of all generations results + val fossilRecord = new ArrayBuffer[SVMModelsWithResults] + + val totalConfigs = modelConfigLength[SVMConfig] + + val primordial = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val generativeArray = new ArrayBuffer[SVMConfig] + val startingModelSeed = generateSVMConfig(_modelSeed) + generativeArray += startingModelSeed + generativeArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + runBattery(generativeArray.result.toArray, generation) + } else { + runBattery( + generateThresholdedParams(_firstGenerationGenePool), + generation + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("RandomForest") + .setModelType("regressor") + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedSVM(_svmNumericBoundaries) + runBattery(startingPool, generation) + } + + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + while (currentIteration <= _numberOfMutationGenerations && + evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .svmCandidates( + "SVM", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + } else { + (1 to _numberOfMutationGenerations).map(i => { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .svmCandidates( + "SVM", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + } + + def evolveBest(): SVMModelsWithResults = { + require(df != null && df.count() > 0) + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[SVMModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + val scoreBuffer = new ListBuffer[(Int, Double)] + results.map(x => scoreBuffer += ((x.generation, x.score))) + val scored = scoreBuffer.result + spark.sparkContext + .parallelize(scored) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + } + + def evolveWithScoringDF(): (Array[SVMModelsWithResults], DataFrame) = { + + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + /** + * Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space + * After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential + * best-condition of hyper parameter configurations. + * + * @param paramsToTest Array of SVM Configuration (hyper parameter settings) from the post-run model + * inference + * @return The results of the hyper parameter test, as well as the scored DataFrame report. + */ + def postRunModeledHyperParams( + paramsToTest: Array[SVMConfig] + ): (Array[SVMModelsWithResults], DataFrame) = { + + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/XGBoostTuner.scala b/src/main/scala/com/databricks/labs/automl/model/XGBoostTuner.scala new file mode 100644 index 00000000..5911b3e4 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/XGBoostTuner.scala @@ -0,0 +1,854 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.split.PerformanceSettings +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.model.tools.{ + GenerationOptimizer, + HyperParameterFullSearch, + ModelReporting +} +import com.databricks.labs.automl.params.{ + Defaults, + XGBoostConfig, + XGBoostModelsWithResults +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostRegressor} +import org.apache.log4j.{Level, Logger} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.storage.StorageLevel + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.collection.parallel.mutable.ParHashSet +import scala.concurrent.forkjoin.ForkJoinPool + +class XGBoostTuner(df: DataFrame, + data: Array[TrainSplitReferences], + modelSelection: String, + isPipeline: Boolean = false) + extends SparkSessionWrapper + with Evolution + with Defaults + with Serializable { + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _scoringMetric = modelSelection match { + case "regressor" => "rmse" + case "classifier" => "f1" + case _ => + throw new UnsupportedOperationException( + s"Model $modelSelection is not supported." + ) + } + + private var _classificationMetrics = classificationMetrics + + private var _xgboostNumericBoundaries = _xgboostDefaultNumBoundaries + + def setScoringMetric(value: String): this.type = { + modelSelection match { + case "regressor" => + require( + regressionMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, regressionMetrics)}" + ) + case "classifier" => + require( + classificationMetrics.contains(value), + s"Regressor scoring metric '$value' is not a valid member of ${invalidateSelection(value, classificationMetrics)}" + ) + case _ => + throw new UnsupportedOperationException( + s"Unsupported modelType $modelSelection" + ) + } + this._scoringMetric = value + this + } + + def setXGBoostNumericBoundaries( + value: Map[String, (Double, Double)] + ): this.type = { + _xgboostNumericBoundaries = value + this + } + + def getScoringMetric: String = _scoringMetric + + def getXGBoostNumericBoundaries: Map[String, (Double, Double)] = + _xgboostNumericBoundaries + + def getClassificationMetrics: List[String] = _classificationMetrics + + def getRegressionMetrics: List[String] = regressionMetrics + + private def resetClassificationMetrics: List[String] = modelSelection match { + case "classifier" => + classificationMetricValidator( + classificationAdjudicator(df), + classificationMetrics + ) + case _ => classificationMetrics + } + + private def setClassificationMetrics(value: List[String]): this.type = { + _classificationMetrics = value + this + } + + final lazy val uniqueLabels: Int = modelSelection match { + case "regressor" => 0 + case "classifier" => + df.select(col(_labelCol)).distinct.count.toInt + } + + private def modelDecider[A, B](modelConfig: XGBoostConfig) = { + + val xgObjective: String = modelSelection match { + case "regressor" => "None" + case _ => + uniqueLabels match { + case x if x <= 2 => "reg:squarederror" + case _ => "multi:softmax" + } + } + + val xgbStartString = + s"Building XGBoost model with: ${PerformanceSettings.coresPerTask} threads & ${PerformanceSettings + .xgbWorkers(_parallelism)} workers." + logger.log(Level.INFO, xgbStartString) + + val builtModel = modelSelection match { + case "classifier" => + val xgClass = new XGBoostClassifier() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setAlpha(modelConfig.alpha) + .setEta(modelConfig.eta) + .setGamma(modelConfig.gamma) + .setLambda(modelConfig.lambda) + .setMaxDepth(modelConfig.maxDepth) + .setMaxBins(modelConfig.maxBins) + .setSubsample(modelConfig.subSample) + .setMinChildWeight(modelConfig.minChildWeight) + .setNumRound(modelConfig.numRound) + .setTrainTestRatio(modelConfig.trainTestRatio) + .setNthread(PerformanceSettings.coresPerTask) + .setNumWorkers(PerformanceSettings.xgbWorkers(_parallelism)) + .setMissing(0.0f) + if (uniqueLabels > 2) { + xgClass + .setNumClass(uniqueLabels) + .setObjective(xgObjective) + } + xgClass + case "regressor" => + new XGBoostRegressor() + .setLabelCol(_labelCol) + .setFeaturesCol(_featureCol) + .setAlpha(modelConfig.alpha) + .setEta(modelConfig.eta) + .setGamma(modelConfig.gamma) + .setLambda(modelConfig.lambda) + .setMaxDepth(modelConfig.maxDepth) + .setMaxBins(modelConfig.maxBins) + .setSubsample(modelConfig.subSample) + .setMinChildWeight(modelConfig.minChildWeight) + .setNumRound(modelConfig.numRound) + .setTrainTestRatio(modelConfig.trainTestRatio) + .setNthread(PerformanceSettings.coresPerTask) + .setNumWorkers(PerformanceSettings.xgbWorkers(_parallelism)) + .setMissing(0.0f) + case _ => + throw new UnsupportedOperationException( + s"Unsupported modelType $modelSelection" + ) + } + builtModel + } + + private def returnBestHyperParameters( + collection: ArrayBuffer[XGBoostModelsWithResults] + ): (XGBoostConfig, Double) = { + + val bestEntry = _optimizationStrategy match { + case "minimize" => + collection.result.toArray.sortWith(_.score < _.score).head + case _ => collection.result.toArray.sortWith(_.score > _.score).head + } + (bestEntry.modelHyperParams, bestEntry.score) + } + + private def evaluateStoppingScore(currentBestScore: Double, + stopThreshold: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (currentBestScore > stopThreshold) true else false + case _ => if (currentBestScore < stopThreshold) true else false + } + } + + private def evaluateBestScore(runScore: Double, + bestScore: Double): Boolean = { + _optimizationStrategy match { + case "minimize" => if (runScore < bestScore) true else false + case _ => if (runScore > bestScore) true else false + } + } + + private def sortAndReturnAll( + results: ArrayBuffer[XGBoostModelsWithResults] + ): Array[XGBoostModelsWithResults] = { + _optimizationStrategy match { + case "minimize" => results.result.toArray.sortWith(_.score < _.score) + case _ => results.result.toArray.sortWith(_.score > _.score) + } + } + + private def sortAndReturnBestScore( + results: ArrayBuffer[XGBoostModelsWithResults] + ): Double = { + sortAndReturnAll(results).head.score + } + + /** + * Method for extracting the predicted class for multi-class classification problems directly from the probabilities + * linalg.Vector field. This is due to a bug in XGBoost4j-spark and should be future-proof. + * @param data The transformed data frame with the incorrect prediction values + * @return Fixed prediction column that acquires the predicted class label from the probability Vector + * @author Ben Wilson + * @since 0.5.1 + */ + private def multiClassPredictionExtract(data: DataFrame): DataFrame = { + + // udf must be defined as a function in order to be serialized as an Object. Defining as a method + // prevents the Future from serializing properly. + val extractUDF = udf( + (v: org.apache.spark.ml.linalg.Vector) => v.toArray.last + ) + // Replace the prediction column with the correct data. + data.withColumn("prediction", extractUDF(col("probability"))) + } + + private def generateThresholdedParams( + iterationCount: Int + ): Array[XGBoostConfig] = { + + val iterations = new ArrayBuffer[XGBoostConfig] + + var i = 0 + do { + val alpha = generateRandomDouble("alpha", _xgboostNumericBoundaries) + val eta = generateRandomDouble("eta", _xgboostNumericBoundaries) + val gamma = generateRandomDouble("gamma", _xgboostNumericBoundaries) + val lambda = generateRandomDouble("lambda", _xgboostNumericBoundaries) + val maxDepth = + generateRandomInteger("maxDepth", _xgboostNumericBoundaries) + val subSample = + generateRandomDouble("subSample", _xgboostNumericBoundaries) + val minChildWeight = + generateRandomDouble("minChildWeight", _xgboostNumericBoundaries) + val numRound = + generateRandomInteger("numRound", _xgboostNumericBoundaries) + val maxBins = generateRandomInteger("maxBins", _xgboostNumericBoundaries) + val trainTestRatio = + generateRandomDouble("trainTestRatio", _xgboostNumericBoundaries) + iterations += XGBoostConfig( + alpha, + eta, + gamma, + lambda, + maxDepth, + subSample, + minChildWeight, + numRound, + maxBins, + trainTestRatio + ) + i += 1 + } while (i < iterationCount) + + iterations.toArray + } + + private def generateAndScoreXGBoostModel( + train: DataFrame, + test: DataFrame, + modelConfig: XGBoostConfig, + generation: Int = 1 + ): XGBoostModelsWithResults = { + + val xgboostModel = modelDecider(modelConfig) + + val builtModel = xgboostModel.fit(train) + + val predictedData = builtModel.transform(test) + + val optimizedPredictions = predictedData.persist(StorageLevel.DISK_ONLY) +// optimizedPredictions.foreach(_ => ()) + + // Due to a bug in XGBoost's transformer for accessing the probability Vector to provide a prediction + // This method needs to be called if the unique count for the label class is non-binary for a classifier. + + val fixedPredictionData = modelSelection match { + case "regressor" => optimizedPredictions + case _ => + uniqueLabels match { + case x if x <= 2 => optimizedPredictions + case _ => multiClassPredictionExtract(optimizedPredictions) + } + } + + val scoringMap = scala.collection.mutable.Map[String, Double]() + + modelSelection match { + case "classifier" => + for (i <- _classificationMetrics) { + scoringMap(i) = + classificationScoring(i, _labelCol, fixedPredictionData) + } + case "regressor" => + for (i <- regressionMetrics) { + scoringMap(i) = regressionScoring(i, _labelCol, fixedPredictionData) + } + } + + val xgbModelWithResults = XGBoostModelsWithResults( + modelConfig, + builtModel, + scoringMap(_scoringMetric), + scoringMap.toMap, + generation + ) + + optimizedPredictions.unpersist() + xgbModelWithResults + } + + private def runBattery( + battery: Array[XGBoostConfig], + generation: Int = 1 + ): Array[XGBoostModelsWithResults] = { + + val metrics = modelSelection match { + case "classifier" => _classificationMetrics + case _ => regressionMetrics + } + + val statusObj = new ModelReporting("xgboost", metrics) + + validateLabelAndFeatures(df, _labelCol, _featureCol) + + @volatile var results = new ArrayBuffer[XGBoostModelsWithResults] + @volatile var modelCnt = 0 + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val runs = battery.par + runs.tasksupport = taskSupport + + val uniqueLabels: Array[Row] = df.select(_labelCol).distinct().collect() + + val currentStatus = statusObj.generateGenerationStartStatement( + generation, + calculateModelingFamilyRemainingTime(generation, modelCnt) + ) + + println(currentStatus) + logger.log(Level.INFO, currentStatus) + + runs.foreach { x => + val runId = java.util.UUID.randomUUID() + + println(statusObj.generateRunStartStatement(runId, x)) + + val kFoldTimeStamp = System.currentTimeMillis() / 1000 + + val kFoldBuffer = data.map { z => + generateAndScoreXGBoostModel(z.data.train, z.data.test, x) + } + + val scores = kFoldBuffer.map(_.score) + + val scoringMap = scala.collection.mutable.Map[String, Double]() + modelSelection match { + case "classifier" => + for (a <- _classificationMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case "regressor" => + for (a <- regressionMetrics) { + val metricScores = new ListBuffer[Double] + kFoldBuffer.map(x => metricScores += x.evalMetrics(a)) + scoringMap(a) = metricScores.sum / metricScores.length + } + case _ => + throw new UnsupportedOperationException( + s"$modelSelection is not a supported model type." + ) + } + + val runAvg = XGBoostModelsWithResults( + x, + kFoldBuffer.head.model, + scores.sum / scores.length, + scoringMap.toMap, + generation + ) + + results += runAvg + modelCnt += 1 + + val runStatement = statusObj.generateRunScoreStatement( + runId, + scoringMap.result.toMap, + _scoringMetric, + x, + calculateModelingFamilyRemainingTime(generation, modelCnt), + kFoldTimeStamp + ) + + println(runStatement) + + logger.log(Level.INFO, runStatement) + + } + + sortAndReturnAll(results) + + } + + private def irradiateGeneration( + parents: Array[XGBoostConfig], + mutationCount: Int, + mutationAggression: Int, + mutationMagnitude: Double + ): Array[XGBoostConfig] = { + + val mutationPayload = new ArrayBuffer[XGBoostConfig] + val totalConfigs = modelConfigLength[XGBoostConfig] + val indexMutation = + if (mutationAggression >= totalConfigs) totalConfigs - 1 + else totalConfigs - mutationAggression + val mutationCandidates = generateThresholdedParams(mutationCount) + val mutationIndeces = + generateMutationIndeces(1, totalConfigs, indexMutation, mutationCount) + + for (i <- mutationCandidates.indices) { + + val randomParent = scala.util.Random.shuffle(parents.toList).head + val mutationIteration = mutationCandidates(i) + val mutationIndexIteration = mutationIndeces(i) + + mutationPayload += XGBoostConfig( + if (mutationIndexIteration.contains(0)) + geneMixing( + randomParent.alpha, + mutationIteration.alpha, + mutationMagnitude + ) + else randomParent.alpha, + if (mutationIndexIteration.contains(1)) + geneMixing(randomParent.eta, mutationIteration.eta, mutationMagnitude) + else randomParent.eta, + if (mutationIndexIteration.contains(2)) + geneMixing( + randomParent.gamma, + mutationIteration.gamma, + mutationMagnitude + ) + else randomParent.gamma, + if (mutationIndexIteration.contains(3)) + geneMixing( + randomParent.lambda, + mutationIteration.lambda, + mutationMagnitude + ) + else randomParent.lambda, + if (mutationIndexIteration.contains(4)) + geneMixing( + randomParent.maxDepth, + mutationIteration.maxDepth, + mutationMagnitude + ) + else randomParent.maxDepth, + if (mutationIndexIteration.contains(5)) + geneMixing( + randomParent.subSample, + mutationIteration.subSample, + mutationMagnitude + ) + else randomParent.subSample, + if (mutationIndexIteration.contains(6)) + geneMixing( + randomParent.minChildWeight, + mutationIteration.minChildWeight, + mutationMagnitude + ) + else randomParent.minChildWeight, + if (mutationIndexIteration.contains(7)) + geneMixing( + randomParent.numRound, + mutationIteration.numRound, + mutationMagnitude + ) + else randomParent.numRound, + if (mutationIndexIteration.contains(8)) + geneMixing( + randomParent.maxBins, + mutationIteration.maxBins, + mutationMagnitude + ) + else randomParent.maxBins, + if (mutationIndexIteration.contains(9)) + geneMixing( + randomParent.trainTestRatio, + mutationIteration.trainTestRatio, + mutationMagnitude + ) + else randomParent.trainTestRatio + ) + } + mutationPayload.result.toArray + } + + private def continuousEvolution(): Array[XGBoostModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + + logger.log(Level.DEBUG, debugSettings) + + val taskSupport = new ForkJoinTaskSupport( + new ForkJoinPool(_continuousEvolutionParallelism) + ) + + var runResults = new ArrayBuffer[XGBoostModelsWithResults] + + var scoreHistory = new ArrayBuffer[Double] + + // Set the beginning of the loop and instantiate a place holder for holdling the current best score + var iter: Int = 1 + var bestScore: Double = 0.0 + var rollingImprovement: Boolean = true + var incrementalImprovementCount: Int = 0 + val earlyStoppingImprovementThreshold: Int = + _continuousEvolutionImprovementThreshold + + val totalConfigs = modelConfigLength[XGBoostConfig] + + var runSet = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val genArray = new ArrayBuffer[XGBoostConfig] + val startingModelSeed = generateXGBoostConfig(_modelSeed) + genArray += startingModelSeed + genArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + ParHashSet(genArray.result.toArray: _*) + } else { + ParHashSet(generateThresholdedParams(_firstGenerationGenePool): _*) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("XGBoost") + .setModelType(modelSelection) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedXGBoost(_xgboostNumericBoundaries) + ParHashSet(startingPool: _*) + } + + // Apply ForkJoin ThreadPool parallelism + runSet.tasksupport = taskSupport + + do { + + runSet.foreach(x => { + + try { + // Pull the config out of the HashSet + runSet -= x + + // Run the model config + val run = runBattery(Array(x), iter) + + runResults += run.head + scoreHistory += run.head.score + + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + + bestScore = currentBestScore + + // Add a mutated version of the current best model to the ParHashSet + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + + // Evaluate whether the scores are staying static over the last configured rolling window. + val currentWindowValues = scoreHistory.slice( + scoreHistory.length - _continuousEvolutionRollingImprovementCount, + scoreHistory.length + ) + + // Check for static values + val staticCheck = currentWindowValues.toSet.size + + // If there is more than one value, proceed with validation check on whether the model is improving over time. + if (staticCheck > 1) { + val (early, later) = currentWindowValues.splitAt( + scala.math.round(currentWindowValues.size / 2) + ) + if (later.sum / later.length < early.sum / early.length) { + incrementalImprovementCount += 1 + } else { + incrementalImprovementCount -= 1 + } + } else { + rollingImprovement = false + } + + val statusReport = s"Current Best Score: $bestScore as of run: $iter with cumulative improvement count of: " + + s"$incrementalImprovementCount" + + logger.log(Level.INFO, statusReport) + println(statusReport) + + iter += 1 + + } catch { + case e: java.lang.NullPointerException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + case f: java.lang.ArrayIndexOutOfBoundsException => + val (bestConfig, currentBestScore) = + returnBestHyperParameters(runResults) + runSet += irradiateGeneration( + Array(bestConfig), + 1, + _continuousEvolutionMutationAggressiveness, + _continuousEvolutionGeneticMixing + ).head + bestScore = currentBestScore + } + }) + } while (iter < _continuousEvolutionMaxIterations && + evaluateStoppingScore(bestScore, _continuousEvolutionStoppingScore) + && rollingImprovement && incrementalImprovementCount > earlyStoppingImprovementThreshold) + + sortAndReturnAll(runResults) + + } + + def generateIdealParents( + results: Array[XGBoostModelsWithResults] + ): Array[XGBoostConfig] = { + val bestParents = new ArrayBuffer[XGBoostConfig] + results + .take(_numberOfParentsToRetain) + .map(x => { + bestParents += x.modelHyperParams + }) + bestParents.result.toArray + } + + def evolveParameters(): Array[XGBoostModelsWithResults] = { + + setClassificationMetrics(resetClassificationMetrics) + + logger.log(Level.DEBUG, debugSettings) + + var generation = 1 + // Record of all generations results + val fossilRecord = new ArrayBuffer[XGBoostModelsWithResults] + + val totalConfigs = modelConfigLength[XGBoostConfig] + + val primordial = _initialGenerationMode match { + + case "random" => + if (_modelSeedSet) { + val generativeArray = new ArrayBuffer[XGBoostConfig] + val startingModelSeed = generateXGBoostConfig(_modelSeed) + generativeArray += startingModelSeed + generativeArray ++= irradiateGeneration( + Array(startingModelSeed), + _firstGenerationGenePool, + totalConfigs - 1, + _geneticMixing + ) + runBattery(generativeArray.result.toArray, generation) + } else { + runBattery( + generateThresholdedParams(_firstGenerationGenePool), + generation + ) + } + case "permutations" => + val startingPool = new HyperParameterFullSearch() + .setModelFamily("XGBoost") + .setModelType(modelSelection) + .setPermutationCount(_initialGenerationPermutationCount) + .setIndexMixingMode(_initialGenerationIndexMixingMode) + .setArraySeed(_initialGenerationArraySeed) + .initialGenerationSeedXGBoost(_xgboostNumericBoundaries) + runBattery(startingPool, generation) + } + + fossilRecord ++= primordial + generation += 1 + + var currentIteration = 1 + + if (_earlyStoppingFlag) { + + var currentBestResult = sortAndReturnBestScore(fossilRecord) + + if (evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + while (currentIteration <= _numberOfMutationGenerations && + evaluateStoppingScore(currentBestResult, _earlyStoppingScore)) { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, currentIteration) + + // Get the sorted state + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .xgBoostCandidates( + "XGBoost", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + val postRunBestScore = sortAndReturnBestScore(fossilRecord) + + if (evaluateBestScore(postRunBestScore, currentBestResult)) + currentBestResult = postRunBestScore + + currentIteration += 1 + + } + + sortAndReturnAll(fossilRecord) + + } else { + sortAndReturnAll(fossilRecord) + } + } else { + (1 to _numberOfMutationGenerations).map(i => { + + val mutationAggressiveness: Int = + generateAggressiveness(totalConfigs, i) + + val currentState = sortAndReturnAll(fossilRecord) + + val expandedCandidates = irradiateGeneration( + generateIdealParents(currentState), + _numberOfMutationsPerGeneration * _geneticMBOCandidateFactor, + mutationAggressiveness, + _geneticMixing + ) + + val evolution = GenerationOptimizer + .xgBoostCandidates( + "XGBoost", + _geneticMBORegressorType, + fossilRecord, + expandedCandidates, + _optimizationStrategy, + _numberOfMutationsPerGeneration + ) + + var evolve = runBattery(evolution, generation) + generation += 1 + fossilRecord ++= evolve + + }) + + sortAndReturnAll(fossilRecord) + + } + } + + def evolveBest(): XGBoostModelsWithResults = { + evolveParameters().head + } + + def generateScoredDataFrame( + results: Array[XGBoostModelsWithResults] + ): DataFrame = { + + import spark.sqlContext.implicits._ + + val scoreBuffer = new ListBuffer[(Int, Double)] + results.map(x => scoreBuffer += ((x.generation, x.score))) + val scored = scoreBuffer.result + spark.sparkContext + .parallelize(scored) + .toDF("generation", "score") + .orderBy(col("generation").asc, col("score").asc) + } + + def evolveWithScoringDF(): (Array[XGBoostModelsWithResults], DataFrame) = { + + val evolutionResults = _evolutionStrategy match { + case "batch" => evolveParameters() + case "continuous" => continuousEvolution() + } + + (evolutionResults, generateScoredDataFrame(evolutionResults)) + } + + /** + * Helper Method for a post-run model optimization based on theoretical hyperparam multidimensional grid search space + * After a genetic tuning run is complete, this allows for a model to be trained and run to predict a potential + * best-condition of hyper parameter configurations. + * + * @param paramsToTest Array of XGBoost Configuration (hyper parameter settings) from the post-run model + * inference + * @return The results of the hyper parameter test, as well as the scored DataFrame report. + */ + def postRunModeledHyperParams( + paramsToTest: Array[XGBoostConfig] + ): (Array[XGBoostModelsWithResults], DataFrame) = { + + val finalRunResults = + runBattery(paramsToTest, _numberOfMutationGenerations + 2) + + (finalRunResults, generateScoredDataFrame(finalRunResults)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/GenerationOptimizer.scala b/src/main/scala/com/databricks/labs/automl/model/tools/GenerationOptimizer.scala new file mode 100644 index 00000000..4f480b42 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/GenerationOptimizer.scala @@ -0,0 +1,802 @@ +package com.databricks.labs.automl.model.tools + +import com.databricks.labs.automl.exceptions.ModelingTypeException +import com.databricks.labs.automl.model.tools.structures._ +import com.databricks.labs.automl.params._ +import com.databricks.labs.automl.utils.SparkSessionWrapper +import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor +import org.apache.spark.ml.feature.{ + MaxAbsScaler, + StringIndexer, + VectorAssembler +} +import org.apache.spark.ml.regression.{LinearRegression, RandomForestRegressor} +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{StringType, StructType} +import org.apache.spark.sql.{DataFrame, Dataset} + +import scala.collection.mutable.ArrayBuffer +import scala.reflect.ClassTag + +case class LayerConfig(layers: Int, hiddenLayers: Int) + +case class MLPCExtractConfig(layers: Int, + maxIter: Int, + solver: String, + stepSize: Double, + tolerance: Double, + hiddenLayerSizeAdjust: Int) + +case class FieldTypes(numericHyperParams: Array[String], + stringHyperParams: Array[String], + allHyperParams: Array[String]) + +object ModelTypes extends Enumeration { + type ModelTypes = Value + val Trees, GBT, LinearRegressor, LogisticRegression, MLPC, NaiveBayes, + RandomForest, SVM, XGBoost, LightGBM = Value +} + +object RegressorTypes extends Enumeration { + type RegressorTypes = Value + val RF, LR, XG = Value +} + +object OptimizationTypes extends Enumeration { + type OptimizationTypes = Value + val Minimize, Maximize = Value +} + +trait GenerationOptimizerBase extends SparkSessionWrapper { + + import com.databricks.labs.automl.model.tools.ModelTypes._ + import com.databricks.labs.automl.model.tools.OptimizationTypes._ + import com.databricks.labs.automl.model.tools.RegressorTypes._ + + private def layerExtract(layers: Array[Int]): LayerConfig = { + + val hiddenLayersSizeAdjust = + if (layers.length > 2) layers(1) - layers(0) else 0 + val layerCount = layers.length - 2 + + LayerConfig(layerCount, hiddenLayersSizeAdjust) + + } + + def mlpcLayerGenerator(inputFeatures: Int, + distinctClasses: Int, + layers: Int, + hiddenLayers: Int): Array[Int] = { + + val layerConstruct = new ArrayBuffer[Int] + + layerConstruct += inputFeatures + + (1 to layers).foreach { x => + layerConstruct += inputFeatures + layers - x + hiddenLayers + } + layerConstruct += distinctClasses + layerConstruct.result.toArray + + } + + def enumerateModelType(value: String): ModelTypes = { + + value match { + case "Trees" => Trees + case "GBT" => GBT + case "LinearRegression" => LinearRegressor + case "LogisticRegression" => LogisticRegression + case "MLPC" => MLPC + case "NaiveBayes" => NaiveBayes + case "RandomForest" => RandomForest + case "SVM" => SVM + case "XGBoost" => XGBoost + case "LightGBM" => LightGBM + case _ => + throw ModelingTypeException( + value, + ModelTypes.values.map(_.toString).toArray + ) + } + + } + + def enumerateRegressorType(value: String): RegressorTypes = { + value match { + case "RandomForest" => RF + case "LinearRegression" => LR + case "XGBoost" => XG + case _ => + throw ModelingTypeException( + value, + RegressorTypes.values.map(_.toString).toArray + ) + } + } + + def enumerateOptimizationType(value: String): OptimizationTypes = { + value match { + case "minimize" => Minimize + case "maximize" => Maximize + case _ => + throw ModelingTypeException(value, Array("minimize", "maximize")) + } + } + + def convertConfigToDF[A](modelType: ModelTypes, config: Array[A])( + implicit c: ClassTag[A] + ): DataFrame = { + + val data = modelType match { + case Trees => + val conf = config.asInstanceOf[Array[TreesModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + TreesModelRunReport( + impurity = hyperParams.impurity, + maxBins = hyperParams.maxBins, + maxDepth = hyperParams.maxDepth, + minInfoGain = hyperParams.minInfoGain, + minInstancesPerNode = hyperParams.minInstancesPerNode, + score = x.score + ) + }) + spark.createDataFrame(report) + case GBT => + val conf = config.asInstanceOf[Array[GBTModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + GBTModelRunReport( + impurity = hyperParams.impurity, + lossType = hyperParams.lossType, + maxBins = hyperParams.maxBins, + maxDepth = hyperParams.maxDepth, + maxIter = hyperParams.maxIter, + minInfoGain = hyperParams.minInfoGain, + minInstancesPerNode = hyperParams.minInstancesPerNode, + stepSize = hyperParams.stepSize, + score = x.score + ) + }) + spark.createDataFrame(report) + case LinearRegressor => + val conf = config.asInstanceOf[Array[LinearRegressionModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + LinearRegressionModelRunReport( + elasticNetParams = hyperParams.elasticNetParams, + fitIntercept = hyperParams.fitIntercept, + loss = hyperParams.loss, + maxIter = hyperParams.maxIter, + regParam = hyperParams.regParam, + standardization = hyperParams.standardization, + tolerance = hyperParams.tolerance, + score = x.score + ) + }) + spark.createDataFrame(report) + case LogisticRegression => + val conf = + config.asInstanceOf[Array[LogisticRegressionModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + LogisticRegressionModelRunReport( + elasticNetParams = hyperParams.elasticNetParams, + fitIntercept = hyperParams.fitIntercept, + maxIter = hyperParams.maxIter, + regParam = hyperParams.regParam, + standardization = hyperParams.standardization, + tolerance = hyperParams.tolerance, + score = x.score + ) + }) + spark.createDataFrame(report) + case MLPC => + val conf = config.asInstanceOf[Array[MLPCModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + val layers = layerExtract(hyperParams.layers) + MLPCModelRunReport( + layers = layers.layers, + maxIter = hyperParams.maxIter, + solver = hyperParams.solver, + stepSize = hyperParams.stepSize, + tolerance = hyperParams.tolerance, + hiddenLayerSizeAdjust = layers.hiddenLayers, + score = x.score + ) + }) + spark.createDataFrame(report) + case RandomForest => + val conf = config.asInstanceOf[Array[RandomForestModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + RandomForestModelRunReport( + numTrees = hyperParams.numTrees, + impurity = hyperParams.impurity, + maxBins = hyperParams.maxBins, + maxDepth = hyperParams.maxDepth, + minInfoGain = hyperParams.minInfoGain, + subSamplingRate = hyperParams.subSamplingRate, + featureSubsetStrategy = hyperParams.featureSubsetStrategy, + score = x.score + ) + }) + spark.createDataFrame(report) + case SVM => + val conf = config.asInstanceOf[Array[SVMModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + SVMModelRunReport( + fitIntercept = hyperParams.fitIntercept, + maxIter = hyperParams.maxIter, + regParam = hyperParams.regParam, + standardization = hyperParams.standardization, + tolerance = hyperParams.tolerance, + score = x.score + ) + }) + spark.createDataFrame(report) + case XGBoost => + val conf = config.asInstanceOf[Array[XGBoostModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + XGBoostModelRunReport( + alpha = hyperParams.alpha, + eta = hyperParams.eta, + gamma = hyperParams.gamma, + lambda = hyperParams.lambda, + maxDepth = hyperParams.maxDepth, + subSample = hyperParams.subSample, + minChildWeight = hyperParams.minChildWeight, + numRound = hyperParams.numRound, + maxBins = hyperParams.maxBins, + trainTestRatio = hyperParams.trainTestRatio, + score = x.score + ) + }) + spark.createDataFrame(report) + case LightGBM => + val conf = config.asInstanceOf[Array[LightGBMModelsWithResults]] + val report = conf.map(x => { + val hyperParams = x.modelHyperParams + LightGBMModelRunReport( + baggingFraction = hyperParams.baggingFraction, + baggingFreq = hyperParams.baggingFreq, + featureFraction = hyperParams.featureFraction, + learningRate = hyperParams.learningRate, + maxBin = hyperParams.maxBin, + maxDepth = hyperParams.maxDepth, + minSumHessianInLeaf = hyperParams.minSumHessianInLeaf, + numIterations = hyperParams.numIterations, + numLeaves = hyperParams.numLeaves, + boostFromAverage = hyperParams.boostFromAverage, + lambdaL1 = hyperParams.lambdaL1, + lambdaL2 = hyperParams.lambdaL2, + alpha = hyperParams.alpha, + boostingType = hyperParams.boostingType, + score = x.score + ) + }) + spark.createDataFrame(report) + } + data + } + + def convertCandidatesToDF[B](modelType: ModelTypes, + candidates: Array[B]): DataFrame = { + modelType match { + case Trees => + spark.createDataFrame(candidates.asInstanceOf[Array[TreesConfig]]) + case GBT => + spark.createDataFrame(candidates.asInstanceOf[Array[GBTConfig]]) + case LinearRegressor => + spark.createDataFrame( + candidates.asInstanceOf[Array[LinearRegressionConfig]] + ) + case LogisticRegression => + spark.createDataFrame( + candidates.asInstanceOf[Array[LogisticRegressionConfig]] + ) + case MLPC => + val conf = candidates.asInstanceOf[Array[MLPCConfig]] + val adjust = conf.map(x => { + val layers = layerExtract(x.layers) + MLPCExtractConfig( + layers = layers.layers, + maxIter = x.maxIter, + solver = x.solver, + stepSize = x.stepSize, + tolerance = x.tolerance, + hiddenLayerSizeAdjust = layers.hiddenLayers + ) + }) + spark.createDataFrame(adjust) + case RandomForest => + spark.createDataFrame( + candidates.asInstanceOf[Array[RandomForestConfig]] + ) + case SVM => + spark.createDataFrame(candidates.asInstanceOf[Array[SVMConfig]]) + case XGBoost => + spark.createDataFrame(candidates.asInstanceOf[Array[XGBoostConfig]]) + case LightGBM => + spark.createDataFrame(candidates.asInstanceOf[Array[LightGBMConfig]]) + } + } + + def fit(df: Dataset[_], pipeline: Pipeline): PipelineModel = { + + pipeline.fit(df) + + } + + def transform(df: Dataset[_], pipeline: PipelineModel): DataFrame = { + + pipeline.transform(df) + + } + +} + +class GenerationOptimizer[A, B](val modelType: String, + val regressorType: String, + var history: ArrayBuffer[A], + var candidates: Array[B], + val optimizationType: String, + val candidateCount: Int) + extends GenerationOptimizerBase { + + import com.databricks.labs.automl.model.tools.OptimizationTypes._ + import com.databricks.labs.automl.model.tools.RegressorTypes._ + + final val LABEL_COLUMN: String = "score" + final val UNSCALED_FEATURE_COLUMN: String = "features" + final val SCALED_FEATURE_COLUMN: String = "features_scaled" + final val PREDICTION_COLUMN: String = "predicted_score" + final val SI_SUFFIX: String = "_si" + + private final val modelEnum = enumerateModelType(modelType) + private final val regressorEnum = enumerateRegressorType(regressorType) + private final val optimizationEnum = enumerateOptimizationType( + optimizationType + ) + + private def extractFieldsToStringIndex(schema: StructType): FieldTypes = { + + val allHyperParams = schema.names.filterNot(LABEL_COLUMN.contains) + val stringHyperParams = schema + .filter(_.dataType == StringType) + .map(_.name) + .toArray + .filterNot(LABEL_COLUMN.contains) + val numericHyperParams = + allHyperParams.filterNot(stringHyperParams.contains) + + FieldTypes( + numericHyperParams = numericHyperParams, + stringHyperParams = stringHyperParams, + allHyperParams = allHyperParams + ) + } + + private def buildFeaturePipeline(fields: FieldTypes): Pipeline = { + + val stringIndexers = fields.stringHyperParams.map( + x => new StringIndexer().setInputCol(x).setOutputCol(x + SI_SUFFIX) + ) + val vectorNames = fields.stringHyperParams.map(_ + SI_SUFFIX) ++ fields.numericHyperParams + + val vectorAssembler = new VectorAssembler() + .setInputCols(vectorNames) + .setOutputCol(UNSCALED_FEATURE_COLUMN) + + val scaler = new MaxAbsScaler() + .setInputCol(UNSCALED_FEATURE_COLUMN) + .setOutputCol(SCALED_FEATURE_COLUMN) + + val regressor = regressorEnum match { + case LR => new LinearRegression().setPredictionCol(PREDICTION_COLUMN) + case RF => new RandomForestRegressor().setPredictionCol(PREDICTION_COLUMN) + case XG => + new XGBoostRegressor() + .setMissing(0.0f) + .setPredictionCol(PREDICTION_COLUMN) + } + + regressor.setLabelCol(LABEL_COLUMN).setFeaturesCol(SCALED_FEATURE_COLUMN) + + new Pipeline() + .setStages(stringIndexers :+ vectorAssembler :+ scaler :+ regressor) + + } + + private def sortRestrict(df: DataFrame, limit: Int): DataFrame = { + optimizationEnum match { + case Maximize => df.orderBy(col(PREDICTION_COLUMN).desc).limit(limit) + case Minimize => df.orderBy(col(PREDICTION_COLUMN).asc).limit(limit) + } + + } + + private def evaluateCandidates()(implicit c: ClassTag[A]): DataFrame = { + + val historyDF = convertConfigToDF(modelEnum, history.toArray) + + val historyFields = extractFieldsToStringIndex(historyDF.schema) + + val candidateDF = convertCandidatesToDF(modelEnum, candidates) + + val candidateFields = extractFieldsToStringIndex(candidateDF.schema) + + val pipeline = buildFeaturePipeline(historyFields) + + val model = fit(historyDF, pipeline) + + val prediction = transform(candidateDF, model) + + sortRestrict(prediction, candidateCount) + + } + + def generateRandomForestCandidates()( + implicit c: ClassTag[A] + ): Array[RandomForestConfig] = { + + val candidates = evaluateCandidates() + candidates + .collect() + .map( + x => + RandomForestConfig( + numTrees = x.getAs[Int]("numTrees"), + impurity = x.getAs[String]("impurity"), + maxBins = x.getAs[Int]("maxBins"), + maxDepth = x.getAs[Int]("maxDepth"), + minInfoGain = x.getAs[Double]("minInfoGain"), + subSamplingRate = x.getAs[Double]("subSamplingRate"), + featureSubsetStrategy = x.getAs[String]("featureSubsetStrategy") + ) + ) + } + + def generateDecisionTreesCandidates()( + implicit c: ClassTag[A] + ): Array[TreesConfig] = { + + val candidates = evaluateCandidates() + candidates + .collect() + .map( + x => + TreesConfig( + impurity = x.getAs[String]("impurity"), + maxBins = x.getAs[Int]("maxBins"), + maxDepth = x.getAs[Int]("maxDepth"), + minInfoGain = x.getAs[Double]("minInfoGain"), + minInstancesPerNode = x.getAs[Int]("minInstancesPerNode") + ) + ) + } + + def generateGBTCandidates()(implicit c: ClassTag[A]): Array[GBTConfig] = { + + val candidates = evaluateCandidates() + candidates + .collect() + .map( + x => + GBTConfig( + impurity = x.getAs[String]("impurity"), + lossType = x.getAs[String]("lossType"), + maxBins = x.getAs[Int]("maxBins"), + maxDepth = x.getAs[Int]("maxDepth"), + maxIter = x.getAs[Int]("maxIter"), + minInfoGain = x.getAs[Double]("minInfoGain"), + minInstancesPerNode = x.getAs[Int]("minInstancesPerNode"), + stepSize = x.getAs[Double]("stepSize") + ) + ) + } + + def generateLinearRegressionCandidates()( + implicit c: ClassTag[A] + ): Array[LinearRegressionConfig] = { + + val candidates = evaluateCandidates() + candidates + .collect() + .map( + x => + LinearRegressionConfig( + elasticNetParams = x.getAs[Double]("elasticNetParams"), + fitIntercept = x.getAs[Boolean]("fitIntercept"), + loss = x.getAs[String]("loss"), + maxIter = x.getAs[Int]("maxIter"), + regParam = x.getAs[Double]("regParam"), + standardization = x.getAs[Boolean]("standardization"), + tolerance = x.getAs[Double]("tolerance") + ) + ) + } + + def generateLogisticRegressionCandidates()( + implicit c: ClassTag[A] + ): Array[LogisticRegressionConfig] = { + + val candidates = evaluateCandidates() + candidates + .collect() + .map( + x => + LogisticRegressionConfig( + elasticNetParams = x.getAs[Double]("elasticNetParams"), + fitIntercept = x.getAs[Boolean]("fitIntercept"), + maxIter = x.getAs[Int]("maxIter"), + regParam = x.getAs[Double]("regParam"), + standardization = x.getAs[Boolean]("standardization"), + tolerance = x.getAs[Double]("tolerance") + ) + ) + + } + + def generateSVMCandidates()(implicit c: ClassTag[A]): Array[SVMConfig] = { + + val candidates = evaluateCandidates() + candidates + .collect() + .map( + x => + SVMConfig( + fitIntercept = x.getAs[Boolean]("fitIntercept"), + maxIter = x.getAs[Int]("maxIter"), + regParam = x.getAs[Double]("regParam"), + standardization = x.getAs[Boolean]("standardization"), + tolerance = x.getAs[Double]("tolerance") + ) + ) + } + + def generateXGBoostCandidates()( + implicit c: ClassTag[A] + ): Array[XGBoostConfig] = { + + val candidates = evaluateCandidates() + candidates + .collect() + .map( + x => + XGBoostConfig( + alpha = x.getAs[Double]("alpha"), + eta = x.getAs[Double]("eta"), + gamma = x.getAs[Double]("gamma"), + lambda = x.getAs[Double]("lambda"), + maxDepth = x.getAs[Int]("maxDepth"), + subSample = x.getAs[Double]("subSample"), + minChildWeight = x.getAs[Double]("minChildWeight"), + numRound = x.getAs[Int]("numRound"), + maxBins = x.getAs[Int]("maxBins"), + trainTestRatio = x.getAs[Double]("trainTestRatio") + ) + ) + + } + + def generateLightGBMCandidates()( + implicit c: ClassTag[A] + ): Array[LightGBMConfig] = { + val candidates = evaluateCandidates() + candidates + .collect() + .map( + x => + LightGBMConfig( + baggingFraction = x.getAs[Double]("baggingFraction"), + baggingFreq = x.getAs[Int]("baggingFreq"), + featureFraction = x.getAs[Double]("featureFraction"), + learningRate = x.getAs[Double]("learningRate"), + maxBin = x.getAs[Int]("maxBin"), + maxDepth = x.getAs[Int]("maxDepth"), + minSumHessianInLeaf = x.getAs[Double]("minSumHessianInLeaf"), + numIterations = x.getAs[Int]("numIterations"), + numLeaves = x.getAs[Int]("numLeaves"), + boostFromAverage = x.getAs[Boolean]("boostFromAverage"), + lambdaL1 = x.getAs[Double]("lambdaL1"), + lambdaL2 = x.getAs[Double]("lambdaL2"), + alpha = x.getAs[Double]("alpha"), + boostingType = x.getAs[String]("boostingType") + ) + ) + } + + def generateMLPCCandidates(inputFeatures: Int, distinctClasses: Int)( + implicit c: ClassTag[A] + ): Array[MLPCConfig] = { + + val candidates = evaluateCandidates() + candidates + .collect() + .map(x => { + + val layers = mlpcLayerGenerator( + inputFeatures, + distinctClasses, + x.getAs[Int]("layers"), + x.getAs[Int]("hiddenLayerSizeAdjust") + ) + + MLPCConfig( + layers = layers, + maxIter = x.getAs[Int]("maxIter"), + solver = x.getAs[String]("solver"), + stepSize = x.getAs[Double]("stepSize"), + tolerance = x.getAs[Double]("tolerance") + ) + }) + + } + +} + +object GenerationOptimizer { + + def randomForestCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int + )(implicit c: ClassTag[A]): Array[RandomForestConfig] = + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateRandomForestCandidates() + + def decisionTreesCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int + )(implicit c: ClassTag[A]): Array[TreesConfig] = + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateDecisionTreesCandidates() + + def gbtCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int + )(implicit c: ClassTag[A]): Array[GBTConfig] = + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateGBTCandidates() + + def linearRegressionCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int + )(implicit c: ClassTag[A]): Array[LinearRegressionConfig] = + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateLinearRegressionCandidates() + + def logisticRegressionCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int + )(implicit c: ClassTag[A]): Array[LogisticRegressionConfig] = + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateLogisticRegressionCandidates() + + def svmCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int + )(implicit c: ClassTag[A]): Array[SVMConfig] = + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateSVMCandidates() + + def xgBoostCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int + )(implicit c: ClassTag[A]): Array[XGBoostConfig] = + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateXGBoostCandidates() + + def lightGBMCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int + )(implicit c: ClassTag[A]): Array[LightGBMConfig] = { + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateLightGBMCandidates() + } + + def mlpcCandidates[A, B]( + modelType: String, + regressorType: String, + history: ArrayBuffer[A], + candidates: Array[B], + optimizationType: String, + candidateCount: Int, + inputFeatures: Int, + distinctClasses: Int + )(implicit c: ClassTag[A]): Array[MLPCConfig] = + new GenerationOptimizer( + modelType, + regressorType, + history, + candidates, + optimizationType, + candidateCount + ).generateMLPCCandidates(inputFeatures, distinctClasses) + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/HyperParameterFullSearch.scala b/src/main/scala/com/databricks/labs/automl/model/tools/HyperParameterFullSearch.scala new file mode 100644 index 00000000..4a79c89e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/HyperParameterFullSearch.scala @@ -0,0 +1,658 @@ +package com.databricks.labs.automl.model.tools + +import com.databricks.labs.automl.model.tools.structures._ +import com.databricks.labs.automl.params._ + +import scala.collection.mutable.ArrayBuffer + +class HyperParameterFullSearch extends Defaults with ModelConfigGenerators { + + var _modelFamily = "" + var _modelType = "" + var _permutationCount = 10 + var _indexMixingMode = "linear" + var _arraySeed = 42L + + private val allowableMixingModes = List("linear", "random") + + def setModelFamily(value: String): this.type = { + require( + _supportedModels.contains(value), + s"${this.getClass.toString} error! Model Family $value is not supported." + + s"\n\t Supported families: ${_supportedModels.mkString(", ")}" + ) + _modelFamily = value + this + } + + def setModelType(value: String): this.type = { + value match { + case "classifier" => _modelType = value + case "regressor" => _modelType = value + case _ => + throw new UnsupportedOperationException( + s"Model type $value is not supported." + ) + } + this + } + + def setPermutationCount(value: Int): this.type = { + _permutationCount = value + this + } + + def setIndexMixingMode(value: String): this.type = { + require( + allowableMixingModes.contains(value), + s"Index Mixing mode $value is not supported. Allowable modes are: " + + s"${allowableMixingModes.mkString(", ")}" + ) + _indexMixingMode = value + this + } + + def setArraySeed(value: Long): this.type = { + _arraySeed = value + this + } + + def getModelFamily: String = _modelFamily + def getModelType: String = _modelType + def getPermutationCount: Int = _permutationCount + def getIndexMixingMode: String = _indexMixingMode + def getArraySeed: Long = _arraySeed + + /** + * Method for generating a geometric space search for a first-generation hyper parameter generation for RandomForest + * @param numericBoundaries The allowable restrictive space for the numeric hyper parameters + * @param stringBoundaries The allowable values for string-based hyper parameters + * @return An Array of Hyperparameter settings for RandomForest algorithms. + */ + def initialGenerationSeedRandomForest( + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]] + ): Array[RandomForestConfig] = { + + var outputPayload = new ArrayBuffer[RandomForestConfig]() + + val impurityValues = _modelType match { + case "regressor" => List("variance") + case _ => stringBoundaries("impurity") + } + + // Set the config object + val rfConfig = PermutationConfiguration( + modelType = _modelType, + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = stringBoundaries + ) + + // Generate the permutation collections + + val generatedArrays = randomForestNumericArrayGenerator(rfConfig) + + // Create some index values + var _impurityIdx = 0 + var _featureSubsetStrategyIdx = 0 + + var numericArrays = Array( + generatedArrays.numTreesArray, + generatedArrays.maxBinsArray, + generatedArrays.maxDepthArray, + generatedArrays.minInfoGainArray, + generatedArrays.subSamplingRateArray + ) + + // Main builder loop + for (i <- 1 to _permutationCount) { + + val selectedIndeces = _indexMixingMode match { + case "linear" => staticIndexSelection(numericArrays) + case "random" => randomIndexSelection(numericArrays) + case _ => + throw new UnsupportedOperationException( + s"index mixing mode ${_indexMixingMode} is not supported." + ) + } + + numericArrays = selectedIndeces.remainingPayload + + // Handle the string value selections + val impurityLoop = selectStringIndex(impurityValues, _impurityIdx) + + _impurityIdx = impurityLoop.IndexCounterStatus + + val featureSubsetStrategyLoop = selectStringIndex( + stringBoundaries("featureSubsetStrategy"), + _featureSubsetStrategyIdx + ) + + _featureSubsetStrategyIdx = featureSubsetStrategyLoop.IndexCounterStatus + + outputPayload += RandomForestConfig( + numTrees = selectedIndeces.selectedPayload(0).toInt, + impurity = impurityLoop.selectedStringValue, + maxBins = selectedIndeces.selectedPayload(1).toInt, + maxDepth = selectedIndeces.selectedPayload(2).toInt, + minInfoGain = selectedIndeces.selectedPayload(3), + subSamplingRate = selectedIndeces.selectedPayload(4), + featureSubsetStrategy = featureSubsetStrategyLoop.selectedStringValue + ) + _impurityIdx += 1 + _featureSubsetStrategyIdx += 1 + } + + outputPayload.result.toArray + + } + + /** + * Method for generating a geometric search space for a first-generation hyper parameter generation for LightGBM + * @param numericBoundaries LightGBM numeric search space boundaries + * @param stringBoundaries LightGBM string search space boundaries + * @return An array of LightGBM configs + * @since 0.6.1 + * @author Ben Wilson, Databricks + * @throws UnsupportedOperationException if the index mixing mode supplied is invalid. + */ + @throws(classOf[UnsupportedOperationException]) + def initialGenerationSeedLightGBM( + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]] + ): Array[LightGBMConfig] = { + + var outputPayload = new ArrayBuffer[LightGBMConfig]() + + val lightGBMConfig = PermutationConfiguration( + modelType = _modelType, + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = stringBoundaries + ) + + val generatedArrays = lightGBMNumericArrayGenerator(lightGBMConfig) + + var _boostFromAverageIdx = 0 + var _boostingTypeIdx = 0 + + var numericArrays = Array( + generatedArrays.baggingFractionArray, + generatedArrays.baggingFreqArray, + generatedArrays.featureFractionArray, + generatedArrays.learningRateArray, + generatedArrays.maxBinArray, + generatedArrays.maxDepthArray, + generatedArrays.minSumHessianInLeafArray, + generatedArrays.numIterationsArray, + generatedArrays.numLeavesArray, + generatedArrays.lambdaL1Array, + generatedArrays.lambdaL2Array, + generatedArrays.alphaArray + ) + + for (i <- 1 to _permutationCount) { + val selectedIndeces = _indexMixingMode match { + case "linear" => staticIndexSelection(numericArrays) + case "random" => randomIndexSelection(numericArrays) + case _ => + throw new UnsupportedOperationException( + s" Index mixing mode ${_indexMixingMode} is not supported." + ) + } + numericArrays = selectedIndeces.remainingPayload + + val boostFromAverageLoop = selectCoinFlip(_boostFromAverageIdx) + val boostingTypeLoop = + selectStringIndex(stringBoundaries("boostingType"), _boostingTypeIdx) + _boostingTypeIdx = boostingTypeLoop.IndexCounterStatus + + outputPayload += LightGBMConfig( + baggingFraction = selectedIndeces.selectedPayload(0), + baggingFreq = selectedIndeces.selectedPayload(1).toInt, + featureFraction = selectedIndeces.selectedPayload(2), + learningRate = selectedIndeces.selectedPayload(3), + maxBin = selectedIndeces.selectedPayload(4).toInt, + maxDepth = selectedIndeces.selectedPayload(5).toInt, + minSumHessianInLeaf = selectedIndeces.selectedPayload(6), + numIterations = selectedIndeces.selectedPayload(7).toInt, + numLeaves = selectedIndeces.selectedPayload(8).toInt, + boostFromAverage = boostFromAverageLoop, + lambdaL1 = selectedIndeces.selectedPayload(9), + lambdaL2 = selectedIndeces.selectedPayload(10), + alpha = selectedIndeces.selectedPayload(11), + boostingType = boostingTypeLoop.selectedStringValue + ) + _boostFromAverageIdx += 1 + _boostingTypeIdx += 1 + + } + + outputPayload.result.toArray + + } + + /** + * Method for generating a geometric search space for a first-generation hyper parameter generation for DecisionTrees + * @param numericBoundaries numeric bounds restrictions + * @param stringBoundaries string value restrictions + * @return An Array of Hyperparameter settings for DecisionTrees algorithms + */ + def initialGenerationSeedTrees( + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]] + ): Array[TreesConfig] = { + + var outputPayload = new ArrayBuffer[TreesConfig]() + + val impurityValues = _modelType match { + case "regressor" => List("variance") + case _ => stringBoundaries("impurity") + } + + val treesConfig = PermutationConfiguration( + modelType = _modelType, + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = stringBoundaries + ) + + val generatedArrays = treesNumericArrayGenerator(treesConfig) + + var _impurityIdx = 0 + + var numericArrays = Array( + generatedArrays.maxBinsArray, + generatedArrays.maxDepthArray, + generatedArrays.minInfoGainArray, + generatedArrays.minInstancesPerNodeArray + ) + + for (i <- 1 to _permutationCount) { + val selectedIndeces = _indexMixingMode match { + case "linear" => staticIndexSelection(numericArrays) + case "random" => randomIndexSelection(numericArrays) + case _ => + throw new UnsupportedOperationException( + s"Index mixing mode ${_indexMixingMode} is not supported." + ) + } + + numericArrays = selectedIndeces.remainingPayload + + val impurityLoop = selectStringIndex(impurityValues, _impurityIdx) + _impurityIdx = impurityLoop.IndexCounterStatus + + outputPayload += TreesConfig( + impurity = impurityLoop.selectedStringValue, + maxBins = selectedIndeces.selectedPayload(0).toInt, + maxDepth = selectedIndeces.selectedPayload(1).toInt, + minInfoGain = selectedIndeces.selectedPayload(2), + minInstancesPerNode = selectedIndeces.selectedPayload(3).toInt + ) + _impurityIdx += 1 + } + + outputPayload.result.toArray + + } + + def initialGenerationSeedGBT( + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]] + ): Array[GBTConfig] = { + var outputPayload = new ArrayBuffer[GBTConfig]() + + val impurityValues = _modelType match { + case "regressor" => List("variance") + case _ => stringBoundaries("impurity") + } + val lossTypeValues = _modelType match { + case "regressor" => List("squared", "absolute") + case _ => stringBoundaries("lossType") + } + + val gbtConfig = PermutationConfiguration( + modelType = _modelType, + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = stringBoundaries + ) + + val generatedArrays = gbtNumericArrayGenerator(gbtConfig) + + var _impurityIdx = 0 + var _lossTypeIdx = 0 + + var numericArrays = Array( + generatedArrays.maxBinsArray, + generatedArrays.maxDepthArray, + generatedArrays.maxIterArray, + generatedArrays.minInfoGainArray, + generatedArrays.minInstancesPerNodeArray, + generatedArrays.stepSizeArray + ) + + for (i <- 1 to _permutationCount) { + val selectedIndeces = _indexMixingMode match { + case "linear" => staticIndexSelection(numericArrays) + case "random" => randomIndexSelection(numericArrays) + case _ => + throw new UnsupportedOperationException( + s"Index mixing mode ${_indexMixingMode} is not supported." + ) + } + + numericArrays = selectedIndeces.remainingPayload + + val impurityLoop = selectStringIndex(impurityValues, _impurityIdx) + val lossTypeLoop = selectStringIndex(lossTypeValues, _lossTypeIdx) + _impurityIdx = impurityLoop.IndexCounterStatus + _lossTypeIdx = lossTypeLoop.IndexCounterStatus + + outputPayload += GBTConfig( + impurity = impurityLoop.selectedStringValue, + lossType = lossTypeLoop.selectedStringValue, + maxBins = selectedIndeces.selectedPayload(0).toInt, + maxDepth = selectedIndeces.selectedPayload(1).toInt, + maxIter = selectedIndeces.selectedPayload(2).toInt, + minInfoGain = selectedIndeces.selectedPayload(3), + minInstancesPerNode = selectedIndeces.selectedPayload(4).toInt, + stepSize = selectedIndeces.selectedPayload(5) + ) + _impurityIdx += 1 + _lossTypeIdx += 1 + } + outputPayload.result.toArray + + } + + def initialGenerationSeedLinearRegression( + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]] + ): Array[LinearRegressionConfig] = { + var outputPayload = new ArrayBuffer[LinearRegressionConfig]() + + val linearRegressionConfig = PermutationConfiguration( + modelType = _modelType, + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = stringBoundaries + ) + + val generatedArrays = linearRegressionNumericArrayGenerator( + linearRegressionConfig + ) + + var _fitInterceptIdx = 0 + var _standardizationIdx = 0 + var _lossIdx = 0 + + var numericArrays = Array( + generatedArrays.elasticNetParamsArray, + generatedArrays.maxIterArray, + generatedArrays.regParamArray, + generatedArrays.toleranceArray + ) + + for (i <- 1 to _permutationCount) { + val selectedIndeces = _indexMixingMode match { + case "linear" => staticIndexSelection(numericArrays) + case "random" => randomIndexSelection(numericArrays) + case _ => + throw new UnsupportedOperationException( + s"Index mixing mode ${_indexMixingMode} is not supported." + ) + } + + numericArrays = selectedIndeces.remainingPayload + + val fitInterceptLoop = selectCoinFlip(_fitInterceptIdx) + val standardizationLoop = selectCoinFlip(_standardizationIdx) + val lossLoop = selectStringIndex(stringBoundaries("loss"), _lossIdx) + _lossIdx = lossLoop.IndexCounterStatus + + /** + * For Linear Regression, the loss setting of 'huber' does not permit regularization of elasticnet or L1. + * It must be set to L2 regularization (elasticNetParams == 0.0) to function. + */ + val loss = lossLoop.selectedStringValue + val elasticNetParams = loss match { + case "huber" => 0.0 + case _ => selectedIndeces.selectedPayload(0) + } + + outputPayload += LinearRegressionConfig( + loss = loss, + elasticNetParams = elasticNetParams, + fitIntercept = fitInterceptLoop, + maxIter = selectedIndeces.selectedPayload(1).toInt, + regParam = selectedIndeces.selectedPayload(2), + standardization = standardizationLoop, + tolerance = selectedIndeces.selectedPayload(3) + ) + _lossIdx += 1 + _standardizationIdx += 1 + _fitInterceptIdx += 1 + } + outputPayload.result.toArray + } + + def initialGenerationSeedLogisticRegression( + numericBoundaries: Map[String, (Double, Double)] + ): Array[LogisticRegressionConfig] = { + + var outputPayload = new ArrayBuffer[LogisticRegressionConfig]() + + val logisticRegressionConfig = PermutationConfiguration( + modelType = _modelType, + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = Map[String, List[String]]() + ) + + val generatedArrays = logisticRegressionNumericArrayGenerator( + logisticRegressionConfig + ) + + var _fitInterceptIdx = 0 + var _standardizationIdx = 0 + + var numericArrays = Array( + generatedArrays.elasticNetParamsArray, + generatedArrays.maxIterArray, + generatedArrays.regParamArray, + generatedArrays.toleranceArray + ) + + for (i <- 1 to _permutationCount) { + val selectedIndeces = _indexMixingMode match { + case "linear" => staticIndexSelection(numericArrays) + case "random" => randomIndexSelection(numericArrays) + case _ => + throw new UnsupportedOperationException( + s"Index mixing mode ${_indexMixingMode} is not supported." + ) + } + + numericArrays = selectedIndeces.remainingPayload + + val fitInterceptLoop = selectCoinFlip(_fitInterceptIdx) + val standardizationLoop = selectCoinFlip(_standardizationIdx) + + outputPayload += LogisticRegressionConfig( + elasticNetParams = selectedIndeces.selectedPayload(0), + fitIntercept = fitInterceptLoop, + maxIter = selectedIndeces.selectedPayload(1).toInt, + regParam = selectedIndeces.selectedPayload(2), + standardization = standardizationLoop, + tolerance = selectedIndeces.selectedPayload(3) + ) + _standardizationIdx += 1 + _fitInterceptIdx += 1 + } + outputPayload.result.toArray + + } + + def initialGenerationSeedSVM( + numericBoundaries: Map[String, (Double, Double)] + ): Array[SVMConfig] = { + + var outputPayload = new ArrayBuffer[SVMConfig]() + + val svmConfig = PermutationConfiguration( + modelType = _modelType, + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = Map[String, List[String]]() + ) + + val generatedArrays = svmNumericArrayGenerator(svmConfig) + + var _fitInterceptIdx = 0 + var _standardizationIdx = 0 + + var numericArrays = Array( + generatedArrays.maxIterArray, + generatedArrays.regParamArray, + generatedArrays.toleranceArray + ) + + for (i <- 1 to _permutationCount) { + val selectedIndeces = _indexMixingMode match { + case "linear" => staticIndexSelection(numericArrays) + case "random" => randomIndexSelection(numericArrays) + case _ => + throw new UnsupportedOperationException( + s"Index mixing mode ${_indexMixingMode} is not supported." + ) + } + + numericArrays = selectedIndeces.remainingPayload + + val fitInterceptLoop = selectCoinFlip(_fitInterceptIdx) + val standardizationLoop = selectCoinFlip(_standardizationIdx) + + outputPayload += SVMConfig( + fitIntercept = fitInterceptLoop, + maxIter = selectedIndeces.selectedPayload(0).toInt, + regParam = selectedIndeces.selectedPayload(1), + standardization = standardizationLoop, + tolerance = selectedIndeces.selectedPayload(2) + ) + _standardizationIdx += 1 + _fitInterceptIdx += 1 + } + outputPayload.result.toArray + + } + + def initialGenerationSeedXGBoost( + numericBoundaries: Map[String, (Double, Double)] + ): Array[XGBoostConfig] = { + + var outputPayload = new ArrayBuffer[XGBoostConfig]() + + val xgboostConfig = PermutationConfiguration( + modelType = _modelType, + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = Map[String, List[String]]() + ) + + val generatedArrays = xgboostNumericArrayGenerator(xgboostConfig) + + var numericArrays = Array( + generatedArrays.alphaArray, + generatedArrays.etaArray, + generatedArrays.gammaArray, + generatedArrays.lambdaArray, + generatedArrays.maxDepthArray, + generatedArrays.subSampleArray, + generatedArrays.minChildWeightArray, + generatedArrays.numRoundArray, + generatedArrays.maxBinsArray, + generatedArrays.trainTestRatioArray + ) + + for (i <- 1 to _permutationCount) { + val selectedIndeces = _indexMixingMode match { + case "linear" => staticIndexSelection(numericArrays) + case "random" => randomIndexSelection(numericArrays) + case _ => + throw new UnsupportedOperationException( + s"Index mixing mode ${_indexMixingMode} is not supported." + ) + } + + numericArrays = selectedIndeces.remainingPayload + + outputPayload += XGBoostConfig( + alpha = selectedIndeces.selectedPayload(0), + eta = selectedIndeces.selectedPayload(1), + gamma = selectedIndeces.selectedPayload(2), + lambda = selectedIndeces.selectedPayload(3), + maxDepth = selectedIndeces.selectedPayload(4).toInt, + subSample = selectedIndeces.selectedPayload(5), + minChildWeight = selectedIndeces.selectedPayload(6), + numRound = selectedIndeces.selectedPayload(7).toInt, + maxBins = selectedIndeces.selectedPayload(8).toInt, + trainTestRatio = selectedIndeces.selectedPayload(9) + ) + + } + outputPayload.result.toArray + + } + + def initialGenerationSeedMLPC( + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]], + inputFeatureSize: Int, + distinctClasses: Int + ): Array[MLPCConfig] = { + + var outputPayload = new ArrayBuffer[MLPCConfig]() + + val mlpcConfig = MLPCPermutationConfiguration( + permutationTarget = _permutationCount, + numericBoundaries = numericBoundaries, + stringBoundaries = stringBoundaries, + inputFeatureSize = inputFeatureSize, + distinctClasses = distinctClasses + ) + + var generatedArrays = mlpcNumericArrayGenerator(mlpcConfig) + + var _solverIdx = 0 + + for (i <- 1 to _permutationCount) { + val selectedIndeces = _indexMixingMode match { + case "linear" => mlpcStaticIndexSelection(generatedArrays) + case "random" => mlpcRandomIndexSelection(generatedArrays) + case _ => + throw new UnsupportedOperationException( + s"Index mixing mode ${_indexMixingMode} is not supported." + ) + } + + generatedArrays = selectedIndeces.remainingPayloads + + val solverLoop = selectStringIndex(stringBoundaries("solver"), _solverIdx) + _solverIdx = solverLoop.IndexCounterStatus + + outputPayload += MLPCConfig( + layers = selectedIndeces.selectedPayload.layers, + maxIter = selectedIndeces.selectedPayload.maxIter, + solver = solverLoop.selectedStringValue, + stepSize = selectedIndeces.selectedPayload.stepSize, + tolerance = selectedIndeces.selectedPayload.tolerance + ) + _solverIdx += 1 + } + outputPayload.result.toArray + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/LightGBMBase.scala b/src/main/scala/com/databricks/labs/automl/model/tools/LightGBMBase.scala new file mode 100644 index 00000000..b5550fd1 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/LightGBMBase.scala @@ -0,0 +1,135 @@ +package com.databricks.labs.automl.model.tools + +import com.databricks.labs.automl.exceptions.LightGBMModelTypeException + +import scala.language.implicitConversions + +object GBMTypes extends Enumeration { + val GBMHuber = GBM("gbmHuber", "regressor") + val GBMFair = GBM("gbmFair", "regressor") + val GBMLasso = GBM("gbmLasso", "regressor") + val GBMRidge = GBM("gbmRidge", "regressor") + val GBMPoisson = GBM("gbmPoisson", "regressor") + val GBMQuantile = GBM("gbmQuantile", "regressor") + val GBMMape = GBM("gbmMape", "regressor") + val GBMTweedie = GBM("gbmTweedie", "regressor") + val GBMGamma = GBM("gbmGamma", "regressor") + val GBMBinary = GBM("gbmBinary", "classifier") + val GBMMulti = GBM("gbmMulti", "classifier") + val GBMMultiOVA = GBM("gbmMultiOVA", "classifier") + protected case class GBM(gbmType: String, modelType: String) + extends super.Val() + implicit def convert(value: Value): GBM = value.asInstanceOf[GBM] +} + +object InitialGenerationMode extends Enumeration { + type InitialGenerationMode = Value + val RANDOM, PERMUTATIONS = Value + +} + +trait LightGBMBase { + + import GBMTypes._ + import InitialGenerationMode._ + + final val allowableLightGBMRegressorTypes = Array( + "gbmHuber", + "gbmFair", + "gbmLasso", + "gbmRidge", + "gbmPoisson", + "gbmQuantile", + "gbmMape", + "gbmTweedie", + "gbmGamma" + ) + final val allowableLightGBMClassifierTypes = + Array("gbmBinary", "gbmMulti", "gbmMultiOVA") + + final val BARRIER_MODE = false + final val TIMEOUT = 36000 + + protected[model] def getGBMType(modelSelection: String, + lightGBMType: String): GBMTypes.Value = { + + (modelSelection, lightGBMType) match { + case ("classifier", "gbmBinary") => GBMBinary + case ("classifier", "gbmMulti") => GBMMulti + case ("classifier", "gbmMultiOVA") => GBMMultiOVA + case ("regressor", "gbmHuber") => GBMHuber + case ("regressor", "gbmFair") => GBMFair + case ("regressor", "gbmLasso") => GBMLasso + case ("regressor", "gbmRidge") => GBMRidge + case ("regressor", "gbmPoisson") => GBMPoisson + case ("regressor", "gbmQuantile") => GBMQuantile + case ("regressor", "gbmMape") => GBMMape + case ("regressor", "gbmTweedie") => GBMTweedie + case ("regressor", "gbmGamma") => GBMGamma + case _ => + throw LightGBMModelTypeException( + modelSelection, + lightGBMType, + allowableLightGBMRegressorTypes, + allowableLightGBMClassifierTypes + ) + } + } + + protected[model] def getInitialGenMode( + mode: String + ): InitialGenerationMode = { + mode match { + case "random" => RANDOM + case "permutations" => PERMUTATIONS + } + } + +} + +/** + +// https://sites.google.com/view/lauraepp/parameters + +Regressor -> + +alpha -> Double huber loss and quantile regression default: 0.9 + + + +Classifier -> + +baggingFraction -> Double 0:1 (random bagging selection) default 1.0 +baggingFreq -> Int (perform baggging at every k interval) default: 0:10? +baggingSeed -> Int -> Default 3 +featureFraction -> Double 0:1 can be used to speed up training and prevent overfitting +lambdaL1 -> Double >=0.0 sets l1 regularization (lasso) default 0.0 +lambdaL2 -> Double >=0.0 sets l2 regularization (ridge) default 0.0 +learningRate -> Double 0:1 default 0.1 +maxBin -> Int compression efficiency and lower values can prevent overfitting. default 255 +maxDepth -> Int control the maximum depth of trees default: -1 3:15? +minSumHessianInLeaf -> Double used to deal with overfitting LOG SCALE default 1e-3 +numIterations -> Int built by class count * numIterations default 100 +numLeaves -> Int maximum number of leaves in one tree default 31 + + + +boostFromAverage -> Boolean Adjusts Initial Score for faster convergence default: True + + +boostingType: String -> gbdt, rf, dart, goss default: gbdt +objective -> String + Regression -> regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma, tweedie + Classification -> binary, multiclass, multiclassova + + + + +categoricalSlotNames ? List of Categorical Columns in the feature Vector (needed?) +earlyStoppingRound ? Set Early stopping for metric evaluation Int + +isUnbalance -> Boolean, set if a Binary Classification problem is heavily skewed default: false +timeout -> default 1200 (might want to increase this) +useBarrierExecutionMode -> default False, might want to try True to speed things up? + + */ diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/ModelReporting.scala b/src/main/scala/com/databricks/labs/automl/model/tools/ModelReporting.scala new file mode 100644 index 00000000..adea142e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/ModelReporting.scala @@ -0,0 +1,343 @@ +package com.databricks.labs.automl.model.tools + +import java.util.UUID + +import com.databricks.labs.automl.params.{ + GBTConfig, + LightGBMConfig, + LinearRegressionConfig, + LogisticRegressionConfig, + MLPCConfig, + RandomForestConfig, + SVMConfig, + TreesConfig, + XGBoostConfig +} + +class ModelReporting(modelType: String, metrics: List[String]) { + + final val _runStart = System.currentTimeMillis / 1000 + + /** + * Private method for generating the run score string + * @param scoreBattery The collection of scores for each of the scoring methodologies + * @return The formatted string for reporting out the model validation scores + * @since 0.5.1 + * @author Ben Wilson + */ + private def getRunScores(scoreBattery: Map[String, Double]): String = { + + val builtString = new StringBuilder() + + "\n\t\tScores: \n".flatMap(x => builtString += x) + + metrics.foreach { x => + s"\t\t\t[$x] -> [${scoreBattery(x)}]\n".flatMap(y => builtString += y) + } + + builtString.toString + + } + + /** + * Private method for generating the parameters as a string for stdout and log4j recording of the run information + * @param config Any: Config for the model hyper parameter collection + * @return String formatted for the hyper parameters. + * @since 0.5.1 + * @author Ben Wilson + */ + private def getParams(config: Any, formatter: String): String = { + + modelType match { + case "xgboost" => + convertXGBoostConfigToHumanReadable( + config.asInstanceOf[XGBoostConfig], + formatter + ) + case "lightgbm" => + convertLightGBMConfigToHumanReadable( + config.asInstanceOf[LightGBMConfig], + formatter + ) + case "trees" => + convertTreesConfigToHumanReadable( + config.asInstanceOf[TreesConfig], + formatter + ) + case "gbt" => + convertGBTConfigToHumanReadable( + config.asInstanceOf[GBTConfig], + formatter + ) + case "linearRegression" => + convertLinearRegressionConfigToHumanReadable( + config.asInstanceOf[LinearRegressionConfig], + formatter + ) + case "logisticRegression" => + convertLogisticRegressionConfigToHumanReadable( + config.asInstanceOf[LogisticRegressionConfig], + formatter + ) + case "mlpc" => + convertMLPCConfigToHumanReadable( + config.asInstanceOf[MLPCConfig], + formatter + ) + case "randomForest" => + convertRFConfigToHumanReadable( + config.asInstanceOf[RandomForestConfig], + formatter + ) + case "svm" => + convertSVMConfigToHumanReadable( + config.asInstanceOf[SVMConfig], + formatter + ) + } + + } + + /** + * Private method for getting the current run progress as a formatted string + * @param runProgressPercentage Utilizes the method from withing Evolution() trait for calculating the + * estimated % complete that the job has achieved thus far. + * @return String formatted for % complete of the run + * @since 0.5.1 + * @author Ben Wilson + */ + private def getRunProgress(runProgressPercentage: Double): String = { + s"\t\tCurrent Modeling Progress complete for $modelType: " + + f"$runProgressPercentage%2.4f%%" + } + + /** + * Public method for getting the number of seconds since the modeling family job has started and the current process + * deltas for generation epoch training + * @param modelStart The kFold start time of the model generation's individual training session + * @return String formatted time deltas. + * @since 0.5.1 + * @author Ben Wilson + */ + def generateModelTime(modelStart: Long): String = { + + val currentTime = System.currentTimeMillis / 1000 + val deltaTime = currentTime - modelStart + val batteryDelta = currentTime - _runStart + + s"Completed $modelType in $deltaTime seconds. Generation run time: $batteryDelta seconds." + + } + + /** + * Public method for creating the run start statement as a string based on the uuid of the model run and the + * hyper parameter settings that are being used. + * @param runId UUID representing the unique identifier for the generated model + * @param config The hyper parameter configuration for the run + * @return String a Human readable string for stdout and logging in log4j + * @since 0.5.1 + * @author Ben Wilson + */ + def generateRunStartStatement(runId: UUID, config: Any): String = { + s"Starting run $runId with Params: ${getParams(config, " ")}" + } + + /** + * Public method for reporting on the general completion status of the run. + * @param generation The current generation that the algorithm is on + * @param runProgressPercentage The percentage from Evolution() trait for calculating the current completion + * percentage. + * @return Human readable string for the generation start message. + * @since 0.5.1 + * @author Ben Wilson + */ + def generateGenerationStartStatement( + generation: Int, + runProgressPercentage: Double + ): String = { + f"Starting Generation $generation \n\t\t Completion Status: $runProgressPercentage%2.4f%%" + } + + /** + * Public accessor method for generating the print and logging string payload for modeling results and progress + * @param runId The uuid for the individual model training run + * @param scoreBattery The resulting score payload for the model (all of the scoring metrics) + * @param targetMetric The metric that is being used to adjust model tuning selection + * @param config The hyper parameter configuration for the individual model run + * @param progress The calculated progress of the model as a Double. + * @return String block of text that reports out the run results. + * @since 0.5.1 + * @author Ben Wilson + */ + def generateRunScoreStatement(runId: UUID, + scoreBattery: Map[String, Double], + targetMetric: String, + config: Any, + progress: Double, + modelStartTime: Long): String = { + + val scoreText = getRunScores(scoreBattery) + + val outputText = { + s"\tFinished run $runId with optimiztion target [$targetMetric] value: ${scoreBattery(targetMetric)} " + + s"\n\tWith full scoring breakdown of: $scoreText" + + s"\n\tWith hyper-parameters: ${getParams(config, "\n\t\t\t\t")}" + + s"\n${getRunProgress(progress)}" + + s"\n\t\t${generateModelTime(modelStartTime)}" + } + + outputText + } + + /** + * Private method for making stdout and logging of params much more readable, particularly for the array objects + * + * @param conf The configuration of the run (hyper parameters) + * @return A string representation that is readable. + */ + private def convertXGBoostConfigToHumanReadable(conf: XGBoostConfig, + formatter: String): String = { + s"\n\t\t\tConfig: $formatter[alpha] -> [${conf.alpha.toString}]" + + s"$formatter[eta] -> [${conf.eta.toString}]" + + s"$formatter[gamma] -> [${conf.gamma.toString}]" + + s"$formatter[lambda] -> [${conf.lambda.toString}]" + + s"$formatter[maxBins] -> [${conf.maxBins.toString}]" + + s"$formatter[maxDepth] -> [${conf.maxDepth.toString}]" + + s"$formatter[minChildWeight] -> [${conf.minChildWeight.toString}]" + + s"$formatter[numRound] -> [${conf.numRound.toString}]" + + s"$formatter[subSample] -> [${conf.subSample.toString}]" + + s"$formatter[trainTestRatio] -> [${conf.trainTestRatio.toString}]" + } + + private def convertLightGBMConfigToHumanReadable( + conf: LightGBMConfig, + formatter: String + ): String = { + s"\n\t\t\tConfig: $formatter[baggingFraction] -> [${conf.baggingFraction.toString}]" + + s"$formatter[baggingFreq] -> [${conf.baggingFreq.toString}]" + + s"$formatter[featureFreaction] -> [${conf.featureFraction.toString}]" + + s"$formatter[learningRate] -> [${conf.learningRate.toString}]" + + s"$formatter[maxBin] -> [${conf.maxBin.toString}]" + + s"$formatter[maxDepth] -> [${conf.maxDepth.toString}]" + + s"$formatter[minSumHessianInLeaf] -> [${conf.minSumHessianInLeaf.toString}]" + + s"$formatter[numIterations] -> [${conf.numIterations.toString}]" + + s"$formatter[numLeaves] -> [${conf.numLeaves.toString}]" + + s"$formatter[boostFromAverage] -> [${conf.boostFromAverage.toString}]" + + s"$formatter[lambdaL1] -> [${conf.lambdaL1.toString}]" + + s"$formatter[lambdaL2] -> [${conf.lambdaL2.toString}]" + + s"$formatter[alpha] -> [${conf.alpha.toString}]" + + s"$formatter[boostingType] -> [${conf.boostingType.toString}]" + } + + /** + * Private method for making stdout and logging of params much more readable, particularly for the array objects + * + * @param conf The configuration of the run (hyper parameters) + * @return A string representation that is readable. + */ + private def convertTreesConfigToHumanReadable(conf: TreesConfig, + formatter: String): String = { + s"\n\t\t\tConfig: $formatter[impurity] -> [${conf.impurity}]" + + s"$formatter[maxBins] -> [${conf.maxBins.toString}]" + + s"$formatter[maxDepth] -> [${conf.maxDepth.toString}]" + + s"$formatter[minInfoGain] -> [${conf.minInfoGain.toString}]" + + s"$formatter[minInstancesPerNode] -> [${conf.minInstancesPerNode.toString}]" + } + + /** + * Private method for making stdout and logging of params much more readable, particularly for the array objects + * + * @param conf The configuration of the run (hyper parameters) + * @return A string representation that is readable. + */ + private def convertGBTConfigToHumanReadable(conf: GBTConfig, + formatter: String): String = { + s"\n\t\t\tConfig: $formatter[impurity] -> [${conf.impurity}]" + + s"$formatter[lossType] -> [${conf.lossType}]" + + s"$formatter[maxBins] -> [${conf.maxBins.toString}]" + + s"$formatter[maxDepth] -> [${conf.maxDepth.toString}]" + + s"$formatter[maxIter] -> [${conf.maxIter.toString}]" + + s"$formatter[minInfoGain] -> [${conf.minInfoGain.toString}]" + + s"$formatter[minInstancesPerNode] -> [${conf.minInstancesPerNode.toString}]" + + s"$formatter[stepSize] -> [${conf.stepSize.toString}]" + } + + /** + * Private method for making stdout and logging of params much more readable, particularly for the array objects + * + * @param conf The configuration of the run (hyper parameters) + * @return A string representation that is readable. + */ + private def convertLinearRegressionConfigToHumanReadable( + conf: LinearRegressionConfig, + formatter: String + ): String = { + s"\n\t\t\tConfig: $formatter[elasticNetParams] -> [${conf.elasticNetParams.toString}]" + + s"$formatter[fitIntercept] -> [${conf.fitIntercept.toString}]" + + s"$formatter[loss] -> [${conf.loss}]" + + s"$formatter[maxIter] -> [${conf.maxIter.toString}]" + + s"$formatter[regParam] -> [${conf.regParam.toString}]" + + s"$formatter[standardization] -> [${conf.standardization.toString}]" + + s"$formatter[tolerance] -> [${conf.tolerance.toString}]" + } + + /** + * Private method for making stdout and logging of params much more readable, particularly for the array objects + * + * @param conf The configuration of the run (hyper parameters) + * @return A string representation that is readable. + */ + private def convertLogisticRegressionConfigToHumanReadable( + conf: LogisticRegressionConfig, + formatter: String + ): String = { + s"\n\t\t\tConfig: $formatter[elasticNetParams] -> [${conf.elasticNetParams.toString}]" + + s"$formatter[fitIntercept] -> [${conf.fitIntercept.toString}]" + + s"$formatter[maxIter] -> [${conf.maxIter.toString}]" + + s"$formatter[regParam] -> [${conf.regParam.toString}]" + + s"$formatter[standardization] -> [${conf.standardization.toString}]" + + s"$formatter[tolerance] -> [${conf.tolerance.toString}]" + } + + /** + * Private method for making stdout and logging of params much more readable, particularly for the array objects + * @param conf The configuration of the run (hyper parameters) + * @return A string representation that is readable. + */ + private def convertMLPCConfigToHumanReadable(conf: MLPCConfig, + formatter: String): String = { + s"\n\t\t\tConfig: $formatter[layers] -> [${conf.layers.mkString(",")}]" + + s"$formatter[maxIter] -> [${conf.maxIter.toString}] $formatter[solver] -> [${conf.solver}]" + + s"$formatter[stepSize] -> [${conf.stepSize.toString}]$formatter[tolerance] -> [${conf.tolerance.toString}]" + } + + /** + * Private method for making stdout and logging of params much more readable, particularly for the array objects + * + * @param conf The configuration of the run (hyper parameters) + * @return A string representation that is readable. + */ + private def convertRFConfigToHumanReadable(conf: RandomForestConfig, + formatter: String): String = { + s"\n\t\t\tConfig: $formatter[featureSubsetStrategy] -> [${conf.featureSubsetStrategy}]" + + s"$formatter[impurity] -> [${conf.impurity}]$formatter[maxBins] -> [${conf.maxBins.toString}]" + + s"$formatter[maxDepth] -> [${conf.maxDepth.toString}]$formatter[minInfoGain] -> [${conf.minInfoGain.toString}]" + + s"$formatter[numTrees] -> [${conf.numTrees.toString}]$formatter[subSamplingRate] -> [${conf.subSamplingRate.toString}]" + } + + /** + * Private method for making stdout and logging of params much more readable, particularly for the array objects + * + * @param conf The configuration of the run (hyper parameters) + * @return A string representation that is readable. + */ + private def convertSVMConfigToHumanReadable(conf: SVMConfig, + formatter: String): String = { + s"\n\t\t\tConfig: $formatter[fitIntercept] -> [${conf.fitIntercept.toString}]" + + s"$formatter[maxIter] -> [${conf.maxIter.toString}]" + + s"$formatter[regParam] -> [${conf.regParam.toString}]" + + s"$formatter[standardization] -> [${conf.standardization.toString}]" + + s"$formatter[tolerance] -> [${conf.tolerance.toString}]" + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/ModelUtils.scala b/src/main/scala/com/databricks/labs/automl/model/tools/ModelUtils.scala new file mode 100644 index 00000000..473a75ab --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/ModelUtils.scala @@ -0,0 +1,119 @@ +package com.databricks.labs.automl.model.tools + +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +object ModelUtils { + + final private val STRING_INDEX_SUFFIX = "_si" + final private val OHE_SUFFIX = "_oh" + final private val LOGGING_COL = "automl_internal_id" + final private val EXACT_CARDINALITY_CUTOFF = 50L + + /** + * Private method for getting the cardinality of a string indexed column to ensure that + * @param df Source DataFrame + * @param field field to test exact cardinality against + * @return Integer count of distinct entries in the column + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def getExactFieldCardinality(df: DataFrame, field: String): Int = { + + df.select(field).distinct().count().toInt + + } + + /** + * Private method for getting an approximate cardinality for numeric field types (not previously string indexed) + * using approx distinct here due to speed and the prevention of a massive shuffle in the instance of high + * cardinality fields. If the approximate value is below a certain threshold, then it will be eligible for + * exact measurement to ensure that maxBins threshold minimum will not cause an exception to be thrown. + * @param df DataFrame for testing cardinality of fields + * @param field Field to test cardinality for + * @return Long - the estimated cardinality for the field + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def getApproxFieldCardinality(df: DataFrame, field: String): Long = { + + df.select(approx_count_distinct(field, 0.1).alias("approx")) + .first() + .getAs[Long]("approx") + + } + + /** + * Method for readjusting the search space for tree-based algorithms to ensure that maxBins search space does not + * initiate a model run where maxBins value is below the cardinality value of nominal fields in the data set. + * Having a cardinality of a field that is higher than maxBins will prevent calculation of InformationGain / gini + * for tree split calculations, since it won't be able to adequately perform the summarization of values + * for the entropy calculation. Resetting the search space based on the data presented for modeling will eliminate + * the possibility of attempting to search an invalid space. + * @param df DataFrame prepared for modeling + * @param fieldsToIgnore fields to ignore from cardinality checks + * @param labelCol label field (not needed for cardinality check) + * @param featuresCol feature field (not needed for cardinality check) + * @return An updated NumericMapping for the model's search space (where maxBins is located for the tree based + * algorithms) + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def resetTreeBinsSearchSpace( + df: DataFrame, + numericMap: Map[String, (Double, Double)], + fieldsToIgnore: Array[String], + labelCol: String, + featuresCol: String + ): Map[String, (Double, Double)] = { + + val numericFields = df.schema.names + .filterNot(_.endsWith(STRING_INDEX_SUFFIX)) + .filterNot(_.endsWith(OHE_SUFFIX)) + .filterNot(fieldsToIgnore.contains) + .filterNot(x => x == labelCol) + .filterNot(x => x == featuresCol) + .filterNot(x => x == LOGGING_COL) + + val numericNominalCandidates = numericFields.foldLeft(Array.empty[String]) { + case (accum, x) => + if (getApproxFieldCardinality(df, x) < EXACT_CARDINALITY_CUTOFF) + accum ++ Array(x) + else accum + } + + val categoricalFields = + df.schema.names + .filter(x => x.endsWith(STRING_INDEX_SUFFIX)) ++ numericNominalCandidates + + val maxBinsFloor = categoricalFields.foldLeft(0) { + + case (a, x) => + math.max(a, getExactFieldCardinality(df, x)) + } + 1 + + val maxBinsTuple = numericMap("maxBins") + + val upperBound = + if (maxBinsFloor > maxBinsTuple._2 - 25) maxBinsFloor + 100 + else maxBinsTuple._2 + + numericMap + ("maxBins" -> (maxBinsFloor, upperBound)) + + } + + def validateGBTClassifier(df: DataFrame, labelCol: String): Unit = { + + val distinctLabelValues = df.select(labelCol).distinct().count() + + distinctLabelValues match { + case x if x > 2L => + throw new IllegalArgumentException( + "GBT Classifier currently only supports binary " + + "classification. For multi-class, try 'trees', 'xgboost', 'randomforest', 'logistic', or 'mlpc" + ) + case _ => None + } + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/PostModelingOptimization.scala b/src/main/scala/com/databricks/labs/automl/model/tools/PostModelingOptimization.scala new file mode 100644 index 00000000..df6113fc --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/PostModelingOptimization.scala @@ -0,0 +1,764 @@ +package com.databricks.labs.automl.model.tools + +import com.databricks.labs.automl.model.tools.structures._ +import com.databricks.labs.automl.params._ +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.spark.ml.PipelineModel +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.ArrayBuffer + +class PostModelingOptimization + extends Defaults + with ModelConfigGenerators + with SparkSessionWrapper { + + private final val PERMUTATION_FACTOR: Int = 10 + private final val PREDICTION_COL: String = "prediction" + private final val supportedOptimizationStrategies: List[String] = + List("minimize", "maximize") + + var _modelFamily = "" + var _modelType = "" + var _hyperParameterSpaceCount = 100000 + var _numericBoundaries: Map[String, (Double, Double)] = _ + var _stringBoundaries: Map[String, List[String]] = _ + var _seed: Long = 42L + var _optimizationStrategy: String = "maximize" + + def setModelFamily(value: String): this.type = { + require( + _supportedModels.contains(value), + s"${this.getClass.toString} error! Model Family $value is not supported." + + s"\n\t Supported families: ${_supportedModels.mkString(", ")}" + ) + _modelFamily = value + this + } + + def setModelType(value: String): this.type = { + value match { + case "classifier" => _modelType = value + case "regressor" => _modelType = value + case _ => + throw new UnsupportedOperationException( + s"Model type $value is not supported." + ) + } + this + } + + def setHyperParameterSpaceCount(value: Int): this.type = { + if (value > 500000) + println( + "WARNING! Setting permutation counts above 500,000 will put stress on the driver." + ) + if (value > 1000000) + throw new UnsupportedOperationException( + s"Setting permutation above 1,000,000 is not supported" + + s" due to runtime considerations. $value is too large of a value." + ) + _hyperParameterSpaceCount = value + this + } + + def setNumericBoundaries(value: Map[String, (Double, Double)]): this.type = { + _numericBoundaries = value + this + } + + def setStringBoundaries(value: Map[String, List[String]]): this.type = { + _stringBoundaries = value + this + } + + def setSeed(value: Long): this.type = { + _seed = value + this + } + + def setOptimizationStrategy(value: String): this.type = { + + require( + supportedOptimizationStrategies.contains(value), + s"Optimization Strategy for Post Modeling Optimization " + + s"$value is not supported. Must be one of: ${supportedOptimizationStrategies.mkString(", ")}." + ) + _optimizationStrategy = value + this + } + + def getModelFamily: String = _modelFamily + + def getModelType: String = _modelType + + def getHyperParameterSpaceCount: Int = _hyperParameterSpaceCount + + def getNumericBoundaries: Map[String, (Double, Double)] = _numericBoundaries + + def getStringBoundaries: Map[String, List[String]] = _stringBoundaries + + def getSeed: Long = _seed + + def getOptimizationStrategy: String = _optimizationStrategy + + private def generateGenericSearchSpace(): PermutationConfiguration = { + val calculatedPermutationValue = getPermutationCounts( + _hyperParameterSpaceCount, + _numericBoundaries.size + ) + + stringBoundaryPermutationCalculator(_stringBoundaries) + + PermutationConfiguration( + modelType = _modelType, + permutationTarget = calculatedPermutationValue, + numericBoundaries = _numericBoundaries, + stringBoundaries = _stringBoundaries + ) + } + + private def euclideanRestrict(df: DataFrame, + topPredictions: Int, + additionalFields: Array[String] = + Array[String]()): DataFrame = { + + EuclideanSpaceSearch( + df, + _numericBoundaries.keys.toArray, + _stringBoundaries.keys.toArray, + topPredictions, + additionalFields + ) + + } + + /** + * Private method for returning the top n hyper parameters based on the direction of optimization that should occur + * for the metric being evaluated. + * @param pipeline ML Pipeline object + * @param data DataFrame continaing the hyper parameters to predict performance for + * @param topPredictions The number of potential candidates to return. + * @return DataFrame of relevant candidates + * @since 0.6.1 + * @author Ben Wilson, Databricks + */ + private def transformAndLimit(pipeline: PipelineModel, + data: DataFrame, + topPredictions: Int): DataFrame = { + + _optimizationStrategy match { + case "minimize" => + pipeline + .transform(data) + .orderBy(col(PREDICTION_COL).asc) + .limit(topPredictions * PERMUTATION_FACTOR) + case _ => + pipeline + .transform(data) + .orderBy(col(PREDICTION_COL).desc) + .limit(topPredictions * PERMUTATION_FACTOR) + } + + } + + //RANDOM FOREST METHODS + /** + * Generates an array of RandomForestConfig hyper parameters to meet the configured target size + * @return a distinct array of RandomForestConfig's + */ + protected[tools] def generateRandomForestSearchSpace() + : Array[RandomForestConfig] = { + // Generate the Permutations + val permutationsArray = randomForestPermutationGenerator( + generateGenericSearchSpace(), + _hyperParameterSpaceCount, + _seed + ) + + permutationsArray.distinct + } + + def generateRandomForestSearchSpaceAsDataFrame(): DataFrame = { + + spark.createDataFrame(generateRandomForestSearchSpace()) + + } + + protected[tools] def randomForestResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = new ArrayBuffer[RandomForestModelRunReport]() + + results.foreach { x => + val hyperParams = x.hyperParams + builder += RandomForestModelRunReport( + numTrees = hyperParams("numTrees").toString.toInt, + impurity = hyperParams("impurity").toString, + maxBins = hyperParams("maxBins").toString.toInt, + maxDepth = hyperParams("maxDepth").toString.toInt, + minInfoGain = hyperParams("minInfoGain").toString.toDouble, + subSamplingRate = hyperParams("subSamplingRate").toString.toDouble, + featureSubsetStrategy = hyperParams("featureSubsetStrategy").toString, + score = x.score + ) + } + spark.createDataFrame(builder.result.toArray) + } + + def randomForestPrediction(modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int): Array[RandomForestConfig] = { + + val inferenceDataSet = randomForestResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = generateRandomForestSearchSpaceAsDataFrame() + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + + convertRandomForestResultToConfig( + euclideanRestrict(restrictedData, topPredictions) + ) + + } + + //DECISION TREE METHODS + + protected[tools] def generateTreesSearchSpace(): Array[TreesConfig] = { + + val permutationsArray = treesPermutationGenerator( + generateGenericSearchSpace(), + _hyperParameterSpaceCount, + _seed + ) + permutationsArray.distinct + } + + protected[tools] def generateTreesSearchSpaceAsDataFrame(): DataFrame = { + spark.createDataFrame(generateTreesSearchSpace()) + } + + protected[tools] def treesResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = new ArrayBuffer[TreesModelRunReport]() + + results.foreach { x => + val hyperParams = x.hyperParams + builder += TreesModelRunReport( + impurity = hyperParams("impurity").toString, + maxBins = hyperParams("maxBins").toString.toInt, + maxDepth = hyperParams("maxDepth").toString.toInt, + minInfoGain = hyperParams("minInfoGain").toString.toDouble, + minInstancesPerNode = + hyperParams("minInstancesPerNode").toString.toDouble, + score = x.score + ) + } + spark.createDataFrame(builder.result.toArray) + } + + def treesPrediction(modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int): Array[TreesConfig] = { + val inferenceDataSet = treesResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = generateTreesSearchSpaceAsDataFrame() + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + + convertTreesResultToConfig( + euclideanRestrict(restrictedData, topPredictions) + ) + } + + //GBT METHODS + + protected[tools] def generateGBTSearchSpace(): Array[GBTConfig] = { + + val permutationsArray = gbtPermutationGenerator( + generateGenericSearchSpace(), + _hyperParameterSpaceCount, + _seed + ) + permutationsArray.distinct + } + + protected[tools] def generateGBTSearchSpaceAsDataFrame(): DataFrame = { + spark.createDataFrame(generateGBTSearchSpace()) + } + + protected[tools] def gbtResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = new ArrayBuffer[GBTModelRunReport]() + + results.foreach { x => + val hyperParams = x.hyperParams + builder += GBTModelRunReport( + impurity = hyperParams("impurity").toString, + lossType = hyperParams("lossType").toString, + maxBins = hyperParams("maxBins").toString.toInt, + maxDepth = hyperParams("maxDepth").toString.toInt, + maxIter = hyperParams("maxIter").toString.toInt, + minInfoGain = hyperParams("minInfoGain").toString.toDouble, + minInstancesPerNode = hyperParams("minInstancesPerNode").toString.toInt, + stepSize = hyperParams("stepSize").toString.toDouble, + score = x.score + ) + } + spark.createDataFrame(builder.result.toArray) + } + + def gbtPrediction(modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int): Array[GBTConfig] = { + val inferenceDataSet = gbtResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = generateGBTSearchSpaceAsDataFrame() + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + + convertGBTResultToConfig(euclideanRestrict(restrictedData, topPredictions)) + } + + //LINEAR REGRESSION METHODS + + protected[tools] def generateLinearRegressionSearchSpace() + : Array[LinearRegressionConfig] = { + + val permutationsArray = linearRegressionPermutationGenerator( + generateGenericSearchSpace(), + _hyperParameterSpaceCount, + _seed + ) + permutationsArray.distinct + } + + protected[tools] def generateLinearRegressionSearchSpaceAsDataFrame() + : DataFrame = { + spark.createDataFrame(generateLinearRegressionSearchSpace()) + } + + protected[tools] def linearRegressionResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = new ArrayBuffer[LinearRegressionModelRunReport]() + + results.foreach { x => + val hyperParams = x.hyperParams + builder += LinearRegressionModelRunReport( + elasticNetParams = hyperParams("elasticNetParams").toString.toDouble, + fitIntercept = hyperParams("fitIntercept").toString.toBoolean, + loss = hyperParams("loss").toString, + maxIter = hyperParams("maxIter").toString.toInt, + regParam = hyperParams("regParam").toString.toDouble, + standardization = hyperParams("standardization").toString.toBoolean, + tolerance = hyperParams("tolerance").toString.toDouble, + score = x.score + ) + } + spark.createDataFrame(builder.result.toArray) + } + + def linearRegressionPrediction( + modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int + ): Array[LinearRegressionConfig] = { + val inferenceDataSet = linearRegressionResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = + generateLinearRegressionSearchSpaceAsDataFrame() + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + + convertLinearRegressionResultToConfig( + euclideanRestrict( + restrictedData, + topPredictions, + Array("fitIntercept", "standardization") + ) + ) + } + + //LOGISTIC REGRESSION METHODS + + protected[tools] def generateLogisticRegressionSearchSpace() + : Array[LogisticRegressionConfig] = { + + val permutationsArray = logisticRegressionPermutationGenerator( + generateGenericSearchSpace(), + _hyperParameterSpaceCount, + _seed + ) + permutationsArray.distinct + } + + protected[tools] def generateLogisticRegressionSearchSpaceAsDataFrame() + : DataFrame = { + spark.createDataFrame(generateLogisticRegressionSearchSpace()) + } + + protected[tools] def logisticRegressionResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = new ArrayBuffer[LogisticRegressionModelRunReport]() + + results.foreach { x => + val hyperParams = x.hyperParams + builder += LogisticRegressionModelRunReport( + elasticNetParams = hyperParams("elasticNetParams").toString.toDouble, + fitIntercept = hyperParams("fitIntercept").toString.toBoolean, + maxIter = hyperParams("maxIter").toString.toInt, + regParam = hyperParams("regParam").toString.toDouble, + standardization = hyperParams("standardization").toString.toBoolean, + tolerance = hyperParams("tolerance").toString.toDouble, + score = x.score + ) + } + spark.createDataFrame(builder.result.toArray) + } + + def logisticRegressionPrediction( + modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int + ): Array[LogisticRegressionConfig] = { + val inferenceDataSet = logisticRegressionResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = + generateLogisticRegressionSearchSpaceAsDataFrame() + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + + convertLogisticRegressionResultToConfig( + euclideanRestrict( + restrictedData, + topPredictions, + Array("fitIntercept", "standardization") + ) + ) + } + + //SUPPORT VECTOR MACHINES METHODS + + protected[tools] def generateSVMSearchSpace(): Array[SVMConfig] = { + + val permutationsArray = svmPermutationGenerator( + generateGenericSearchSpace(), + _hyperParameterSpaceCount, + _seed + ) + permutationsArray.distinct + } + + protected[tools] def generateSVMSearchSpaceAsDataFrame(): DataFrame = { + spark.createDataFrame(generateSVMSearchSpace()) + } + + protected[tools] def svmResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = new ArrayBuffer[SVMModelRunReport]() + + results.foreach { x => + val hyperParams = x.hyperParams + builder += SVMModelRunReport( + fitIntercept = hyperParams("fitIntercept").toString.toBoolean, + maxIter = hyperParams("maxIter").toString.toInt, + regParam = hyperParams("regParam").toString.toDouble, + standardization = hyperParams("standardization").toString.toBoolean, + tolerance = hyperParams("tolerance").toString.toDouble, + score = x.score + ) + } + spark.createDataFrame(builder.result.toArray) + } + + def svmPrediction(modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int): Array[SVMConfig] = { + val inferenceDataSet = svmResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = generateSVMSearchSpaceAsDataFrame() + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + + convertSVMResultToConfig( + euclideanRestrict( + restrictedData, + topPredictions, + Array("fitIntercept", "standardization") + ) + ) + } + + //XGBOOST METHODS + + protected[tools] def generateXGBoostSearchSpace(): Array[XGBoostConfig] = { + + val permutationsArray = xgboostPermutationGenerator( + generateGenericSearchSpace(), + _hyperParameterSpaceCount, + _seed + ) + permutationsArray.distinct + } + + protected[tools] def generateXGBoostSearchSpaceAsDataFrame(): DataFrame = { + spark.createDataFrame(generateXGBoostSearchSpace()) + } + + protected[tools] def xgBoostResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = new ArrayBuffer[XGBoostModelRunReport]() + + results.foreach { x => + val hyperParams = x.hyperParams + builder += XGBoostModelRunReport( + alpha = hyperParams("alpha").toString.toDouble, + eta = hyperParams("eta").toString.toDouble, + gamma = hyperParams("gamma").toString.toDouble, + lambda = hyperParams("lambda").toString.toDouble, + maxDepth = hyperParams("maxDepth").toString.toInt, + subSample = hyperParams("subSample").toString.toDouble, + minChildWeight = hyperParams("minChildWeight").toString.toDouble, + numRound = hyperParams("numRound").toString.toInt, + maxBins = hyperParams("maxBins").toString.toInt, + trainTestRatio = hyperParams("trainTestRatio").toString.toDouble, + score = x.score + ) + } + spark.createDataFrame(builder.result.toArray) + } + + def xgBoostPrediction(modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int): Array[XGBoostConfig] = { + val inferenceDataSet = xgBoostResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = generateXGBoostSearchSpaceAsDataFrame() + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + + convertXGBoostResultToConfig( + euclideanRestrict(restrictedData, topPredictions) + ) + } + + //LIGHTGBM METHODS + + protected[tools] def generateLightGBMSearchSpace(): Array[LightGBMConfig] = { + + val permutationsArray = lightGBMPermutationGenerator( + generateGenericSearchSpace(), + _hyperParameterSpaceCount, + _seed + ) + permutationsArray.distinct + } + + protected[tools] def generateLightGBMSearchSpaceAsDataFrame(): DataFrame = { + spark.createDataFrame(generateLightGBMSearchSpace()) + } + + protected[tools] def lightGBMResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = results.map { x => + val hyperParams = x.hyperParams + LightGBMModelRunReport( + baggingFraction = hyperParams("baggingFraction").toString.toDouble, + baggingFreq = hyperParams("baggingFreq").toString.toInt, + featureFraction = hyperParams("featureFraction").toString.toDouble, + learningRate = hyperParams("learningRate").toString.toDouble, + maxBin = hyperParams("maxBin").toString.toInt, + maxDepth = hyperParams("maxDepth").toString.toInt, + minSumHessianInLeaf = + hyperParams("minSumHessianInLeaf").toString.toDouble, + numIterations = hyperParams("numIterations").toString.toInt, + numLeaves = hyperParams("numLeaves").toString.toInt, + boostFromAverage = hyperParams("boostFromAverage").toString.toBoolean, + lambdaL1 = hyperParams("lambdaL1").toString.toDouble, + lambdaL2 = hyperParams("lambdaL2").toString.toDouble, + alpha = hyperParams("alpha").toString.toDouble, + boostingType = hyperParams("boostingType").toString, + score = x.score + ) + } + spark.createDataFrame(builder) + } + + def lightGBMPrediction(modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int): Array[LightGBMConfig] = { + + val inferenceDataSet = lightGBMResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = generateLightGBMSearchSpaceAsDataFrame() + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + + convertLightGBMResultToConfig( + euclideanRestrict(restrictedData, topPredictions) + ) + + } + + //MLPC METHODS + + protected[tools] def generateMLPCSearchSpace( + inputFeatureSize: Int, + classCount: Int + ): Array[MLPCModelingConfig] = { + + val mlpcSearchSpace = MLPCPermutationConfiguration( + permutationTarget = getPermutationCounts( + _hyperParameterSpaceCount, + _numericBoundaries.size + ) + + stringBoundaryPermutationCalculator(_stringBoundaries), + numericBoundaries = _numericBoundaries, + stringBoundaries = _stringBoundaries, + inputFeatureSize = inputFeatureSize, + distinctClasses = classCount + ) + + val permutationsArray = mlpcPermutationGenerator( + mlpcSearchSpace, + _hyperParameterSpaceCount, + _seed + ) + permutationsArray.distinct + } + + protected[tools] def generateMLPCSearchSpaceAsDataFrame( + inputFeatureSize: Int, + classCount: Int + ): DataFrame = { + spark.createDataFrame(generateMLPCSearchSpace(inputFeatureSize, classCount)) + } + + protected[tools] def mlpcResultMapping( + results: Array[GenericModelReturn] + ): DataFrame = { + + val builder = new ArrayBuffer[MLPCModelRunReport]() + + results.foreach { x => + val hyperParams = x.hyperParams + val (layerCount, hiddenLayerSizeAdjust) = + mlpcLayersExtractor(hyperParams("layers").asInstanceOf[Array[Int]]) + builder += MLPCModelRunReport( + layers = layerCount, + maxIter = hyperParams("maxIter").toString.toInt, + solver = hyperParams("solver").toString, + stepSize = hyperParams("stepSize").toString.toDouble, + tolerance = hyperParams("tolerance").toString.toDouble, + hiddenLayerSizeAdjust = hiddenLayerSizeAdjust, + score = x.score + ) + } + spark.createDataFrame(builder.result.toArray) + } + + def mlpcPrediction(modelingResults: Array[GenericModelReturn], + modelType: String, + topPredictions: Int, + featureInputSize: Int, + classDistinctCount: Int): Array[MLPCConfig] = { + + val inferenceDataSet = mlpcResultMapping(modelingResults) + + val fittedPipeline = new PostModelingPipelineBuilder(inferenceDataSet) + .setModelType(modelType) + .setNumericBoundaries(_numericBoundaries) + .setStringBoundaries(_stringBoundaries) + .regressionModelForPermutationTest() + + val fullSearchSpaceDataSet = + generateMLPCSearchSpaceAsDataFrame( + featureInputSize, + classDistinctCount + 1 + ).withColumnRenamed("layers", "layerConstruct") + .withColumnRenamed("layerCount", "layers") + + val restrictedData = + transformAndLimit(fittedPipeline, fullSearchSpaceDataSet, topPredictions) + .withColumnRenamed("layers", "layerCount") + .withColumnRenamed("layerConstruct", "layers") + + convertMLPCResultToConfig( + restrictedData, + featureInputSize, + classDistinctCount + 1 + ) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/PostModelingPipelineBuilder.scala b/src/main/scala/com/databricks/labs/automl/model/tools/PostModelingPipelineBuilder.scala new file mode 100644 index 00000000..89c5627a --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/PostModelingPipelineBuilder.scala @@ -0,0 +1,103 @@ +package com.databricks.labs.automl.model.tools + +import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor +import org.apache.spark.ml.feature.VectorAssembler +import org.apache.spark.ml.regression.{LinearRegression, RandomForestRegressor} +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage} +import org.apache.spark.sql.DataFrame + +import scala.collection.mutable.ArrayBuffer + +class PostModelingPipelineBuilder(modelResults: DataFrame) { + + var _numericBoundaries: Map[String, (Double, Double)] = _ + var _stringBoundaries: Map[String, List[String]] = _ + var _modelType: String = _ + + def setNumericBoundaries(value: Map[String, (Double, Double)]): this.type = { + _numericBoundaries = value + this + } + + def setStringBoundaries(value: Map[String, List[String]]): this.type = { + _stringBoundaries = value + this + } + + def setModelType(value: String): this.type = { + require( + List("RandomForest", "LinearRegression", "XGBoost").contains(value), + s"Model type '$value' is not supported for " + + s"post-run optimization." + ) + _modelType = value + this + } + + def getNumericBoundaries: Map[String, (Double, Double)] = _numericBoundaries + def getStringBoundaries: Map[String, List[String]] = _stringBoundaries + def getModelType: String = _modelType + + protected[tools] def regressionModelForPermutationTest(): PipelineModel = { + + val vectorFields = new ArrayBuffer[String] + val pipelineBuffer = new ArrayBuffer[PipelineStage] + + // Insert the Numeric Values directly into the ArrayBuffer for column listings for the vector assembler + _numericBoundaries.keys.toArray.foreach(x => vectorFields += x) + + // Get the string type fields from the Dataframe to StringIndex them +// _stringBoundaries.keys.foreach{ x => +// +// val indexedName = s"${x}_si" +// +// val stringIndexer = new StringIndexer() +// .setInputCol(x) +// .setOutputCol(indexedName) +// .setHandleInvalid("keep") +// +// vectorFields += indexedName +// pipelineBuffer += stringIndexer +// } + + // Build the vector + val vectorizer = new VectorAssembler() + .setInputCols(vectorFields.result.toArray) + .setOutputCol("features") + + pipelineBuffer += vectorizer + + val model = _modelType match { + case "RandomForest" => + new RandomForestRegressor() + .setMinInfoGain(1E-8) + .setNumTrees(600) + .setMaxDepth(10) + case "LinearRegression" => new LinearRegression() + case "XGBoost" => + new XGBoostRegressor() + .setAlpha(0.5) + .setEta(0.25) + .setGamma(3.0) + .setLambda(10.0) + .setMaxBins(200) + .setMaxDepth(10) + .setMinChildWeight(3.0) + .setNumRound(10) + } + + model + .setLabelCol("score") + .setFeaturesCol("features") + + pipelineBuffer += model + + // Build the pipeline + val fullPipeline = new Pipeline() + .setStages(pipelineBuffer.result.toArray) + + // Fit the model pipeline and return it + fullPipeline.fit(modelResults) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/split/DataSplitCustodial.scala b/src/main/scala/com/databricks/labs/automl/model/tools/split/DataSplitCustodial.scala new file mode 100644 index 00000000..5cea78b6 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/split/DataSplitCustodial.scala @@ -0,0 +1,67 @@ +package com.databricks.labs.automl.model.tools.split + +import com.databricks.labs.automl.exploration.structures.FeatureImportanceConfig +import com.databricks.labs.automl.model.tools.structures.TrainSplitReferences +import com.databricks.labs.automl.params.MainConfig + +object DataSplitCustodial { + + /** + * Method for tidying up the cached, persisted, or delta-written test/train splits + * @param splitData reference collection to the cached, persisted, or written-out delta tables + * @param config main config, containing configuration references for how to handle the split data + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def cleanCachedInstances(splitData: Array[TrainSplitReferences], + config: MainConfig): Unit = { + + splitData.foreach { x => + { + config.geneticConfig.splitCachingStrategy match { + case "cache" => + x.data.train.unpersist(true) + x.data.test.unpersist(true) + case "persist" => + x.data.train.unpersist() + x.data.test.unpersist() + case "delta" => + if (config.geneticConfig.deltaCacheBackingDirectoryRemovalFlag) { + DeltaCacheCleanup.removeTrainTestPair(x.paths) + } + } + } + } + + } + + /** + * Method for cleaning up the cached instances based on a FeatureImportance config object. + * @param splitData reference collection to the cached, persisted, or written-out delta tables + * @param config feature importances config, containing configuration references for how to handle the split data + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def cleanCachedInstances(splitData: Array[TrainSplitReferences], + config: FeatureImportanceConfig): Unit = { + + splitData.foreach { x => + { + config.splitCachingStrategy match { + case "cache" => + x.data.train.unpersist(true) + x.data.test.unpersist(true) + case "persist" => + x.data.train.unpersist() + x.data.test.unpersist() + case "delta" => + if (config.deltaCacheBackingDirectoryRemovalFlag) { + DeltaCacheCleanup.removeTrainTestPair(x.paths) + } + } + } + } + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/split/DataSplitUtility.scala b/src/main/scala/com/databricks/labs/automl/model/tools/split/DataSplitUtility.scala new file mode 100644 index 00000000..e3735d3c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/split/DataSplitUtility.scala @@ -0,0 +1,234 @@ +package com.databricks.labs.automl.model.tools.split + +import com.databricks.labs.automl.model.tools.structures.{ + TrainSplitReferences, + TrainTestData, + TrainTestPaths +} +import org.apache.log4j.{Level, Logger} +import org.apache.spark.sql.DataFrame + +/** + * Train / Test split handler class + * @param mainDataset Dataset that contains feature vector, out of DataPrep phase, ready to be split into + * @param kIterations number of 'copies' of the split to perform in order to fulfill the number of kFold models to be built + * @param splitMethod The type of split being performed (i.e. 'stratified', 'random', 'kSample') + * @param labelColumn Name of the label column + * @param rootDir Source directory to use to build the delta persisted data sets if using 'delta' mode in persistMode + * @param persistMode 'cache', 'persist' or 'delta' - how to retain each of the kFold train/test splits. + * @param modelFamily The model family in order to determine how many parts in which to repartition the train and test + * splits for optimal performance. + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ +class DataSplitUtility(mainDataset: DataFrame, + kIterations: Int, + splitMethod: String, + labelColumn: String, + rootDir: String, + persistMode: String, + modelFamily: String, + parallelism: Int, + trainPortion: Double, + syntheticCol: String, + trainSplitChronologicalColumn: String, + trainSplitChronologicalRandomPercentage: Double, + reductionFactor: Double) + extends SplitUtilityTooling { + + @transient private val logger: Logger = Logger.getLogger(this.getClass) + + final val uniqueLabels = mainDataset.select(labelColumn).distinct().collect() + + /** + * Method for persisting the train test splits to local disk. + * @return Array[TrainSplitReferences], containing pointers to the Data, organized by kFold index, as well as + * dummy values for pathing. + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + private def trainSplitPersist: Array[TrainSplitReferences] = { + + val optimalParts = modelFamily match { + case "XGBoost" => PerformanceSettings.xgbWorkers(parallelism) + case _ => PerformanceSettings.optimalJVMModelPartitions(parallelism) + } + + (0 until kIterations).map { x => + val Array(train, test) = + SplitOperators.genTestTrain( + mainDataset, + scala.util.Random.nextLong(), + uniqueLabels, + splitMethod, + labelColumn, + trainPortion, + syntheticCol, + trainSplitChronologicalColumn, + trainSplitChronologicalRandomPercentage, + reductionFactor + ) + logger.log( + Level.DEBUG, + s"DEBUG: Generated train/test split for kfold $x. Beginning persist." + ) + val (persistedTrain, persistedTest) = + SplitOperators.optimizeTestTrain( + train, + test, + optimalParts, + shuffle = true + ) + + TrainSplitReferences( + x, + TrainTestData(persistedTrain, persistedTest), + TrainTestPaths("", "") + ) + + }.toArray + + } + + /** + * Method for caching the train test splits in memory. + * @return Array[TrainSplitReferences], containing pointers to the Data, organized by kFold index, as well as + * dummy values for pathing. + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + private def trainSplitCache: Array[TrainSplitReferences] = { + + val optimalParts = modelFamily match { + case "XGBoost" => PerformanceSettings.xgbWorkers(parallelism) + case "RandomForest" => + PerformanceSettings.optimalJVMModelPartitions(parallelism) * 4 + case _ => PerformanceSettings.optimalJVMModelPartitions(parallelism) + } + + (0 to kIterations).map { x => + val Array(train, test) = + SplitOperators.genTestTrain( + mainDataset, + scala.util.Random.nextLong(), + uniqueLabels, + splitMethod, + labelColumn, + trainPortion, + syntheticCol, + trainSplitChronologicalColumn, + trainSplitChronologicalRandomPercentage, + reductionFactor + ) + + logger.log( + Level.DEBUG, + s"DEBUG: Generated train/test split for kfold $x. Beginning cache to memory." + ) + + val trainCache = train.repartition(optimalParts).cache() + val testCache = test.repartition(optimalParts).cache() + + trainCache.foreach(_ => ()) + testCache.foreach(_ => ()) + + TrainSplitReferences( + x, + TrainTestData(trainCache, testCache), + TrainTestPaths("", "") + ) + }.toArray + + } + + /** + * Method for writing the train test splits to dbfs as a delta data source + * + * @return Array[TrainSplitReferences], containing pointers to the Data as stored by Delta, organized by kFold index, + * as well as the values for pathing for eventual cleanup. + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + private def trainSplitDelta: Array[TrainSplitReferences] = { + + (0 to kIterations).map { x => + val Array(train, test) = + SplitOperators.genTestTrain( + mainDataset, + scala.util.Random.nextLong(), + uniqueLabels, + splitMethod, + labelColumn, + trainPortion, + syntheticCol, + trainSplitChronologicalColumn, + trainSplitChronologicalRandomPercentage, + reductionFactor + ) + + val deltaPaths = formTrainTestPaths(rootDir) + + val deltaReferences = storeLoadDelta(train, test, deltaPaths) + + logger.log( + Level.DEBUG, + s"DEBUG: Generated train/test split for kfold $x. Stored tables to Delta paths." + ) + + TrainSplitReferences(x, deltaReferences, deltaPaths) + }.toArray + + } + + /** + * Wrapper interface for performing the splits, dependent on mode + * @return Array[TrainSplitReferences] from the above methods. + */ + def performSplit: Array[TrainSplitReferences] = { + + persistMode match { + case "persist" => trainSplitPersist + case "delta" => trainSplitDelta + case "cache" => trainSplitCache + case _ => + throw new UnsupportedOperationException( + s"Train Split mode $persistMode is not supported." + ) + } + + } + +} + +object DataSplitUtility { + + def split(mainDataSet: DataFrame, + kIterations: Int, + splitMethod: String, + labelColumn: String, + rootDir: String, + persistMode: String, + modelFamily: String, + parallelism: Int, + trainPortion: Double, + syntheticCol: String, + trainSplitChronologicalColumn: String, + trainSplitChronologicalRandomPercentage: Double, + reductionFactor: Double): Array[TrainSplitReferences] = + new DataSplitUtility( + mainDataSet, + kIterations, + splitMethod, + labelColumn, + rootDir, + persistMode, + modelFamily, + parallelism, + trainPortion, + syntheticCol, + trainSplitChronologicalColumn, + trainSplitChronologicalRandomPercentage, + reductionFactor + ).performSplit + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/split/DeltaCacheCleanup.scala b/src/main/scala/com/databricks/labs/automl/model/tools/split/DeltaCacheCleanup.scala new file mode 100644 index 00000000..7a8e11f8 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/split/DeltaCacheCleanup.scala @@ -0,0 +1,44 @@ +package com.databricks.labs.automl.model.tools.split + +import com.databricks.dbutils_v1.DBUtilsHolder.dbutils0 +import com.databricks.labs.automl.model.tools.structures.{ + TrainSplitReferences, + TrainTestPaths +} + +object DeltaCacheCleanup { + + /** + * Method for cleaning up all of the delta train/test paths that have been created during the modeling phase + * + * @param dataPayload Array of TrainSplitReferences containing the links to the delta paths + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def removeCacheDirectories(dataPayload: Array[TrainSplitReferences]): Unit = { + + dataPayload.foreach { x => + { + dbutils0.get().fs.rm(x.paths.train, true) + dbutils0.get().fs.rm(x.paths.test, true) + + } + } + + } + + /** + * Internal method for cleaning up a kfold test/train data delta source + * + * @param dataPaths paths to test and train for a particular delta source + * @since 0.7.1 + * @author Ben Wilson, Databricks + */ + def removeTrainTestPair(dataPaths: TrainTestPaths): Unit = { + + dbutils0.get().fs.rm(dataPaths.train, true) + dbutils0.get().fs.rm(dataPaths.test, true) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/split/PerformanceSettings.scala b/src/main/scala/com/databricks/labs/automl/model/tools/split/PerformanceSettings.scala new file mode 100644 index 00000000..7ee025b7 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/split/PerformanceSettings.scala @@ -0,0 +1,77 @@ +package com.databricks.labs.automl.model.tools.split + +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} + +import scala.collection.JavaConverters._ + +object PerformanceSettings extends SparkSessionWrapper { + + @transient private val logger: Logger = Logger.getLogger(this.getClass) + + final val environmentVars: Map[String, String] = System.getenv().asScala.toMap + + lazy val coresPerWorker: Int = sc + .parallelize("1", 1) + .map(_ => java.lang.Runtime.getRuntime.availableProcessors) + .collect()(0) + + lazy val numberOfWorkerNodes + : Int = sc.statusTracker.getExecutorInfos.length - 1 + + lazy val totalCores: Int = coresPerWorker * numberOfWorkerNodes + + lazy val coresPerTask + : Int = try { spark.conf.get("spark.task.cpus").toInt } catch { + case e: java.util.NoSuchElementException => 1 + } + + private lazy val preCalcParTasks: Int = + scala.math.floor(totalCores / coresPerTask).toInt + lazy val parTasks: Int = if (preCalcParTasks < 1) 1 else preCalcParTasks + + def envString: String = + s"coresPerWorker: $coresPerWorker \n" + + s"numberOfWorkerNodes: $numberOfWorkerNodes \n " + + s"totalCores: $totalCores \n " + + s"coresPerTask: $coresPerTask \n " + + s"preCalcParTasks: $preCalcParTasks \n " + + s"parTasks: $parTasks" + + @throws(classOf[IllegalArgumentException]) + def xgbWorkers(parallelism: Int): Int = { + //DEBUG + logger.log(Level.DEBUG, envString) + + try { + environmentVars("num_workers").toString.toInt + } catch { + case e: java.util.NoSuchElementException => + val workerCount = scala.math.floor(totalCores / coresPerTask / parallelism).toInt + require(workerCount >= 1, s"XGBoost requires at least one core per XGB worker. " + + s"Current configuration is not compatible with XGBoost. Consider increasing cluster size or " + + s"decreasing parallelism or lowering spark.task.cpus. The XGBWorker count is derived: " + + s"floor(total Cluster Cores / spark.task.cpus / parallelism).toInt. This number must be >= 1. \n " + + s"XGB numWorkers == ${workerCount} \n " + + s"Total Cluster Cores == ${totalCores} \n " + + s"spark.task.cpu == ${coresPerTask} == nThread" + + s"Parallelism == ${parallelism}") + workerCount + } + } + + def optimalJVMModelPartitions(parallelism: Int): Int = { + //DEBUG + logger.log(Level.DEBUG, envString) + val jvmParts = scala.math.floor(parTasks / (parallelism / 2)).toInt + val warnMessage = s"WARNING: JVM Model partitions < 10. Consider a larger" + + s"cluster or reducing Parallelism. JVM Model Parallelism is calculated: floor(parTasks / (parallelism / 2)). \n " + + s"JVM Parallelism: ${jvmParts} \n " + + s"parTasks: ${parTasks} \n " + + s"Parallelism: ${parallelism}" + if (jvmParts < 10) logger.log(Level.WARN, warnMessage) + println(warnMessage) + jvmParts + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/split/SplitOperators.scala b/src/main/scala/com/databricks/labs/automl/model/tools/split/SplitOperators.scala new file mode 100644 index 00000000..85d4aedf --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/split/SplitOperators.scala @@ -0,0 +1,396 @@ +package com.databricks.labs.automl.model.tools.split + +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.log4j.{Level, Logger} +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.expressions.Window +import org.apache.spark.sql.functions.{col, count, lit, row_number} +import org.apache.spark.sql.types.StructType +import org.apache.spark.storage.StorageLevel + +object SplitOperators extends SparkSessionWrapper { + + @transient private val logger: Logger = Logger.getLogger(this.getClass) + + private def toDoubleType(x: Any): Option[Double] = x match { + case i: Int => Some(i) + case d: Double => Some(d) + case _ => None + } + + private def generateEmptyTrainTest( + schema: StructType + ): (DataFrame, DataFrame) = { + + var trainData = spark.createDataFrame(sc.emptyRDD[Row], schema) + var testData = spark.createDataFrame(sc.emptyRDD[Row], schema) + (trainData, testData) + } + + /** + * Method for stratification of the test/train around the unique values of the label column + * This mode is recommended for label value distributions in classification that have relatively balanced + * and uniformly distributed instances of the classes. + * If there is significant skew, it is highly recommended to use under or over sampling. + * + * @param data Dataframe that is the input to the train/test split + * @param seed random seed for splitting the data into train/test. + * @return An Array of Dataframes: Array[, ] + */ + def stratifiedSplit(data: DataFrame, + seed: Long, + uniqueLabels: Array[Row], + labelCol: String, + trainPortion: Double): Array[DataFrame] = { + + logger.log(Level.DEBUG, "DEBUG: Generating empty train/test split sets") + + var (trainData, testData) = generateEmptyTrainTest(data.schema) + + uniqueLabels.foreach { x => + logger.log(Level.DEBUG, s"DEBUG: Unique Label: $x") + + val conversionValue = toDoubleType(x(0)).get + + println("DEBUG: performing stratified random split") + val Array(trainSplit, testSplit) = data + .filter(col(labelCol) === conversionValue) + .randomSplit(Array(trainPortion, 1 - trainPortion), seed) + + trainData = trainData.union(trainSplit) + testData = testData.union(testSplit) + + logger.log(Level.DEBUG, "DEBUG: returning train & test datasets") + } + + Array(trainData, testData) + } + + def underSampleSplit(data: DataFrame, + seed: Long, + labelCol: String, + trainPortion: Double): Array[DataFrame] = { + + logger.log(Level.DEBUG, "DEBUG: Generating empty train/test split sets") + + var (trainData, testData) = generateEmptyTrainTest(data.schema) + + val totalDataSetCount = data.count() + + val groupedLabelCount = data + .select(labelCol) + .groupBy(labelCol) + .agg(count("*").as("counts")) + .withColumn("skew", col("counts") / lit(totalDataSetCount)) + .select(labelCol, "skew") + + val uniqueGroups = groupedLabelCount.collect() + + val smallestSkew = groupedLabelCount + .sort(col("skew").asc) + .select(col("skew")) + .first() + .getDouble(0) + + uniqueGroups.foreach { x => + logger.log(Level.DEBUG, s"DEBUG: Unique Label: $x") + + val groupData = toDoubleType(x(0)).get + + val groupRatio = x.getDouble(1) + + val groupDataFrame = data.filter(col(labelCol) === groupData) + + val Array(train, test) = if (groupRatio == smallestSkew) { + groupDataFrame.randomSplit(Array(trainPortion, 1 - trainPortion), seed) + } else { + groupDataFrame + .sample(withReplacement = false, smallestSkew / groupRatio) + .randomSplit(Array(trainPortion, 1 - trainPortion), seed) + } + + trainData = trainData.union(train) + testData = testData.union(test) + + } + + logger.log(Level.DEBUG, "DEBUG: returning train & test datasets") + + Array(trainData, testData) + + } + + def overSampleSplit(data: DataFrame, + seed: Long, + labelCol: String, + trainPortion: Double): Array[DataFrame] = { + + logger.log(Level.DEBUG, "DEBUG: Generating empty train/test split sets") + + var (trainData, testData) = generateEmptyTrainTest(data.schema) + + val groupedLabelCount = data + .select(labelCol) + .groupBy(labelCol) + .agg(count("*").as("counts")) + + val uniqueGroups = groupedLabelCount.collect() + + val largestGroupCount = groupedLabelCount + .sort(col("counts").desc) + .select(col("counts")) + .first() + .getLong(0) + + uniqueGroups.foreach { x => + logger.log(Level.DEBUG, s"DEBUG: Unique Label: $x") + + val groupData = toDoubleType(x(0)).get + + val groupRatio = math.ceil(largestGroupCount / x.getLong(1)).toInt + + for (i <- 1 to groupRatio) { + + val Array(train, test): Array[DataFrame] = data + .filter(col(labelCol) === groupData) + .randomSplit(Array(trainPortion, 1 - trainPortion), seed) + + trainData = trainData.union(train) + testData = testData.union(test) + + } + } + + logger.log(Level.DEBUG, "DEBUG: returning train & test datasets") + + Array(trainData, testData) + + } + + def stratifyReduce(data: DataFrame, + reductionFactor: Double, + seed: Long, + uniqueLabels: Array[Row], + labelCol: String, + trainPortion: Double): Array[DataFrame] = { + + logger.log(Level.DEBUG, "DEBUG: Generating empty train/test split sets") + + var (trainData, testData) = generateEmptyTrainTest(data.schema) + + uniqueLabels.foreach { x => + logger.log(Level.DEBUG, s"DEBUG: Unique Label: $x") + + val conversionValue = toDoubleType(x(0)).get + + val Array(trainSplit, testSplit) = data + .filter(col(labelCol) === conversionValue) + .randomSplit(Array(trainPortion, 1 - trainPortion), seed) + + trainData = trainData.union(trainSplit.sample(reductionFactor)) + testData = testData.union(testSplit.sample(reductionFactor)) + + } + + logger.log(Level.DEBUG, "DEBUG: returning train & test datasets") + + Array(trainData, testData) + + } + + def chronologicalSplit(data: DataFrame, + seed: Long, + trainSplitChronologicalColumn: String, + trainSplitChronologicalRandomPercentage: Double, + trainPortion: Double): Array[DataFrame] = { + + require( + data.schema.fieldNames.contains(trainSplitChronologicalColumn), + s"Chronological Split Field ${trainSplitChronologicalColumn} is not in schema: " + + s"${data.schema.fieldNames.mkString(", ")}" + ) + + // Validation check for the random 'wiggle value' if it's set that it won't risk creating zero rows in train set. + if (trainSplitChronologicalRandomPercentage > 0.0) + require( + (1 - trainPortion) * trainSplitChronologicalRandomPercentage / 100 < 0.5, + s"With trainSplitChronologicalRandomPercentage set at '${trainSplitChronologicalRandomPercentage}' " + + s"and a train test ratio of ${trainPortion} there is a high probability of train data set being empty." + + s" \n\tAdjust lower to prevent non-deterministic split levels that could break training." + ) + + // Get the row count + val rawDataCount = data.count.toDouble + + val splitValue = scala.math.round(rawDataCount * trainPortion).toInt + + // Get the row number estimation for conduction the split at + val splitRow: Int = if (trainSplitChronologicalRandomPercentage <= 0.0) { + splitValue + } else { + // randomly mutate the size of the test validation set + val splitWiggle = scala.math + .round( + rawDataCount * (1 - trainPortion) * + trainSplitChronologicalRandomPercentage / 100 + ) + .toInt + splitValue - scala.util.Random.nextInt(splitWiggle) + } + + // Define the window partition + val uniqueCol = "chron_grp_autoML_" + java.util.UUID.randomUUID().toString + + // Define temporary non-colliding columns for the window partition + val uniqueRow = "row_" + java.util.UUID.randomUUID().toString + val windowDefintion = + Window.partitionBy(uniqueCol).orderBy(trainSplitChronologicalColumn) + + // Generate a new Dataframe that has the row number partition, sorted by the chronological field + val preSplitData = data + .withColumn(uniqueCol, lit("grp")) + .withColumn(uniqueRow, row_number() over windowDefintion) + .drop(uniqueCol) + + logger.log(Level.DEBUG, "DEBUG: returning train & test datasets") + + // Generate the test/train split data based on sorted chronological column + Array( + preSplitData.filter(col(uniqueRow) <= splitRow).drop(uniqueRow), + preSplitData.filter(col(uniqueRow) > splitRow).drop(uniqueRow) + ) + + } + + /** + * Split methodology for getting test and train of KSample up-sampled data.
+ * Both data sets are split into test and train.
+ * The returned collections are a union of the real train + synthetic train, but only the real test data. + * @param data DataFrame: The full data set (containing a synthetic column that indicates whether the data is real or not) + * @param seed Long: A seed value that is consistent across both data sets + * @param uniqueLabels Array[Row]: The unique entries of the label values + * @return Array[DataFrame] of Array(trainData, testData) + * @since 0.5.1 + * @author Ben Wilson + */ + def kSamplingSplit(data: DataFrame, + seed: Long, + uniqueLabels: Array[Row], + syntheticCol: String, + labelCol: String, + trainPortion: Double): Array[DataFrame] = { + + logger.log(Level.DEBUG, "DEBUG: generating KSample data sets") + + // Split out the real from the synthetic data + val realData = data.filter(!col(syntheticCol)) + + // Split out the synthetic data + val syntheticData = data.filter(col(syntheticCol)) + + // Perform stratified splits on both the real and synthetic data + val Array(realTrain, realTest) = + stratifiedSplit(realData, seed, uniqueLabels, labelCol, trainPortion) + + val Array(syntheticTrain, syntheticTest) = + stratifiedSplit(syntheticData, seed, uniqueLabels, labelCol, trainPortion) + + logger.log( + Level.DEBUG, + "DEBUG: returning data sets augmented with KSample synthetic data" + ) + + // Union the real train data with the synthetic train data and return that with only the real test data + Array(realTrain.union(syntheticTrain), realTest) + + } + + def genTestTrain(data: DataFrame, + seed: Long, + uniqueLabels: Array[Row], + trainSplitMethod: String, + labelCol: String, + trainPortion: Double, + syntheticCol: String = "syntheticColumn", + trainSplitChronologicalColumn: String = "datetime", + trainSplitChronologicalRandomPercentage: Double = 0.05, + reductionFactor: Double = 0.5): Array[DataFrame] = { + + logger.log(Level.DEBUG, s"DEBUG: Split Method: ${trainSplitMethod}") + + trainSplitMethod match { + case "random" => + data.randomSplit(Array(trainPortion, 1 - trainPortion), seed) + case "chronological" => + chronologicalSplit( + data, + seed, + trainSplitChronologicalColumn, + trainSplitChronologicalRandomPercentage, + trainPortion + ) + case "stratified" => + stratifiedSplit(data, seed, uniqueLabels, labelCol, trainPortion) + case "overSample" => overSampleSplit(data, seed, labelCol, trainPortion) + case "underSample" => underSampleSplit(data, seed, labelCol, trainPortion) + case "stratifyReduce" => + stratifyReduce( + data, + reductionFactor, + seed, + uniqueLabels, + labelCol, + trainPortion + ) + case "kSample" => + kSamplingSplit( + data, + seed, + uniqueLabels, + syntheticCol, + labelCol, + trainPortion + ) + case _ => + throw new IllegalArgumentException( + s"Cannot conduct train test split in mode: '${trainSplitMethod}'" + ) + } + + } + + def optimizeTestTrain(train: DataFrame, + test: DataFrame, + optimalParts: Int, + shuffle: Boolean = false): (DataFrame, DataFrame) = { + // TODO: TOMES - Why is this still hardocded DISK_ONLY? + logger.log( + Level.DEBUG, + s"DEBUG: Train persist called. Shuffle = $shuffle. Optimal parts: $optimalParts" + ) + val optimizedTrain = if (shuffle) { + train.repartition(optimalParts).persist(StorageLevel.DISK_ONLY) + } else { + train.coalesce(optimalParts).persist(StorageLevel.DISK_ONLY) + } + + logger.log( + Level.DEBUG, + s"DEBUG: Test persist called. Shuffle = $shuffle. Optimal parts: $optimalParts" + ) + val optimizedTest = if (shuffle) { + test.repartition(optimalParts).persist(StorageLevel.DISK_ONLY) + } else { + test.coalesce(optimalParts).persist(StorageLevel.DISK_ONLY) + } + + logger.log(Level.DEBUG, "DEBUG: Forcing the persist for Train") + optimizedTrain.foreach(_ => ()) + logger.log(Level.DEBUG, "DEBUG: Forcing the persist for Test") + optimizedTest.foreach(_ => ()) + + (optimizedTrain, optimizedTest) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/split/SplitUtilityTooling.scala b/src/main/scala/com/databricks/labs/automl/model/tools/split/SplitUtilityTooling.scala new file mode 100644 index 00000000..d7df51fd --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/split/SplitUtilityTooling.scala @@ -0,0 +1,51 @@ +package com.databricks.labs.automl.model.tools.split + +import com.databricks.labs.automl.model.tools.structures.{ + TrainTestData, + TrainTestPaths +} +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.spark.sql.DataFrame + +trait SplitUtilityTooling extends SparkSessionWrapper { + + def formRootPath(configStoreLocation: String): String = { + + configStoreLocation.takeRight(1) match { + case "/" => configStoreLocation + "modeling_sources/" + case _ => configStoreLocation + "/modeling_sources/" + } + + } + + def formTrainTestPaths(configStoreLocation: String): TrainTestPaths = { + + val uniqueIdentifier = java.util.UUID.randomUUID() + + val rootPath = formRootPath(configStoreLocation) + + val trainPath = rootPath + s"train_$uniqueIdentifier" + val testPath = rootPath + s"test_$uniqueIdentifier" + + TrainTestPaths(trainPath, testPath) + + } + + def storeLoadDelta(trainData: DataFrame, + testData: DataFrame, + paths: TrainTestPaths): TrainTestData = { + + // Write test data to delta location + trainData.write.format("delta").save(paths.train) + testData.write.format("delta").save(paths.test) + + // read from the location and provide a reference object to the reader + + TrainTestData( + train = spark.read.format("delta").load(paths.train), + test = spark.read.format("delta").load(paths.test) + ) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/structures/EuclideanSpaceSearch.scala b/src/main/scala/com/databricks/labs/automl/model/tools/structures/EuclideanSpaceSearch.scala new file mode 100644 index 00000000..d1b6f3be --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/structures/EuclideanSpaceSearch.scala @@ -0,0 +1,124 @@ +package com.databricks.labs.automl.model.tools.structures + +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.feature.{ + MaxAbsScaler, + StringIndexer, + VectorAssembler +} +import org.apache.spark.ml.linalg.{Vector, Vectors} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions._ + +/** + * Provides log search space results based on HyperParameter Vector similarity to prevent too-similar post-run + * hyper parameters from being tested, which do not provide much information gain to the modeling run. + * @param df DataFrame containing the hyper parameters that are generated based on the PostModelingOptimization class + * @param numericParams Array[String] numeric Parameters available for the model type + * @param stringParams Array[String] string Parameters available for the model type + * @param outputCount Int Desired number of output predictions to provide. + */ +class EuclideanSpaceSearch(df: DataFrame, + numericParams: Array[String], + stringParams: Array[String], + outputCount: Int, + additionalFields: Array[String] = Array[String]()) + extends Serializable + with SparkSessionWrapper { + + private final val SI_NAME: String = "_si" + private final val UNSCALED_VECTOR: String = "vecParams" + private final val SCALED_VECTOR: String = "scaledVector" + private final val DISTANCE_COL: String = "distanceEuclid" + + @transient private lazy val fullColumns + : Seq[String] = numericParams.toSeq ++ stringParams.toSeq ++ additionalFields.toSeq + + private def euclidean(vec: Vector): UserDefinedFunction = + udf((feat: Vector) => Vectors.sqdist(feat, vec)) + + private def generateLogNTiles: Array[Double] = { + (0 to outputCount) + .map(_ / outputCount.toDouble) + .toArray + .map(x => { + val b = math.log(1.0 / 1E-2) / (1.0 - 1E-2) + val a = 1.0 / math.exp(b) + a * math.exp(b * x) + }) + } + + private def buildVectorPipeline: Pipeline = { + + val indexers = stringParams.map( + x => new StringIndexer().setInputCol(x).setOutputCol(x + SI_NAME) + ) + + val indexFields = stringParams.map(_ + SI_NAME) ++ numericParams + + val vectorAssembler = new VectorAssembler() + .setInputCols(indexFields) + .setOutputCol(UNSCALED_VECTOR) + + val scaler = + new MaxAbsScaler() + .setInputCol(UNSCALED_VECTOR) + .setOutputCol(SCALED_VECTOR) + + new Pipeline().setStages(indexers :+ vectorAssembler :+ scaler) + } + + private def executePipeline: DataFrame = { + + buildVectorPipeline.fit(df).transform(df) + + } + + def searchSpace(): DataFrame = { + + val vectoredData = executePipeline + + val topRecord = + vectoredData.take(1).map(x => x.getAs[Vector](SCALED_VECTOR)).head + + val distanceDF = vectoredData.withColumn( + DISTANCE_COL, + euclidean(topRecord)(col(SCALED_VECTOR)) + ) + + val nTiles = generateLogNTiles + + val nTileValues = distanceDF.stat.approxQuantile(DISTANCE_COL, nTiles, 0.0) + + nTileValues + .map(x => { + distanceDF + .filter(col(DISTANCE_COL) >= x) + .sort(col(DISTANCE_COL).asc) + .limit(1) + }) + .reduce(_ union _) + .select(fullColumns.map(x => col(x)): _*) + + } + +} + +object EuclideanSpaceSearch { + + def apply(df: DataFrame, + numericParams: Array[String], + stringParams: Array[String], + outputCount: Int, + additionalFields: Array[String] = Array[String]()): DataFrame = + new EuclideanSpaceSearch( + df, + numericParams, + stringParams, + outputCount, + additionalFields + ).searchSpace() + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/structures/ModelConfigGenerators.scala b/src/main/scala/com/databricks/labs/automl/model/tools/structures/ModelConfigGenerators.scala new file mode 100644 index 00000000..d15979c7 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/structures/ModelConfigGenerators.scala @@ -0,0 +1,1102 @@ +package com.databricks.labs.automl.model.tools.structures + +import com.databricks.labs.automl.params._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.ArrayBuffer +import scala.reflect.runtime.universe._ + +trait ModelConfigGenerators extends SeedGenerator { + + /** + * Helper method for reading a case class definition, getting the defined names of each key, and returning them as + * an iterable list. + * + * @tparam T The class type as derived through reflection + * @return The List of all case class member names + */ + def getCaseClassNames[T: TypeTag]: List[String] = + typeOf[T].members.sorted.collect { + case m: MethodSymbol if m.isCaseAccessor => m.name.toString + } + + // RANDOM FOREST METHODS + /** + * Method for taking a collection of permutations generated per each hyper parameter and converting them + * into a collection that can be used to execute models by building out all possible permutations of the generated + * hyper parameter collections. + * + * @param randomForestPermutationCollection The Array of values generated for possible hyper parameters for the + * permutation collection creation + * @return Array of Random Forest configurations based on permutations of each value within the arrays supplied. + */ + def randomForestConfigGenerator( + randomForestPermutationCollection: RandomForestPermutationCollection + ): Array[RandomForestConfig] = { + + for { + numTrees <- randomForestPermutationCollection.numTreesArray + impurity <- randomForestPermutationCollection.impurityArray + maxBins <- randomForestPermutationCollection.maxBinsArray + maxDepth <- randomForestPermutationCollection.maxDepthArray + minInfoGain <- randomForestPermutationCollection.minInfoGainArray + subSamplingRate <- randomForestPermutationCollection.subSamplingRateArray + featureSubsetStrategy <- randomForestPermutationCollection.featureSubsetStrategyArray + } yield + RandomForestConfig( + numTrees.toInt, + impurity, + maxBins.toInt, + maxDepth.toInt, + minInfoGain, + subSamplingRate, + featureSubsetStrategy + ) + } + + /** + * Method for generating linear and log spaces for potential hyper parameter values for the model + * + * @param config Configuration value for the generation of permutation arrays + * @return Arrays for all numeric parameters that will be generated for input into the permutation generator + */ + protected[tools] def randomForestNumericArrayGenerator( + config: PermutationConfiguration + ): RandomForestNumericArrays = { + + RandomForestNumericArrays( + numTreesArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("numTrees")), + config.permutationTarget + ), + maxBinsArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxBins")), + config.permutationTarget + ), + maxDepthArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxDepth")), + config.permutationTarget + ), + minInfoGainArray = generateLogSpace( + extractContinuousBoundaries(config.numericBoundaries("minInfoGain")), + config.permutationTarget + ), + subSamplingRateArray = generateLinearSpace( + extractContinuousBoundaries( + config.numericBoundaries("subSamplingRate") + ), + config.permutationTarget + ) + ) + } + + /** + * Main accessor for generating permutations for a RandomForest Model + * + * @param config Configuration for holding the numeber of permutations to generate and the boundaries of the + * search space + * @param countTarget Total maximum count of permutations to return + * @param seed Seed for determining the random sample of permutations that are generated due to the sheer count + * of permutations that are generated to search the space effectively. + * @return An Array of RandomForest Configurations to be used in generating model runs. + */ + def randomForestPermutationGenerator( + config: PermutationConfiguration, + countTarget: Int, + seed: Long = 42L + ): Array[RandomForestConfig] = { + + // Get the number of permutations to generate + val numericPayloads = randomForestNumericArrayGenerator(config) + + val impurityOverride = if (config.modelType == "regressor") { + Array("variance") + } else { + config.stringBoundaries("impurity").toArray + } + + val fullPermutationConfig = RandomForestPermutationCollection( + numTreesArray = numericPayloads.numTreesArray, + maxBinsArray = numericPayloads.maxBinsArray, + maxDepthArray = numericPayloads.maxDepthArray, + minInfoGainArray = numericPayloads.minInfoGainArray, + subSamplingRateArray = numericPayloads.subSamplingRateArray, + impurityArray = impurityOverride, + featureSubsetStrategyArray = + config.stringBoundaries("featureSubsetStrategy").toArray + ) + + val permutationCollection = randomForestConfigGenerator( + fullPermutationConfig + ) + + randomSampleArray(permutationCollection, countTarget, seed) + + } + + /** + * Helper method for converting a Dataframe of predicted hyper parameters into configurations that can be used + * by models (for post-run hyper parameter optimization) + * + * @param predictionDataFrame The predicted sets of highest probability hyper parameter collections + * @return An Array of RandomForest Configurations to be used in generating model runs. + */ + def convertRandomForestResultToConfig( + predictionDataFrame: DataFrame + ): Array[RandomForestConfig] = { + + val collectionBuffer = new ArrayBuffer[RandomForestConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[RandomForestConfig] map col: _*) + .collect() + + dataCollection.foreach { x => + collectionBuffer += RandomForestConfig( + numTrees = x(0).toString.toInt, + impurity = x(1).toString, + maxBins = x(2).toString.toInt, + maxDepth = x(3).toString.toInt, + minInfoGain = x(4).toString.toDouble, + subSamplingRate = x(5).toString.toDouble, + featureSubsetStrategy = x(6).toString + ) + + } + + collectionBuffer.result.toArray + } + + // DECISION TREE METHODS + def treesConfigGenerator( + treesPermutationCollection: TreesPermutationCollection + ): Array[TreesConfig] = { + + for { + impurity <- treesPermutationCollection.impurityArray + maxBins <- treesPermutationCollection.maxBinsArray + maxDepth <- treesPermutationCollection.maxDepthArray + minInfoGain <- treesPermutationCollection.minInfoGainArray + minInstancesPerNode <- treesPermutationCollection.minInstancesPerNodeArray + } yield + TreesConfig( + impurity, + maxBins.toInt, + maxDepth.toInt, + minInfoGain, + minInstancesPerNode.toInt + ) + } + + protected[tools] def treesNumericArrayGenerator( + config: PermutationConfiguration + ): TreesNumericArrays = { + + TreesNumericArrays( + maxBinsArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxBins")), + config.permutationTarget + ), + maxDepthArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxDepth")), + config.permutationTarget + ), + minInfoGainArray = generateLogSpace( + extractContinuousBoundaries(config.numericBoundaries("minInfoGain")), + config.permutationTarget + ), + minInstancesPerNodeArray = generateLinearIntSpace( + extractContinuousBoundaries( + config.numericBoundaries("minInstancesPerNode") + ), + config.permutationTarget + ) + ) + } + + def treesPermutationGenerator(config: PermutationConfiguration, + countTarget: Int, + seed: Long = 42L): Array[TreesConfig] = { + + // Get the number of permutations to generate + val numericPayloads = treesNumericArrayGenerator(config) + + val impurityOverride = if (config.modelType == "regressor") { + Array("variance") + } else { + config.stringBoundaries("impurity").toArray + } + + val fullPermutationConfig = TreesPermutationCollection( + impurityArray = impurityOverride, + maxBinsArray = numericPayloads.maxBinsArray, + maxDepthArray = numericPayloads.maxDepthArray, + minInfoGainArray = numericPayloads.minInfoGainArray, + minInstancesPerNodeArray = numericPayloads.minInstancesPerNodeArray + ) + + val permutationCollection = treesConfigGenerator(fullPermutationConfig) + + randomSampleArray(permutationCollection, countTarget, seed) + + } + + def convertTreesResultToConfig( + predictionDataFrame: DataFrame + ): Array[TreesConfig] = { + + val collectionBuffer = new ArrayBuffer[TreesConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[TreesConfig] map col: _*) + .collect() + + dataCollection.foreach { x => + collectionBuffer += TreesConfig( + impurity = x(0).toString, + maxBins = x(1).toString.toInt, + maxDepth = x(2).toString.toInt, + minInfoGain = x(3).toString.toDouble, + minInstancesPerNode = x(4).toString.toInt + ) + + } + collectionBuffer.result.toArray + } + + // GRADIENT BOOSTED TREES METHODS + + def gbtConfigGenerator( + gbtPermutationCollection: GBTPermutationCollection + ): Array[GBTConfig] = { + + for { + impurity <- gbtPermutationCollection.impurityArray + lossType <- gbtPermutationCollection.lossTypeArray + maxBins <- gbtPermutationCollection.maxBinsArray + maxDepth <- gbtPermutationCollection.maxDepthArray + maxIter <- gbtPermutationCollection.maxIterArray + minInfoGain <- gbtPermutationCollection.minInfoGainArray + minInstancesPerNode <- gbtPermutationCollection.minInstancesPerNodeArray + stepSize <- gbtPermutationCollection.stepSizeArray + } yield + GBTConfig( + impurity, + lossType, + maxBins.toInt, + maxDepth.toInt, + maxIter.toInt, + minInfoGain, + minInstancesPerNode.toInt, + stepSize + ) + } + + protected[tools] def gbtNumericArrayGenerator( + config: PermutationConfiguration + ): GBTNumericArrays = { + + GBTNumericArrays( + maxBinsArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxBins")), + config.permutationTarget + ), + maxDepthArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxDepth")), + config.permutationTarget + ), + maxIterArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxIter")), + config.permutationTarget + ), + minInfoGainArray = generateLogSpace( + extractContinuousBoundaries(config.numericBoundaries("minInfoGain")), + config.permutationTarget + ), + minInstancesPerNodeArray = generateLinearIntSpace( + extractContinuousBoundaries( + config.numericBoundaries("minInstancesPerNode") + ), + config.permutationTarget + ), + stepSizeArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("stepSize")), + config.permutationTarget + ) + ) + } + + def gbtPermutationGenerator(config: PermutationConfiguration, + countTarget: Int, + seed: Long = 42L): Array[GBTConfig] = { + + // Get the number of permutations to generate + val numericPayloads = gbtNumericArrayGenerator(config) + + val impurityOverride = if (config.modelType == "regressor") { + Array("variance") + } else { + config.stringBoundaries("impurity").toArray + } + + val lossTypeOverride = if (config.modelType == "regressor") { + Array("squared", "absolute") + } else { + config.stringBoundaries("lossType").toArray + } + + val fullPermutationConfig = GBTPermutationCollection( + impurityArray = impurityOverride, + lossTypeArray = lossTypeOverride, + maxBinsArray = numericPayloads.maxBinsArray, + maxDepthArray = numericPayloads.maxDepthArray, + maxIterArray = numericPayloads.maxIterArray, + minInfoGainArray = numericPayloads.minInfoGainArray, + minInstancesPerNodeArray = numericPayloads.minInstancesPerNodeArray, + stepSizeArray = numericPayloads.stepSizeArray + ) + + val permutationCollection = gbtConfigGenerator(fullPermutationConfig) + + randomSampleArray(permutationCollection, countTarget, seed) + + } + + def convertGBTResultToConfig( + predictionDataFrame: DataFrame + ): Array[GBTConfig] = { + + val collectionBuffer = new ArrayBuffer[GBTConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[GBTConfig] map col: _*) + .collect() + + dataCollection.foreach { x => + collectionBuffer += GBTConfig( + impurity = x(0).toString, + lossType = x(1).toString, + maxBins = x(2).toString.toInt, + maxDepth = x(3).toString.toInt, + maxIter = x(4).toString.toInt, + minInfoGain = x(5).toString.toDouble, + minInstancesPerNode = x(6).toString.toInt, + stepSize = x(7).toString.toDouble + ) + } + collectionBuffer.result.toArray + } + + // LINEAR REGRESSION METHODS + def linearRegressionConfigGenerator( + linearRegressionPermutationCollection: LinearRegressionPermutationCollection + ): Array[LinearRegressionConfig] = { + + for { + elasticNetParams <- linearRegressionPermutationCollection.elasticNetParamsArray + fitIntercept <- linearRegressionPermutationCollection.fitInterceptArray + loss <- linearRegressionPermutationCollection.lossArray + maxIter <- linearRegressionPermutationCollection.maxIterArray + regParam <- linearRegressionPermutationCollection.regParamArray + standardization <- linearRegressionPermutationCollection.standardizationArray + tolerance <- linearRegressionPermutationCollection.toleranceArray + } yield + LinearRegressionConfig( + elasticNetParams, + fitIntercept, + loss, + maxIter.toInt, + regParam, + standardization, + tolerance + ) + } + + protected[tools] def linearRegressionNumericArrayGenerator( + config: PermutationConfiguration + ): LinearRegressionNumericArrays = { + + LinearRegressionNumericArrays( + elasticNetParamsArray = generateLinearSpace( + extractContinuousBoundaries( + config.numericBoundaries("elasticNetParams") + ), + config.permutationTarget + ), + maxIterArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxIter")), + config.permutationTarget + ), + regParamArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("regParam")), + config.permutationTarget + ), + toleranceArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("tolerance")), + config.permutationTarget + ) + ) + } + + def linearRegressionPermutationGenerator( + config: PermutationConfiguration, + countTarget: Int, + seed: Long = 42L + ): Array[LinearRegressionConfig] = { + + // Get the number of permutations to generate + val numericPayloads = linearRegressionNumericArrayGenerator(config) + + val fullPermutationConfig = LinearRegressionPermutationCollection( + elasticNetParamsArray = numericPayloads.elasticNetParamsArray, + fitInterceptArray = Array(true, false), + lossArray = config.stringBoundaries("loss").toArray, + maxIterArray = numericPayloads.maxIterArray, + regParamArray = numericPayloads.regParamArray, + standardizationArray = Array(true, false), + toleranceArray = numericPayloads.toleranceArray + ) + + val permutationCollection = linearRegressionConfigGenerator( + fullPermutationConfig + ) + + randomSampleArray(permutationCollection, countTarget, seed) + } + + def convertLinearRegressionResultToConfig( + predictionDataFrame: DataFrame + ): Array[LinearRegressionConfig] = { + + val collectionBuffer = new ArrayBuffer[LinearRegressionConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[LinearRegressionConfig] map col: _*) + .collect() + + dataCollection.foreach { x => + val lossType = x(2).toString + val eNetParams = lossType match { + case "huber" => 0.0 + case _ => x(0).toString.toDouble + } + + collectionBuffer += LinearRegressionConfig( + elasticNetParams = eNetParams, + fitIntercept = x(1).toString.toBoolean, + loss = lossType, + maxIter = x(3).toString.toInt, + regParam = x(4).toString.toDouble, + standardization = x(5).toString.toBoolean, + tolerance = x(6).toString.toDouble + ) + } + collectionBuffer.result.toArray + } + + // LOGISTIC REGRESSION METHODS + def logisticRegressionConfigGenerator( + logisticRegressionPermutationCollection: LogisticRegressionPermutationCollection + ): Array[LogisticRegressionConfig] = { + + for { + elasticNetParams <- logisticRegressionPermutationCollection.elasticNetParamsArray + fitIntercept <- logisticRegressionPermutationCollection.fitInterceptArray + maxIter <- logisticRegressionPermutationCollection.maxIterArray + regParam <- logisticRegressionPermutationCollection.regParamArray + standardization <- logisticRegressionPermutationCollection.standardizationArray + tolerance <- logisticRegressionPermutationCollection.toleranceArray + } yield + LogisticRegressionConfig( + elasticNetParams, + fitIntercept, + maxIter.toInt, + regParam, + standardization, + tolerance + ) + } + + protected[tools] def logisticRegressionNumericArrayGenerator( + config: PermutationConfiguration + ): LogisticRegressionNumericArrays = { + + LogisticRegressionNumericArrays( + elasticNetParamsArray = generateLinearSpace( + extractContinuousBoundaries( + config.numericBoundaries("elasticNetParams") + ), + config.permutationTarget + ), + maxIterArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxIter")), + config.permutationTarget + ), + regParamArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("regParam")), + config.permutationTarget + ), + toleranceArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("tolerance")), + config.permutationTarget + ) + ) + } + + def logisticRegressionPermutationGenerator( + config: PermutationConfiguration, + countTarget: Int, + seed: Long = 42L + ): Array[LogisticRegressionConfig] = { + + // Get the number of permutations to generate + val numericPayloads = logisticRegressionNumericArrayGenerator(config) + + val fullPermutationConfig = LogisticRegressionPermutationCollection( + elasticNetParamsArray = numericPayloads.elasticNetParamsArray, + fitInterceptArray = Array(true, false), + maxIterArray = numericPayloads.maxIterArray, + regParamArray = numericPayloads.regParamArray, + standardizationArray = Array(true, false), + toleranceArray = numericPayloads.toleranceArray + ) + + val permutationCollection = logisticRegressionConfigGenerator( + fullPermutationConfig + ) + + randomSampleArray(permutationCollection, countTarget, seed) + + } + + def convertLogisticRegressionResultToConfig( + predictionDataFrame: DataFrame + ): Array[LogisticRegressionConfig] = { + + val collectionBuffer = new ArrayBuffer[LogisticRegressionConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[LogisticRegressionConfig] map col: _*) + .collect() + + dataCollection.foreach { x => + collectionBuffer += LogisticRegressionConfig( + elasticNetParams = x(0).toString.toDouble, + fitIntercept = x(1).toString.toBoolean, + maxIter = x(2).toString.toInt, + regParam = x(3).toString.toDouble, + standardization = x(4).toString.toBoolean, + tolerance = x(5).toString.toDouble + ) + } + collectionBuffer.result.toArray + } + + // SUPPORT VECTOR MACHINE METHODS + def svmConfigGenerator( + svmPermutationCollection: SVMPermutationCollection + ): Array[SVMConfig] = { + + for { + fitIntercept <- svmPermutationCollection.fitInterceptArray + maxIter <- svmPermutationCollection.maxIterArray + regParam <- svmPermutationCollection.regParamArray + standardization <- svmPermutationCollection.standardizationArray + tolerance <- svmPermutationCollection.toleranceArray + } yield + SVMConfig( + fitIntercept, + maxIter.toInt, + regParam, + standardization, + tolerance + ) + } + + protected[tools] def svmNumericArrayGenerator( + config: PermutationConfiguration + ): SVMNumericArrays = { + + SVMNumericArrays( + maxIterArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxIter")), + config.permutationTarget + ), + regParamArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("regParam")), + config.permutationTarget + ), + toleranceArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("tolerance")), + config.permutationTarget + ) + ) + } + + def svmPermutationGenerator(config: PermutationConfiguration, + countTarget: Int, + seed: Long = 42L): Array[SVMConfig] = { + + // Get the number of permutations to generate + val numericPayloads = svmNumericArrayGenerator(config) + + val fullPermutationConfig = SVMPermutationCollection( + fitInterceptArray = Array(true, false), + maxIterArray = numericPayloads.maxIterArray, + regParamArray = numericPayloads.regParamArray, + standardizationArray = Array(true, false), + toleranceArray = numericPayloads.toleranceArray + ) + + val permutationCollection = svmConfigGenerator(fullPermutationConfig) + + randomSampleArray(permutationCollection, countTarget, seed) + + } + + def convertSVMResultToConfig( + predictionDataFrame: DataFrame + ): Array[SVMConfig] = { + + val collectionBuffer = new ArrayBuffer[SVMConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[SVMConfig] map col: _*) + .collect() + + dataCollection.foreach { x => + collectionBuffer += SVMConfig( + fitIntercept = x(0).toString.toBoolean, + maxIter = x(1).toString.toInt, + regParam = x(2).toString.toDouble, + standardization = x(3).toString.toBoolean, + tolerance = x(4).toString.toDouble + ) + } + collectionBuffer.result.toArray + } + + // XGBOOST METHODS + def xgboostConfigGenerator( + xgboostPermutationCollection: XGBoostPermutationCollection + ): Array[XGBoostConfig] = { + + for { + alpha <- xgboostPermutationCollection.alphaArray + eta <- xgboostPermutationCollection.etaArray + gamma <- xgboostPermutationCollection.gammaArray + lambda <- xgboostPermutationCollection.lambdaArray + maxDepth <- xgboostPermutationCollection.maxDepthArray + subSample <- xgboostPermutationCollection.subSampleArray + minChildWeight <- xgboostPermutationCollection.minChildWeightArray + numRound <- xgboostPermutationCollection.numRoundArray + maxBins <- xgboostPermutationCollection.maxBinsArray + trainTestRatio <- xgboostPermutationCollection.trainTestRatioArray + } yield + XGBoostConfig( + alpha, + eta, + gamma, + lambda, + maxDepth.toInt, + subSample, + minChildWeight, + numRound.toInt, + maxBins.toInt, + trainTestRatio + ) + } + + protected[tools] def xgboostNumericArrayGenerator( + config: PermutationConfiguration + ): XGBoostNumericArrays = { + + XGBoostNumericArrays( + alphaArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("alpha")), + config.permutationTarget + ), + etaArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("eta")), + config.permutationTarget + ), + gammaArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("gamma")), + config.permutationTarget + ), + lambdaArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("lambda")), + config.permutationTarget + ), + maxDepthArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxDepth")), + config.permutationTarget + ), + subSampleArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("subSample")), + config.permutationTarget + ), + minChildWeightArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("minChildWeight")), + config.permutationTarget + ), + numRoundArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("numRound")), + config.permutationTarget + ), + maxBinsArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxBins")), + config.permutationTarget + ), + trainTestRatioArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("trainTestRatio")), + config.permutationTarget + ) + ) + } + + def xgboostPermutationGenerator(config: PermutationConfiguration, + countTarget: Int, + seed: Long = 42L): Array[XGBoostConfig] = { + + // Get the number of permutations to generate + val numericPayloads = xgboostNumericArrayGenerator(config) + + val fullPermutationConfig = XGBoostPermutationCollection( + alphaArray = numericPayloads.alphaArray, + etaArray = numericPayloads.etaArray, + gammaArray = numericPayloads.gammaArray, + lambdaArray = numericPayloads.lambdaArray, + maxDepthArray = numericPayloads.maxDepthArray, + subSampleArray = numericPayloads.subSampleArray, + minChildWeightArray = numericPayloads.minChildWeightArray, + numRoundArray = numericPayloads.numRoundArray, + maxBinsArray = numericPayloads.maxBinsArray, + trainTestRatioArray = numericPayloads.trainTestRatioArray + ) + + val permutationCollection = xgboostConfigGenerator(fullPermutationConfig) + + randomSampleArray(permutationCollection, countTarget, seed) + + } + + def convertXGBoostResultToConfig( + predictionDataFrame: DataFrame + ): Array[XGBoostConfig] = { + + val collectionBuffer = new ArrayBuffer[XGBoostConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[XGBoostConfig] map col: _*) + .collect() + + dataCollection.foreach { x => + collectionBuffer += XGBoostConfig( + alpha = x(0).toString.toDouble, + eta = x(1).toString.toDouble, + gamma = x(2).toString.toDouble, + lambda = x(3).toString.toDouble, + maxDepth = x(4).toString.toInt, + subSample = x(5).toString.toDouble, + minChildWeight = x(6).toString.toDouble, + numRound = x(7).toString.toInt, + maxBins = x(8).toString.toInt, + trainTestRatio = x(9).toString.toDouble + ) + } + collectionBuffer.result.toArray + } + + // MULTILAYER PERCEPTRON CLASSIFIER METHODS + + def mlpcConfigGenerator( + mlpcPermutationCollection: MLPCPermutationCollection + ): Array[MLPCModelingConfig] = { + + for { + layerCount <- mlpcPermutationCollection.layerCountArray + layers <- mlpcPermutationCollection.layersArray + maxIter <- mlpcPermutationCollection.maxIterArray + solver <- mlpcPermutationCollection.solverArray + stepSize <- mlpcPermutationCollection.stepSizeArray + tolerance <- mlpcPermutationCollection.toleranceArray + hiddenLayerSizeAdjust <- mlpcPermutationCollection.hiddenLayerSizeAdjustArray + + } yield + MLPCModelingConfig( + layerCount.toInt, + layers, + maxIter.toInt, + solver, + stepSize, + tolerance, + hiddenLayerSizeAdjust.toInt + ) + } + + case class MLPCModelingConfig(layerCount: Int, + layers: Array[Int], + maxIter: Int, + solver: String, + stepSize: Double, + tolerance: Double, + hiddenLayerSizeAdjust: Int) + + protected[tools] def mlpcNumericArrayGenerator( + config: MLPCPermutationConfiguration + ): MLPCNumericArrays = { + + MLPCNumericArrays( + layersArray = generateArraySpace( + config.numericBoundaries("layers")._1.toInt, + config.numericBoundaries("layers")._2.toInt, + config.numericBoundaries("hiddenLayerSizeAdjust")._1.toInt, + config.numericBoundaries("hiddenLayerSizeAdjust")._2.toInt, + config.inputFeatureSize, + config.distinctClasses + 1, + config.permutationTarget + ), + maxIterArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxIter")), + config.permutationTarget + ), + stepSizeArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("stepSize")), + config.permutationTarget + ), + toleranceArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("tolerance")), + config.permutationTarget + ) + ) + } + + def mlpcPermutationGenerator(config: MLPCPermutationConfiguration, + countTarget: Int, + seed: Long = 42L): Array[MLPCModelingConfig] = { + + // Get the number of permutations to generate + val numericPayloads = mlpcNumericArrayGenerator(config) + + val layerCountBuffer = ArrayBuffer[Int]() + val hiddenLayerBuffer = ArrayBuffer[Int]() + + numericPayloads.layersArray.foreach { x => + val layerCountCalc = x.length - 2 + val hiddenLayerCalc = x(1) - x(0) + layerCountBuffer += layerCountCalc + hiddenLayerBuffer += hiddenLayerCalc + } + + val fullPermutationConfig = MLPCPermutationCollection( + layerCountArray = layerCountBuffer.toArray, + layersArray = numericPayloads.layersArray, + maxIterArray = numericPayloads.maxIterArray, + solverArray = config.stringBoundaries("solver").toArray, + stepSizeArray = numericPayloads.stepSizeArray, + toleranceArray = numericPayloads.toleranceArray, + hiddenLayerSizeAdjustArray = hiddenLayerBuffer.toArray + ) + + val permutationCollection = mlpcConfigGenerator(fullPermutationConfig) + + randomSampleArray(permutationCollection, countTarget, seed) + + } + + def convertMLPCResultToConfig(predictionDataFrame: DataFrame, + inputFeatureSize: Int, + distinctClasses: Int): Array[MLPCConfig] = { + + val collectionBuffer = new ArrayBuffer[MLPCConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[MLPCGenerator] map col: _*) + .collect() + + dataCollection.foreach { x => + collectionBuffer += MLPCConfig( + layers = constructLayerArray( + inputFeatureSize, + distinctClasses, + x(0).toString.toInt, + x(1).toString.toInt + ), + maxIter = x(2).toString.toInt, + solver = x(3).toString, + stepSize = x(4).toString.toDouble, + tolerance = x(5).toString.toDouble + ) + } + collectionBuffer.result.toArray + } + + // LightGBM METHODS + + def lightGBMConfigGenerator( + lightGBMPermutationCollection: LightGBMPermutationCollection + ): Array[LightGBMConfig] = { + + for { + baggingFraction <- lightGBMPermutationCollection.baggingFractionArray + baggingFreq <- lightGBMPermutationCollection.baggingFreqArray + featureFraction <- lightGBMPermutationCollection.featureFractionArray + learningRate <- lightGBMPermutationCollection.learningRateArray + maxBin <- lightGBMPermutationCollection.maxBinArray + maxDepth <- lightGBMPermutationCollection.maxDepthArray + minSumHessianInLeaf <- lightGBMPermutationCollection.minSumHessianInLeafArray + numIterations <- lightGBMPermutationCollection.numIterationsArray + numLeaves <- lightGBMPermutationCollection.numLeavesArray + boostFromAverage <- lightGBMPermutationCollection.boostFromAverageArray + lambdaL1 <- lightGBMPermutationCollection.lambdaL1Array + lambdaL2 <- lightGBMPermutationCollection.lambdaL2Array + alpha <- lightGBMPermutationCollection.alphaArray + boostingType <- lightGBMPermutationCollection.boostingTypeArray + + } yield + LightGBMConfig( + baggingFraction, + baggingFreq.toInt, + featureFraction, + learningRate, + maxBin.toInt, + maxDepth.toInt, + minSumHessianInLeaf, + numIterations.toInt, + numLeaves.toInt, + boostFromAverage.toString.toBoolean, + lambdaL1, + lambdaL2, + alpha, + boostingType + ) + + } + + protected[tools] def lightGBMNumericArrayGenerator( + config: PermutationConfiguration + ): LightGBMNumericArrays = { + + LightGBMNumericArrays( + baggingFractionArray = generateLinearSpace( + extractContinuousBoundaries( + config.numericBoundaries("baggingFraction") + ), + config.permutationTarget + ), + baggingFreqArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("baggingFreq")), + config.permutationTarget + ), + featureFractionArray = generateLinearSpace( + extractContinuousBoundaries( + config.numericBoundaries("featureFraction") + ), + config.permutationTarget + ), + learningRateArray = generateLinearSpace( + extractContinuousBoundaries(config.numericBoundaries("learningRate")), + config.permutationTarget + ), + maxBinArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxBin")), + config.permutationTarget + ), + maxDepthArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("maxDepth")), + config.permutationTarget + ), + minSumHessianInLeafArray = generateLinearSpace( + extractContinuousBoundaries( + config.numericBoundaries("minSumHessianInLeaf") + ), + config.permutationTarget + ), + numIterationsArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("numIterations")), + config.permutationTarget + ), + numLeavesArray = generateLinearIntSpace( + extractContinuousBoundaries(config.numericBoundaries("numLeaves")), + config.permutationTarget + ), + lambdaL1Array = generateLogSpace( + extractContinuousBoundaries(config.numericBoundaries("lambdaL1")), + config.permutationTarget + ), + lambdaL2Array = generateLogSpace( + extractContinuousBoundaries(config.numericBoundaries("lambdaL2")), + config.permutationTarget + ), + alphaArray = generateLogSpace( + extractContinuousBoundaries(config.numericBoundaries("alpha")), + config.permutationTarget + ) + ) + + } + + def lightGBMPermutationGenerator(config: PermutationConfiguration, + countTarget: Int, + seed: Long = 42L): Array[LightGBMConfig] = { + + val numericPayloads = lightGBMNumericArrayGenerator(config) + + val fullPermutationConfig = LightGBMPermutationCollection( + baggingFractionArray = numericPayloads.baggingFractionArray, + baggingFreqArray = numericPayloads.baggingFreqArray, + featureFractionArray = numericPayloads.featureFractionArray, + learningRateArray = numericPayloads.learningRateArray, + maxBinArray = numericPayloads.maxBinArray, + maxDepthArray = numericPayloads.maxDepthArray, + minSumHessianInLeafArray = numericPayloads.minSumHessianInLeafArray, + numIterationsArray = numericPayloads.numIterationsArray, + numLeavesArray = numericPayloads.numLeavesArray, + boostFromAverageArray = Array(true, false), + lambdaL1Array = numericPayloads.lambdaL1Array, + lambdaL2Array = numericPayloads.lambdaL2Array, + alphaArray = numericPayloads.alphaArray, + boostingTypeArray = config.stringBoundaries("boostingType").toArray + ) + + val permutationCollection = lightGBMConfigGenerator(fullPermutationConfig) + + randomSampleArray(permutationCollection, countTarget, seed) + + } + + def convertLightGBMResultToConfig( + predictionDataFrame: DataFrame + ): Array[LightGBMConfig] = { + + val collectionBuffer = new ArrayBuffer[LightGBMConfig]() + + val dataCollection = predictionDataFrame + .select(getCaseClassNames[LightGBMConfig] map col: _*) + .collect() + + dataCollection.map( + x => + LightGBMConfig( + baggingFraction = x(0).toString.toDouble, + baggingFreq = x(1).toString.toInt, + featureFraction = x(2).toString.toDouble, + learningRate = x(3).toString.toDouble, + maxBin = x(4).toString.toInt, + maxDepth = x(5).toString.toInt, + minSumHessianInLeaf = x(6).toString.toDouble, + numIterations = x(7).toString.toInt, + numLeaves = x(8).toString.toInt, + boostFromAverage = x(9).toString.toBoolean, + lambdaL1 = x(10).toString.toDouble, + lambdaL2 = x(11).toString.toDouble, + alpha = x(12).toString.toDouble, + boostingType = x(13).toString + ) + ) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/structures/ModelConfigStructures.scala b/src/main/scala/com/databricks/labs/automl/model/tools/structures/ModelConfigStructures.scala new file mode 100644 index 00000000..9dca2d67 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/structures/ModelConfigStructures.scala @@ -0,0 +1,286 @@ +package com.databricks.labs.automl.model.tools.structures + +import com.databricks.labs.automl.params.MLPCConfig + +case class NumericBoundaries(minimum: Double, maximum: Double) + +case class NumericArrayCollection(selectedPayload: Array[Double], + remainingPayload: Array[Array[Double]]) +case class StringSelectionReturn(selectedStringValue: String, + IndexCounterStatus: Int) + +case class PermutationConfiguration( + modelType: String, + permutationTarget: Int, + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]] +) + +case class MLPCPermutationConfiguration( + permutationTarget: Int, + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]], + inputFeatureSize: Int, + distinctClasses: Int +) + +// RANDOM FOREST +case class RandomForestPermutationCollection( + numTreesArray: Array[Double], + maxBinsArray: Array[Double], + maxDepthArray: Array[Double], + minInfoGainArray: Array[Double], + subSamplingRateArray: Array[Double], + impurityArray: Array[String], + featureSubsetStrategyArray: Array[String] +) + +case class RandomForestNumericArrays(numTreesArray: Array[Double], + maxBinsArray: Array[Double], + maxDepthArray: Array[Double], + minInfoGainArray: Array[Double], + subSamplingRateArray: Array[Double]) + +case class RandomForestModelRunReport(numTrees: Int, + impurity: String, + maxBins: Int, + maxDepth: Int, + minInfoGain: Double, + subSamplingRate: Double, + featureSubsetStrategy: String, + score: Double) + +//DECISION TREES +case class TreesPermutationCollection(impurityArray: Array[String], + maxBinsArray: Array[Double], + maxDepthArray: Array[Double], + minInfoGainArray: Array[Double], + minInstancesPerNodeArray: Array[Double]) + +case class TreesNumericArrays(maxBinsArray: Array[Double], + maxDepthArray: Array[Double], + minInfoGainArray: Array[Double], + minInstancesPerNodeArray: Array[Double]) + +case class TreesModelRunReport(impurity: String, + maxBins: Int, + maxDepth: Int, + minInfoGain: Double, + minInstancesPerNode: Double, + score: Double) + +//GRADIENT BOOSTED TREES +case class GBTPermutationCollection(impurityArray: Array[String], + lossTypeArray: Array[String], + maxBinsArray: Array[Double], + maxDepthArray: Array[Double], + maxIterArray: Array[Double], + minInfoGainArray: Array[Double], + minInstancesPerNodeArray: Array[Double], + stepSizeArray: Array[Double]) + +case class GBTNumericArrays(maxBinsArray: Array[Double], + maxDepthArray: Array[Double], + maxIterArray: Array[Double], + minInfoGainArray: Array[Double], + minInstancesPerNodeArray: Array[Double], + stepSizeArray: Array[Double]) + +case class GBTModelRunReport(impurity: String, + lossType: String, + maxBins: Int, + maxDepth: Int, + maxIter: Int, + minInfoGain: Double, + minInstancesPerNode: Double, + stepSize: Double, + score: Double) + +//LINEAR REGRESSION +case class LinearRegressionPermutationCollection( + elasticNetParamsArray: Array[Double], + fitInterceptArray: Array[Boolean], + lossArray: Array[String], + maxIterArray: Array[Double], + regParamArray: Array[Double], + standardizationArray: Array[Boolean], + toleranceArray: Array[Double] +) + +case class LinearRegressionNumericArrays(elasticNetParamsArray: Array[Double], + maxIterArray: Array[Double], + regParamArray: Array[Double], + toleranceArray: Array[Double]) + +case class LinearRegressionModelRunReport(elasticNetParams: Double, + fitIntercept: Boolean, + loss: String, + maxIter: Int, + regParam: Double, + standardization: Boolean, + tolerance: Double, + score: Double) + +//LOGISTIC REGRESSION +case class LogisticRegressionPermutationCollection( + elasticNetParamsArray: Array[Double], + fitInterceptArray: Array[Boolean], + maxIterArray: Array[Double], + regParamArray: Array[Double], + standardizationArray: Array[Boolean], + toleranceArray: Array[Double] +) + +case class LogisticRegressionNumericArrays(elasticNetParamsArray: Array[Double], + maxIterArray: Array[Double], + regParamArray: Array[Double], + toleranceArray: Array[Double]) + +case class LogisticRegressionModelRunReport(elasticNetParams: Double, + fitIntercept: Boolean, + maxIter: Int, + regParam: Double, + standardization: Boolean, + tolerance: Double, + score: Double) + +//SVM +case class SVMPermutationCollection(fitInterceptArray: Array[Boolean], + maxIterArray: Array[Double], + regParamArray: Array[Double], + standardizationArray: Array[Boolean], + toleranceArray: Array[Double]) + +case class SVMNumericArrays(maxIterArray: Array[Double], + regParamArray: Array[Double], + toleranceArray: Array[Double]) + +case class SVMModelRunReport(fitIntercept: Boolean, + maxIter: Int, + regParam: Double, + standardization: Boolean, + tolerance: Double, + score: Double) + +//MLPC + +case class MLPCGenerator(layerCount: Int, + hiddenLayerSizeAdjust: Int, + maxIter: Int, + solver: String, + stepSize: Double, + tolerance: Double) + +case class MLPCPermutationCollection(layerCountArray: Array[Int], + layersArray: Array[Array[Int]], + maxIterArray: Array[Double], + solverArray: Array[String], + stepSizeArray: Array[Double], + toleranceArray: Array[Double], + hiddenLayerSizeAdjustArray: Array[Int]) + +case class MLPCModelingConfig(layerCount: Int, + layers: Array[Int], + maxIter: Int, + solver: String, + stepSize: Double, + tolerance: Double, + hiddenLayerSizeAdjust: Int) + +case class MLPCNumericArrays(layersArray: Array[Array[Int]], + maxIterArray: Array[Double], + stepSizeArray: Array[Double], + toleranceArray: Array[Double]) + +case class MLPCModelRunReport(layers: Int, + maxIter: Int, + solver: String, + stepSize: Double, + tolerance: Double, + hiddenLayerSizeAdjust: Int, + score: Double) + +case class MLPCArrayCollection(selectedPayload: MLPCConfig, + remainingPayloads: MLPCNumericArrays) + +//XGBOOST +case class XGBoostPermutationCollection(alphaArray: Array[Double], + etaArray: Array[Double], + gammaArray: Array[Double], + lambdaArray: Array[Double], + maxDepthArray: Array[Double], + subSampleArray: Array[Double], + minChildWeightArray: Array[Double], + numRoundArray: Array[Double], + maxBinsArray: Array[Double], + trainTestRatioArray: Array[Double]) + +case class XGBoostNumericArrays(alphaArray: Array[Double], + etaArray: Array[Double], + gammaArray: Array[Double], + lambdaArray: Array[Double], + maxDepthArray: Array[Double], + subSampleArray: Array[Double], + minChildWeightArray: Array[Double], + numRoundArray: Array[Double], + maxBinsArray: Array[Double], + trainTestRatioArray: Array[Double]) + +case class XGBoostModelRunReport(alpha: Double, + eta: Double, + gamma: Double, + lambda: Double, + maxDepth: Int, + subSample: Double, + minChildWeight: Double, + numRound: Int, + maxBins: Int, + trainTestRatio: Double, + score: Double) + +//LightGBM +case class LightGBMPermutationCollection( + baggingFractionArray: Array[Double], + baggingFreqArray: Array[Double], + featureFractionArray: Array[Double], + learningRateArray: Array[Double], + maxBinArray: Array[Double], + maxDepthArray: Array[Double], + minSumHessianInLeafArray: Array[Double], + numIterationsArray: Array[Double], + numLeavesArray: Array[Double], + boostFromAverageArray: Array[Boolean], + lambdaL1Array: Array[Double], + lambdaL2Array: Array[Double], + alphaArray: Array[Double], + boostingTypeArray: Array[String] +) + +case class LightGBMNumericArrays(baggingFractionArray: Array[Double], + baggingFreqArray: Array[Double], + featureFractionArray: Array[Double], + learningRateArray: Array[Double], + maxBinArray: Array[Double], + maxDepthArray: Array[Double], + minSumHessianInLeafArray: Array[Double], + numIterationsArray: Array[Double], + numLeavesArray: Array[Double], + lambdaL1Array: Array[Double], + lambdaL2Array: Array[Double], + alphaArray: Array[Double]) + +case class LightGBMModelRunReport(baggingFraction: Double, + baggingFreq: Int, + featureFraction: Double, + learningRate: Double, + maxBin: Int, + maxDepth: Int, + minSumHessianInLeaf: Double, + numIterations: Int, + numLeaves: Int, + boostFromAverage: Boolean, + lambdaL1: Double, + lambdaL2: Double, + alpha: Double, + boostingType: String, + score: Double) diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/structures/SeedGenerator.scala b/src/main/scala/com/databricks/labs/automl/model/tools/structures/SeedGenerator.scala new file mode 100644 index 00000000..622c0622 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/structures/SeedGenerator.scala @@ -0,0 +1,281 @@ +package com.databricks.labs.automl.model.tools.structures + +import com.databricks.labs.automl.params.MLPCConfig + +import scala.collection.mutable.ArrayBuffer +import scala.math._ +import scala.reflect.ClassTag +import scala.util.Random + +trait SeedGenerator { + + def generateLinearIntSpace(boundaries: NumericBoundaries, + generatorCount: Int): Array[Double] = { + + val integerSpace = new ArrayBuffer[Double]() + + val generatedDoubles = generateLinearSpace(boundaries, generatorCount) + + generatedDoubles.foreach { x => + integerSpace += x.round + } + + integerSpace.result.toArray + + } + + def generateLinearSpace(boundaries: NumericBoundaries, + generatorCount: Int): Array[Double] = { + + val space = new ArrayBuffer[Double] + + val iteratorDelta = (boundaries.maximum - boundaries.minimum) / (generatorCount.toDouble - 1.0) + + for (i <- 0 until generatorCount - 1) { + space += boundaries.minimum + i * iteratorDelta + } + space += boundaries.maximum + space.result.toArray + } + + def convertToLog(minScale: Double, + maxScale: Double, + value: Double): Double = { + + val minVal = if (minScale == 0.0) 1.0E-10 else minScale + + val b = log(maxScale / minVal) / (maxScale - minVal) + + val a = maxScale / exp(b * maxScale) + + a * exp(b * value) + } + + def generateLogSpace(boundaries: NumericBoundaries, + generatorCount: Int): Array[Double] = { + + val space = new ArrayBuffer[Double] + + val linearSpace = generateLinearSpace(boundaries, generatorCount) + + linearSpace.foreach { x => + space += convertToLog(boundaries.minimum, boundaries.maximum, x) + } + + space.result.toArray + + } + + private[tools] def constructLayerArray(inputFeatureSize: Int, + distinctClasses: Int, + layerCount: Int, + sizeAdjustment: Int): Array[Int] = { + + val layerConstruct = new ArrayBuffer[Int] + + layerConstruct += inputFeatureSize + + (1 to layerCount).foreach( + x => layerConstruct += inputFeatureSize + layerCount - x + sizeAdjustment + ) + layerConstruct += distinctClasses + layerConstruct.result.toArray + } + + def generateArraySpace(layerBoundaryLow: Int, + layerBoundaryHigh: Int, + hiddenBoundaryLow: Int, + hiddenBoundaryHigh: Int, + inputFeatureSize: Int, + distinctClasses: Int, + generatorCount: Int): Array[Array[Int]] = { + + val outputBuffer = new ArrayBuffer[Array[Int]]() + + // Generate the layer Boundary space and the size adjustment space + val layerBoundaries = generateLinearIntSpace( + NumericBoundaries(layerBoundaryLow, layerBoundaryHigh), + generatorCount + ) + val sizeAdjustmentBoundaries = generateLinearIntSpace( + NumericBoundaries(hiddenBoundaryLow, hiddenBoundaryHigh), + generatorCount + ) + + (0 until generatorCount).foreach { x => + outputBuffer += constructLayerArray( + inputFeatureSize, + distinctClasses, + layerBoundaries(x).toInt, + sizeAdjustmentBoundaries(x).toInt + ) + } + outputBuffer.result.toArray + } + + private[SeedGenerator] def getNthRoot(n: Double, root: Double): Double = { + pow(exp(1.0 / root), log(n)) + } + + def getNumberOfElements( + numericBoundaries: Map[String, (Double, Double)] + ): Int = { + numericBoundaries.keys.size + } + + def getPermutationCounts(targetIterations: Int, + numberOfElements: Int): Int = { + + getNthRoot(targetIterations.toDouble, numberOfElements.toDouble).ceil.toInt + + } + + protected[tools] def randomSampleArray[T: ClassTag]( + hyperParameterArray: Array[T], + sampleCount: Int, + seed: Long = 42L + ): Array[T] = { + + val randomSeed = new Random(seed) + Array.fill(sampleCount)( + hyperParameterArray(randomSeed.nextInt(hyperParameterArray.length)) + ) + + } + + protected[tools] def extractContinuousBoundaries( + parameter: Tuple2[Double, Double] + ): NumericBoundaries = { + NumericBoundaries(parameter._1, parameter._2) + } + + protected[tools] def selectStringIndex( + availableParams: List[String], + currentIterator: Int + ): StringSelectionReturn = { + val listLength = availableParams.length + val idxSelection = if (currentIterator >= listLength) 0 else currentIterator + + StringSelectionReturn(availableParams(idxSelection), idxSelection) + } + + protected[tools] def selectCoinFlip(currentIterator: Int): Boolean = { + if (currentIterator.toDouble % 2.0 == 0.0) true else false + } + + protected[tools] def staticIndexSelection( + numericArrays: Array[Array[Double]] + ): NumericArrayCollection = { + + val selectedPayload = numericArrays.map(x => x(0)) + + val remainingArrays = numericArrays.map(x => x.drop(1)) + + NumericArrayCollection(selectedPayload, remainingArrays) + + } + + protected[tools] def randomIndexSelection( + numericArrays: Array[Array[Double]] + ): NumericArrayCollection = { + + val bufferContainer = new ArrayBuffer[Array[Double]]() + + numericArrays.foreach { x => + bufferContainer += Random.shuffle(x.toList).toArray + } + + val arrayRandomHolder = bufferContainer.result.toArray + + val randomlySelectedPayload = arrayRandomHolder.map(x => x(0)) + + val remainingArrays = arrayRandomHolder.map(x => x.drop(1)) + + NumericArrayCollection(randomlySelectedPayload, remainingArrays) + + } + + protected[tools] def mlpcStaticIndexSelection( + numericArrays: MLPCNumericArrays + ): MLPCArrayCollection = { + + val layersArray = numericArrays.layersArray + val selectedLayers = layersArray.take(1)(0) + val updatedLayersArray = layersArray.drop(1) + + val maxIterArray = numericArrays.maxIterArray + val selectedMaxIter = maxIterArray.take(1)(0) + val updatedMaxIterArray = maxIterArray.drop(1) + + val stepSizeArray = numericArrays.stepSizeArray + val selectedStepSize = stepSizeArray.take(1)(0) + val updatedStepSizeArray = stepSizeArray.drop(1) + + val tolArray = numericArrays.toleranceArray + val selectedTol = tolArray.take(1)(0) + val updateTolArray = tolArray.drop(1) + + val remainingArrays = MLPCNumericArrays( + layersArray = updatedLayersArray, + maxIterArray = updatedMaxIterArray, + stepSizeArray = updatedStepSizeArray, + toleranceArray = updateTolArray + ) + + MLPCArrayCollection( + MLPCConfig( + layers = selectedLayers, + maxIter = selectedMaxIter.toInt, + solver = "placeholder", + stepSize = selectedStepSize, + tolerance = selectedTol + ), + remainingArrays + ) + + } + + protected[tools] def mlpcLayersExtractor(layers: Array[Int]): (Int, Int) = { + + val hiddenLayersSizeAdjust = + if (layers.length > 2) layers(1) - layers(0) else 0 + val layerCount = layers.length - 2 + + (layerCount, hiddenLayersSizeAdjust) + } + + protected[tools] def mlpcRandomIndexSelection( + numericArrays: MLPCNumericArrays + ): MLPCArrayCollection = { + + val shuffledData = MLPCNumericArrays( + layersArray = Random.shuffle(numericArrays.layersArray.toList).toArray, + maxIterArray = Random.shuffle(numericArrays.maxIterArray.toList).toArray, + stepSizeArray = Random.shuffle(numericArrays.stepSizeArray.toList).toArray, + toleranceArray = + Random.shuffle(numericArrays.toleranceArray.toList).toArray + ) + + mlpcStaticIndexSelection(shuffledData) + } + + /** + * Calculates the number of possible additional permutations to be added to the search space for string values + * @param stringBoundaries The string boundary payload for a modeling family + * @return Int representing any additional permutations on the numeric body that will need to be generated in order + * to attempt to reach the target unique hyperparameter search space + */ + protected[tools] def stringBoundaryPermutationCalculator( + stringBoundaries: Map[String, List[String]] + ): Int = { + + var uniqueValues = 0 + + stringBoundaries.foreach { x => + uniqueValues += x._2.length - 1 + } + + uniqueValues + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/model/tools/structures/SplitUtilityStructures.scala b/src/main/scala/com/databricks/labs/automl/model/tools/structures/SplitUtilityStructures.scala new file mode 100644 index 00000000..58b6b03c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/model/tools/structures/SplitUtilityStructures.scala @@ -0,0 +1,11 @@ +package com.databricks.labs.automl.model.tools.structures + +import org.apache.spark.sql.DataFrame + +case class TrainSplitReferences(kIndex: Int, + data: TrainTestData, + paths: TrainTestPaths) + +case class TrainTestData(train: DataFrame, test: DataFrame) + +case class TrainTestPaths(train: String, test: String) diff --git a/src/main/scala/com/databricks/labs/automl/params/Configuration.scala b/src/main/scala/com/databricks/labs/automl/params/Configuration.scala new file mode 100755 index 00000000..d1a37adc --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/params/Configuration.scala @@ -0,0 +1,159 @@ +package com.databricks.labs.automl.params + +case class MainConfig(modelFamily: String, + labelCol: String, + featuresCol: String, + naFillFlag: Boolean, + varianceFilterFlag: Boolean, + outlierFilterFlag: Boolean, + pearsonFilteringFlag: Boolean, + covarianceFilteringFlag: Boolean, + oneHotEncodeFlag: Boolean, + scalingFlag: Boolean, + featureInteractionFlag: Boolean, + dataPrepCachingFlag: Boolean, + dataPrepParallelism: Int, + autoStoppingFlag: Boolean, + autoStoppingScore: Double, + featureImportanceCutoffType: String, + featureImportanceCutoffValue: Double, + dateTimeConversionType: String, + fieldsToIgnoreInVector: Array[String], + numericBoundaries: Map[String, (Double, Double)], + stringBoundaries: Map[String, List[String]], + scoringMetric: String, + scoringOptimizationStrategy: String, + fillConfig: FillConfig, + outlierConfig: OutlierConfig, + pearsonConfig: PearsonConfig, + covarianceConfig: CovarianceConfig, + featureInteractionConfig: FeatureInteractionConfig, + scalingConfig: ScalingConfig, + geneticConfig: GeneticConfig, + mlFlowLoggingFlag: Boolean, + mlFlowLogArtifactsFlag: Boolean, + mlFlowConfig: MLFlowConfig, + inferenceConfigSaveLocation: String, + dataReductionFactor: Double, + pipelineDebugFlag: Boolean, + pipelineId: String) + +case class DataPrepConfig(naFillFlag: Boolean, + varianceFilterFlag: Boolean, + outlierFilterFlag: Boolean, + pearsonFilterFlag: Boolean, + covarianceFilterFlag: Boolean, + scalingFlag: Boolean) + +case class MLFlowConfig(mlFlowTrackingURI: String, + mlFlowExperimentName: String, + mlFlowAPIToken: String, + mlFlowModelSaveDirectory: String, + mlFlowLoggingMode: String, + mlFlowBestSuffix: String, + mlFlowCustomRunTags: Map[String, String]) + +case class FillConfig(numericFillStat: String, + characterFillStat: String, + modelSelectionDistinctThreshold: Int, + cardinalitySwitch: Boolean, + cardinalityType: String, + cardinalityLimit: Int, + cardinalityPrecision: Double, + cardinalityCheckMode: String, + filterPrecision: Double, + categoricalNAFillMap: Map[String, String], + numericNAFillMap: Map[String, AnyVal], + characterNABlanketFillValue: String, + numericNABlanketFillValue: Double, + naFillMode: String) + +case class OutlierConfig(filterBounds: String, + lowerFilterNTile: Double, + upperFilterNTile: Double, + filterPrecision: Double, + continuousDataThreshold: Int, + fieldsToIgnore: Array[String]) + +case class PearsonConfig(filterStatistic: String, + filterDirection: String, + filterManualValue: Double, + filterMode: String, + autoFilterNTile: Double) + +case class CovarianceConfig(correlationCutoffLow: Double, + correlationCutoffHigh: Double) + +case class FirstGenerationConfig(permutationCount: Int, + indexMixingMode: String, + arraySeed: Long) + +case class KSampleConfig(syntheticCol: String, + kGroups: Int, + kMeansMaxIter: Int, + kMeansTolerance: Double, + kMeansDistanceMeasurement: String, + kMeansSeed: Long, + kMeansPredictionCol: String, + lshHashTables: Int, + lshSeed: Long, + lshOutputCol: String, + quorumCount: Int, + minimumVectorCountToMutate: Int, + vectorMutationMethod: String, + mutationMode: String, + mutationValue: Double, + labelBalanceMode: String, + cardinalityThreshold: Int, + numericRatio: Double, + numericTarget: Int, + outputDfRepartitionScaleFactor: Int) + +case class GeneticConfig(parallelism: Int, + kFold: Int, + trainPortion: Double, + trainSplitMethod: String, + kSampleConfig: KSampleConfig, + trainSplitChronologicalColumn: String, + trainSplitChronologicalRandomPercentage: Double, + seed: Long, + firstGenerationGenePool: Int, + numberOfGenerations: Int, + numberOfParentsToRetain: Int, + numberOfMutationsPerGeneration: Int, + geneticMixing: Double, + generationalMutationStrategy: String, + fixedMutationValue: Int, + mutationMagnitudeMode: String, + evolutionStrategy: String, + geneticMBORegressorType: String, + geneticMBOCandidateFactor: Int, + continuousEvolutionMaxIterations: Int, + continuousEvolutionStoppingScore: Double, + continuousEvolutionImprovementThreshold: Int, + continuousEvolutionParallelism: Int, + continuousEvolutionMutationAggressiveness: Int, + continuousEvolutionGeneticMixing: Double, + continuousEvolutionRollingImprovementCount: Int, + modelSeed: Map[String, Any], + hyperSpaceInference: Boolean, + hyperSpaceInferenceCount: Int, + hyperSpaceModelType: String, + hyperSpaceModelCount: Int, + initialGenerationMode: String, + initialGenerationConfig: FirstGenerationConfig, + deltaCacheBackingDirectory: String, + splitCachingStrategy: String, + deltaCacheBackingDirectoryRemovalFlag: Boolean) + +case class ScalingConfig(scalerType: String, + scalerMin: Double, + scalerMax: Double, + standardScalerMeanFlag: Boolean, + standardScalerStdDevFlag: Boolean, + pNorm: Double) + +case class FeatureInteractionConfig(retentionMode: String, + continuousDiscretizerBucketCount: Int, + parallelism: Int, + targetInteractionPercentage: Double) diff --git a/src/main/scala/com/databricks/labs/automl/params/DataStructures.scala b/src/main/scala/com/databricks/labs/automl/params/DataStructures.scala new file mode 100755 index 00000000..c1933bf5 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/params/DataStructures.scala @@ -0,0 +1,261 @@ +package com.databricks.labs.automl.params + +import com.databricks.labs.automl.tracking.MLFlowReportStructure +import org.apache.spark.ml.PipelineModel +import org.apache.spark.ml.classification._ +import org.apache.spark.ml.regression.LinearRegressionModel +import org.apache.spark.sql.DataFrame + +case class PearsonPayload(fieldName: String, + pvalue: Double, + degreesFreedom: Int, + pearsonStat: Double) + +case class FeatureCorrelationStats(leftCol: String, + rightCol: String, + correlation: Double) + +case class FilterData(field: String, uniqueValues: Long) + +case class ManualFilters(field: String, threshold: Double) + +case class XGBoostConfig(alpha: Double, + eta: Double, + gamma: Double, + lambda: Double, + maxDepth: Int, + subSample: Double, + minChildWeight: Double, + numRound: Int, + maxBins: Int, + trainTestRatio: Double) + +case class RandomForestConfig(numTrees: Int, + impurity: String, + maxBins: Int, + maxDepth: Int, + minInfoGain: Double, + subSamplingRate: Double, + featureSubsetStrategy: String) + +case class TreesConfig(impurity: String, + maxBins: Int, + maxDepth: Int, + minInfoGain: Double, + minInstancesPerNode: Int) + +case class GBTConfig(impurity: String, + lossType: String, + maxBins: Int, + maxDepth: Int, + maxIter: Int, + minInfoGain: Double, + minInstancesPerNode: Int, + stepSize: Double) + +case class LogisticRegressionConfig(elasticNetParams: Double, + fitIntercept: Boolean, + maxIter: Int, + regParam: Double, + standardization: Boolean, + tolerance: Double) + +case class LinearRegressionConfig(elasticNetParams: Double, + fitIntercept: Boolean, + loss: String, + maxIter: Int, + regParam: Double, + standardization: Boolean, + tolerance: Double) + +case class LinearRegressionModelsWithResults( + modelHyperParams: LinearRegressionConfig, + model: LinearRegressionModel, + score: Double, + evalMetrics: Map[String, Double], + generation: Int +) + +case class LogisticRegressionModelsWithResults( + modelHyperParams: LogisticRegressionConfig, + model: LogisticRegressionModel, + score: Double, + evalMetrics: Map[String, Double], + generation: Int +) + +case class XGBoostModelsWithResults(modelHyperParams: XGBoostConfig, + model: Any, + score: Double, + evalMetrics: Map[String, Double], + generation: Int) + +case class RandomForestModelsWithResults(modelHyperParams: RandomForestConfig, + model: Any, + score: Double, + evalMetrics: Map[String, Double], + generation: Int) + +case class TreesModelsWithResults(modelHyperParams: TreesConfig, + model: Any, + score: Double, + evalMetrics: Map[String, Double], + generation: Int) + +case class GBTModelsWithResults(modelHyperParams: GBTConfig, + model: Any, + score: Double, + evalMetrics: Map[String, Double], + generation: Int) + +case class SVMConfig(fitIntercept: Boolean, + maxIter: Int, + regParam: Double, + standardization: Boolean, + tolerance: Double) + +case class SVMModelsWithResults(modelHyperParams: SVMConfig, + model: LinearSVCModel, + score: Double, + evalMetrics: Map[String, Double], + generation: Int) + +case class MLPCConfig(layers: Array[Int], + maxIter: Int, + solver: String, + stepSize: Double, + tolerance: Double) + +case class MLPCModelsWithResults(modelHyperParams: MLPCConfig, + model: MultilayerPerceptronClassificationModel, + score: Double, + evalMetrics: Map[String, Double], + generation: Int) + +case class NaiveBayesConfig(modelType: String, + smoothing: Double, + thresholds: Boolean) + +case class NaiveBayesModelsWithResults(modelHyperParams: NaiveBayesConfig, + model: NaiveBayesModel, + score: Double, + evalMetrics: Map[String, Double], + generation: Int) + +case class LightGBMConfig(baggingFraction: Double, + baggingFreq: Int, + featureFraction: Double, + learningRate: Double, + maxBin: Int, + maxDepth: Int, + minSumHessianInLeaf: Double, + numIterations: Int, + numLeaves: Int, + boostFromAverage: Boolean, + lambdaL1: Double, + lambdaL2: Double, + alpha: Double, + boostingType: String) + +case class LightGBMModelsWithResults(modelHyperParams: LightGBMConfig, + model: Any, + score: Double, + evalMetrics: Map[String, Double], + generation: Int) + +case class StaticModelConfig(labelColumn: String, featuresColumn: String) + +case class GenericModelReturn(hyperParams: Map[String, Any], + model: Any, + score: Double, + metrics: Map[String, Double], + generation: Int) + +case class GroupedModelReturn(modelFamily: String, + hyperParams: Map[String, Any], + model: Any, + score: Double, + metrics: Map[String, Double], + generation: Int) + +case class GenerationalReport(modelFamily: String, + modelType: String, + generation: Int, + generationMeanScore: Double, + generationStddevScore: Double) + +case class FeatureImportanceReturn(modelPayload: RandomForestModelsWithResults, + data: DataFrame, + fields: Array[String], + modelType: String) + +case class TreeSplitReport(decisionText: String, + featureImportances: DataFrame, + model: Any) + +case class DataPrepReturn(outputData: DataFrame, fieldListing: Array[String]) + +case class DataGeneration(data: DataFrame, + fields: Array[String], + modelType: String) + +case class OutlierFilteringReturn( + outputData: DataFrame, + fieldRemovalMap: Map[String, (Double, String)] +) + +sealed trait Output { + def modelReport: Array[GenericModelReturn] + def generationReport: Array[GenerationalReport] + def modelReportDataFrame: DataFrame + def generationReportDataFrame: DataFrame +} + +abstract case class AutomationOutput(mlFlowOutput: MLFlowReportStructure) + extends Output + +abstract case class TunerOutput(rawData: DataFrame, + modelSelection: String, + mlFlowOutput: MLFlowReportStructure) + extends Output + +abstract case class PredictionOutput(dataWithPredictions: DataFrame, + mlFlowOutput: MLFlowReportStructure) + extends Output + +abstract case class FeatureImportanceOutput(featureImportances: DataFrame, + mlFlowOutput: MLFlowReportStructure) + extends Output + +abstract case class FeatureImportancePredictionOutput( + featureImportances: DataFrame, + predictionData: DataFrame, + mlFlowOutput: MLFlowReportStructure +) extends Output + +abstract case class ConfusionOutput(predictionData: DataFrame, + confusionData: DataFrame, + mlFlowOutput: MLFlowReportStructure) + extends Output + +abstract case class FamilyOutput(modelType: String, + mlFlowOutput: MLFlowReportStructure) + extends Output + +case class FamilyFinalOutput(modelReport: Array[GroupedModelReturn], + generationReport: Array[GenerationalReport], + modelReportDataFrame: DataFrame, + generationReportDataFrame: DataFrame, + mlFlowReport: Array[MLFlowReportStructure]) + +case class FamilyFinalOutputWithPipeline( + familyFinalOutput: FamilyFinalOutput, + bestPipelineModel: Map[String, PipelineModel], + bestMlFlowRunId: Map[String, String] = Map.empty +) + +sealed trait ModelType[A, B] + +final case class ClassiferType[A, B](a: A) extends ModelType[A, B] + +final case class RegressorType[A, B](b: B) extends ModelType[A, B] diff --git a/src/main/scala/com/databricks/labs/automl/params/Defaults.scala b/src/main/scala/com/databricks/labs/automl/params/Defaults.scala new file mode 100755 index 00000000..1ab99776 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/params/Defaults.scala @@ -0,0 +1,601 @@ +package com.databricks.labs.automl.params + +import com.databricks.labs.automl.pipeline.PipelineStateCache +import com.databricks.labs.automl.utils.InitDbUtils + +trait Defaults { + + final val _supportedModels: Array[String] = Array( + "GBT", + "Trees", + "RandomForest", + "LinearRegression", + "LogisticRegression", + "MLPC", + "SVM", + "XGBoost", + "gbmBinary", + "gbmMulti", + "gbmMultiOVA", + "gbmHuber", + "gbmFair", + "gbmLasso", + "gbmRidge", + "gbmPoisson", + "gbmQuantile", + "gbmMape", + "gbmTweedie", + "gbmGamma" + ) + + final val trainSplitMethods: List[String] = List( + "random", + "chronological", + "stratifyReduce", + "stratified", + "overSample", + "underSample", + "kSample" + ) + + final val _supportedFeatureImportanceCutoffTypes: List[String] = + List("none", "value", "count") + + final val _allowableEvolutionStrategies = List("batch", "continuous") + + final val _allowableMlFlowLoggingModes = + List("tuningOnly", "bestOnly", "full") + + final val _allowableInitialGenerationModes = List("random", "permutations") + + final val _allowableInitialGenerationIndexMixingModes = + List("random", "linear") + + final val allowableKMeansDistanceMeasurements: List[String] = + List("cosine", "euclidean") + final val allowableMutationModes: List[String] = + List("weighted", "random", "ratio") + final val allowableVectorMutationMethods: List[String] = + List("random", "fixed", "all") + final val allowableLabelBalanceModes: List[String] = + List("match", "percentage", "target") + + final val allowableDateTimeConversions = List("unix", "split") + final val allowableCategoricalFilterModes = List("silent", "warn") + final val allowableCardinalilties = List("approx", "exact") + final val _allowableNAFillModes: List[String] = + List( + "auto", + "mapFill", + "blanketFillAll", + "blanketFillCharOnly", + "blanketFillNumOnly" + ) + + final val allowableMBORegressorTypes = + List("XGBoost", "LinearRegression", "RandomForest") + + final val allowableFeatureInteractionModes = + List("optimistic", "strict", "all") + + def _defaultModelingFamily: String = "RandomForest" + + def _defaultLabelCol: String = "label" + + def _defaultFeaturesCol: String = "features" + + def _defaultNAFillFlag: Boolean = true + + def _defaultVarianceFilterFlag: Boolean = true + + def _defaultOutlierFilterFlag: Boolean = false + + def _defaultPearsonFilterFlag: Boolean = false + + def _defaultCovarianceFilterFlag: Boolean = false + + def _defaultOneHotEncodeFlag: Boolean = false + + def _defaultScalingFlag: Boolean = false + + def _defaultFeatureInteractionFlag: Boolean = false + + def _defaultDataPrepCachingFlag: Boolean = false + + def _defaultDataReductionFactor: Double = 0.5 + + def _defaultPipelineDebugFlag: Boolean = false + + def _defaultDateTimeConversionType: String = "split" + + def _defaultFieldsToIgnoreInVector: Array[String] = Array.empty[String] + + def _defaultHyperSpaceInference: Boolean = false + + def _defaultHyperSpaceInferenceCount: Int = 200000 + + def _defaultHyperSpaceModelType: String = "RandomForest" + + def _defaultHyperSpaceModelCount: Int = 10 + + def _defaultInitialGenerationMode: String = "random" + + def _defaultDataPrepParallelism: Int = 20 + + def _defaultPipelineId: String = PipelineStateCache.generatePipelineId() + + def _defaultFirstGenerationConfig = FirstGenerationConfig( + permutationCount = 10, + indexMixingMode = "linear", + arraySeed = 42L + ) + + def _defaultFeatureInteractionConfig = FeatureInteractionConfig( + retentionMode = "optimistic", + continuousDiscretizerBucketCount = 10, + parallelism = 12, + targetInteractionPercentage = 10 + ) + + def _defaultKSampleConfig: KSampleConfig = KSampleConfig( + syntheticCol = "synthetic_kSample", + kGroups = 25, + kMeansMaxIter = 100, + kMeansTolerance = 1E-6, + kMeansDistanceMeasurement = "euclidean", + kMeansSeed = 42L, + kMeansPredictionCol = "kGroups_kSample", + lshHashTables = 10, + lshSeed = 42L, + lshOutputCol = "hashes_kSample", + quorumCount = 7, + minimumVectorCountToMutate = 1, + vectorMutationMethod = "random", + mutationMode = "weighted", + mutationValue = 0.5, + labelBalanceMode = "percentage", + cardinalityThreshold = 20, + numericRatio = 0.2, + numericTarget = 500, + outputDfRepartitionScaleFactor = 3 + ) + + def _geneticTunerDefaults = GeneticConfig( + parallelism = 20, + kFold = 5, + trainPortion = 0.8, + trainSplitMethod = "random", + kSampleConfig = _defaultKSampleConfig, + trainSplitChronologicalColumn = "datetime", + trainSplitChronologicalRandomPercentage = 0.0, + seed = 42L, + firstGenerationGenePool = 20, + numberOfGenerations = 10, + numberOfParentsToRetain = 3, + numberOfMutationsPerGeneration = 10, + geneticMixing = 0.7, + generationalMutationStrategy = "linear", + fixedMutationValue = 1, + mutationMagnitudeMode = "fixed", + evolutionStrategy = "batch", + geneticMBORegressorType = "XGBoost", + geneticMBOCandidateFactor = 10, + continuousEvolutionMaxIterations = 200, + continuousEvolutionStoppingScore = 1.0, + continuousEvolutionImprovementThreshold = -10, + continuousEvolutionParallelism = 4, + continuousEvolutionMutationAggressiveness = 3, + continuousEvolutionGeneticMixing = 0.7, + continuousEvolutionRollingImprovementCount = 20, + modelSeed = Map.empty, + hyperSpaceInference = _defaultHyperSpaceInference, + hyperSpaceInferenceCount = _defaultHyperSpaceInferenceCount, + hyperSpaceModelCount = _defaultHyperSpaceModelCount, + hyperSpaceModelType = _defaultHyperSpaceModelType, + initialGenerationMode = _defaultInitialGenerationMode, + initialGenerationConfig = _defaultFirstGenerationConfig, + deltaCacheBackingDirectory = "dbfs:/mnt/automl/", + splitCachingStrategy = "persist", + deltaCacheBackingDirectoryRemovalFlag = true + ) + + def _fillConfigDefaults = FillConfig( + numericFillStat = "mean", + characterFillStat = "max", + modelSelectionDistinctThreshold = 10, + cardinalitySwitch = true, + cardinalityType = "exact", + cardinalityLimit = 200, + cardinalityPrecision = 0.05, + cardinalityCheckMode = "silent", + filterPrecision = 0.01, + categoricalNAFillMap = Map.empty[String, String], + numericNAFillMap = Map.empty[String, AnyVal], + characterNABlanketFillValue = "", + numericNABlanketFillValue = 0.0, + naFillMode = "auto" + ) + + def _outlierConfigDefaults = OutlierConfig( + filterBounds = "both", + lowerFilterNTile = 0.02, + upperFilterNTile = 0.98, + filterPrecision = 0.01, + continuousDataThreshold = 50, + fieldsToIgnore = Array("") + ) + + def _pearsonConfigDefaults = PearsonConfig( + filterStatistic = "pearsonStat", + filterDirection = "greater", + filterManualValue = 1.0, + filterMode = "auto", + autoFilterNTile = 0.99 + ) + + def _covarianceConfigDefaults = + CovarianceConfig(correlationCutoffLow = -0.99, correlationCutoffHigh = 0.99) + + def _scalingConfigDefaults = ScalingConfig( + scalerType = "minMax", + scalerMin = 0.0, + scalerMax = 1.0, + standardScalerMeanFlag = false, + standardScalerStdDevFlag = true, + pNorm = 2.0 + ) + + def _dataPrepConfigDefaults = DataPrepConfig( + naFillFlag = true, + varianceFilterFlag = true, + outlierFilterFlag = false, + pearsonFilterFlag = true, + covarianceFilterFlag = true, + scalingFlag = false + ) + + def _xgboostDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "alpha" -> Tuple2(0.0, 1.0), + "eta" -> Tuple2(0.1, 0.5), + "gamma" -> Tuple2(0.0, 10.0), + "lambda" -> Tuple2(0.1, 10.0), + "maxDepth" -> Tuple2(3.0, 10.0), + "subSample" -> Tuple2(0.4, 0.6), + "minChildWeight" -> Tuple2(0.1, 10.0), + "numRound" -> Tuple2(25.0, 250.0), + "maxBins" -> Tuple2(25.0, 512.0), + "trainTestRatio" -> Tuple2(0.2, 0.8) + ) + + def _rfDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "numTrees" -> Tuple2(50.0, 1000.0), + "maxBins" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ) + + def _rfDefaultStringBoundaries = Map( + "impurity" -> List("gini", "entropy"), + "featureSubsetStrategy" -> List("auto") + ) + + def _treesDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "maxBins" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "minInstancesPerNode" -> Tuple2(1.0, 50.0) + ) + + def _treesDefaultStringBoundaries: Map[String, List[String]] = Map( + "impurity" -> List("gini", "entropy") + ) + + def _mlpcDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "layers" -> Tuple2(1.0, 10.0), + "maxIter" -> Tuple2(10.0, 100.0), + "stepSize" -> Tuple2(0.01, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5), + "hiddenLayerSizeAdjust" -> Tuple2(0.0, 50.0) + ) + + def _mlpcDefaultStringBoundaries: Map[String, List[String]] = Map( + "solver" -> List("gd", "l-bfgs") + ) + + def _gbtDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "maxBins" -> Tuple2(10.0, 100.0), + "maxIter" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 1.0), + "minInstancesPerNode" -> Tuple2(1.0, 50.0), + "stepSize" -> Tuple2(1E-4, 1.0) + ) + + def _gbtDefaultStringBoundaries: Map[String, List[String]] = + Map("impurity" -> List("gini", "entropy"), "lossType" -> List("logistic")) + + def _linearRegressionDefaultNumBoundaries: Map[String, (Double, Double)] = + Map( + "elasticNetParams" -> Tuple2(0.0, 1.0), + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) + def _linearRegressionDefaultStringBoundaries: Map[String, List[String]] = Map( + "loss" -> List("squaredError", "huber") + ) + def _logisticRegressionDefaultNumBoundaries: Map[String, (Double, Double)] = + Map( + "elasticNetParams" -> Tuple2(0.0, 1.0), + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) + def _logisticRegressionDefaultStringBoundaries: Map[String, List[String]] = + Map("" -> List("")) + def _svmDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "maxIter" -> Tuple2(100.0, 10000.0), + "regParam" -> Tuple2(0.0, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) + def _svmDefaultStringBoundaries: Map[String, List[String]] = Map( + "" -> List("") + ) + + def _naiveBayesDefaultStringBoundaries: Map[String, List[String]] = Map( + "modelType" -> List("multinomial", "bernoulli") + ) + + def _naiveBayesDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "smoothing" -> Tuple2(0.0, 1.0) + ) + + def _lightGBMDefaultNumBoundaries: Map[String, (Double, Double)] = Map( + "baggingFraction" -> Tuple2(0.5, 1.0), + "baggingFreq" -> Tuple2(0.0, 1.0), + "featureFraction" -> Tuple2(0.6, 1.0), + "learningRate" -> Tuple2(1E-8, 1.0), + "maxBin" -> Tuple2(50, 1000), + "maxDepth" -> Tuple2(3.0, 20.0), + "minSumHessianInLeaf" -> Tuple2(1e-5, 50.0), + "numIterations" -> Tuple2(25.0, 250.0), + "numLeaves" -> Tuple2(10.0, 50.0), + "lambdaL1" -> Tuple2(0.0, 1.0), + "lambdaL2" -> Tuple2(0.0, 1.0), + "alpha" -> Tuple2(0.0, 1.0) + ) + + def _lightGBMDefaultStringBoundaries: Map[String, List[String]] = Map( + "boostingType" -> List("gbdt", "rf", "dart", "goss") + ) + + def _scoringDefaultClassifier = "f1" + def _scoringOptimizationStrategyClassifier = "maximize" + def _scoringDefaultRegressor = "rmse" + def _scoringOptimizationStrategyRegressor = "minimize" + + def _modelTypeDefault = "RandomForest" + + def _mlFlowConfigDefaults: MLFlowConfig = { + val mlfloWLoggingConfig = + InitDbUtils.getMlFlowLoggingConfig(_defaultMlFlowLoggingFlag) + MLFlowConfig( + mlFlowTrackingURI = mlfloWLoggingConfig.mlFlowTrackingURI, + mlFlowExperimentName = mlfloWLoggingConfig.mlFlowExperimentName, + mlFlowAPIToken = mlfloWLoggingConfig.mlFlowAPIToken, + mlFlowModelSaveDirectory = mlfloWLoggingConfig.mlFlowModelSaveDirectory, + mlFlowLoggingMode = "full", + mlFlowBestSuffix = "best", + mlFlowCustomRunTags = Map[String, String]() + ) + } + + def _inferenceConfigSaveLocationDefault: String = "/models" + + def _defaultMlFlowLoggingFlag: Boolean = false + + def _defaultMlFlowArtifactsFlag: Boolean = false + + def _defaultAutoStoppingFlag: Boolean = true + + def _defaultAutoStoppingScore: Double = 0.95 + + def _defaultFeatureImportanceCutoffType: String = "count" + + def _defaultFeatureImportanceCutoffValue: Double = 15.0 + + def _mainConfigDefaults = MainConfig( + modelFamily = _modelTypeDefault, + labelCol = "label", + featuresCol = "features", + naFillFlag = true, + varianceFilterFlag = true, + outlierFilterFlag = false, + pearsonFilteringFlag = false, + covarianceFilteringFlag = false, + oneHotEncodeFlag = false, + scalingFlag = false, + featureInteractionFlag = false, + dataPrepCachingFlag = true, + autoStoppingFlag = _defaultAutoStoppingFlag, + dataPrepParallelism = _defaultDataPrepParallelism, + autoStoppingScore = _defaultAutoStoppingScore, + featureImportanceCutoffType = _defaultFeatureImportanceCutoffType, + featureImportanceCutoffValue = _defaultFeatureImportanceCutoffValue, + dateTimeConversionType = "split", + fieldsToIgnoreInVector = _defaultFieldsToIgnoreInVector, + numericBoundaries = _rfDefaultNumBoundaries, + stringBoundaries = _rfDefaultStringBoundaries, + scoringMetric = _scoringDefaultClassifier, + scoringOptimizationStrategy = _scoringOptimizationStrategyClassifier, + fillConfig = _fillConfigDefaults, + outlierConfig = _outlierConfigDefaults, + pearsonConfig = _pearsonConfigDefaults, + covarianceConfig = _covarianceConfigDefaults, + featureInteractionConfig = _defaultFeatureInteractionConfig, + scalingConfig = _scalingConfigDefaults, + geneticConfig = _geneticTunerDefaults, + mlFlowLoggingFlag = _defaultMlFlowLoggingFlag, + mlFlowLogArtifactsFlag = _defaultMlFlowArtifactsFlag, + mlFlowConfig = _mlFlowConfigDefaults, + inferenceConfigSaveLocation = _inferenceConfigSaveLocationDefault, + dataReductionFactor = _defaultDataReductionFactor, + pipelineDebugFlag = _defaultPipelineDebugFlag, + pipelineId = _defaultPipelineId + ) + + def _featureImportancesDefaults = MainConfig( + modelFamily = "RandomForest", + labelCol = "label", + featuresCol = "features", + naFillFlag = true, + varianceFilterFlag = true, + outlierFilterFlag = false, + pearsonFilteringFlag = false, + covarianceFilteringFlag = false, + oneHotEncodeFlag = false, + scalingFlag = false, + featureInteractionFlag = false, + dataPrepCachingFlag = true, + autoStoppingFlag = _defaultAutoStoppingFlag, + dataPrepParallelism = _defaultDataPrepParallelism, + autoStoppingScore = _defaultAutoStoppingScore, + featureImportanceCutoffType = _defaultFeatureImportanceCutoffType, + featureImportanceCutoffValue = _defaultFeatureImportanceCutoffValue, + dateTimeConversionType = "split", + fieldsToIgnoreInVector = _defaultFieldsToIgnoreInVector, + numericBoundaries = _rfDefaultNumBoundaries, + stringBoundaries = _rfDefaultStringBoundaries, + scoringMetric = _scoringDefaultClassifier, + scoringOptimizationStrategy = _scoringOptimizationStrategyClassifier, + fillConfig = _fillConfigDefaults, + outlierConfig = _outlierConfigDefaults, + pearsonConfig = _pearsonConfigDefaults, + covarianceConfig = _covarianceConfigDefaults, + scalingConfig = _scalingConfigDefaults, + featureInteractionConfig = _defaultFeatureInteractionConfig, + geneticConfig = GeneticConfig( + parallelism = 20, + kFold = 1, + trainPortion = 0.8, + trainSplitMethod = "random", + kSampleConfig = _defaultKSampleConfig, + trainSplitChronologicalColumn = "datetime", + trainSplitChronologicalRandomPercentage = 0.0, + seed = 42L, + firstGenerationGenePool = 25, + numberOfGenerations = 20, + numberOfParentsToRetain = 2, + numberOfMutationsPerGeneration = 10, + geneticMixing = 0.7, + generationalMutationStrategy = "linear", + fixedMutationValue = 1, + mutationMagnitudeMode = "fixed", + evolutionStrategy = "batch", + geneticMBORegressorType = "XGBoost", + geneticMBOCandidateFactor = 10, + continuousEvolutionMaxIterations = 200, + continuousEvolutionStoppingScore = 1.0, + continuousEvolutionImprovementThreshold = -10, + continuousEvolutionParallelism = 4, + continuousEvolutionMutationAggressiveness = 3, + continuousEvolutionGeneticMixing = 0.7, + continuousEvolutionRollingImprovementCount = 20, + modelSeed = Map.empty, + hyperSpaceInference = _defaultHyperSpaceInference, + hyperSpaceInferenceCount = _defaultHyperSpaceInferenceCount, + hyperSpaceModelType = _defaultHyperSpaceModelType, + hyperSpaceModelCount = _defaultHyperSpaceModelCount, + initialGenerationMode = _defaultInitialGenerationMode, + initialGenerationConfig = _defaultFirstGenerationConfig, + deltaCacheBackingDirectory = "", + splitCachingStrategy = "persist", + deltaCacheBackingDirectoryRemovalFlag = true + ), + mlFlowLoggingFlag = _defaultMlFlowLoggingFlag, + mlFlowLogArtifactsFlag = _defaultMlFlowArtifactsFlag, + mlFlowConfig = _mlFlowConfigDefaults, + inferenceConfigSaveLocation = _inferenceConfigSaveLocationDefault, + dataReductionFactor = _defaultDataReductionFactor, + pipelineDebugFlag = false, + pipelineId = _defaultPipelineId + ) + + def _treeSplitDefaults = MainConfig( + modelFamily = "Trees", + labelCol = "label", + featuresCol = "features", + naFillFlag = true, + varianceFilterFlag = true, + outlierFilterFlag = false, + pearsonFilteringFlag = false, + covarianceFilteringFlag = false, + oneHotEncodeFlag = false, + scalingFlag = false, + featureInteractionFlag = false, + dataPrepCachingFlag = true, + dateTimeConversionType = "split", + autoStoppingFlag = _defaultAutoStoppingFlag, + dataPrepParallelism = _defaultDataPrepParallelism, + autoStoppingScore = _defaultAutoStoppingScore, + featureImportanceCutoffType = _defaultFeatureImportanceCutoffType, + featureImportanceCutoffValue = _defaultFeatureImportanceCutoffValue, + fieldsToIgnoreInVector = _defaultFieldsToIgnoreInVector, + numericBoundaries = _treesDefaultNumBoundaries, + stringBoundaries = _treesDefaultStringBoundaries, + scoringMetric = _scoringDefaultClassifier, + scoringOptimizationStrategy = _scoringOptimizationStrategyClassifier, + fillConfig = _fillConfigDefaults, + outlierConfig = _outlierConfigDefaults, + pearsonConfig = _pearsonConfigDefaults, + covarianceConfig = _covarianceConfigDefaults, + scalingConfig = _scalingConfigDefaults, + featureInteractionConfig = _defaultFeatureInteractionConfig, + geneticConfig = GeneticConfig( + parallelism = 20, + kFold = 1, + trainPortion = 0.8, + trainSplitMethod = "random", + kSampleConfig = _defaultKSampleConfig, + trainSplitChronologicalColumn = "datetime", + trainSplitChronologicalRandomPercentage = 0.0, + seed = 42L, + firstGenerationGenePool = 25, + numberOfGenerations = 20, + numberOfParentsToRetain = 2, + numberOfMutationsPerGeneration = 10, + geneticMixing = 0.7, + generationalMutationStrategy = "linear", + fixedMutationValue = 1, + mutationMagnitudeMode = "fixed", + evolutionStrategy = "batch", + geneticMBORegressorType = "XGBoost", + geneticMBOCandidateFactor = 10, + continuousEvolutionMaxIterations = 200, + continuousEvolutionStoppingScore = 1.0, + continuousEvolutionImprovementThreshold = -10, + continuousEvolutionParallelism = 4, + continuousEvolutionMutationAggressiveness = 3, + continuousEvolutionGeneticMixing = 0.7, + continuousEvolutionRollingImprovementCount = 20, + modelSeed = Map.empty, + hyperSpaceInference = _defaultHyperSpaceInference, + hyperSpaceInferenceCount = _defaultHyperSpaceInferenceCount, + hyperSpaceModelType = _defaultHyperSpaceModelType, + hyperSpaceModelCount = _defaultHyperSpaceModelCount, + initialGenerationMode = _defaultInitialGenerationMode, + initialGenerationConfig = _defaultFirstGenerationConfig, + deltaCacheBackingDirectory = "", + splitCachingStrategy = "persist", + deltaCacheBackingDirectoryRemovalFlag = true + ), + mlFlowLoggingFlag = _defaultMlFlowLoggingFlag, + mlFlowLogArtifactsFlag = _defaultMlFlowArtifactsFlag, + mlFlowConfig = _mlFlowConfigDefaults, + inferenceConfigSaveLocation = _inferenceConfigSaveLocationDefault, + dataReductionFactor = _defaultDataReductionFactor, + pipelineDebugFlag = false, + pipelineId = _defaultPipelineId + ) +} diff --git a/src/main/scala/com/databricks/labs/automl/params/EvolutionDefaults.scala b/src/main/scala/com/databricks/labs/automl/params/EvolutionDefaults.scala new file mode 100755 index 00000000..3cf15903 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/params/EvolutionDefaults.scala @@ -0,0 +1,104 @@ +package com.databricks.labs.automl.params + +trait EvolutionDefaults { + + final val allowableEvolutionStrategies = List("batch", "continuous") + final val allowableOptimizationStrategies = List("minimize", "maximize") + final val allowableMutationStrategies = List("linear", "fixed") + final val allowableMutationMagnitudeMode = List("random", "fixed") + final val regressionMetrics: List[String] = List("rmse", "mse", "r2", "mae") + final val classificationMetrics: List[String] = List( + "f1", + "weightedPrecision", + "weightedRecall", + "accuracy", + "areaUnderPR", + "areaUnderROC" + ) + final val allowableTrainSplitMethod: List[String] = List( + "random", + "chronological", + "stratified", + "overSample", + "underSample", + "stratifyReduce", + "kSample" + ) + final val allowableInitialGenerationModes: List[String] = + List("random", "permutations") + final val allowableInitialGenerationIndexMixingModes: List[String] = + List("random", "linear") + final val allowableGeneticMBORegressorTypes: List[String] = + List("XGBoost", "LinearRegression", "RandomForest") + + def _defaultLabel: String = "label" + + def _defaultFeature: String = "features" + + def _defaultTrainPortion: Double = 0.8 + + def _defaultTrainSplitMethod: String = "random" + + def _defaultTrainSplitChronologicalColumn: String = "datetime" + + def _defaultTrainSplitChronologicalRandomPercentage: Double = 0.0 + + def _defaultParallelism: Int = 20 + + def _defaultKFold: Int = 3 + + def _defaultSeed: Long = 42L + + def _defaultOptimizationStrategy: String = "maximize" + + def _defaultFirstGenerationGenePool: Int = 20 + + def _defaultNumberOfMutationGenerations: Int = 10 + + def _defaultNumberOfParentsToRetain: Int = 2 + + def _defaultNumberOfMutationsPerGeneration: Int = 10 + + def _defaultGeneticMixing: Double = 0.7 + + def _defaultGenerationalMutationStrategy: String = "linear" + + def _defaultMutationMagnitudeMode: String = "random" + + def _defaultFixedMutationValue: Int = 1 + + def _defaultEarlyStoppingScore: Double = 1.0 + + def _defaultEarlyStoppingFlag: Boolean = true + + def _defaultEvolutionStrategy: String = "batch" + + def _defaultGeneticMBOCandidateFactor: Int = 10 + + def _defaultGeneticMBORegressorType: String = "XGBoost" + + def _defaultContinuousEvolutionImprovementThreshold = -10 + + def _defaultContinuousEvolutionMaxIterations: Int = 200 + + def _defaultContinuousEvolutionStoppingScore: Double = 1.0 + + def _defaultContinuousEvolutionParallelism: Int = 4 + + def _defaultContinuousEvolutionMutationAggressiveness: Int = 3 + + def _defaultContinuousEvolutionGeneticMixing: Double = 0.7 + + def _defaultContinuousEvolutionRollingImprovementCount: Int = 20 + + def _defaultDataReduce: Double = 0.5 + + def _defaultFirstGenMode: String = "random" + + def _defaultFirstGenPermutations: Int = 10 + + def _defaultFirstGenIndexMixingMode: String = "linear" + + def _defaultFirstGenArraySeed: Long = 42L + +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/AbstractTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/AbstractTransformer.scala new file mode 100644 index 00000000..eb9606f5 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/AbstractTransformer.scala @@ -0,0 +1,69 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.log4j.Logger +import org.apache.spark.ml.Transformer +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * Abstract transformer should be extended for all AutoML transformers + * This can contain common validation, exceptions and log messages. + * Internally extends Spark Pipeline transformer [[Transformer]] + */ + +abstract class AbstractTransformer + extends Transformer + with HasAutoMlIdColumn + with HasDebug + with HasPipelineId { + + @transient lazy private val logger: Logger = Logger.getLogger(this.getClass) + + /** + * Final overridden method that cannot be modified by AutoML transformers + * @param schema + * @return Transformed Schema [[StructType]] + */ + final override def transformSchema(schema: StructType): StructType = { + transformSchemaInternal(schema) + } + + /** + * Final overridden method that cannot be modified by AutoML transformers + * + * @param dataset + * @return Transformed DataFrame [[DataFrame]] + */ + final override def transform(dataset: Dataset[_]): DataFrame = { + val startMillis = System.currentTimeMillis() + val outputDf = transformInternal(dataset) + transformSchemaInternal(dataset.schema) + logAutoMlInternalIdPresent(outputDf) + logTransformation(dataset, outputDf, System.currentTimeMillis() - startMillis) + outputDf + } + + final private def logAutoMlInternalIdPresent(outputDf: Dataset[_]): Unit = { + val idAbsentMessage = s"Missing $getAutomlInternalId in the input columns" + val isIdColumnNeeded = outputDf.schema.fieldNames.contains(getAutomlInternalId) || this.isInstanceOf[AutoMlOutputDatasetTransformer] + if(!isIdColumnNeeded) { + logger.fatal(s"idAbsentMessage in ${this.getClass}") + } + assert(isIdColumnNeeded, idAbsentMessage) + } + + /** + * Abstract Method to be implemented by all AutoML transformers + * @param dataset + * @return transformed output [[DataFrame]] + */ + def transformInternal(dataset: Dataset[_]): DataFrame + + /** + * Abstract Method to be implemented by all AutoML transformers + * @param schema + * @return schema of new output [[DataFrame]] [[StructType]] + */ + def transformSchemaInternal(schema: StructType): StructType +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/AutoMlOutputDatasetTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/AutoMlOutputDatasetTransformer.scala new file mode 100644 index 00000000..78baf558 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/AutoMlOutputDatasetTransformer.scala @@ -0,0 +1,77 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{Param, ParamMap, StringArrayParam} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} +import org.apache.spark.sql.types.{LongType, StructField, StructType} + +/** + * @author Jas Bali + * This transformer is intended to be used as a last stage in the inference pipeline. + * Note: This transformer is supposed to be used with [[ZipRegisterTempTransformer]], + * as the first transformer in the Inference/Training pipeline. It generates the final + * dataset that is returned as a result of doing a transform on the [[org.apache.spark.ml.PipelineModel]] + * This is extremely useful for making sure all the original columns are present in the final + * transformed dataset, since there may be a need to JOIN operations on ignored columns in the + * downstream of inference step + * @param uid + */ +class AutoMlOutputDatasetTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasLabelColumn + with HasFeaturesColumns { + + def this() = { + this(Identifiable.randomUID("AutoMlOutputDatasetTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + final val tempViewOriginalDatasetName: Param[String] = new Param[String](this, "tempViewOriginalDatasetName", "Temp table name") + + def setTempViewOriginalDatasetName(value: String): this.type = set(tempViewOriginalDatasetName, value) + + def getTempViewOriginalDatasetName: String = $(tempViewOriginalDatasetName) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + val originalUserDf = dataset.sqlContext.sql(s"select * from $getTempViewOriginalDatasetName") + val userViewDf = + if(dataset.columns.contains(getLabelColumn)) { + val tmpDf = dataset + .drop(getFeatureColumns:_*) + .drop(getLabelColumn) + originalUserDf + .join(tmpDf, getAutomlInternalId) + .drop(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + } else { + val tmpDf = dataset + .drop(getFeatureColumns:_*) + originalUserDf + .join(tmpDf, getAutomlInternalId) + .drop(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + } + dataset.sqlContext.dropTempTable(getTempViewOriginalDatasetName) + userViewDf.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + val spark = SparkSession.builder().getOrCreate() + if(spark.catalog.tableExists(getTempViewOriginalDatasetName)) { + val originalDfSchema = spark.sql(s"select * from $getTempViewOriginalDatasetName").schema + return StructType( + schema.fields.filterNot(field => AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL.equals(field.name)) + ++ + originalDfSchema.fields.filterNot(field => getFeatureColumns.contains(field.name))) + } + schema + } + + override def copy(extra: ParamMap): AutoMlOutputDatasetTransformer = defaultCopy(extra) +} + +object AutoMlOutputDatasetTransformer extends DefaultParamsReadable[AutoMlOutputDatasetTransformer] { + override def load(path: String): AutoMlOutputDatasetTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/CardinalityLimitColumnPrunerTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/CardinalityLimitColumnPrunerTransformer.scala new file mode 100644 index 00000000..d02a3617 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/CardinalityLimitColumnPrunerTransformer.scala @@ -0,0 +1,141 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.data.CategoricalHandler +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, SchemaUtils} +import org.apache.spark.ml.param.{ + DoubleParam, + IntParam, + Param, + ParamMap, + StringArrayParam +} +import org.apache.spark.ml.util.{ + DefaultParamsReadable, + DefaultParamsWritable, + Identifiable +} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * A transformer to apply cardinality limit rules to the input dataset. + * Given a cardinality limit, this transformer will drop all columns with + * the cardinality higher than that of a pre-defined limit + */ +class CardinalityLimitColumnPrunerTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasLabelColumn + with HasTransformCalculated { + + def this() = { + this(Identifiable.randomUID("CardinalityLimitColumnPrunerTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setCardinalityLimit(500) + setTransformCalculated(false) + setPrunedColumns(null) + setDebugEnabled(false) + } + + final val cardinalityLimit: IntParam = new IntParam( + this, + "cardinalityLimit", + "Setting this to a limit will ignore columns with cardinality higher than this limit" + ) + + final val cardinalityCheckMode: Param[String] = + new Param[String](this, "cardinalityCheckMode", "cardinality check mode") + + final val cardinalityType: Param[String] = + new Param[String](this, "cardinalityType", "cardinality type") + + final val cardinalityPrecision: DoubleParam = + new DoubleParam(this, "cardinalityPrecision", "cardinality precision") + + final val prunedColumns: StringArrayParam = new StringArrayParam( + this, + "prunedColumns", + "Columns to ignore based on cardinality limit" + ) + + def setCardinalityLimit(value: Int): this.type = set(cardinalityLimit, value) + + def getCardinalityLimit: Int = $(cardinalityLimit) + + def setPrunedColumns(value: Array[String]): this.type = + set(prunedColumns, value) + + def getPrunedColumns: Array[String] = $(prunedColumns) + + def setCardinalityCheckMode(value: String): this.type = + set(cardinalityCheckMode, value) + + def getCardinalityCheckMode: String = $(cardinalityCheckMode) + + def setCardinalityType(value: String): this.type = set(cardinalityType, value) + + def getCardinalityType: String = $(cardinalityType) + + def setCardinalityPrecision(value: Double): this.type = + set(cardinalityPrecision, value) + + def getCardinalityPrecision: Double = $(cardinalityPrecision) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + if (!getTransformCalculated) { + val columnTypes = SchemaUtils.extractTypes(dataset.toDF(), getLabelColumn) + if (SchemaUtils.isNotEmpty(columnTypes.categoricalFields)) { + + val colsValidated = + new CategoricalHandler(dataset.toDF(), getCardinalityCheckMode) + .setCardinalityType(getCardinalityType) + .setPrecision(getCardinalityPrecision) + .validateCategoricalFields( + columnTypes.categoricalFields + .filterNot(item => getAutomlInternalId.equals(item)), + getCardinalityLimit + ) + + val columnsToDrop = columnTypes.categoricalFields + .filterNot(col => colsValidated.contains(col)) + + if (SchemaUtils.isEmpty(getPrunedColumns)) { + setPrunedColumns(columnsToDrop.toArray[String]) + } + setTransformCalculated(true) + return dataset.drop(columnsToDrop: _*).toDF() + } + } + if (SchemaUtils.isNotEmpty(getPrunedColumns)) { + return dataset.drop(getPrunedColumns: _*).toDF() + } + dataset.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + if (SchemaUtils.isNotEmpty(getPrunedColumns)) { + val allCols = schema.fields.map(field => field.name) + val missingCols = + getPrunedColumns.filterNot(colName => allCols.contains(colName)) + if (missingCols.nonEmpty) { + throw new RuntimeException( + s"""Following columns are missing: ${missingCols.mkString(", ")}""" + ) + } + return StructType( + schema.fields.filterNot(field => getPrunedColumns.contains(field.name)) + ) + } + schema + } + + override def copy(extra: ParamMap): CardinalityLimitColumnPrunerTransformer = + defaultCopy(extra) +} + +object CardinalityLimitColumnPrunerTransformer + extends DefaultParamsReadable[CardinalityLimitColumnPrunerTransformer] { + override def load(path: String): CardinalityLimitColumnPrunerTransformer = + super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/ColumnNameTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/ColumnNameTransformer.scala new file mode 100644 index 00000000..37e10302 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/ColumnNameTransformer.scala @@ -0,0 +1,68 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.param.shared.{HasInputCols, HasOutputCols} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.types.{StructField, StructType} +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * This is a useful transformer, if there is a need to rename columns + * in the intermediate transformations of a pipeline. Using this transformer + * can help avoid doing intermediate "fit" on pipeline just to rename columns + * in the output dataset + * + * Note: This is a noops transformer if input columns are not present in the dataset + */ +class ColumnNameTransformer(override val uid: String) + extends Transformer + with DefaultParamsWritable + with HasInputCols + with HasOutputCols + with HasDebug + with HasPipelineId { + + def this() = { + this(Identifiable.randomUID("ColumnNameTransformer")) + setDebugEnabled(false) + } + + def setInputColumns(value: Array[String]): this.type = set(inputCols, value) + + def setOutputColumns(value: Array[String]): this.type = set(outputCols, value) + + override def transform(dataset: Dataset[_]): DataFrame = { + val startMillis = System.currentTimeMillis() + if(getInputCols.forall(item => dataset.columns.contains(item))) { + transformSchema(dataset.schema) + var newDataset = dataset + for((key, i) <- getInputCols.view.zipWithIndex) { + newDataset = dataset.withColumnRenamed(key, getOutputCols(i)) + } + logTransformation(dataset, newDataset, System.currentTimeMillis() - startMillis) + return newDataset.toDF() + } + dataset.toDF() + } + + override def transformSchema(schema: StructType): StructType = { + require( + getInputCols.length == getOutputCols.length, + s"${getInputCols.toList} input columns array is not equal in length to output columns array ${getOutputCols.toList}") + StructType(schema.fields.zipWithIndex.map{case (element, index) => + if(getInputCols.contains(element.name)) { + StructField(getOutputCols(getInputCols.indexOf(element.name)), element.dataType, element.nullable, element.metadata) + } else { + element + } + }) + } + + override def copy(extra: ParamMap): ColumnNameTransformer = defaultCopy(extra) +} + +object ColumnNameTransformer extends DefaultParamsReadable[ColumnNameTransformer] { + override def load(path: String): ColumnNameTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/CovarianceFilterTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/CovarianceFilterTransformer.scala new file mode 100644 index 00000000..f4a7b439 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/CovarianceFilterTransformer.scala @@ -0,0 +1,75 @@ +package com.databricks.labs.automl.pipeline +import com.databricks.labs.automl.sanitize.FeatureCorrelationDetection +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, SchemaUtils} +import org.apache.log4j.{Level, Logger} +import org.apache.spark.ml.param.{DoubleParam, ParamMap} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} +/** + * @author Jas Bali + * A transformer stage that wraps [[FeatureCorrelationDetection]] in the transform method. + */ +class CovarianceFilterTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasLabelColumn + with HasFeaturesColumns + with HasFieldsRemoved + with HasTransformCalculated + with HasFeatureColumn { + private val logger: Logger = Logger.getLogger(this.getClass) + def this() = { + this(Identifiable.randomUID("CovarianceFilterTransformer")) + setFieldsRemoved(Array.empty) + setTransformCalculated(false) + setCorrelationCutoffLow(-0.99) + setCorrelationCutoffHigh(0.99) + setFeatureColumns(Array.empty) + setDebugEnabled(false) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + } + + final val correlationCutoffLow: DoubleParam = new DoubleParam(this, "correlationCutoffLow", "correlationCutoffLow") + final val correlationCutoffHigh: DoubleParam = new DoubleParam(this, "correlationCutoffHigh", "correlationCutoffHigh") + def setCorrelationCutoffLow(value: Double): this.type = set(correlationCutoffLow, value) + def getCorrelationCutoffLow: Double = $(correlationCutoffLow) + def setCorrelationCutoffHigh(value: Double): this.type = set(correlationCutoffHigh, value) + def getCorrelationCutoffHigh: Double = $(correlationCutoffHigh) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + if(dataset.columns.contains(getLabelColumn)) { + if (SchemaUtils.isNotEmpty(getFeatureColumns)) { + setFeatureColumns(dataset.columns.filterNot(item => Array(getLabelColumn, AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL).contains(item))) + } + // Output has no feature vector + if (!getTransformCalculated) { + val covarianceFilteredData = + new FeatureCorrelationDetection(dataset.toDF(), getFeatureColumns.filterNot(item => AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL.equals(item) || getFeatureCol.equals(item))) + .setLabelCol(getLabelColumn) + .setCorrelationCutoffLow(getCorrelationCutoffLow) + .setCorrelationCutoffHigh(getCorrelationCutoffHigh) + .filterFeatureCorrelation() + setFieldsRemoved(getFeatureColumns.filterNot(field => covarianceFilteredData.columns.contains(field))) + setTransformCalculated(true) + val covarianceFilterLog = + s"Covariance Filtering completed.\n Removed fields: ${getFieldsRemoved.mkString(", ")}" + logger.log(Level.INFO, covarianceFilterLog) + println(covarianceFilterLog) + return covarianceFilteredData + } + } + dataset.drop(getFieldsRemoved: _*) + } + override def transformSchemaInternal(schema: StructType): StructType = { + if(schema.fieldNames.contains(getLabelColumn)) { + StructType(schema.fields.filterNot(field => getFieldsRemoved.contains(field.name))) + } else { + schema + } + } + override def copy(extra: ParamMap): CovarianceFilterTransformer = defaultCopy(extra) +} +object CovarianceFilterTransformer extends DefaultParamsReadable[CovarianceFilterTransformer] { + override def load(path: String): CovarianceFilterTransformer = super.load(path) +} \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/DataSanitizerTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/DataSanitizerTransformer.scala new file mode 100644 index 00000000..5c529a31 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/DataSanitizerTransformer.scala @@ -0,0 +1,314 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.inference.NaFillConfig +import com.databricks.labs.automl.sanitize.DataSanitizer +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, SchemaUtils} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.util.{ + DefaultParamsReadable, + DefaultParamsWritable, + Identifiable +} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * Input: Original feature columns + * Output: sanitized, nafilled columns + * + */ +class DataSanitizerTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasLabelColumn + with HasFeatureColumn { + + final val numericFillStat: Param[String] = + new Param[String](this, "numericFillStat", "Numeric fill stats") + final val characterFillStat: Param[String] = + new Param[String](this, "characterFillStat", "Character fill stat") + final val modelSelectionDistinctThreshold: IntParam = new IntParam( + this, + "modelSelectionDistinctThreshold", + "model selection distinct threshold" + ) + final val filterPrecision: DoubleParam = + new DoubleParam(this, "filterPrecision", "Filter precision") + final val parallelism: IntParam = + new IntParam(this, "parallelism", "filter parallelism") + final val naFillFlag: BooleanParam = + new BooleanParam(this, "naFillFlag", "Na Fill flag") + final val categoricalColumnNames = + new StringArrayParam(this, "categoricalColumnNames", "Categorical Columns") + final val categoricalColumnValues = new StringArrayParam( + this, + "categoricalColumnValues", + "Categorical Columns' Values" + ) + final val numericColumnNames = + new StringArrayParam(this, "numericColumnNames", "Numeric Columns") + final val numericColumnValues = + new DoubleArrayParam(this, "numericColumnValues", "Numeric Columns' Values") + final val booleanColumnNames = + new StringArrayParam(this, "booleanColumnNames", "Boolean Columns") + final val booleanColumnValues = + new StringArrayParam(this, "booleanColumnValues", "Boolean Columns' Values") + final val decideModel: Param[String] = + new Param[String](this, "decideModel", "Decided model") + final val fillMode: Param[String] = + new Param[String](this, "fillMode", "fillMode") + final val characterNABlanketFill: Param[String] = + new Param[String](this, "characterNABlanketFill", "characterNABlanketFill") + final val numericNABlanketFill: DoubleParam = + new DoubleParam(this, "numericNABlanketFill", "numericNABlanketFill") + final val categoricalNAFillMapKeys: StringArrayParam = new StringArrayParam( + this, + "categoricalNAFillMapKeys", + "categoricalNAFillMapKeys" + ) + final val categoricalNAFillMapValues: StringArrayParam = new StringArrayParam( + this, + "categoricalNAFillMapValues", + "categoricalNAFillMapValues" + ) + final val numericNAFillMapKeys: StringArrayParam = + new StringArrayParam(this, "numericNAFillMapKeys", "numericNAFillMapKeys") + final val numericNAFillMapValues: DoubleArrayParam = new DoubleArrayParam( + this, + "numericNAFillMapValues", + "numericNAFillMapValues" + ) + + def setNumericFillStat(value: String): this.type = set(numericFillStat, value) + + def getNumericFillStat: String = $(numericFillStat) + + def setCharacterFillStat(value: String): this.type = + set(characterFillStat, value) + + def getCharacterFillStat: String = $(characterFillStat) + + def setModelSelectionDistinctThreshold(value: Int): this.type = + set(modelSelectionDistinctThreshold, value) + + def getModelSelectionDistinctThreshold: Int = + $(modelSelectionDistinctThreshold) + + def setFilterPrecision(value: Double): this.type = set(filterPrecision, value) + + def getFilterPrecision: Double = $(filterPrecision) + + def setParallelism(value: Int): this.type = set(parallelism, value) + + def getParallelism: Int = $(parallelism) + + def setNaFillFlag(value: Boolean): this.type = set(naFillFlag, value) + + def getNaFillFlag: Boolean = $(naFillFlag) + + def setCategoricalColumnNames(value: Array[String]): this.type = + set(categoricalColumnNames, value) + + def getCategoricalColumnNames: Array[String] = $(categoricalColumnNames) + + def setCategoricalColumnValues(value: Array[String]): this.type = + set(categoricalColumnValues, value) + + def getCategoricalColumnValues: Array[String] = $(categoricalColumnValues) + + def setNumericColumnNames(value: Array[String]): this.type = + set(numericColumnNames, value) + + def getNumericColumnNames: Array[String] = $(numericColumnNames) + + def setBooleanColumnNames(value: Array[String]): this.type = + set(booleanColumnNames, value) + + def getBooleanColumnNames: Array[String] = $(booleanColumnNames) + + def setBooleanColumnValues(value: Array[Boolean]): this.type = + set(booleanColumnValues, value.map(_.toString)) + + def getBooleanColumnValues: Array[Boolean] = + $(booleanColumnValues).map(_.toBoolean) + + def setNumericColumnValues(value: Array[Double]): this.type = + set(numericColumnValues, value) + + def getNumericColumnValues: Array[Double] = $(numericColumnValues) + + def setDecideModel(value: String): this.type = set(decideModel, value) + + def getDecideModel: String = $(decideModel) + + def setFillMode(value: String): this.type = set(fillMode, value) + + def getFillMode: String = $(fillMode) + + def setCharacterNABlanketFill(value: String): this.type = + set(characterNABlanketFill, value) + + def getCharacterNABlanketFill: String = $(characterNABlanketFill) + + def setNumericNABlanketFill(value: Double): this.type = + set(numericNABlanketFill, value) + + def getNumericNABlanketFill: Double = $(numericNABlanketFill) + + def setCategoricalNAFillMapKeys(value: Array[String]): this.type = + set(categoricalNAFillMapKeys, value) + + def getCategoricalNAFillMapKeys: Array[String] = $(categoricalNAFillMapKeys) + + def setCategoricalNAFillMapValues(value: Array[String]): this.type = + set(categoricalNAFillMapValues, value) + + def getCategoricalNAFillMapValues: Array[String] = + $(categoricalNAFillMapValues) + + def setNumericNAFillMapKeys(value: Array[String]): this.type = + set(numericNAFillMapKeys, value) + + def getNumericNAFillMapKeys: Array[String] = $(numericNAFillMapKeys) + + def setNumericNAFillMapValues(value: Array[Double]): this.type = + set(numericNAFillMapValues, value) + + def getNumericNAFillMapValues: Array[Double] = $(numericNAFillMapValues) + + def setCategoricalNAFillMap(value: Map[String, String]): this.type = { + setCategoricalNAFillMapKeys(value.keys.toArray) + setCategoricalNAFillMapValues(value.values.toArray) + } + + def setNumericNAFillMap(value: Map[String, Double]): this.type = { + setNumericNAFillMapKeys(value.keys.toArray) + setNumericNAFillMapValues(value.values.toArray) + } + + def this() = { + this(Identifiable.randomUID("DataSanitizerTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setFeatureCol("features") + setNumericFillStat("mean") + setCharacterFillStat("max") + setModelSelectionDistinctThreshold(10) + setFilterPrecision(0.01) + setParallelism(20) + setNaFillFlag(false) + setDecideModel("") + setCategoricalColumnNames(Array.empty) + setNumericColumnValues(Array.empty) + setNumericColumnNames(Array.empty) + setNumericColumnValues(Array.empty) + setBooleanColumnNames(Array.empty) + setBooleanColumnValues(Array.empty) + setCategoricalNAFillMapKeys(Array.empty) + setCategoricalNAFillMapValues(Array.empty) + setNumericNAFillMapKeys(Array.empty) + setNumericNAFillMapValues(Array.empty) + setCharacterNABlanketFill("") + setNumericNABlanketFill(0.0) + setFillMode("auto") + setDebugEnabled(false) + } + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + val naConfig = new DataSanitizer(dataset.toDF()) + .setLabelCol(getLabelColumn) + .setFeatureCol(getFeatureCol) + .setModelSelectionDistinctThreshold(getModelSelectionDistinctThreshold) + .setNumericFillStat(getNumericFillStat) + .setCharacterFillStat(getCharacterFillStat) + .setParallelism(getParallelism) + .setCategoricalNAFillMap( + SchemaUtils.generateMapFromKeysValues( + getCategoricalNAFillMapKeys, + getCategoricalNAFillMapValues + ) + ) + .setCharacterNABlanketFillValue(getCharacterNABlanketFill) + .setNumericNABlanketFillValue(getNumericNABlanketFill) + .setNumericNAFillMap( + SchemaUtils.generateMapFromKeysValues( + getNumericNAFillMapKeys, + getNumericNAFillMapValues + ) + ) + .setNAFillMode(getFillMode) + .setFilterPrecision(getFilterPrecision) + .setFieldsToIgnoreInVector(Array(getAutomlInternalId)) + + val (naFilledDataFrame, fillMap, detectedModelType) = + if (getNaFillFlag) { + val naFillConfigTmp = buildNaConfig() + if (naFillConfigTmp.isDefined) { + naConfig.generateCleanData( + naFillConfigTmp.get, + refactorLabelFlag = false, + decidedModel = getDecideModel + ) + } else { + naConfig.generateCleanData( + refactorLabelFlag = false, + decidedModel = getDecideModel + ) + } + } else { + ( + dataset, + NaFillConfig(Map("" -> ""), Map("" -> 0.0), Map("" -> false)), + naConfig.decideModel() + ) + } + if (getDecideModel == null || getDecideModel.isEmpty) { + setCategoricalColumnNames(fillMap.categoricalColumns.keys.toArray) + setCategoricalColumnValues(fillMap.categoricalColumns.values.toArray) + setNumericColumnNames(fillMap.numericColumns.keys.toArray) + setNumericColumnValues(fillMap.numericColumns.values.toArray) + setDecideModel(detectedModelType) + } + + naFilledDataFrame + .toDF() + .filter(col($(labelColumn)).isNotNull) + .filter(!col($(labelColumn)).isNaN) + } + + private def buildNaConfig(): Option[NaFillConfig] = { + if (SchemaUtils.isNotEmpty(getCategoricalColumnNames) && + SchemaUtils.isNotEmpty(getNumericColumnNames)) { + return Some( + NaFillConfig( + categoricalColumns = SchemaUtils.generateMapFromKeysValues( + getCategoricalColumnNames, + getCategoricalColumnValues + ), + numericColumns = SchemaUtils.generateMapFromKeysValues( + getNumericColumnNames, + getNumericColumnValues + ), + booleanColumns = SchemaUtils.generateMapFromKeysValues( + getBooleanColumnNames, + getBooleanColumnValues + ) + ) + ) + } + None + } + + override def transformSchemaInternal(schema: StructType): StructType = { + schema + } + + override def copy(extra: ParamMap): DataSanitizerTransformer = + defaultCopy(extra) +} + +object DataSanitizerTransformer + extends DefaultParamsReadable[DataSanitizerTransformer] { + override def load(path: String): DataSanitizerTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/DatasetsUnionTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/DatasetsUnionTransformer.scala new file mode 100644 index 00000000..4afd4a97 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/DatasetsUnionTransformer.scala @@ -0,0 +1,75 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +import scala.util.Sorting + +/** + * @author Jas Bali + * A transformer stage that is useful to do joins on two datasets. It is useful + * when there is a need to do a join on two datasets in the intermediate step of a pipeline + * + * NOTE: A transformer semantics does not allow to pass two datasets to a transform method. + * As a workaround, the first dataset needs to be registered as a temp table outside of this transformer + * using [[RegisterTempTableTransformer]] transformer. + */ +class DatasetsUnionTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable { + + final val unionDatasetName = new Param[String](this, "unionDatasetName", "unionDatasetName") + + def setUnionDatasetName(value: String): this.type = set(unionDatasetName, value) + + def getUnionDatasetName: String = $(unionDatasetName) + + def this() = { + this(Identifiable.randomUID("DatasetsUnionTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + val dfs = prepareUnion( + dataset.sqlContext.sql(s"select * from $getUnionDatasetName"), + dataset.toDF()) + dfs._1.unionByName(dfs._2) + } + + private def prepareUnion(df1: DataFrame, df2: DataFrame): (DataFrame, DataFrame) = { + validateUnion(df1, df2) + val colNames = df1.schema.fieldNames + Sorting.quickSort(colNames) + val newDf1 = df1.select( colNames map col:_*) + val newDf2 = df2.select( colNames map col:_*) + val returnVal = (newDf1, newDf2) + returnVal + } + + private def validateUnion(df1: DataFrame, df2: DataFrame): Unit = { + val df1Cols = df1.schema.fieldNames + Sorting.quickSort(df1Cols) + val df2Cols = df2.schema.fieldNames + Sorting.quickSort(df2Cols) + val df1SchemaString = df1.select(df1Cols map col:_*).schema.toString() + val df2SchemaString = df2.select(df2Cols map col:_*).schema.toString() + assert(df1SchemaString.equals(df2SchemaString), + s"Different schemas for union DFs. \n DF1 schema $df1SchemaString \n " + + s"DF2 schema $df2SchemaString \n") + } + + override def transformSchemaInternal(schema: StructType): StructType = { + schema + } + + override def copy(extra: ParamMap): DatasetsUnionTransformer = defaultCopy(extra) +} + +object DatasetsUnionTransformer extends DefaultParamsReadable[DatasetsUnionTransformer] { + override def load(path: String): DatasetsUnionTransformer = super.load(path) +} \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/DateFieldTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/DateFieldTransformer.scala new file mode 100644 index 00000000..8b220a14 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/DateFieldTransformer.scala @@ -0,0 +1,152 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.{ + AutoMlPipelineMlFlowUtils, + DataValidation, + SchemaUtils +} +import org.apache.spark.ml.param.{Param, ParamMap, StringArrayParam} +import org.apache.spark.ml.util.{ + DefaultParamsReadable, + DefaultParamsWritable, + Identifiable +} +import org.apache.spark.sql.types.{IntegerType, StructField, StructType} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.ArrayBuffer + +/** + * @author Jas Bali + * A transformer stage that breaks down date and field columns into the following feature columns: + * Date = day, month, year + * time = day, month, year, hour, minutes, seconds + */ +class DateFieldTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with DataValidation + with HasLabelColumn { + + def this() = { + this(Identifiable.randomUID("DateFieldTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setNewDateTimeFeatureColumns(Array.empty) + setOldDateTimeFeatureColumns(Array.empty) + setDebugEnabled(false) + } + + final val mode: Param[String] = new Param[String]( + this, + "mode", + "date/time conversion mode. Possible values 'split' and 'unix'" + ) + + final val newDateTimeFeatureColumns: StringArrayParam = new StringArrayParam( + this, + "newDateTimeFeatureColumns", + "New Columns that were added for converting date/time features " + ) + + final val oldDateTimeFeatureColumns: StringArrayParam = new StringArrayParam( + this, + "oldDateTimeFeatureColumns", + "Old Columns before converting date/time features" + ) + + def setMode(value: String): this.type = set(mode, value) + + def getMode: String = $(mode) + + def setNewDateTimeFeatureColumns(value: Array[String]): this.type = + set(newDateTimeFeatureColumns, value) + + def getNewDateTimeFeatureColumns: Array[String] = $(newDateTimeFeatureColumns) + + def setOldDateTimeFeatureColumns(value: Array[String]): this.type = + set(oldDateTimeFeatureColumns, value) + + def getOldDateTimeFeatureColumns: Array[String] = $(oldDateTimeFeatureColumns) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + val columnTypes = SchemaUtils.extractTypes( + dataset.select( + dataset.columns + .filterNot(item => getAutomlInternalId.equals(item)) map col: _* + ), + getLabelColumn + ) + if (columnTypes != null && + (SchemaUtils.isNotEmpty(columnTypes.dateFields) || SchemaUtils + .isNotEmpty(columnTypes.timeFields))) { + val dfWithDateTimeTransformedFeatures = convertDateAndTime( + dataset.toDF(), + columnTypes.dateFields, + columnTypes.timeFields, + getMode + ) + val newDateTimeFeatureColumns = + dfWithDateTimeTransformedFeatures._2.toArray[String] + val columnsConvertedFrom = new ArrayBuffer[String]() + if (SchemaUtils.isNotEmpty(columnTypes.dateFields)) { + columnsConvertedFrom ++= columnTypes.dateFields + } + if (SchemaUtils.isNotEmpty(columnTypes.timeFields)) { + columnsConvertedFrom ++= columnTypes.timeFields + } + setParamsIfEmptyInternal( + newDateTimeFeatureColumns, + columnsConvertedFrom.toArray + ) + return dfWithDateTimeTransformedFeatures._1.drop(columnsConvertedFrom: _*) + } + dataset.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + if (SchemaUtils.isNotEmpty(getOldDateTimeFeatureColumns)) { + val allCols = schema.fields.map(field => field.name) + val missingDateTimeCols = + getOldDateTimeFeatureColumns.filterNot(name => allCols.contains(name)) + if (missingDateTimeCols.nonEmpty) { + throw new RuntimeException( + s"""Following columns are missing: ${missingDateTimeCols.mkString( + ", " + )}""" + ) + } + } + if (SchemaUtils.isNotEmpty(getNewDateTimeFeatureColumns)) { + val newFields: Array[StructField] = getNewDateTimeFeatureColumns.map( + colName => StructField(colName, IntegerType) + ) + return StructType( + schema.fields + .filterNot(field => getOldDateTimeFeatureColumns.contains(field.name)) + ++ + newFields + ) + } + schema + } + + private def setParamsIfEmptyInternal( + newDateTimeFeatureColumns: Array[String], + oldDateTimeFeatureColumns: Array[String] + ): Unit = { + if (SchemaUtils.isEmpty(getNewDateTimeFeatureColumns)) { + setNewDateTimeFeatureColumns(newDateTimeFeatureColumns) + } + if (SchemaUtils.isEmpty(getOldDateTimeFeatureColumns)) { + setOldDateTimeFeatureColumns(oldDateTimeFeatureColumns) + } + } + + override def copy(extra: ParamMap): DateFieldTransformer = defaultCopy(extra) +} + +object DateFieldTransformer + extends DefaultParamsReadable[DateFieldTransformer] { + override def load(path: String): DateFieldTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/DropColumnsTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/DropColumnsTransformer.scala new file mode 100644 index 00000000..884f7fe8 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/DropColumnsTransformer.scala @@ -0,0 +1,50 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, SchemaUtils} +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.param.shared.HasInputCols +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * A transformer stage that can drop columns from an input Dataset. + * Necessary when there are intermediate stages that require columns to be + * removed from an input dataset because they aren't needed in the downstream + * stages anymore, such as input columns for SI are not needed for OHE, input cols for + * VA etc + */ +class DropColumnsTransformer (override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasInputCols { + + def this() = { + this(Identifiable.randomUID("DropColumnsTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + if(SchemaUtils.isNotEmpty(getInputCols)) { + return dataset.drop(getInputCols: _*) + } + dataset.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + if(SchemaUtils.isNotEmpty(getInputCols)) { + return StructType(schema.fields.filterNot(field => getInputCols.contains(field.name))) + } + schema + } + + override def copy(extra: ParamMap): DropColumnsTransformer = defaultCopy(extra) +} + +object DropColumnsTransformer extends DefaultParamsReadable[DropColumnsTransformer] { + override def load(path: String): DropColumnsTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/DropTempTableTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/DropTempTableTransformer.scala new file mode 100644 index 00000000..df74d196 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/DropTempTableTransformer.scala @@ -0,0 +1,47 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types.StructType + +/** + * @author Jas Bali + * A [[WithNoopsStage]] transformer stage that is helpful when a previous stage + * registers a temp table and is no longer required for the rest of the pipeline. + * Supposed to be used with [[RegisterTempTableTransformer]] and [[DatasetsUnionTransformer]] + * @param uid + */ +class DropTempTableTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with WithNoopsStage { + + final val tempTableName = new Param[String](this, "tempTableName", "tempTableName") + + def setTempTableName(value: String): this.type = set(tempTableName, value) + + def getTempTableName: String = $(tempTableName) + + def this() = { + this(Identifiable.randomUID("DropTempTableTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + dataset.sqlContext.dropTempTable(getTempTableName) + dataset.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + schema + } + + override def copy(extra: ParamMap): DropTempTableTransformer = defaultCopy(extra) +} + +object DropTempTableTransformer extends DefaultParamsReadable[DropTempTableTransformer] { + override def load(path: String): DropTempTableTransformer = super.load(path) +} \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringPipelineContext.scala b/src/main/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringPipelineContext.scala new file mode 100644 index 00000000..937e0b0f --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringPipelineContext.scala @@ -0,0 +1,1049 @@ +package com.databricks.labs.automl.pipeline + +import java.util.UUID + +import com.databricks.labs.automl.exceptions.{ + DateFeatureConversionException, + FeatureConversionException, + TimeFeatureConversionException +} +import com.databricks.labs.automl.feature.FeatureInteraction +import com.databricks.labs.automl.params.{GroupedModelReturn, MainConfig} +import com.databricks.labs.automl.pipeline.PipelineVars._ +import com.databricks.labs.automl.sanitize.Scaler +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, SchemaUtils} +import org.apache.log4j.Logger +import org.apache.spark.ml.feature._ +import org.apache.spark.ml.mleap.SparkUtil +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.ml.{Model, Pipeline, PipelineModel, PipelineStage} +import org.apache.spark.sql.DataFrame + +/** + * @author Jas Bali + * This singleton encapsulates generation of feature engineering pipeline as well as inference pipeline, given + * [[MainConfig]] and input [[DataFrame]] + */ +import scala.collection.mutable.ArrayBuffer + +final case class VectorizationOutput(pipelineModel: PipelineModel, + vectorizedCols: Array[String]) + +final case class FeatureEngineeringOutput(pipelineModel: PipelineModel, + originalDfViewName: String, + decidedModel: String, + transformedForTrainingDf: DataFrame) + +object FeatureEngineeringPipelineContext { + + @transient lazy private val logger: Logger = Logger.getLogger(this.getClass) + + //TODO (Jas): verbose true, only works for only feature engineering pipeline, for full predict pipeline this needs to be update. + def generatePipelineModel( + originalInputDataset: DataFrame, + mainConfig: MainConfig, + verbose: Boolean = false, + isFeatureEngineeringOnly: Boolean = false + ): FeatureEngineeringOutput = { + val originalDfTempTableName = Identifiable.randomUID("zipWithId") + + val removeColumns = new ArrayBuffer[String] + + // First Transformation: Select required columns, convert date/time features and apply cardinality limit + val initialPipelineModel = selectFeaturesConvertTypesAndApplyCardLimit( + originalInputDataset, + mainConfig, + originalDfTempTableName + ) + val initialTransformationDf = + initialPipelineModel.transform(originalInputDataset) + + // Second Transformation: Apply string indexers, apply vector assembler, drop unnecessary columns + val secondTransformation = applyStngIndxVectAssembler( + initialTransformationDf, + mainConfig, + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + verbose + ) + var vectorizedColumns = secondTransformation.vectorizedCols + removeColumns ++= vectorizedColumns + val secondTransformationPipelineModel = secondTransformation.pipelineModel + val secondTransformationDf = + secondTransformationPipelineModel.transform(initialTransformationDf) + + val modelDecider = secondTransformationPipelineModel.stages + .find(item => item.isInstanceOf[DataSanitizerTransformer]) + .get + val decidedModel = modelDecider + .getOrDefault(modelDecider.getParam("decideModel")) + .asInstanceOf[String] + + val stages = new ArrayBuffer[PipelineStage]() + + // Apply Outlier Filtering + getAndAddStage(stages, outlierFilterStage(mainConfig)) + + // Apply Vector Assembler + getAndAddStage(stages, vectorAssemblerStage(mainConfig, vectorizedColumns)) + + // Apply Variance filter + getAndAddStage(stages, varianceFilterStage(mainConfig)) + + // Apply Covariance Filtering + getAndAddStage( + stages, + covarianceFilteringStage(mainConfig, vectorizedColumns) + ) + + // Apply Pearson Filtering + getAndAddStage( + stages, + pearsonFilteringStage(mainConfig, vectorizedColumns, decidedModel) + ) + + // Third Transformation + var thirdPipelineModel = + new Pipeline().setStages(stages.toArray).fit(secondTransformationDf) + val thirdTransformationDf = + thirdPipelineModel.transform(secondTransformationDf) + val oheInputCols = thirdTransformationDf.columns + .filter(item => item.endsWith(PipelineEnums.SI_SUFFIX.value)) + .filterNot( + item => + (mainConfig.labelCol + PipelineEnums.SI_SUFFIX.value).equals(item) + ) + + // Feature Interaction stages + thirdPipelineModel = if (mainConfig.featureInteractionFlag) { + + val featureInteractionTotalVectorFields = thirdTransformationDf.columns + .filterNot( + x => + mainConfig.labelCol.equals(x) || mainConfig.featuresCol + .equals(x) || mainConfig.fieldsToIgnoreInVector + .contains(x) || AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL + .equals(x) + ) + val featureInteractionNominalFields = oheInputCols + val featureInteractionContinuousFields = + featureInteractionTotalVectorFields.diff( + featureInteractionNominalFields + ) + + val featureInteractionStage = FeatureInteraction.interactionPipeline( + data = thirdTransformationDf, + nominalFields = featureInteractionNominalFields, + continuousFields = featureInteractionContinuousFields, + modelingType = decidedModel, + retentionMode = mainConfig.featureInteractionConfig.retentionMode, + labelCol = mainConfig.labelCol, + featureCol = mainConfig.featuresCol, + continuousDiscretizerBucketCount = + mainConfig.featureInteractionConfig.continuousDiscretizerBucketCount, + parallelism = mainConfig.featureInteractionConfig.parallelism, + targetInteractionPercentage = + mainConfig.featureInteractionConfig.targetInteractionPercentage + ) + + vectorizedColumns = featureInteractionStage.fullFeatureVectorColumns + + removeColumns ++= featureInteractionStage.fullFeatureVectorColumns + + val featureInteractionPipelineModel = + featureInteractionStage.pipeline.fit(thirdTransformationDf) + + mergePipelineModels( + ArrayBuffer(thirdPipelineModel, featureInteractionPipelineModel) + ) + + } else thirdPipelineModel + + val featureInteractionDf = + thirdPipelineModel.transform(secondTransformationDf) + + val finalOheCols = featureInteractionDf.columns + .filter(item => item.endsWith(PipelineEnums.SI_SUFFIX.value)) + .filterNot( + item => + (mainConfig.labelCol + PipelineEnums.SI_SUFFIX.value).equals(item) + ) + + //Get columns removed from above stages + val colsRemoved = getColumnsRemoved(thirdPipelineModel) + + // Ksampler stages + val ksampleStages = ksamplerStages( + mainConfig, + isFeatureEngineeringOnly, + vectorizedColumns.filterNot(colsRemoved.contains(_)) + ) + var ksampledDf = featureInteractionDf + if (ksampleStages.isDefined) { + val ksamplerPipelineModel = + new Pipeline().setStages(ksampleStages.get).fit(featureInteractionDf) + ksampledDf = ksamplerPipelineModel.transform(featureInteractionDf) + + // Save ksampler states in pipeline cache to be accessed later for logging to Mlflow + PipelineStateCache + .addToPipelineCache( + mainConfig.pipelineId, + PipelineVars.KSAMPLER_STAGES.key, + ksampleStages.get.map(item => item.getClass.getName).mkString(", ") + ) + } + + val lastStages = new ArrayBuffer[PipelineStage]() + // Roundup OHE input Cols + getAndAddStage( + lastStages, + Some(new RoundUpDoubleTransformer().setInputCols(finalOheCols)) + ) + + //TODO: When we figure out the metadata loss issue, remove this extra stage of StringIndexers. + val oheModdedCols = finalOheCols.map( + x => + if (x.endsWith(PipelineEnums.SI_SUFFIX.value)) + x + PipelineEnums.SI_SUFFIX.value + else x + ) + val preOheCols = + finalOheCols.filter(_.endsWith(PipelineEnums.SI_SUFFIX.value)) + getAndAddStages(lastStages, stringIndexerStage(mainConfig, preOheCols)) + + removeColumns ++= oheModdedCols + removeColumns ++= preOheCols + removeColumns ++= finalOheCols + + // Apply OneHotEncoding Options + getAndAddStage(lastStages, oneHotEncodingStage(mainConfig, oheModdedCols)) + getAndAddStage( + lastStages, + dropColumns(Array(mainConfig.featuresCol), mainConfig) + ) + // Execute Vector Assembler Again + if (mainConfig.oneHotEncodeFlag) { + //Exclude columns removed by variance, covariance and pearson + val allVectCols = oheModdedCols.map( + SchemaUtils.generateOneHotEncodedColumn + ) ++ vectorizedColumns.filterNot( + _.endsWith(PipelineEnums.SI_SUFFIX.value) + ) + + val vectorCols = allVectCols.filterNot(colsRemoved.contains(_)) + + removeColumns ++= vectorCols + + getAndAddStage(lastStages, vectorAssemblerStage(mainConfig, vectorCols)) + } else { + //Exclude columns removed by variance, covariance and pearson + getAndAddStage( + lastStages, + vectorAssemblerStage( + mainConfig, + vectorizedColumns.filterNot(colsRemoved.contains(_)) + ) + ) + } + + // Apply Scaler option + getAndAddStages(lastStages, scalerStage(mainConfig)) + + // Drop Unnecessary columns - output of feature engineering stage should only contain automl_internal_id, label, features and synthetic from ksampler + removeColumns ++= finalOheCols.map(SchemaUtils.generateOneHotEncodedColumn) ++ oheModdedCols + .map(SchemaUtils.generateOneHotEncodedColumn) + + if (!verbose) { + getAndAddStage( + lastStages, + dropColumns(removeColumns.distinct.toArray, mainConfig) + ) + } + // final transformation + val fourthPipelineModel = + new Pipeline().setStages(lastStages.toArray).fit(ksampledDf) + val fourthTransformationDf = fourthPipelineModel.transform(ksampledDf) + + //Extract Decided model from DataSanitizer stage + + FeatureEngineeringOutput( + mergePipelineModels( + ArrayBuffer( + initialPipelineModel, + secondTransformationPipelineModel, + thirdPipelineModel, + fourthPipelineModel + ) + ), + originalDfTempTableName, + decidedModel, + fourthTransformationDf + ) + } + + private def getColumnsRemoved( + thirdPipelineModel: PipelineModel + ): Array[String] = { + val removedCols = new ArrayBuffer[String]() + val removedByVariance = thirdPipelineModel.stages + .filter(_.isInstanceOf[VarianceFilterTransformer]) + .map(_.asInstanceOf[VarianceFilterTransformer]) + + val removedByCovariance = thirdPipelineModel.stages + .filter(_.isInstanceOf[CovarianceFilterTransformer]) + .map(_.asInstanceOf[CovarianceFilterTransformer]) + + val removedByPearson = thirdPipelineModel.stages + .filter(_.isInstanceOf[PearsonFilterTransformer]) + .map(_.asInstanceOf[PearsonFilterTransformer]) + + if (removedByVariance != null && removedByVariance.nonEmpty) { + removedCols ++= removedByVariance.head.getRemovedColumns + } + if (removedByCovariance != null && removedByCovariance.nonEmpty) { + removedCols ++= removedByCovariance.head.getFieldsRemoved + } + if (removedByPearson != null && removedByPearson.nonEmpty) { + removedCols ++= removedByPearson.head.getFieldsRemoved + } + removedCols.toArray + } + + def buildFullPredictPipeline(featureEngOutput: FeatureEngineeringOutput, + modelReport: Array[GroupedModelReturn], + mainConfiguration: MainConfig, + originalDf: DataFrame): PipelineModel = { + val pipelineModelStages = new ArrayBuffer[PipelineModel]() + //Build Pipeline here + // get Feature eng. pipeline model + pipelineModelStages += featureEngOutput.pipelineModel + + val bestModel = + getBestModel(modelReport, mainConfiguration.scoringOptimizationStrategy) + val mlPipelineModel = SparkUtil.createPipelineModel( + Array(bestModel.model.asInstanceOf[Model[_]]) + ) + + pipelineModelStages += mlPipelineModel + val pipelinewithMlModel = + FeatureEngineeringPipelineContext.mergePipelineModels(pipelineModelStages) + val pipelinewithMlModelDf = + mlPipelineModel.transform(featureEngOutput.transformedForTrainingDf) + + // Add Index To String Stage + val pipelineModelWithLabelSi = addLabelIndexToString( + pipelinewithMlModel, + pipelinewithMlModelDf, + mainConfiguration + ) + val pipelineModelWithLabelSiDf = + pipelineModelWithLabelSi.transform(originalDf) + + val prefinalPipelineModel = addUserReturnViewStage( + pipelineModelWithLabelSi, + mainConfiguration, + pipelineModelWithLabelSiDf, + featureEngOutput.originalDfViewName + ) + + // Removes train-only stages, if present, such as OutlierTransformer and SyntheticDataTransformer + val finalPipelineModel = buildInferencePipelineStages(prefinalPipelineModel) + // log full pipeline stage names to toMlFlow, save pipeline and register with MlFlow + savePipelineLogToMlFLow( + mainConfiguration, + featureEngOutput, + finalPipelineModel, + prefinalPipelineModel, + originalDf + ) + finalPipelineModel + } + + private def savePipelineLogToMlFLow( + mainConfiguration: MainConfig, + featureEngOutput: FeatureEngineeringOutput, + finalPipelineModel: PipelineModel, + prefinalPipelineModel: PipelineModel, + originalDf: DataFrame + ): Unit = { + if (mainConfiguration.mlFlowLoggingFlag) { + AutoMlPipelineMlFlowUtils + .saveInferencePipelineDfAndLogToMlFlow( + mainConfiguration.pipelineId, + featureEngOutput.decidedModel, + mainConfiguration.modelFamily, + mainConfiguration.mlFlowConfig.mlFlowModelSaveDirectory, + finalPipelineModel, + originalDf + ) + val totalStagesExecuted = + if (mainConfiguration.geneticConfig.trainSplitMethod == "kSample") { + prefinalPipelineModel.stages.length + PipelineStateCache + .getFromPipelineByIdAndKey( + mainConfiguration.pipelineId, + PipelineVars.KSAMPLER_STAGES.key + ) + .asInstanceOf[String] + .split(", ") + .length + } else { + prefinalPipelineModel.stages.length + } + PipelineMlFlowProgressReporter.completed( + mainConfiguration.pipelineId, + totalStagesExecuted + ) + } + } + + private def buildInferencePipelineStages( + pipelineModel: PipelineModel + ): PipelineModel = { + val nonTrainingStages = + pipelineModel.stages.filterNot(_.isInstanceOf[IsTrainingStage]) + logger.debug( + s"""Removed following training stages from inference-only pipeline ${nonTrainingStages + .map(_.uid) + .mkString(", ")}""" + ) + SparkUtil.createPipelineModel( + Identifiable.randomUID("final_linted_infer_pipeline"), + nonTrainingStages + ) + } + + private def getBestModel(runData: Array[GroupedModelReturn], + optimizationStrategy: String): GroupedModelReturn = { + optimizationStrategy match { + case "minimize" => runData.sortWith(_.score < _.score)(0) + case _ => runData.sortWith(_.score > _.score)(0) + } + } + + private def addLabelIndexToString(pipelineModel: PipelineModel, + dataFrame: DataFrame, + mainConfig: MainConfig): PipelineModel = { + if (SchemaUtils.isLabelRefactorNeeded(dataFrame.schema, mainConfig.labelCol) + || + PipelineStateCache + .getFromPipelineByIdAndKey( + mainConfig.pipelineId, + PIPELINE_LABEL_REFACTOR_NEEDED_KEY.key + ) + .asInstanceOf[Boolean]) { + //Find the last string indexer by reversing the pipeline mode stages + val stringIndexerLabels = + pipelineModel.stages + .find( + _.uid + .startsWith(PipelineEnums.LABEL_STRING_INDEXER_STAGE_NAME.value) + ) + .get + .asInstanceOf[StringIndexerModel] + .labels + + val labelRefactorPipelineModel = new Pipeline() + .setStages( + Array( + new IndexToString() + .setInputCol("prediction") + .setOutputCol("prediction_stng") + .setLabels(stringIndexerLabels), + new DropColumnsTransformer() + .setInputCols(Array("prediction")) + .setPipelineId(mainConfig.pipelineId), + new ColumnNameTransformer() + .setInputColumns(Array("prediction_stng")) + .setOutputColumns(Array("prediction")) + .setPipelineId(mainConfig.pipelineId) + ) + ) + .fit(dataFrame) + labelRefactorPipelineModel.transform(dataFrame) + + return mergePipelineModels( + ArrayBuffer(pipelineModel, labelRefactorPipelineModel) + ) + + } + pipelineModel + } + + private def getInputFeautureCols(inputDataFrame: DataFrame, + mainConfig: MainConfig): Array[String] = { + inputDataFrame.columns + .filterNot(mainConfig.fieldsToIgnoreInVector.contains) + .filterNot( + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL).contains + ) + .filterNot(Array(mainConfig.labelCol).contains) + } + + def addUserReturnViewStage(pipelineModel: PipelineModel, + mainConfig: MainConfig, + dataFrame: DataFrame, + originalDfTempTableName: String): PipelineModel = { + // Generate output dataset + val inputFeatures = getInputFeautureCols( + dataFrame.sqlContext.sql(s"select * from $originalDfTempTableName"), + mainConfig + ) + + val userViewPipelineModel = new Pipeline() + .setStages( + Array( + new AutoMlOutputDatasetTransformer() + .setTempViewOriginalDatasetName(originalDfTempTableName) + .setLabelColumn(mainConfig.labelCol) + .setFeatureColumns(inputFeatures) + .setPipelineId(mainConfig.pipelineId) + ) + ) + .fit(dataFrame) + + userViewPipelineModel.transform(dataFrame) + + mergePipelineModels(ArrayBuffer(pipelineModel, userViewPipelineModel)) + } + + /** + * Select feature columns, converting date/time features and applying cardinality limit stages + * + * @param dataFrame + * @param mainConfig + * @param originalDfTempTableName + * @return + */ + private def selectFeaturesConvertTypesAndApplyCardLimit( + dataFrame: DataFrame, + mainConfig: MainConfig, + originalDfTempTableName: String + ): PipelineModel = { + + // Stage to select only those columns that are needed in the downstream stages + // also creates a temp view of the original dataset which will then be used by the last stage + // to return user table + val inputFeatures = getInputFeautureCols(dataFrame, mainConfig) + + val zipRegisterTempTransformer = new ZipRegisterTempTransformer() + .setTempViewOriginalDatasetName(originalDfTempTableName) + .setLabelColumn(mainConfig.labelCol) + .setFeatureColumns(inputFeatures) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + + val mlFlowLoggingValidationStageTransformer = + new MlFlowLoggingValidationStageTransformer() + .setMlFlowAPIToken(mainConfig.mlFlowConfig.mlFlowAPIToken) + .setMlFlowTrackingURI(mainConfig.mlFlowConfig.mlFlowTrackingURI) + .setMlFlowExperimentName(mainConfig.mlFlowConfig.mlFlowExperimentName) + .setMlFlowLoggingFlag(mainConfig.mlFlowLoggingFlag) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + + val cardinalityLimitColumnPrunerTransformer = + new CardinalityLimitColumnPrunerTransformer() + .setLabelColumn(mainConfig.labelCol) + .setCardinalityLimit(mainConfig.fillConfig.cardinalityLimit) + .setCardinalityCheckMode(mainConfig.fillConfig.cardinalityCheckMode) + .setCardinalityPrecision(mainConfig.fillConfig.cardinalityPrecision) + .setCardinalityType(mainConfig.fillConfig.cardinalityType) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + + val dateFieldTransformer = new DateFieldTransformer() + .setLabelColumn(mainConfig.labelCol) + .setMode(mainConfig.dateTimeConversionType) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + + //TODO: Remove Date/time columns at this tage with drop transformer + new Pipeline() + .setStages( + Array( + zipRegisterTempTransformer, + mlFlowLoggingValidationStageTransformer, + cardinalityLimitColumnPrunerTransformer, + dateFieldTransformer + ) + ) + .fit(dataFrame) + } + + /** + * Apply string indexers, apply vector assembler, drop unnecessary columns + * @param dataFrame + * @param mainConfig + * @param ignoreCols + * @return + */ + private def applyStngIndxVectAssembler( + dataFrame: DataFrame, + mainConfig: MainConfig, + ignoreCols: Array[String], + verbose: Boolean + ): VectorizationOutput = { + val fields = SchemaUtils.extractTypes(dataFrame, mainConfig.labelCol) + val stringFields = fields.categoricalFields + .filterNot(ignoreCols.contains) + .filterNot(item => item.equals(mainConfig.labelCol)) + val vectorizableFields = + fields.numericFields.toArray.filterNot(ignoreCols.contains) + val dateFields = fields.dateFields.toArray.filterNot(ignoreCols.contains) + val timeFields = fields.timeFields.toArray.filterNot(ignoreCols.contains) + val booleanFields = + fields.booleanFields.toArray.filterNot(ignoreCols.contains) + + //Validate date and time fields have been removed and already featurized at this point + validateDateAndTimeFeatures(dateFields, timeFields) + + val stages = new ArrayBuffer[PipelineStage] + // Fill with Na + getAndAddStage(stages, fillNaStage(mainConfig)) + + // Label refactor + if (SchemaUtils.isLabelRefactorNeeded( + dataFrame.schema, + mainConfig.labelCol + )) { + getAndAddStage( + stages, + Some( + new StringIndexer( + PipelineEnums.LABEL_STRING_INDEXER_STAGE_NAME.value + Identifiable + .randomUID("strIdx") + ).setInputCol(mainConfig.labelCol) + .setOutputCol(mainConfig.labelCol + PipelineEnums.SI_SUFFIX.value) + ) + ) + if (!verbose) { + getAndAddStage( + stages, + dropColumns(Array(mainConfig.labelCol), mainConfig) + ) + getAndAddStage( + stages, + renameTransformerStage( + mainConfig.labelCol + PipelineEnums.SI_SUFFIX.value, + mainConfig.labelCol, + mainConfig + ) + ) + } + + // Register label refactor needed var for this pipeline context + // LabelRefactor needed + addToPipelineCacheInternal(mainConfig, refactorNeeded = true) + } else { + // Register label refactor needed var for this pipeline context + //Label refactor not required + addToPipelineCacheInternal(mainConfig, refactorNeeded = false) + } + stringFields.foreach(columnName => { + stages += new StringIndexer() + .setInputCol(columnName) + .setOutputCol(SchemaUtils.generateStringIndexedColumn(columnName)) + .setHandleInvalid("keep") + }) + stages += new DropColumnsTransformer() + .setInputCols(stringFields.toArray) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + + val featureAssemblerInputCols: Array[String] = stringFields + .map(item => SchemaUtils.generateStringIndexedColumn(item)) + .toArray[String] ++ vectorizableFields + + VectorizationOutput( + new Pipeline().setStages(stages.toArray).fit(dataFrame), + featureAssemblerInputCols + ) + } + + private def addToPipelineCacheInternal(mainConfig: MainConfig, + refactorNeeded: Boolean): Unit = { + PipelineStateCache + .addToPipelineCache( + mainConfig.pipelineId, + PipelineVars.PIPELINE_LABEL_REFACTOR_NEEDED_KEY.key, + refactorNeeded + ) + } + + private def vectorAssemblerStage( + mainConfig: MainConfig, + featureAssemblerInputCols: Array[String] + ): Option[PipelineStage] = { + Some( + new VectorAssembler() + .setInputCols(featureAssemblerInputCols) + .setOutputCol(mainConfig.featuresCol) + .setHandleInvalid("keep") + ) + } + + private def validateDateAndTimeFeatures(dateFields: Array[String], + timeFields: Array[String]): Unit = { + throwFieldConversionException( + dateFields, + classOf[DateFeatureConversionException] + ) + throwFieldConversionException( + timeFields, + classOf[TimeFeatureConversionException] + ) + } + + private def throwFieldConversionException( + fields: Array[_ <: String], + clazz: Class[_ <: FeatureConversionException] + ): Unit = { + if (SchemaUtils.isNotEmpty(fields)) { + throw clazz.getConstructor(classOf[Array[String]]).newInstance(fields) + } + } + + private def fillNaStage(mainConfig: MainConfig): Option[PipelineStage] = { + val dataSanitizerTransformer = new DataSanitizerTransformer() + .setLabelColumn(mainConfig.labelCol) + .setFeatureCol(mainConfig.featuresCol) + .setModelSelectionDistinctThreshold( + mainConfig.fillConfig.modelSelectionDistinctThreshold + ) + .setNumericFillStat(mainConfig.fillConfig.numericFillStat) + .setCharacterFillStat(mainConfig.fillConfig.characterFillStat) + .setParallelism(mainConfig.geneticConfig.parallelism) + .setCategoricalNAFillMap(mainConfig.fillConfig.categoricalNAFillMap) + .setNumericNAFillMap( + mainConfig.fillConfig.numericNAFillMap.asInstanceOf[Map[String, Double]] + ) + .setFillMode(mainConfig.fillConfig.naFillMode) + .setFilterPrecision(mainConfig.fillConfig.filterPrecision) + .setNumericNABlanketFill(mainConfig.fillConfig.numericNABlanketFillValue) + .setCharacterNABlanketFill( + mainConfig.fillConfig.characterNABlanketFillValue + ) + .setNaFillFlag(mainConfig.naFillFlag) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + + Some(dataSanitizerTransformer) + } + + private def varianceFilterStage( + mainConfig: MainConfig + ): Option[PipelineStage] = { + if (mainConfig.varianceFilterFlag) { + val varianceFilterTransformer = new VarianceFilterTransformer() + .setLabelColumn(mainConfig.labelCol) + .setFeatureCol(mainConfig.featuresCol) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + return Some(varianceFilterTransformer) + } + None + } + + private def outlierFilterStage( + mainConfig: MainConfig + ): Option[PipelineStage] = { + if (mainConfig.outlierFilterFlag) { + val outlierFilterTransformer = new OutlierFilterTransformer() + .setFilterBounds(mainConfig.outlierConfig.filterBounds) + .setLowerFilterNTile(mainConfig.outlierConfig.lowerFilterNTile) + .setUpperFilterNTile(mainConfig.outlierConfig.upperFilterNTile) + .setFilterPrecision(mainConfig.outlierConfig.filterPrecision) + .setContinuousDataThreshold( + mainConfig.outlierConfig.continuousDataThreshold + ) + .setParallelism(mainConfig.geneticConfig.parallelism) + .setFieldsToIgnore(Array.empty) + .setLabelColumn(mainConfig.labelCol) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + return Some(outlierFilterTransformer) + } + None + } + + private def covarianceFilteringStage( + mainConfig: MainConfig, + featureCols: Array[String] + ): Option[PipelineStage] = { + if (mainConfig.covarianceFilteringFlag) { + val covarianceFilterTransformer = new CovarianceFilterTransformer() + .setLabelColumn(mainConfig.labelCol) + .setCorrelationCutoffLow( + mainConfig.covarianceConfig.correlationCutoffLow + ) + .setCorrelationCutoffHigh( + mainConfig.covarianceConfig.correlationCutoffHigh + ) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + .setFeatureColumns(featureCols) + .setFeatureCol(mainConfig.featuresCol) + return Some(covarianceFilterTransformer) + } + None + } + + private def pearsonFilteringStage( + mainConfig: MainConfig, + featureCols: Array[String], + modelType: String + ): Option[PipelineStage] = { + if (mainConfig.pearsonFilteringFlag) { + val pearsonFilterTransformer = new PearsonFilterTransformer() + .setModelType(modelType) + .setLabelColumn(mainConfig.labelCol) + .setFeatureCol(mainConfig.featuresCol) + .setAutoFilterNTile(mainConfig.pearsonConfig.autoFilterNTile) + .setFilterDirection(mainConfig.pearsonConfig.filterDirection) + .setFilterManualValue(mainConfig.pearsonConfig.filterManualValue) + .setFilterMode(mainConfig.pearsonConfig.filterMode) + .setFilterStatistic(mainConfig.pearsonConfig.filterStatistic) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + .setFeatureColumns(featureCols) + return Some(pearsonFilterTransformer) + } + None + } + + private def stringIndexerStage( + mainConfig: MainConfig, + stringIndexInputs: Array[String] + ): Option[Array[PipelineStage]] = { + if (mainConfig.oneHotEncodeFlag) { + val buffer = new ArrayBuffer[PipelineStage]() + val indexers = Some(stringIndexInputs.map { x => + new StringIndexer() + .setInputCol(x) + .setOutputCol(x + PipelineEnums.SI_SUFFIX.value) + }) + getAndAddStages(buffer, indexers) + getAndAddStage(buffer, dropColumns(stringIndexInputs, mainConfig)) + return Some(buffer.toArray) + } + None + } + + private def oneHotEncodingStage( + mainConfig: MainConfig, + stngIndxCols: Array[String] + ): Option[PipelineStage] = { + if (mainConfig.oneHotEncodeFlag) { + return Some( + new OneHotEncoderEstimator() + .setInputCols(stngIndxCols) + .setOutputCols( + stngIndxCols + .map(item => SchemaUtils.generateOneHotEncodedColumn(item)) + ) + .setHandleInvalid("keep") + ) + } + None + } + + private def scalerStage( + mainConfig: MainConfig + ): Option[Array[PipelineStage]] = { + if (mainConfig.scalingFlag) { + val arrayBuffer = new ArrayBuffer[PipelineStage]() + val renamedFeatureCol = mainConfig.featuresCol + PipelineEnums.FEATURE_NAME_TEMP_SUFFIX.value + getAndAddStage( + arrayBuffer, + renameTransformerStage( + mainConfig.featuresCol, + renamedFeatureCol, + mainConfig + ) + ) + val scaler = Some( + new Scaler() + .setFeaturesCol(mainConfig.featuresCol) + .setScalerType(mainConfig.scalingConfig.scalerType) + .setScalerMin(mainConfig.scalingConfig.scalerMin) + .setScalerMax(mainConfig.scalingConfig.scalerMax) + .setStandardScalerMeanMode( + mainConfig.scalingConfig.standardScalerMeanFlag + ) + .setStandardScalerStdDevMode( + mainConfig.scalingConfig.standardScalerStdDevFlag + ) + .setPNorm(mainConfig.scalingConfig.pNorm) + .scaleFeaturesForPipeline() + ) + getAndAddStage(arrayBuffer, scaler) + getAndAddStage( + arrayBuffer, + dropColumns(Array(renamedFeatureCol), mainConfig) + ) + return Some(arrayBuffer.toArray) + } + None + } + + private def ksamplerStages( + mainConfig: MainConfig, + isFeatureEngineeringOnly: Boolean, + vectorizedColumns: Array[String] + ): Option[Array[_ <: PipelineStage]] = { + val ksampleConfigString = "kSample" + if (isFeatureEngineeringOnly && mainConfig.geneticConfig.trainSplitMethod == ksampleConfigString) { + throw new RuntimeException( + "Ksampler should be disabled when generating only a feature engineering pipeline." + ) + } + if (mainConfig.geneticConfig.trainSplitMethod == ksampleConfigString && !isFeatureEngineeringOnly) { + val arrayBuffer = new ArrayBuffer[PipelineStage]() + // Apply Vector Assembler again + getAndAddStage( + arrayBuffer, + dropColumns(Array(mainConfig.featuresCol), mainConfig) + ) + getAndAddStage( + arrayBuffer, + vectorAssemblerStage(mainConfig, vectorizedColumns) + ) + + // Ksampler stage + arrayBuffer += new SyntheticFeatureGenTransformer() + .setFeatureCol(mainConfig.featuresCol) + .setLabelColumn(mainConfig.labelCol) + .setSyntheticCol(mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLshHashTables(mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLshSeed(mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLshOutputCol(mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + + //Repartition after Ksampler stage + arrayBuffer += new RepartitionTransformer() + .setPartitionScaleFactor( + mainConfig.geneticConfig.kSampleConfig.outputDfRepartitionScaleFactor + ) + + // Register temp table stage for registering non-synthetic dataset later needed for the union with synthetic dataset + val nonSyntheticFeatureGenTmpTable = + Identifiable.randomUID("nonSyntheticFeatureGenTransformer_") + getAndAddStage( + arrayBuffer, + Some( + new RegisterTempTableTransformer() + .setTempTableName(nonSyntheticFeatureGenTmpTable) + .setStatement( + s"select * from __THIS__ where !${mainConfig.geneticConfig.kSampleConfig.syntheticCol}" + ) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + ) + ) + + // Get synthetic dataset + getAndAddStage( + arrayBuffer, + Some( + new SQLWrapperTransformer() + .setStatement( + s"select * from __THIS__ where ${mainConfig.geneticConfig.kSampleConfig.syntheticCol}" + ) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + ) + ) + // If scaling is used, make sure that the synthetic data has the same scaling. + if (mainConfig.scalingFlag) { + getAndAddStages(arrayBuffer, scalerStage(mainConfig)) + } + arrayBuffer += new DatasetsUnionTransformer() + .setUnionDatasetName(nonSyntheticFeatureGenTmpTable) + .setPipelineId(mainConfig.pipelineId) + arrayBuffer += new DropTempTableTransformer() + .setTempTableName(nonSyntheticFeatureGenTmpTable) + .setPipelineId(mainConfig.pipelineId) + + return Some(arrayBuffer.toArray) + } + None + } + + private def renameTransformerStage( + oldLabelName: String, + newLabelName: String, + mainConfig: MainConfig + ): Option[PipelineStage] = { + Some( + new ColumnNameTransformer() + .setInputColumns(Array(oldLabelName)) + .setOutputColumns(Array(newLabelName)) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + ) + } + + private def dropColumns(colNames: Array[String], + mainConfig: MainConfig): Option[PipelineStage] = { + Some( + new DropColumnsTransformer() + .setInputCols(colNames) + .setDebugEnabled(mainConfig.pipelineDebugFlag) + .setPipelineId(mainConfig.pipelineId) + ) + } + + private def mergePipelineModels( + pipelineModels: ArrayBuffer[PipelineModel] + ): PipelineModel = { + SparkUtil.createPipelineModel( + "final_ml_pipeline_" + UUID.randomUUID().toString, + pipelineModels.flatMap(item => item.stages).toArray + ) + } + + private def getAndAddStage[T](stages: ArrayBuffer[PipelineStage], + value: Option[_ <: PipelineStage]): Unit = { + if (value.isDefined) { + stages += value.get + } + } + + private def getAndAddStages[T]( + stages: ArrayBuffer[PipelineStage], + value: Option[Array[_ <: PipelineStage]] + ): Unit = { + if (value.isDefined) { + stages ++= value.get + } + } +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/FeaturePipeline.scala b/src/main/scala/com/databricks/labs/automl/pipeline/FeaturePipeline.scala new file mode 100644 index 00000000..c9ec4858 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/FeaturePipeline.scala @@ -0,0 +1,294 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.{DataValidation, SchemaUtils} +import com.databricks.labs.automl.utils.data.CategoricalHandler +import org.apache.log4j.{Level, Logger} +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.feature.VectorAssembler +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +class FeaturePipeline(data: DataFrame, isInferenceRun: Boolean = false) + extends DataValidation { + + private var _labelCol = "label" + private var _featureCol = "features" + private var _dateTimeConversionType = "split" + + private var _cardinalityType: String = "exact" + private var _cardinalityLimit: Int = 200 + private var _cardinalityPrecision: Double = 0.05 + private var _cardinalityCheckMode: String = "silent" + private var _cardinalityCheckSwitch: Boolean = true + + private val logger: Logger = Logger.getLogger(this.getClass) + + final private val _dataFieldNames = data.schema.fieldNames + + def setLabelCol(value: String): this.type = { + if (!isInferenceRun) + assert( + _dataFieldNames.contains(value), + s"Label field $value is not in DataFrame!" + ) + _labelCol = value + this + } + + def setFeatureCol(value: String): this.type = { + _featureCol = value + this + } + + def setDateTimeConversionType(value: String): this.type = { + assert( + _allowableDateTimeConversions.contains(value), + s"Supplied conversion type '$value' is not in: " + + s"${invalidateSelection(value, _allowableDateTimeConversions)}" + ) + _dateTimeConversionType = value + this + } + + def setCardinalityType(value: String): this.type = { + assert( + _allowableCardinalilties.contains(value), + s"Supplied CardinalityType '$value' is not in: " + + s"${invalidateSelection(value, _allowableCardinalilties)}" + ) + _cardinalityType = value + this + } + def setCardinalityLimit(value: Int): this.type = { + require(value > 0, s"Cardinality limit must be greater than 0") + _cardinalityLimit = value + this + } + + def setCardinalityPrecision(value: Double): this.type = { + require(value >= 0.0, s"Precision must be greater than or equal to 0.") + require(value <= 1.0, s"Precision must be less than or equal to 1.") + _cardinalityPrecision = value + this + } + + def setCardinalityCheckMode(value: String): this.type = { + assert( + _allowableCategoricalFilterModes.contains(value), + s"${invalidateSelection(value, _allowableCategoricalFilterModes)}" + ) + _cardinalityCheckMode = value + this + } + + def setCardinalityCheck(value: Boolean): this.type = { + _cardinalityCheckSwitch = value + this + } + + def getLabelCol: String = _labelCol + + def getFeatureCol: String = _featureCol + + def getDateTimeConversionType: String = _dateTimeConversionType + + def getCardinalityType: String = _cardinalityType + def getCardinalityLimit: Int = _cardinalityLimit + def getCardinalityPrecision: Double = _cardinalityPrecision + def getCardinalityCheckMode: String = _cardinalityCheckMode + def getCardinalitySwitchSetting: Boolean = _cardinalityCheckSwitch + + /** + * Public method for creating a feature vector. + * Tasks that are covered: + * 1. Checking types and ensuring that the label column specified in the config is present in the DataFrame + * 2. Separating numeric types from categorical types + * 3. Perform validation on categorical types for cardinality checks. + * 4. String Index available fields + * 5. Convert DateTime fields to numeric types + * 6. Assemble the indexers into a vector assembler to create the feature vector + * @param ignoreList Fields in the DataFrame to ignore for processing + * @return The Dataframe with a feature vector. + */ + def makeFeaturePipeline( + ignoreList: Array[String] + ): (DataFrame, Array[String], Array[String]) = { + + val dfSchema = data.schema + if (!isInferenceRun) + assert( + dfSchema.fieldNames.contains(_labelCol), + s"Dataframe does not contain label column named: ${_labelCol}" + ) + + // Extract all of the field types + val fields = SchemaUtils.extractTypes(data, _labelCol, ignoreList) + + val fieldsToConvertExclusionsSet = + fields.categoricalFields.filterNot(ignoreList.contains) + + val validatedStringFields = + validateCardinality(data, fieldsToConvertExclusionsSet) + + // Support exclusions of fields + val excludedFieldsReady = + fields.numericFields.filterNot(ignoreList.contains) + + val excludedFieldsToConvert = fields.categoricalFields + .filterNot(x => ignoreList.contains(x)) + .filterNot(x => validatedStringFields.invalidFields.contains(x)) + + // Restrict the fields based on the configured cardinality limits. + // Depending on settings: + // Silent mode - will silently remove the fields that are above the cardinality limit + // Warn mode - an exception will be thrown if the cardinality is too high. + + val cardinalityValidatedConversionFields = if (_cardinalityCheckSwitch) { + if (excludedFieldsToConvert.nonEmpty) { + new CategoricalHandler(data, _cardinalityCheckMode) + .setCardinalityType(_cardinalityType) + .setPrecision(_cardinalityPrecision) + .validateCategoricalFields(excludedFieldsToConvert, _cardinalityLimit) + .toList + } else excludedFieldsToConvert + } else excludedFieldsToConvert + + val excludedDateFields = fields.dateFields.filterNot(ignoreList.contains) + + val excludedTimeFields = fields.timeFields.filterNot(ignoreList.contains) + + // Modify the Dataframe for datetime / date types + val (dateTimeModData, dateTimeFields) = convertDateAndTime( + data, + excludedDateFields, + excludedTimeFields, + _dateTimeConversionType + ) + + // Concatenate the numeric field listing with the new numeric converted datetime fields + val mergedFields = excludedFieldsReady ++ dateTimeFields + + val (indexers, assembledColumns, assembler) = + generateAssembly( + mergedFields, + cardinalityValidatedConversionFields, + _featureCol + ) + + val createPipe = new Pipeline() + .setStages(indexers :+ assembler) + + val fieldsToInclude = if (!isInferenceRun) { + assembledColumns ++ Array(_featureCol, _labelCol) + } else { + assembledColumns ++ Array(_featureCol) ++ ignoreList + } + + //DEBUG + logger.log( + Level.DEBUG, + s" MAKE FEATURE PIPELINE FIELDS TO INCLUDE: ${fieldsToInclude.mkString(", ")}" + ) + + val transformedData = createPipe + .fit(dateTimeModData) + .transform(dateTimeModData) + .select(fieldsToInclude ++ ignoreList map col: _*) + + val transformedExtract = if (fields.categoricalFields.contains(_labelCol)) { + transformedData + .drop(_labelCol) + .withColumnRenamed(s"${_labelCol}_si", _labelCol) + } else { + transformedData + } + + val assembledColumnsOutput = + if (fields.categoricalFields.contains(_labelCol)) { + assembledColumns.filterNot(x => x.contains(s"${_labelCol}_si")) + } else assembledColumns + + val fieldsToIncludeOutput = + if (fields.categoricalFields.contains(_labelCol)) { + fieldsToInclude.filterNot(x => x.contains(s"${_labelCol}_si")) + } else fieldsToInclude + + ( + transformedExtract, + assembledColumnsOutput, + fieldsToIncludeOutput.filterNot(_.contains(_featureCol)) + ) + + } + + def applyOneHotEncoding( + featureColumns: Array[String], + totalFields: Array[String] + ): (DataFrame, Array[String], Array[String]) = { + + // From the featureColumns collection, get the string indexed fields. + val stringIndexedFields = featureColumns + .filter(x => x.takeRight(3) == "_si") + .filterNot(x => x.contains(_labelCol)) + + // Get the fields that are not String Indexed. + val remainingFeatureFields = + featureColumns.filterNot(x => x.takeRight(3) == "_si") + + // Drop the feature field that has already been created. + val adjustedData = + if (data.schema.fieldNames.contains(_featureCol)) data.drop(_featureCol) + else data + + // One hot encode the StringIndexed fields, if present and generate the feature vector. + val (outputData, featureFields) = if (stringIndexedFields.length > 0) { + + val (encoder, encodedColumns) = oneHotEncodeStrings( + stringIndexedFields.toList + ) + + val fullFeatureColumns = remainingFeatureFields ++ encodedColumns + + val assembler = new VectorAssembler() + .setInputCols(fullFeatureColumns) + .setOutputCol(_featureCol) + + val pipe = new Pipeline() + .setStages(Array(encoder) :+ assembler) + + val transformedData = pipe.fit(adjustedData).transform(adjustedData) + + (transformedData, fullFeatureColumns) + + } else { + + val assembler = new VectorAssembler() + .setInputCols(featureColumns) + .setOutputCol(_featureCol) + + val pipe = new Pipeline() + .setStages(Array(assembler)) + + val transformedData = pipe.fit(adjustedData).transform(adjustedData) + + (transformedData, featureColumns) + + } + + val fullFinalSchema = outputData.schema.fieldNames.diff(stringIndexedFields) + + val dataReturn = outputData.select(fullFinalSchema map col: _*) + + val dataSchema = fullFinalSchema.filterNot(_.contains(_featureCol)) + + //DEBUG + logger.log( + Level.DEBUG, + s" Post OneHotEncoding Fields: ${fullFinalSchema.mkString(", ")}" + ) + + (dataReturn, featureFields, dataSchema) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasAutoMlIdColumn.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasAutoMlIdColumn.scala new file mode 100644 index 00000000..02dd1c06 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasAutoMlIdColumn.scala @@ -0,0 +1,16 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{Param, Params} + +/** + * @author Jas Bali + * + */ +trait HasAutoMlIdColumn extends Params { + final val automlInternalId: Param[String] = + new Param[String](this, "automlInternalId", "unique identifier column internally generated by AutoML") + + def setAutomlInternalId(value: String): this.type = set(automlInternalId, value) + + def getAutomlInternalId: String = $(automlInternalId) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasDebug.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasDebug.scala new file mode 100644 index 00000000..0403a47b --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasDebug.scala @@ -0,0 +1,110 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.params.MainConfig +import com.databricks.labs.automl.utils.{ + AutoMlPipelineMlFlowUtils, + PipelineStatus +} +import org.apache.log4j.Logger +import org.apache.spark.ml.param.{BooleanParam, Param, Params} +import org.apache.spark.sql.Dataset + +/** + * Base trait for setting/accessing debug flags. Meant to be extended by all pipeline stages, + * which inherit pipeline stage logging by default + * @author Jas Bali + */ +trait HasDebug extends Params { + + @transient private val logger: Logger = Logger.getLogger(this.getClass) + + final val isDebugEnabled: BooleanParam = + new BooleanParam(this, "isDebugEnabled", "Debug option flag") + + def setDebugEnabled(value: Boolean): this.type = set(isDebugEnabled, value) + + def getDebugEnabled: Boolean = $(isDebugEnabled) + + def logTransformation(inputDataset: Dataset[_], + outputDataset: Dataset[_], + stageExecutionTime: Long): Unit = { + if (getDebugEnabled) { + val stageExecTime = if (stageExecutionTime < 1000) { + s"$stageExecutionTime ms" + } else { + s"${stageExecutionTime.toDouble / 1000} seconds" + } + val pipelineId = paramValueAsString( + this + .extractParamMap() + .get(this.getParam("pipelineId")) + .get + ).asInstanceOf[String] + val mainConfig = PipelineStateCache + .getFromPipelineByIdAndKey(pipelineId, PipelineVars.MAIN_CONFIG.key) + .asInstanceOf[MainConfig] + //Log Dfs counts + val countLog = if (mainConfig.dataPrepCachingFlag) { + s"Input dataset count: ${inputDataset.count()} \n " + + s"Output dataset count: ${outputDataset.count()} \n " + } else { + "" + } + //TODO: Log Schema flag (required when schemas are large and need to be turned off from log) + val logStrng = s"\n \n" + + s"=== AutoML Pipeline Stage: ${this.getClass} log ==> \n" + + s"Stage Name: ${this.uid} \n" + + s"Total Stage Execution time: $stageExecTime \n" + + s"Stage Params: ${paramsAsString(this.params)} \n " + + s"$countLog" + + s"Input dataset schema: ${inputDataset.schema.treeString} \n " + + s"Output dataset schema: ${outputDataset.schema.treeString} " + "\n" + + s"=== End of ${this.getClass} Pipeline Stage log <==" + "\n" + // Keeping this INFO level, since debug level can easily pollute this important block of debug information + println(logStrng) + logger.info(logStrng) + //Log this stage to MLFlow with useful information + val pipelineStatus = try { + PipelineStateCache + .getFromPipelineByIdAndKey( + pipelineId, + PipelineVars.PIPELINE_STATUS.key + ) + .asInstanceOf[String] + } catch { + case ex: Exception => PipelineStatus.PIPELINE_FAILED.key + } + val isTrain = !pipelineStatus.equals( + PipelineStatus.PIPELINE_COMPLETED.key + ) && + !pipelineStatus.equals(PipelineStatus.PIPELINE_FAILED.key) + if (!inputDataset.sparkSession.sparkContext.isLocal && isTrain) { + AutoMlPipelineMlFlowUtils + .logTagsToMlFlow( + pipelineId, + Map(s"pipeline_stage_${this.getClass.getName}" -> logStrng) + ) + PipelineMlFlowProgressReporter.runningStage( + pipelineId, + this.getClass.getName + ) + } + } + } + + private def paramsAsString(params: Array[Param[_]]): String = { + params + .map { param => + s"\t${param.name}: ${paramValueAsString(this.extractParamMap().get(param).get)}" + } + .mkString("{\n", ",\n", "\n}") + } + + private def paramValueAsString(value: Any): Any = { + value match { + case v: Array[String] => + v.asInstanceOf[Array[String]].mkString(", ") + case _ => value + } + } +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasFeatureColumn.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasFeatureColumn.scala new file mode 100644 index 00000000..df02bc46 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasFeatureColumn.scala @@ -0,0 +1,16 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{Param, Params} + +/** + * @author Jas Bali + * + */ +trait HasFeatureColumn extends Params { + + final val featureCol: Param[String] = new Param[String](this, "featureCol", "Feature Column Name") + + def setFeatureCol(value: String): this.type = set(featureCol, value) + + def getFeatureCol: String = $(featureCol) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasFeaturesColumns.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasFeaturesColumns.scala new file mode 100644 index 00000000..c6f9a132 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasFeaturesColumns.scala @@ -0,0 +1,17 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{Params, StringArrayParam} + +/** + * @author Jas Bali + * + */ +trait HasFeaturesColumns extends Params { + + final val featureColumns: StringArrayParam = new StringArrayParam(this, "featureColumns", "List of feature column names") + + def setFeatureColumns(value: Array[String]): this.type = set(featureColumns, value) + + def getFeatureColumns: Array[String] = $(featureColumns) + +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasFieldsRemoved.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasFieldsRemoved.scala new file mode 100644 index 00000000..7143c5b4 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasFieldsRemoved.scala @@ -0,0 +1,16 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{Params, StringArrayParam} + +/** + * @author Jas Bali + * + */ +trait HasFieldsRemoved extends Params { + + final val fieldsRemoved: StringArrayParam = new StringArrayParam(this, "fieldsRemoved", "fieldsRemoved") + + def setFieldsRemoved(value: Array[String]): this.type = set(fieldsRemoved, value) + + def getFieldsRemoved: Array[String] = $(fieldsRemoved) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasFieldsToIgnore.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasFieldsToIgnore.scala new file mode 100644 index 00000000..2ea6cd20 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasFieldsToIgnore.scala @@ -0,0 +1,16 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{Params, StringArrayParam} + +/** + * @author Jas Bali + * + */ +trait HasFieldsToIgnore extends Params { + + final val fieldsToIgnore: StringArrayParam = new StringArrayParam(this, "fieldsToIgnore", "Columns To Ignore") + + def setFieldsToIgnore(value: Array[String]): this.type = set(fieldsToIgnore, value) + + def getFieldsToIgnore: Array[String] = $(fieldsToIgnore) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasInteractionColumns.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasInteractionColumns.scala new file mode 100644 index 00000000..c1bdba9b --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasInteractionColumns.scala @@ -0,0 +1,33 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{Params, StringArrayParam} + +/** + * Trait for defining whether interaction columns have been set for the application of Feature Interactions + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ +trait HasInteractionColumns extends Params { + + final val leftColumns: StringArrayParam = new StringArrayParam( + this, + "leftColumns", + "Left side columns for interaction" + ) + final val rightColumns: StringArrayParam = new StringArrayParam( + this, + "rightColumns", + "Right side columns for interaction" + ) + + def setLeftColumns(value: Array[String]): this.type = set(leftColumns, value) + def setRightColumns(value: Array[String]): this.type = + set(rightColumns, value) + + def getLeftColumns: Array[String] = $(leftColumns) + def getRightColumns: Array[String] = $(rightColumns) + + def getInteractionColumns: Array[(String, String)] = + ($(leftColumns) zip $(rightColumns)) + +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasLabelColumn.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasLabelColumn.scala new file mode 100644 index 00000000..c67470cd --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasLabelColumn.scala @@ -0,0 +1,16 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{Param, Params} + +/** + * @author Jas Bali + * + */ +trait HasLabelColumn extends Params{ + + final val labelColumn: Param[String] = new Param[String](this, "labelColumn", "Label Column Name") + + def setLabelColumn(value: String): this.type = set(labelColumn, value) + + def getLabelColumn: String = $(labelColumn) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasPipelineId.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasPipelineId.scala new file mode 100644 index 00000000..3a93f887 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasPipelineId.scala @@ -0,0 +1,18 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{Param, Params} + +/** + * @author Jas Bali + * @since 0.6.1 + * trait for decorating all pipeline stages with pipeline ID. + * Helpful when troubleshooting logs with a given pipeline ID (eg, fetched from MLflow) + */ +trait HasPipelineId extends Params { + + final val pipelineId: Param[String] = new Param[String](this, "pipelineId", "UUID for AutoML pipeline") + + def setPipelineId(value: String): this.type = set(pipelineId, value) + + def getPipelineId: String = $(pipelineId) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/HasTransformCalculated.scala b/src/main/scala/com/databricks/labs/automl/pipeline/HasTransformCalculated.scala new file mode 100644 index 00000000..dfd37a9c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/HasTransformCalculated.scala @@ -0,0 +1,16 @@ +package com.databricks.labs.automl.pipeline + +import org.apache.spark.ml.param.{BooleanParam, Params} + +/** + * @author Jas Bali + * + */ +trait HasTransformCalculated extends Params { + + final val transformCalculated: BooleanParam = new BooleanParam(this, "transformCalculated", "Flag to help for predict pipeline to avoid calculating estimators again") + + def setTransformCalculated(value: Boolean): this.type = set(transformCalculated, value) + + def getTransformCalculated: Boolean = $(transformCalculated) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/InteractionTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/InteractionTransformer.scala new file mode 100644 index 00000000..b57e1a76 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/InteractionTransformer.scala @@ -0,0 +1,64 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, SchemaUtils} +import org.apache.spark.ml.param.ParamMap +import org.apache.spark.ml.util.{ + DefaultParamsReadable, + DefaultParamsWritable, + Identifiable +} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * Transformer for creating interacted feature fields based on FeatureInteraction module + * @param uid Stage Identifier + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ +class InteractionTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasInteractionColumns { + + def this() = { + this(Identifiable.randomUID("InteractionTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + def setLeftCols(value: Array[String]): this.type = set(leftColumns, value) + def setRightCols(value: Array[String]): this.type = set(rightColumns, value) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + var data = dataset + transformSchemaInternal(dataset.schema) + if (SchemaUtils.isNotEmpty(getInteractionColumns)) { + getInteractionColumns.foreach { x => + data = data.withColumn(s"i_${x._1}_${x._2}", col(x._1) * col(x._2)) + } + } + data.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + + if (SchemaUtils.isNotEmpty(getInteractionColumns)) { + val newFields = getInteractionColumns.map(x => { + StructField(s"i_${x._1}_${x._2}", DoubleType) + }) + StructType(schema.fields ++ newFields) + } else schema + + } + + override def copy(extra: ParamMap): InteractionTransformer = + defaultCopy(extra) + +} + +object InteractionTransformer + extends DefaultParamsReadable[InteractionTransformer] { + override def load(path: String): InteractionTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/IsTrainingStage.scala b/src/main/scala/com/databricks/labs/automl/pipeline/IsTrainingStage.scala new file mode 100644 index 00000000..32c4475e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/IsTrainingStage.scala @@ -0,0 +1,14 @@ +package com.databricks.labs.automl.pipeline + +/** + * @author Jas Bali + * Marker interface to signify a pipeline stage is only applicable for the training phase + * This can be used to extract predict-only stages in a pipeline for debugging or logging purposes. + * It is not intended to use this trait for any other purposes. To add default behavior to transformer, + * look at [[AbstractTransformer]] + * + * An example [[IsTrainingStage]] is [[SyntheticFeatureGenTransformer]] which adds synthetic rows for + * post model optimization process and is not required for inference + */ +trait IsTrainingStage { +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/MlFlowLoggingValidationStageTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/MlFlowLoggingValidationStageTransformer.scala new file mode 100644 index 00000000..27752a03 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/MlFlowLoggingValidationStageTransformer.scala @@ -0,0 +1,85 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.exceptions.MlFlowValidationException +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, WorkspaceDirectoryValidation} +import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * A [[WithNoopsStage]] transformer stage that does MlFlow validation before + * continuing with the rest of the stages. This should be added in the earliest stages of a + * pipeline + * @param uid + */ +class MlFlowLoggingValidationStageTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with WithNoopsStage { + + def this() = { + this(Identifiable.randomUID("MlFlowLoggingValidationStageTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + final val mlFlowLoggingFlag: BooleanParam = new BooleanParam(this, "mlFlowLoggingFlag", "whether to log to MlFlow or not") + + final val mlFlowTrackingURI: Param[String] = new Param[String](this, "mlFlowTrackingURI", "MlFlow Tracking URI") + + final val mlFlowAPIToken: Param[String] = new Param[String](this, "mlFlowAPIToken", "MlFlow API token") + + final val mlFlowExperimentName: Param[String] = new Param[String](this, "mlFlowExperimentName", "MlFlow Experiment name") + + def setMlFlowLoggingFlag(value: Boolean): this.type = set(mlFlowLoggingFlag, value) + + def getMlFlowLoggingFlag: Boolean = $(mlFlowLoggingFlag) + + def setMlFlowTrackingURI(value: String): this.type = set(mlFlowTrackingURI, value) + + def getMlFlowTrackingURI: String = $(mlFlowTrackingURI) + + def setMlFlowAPIToken(value: String): this.type = set(mlFlowAPIToken, value) + + def getMlFlowAPIToken: String = $(mlFlowAPIToken) + + def setMlFlowExperimentName(value: String): this.type = set(mlFlowExperimentName, value) + + def getMlFlowExperimentName: String = $(mlFlowExperimentName) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + if (getMlFlowLoggingFlag) { + try { + val dirValidate = WorkspaceDirectoryValidation( + getMlFlowTrackingURI, + getMlFlowAPIToken, + getMlFlowExperimentName + ) + if (dirValidate) { + val rgx = "(\\/\\w+$)".r + val dir = + rgx.replaceFirstIn(getMlFlowExperimentName, "") + println( + s"MLFlow Logging Directory confirmed accessible at: " + + s"$dir" + ) + } + } catch { + case exception: Exception => throw MlFlowValidationException("Failed to validate MLflow configuration", exception) + } + } + dataset.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + schema + } + + override def copy(extra: ParamMap): MlFlowLoggingValidationStageTransformer = defaultCopy(extra) +} + +object MlFlowLoggingValidationStageTransformer extends DefaultParamsReadable[MlFlowLoggingValidationStageTransformer] { + override def load(path: String): MlFlowLoggingValidationStageTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/OutlierFilterTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/OutlierFilterTransformer.scala new file mode 100644 index 00000000..42ced64b --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/OutlierFilterTransformer.scala @@ -0,0 +1,116 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.sanitize.OutlierFiltering +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.log4j.{Level, Logger} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * This transformer wraps [[OutlierFiltering]] in a transform method + * @param uid + */ +class OutlierFilterTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasLabelColumn + with HasFieldsToIgnore + with IsTrainingStage { + + @transient lazy private val logger: Logger = Logger.getLogger(this.getClass) + + def this() = { + this(Identifiable.randomUID("OutlierFilterTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setFieldsToIgnore(Array.empty) + setDebugEnabled(false) + } + + final val filterBounds: Param[String] = new Param[String](this, "filterBounds", "Filter Bounds") + + final val lowerFilterNTile: DoubleParam = new DoubleParam(this, "lowerFilterNTile", "lowerFilterNTile") + + final val upperFilterNTile: DoubleParam = new DoubleParam(this, "upperFilterNTile", "upperFilterNTile") + + final val filterPrecision: DoubleParam = new DoubleParam(this, "filterPrecision", "filterPrecision") + + final val parallelism: IntParam = new IntParam(this, "parallelism", "parallelism") + + final val continuousDataThreshold: IntParam = new IntParam(this, "continuousDataThreshold", "continuousDataThreshold") + + def setFilterBounds(value: String): this.type = set(filterBounds, value) + + def getFilterBounds: String = $(filterBounds) + + def setLowerFilterNTile(value: Double): this.type = set(lowerFilterNTile, value) + + def getLowerFilterNTile: Double = $(lowerFilterNTile) + + def setUpperFilterNTile(value: Double): this.type = set(upperFilterNTile, value) + + def getUpperFilterNTile: Double = $(upperFilterNTile) + + def setFilterPrecision(value: Double): this.type = set(filterPrecision, value) + + def getFilterPrecision: Double = $(filterPrecision) + + def setParallelism(value: Int): this.type = set(parallelism, value) + + def getParallelism: Int = $(parallelism) + + def setContinuousDataThreshold(value: Int): this.type = set(continuousDataThreshold, value) + + def getContinuousDataThreshold: Int = $(continuousDataThreshold) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + if(dataset.columns.contains(getLabelColumn)) { + // Output has no feature vector + val outlierFiltering = new OutlierFiltering(dataset.toDF()) + .setLabelCol(getLabelColumn) + .setFilterBounds(getFilterBounds) + .setLowerFilterNTile(getLowerFilterNTile) + .setUpperFilterNTile(getUpperFilterNTile) + .setFilterPrecision(getFilterPrecision) + .setParallelism(getParallelism) + .setContinuousDataThreshold(getContinuousDataThreshold) + + val (outlierCleanedData, outlierRemovedData, filteringMap) = + outlierFiltering.filterContinuousOutliers(Array(getAutomlInternalId) ++ getFieldsToIgnore, getFieldsToIgnore) + val outlierRemovalInfo = + s"Removed outlier data. Total rows removed = ${outlierRemovedData.count()}" + logger.log(Level.INFO, outlierRemovalInfo) + println(outlierRemovalInfo) + //setInferenceOutlierMap(filteringMap) + // Validate mutated Dfs + validateMutatedf(dataset.toDF(), outlierCleanedData, outlierRemovedData) + outlierCleanedData + } else { + dataset.toDF() + } + } + + private def validateMutatedf(originalDf: DataFrame, + mutatedDf: DataFrame, + outlierDf: DataFrame): Unit = { + + val origCount = originalDf.count() + val mutatedCount = mutatedDf.count() + val outlierCount = outlierDf.count() + val warningMessage = s"Original DataFrame count ($origCount) does not match the sum of outlier filter data ($mutatedCount) and removed data ($outlierCount)" + if (origCount == mutatedCount + outlierCount) println(warningMessage); logger.log(Level.WARN, warningMessage) + + } + + override def transformSchemaInternal(schema: StructType): StructType = { + schema + } + + override def copy(extra: ParamMap): OutlierFilterTransformer = defaultCopy(extra) +} + +object OutlierFilterTransformer extends DefaultParamsReadable[OutlierFilterTransformer] { + override def load(path: String): OutlierFilterTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/PearsonFilterTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/PearsonFilterTransformer.scala new file mode 100644 index 00000000..fde25a10 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/PearsonFilterTransformer.scala @@ -0,0 +1,139 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.sanitize.PearsonFiltering +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, SchemaUtils} +import org.apache.log4j.{Level, Logger} +import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap} +import org.apache.spark.ml.util.{ + DefaultParamsReadable, + DefaultParamsWritable, + Identifiable +} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * This transformer wraps [[PearsonFiltering]] in a transform method + * @param uid + */ +class PearsonFilterTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasLabelColumn + with HasFeatureColumn + with HasFeaturesColumns + with HasFieldsRemoved + with HasTransformCalculated { + + private val logger: Logger = Logger.getLogger(this.getClass) + + def this() = { + this(Identifiable.randomUID("PearsonFilterTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setFieldsRemoved(Array.empty) + setTransformCalculated(false) + setDebugEnabled(false) + } + + final val modelType: Param[String] = + new Param[String](this, "modelType", "modelType") + + final val filterStatistic: Param[String] = + new Param[String](this, "filterStatistic", "filterStatistic") + + final val filterDirection: Param[String] = + new Param[String](this, "filterDirection", "filterDirection") + + final val filterManualValue: DoubleParam = + new DoubleParam(this, "filterManualValue", "filterManualValue") + + final val filterMode: Param[String] = + new Param[String](this, "filterMode", "filterMode") + + final val autoFilterNTile: DoubleParam = + new DoubleParam(this, "autoFilterNTile", "autoFilterNTile") + + def setModelType(value: String): this.type = set(modelType, value) + + def setFilterStatistic(value: String): this.type = set(filterStatistic, value) + + def getFilterStatistic: String = $(filterStatistic) + + def setFilterDirection(value: String): this.type = set(filterDirection, value) + + def getFilterDirection: String = $(filterDirection) + + def setFilterManualValue(value: Double): this.type = + set(filterManualValue, value) + + def getModelType: String = $(modelType) + + def getFilterManualValue: Double = $(filterManualValue) + + def setFilterMode(value: String): this.type = set(filterMode, value) + + def getFilterMode: String = $(filterMode) + + def setAutoFilterNTile(value: Double): this.type = set(autoFilterNTile, value) + + def getAutoFilterNTile: Double = $(autoFilterNTile) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + if (dataset.columns.contains(getLabelColumn)) { + if (SchemaUtils.isNotEmpty(getFeatureColumns)) { + setFeatureColumns( + dataset.columns.filterNot( + item => Array(getLabelColumn, getAutomlInternalId).contains(item) + ) + ) + } + if (!getTransformCalculated) { + // Requires a DataFrame that has a feature vector field. Output has no feature vector. + val pearsonFiltering = + new PearsonFiltering(dataset.toDF(), getFeatureColumns, getModelType) + .setLabelCol(getLabelColumn) + .setFeaturesCol(getFeatureCol) + .setFilterStatistic(getFilterStatistic) + .setFilterDirection(getFilterDirection) + .setFilterManualValue(getFilterManualValue) + .setFilterMode(getFilterMode) + .setAutoFilterNTile(getAutoFilterNTile) + .filterFields(Array(getAutomlInternalId)) + + val removedFields = getFeatureColumns + .filterNot( + field => pearsonFiltering.schema.fieldNames.contains(field) + ) + + val pearsonFilterLog = + s"Pearson Filtering completed.\n Removed fields: ${removedFields.mkString(", ")}" + logger.log(Level.INFO, pearsonFiltering) + println(pearsonFilterLog) + + setFieldsRemoved(removedFields) + setTransformCalculated(true) + return pearsonFiltering + } + } + dataset.drop(getFieldsRemoved: _*) + } + + override def transformSchemaInternal(schema: StructType): StructType = { + if (schema.fieldNames.contains(getLabelColumn)) { + StructType( + schema.fields.filterNot(field => getFieldsRemoved.contains(field.name)) + ) + } else { + schema + } + } + + override def copy(extra: ParamMap): PearsonFilterTransformer = + defaultCopy(extra) +} + +object PearsonFilterTransformer + extends DefaultParamsReadable[PearsonFilterTransformer] { + override def load(path: String): PearsonFilterTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/PipelineEnums.scala b/src/main/scala/com/databricks/labs/automl/pipeline/PipelineEnums.scala new file mode 100644 index 00000000..8ff310f9 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/PipelineEnums.scala @@ -0,0 +1,30 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.params.{MLFlowConfig, MainConfig} + +object PipelineEnums extends Enumeration { + + type PipelineEnums = PipelineConstants + + val OHE_SUFFIX = PipelineConstants("_oh") + val SI_SUFFIX = PipelineConstants("_si") + val FEATURE_NAME_TEMP_SUFFIX = PipelineConstants("_r") + + val LABEL_STRING_INDEXER_STAGE_NAME = PipelineConstants("LabelStringIndexer") + + case class PipelineConstants(value: String) extends Val +} + + +object PipelineVars extends Enumeration { + + type PipelineVars = PipelineVarsPair + + val PIPELINE_LABEL_REFACTOR_NEEDED_KEY = PipelineVarsPair("labelRefactorNeeded", classOf[Boolean]) + val MLFLOW_RUN_ID = PipelineVarsPair("MlFlowRunId", classOf[String]) + val MAIN_CONFIG = PipelineVarsPair("MainConfig", classOf[MainConfig]) + val PIPELINE_STATUS = PipelineVarsPair("PipelineStatus", classOf[String]) + val KSAMPLER_STAGES = PipelineVarsPair("KSamplerStages", classOf[String]) + + case class PipelineVarsPair(key: String, keyType: Class[_]) extends Val +} \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/PipelineMlFlowProgressReporter.scala b/src/main/scala/com/databricks/labs/automl/pipeline/PipelineMlFlowProgressReporter.scala new file mode 100644 index 00000000..2a528300 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/PipelineMlFlowProgressReporter.scala @@ -0,0 +1,61 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, PipelineMlFlowTagKeys, PipelineStatus} + +/** + * @author Jas Bali + * @since 0.6.1 + * Utility for reporting pipeline progress to MLflow + */ +object PipelineMlFlowProgressReporter { + + private def isMlFlowOn(pipelineId: String): Boolean = { + AutoMlPipelineMlFlowUtils.getMainConfigByPipelineId(pipelineId).mainConfig.mlFlowLoggingFlag + } + + private def addProgressToPipelineCache(pipelineId: String, progress: String): Unit = { + PipelineStateCache + .addToPipelineCache( + pipelineId, + PipelineVars.PIPELINE_STATUS.key, progress) + } + + private def addProgressToMLflowRun(pipelineId: String, message: String): Unit = { + if(isMlFlowOn(pipelineId)) { + AutoMlPipelineMlFlowUtils + .logTagsToMlFlow( + pipelineId, + Map(s"${PipelineMlFlowTagKeys.PIPELINE_STATUS}" -> message + )) + } + } + + def starting(pipelineId: String): Unit = { + addProgressToPipelineCache(pipelineId, PipelineStatus.PIPELINE_STARTED.key) + addProgressToMLflowRun( + pipelineId, + s"${PipelineStatus.PIPELINE_STARTED} (Building Pipeline from a given configuration)") + } + + def runningStage(pipelineId: String, stage: String): Unit = { + addProgressToPipelineCache(pipelineId, PipelineStatus.PIPELINE_RUNNING.key) + addProgressToMLflowRun( + pipelineId, + s"${PipelineStatus.PIPELINE_RUNNING} Stage: $stage") + } + + def completed(pipelineId: String, totalStagesRan: Int): Unit = { + addProgressToPipelineCache(pipelineId, PipelineStatus.PIPELINE_COMPLETED.key) + addProgressToMLflowRun( + pipelineId, + s"${PipelineStatus.PIPELINE_COMPLETED}. Total Stages Executed: $totalStagesRan") + } + + def failed(pipelineId: String, message: String): Unit = { + addProgressToPipelineCache(pipelineId, PipelineStatus.PIPELINE_FAILED.key) + addProgressToMLflowRun( + pipelineId, + s"${PipelineStatus.PIPELINE_FAILED} with message: $message (Search for pipeline ID $pipelineId in the cluster logs to find more details)") + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/PipelineStateCache.scala b/src/main/scala/com/databricks/labs/automl/pipeline/PipelineStateCache.scala new file mode 100644 index 00000000..6a331f90 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/PipelineStateCache.scala @@ -0,0 +1,29 @@ +package com.databricks.labs.automl.pipeline +import java.util.UUID + +import scala.collection.mutable + +/** + * @author Jas Bali + * A state cache for Pipeline context to maintain any internal state of a pipeline generation. + * Handy when there is a need for dynamic runtime exchange of information required between Pipeline stages and/or + * Pipeline context + */ +object PipelineStateCache { + + lazy private val pipelineStateCache = mutable.WeakHashMap[String, mutable.Map[String, Any]]() + + def addToPipelineCache(pipelineId: String, key: String, value: Any): Unit = { + if(!pipelineStateCache.contains(pipelineId)) { + pipelineStateCache += pipelineId -> mutable.Map.empty + } + pipelineStateCache += pipelineId -> (pipelineStateCache(pipelineId) += (key -> value)) + } + + def getFromPipelineByIdAndKey(pipelineId: String, key: String): Any = { + pipelineStateCache(pipelineId)(key) + } + + def generatePipelineId(): String = UUID.randomUUID().toString + +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/RegisterTempTableTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/RegisterTempTableTransformer.scala new file mode 100644 index 00000000..5a9d659f --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/RegisterTempTableTransformer.scala @@ -0,0 +1,58 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types.StructType + +/** + * @author Jas Bali + * A [[WithNoopsStage]] transformer stage that is helpful when a following pipeline stage needs access to more + * than a single dataset. This is to bypass pipeline semantics of only being able to pass a single dataset. + * Useful for [[SyntheticFeatureGenTransformer]] where original rows need to be appended with synthetic rows from + * [[com.databricks.labs.automl.feature.KSampling]] + * @param uid + */ +class RegisterTempTableTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable { + final val tempTableName = new Param[String](this, "tempTableName", "tempTableName") + final val statement = new Param[String](this, "statement", "statement") + + def setTempTableName(value: String): this.type = set(tempTableName, value) + + def getTempTableName: String = $(tempTableName) + + def setStatement(value: String): this.type = set(statement, value) + + def getStatement: String = $(statement) + + def this() = { + this(Identifiable.randomUID("RegisterTempTableTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + val tmpTableName = Identifiable.randomUID("InternalRegisterTempTableTransformer_") + dataset.createOrReplaceTempView(tmpTableName) + dataset + .sqlContext + .sql(getStatement.replaceAll("__THIS__", tmpTableName)) + .createOrReplaceTempView(getTempTableName) + dataset.sqlContext.dropTempTable(tmpTableName) + dataset.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + schema + } + + override def copy(extra: ParamMap): RegisterTempTableTransformer = defaultCopy(extra) + +} + +object RegisterTempTableTransformer extends DefaultParamsReadable[RegisterTempTableTransformer] { + override def load(path: String): RegisterTempTableTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/RepartitionTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/RepartitionTransformer.scala new file mode 100644 index 00000000..90403209 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/RepartitionTransformer.scala @@ -0,0 +1,52 @@ +package com.databricks.labs.automl.pipeline +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.param.{IntParam, ParamMap} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * A [[WithNoopsStage]] transformer stage that is helpful to repartition a + * DataFrame coming out of any pipeline stage + * @author Jas Bali + * @param uid + */ +class RepartitionTransformer(override val uid: String) + extends AbstractTransformer + with WithNoopsStage + with DefaultParamsWritable + with HasDebug { + + def this() = { + this(Identifiable.randomUID("RepartitionTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setPartitionScaleFactor(1) + setDebugEnabled(false) + } + + final val partitionScaleFactor: IntParam = new IntParam(this, "partitionScaleFactor", "Scale factor of repartition (multiple of executors)") + + def setPartitionScaleFactor(value: Int): this.type = set(partitionScaleFactor, value) + + def getPartitionScaleFactor: Int = $(partitionScaleFactor) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + val executors: Int = try { + dataset.sparkSession.conf.get("spark.databricks.clusterUsageTags.clusterMaxWorkers").toInt + } catch { + case ex: Exception => dataset.rdd.getNumPartitions + } + dataset.repartition(executors * getPartitionScaleFactor).toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + assert(getPartitionScaleFactor > 0, "Repartition scale factor needs to be greater than 0. Default is 1.") + schema + } + + override def copy(extra: ParamMap): RepartitionTransformer = defaultCopy(extra) +} + +object RepartitionTransformer extends DefaultParamsReadable[RepartitionTransformer] { + override def load(path: String): RepartitionTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/RoundUpDoubleTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/RoundUpDoubleTransformer.scala new file mode 100644 index 00000000..3a5811e9 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/RoundUpDoubleTransformer.scala @@ -0,0 +1,48 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.param.shared.HasInputCols +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions._ + +/** + * @author Jas Bali + * This transformer rounds up input columns of type Double to Whole Double. + */ +class RoundUpDoubleTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasTransformCalculated + with HasInputCols { + + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + def this() = { + this(Identifiable.randomUID("RoundUpDoubleTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setTransformCalculated(false) + setDebugEnabled(false) + } + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + transformSchemaInternal(dataset.schema) + var tmpDf = dataset + getInputCols.foreach(item => + tmpDf = tmpDf.withColumn(item, round(col(item))) + ) + tmpDf.toDF() + } + + override def transformSchemaInternal(schema: StructType): StructType = { + schema + } + + override def copy(extra: ParamMap): RoundUpDoubleTransformer = defaultCopy(extra) +} + +object RoundUpDoubleTransformer extends DefaultParamsReadable[RoundUpDoubleTransformer] { + override def load(path: String): RoundUpDoubleTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/SQLWrapperTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/SQLWrapperTransformer.scala new file mode 100644 index 00000000..ae4a49af --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/SQLWrapperTransformer.scala @@ -0,0 +1,45 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.feature.SQLTransformer +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types.StructType + +/** + * @author Jas Bali + * This transformer wraps [[SQLTransformer]] and is useful to add logging capability from [[AbstractTransformer]] + */ +class SQLWrapperTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable { + + final val statement = new Param[String](this, "statement", "statement") + + def setStatement(value: String): this.type = set(statement, value) + + def getStatement: String = $(statement) + + def this() = { + this(Identifiable.randomUID("SQLWrapperTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + transformSchemaInternal(dataset.schema) + new SQLTransformer().setStatement(getStatement).transform(dataset) + } + + override def transformSchemaInternal(schema: StructType): StructType = { + new SQLTransformer().setStatement(getStatement).transformSchema(schema) + } + + override def copy(extra: ParamMap): SQLWrapperTransformer = defaultCopy(extra) + +} + +object SQLWrapperTransformer extends DefaultParamsReadable[SQLWrapperTransformer] { + override def load(path: String): SQLWrapperTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/SyntheticFeatureGenTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/SyntheticFeatureGenTransformer.scala new file mode 100644 index 00000000..20a85f2c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/SyntheticFeatureGenTransformer.scala @@ -0,0 +1,108 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.feature.SyntheticFeatureGenerator +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.param._ +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.types.{BooleanType, StructField, StructType} +import org.apache.spark.sql.{DataFrame, Dataset} + + +/** + * @author Jas Bali + * This transformer wraps [[SyntheticFeatureGenerator]] + */ +class SyntheticFeatureGenTransformer(override val uid: String) + extends AbstractTransformer + with HasLabelColumn + with HasFeatureColumn + with HasFieldsToIgnore + with DefaultParamsWritable + with IsTrainingStage { + + def this() = { + this(Identifiable.randomUID("SyntheticFeatureGenTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setFieldsToIgnore(Array(getAutomlInternalId)) + setDebugEnabled(false) + } + + final val syntheticCol: Param[String] = new Param[String](this, "syntheticCol", "syntheticCol") + final val kGroups: IntParam = new IntParam(this, "kGroups", "kGroups") + final val kMeansMaxIter: IntParam = new IntParam(this, "kMeansMaxIter", "kMeansMaxIter") + final val kMeansTolerance: DoubleParam = new DoubleParam(this, "kMeansTolerance", "kMeansTolerance") + final val kMeansDistanceMeasurement: Param[String] = new Param[String](this, "kMeansDistanceMeasurement", "kMeansDistanceMeasurement") + final val kMeansSeed: LongParam = new LongParam(this, "kMeansSeed", "kMeansSeed") + final val kMeansPredictionCol: Param[String] = new Param[String](this, "kMeansPredictionCol", "kMeansPredictionCol") + final val lshHashTables: IntParam = new IntParam(this, "lshHashTables", "lshHashTables") + final val lshSeed: LongParam = new LongParam(this, "lshSeed", "lshSeed") + final val lshOutputCol: Param[String] = new Param[String](this, "lshOutputCol", "lshOutputCol") + final val quorumCount: IntParam = new IntParam(this, "quorumCount", "quorumCount") + final val minimumVectorCountToMutate: IntParam = new IntParam(this, "minimumVectorCountToMutate", "minimumVectorCountToMutate") + final val vectorMutationMethod: Param[String] = new Param[String](this, "vectorMutationMethod", "vectorMutationMethod") + final val mutationMode: Param[String] = new Param[String](this, "mutationMode", "mutationMode") + final val mutationValue: DoubleParam = new DoubleParam(this, "mutationValue", "mutationValue") + final val labelBalanceMode: Param[String] = new Param[String](this, "labelBalanceMode", "labelBalanceMode") + final val cardinalityThreshold: IntParam = new IntParam(this, "cardinalityThreshold", "cardinalityThreshold") + final val numericRatio: DoubleParam = new DoubleParam(this, "numericRatio", "numericRatio") + final val numericTarget: IntParam = new IntParam(this, "numericTarget", "numericTarget") + + def setSyntheticCol(value: String): this.type = set(syntheticCol, value) + def setKGroups(value: Int): this.type = set(kGroups, value) + def setKMeansMaxIter(value: Int): this.type = set(kMeansMaxIter, value) + def setKMeansTolerance(value: Double): this.type = set(kMeansTolerance, value) + def setKMeansDistanceMeasurement(value: String): this.type = set(kMeansDistanceMeasurement, value) + def setKMeansSeed(value: Long): this.type = set(kMeansSeed, value) + def setKMeansPredictionCol(value: String): this.type = set(kMeansPredictionCol, value) + def setLshHashTables(value: Int): this.type = set(lshHashTables, value) + def setLshSeed(value: Long): this.type = set(lshSeed, value) + def setLshOutputCol(value: String): this.type = set(lshOutputCol, value) + def setQuorumCount(value: Int): this.type = set(quorumCount, value) + def setMinimumVectorCountToMutate(value: Int): this.type = set(minimumVectorCountToMutate, value) + def setVectorMutationMethod(value: String): this.type = set(vectorMutationMethod, value) + def setMutationMode(value: String): this.type = set(mutationMode, value) + def setMutationValue(value: Double): this.type = set(mutationValue, value) + def setLabelBalanceMode(value: String): this.type = set(labelBalanceMode, value) + def setCardinalityThreshold(value: Int): this.type = set(cardinalityThreshold, value) + def setNumericRatio(value: Double): this.type = set(numericRatio, value) + def setNumericTarget(value: Int): this.type = set(numericTarget, value) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + transformSchemaInternal(dataset.schema) + SyntheticFeatureGenerator( + dataset.toDF(), + getFeatureCol, + getLabelColumn, + $(syntheticCol), + getFieldsToIgnore, + $(kGroups), + $(kMeansMaxIter), + $(kMeansTolerance), + $(kMeansDistanceMeasurement), + $(kMeansSeed), + $(kMeansPredictionCol), + $(lshHashTables), + $(lshSeed), + $(lshOutputCol), + $(quorumCount), + $(minimumVectorCountToMutate), + $(vectorMutationMethod), + $(mutationMode), + $(mutationValue), + $(labelBalanceMode), + $(cardinalityThreshold), + $(numericRatio), + $(numericTarget) + ) + } + + override def transformSchemaInternal(schema: StructType): StructType = { + StructType(schema.fields ++ Array(StructField($(syntheticCol), BooleanType, nullable = true))) + } + + override def copy(extra: ParamMap): SyntheticFeatureGenTransformer = defaultCopy(extra) +} + +object SyntheticFeatureGenTransformer extends DefaultParamsReadable[SyntheticFeatureGenTransformer] { + override def load(path: String): SyntheticFeatureGenTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/VarianceFilterTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/VarianceFilterTransformer.scala new file mode 100644 index 00000000..39262175 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/VarianceFilterTransformer.scala @@ -0,0 +1,103 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, SchemaUtils} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable} +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Dataset} + +import scala.collection.mutable.ArrayBuffer + +/** + * @author Jas Bali + * Input: Vectorized feature columns + * Output: variance filtered DataFrame [[DataFrame]] + */ +class VarianceFilterTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasLabelColumn + with HasFeatureColumn + with HasTransformCalculated { + + final val preserveColumns = new StringArrayParam(this, "preserveColumns", "Columns Preserved") + + final val removedColumns = new StringArrayParam(this, "removedColumns", "Columns Removed") + + def setPreserveColumns(value: Array[String]): this.type = set(preserveColumns, value) + + def getPreserveColumns: Array[String] = $(preserveColumns) + + def setRemovedColumns(value: Array[String]): this.type = set(removedColumns, value) + + def getRemovedColumns: Array[String] = $(removedColumns) + + def this() = { + this(Identifiable.randomUID("VarianceFilterTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setPreserveColumns(Array.empty) + setRemovedColumns(Array.empty) + setTransformCalculated(false) + setDebugEnabled(false) + } + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + // Get columns without label, feature column and automl_internal_id columns + val colsToIgnoreForVariance = if(dataset.columns.contains(getLabelColumn)) { + Array(getLabelColumn, getFeatureCol, getAutomlInternalId) + } else { + Array(getFeatureCol, getAutomlInternalId) + } + + if(!getTransformCalculated) { + val fields = dataset.columns.filterNot(field => colsToIgnoreForVariance.contains(field)) + + val dfParts = dataset.rdd.partitions.length.toDouble + val summaryParts = Math.max(32, Math.min(Math.ceil(dfParts / 20.0).toInt, 200)) + val stddevInformation = dataset.coalesce(summaryParts).summary("stddev") + .select(fields map col: _*).collect()(0).toSeq.toArray + + val stddevData = fields.zip(stddevInformation) + + val preserveColumns = new ArrayBuffer[String] + val removedColumns = new ArrayBuffer[String] + + stddevData.foreach { x => + if (x._2.toString.toDouble != 0.0){ + preserveColumns.append(x._1) + } else { + removedColumns.append(x._1) + } + } + + setPreserveColumns(preserveColumns.toArray) + setRemovedColumns(removedColumns.toArray) + setTransformCalculated(true) + + val finalFields = getPreserveColumns ++ colsToIgnoreForVariance + return dataset.select(finalFields map col:_*).toDF() + } else { + if(SchemaUtils.isNotEmpty(getPreserveColumns.toList)) { + val selectFields = getPreserveColumns ++ colsToIgnoreForVariance + return dataset.select(selectFields map col:_*).toDF() + } + } + dataset.drop(getRemovedColumns:_*) + } + + override def transformSchemaInternal(schema: StructType): StructType = { + if(schema.fieldNames.contains(getLabelColumn)) { + if (SchemaUtils.isNotEmpty(getRemovedColumns.toList)) { + return StructType(schema.fields.filterNot(field => getRemovedColumns.contains(field.name))) + } + } + schema + } + + override def copy(extra: ParamMap): VarianceFilterTransformer = defaultCopy(extra) +} + +object VarianceFilterTransformer extends DefaultParamsReadable[VarianceFilterTransformer] { + override def load(path: String): VarianceFilterTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/WithNoopsStage.scala b/src/main/scala/com/databricks/labs/automl/pipeline/WithNoopsStage.scala new file mode 100644 index 00000000..436c43dc --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/WithNoopsStage.scala @@ -0,0 +1,16 @@ +package com.databricks.labs.automl.pipeline + + +/** + * Marker interface to signify any transformer extending this trait will not alter + * an input dataset. This is only for the edge cases where it is required to do an external + * Ops before pipeline execution can continue. An example would be to do Mlflow params Validation + * before training continues. Helpful in scenarios where fail-fast feature is needed + * Example transformers are [[DropTempTableTransformer]], [[MlFlowLoggingValidationStageTransformer]]. + * + * NOTE: Noops implies no changes to the input Dataset, but the implementation can result in a change + * to an external state + * @author Jas Bali + */ +trait WithNoopsStage { +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/ZipRegisterTempTransformer.scala b/src/main/scala/com/databricks/labs/automl/pipeline/ZipRegisterTempTransformer.scala new file mode 100644 index 00000000..a7322cbd --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/ZipRegisterTempTransformer.scala @@ -0,0 +1,100 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import org.apache.spark.ml.param.{Param, ParamMap} +import org.apache.spark.ml.util.{ + DefaultParamsReadable, + DefaultParamsWritable, + Identifiable +} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ + LongType, + StringType, + StructField, + StructType +} +import org.apache.spark.sql.{DataFrame, Dataset} + +/** + * @author Jas Bali + * This transformer stage is supposed to be the first stage of a pipeline and is useful for adding an + * ID column to the input dataset, drop ignored columns and register original dataset a temp. view + * This is supposed to work with [[AutoMlOutputDatasetTransformer]] as the last stage of a pipeline + * which reverts the transformed dataset with the ignored fields using ID column + * @param uid + */ +class ZipRegisterTempTransformer(override val uid: String) + extends AbstractTransformer + with DefaultParamsWritable + with HasLabelColumn + with HasFeaturesColumns { + + def this() = { + this(Identifiable.randomUID("ZipRegisterTempTransformer")) + setAutomlInternalId(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + setDebugEnabled(false) + } + + final val tempViewOriginalDatasetName: Param[String] = + new Param[String](this, "tempViewOriginalDatasetName", "Temp table name") + + def setTempViewOriginalDatasetName(value: String): this.type = + set(tempViewOriginalDatasetName, value) + + def getTempViewOriginalDatasetName: String = $(tempViewOriginalDatasetName) + + override def transformInternal(dataset: Dataset[_]): DataFrame = { + val dfWithId = dataset + .filter(col($(labelColumn)).isNotNull) + .filter(!col($(labelColumn)).isNaN) + .withColumn( + AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, + monotonically_increasing_id() + ) + dfWithId.createOrReplaceTempView(getTempViewOriginalDatasetName) + val colsSelectTmp = if (dfWithId.columns.contains(getLabelColumn)) { + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, getLabelColumn) + } else { + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + } + val colsToSelect = + (colsSelectTmp ++ getFeatureColumns) + .map(field => col(field)) + dfWithId.select(colsToSelect: _*) + } + + override def transformSchemaInternal(schema: StructType): StructType = { + if (schema.fieldNames.contains(getLabelColumn)) { + StructType( + schema.fields.filter(field => $(featureColumns).contains(field.name)) + :+ + StructField( + AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, + LongType, + nullable = false + ) + :+ + StructField(getLabelColumn, StringType, nullable = false) + ) + } else { + StructType( + schema.fields.filter(field => $(featureColumns).contains(field.name)) + :+ + StructField( + AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, + LongType, + nullable = false + ) + ) + } + } + + override def copy(extra: ParamMap): ZipRegisterTempTransformer = + defaultCopy(extra) +} + +object ZipRegisterTempTransformer + extends DefaultParamsReadable[ZipRegisterTempTransformer] { + override def load(path: String): ZipRegisterTempTransformer = super.load(path) +} diff --git a/src/main/scala/com/databricks/labs/automl/pipeline/inference/PipelineModelInference.scala b/src/main/scala/com/databricks/labs/automl/pipeline/inference/PipelineModelInference.scala new file mode 100644 index 00000000..3559ff98 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pipeline/inference/PipelineModelInference.scala @@ -0,0 +1,48 @@ +package com.databricks.labs.automl.pipeline.inference + +import com.databricks.labs.automl.executor.config.LoggingConfig +import com.databricks.labs.automl.params.{MLFlowConfig, MainConfig} +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, InitDbUtils} +import org.apache.spark.ml.PipelineModel + +/** + * @author Jas Bali + * @since 0.6.1 + * Utility functions for running inference directly against an MlFlow Run ID + */ +object PipelineModelInference { + + + /** + * + * @param runId String of MLFlow runId to be used for Inference + * @param loggingConfig Deprecated -- logging config for older pipelines + * @return + */ + @deprecated("Only for legacy pipelines without main config tracked by MLFlow. Use " + + "signature (runId: String, mainConfig: mainConfig: MainConfig) or " + + "(runId: String)", "0.7.1") + def getPipelineModelByMlFlowRunId(runId: String, loggingConfig: LoggingConfig): PipelineModel = { + PipelineModel.load(AutoMlPipelineMlFlowUtils.getPipelinePathByRunId(runId, loggingConfig=Some(loggingConfig))) + } + + /*** + * String of MLFlow runId to be used for Inference + * @param runId + * @param mainConfig + * @return + */ + def getPipelineModelByMlFlowRunId(runId: String, mainConfig: MainConfig): PipelineModel = { + PipelineModel.load(AutoMlPipelineMlFlowUtils.getPipelinePathByRunId(runId, mainConfig=Some(mainConfig))) + } + + /** + * String of MLFlow runId to be used for Inference + * @param runId + * @return + */ + def getPipelineModelByMlFlowRunId(runId: String): PipelineModel = { + PipelineModel.load(AutoMlPipelineMlFlowUtils.getPipelinePathByRunId(runId, None)) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/pyspark/AutomationRunnerUtil.scala b/src/main/scala/com/databricks/labs/automl/pyspark/AutomationRunnerUtil.scala new file mode 100644 index 00000000..1d30b766 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pyspark/AutomationRunnerUtil.scala @@ -0,0 +1,73 @@ +package com.databricks.labs.automl.pyspark + +import com.fasterxml.jackson.module.scala.DefaultScalaModule +import org.apache.spark.ml.PipelineModel +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.DataFrame +import com.fasterxml.jackson.databind.ObjectMapper +import com.databricks.labs.automl.executor.config.{ConfigurationGenerator, InstanceConfig} +import org.apache.spark.sql.SparkSession +import com.databricks.labs.automl.AutomationRunner +import com.databricks.labs.automl.pyspark.utils.Utils + +object AutomationRunnerUtil { + lazy val objectMapper = new ObjectMapper() + + def runAutomationRunner(modelFamily: String, + predictionType: String, + configJson: String, + df: DataFrame, + runnerType: String, + defaultFlag: String): Unit = { + val instanceConfig = defaultConfigFlag(defaultFlag, + configJson, + modelFamily, + predictionType) + + val mainConfig = ConfigurationGenerator.generateMainConfig(instanceConfig) + if (runnerType == "run"){ + val AutomationRunner = new AutomationRunner(df).setMainConfig(mainConfig).run() + //create temp view of returns + AutomationRunner.generationReportDataFrame.createOrReplaceTempView("generationReport") + AutomationRunner.modelReportDataFrame.createOrReplaceTempView("modelReport") + } + else if (runnerType == "confusion"){ + val AutomationRunner = new AutomationRunner(df).setMainConfig(mainConfig).runWithConfusionReport() + // create temp view of the returns + AutomationRunner.confusionData.createOrReplaceTempView("confusionData") + AutomationRunner.predictionData.createOrReplaceTempView("predictionData") + AutomationRunner.generationReportDataFrame.createOrReplaceTempView("generationReport") + AutomationRunner.modelReportDataFrame.createOrReplaceTempView("modelReport") + + } + else if (runnerType == "prediction"){ + val AutomationRunner = new AutomationRunner(df).setMainConfig(mainConfig).runWithPrediction() + //create temp view of the returns + AutomationRunner.dataWithPredictions.createOrReplaceTempView("dataWithPredictions") + AutomationRunner.generationReportDataFrame.createOrReplaceTempView("generationReport") + AutomationRunner.modelReportDataFrame.createOrReplaceTempView("modelReportData") + } + } + + def defaultConfigFlag(defaultFlag: String, + configJson: String, + modelFamily: String, + predictionType: String): InstanceConfig = { + if (defaultFlag == "true"){ + // Generate default config if default flag is true + val instanceConfig = ConfigurationGenerator.generateDefaultConfig(modelFamily, predictionType) + return instanceConfig + } + else{ + // Generating config from the map of overrides if default configs aren't being used + val overrides = Utils.cleansNestedTypes(jsonToMap(configJson)) + ConfigurationGenerator.generateConfigFromMap(modelFamily,predictionType,overrides) + } + } + + def jsonToMap(message: String): Map[String, Any] = { + objectMapper.registerModule(DefaultScalaModule) + objectMapper.readValue(message, classOf[Map[String, Any]]) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/pyspark/FamilyRunnerUtil.scala b/src/main/scala/com/databricks/labs/automl/pyspark/FamilyRunnerUtil.scala new file mode 100644 index 00000000..a5527880 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pyspark/FamilyRunnerUtil.scala @@ -0,0 +1,133 @@ +package com.databricks.labs.automl.pyspark + +import com.fasterxml.jackson.databind.ObjectMapper +import com.fasterxml.jackson.module.scala.DefaultScalaModule +import org.apache.spark.sql.DataFrame +import com.databricks.labs.automl.executor.config.{ + ConfigurationGenerator, + InstanceConfig +} +import com.databricks.labs.automl.executor.FamilyRunner +import com.databricks.labs.automl.pyspark.utils.Utils +import com.databricks.labs.automl.utils.SparkSessionWrapper +import com.databricks.labs.automl.pipeline.inference.PipelineModelInference +import org.apache.spark.ml.PipelineModel + +object FamilyRunnerUtil extends SparkSessionWrapper { + lazy val objectMapper = new ObjectMapper() + def runFamilyRunner(df: DataFrame, + configs: String, + predictionType: String): Unit = { + import spark.implicits._ + + val firstMap = jsonToMap(configs) + val familyRunnerConfigs = buildArray(firstMap, predictionType) + //run the family runner + val runner = FamilyRunner(df, familyRunnerConfigs).executeWithPipeline() + runner.familyFinalOutput.modelReportDataFrame + .createOrReplaceTempView("modelReportDataFrame") + runner.familyFinalOutput.generationReportDataFrame + .createOrReplaceTempView("generationReportDataFrame") + runner.bestMlFlowRunId.toSeq + .toDF("model_family", "run_id") + .createOrReplaceTempView("bestMlFlowRunId") + } + + def cleansNestedTypes(valuesMap: Map[String, Any]): Map[String, Any] = { + val cleanMap: scala.collection.mutable.Map[String, Any] = + scala.collection.mutable.Map() + if (valuesMap.contains("fieldsToIgnoreInVector")) { + cleanMap("fieldsToIgnoreInVector") = valuesMap("fieldsToIgnoreInVector") + .asInstanceOf[List[String]] + .toArray + } + if (valuesMap.contains("outlierFieldsToIgnore")) { + cleanMap("outlierFieldsToIgnore") = valuesMap("outlierFieldsToIgnore") + .asInstanceOf[List[String]] + .toArray + } + if (valuesMap.contains("numericBoundaries")) { + cleanMap("numericBoundaries") = valuesMap("numericBoundaries") + .asInstanceOf[Map[String, List[Any]]] + .flatMap({ + case (k, v) => { + Map(k -> Tuple2(v.head.toString.toDouble, v(1).toString.toDouble)) + } + }) + } + cleanMap.toMap + } + + def runMlFlowInference(mlFlowRunId: String, + modelFamily: String, + predictionType: String, + labelCol: String, + configs: String, + df: DataFrame): Unit = { + + // TO DO add support for default configs + // generate the configs + val familyRunnerConfigs = ConfigurationGenerator.generateConfigFromMap( + modelFamily, + predictionType, + jsonToMap(configs) + ) + // get logging config + val loggingConfig = familyRunnerConfigs.loggingConfig + // get pipeline model + val pipelineModel = PipelineModelInference.getPipelineModelByMlFlowRunId( + mlFlowRunId, + loggingConfig + ) + // run inference on df and pipeline model from mlflow + pipelineModel + .transform(df.drop(labelCol)) + .createOrReplaceTempView("inferenceDF") + } + + def runPathInference(path: String, dataframe: DataFrame): Unit = { + // Read in the Pipeline + PipelineModel + .load(path) + .transform(dataframe) + .createOrReplaceTempView(viewName = "pathInferenceDF") + } + + def runFeatureEngPipeline(df: DataFrame, + modelFamily: String, + predictionType: String, + configs: String): Unit = { + import spark.implicits._ + + val firstMap = jsonToMap(configs) + val familyRunnerConfigs = buildArray(firstMap, predictionType) + //run the family runner + val featureEngPipelineModel = FamilyRunner(df, familyRunnerConfigs) + .generateFeatureEngineeredPipeline(verbose = true)(modelFamily) + featureEngPipelineModel + .transform(df) + .createOrReplaceTempView(viewName = "featEngDf") + } + + def buildArray(configs: Map[String, Any], + predictionType: String): Array[InstanceConfig] = { + + configs + .asInstanceOf[Map[String, Map[String, Any]]] + .map({ + case (key, rawValuesMap) => { + val valuesMap: Map[String, Any] = rawValuesMap ++ Utils + .cleansNestedTypes(rawValuesMap) + ConfigurationGenerator + .generateConfigFromMap(key, predictionType, valuesMap) + } + }) + .toArray + } + + def jsonToMap(message: String): Map[String, Any] = { + objectMapper.registerModule(DefaultScalaModule) + objectMapper.readValue(message, classOf[Map[String, Any]]) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/pyspark/FeatureImportanceUtil.scala b/src/main/scala/com/databricks/labs/automl/pyspark/FeatureImportanceUtil.scala new file mode 100644 index 00000000..987a5d1e --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pyspark/FeatureImportanceUtil.scala @@ -0,0 +1,65 @@ +package com.databricks.labs.automl.pyspark + +import com.databricks.labs.automl.executor.config.{ + ConfigurationGenerator, + InstanceConfig +} +import com.databricks.labs.automl.exploration.FeatureImportances +import com.databricks.labs.automl.pyspark.utils.Utils +import com.fasterxml.jackson.databind.ObjectMapper +import com.fasterxml.jackson.module.scala.DefaultScalaModule +import org.apache.spark.sql.DataFrame + +object FeatureImportanceUtil { + lazy val objectMapper = new ObjectMapper() + + def runFeatureImportance(modelFamily: String, + predictionType: String, + configJson: String, + df: DataFrame, + cutoffType: String, + cutoffValue: Float, + defaultFlag: String): Unit = { + + val fiConfig = + defaultConfigFlag(defaultFlag, configJson, modelFamily, predictionType) + + val mainConfig = + ConfigurationGenerator.generateFeatureImportanceConfig(fiConfig) + val fImportances = + FeatureImportances(df, mainConfig, cutoffType, cutoffValue) + .generateFeatureImportances() + + //create temp importances df and top fields to get them later in python + fImportances.importances.createOrReplaceTempView("importances") + + } + + def defaultConfigFlag(defaultFlag: String, + configJson: String, + modelFamily: String, + predictionType: String): InstanceConfig = { + if (defaultFlag == "true") { + // Generate default config if default flag is true + val fiConfig = ConfigurationGenerator.generateDefaultConfig( + modelFamily, + predictionType + ) + return fiConfig + } else { + // Generating config from the map of overrides if default configs aren't being used + val overrides = Utils.cleansNestedTypes(jsonToMap(configJson)) + ConfigurationGenerator.generateConfigFromMap( + modelFamily, + predictionType, + overrides + ) + } + } + + def jsonToMap(message: String): Map[String, Any] = { + objectMapper.registerModule(DefaultScalaModule) + objectMapper.readValue(message, classOf[Map[String, Any]]) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/pyspark/utils/Utils.scala b/src/main/scala/com/databricks/labs/automl/pyspark/utils/Utils.scala new file mode 100644 index 00000000..6605c81b --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/pyspark/utils/Utils.scala @@ -0,0 +1,22 @@ +package com.databricks.labs.automl.pyspark.utils + +object Utils { + + def cleansNestedTypes(valuesMap: Map[String, Any]): Map[String, Any] = { + val cleanMap: scala.collection.mutable.Map[String, Any] = scala.collection.mutable.Map() + if (valuesMap.contains("fieldsToIgnoreInVector")) { + cleanMap("fieldsToIgnoreInVector") = valuesMap("fieldsToIgnoreInVector").asInstanceOf[List[String]].toArray + } + if (valuesMap.contains("outlierFieldsToIgnore")) { + cleanMap("outlierFieldsToIgnore") = valuesMap("outlierFieldsToIgnore").asInstanceOf[List[String]].toArray + } + if (valuesMap.contains("numericBoundaries")) { + cleanMap("numericBoundaries") = valuesMap("numericBoundaries").asInstanceOf[Map[String, List[Any]]] + .flatMap({ case (k, v) => { + Map(k -> Tuple2(v.head.toString.toDouble, v(1).toString.toDouble)) + }}) + } + cleanMap.toMap + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/reports/DecisionTreeSplits.scala b/src/main/scala/com/databricks/labs/automl/reports/DecisionTreeSplits.scala new file mode 100644 index 00000000..3eb50fe7 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/reports/DecisionTreeSplits.scala @@ -0,0 +1,130 @@ +package com.databricks.labs.automl.reports + +import com.databricks.labs.automl.model.DecisionTreeTuner +import com.databricks.labs.automl.model.tools.split.{ + DataSplitCustodial, + DataSplitUtility +} +import com.databricks.labs.automl.params.{MainConfig, TreeSplitReport} +import org.apache.spark.ml.classification.DecisionTreeClassificationModel +import org.apache.spark.ml.regression.DecisionTreeRegressionModel +import org.apache.spark.sql.DataFrame + +class DecisionTreeSplits(data: DataFrame, + featConfig: MainConfig, + modelType: String) + extends ReportingTools { + + def runTreeSplitAnalysis(fields: Array[String]): TreeSplitReport = { + + val indexedFields = cleanupFieldArray(fields.zipWithIndex) + + val splitData = DataSplitUtility.split( + data, + featConfig.geneticConfig.kFold, + featConfig.geneticConfig.trainSplitMethod, + featConfig.labelCol, + featConfig.geneticConfig.deltaCacheBackingDirectory, + featConfig.geneticConfig.splitCachingStrategy, + featConfig.modelFamily, + featConfig.geneticConfig.parallelism, + featConfig.geneticConfig.trainPortion, + featConfig.geneticConfig.kSampleConfig.syntheticCol, + featConfig.geneticConfig.trainSplitChronologicalColumn, + featConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + featConfig.dataReductionFactor + ) + + val (modelResults, modelStats) = new DecisionTreeTuner( + data, + splitData, + modelType + ).setLabelCol(featConfig.labelCol) + .setFeaturesCol(featConfig.featuresCol) + .setTreesNumericBoundaries(featConfig.numericBoundaries) + .setTreesStringBoundaries(featConfig.stringBoundaries) + .setScoringMetric(featConfig.scoringMetric) + .setTrainPortion(featConfig.geneticConfig.trainPortion) + .setKFold(featConfig.geneticConfig.kFold) + .setSeed(featConfig.geneticConfig.seed) + .setOptimizationStrategy(featConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + featConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + featConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + featConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + featConfig.geneticConfig.numberOfParentsToRetain + ) + .setNumberOfMutationsPerGeneration( + featConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setGeneticMixing(featConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + featConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(featConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(featConfig.geneticConfig.fixedMutationValue) + .evolveWithScoringDF() + + val bestModelData = modelResults.head + + val treeModelBest = modelType match { + case "regressor" => + bestModelData.model.asInstanceOf[DecisionTreeRegressionModel] + case "classifier" => + bestModelData.model.asInstanceOf[DecisionTreeClassificationModel] + case _ => + throw new UnsupportedOperationException( + s"modelType $modelType is not supported for DecisionTrees." + ) + } + + val treeModelString = modelType match { + case "regressor" => + bestModelData.model + .asInstanceOf[DecisionTreeRegressionModel] + .toDebugString + case "classifier" => + bestModelData.model + .asInstanceOf[DecisionTreeClassificationModel] + .toDebugString + case _ => + throw new UnsupportedOperationException( + s"modelType $modelType is not supported for DecisionTrees." + ) + } + + val featureImportances = modelType match { + case "regressor" => + bestModelData.model + .asInstanceOf[DecisionTreeRegressionModel] + .featureImportances + .toArray + case "classifier" => + bestModelData.model + .asInstanceOf[DecisionTreeClassificationModel] + .featureImportances + .toArray + case _ => + throw new UnsupportedOperationException( + s"modelType $modelType is not supported for DecisionTrees." + ) + } + + val importances = generateFrameReport(fields, featureImportances) + + val mappedModelString = + generateDecisionTextReport(treeModelString, indexedFields) + + DataSplitCustodial.cleanCachedInstances(splitData, featConfig) + + TreeSplitReport(mappedModelString, importances, treeModelBest) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/reports/RandomForestFeatureImportance.scala b/src/main/scala/com/databricks/labs/automl/reports/RandomForestFeatureImportance.scala new file mode 100644 index 00000000..a474e0dd --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/reports/RandomForestFeatureImportance.scala @@ -0,0 +1,163 @@ +package com.databricks.labs.automl.reports + +import com.databricks.labs.automl.model.RandomForestTuner +import com.databricks.labs.automl.model.tools.split.{ + DataSplitCustodial, + DataSplitUtility +} +import com.databricks.labs.automl.params.{ + MainConfig, + RandomForestModelsWithResults +} +import org.apache.spark.ml.classification.RandomForestClassificationModel +import org.apache.spark.ml.regression.RandomForestRegressionModel +import org.apache.spark.sql.DataFrame + +class RandomForestFeatureImportance(data: DataFrame, + featConfig: MainConfig, + modelType: String) + extends ReportingTools { + + final private val allowableCutoffTypes = List("none", "value", "count") + + private var _cutoffType = "count" + + private var _cutoffValue = 15.0 + + def setCutoffType(value: String): this.type = { + require( + allowableCutoffTypes.contains(value), + s"Cutoff type $value is not in ${allowableCutoffTypes.mkString(", ")}" + ) + _cutoffType = value + this + } + + def setCutoffValue(value: Double): this.type = { + _cutoffValue = value + this + } + + def getCutoffType: String = _cutoffType + + def getCutoffValue: Double = _cutoffValue + + def runFeatureImportances( + fields: Array[String] + ): (RandomForestModelsWithResults, DataFrame, Array[String]) = { + + val splitData = DataSplitUtility.split( + data, + featConfig.geneticConfig.kFold, + featConfig.geneticConfig.trainSplitMethod, + featConfig.labelCol, + featConfig.geneticConfig.deltaCacheBackingDirectory, + featConfig.geneticConfig.splitCachingStrategy, + featConfig.modelFamily, + featConfig.geneticConfig.parallelism, + featConfig.geneticConfig.trainPortion, + featConfig.geneticConfig.kSampleConfig.syntheticCol, + featConfig.geneticConfig.trainSplitChronologicalColumn, + featConfig.geneticConfig.trainSplitChronologicalRandomPercentage, + featConfig.dataReductionFactor + ) + + val (modelResults, modelStats) = new RandomForestTuner( + data, + splitData, + modelType + ).setLabelCol(featConfig.labelCol) + .setFeaturesCol(featConfig.featuresCol) + .setRandomForestNumericBoundaries(featConfig.numericBoundaries) + .setRandomForestStringBoundaries(featConfig.stringBoundaries) + .setScoringMetric(featConfig.scoringMetric) + .setTrainPortion(featConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(featConfig.geneticConfig.trainSplitMethod) + .setTrainSplitChronologicalColumn( + featConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + featConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(featConfig.geneticConfig.parallelism) + .setKFold(featConfig.geneticConfig.kFold) + .setSeed(featConfig.geneticConfig.seed) + .setOptimizationStrategy(featConfig.scoringOptimizationStrategy) + .setFirstGenerationGenePool( + featConfig.geneticConfig.firstGenerationGenePool + ) + .setNumberOfMutationGenerations( + featConfig.geneticConfig.numberOfGenerations + ) + .setNumberOfMutationsPerGeneration( + featConfig.geneticConfig.numberOfMutationsPerGeneration + ) + .setNumberOfParentsToRetain( + featConfig.geneticConfig.numberOfParentsToRetain + ) + .setGeneticMixing(featConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + featConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(featConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(featConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingScore(featConfig.autoStoppingScore) + .setEarlyStoppingFlag(featConfig.autoStoppingFlag) + .setEvolutionStrategy(featConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionMaxIterations( + featConfig.geneticConfig.continuousEvolutionMaxIterations + ) + .setContinuousEvolutionStoppingScore( + featConfig.geneticConfig.continuousEvolutionStoppingScore + ) + .setContinuousEvolutionParallelism( + featConfig.geneticConfig.continuousEvolutionParallelism + ) + .setContinuousEvolutionMutationAggressiveness( + featConfig.geneticConfig.continuousEvolutionMutationAggressiveness + ) + .setContinuousEvolutionGeneticMixing( + featConfig.geneticConfig.continuousEvolutionGeneticMixing + ) + .setContinuousEvolutionRollingImporvementCount( + featConfig.geneticConfig.continuousEvolutionRollingImprovementCount + ) + .evolveWithScoringDF() + + val bestModelData = modelResults.head + val bestModelFeatureImportances = modelType match { + case "classifier" => + bestModelData.model + .asInstanceOf[RandomForestClassificationModel] + .featureImportances + .toArray + case "regressor" => + bestModelData.model + .asInstanceOf[RandomForestRegressionModel] + .featureImportances + .toArray + case _ => + throw new UnsupportedOperationException( + s"The model type provided, '${featConfig.modelFamily}', is not supported." + ) + } + + val importances = generateFrameReport(fields, bestModelFeatureImportances) + + val extractedFields = _cutoffType match { + case "none" => fields + case "value" => extractTopFeaturesByImportance(importances, _cutoffValue) + case "count" => extractTopFeaturesByCount(importances, _cutoffValue.toInt) + case _ => + throw new UnsupportedOperationException( + s"Extraction mode ${_cutoffType} is not supported for feature importance reduction" + ) + } + + DataSplitCustodial.cleanCachedInstances(splitData, featConfig) + + (bestModelData, importances, extractedFields) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/reports/ReportingTools.scala b/src/main/scala/com/databricks/labs/automl/reports/ReportingTools.scala new file mode 100644 index 00000000..e9ce5e38 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/reports/ReportingTools.scala @@ -0,0 +1,63 @@ +package com.databricks.labs.automl.reports + +import com.databricks.labs.automl.utils.SparkSessionWrapper +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} + +trait ReportingTools extends SparkSessionWrapper { + + def generateFrameReport(columns: Array[String], importances: Array[Double]): DataFrame = { + import spark.sqlContext.implicits._ + sc.parallelize(columns zip importances).toDF("Feature", "Importance").orderBy($"Importance".desc) + .withColumn("Importance", col("Importance") * 100.0) + .withColumn("Feature", split(col("Feature"), "_si$")(0)) + } + + def cleanupFieldArray(indexedFields: Array[(String, Int)]): List[(String, Int)] = { + + val cleanedBuffer = new ListBuffer[(String, Int)] + indexedFields.map(x => { + cleanedBuffer += ((x._1.split("_si$")(0), x._2)) + }) + cleanedBuffer.result() + } + + def generateDecisionTextReport(modelDebugString: String, featureIndex: List[(String, Int)]): String = { + + val reparsedArray = new ArrayBuffer[(String, String)] + + featureIndex.toArray.map(x => { + reparsedArray += (("feature " + x._2.toString, x._1)) + }) + reparsedArray.result.toMap.foldLeft(modelDebugString){case(body, (k,v)) => body.replaceAll(k, v)} + } + + def reportFields(fieldIndexArray: Array[(String, Int)]): String = { + + val stringConstructor = new ArrayBuffer[String] + cleanupFieldArray(fieldIndexArray).foreach(x => { + stringConstructor += s"Column ${x._1} is feature ${x._2}" + }) + stringConstructor.result.mkString("\n") + } + + def extractTopFeaturesByCount(featureFrame: DataFrame, topNCutoff: Int): Array[String] = { + // Ensure the DataFrame is sorted and take the top N rows + val sortedData = featureFrame.sort(col("Importance").desc).limit(topNCutoff).collect() + + sortedData.map(x => x(0).toString) + + } + + def extractTopFeaturesByImportance(featureFrame: DataFrame, importancePercentageCutoff: Double): Array[String] = { + + val sortedData = featureFrame.filter(col("Importance") >= importancePercentageCutoff) + .sort(col("Importance").desc).collect() + + sortedData.map(x => x(0).toString) + } + + +} diff --git a/src/main/scala/com/databricks/labs/automl/sanitize/DataSanitizer.scala b/src/main/scala/com/databricks/labs/automl/sanitize/DataSanitizer.scala new file mode 100644 index 00000000..e458b386 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/sanitize/DataSanitizer.scala @@ -0,0 +1,706 @@ +package com.databricks.labs.automl.sanitize + +import com.databricks.labs.automl.exceptions.BooleanFieldFillException +import com.databricks.labs.automl.inference.{NaFillConfig, NaFillPayload} +import com.databricks.labs.automl.model.tools.split.PerformanceSettings +import com.databricks.labs.automl.utils.structures.FeatureEngineeringEnums.FeatureEngineeringEnums +import com.databricks.labs.automl.utils.structures.{ + FeatureEngineeringAllowables, + FeatureEngineeringEnums +} +import com.databricks.labs.automl.utils.{ + DataValidation, + SchemaUtils, + SparkSessionWrapper +} +import org.apache.spark.ml.feature.StringIndexer +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +import scala.collection.mutable.ArrayBuffer + +class DataSanitizer(data: DataFrame) + extends DataValidation + with SparkSessionWrapper { + + private var _labelCol = "label" + private var _featureCol = "features" + private var _numericFillStat = "mean" + private var _characterFillStat = "max" + private var _modelSelectionDistinctThreshold = 10 + private var _fieldsToIgnoreInVector = Array.empty[String] + private var _filterPrecision: Double = 0.01 + private var _parallelism: Int = 20 + + private var _categoricalNAFillMap: Map[String, String] = + Map.empty[String, String] + private var _numericNAFillMap: Map[String, AnyVal] = Map.empty[String, AnyVal] + private var _characterNABlanketFill: String = "" + private var _numericNABlanketFill: Double = 0.0 + private var _naFillMode: String = "auto" + + final private val _allowableNAFillModes: List[String] = + List( + "auto", + "mapFill", + "blanketFillAll", + "blanketFillCharOnly", + "blanketFillNumOnly" + ) + + def setLabelCol(value: String): this.type = { + _labelCol = value + this + } + + def setFeatureCol(value: String): this.type = { + _featureCol = value + this + } + + def setNumericFillStat(value: String): this.type = { + _numericFillStat = value + this + } + + def setCharacterFillStat(value: String): this.type = { + _characterFillStat = value + this + } + + def setModelSelectionDistinctThreshold(value: Int): this.type = { + _modelSelectionDistinctThreshold = value + this + } + + def setFieldsToIgnoreInVector(value: Array[String]): this.type = { + _fieldsToIgnoreInVector = value + this + } + + def setParallelism(value: Int): this.type = { + _parallelism = value + this + } + + def setFilterPrecision(value: Double): this.type = { + if (value == 0.0) + println( + "Warning! Precision of 0 is an exact calculation of quantiles and may not be performant!" + ) + _filterPrecision = value + this + } + + def setCategoricalNAFillMap(value: Map[String, String]): this.type = { + _categoricalNAFillMap = value + this + } + + def setNumericNAFillMap(value: Map[String, AnyVal]): this.type = { + _numericNAFillMap = value + this + } + + def setCharacterNABlanketFillValue(value: String): this.type = { + _characterNABlanketFill = value + this + } + + def setNumericNABlanketFillValue(value: Double): this.type = { + _numericNABlanketFill = value + this + } + + /** + * Setter for determining the fill mode for handling na values. + * + * @param value Mode for na fill
+ * Available modes:
+ * auto : Stats-based na fill for fields. Usage of .setNumericFillStat and + * .setCharacterFillStat will inform the type of statistics that will be used to fill.
+ * mapFill : Custom by-column overrides to 'blanket fill' na values on a per-column + * basis. The categorical (string) fields are set via .setCategoricalNAFillMap while the + * numeric fields are set via .setNumericNAFillMap.
+ * blanketFillAll : Fills all fields based on the values specified by + * .setCharacterNABlanketFillValue and .setNumericNABlanketFillValue. All NA's for the + * appropriate types will be filled in accordingly throughout all columns.
+ * blanketFillCharOnly Will use statistics to fill in numeric fields, but will replace + * all categorical character fields na values with a blanket fill value.
+ * blanketFillNumOnly Will use statistics to fill in character fields, but will replace + * all numeric fields na values with a blanket value. + * @author Ben Wilson, Databricks + * @since 0.5.2 + * @throws IllegalArgumentException if mode is not supported + */ + @throws(classOf[IllegalArgumentException]) + def setNAFillMode(value: String): this.type = { + require( + _allowableNAFillModes.contains(value), + s"NA fill mode $value is not supported. Must be one of : " + + s"${_allowableNAFillModes.mkString(", ")}" + ) + _naFillMode = value + this + } + + def getLabel: String = _labelCol + + def getFeatureCol: String = _featureCol + + def getNumericFillStat: String = _numericFillStat + + def getCharacterFillStat: String = _characterFillStat + + def getModelSelectionDistinctThreshold: Int = _modelSelectionDistinctThreshold + + def getFieldsToIgnoreInVector: Array[String] = _fieldsToIgnoreInVector + + def getParallelism: Int = _parallelism + + def getFilterPrecision: Double = _filterPrecision + + def getCategoricalNAFillMap: Map[String, String] = _categoricalNAFillMap + + def getNumericNAFillMap: Map[String, AnyVal] = _numericNAFillMap + + def getCharacterNABlanketFillValue: String = _characterNABlanketFill + + def getNumericNABlanketFillValue: Double = _numericNABlanketFill + + def getNaFillMode: String = _naFillMode + + private var _labelValidation: Boolean = false + + def labelValidationOn(): this.type = { + _labelValidation = true + this + } + + private def convertLabel(df: DataFrame): DataFrame = { + + val stringIndexer = getLabelIndexer(df) + + stringIndexer + .fit(data) + .transform(data) + .withColumn(this._labelCol, col(s"${this._labelCol}_si")) + .drop(this._labelCol + "_si") + } + + def getLabelIndexer(df: DataFrame): StringIndexer = { + new StringIndexer() + .setInputCol(this._labelCol) + .setOutputCol(this._labelCol + "_si") + } + + private def refactorLabel(df: DataFrame, labelColumn: String): DataFrame = { + + SchemaUtils + .extractSchema(df.schema) + .foreach( + x => + x.fieldName match { + case `labelColumn` => + x.dataType match { + case StringType => labelValidationOn() + case BooleanType => labelValidationOn() + case BinaryType => labelValidationOn() + case _ => None + } + case _ => None + } + ) + if (_labelValidation) convertLabel(df) else df + } + + private def metricConversion(metric: String): String = { + + val allowableFillArray = Array("min", "25p", "mean", "median", "75p", "max") + + assert( + allowableFillArray.contains(metric), + s"The metric supplied, '$metric' is not in: " + + s"${invalidateSelection(metric, allowableFillArray)}" + ) + + val summaryMetric = metric match { + case "25p" => "25%" + case "median" => "50%" + case "75p" => "75%" + case _ => metric + } + summaryMetric + } + + private def getBatches(items: List[String]): Array[List[String]] = { + val batches = ArrayBuffer[List[String]]() + val batchSize = items.length / _parallelism + for (i <- 0 to items.length by batchSize) { + batches.append(items.slice(i, i + batchSize)) + } + batches.toArray + } + + private def getFieldsAndFillable(df: DataFrame, + columnList: List[String], + statistics: String): DataFrame = { + + val readyDF = df.repartition(PerformanceSettings.parTasks).cache + readyDF.foreach(_ => ()) + val selectionColumns = "Summary" +: columnList + val x = if (statistics.isEmpty) { + val colBatches = getBatches(columnList) + colBatches + .map { batch => + readyDF + .select(batch map col: _*) + .summary() + .select("Summary" +: batch map col: _*) + } + .seq + .toArray + .reduce((x, y) => x.join(broadcast(y), Seq("Summary"))) + + } else { + readyDF + .summary(statistics.replaceAll(" ", "").split(","): _*) + .select(selectionColumns map col: _*) + } + readyDF.unpersist(true) + x + } + + private def assemblePayload(df: DataFrame, + fieldList: List[String], + filterCondition: String): Array[(String, Any)] = { + + val summaryStats = getFieldsAndFillable(df, fieldList, filterCondition) + .drop(col("Summary")) + val summaryColumns = summaryStats.columns + val summaryValues = summaryStats.collect()(0).toSeq.toArray + summaryColumns.zip(summaryValues) + } + + private def getCategoricalFillType(value: String): FeatureEngineeringEnums = { + + value match { + case "min" => FeatureEngineeringEnums.MIN + case "max" => FeatureEngineeringEnums.MAX + } + + } + + /** + * Boolean filling based on the filterCondition for categorical data (min or max) + * @param df DataFrame containing BooleanType fields that may have null values + * @param fieldList List of the Boolean Fields + * @param filterCondition The setting of whether to use min or max values based on the data to fill na's + * @return The fill config data for the Boolean type columns consisting of ColumnName, Boolean fill value + * @since 0.6.2 + * @author Ben Wilson, Databricks + * @throws BooleanFieldFillException if the mode setting for selecting the fill value is not supported. + */ + @throws(classOf[BooleanFieldFillException]) + private def getBooleanFill( + df: DataFrame, + fieldList: List[String], + filterCondition: String + ): Array[(String, Boolean)] = { + + val filterSelection = getCategoricalFillType(filterCondition) + + fieldList + .map(x => { + val booleanFieldStats = + df.select(x) + .groupBy(x) + .agg(count(col(x)).alias(FeatureEngineeringEnums.COUNT_COL.value)) + + val sortedStats = filterSelection match { + case FeatureEngineeringEnums.MIN => + booleanFieldStats + .orderBy(col(FeatureEngineeringEnums.COUNT_COL.value).asc) + .head(1) + case FeatureEngineeringEnums.MAX => + booleanFieldStats + .orderBy(col(FeatureEngineeringEnums.COUNT_COL.value).desc) + .head(1) + case _ => + throw BooleanFieldFillException( + x, + filterCondition, + FeatureEngineeringAllowables.ALLOWED_CATEGORICAL_FILL_MODES.values + ) + } + (x, sortedStats.head.getBoolean(0)) + + }) + .toArray + + } + + /** + * Helper method for extraction the fields types based on the schema and calculating the statistics to be used to + * determine fill values for the columns. + * + * @param df A DataFrame that has already had the label field converted to the appropriate (Double) Type + * @return NaFillConfig : A mapping for numeric and string fields that represents the values to put in for each column. + * @since 0.1.0 + * @author Ben Wilson, Databricks + */ + private def payloadExtraction(df: DataFrame): NaFillPayload = { + + val typeExtract = + SchemaUtils.extractTypes(df, _labelCol, _fieldsToIgnoreInVector) + + val numericPayload = + assemblePayload( + df, + typeExtract.numericFields, + metricConversion(_numericFillStat) + ) + val characterPayload = + assemblePayload( + df, + typeExtract.categoricalFields, + metricConversion(_characterFillStat) + ) + val booleanPayload = getBooleanFill( + df, + typeExtract.booleanFields, + metricConversion(_characterFillStat) + ) + + NaFillPayload(characterPayload, numericPayload, booleanPayload) + + } + + /** + * Helper method for ensuring that the label column isn't overridden + * + * @param payload Array of field name, value for overriding of numeric fields. + * @return Map of Field name to fill value converted to Double type. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + private def numericMapper( + payload: Array[(String, Any)] + ): Map[String, Double] = { + + val buffer = new ArrayBuffer[(String, Double)] + + payload.map( + x => + x._1 match { + case x._1 if x._1 != _labelCol => + try { + buffer += ((x._1, x._2.toString.toDouble)) + } catch { + case _: Exception => None + } + case _ => None + } + ) + buffer.toArray.toMap + } + + /** + * Helper method for ensuring that the label column isn't overridden + * + * @param payload Array of field name, value for overriding character fields. + * @return Map of Field name to fill value convert to String type. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + private def characterMapper( + payload: Array[(String, Any)] + ): Map[String, String] = { + + val buffer = new ArrayBuffer[(String, String)] + + payload.map( + x => + x._1 match { + case x._1 if x._1 != _labelCol => + try { + buffer += ((x._1, x._2.toString)) + } catch { + case _: Exception => None + } + case _ => None + } + ) + + buffer.toArray.toMap + + } + + /** + * Helper method for generating a statistics-based approach for calculating 'smart fillable' values for na's in the + * feature vector fields. + * + * @param df A DataFrame that has already had the label field converted to the appropriate (Double) Type + * @return NaFillConfig : A mapping for numeric and string fields that represents the values to put in for each column. + * @since 0.1.0 + * @author Ben Wilson, Databricks + */ + private def fillMissing(df: DataFrame): NaFillConfig = { + + val payloads = payloadExtraction(df) + + val numericMapping = numericMapper(payloads.numeric) + + val characterMapping = characterMapper(payloads.categorical) + + NaFillConfig( + numericColumns = numericMapping, + categoricalColumns = characterMapping, + booleanColumns = payloads.boolean.toMap + ) + + } + + /** + * Private method for applying a full blanket na fill on all fields to be involved in the feature vector. + * + * @param df A DataFrame that has already had the label field converted to the appropriate (Double) Type + * @return NaFillConfig : A mapping for numeric and string fields that represents the values to put in for each column. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + private def blanketNAFill(df: DataFrame): NaFillConfig = { + + val payloadTypes = + SchemaUtils.extractTypes(df, _labelCol, _fieldsToIgnoreInVector) + + val characterBuffer = new ArrayBuffer[(String, Any)] + val numericBuffer = new ArrayBuffer[(String, Any)] + + payloadTypes.numericFields.foreach( + x => numericBuffer += ((x, _numericNABlanketFill)) + ) + payloadTypes.categoricalFields.foreach( + x => characterBuffer += ((x, _characterNABlanketFill)) + ) + + //TODO: update Boolean overrides. + NaFillConfig( + characterMapper(characterBuffer.toArray), + numericMapper(numericBuffer.toArray), + payloadTypes.booleanFields.map(x => (x, false)).toMap + ) + + } + + /** + * Private method for applying a full blanket na fill on only character fields to be involved in the feature vector. + * Numeric fields will use the stats mode defined in .setNumericFillStat + * + * @param df A DataFrame that has already had the label field converted to the appropriate (Double) Type + * @return NaFillConfig : A mapping for numeric and string fields that represents the values to put in for each column. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + private def blanketFillCharOnly(df: DataFrame): NaFillConfig = { + + val payloads = fillMissing(df) + + val buffer = new ArrayBuffer[(String, String)] + + payloads.categoricalColumns.map( + x => buffer += ((x._1, _characterNABlanketFill)) + ) + + NaFillConfig( + characterMapper(buffer.toArray), + payloads.numericColumns, + payloads.booleanColumns + ) + + } + + /** + * Private method for applying a full blanket na fill on only numeric fields to be involved in the feature vector. + * Character fields will use the stats mode defined in .setCharacterFillStat + * + * @param df A DataFrame that has already had the label field converted to the appropriate (Double) Type + * @return NaFillConfig : A mapping for numeric and string fields that represents the values to put in for each column. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + private def blanketFillNumOnly(df: DataFrame): NaFillConfig = { + + val payloads = fillMissing(df) + + val buffer = new ArrayBuffer[(String, Double)] + + payloads.numericColumns.map(x => buffer += ((x._1, _numericNABlanketFill))) + + NaFillConfig( + payloads.categoricalColumns, + numericMapper(buffer.toArray), + payloads.booleanColumns + ) + + } + + /** + * Validation run-time check for supplied maps, if used. + * + * @param df A DataFrame that has already had the label field converted to the appropriate (Double) Type + * @since 0.5.2 + * @author Ben Wilson, Databricks + * @throws IllegalArgumentException if a map value refers to a column not in the dataset + * @throws UnsupportedOperationException if no map overrides have been specified in the run configuration + */ + @throws(classOf[UnsupportedOperationException]) + @throws(classOf[IllegalArgumentException]) + private def validateMapSchemaMembership(df: DataFrame): Unit = { + val suppliedSchema = df.schema.names + + if (_numericNAFillMap.nonEmpty) + _numericNAFillMap.keys.foreach( + x => + require( + suppliedSchema.contains(x), + s"Field $x supplied in .setNumericNAFillMap() is not a valid column name in the DataFrame." + ) + ) + + if (_categoricalNAFillMap.nonEmpty) + _categoricalNAFillMap.keys.foreach( + x => + require( + suppliedSchema.contains(x), + s"Field $x supplied in .setCategoricalNAFillMap() is not a valid column name in the DataFrame." + ) + ) + if (_categoricalNAFillMap.isEmpty && _numericNAFillMap.isEmpty) + throw new UnsupportedOperationException( + s"Map Fill mode has been defined for NA Fill but " + + s"no map overrides have been specified. Check configuration and ensure that either categoricalNAFillMap " + + s"or numericNAFillMap values have been set." + ) + } + + /** + * Private method for submitting a Map of categorical and numeric overrides for na fill based on column name -> value + * as set in .setCategoricalNAFillMap and .setNumericNAFillMap any fields not included in these maps will use the + * statistics-based approaches to fill na's. + * + * @param df A DataFrame that has already had the label field converted to the appropriate (Double) Type + * @return NaFillConfig : A mapping for numeric and string fields that represents the values to put in for each column. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + private def mapNAFill(df: DataFrame): NaFillConfig = { + + validateMapSchemaMembership(df) + + val payloads = fillMissing(df) + + val numBuffer = new ArrayBuffer[(String, Double)] + val charBuffer = new ArrayBuffer[(String, String)] + + payloads.categoricalColumns.map( + x => + x._1 match { + case x._1 if _categoricalNAFillMap.contains(x._1) => + charBuffer += ((x._1, _categoricalNAFillMap(x._1).toString)) + case _ => charBuffer += x + } + ) + payloads.numericColumns.map( + x => + x._1 match { + case x._1 if _numericNAFillMap.contains(x._1) => + numBuffer += ((x._1, _numericNAFillMap(x._1).toString.toDouble)) + case _ => numBuffer += x + } + ) + + NaFillConfig( + characterMapper(charBuffer.toArray), + numericMapper(numBuffer.toArray), + payloads.booleanColumns.map(x => x._1 -> false) + ) + + } + + /** + * Private method for handling control logic depending on na fill mode selected + * + * @param df A DataFrame that has already had the label field converted to the appropriate (Double) Type + * @return NaFillConfig : A mapping for numeric and string fields that represents the values to put in for each column. + * @since 0.5.2 + * @author Ben Wilson, Databricks + */ + private def fillNA(df: DataFrame): NaFillConfig = { + + _naFillMode match { + case "auto" => fillMissing(df) + case "blanketFillAll" => + blanketNAFill(df) + case "blanketFillCharOnly" => blanketFillCharOnly(df) + case "blanketFillNumOnly" => blanketFillNumOnly(df) + case "mapFill" => mapNAFill(df) + case _ => + throw new UnsupportedOperationException( + s"The naFill Mode ${_naFillMode} is not supported. " + + s"Must be one of: ${_allowableNAFillModes.mkString(", ")}" + ) + } + + } + + def decideModel(): String = { + val uniqueLabelCounts = data + .select(approx_count_distinct(_labelCol, rsd = _filterPrecision)) + .rdd + .map(row => row.getLong(0)) + .take(1)(0) + val decision = uniqueLabelCounts match { + case x if x <= _modelSelectionDistinctThreshold => "classifier" + case _ => "regressor" + } + decision + } + + def generateCleanData( + naFillConfig: NaFillConfig = null, + refactorLabelFlag: Boolean = true, + decidedModel: String = "" + ): (DataFrame, NaFillConfig, String) = { + + val preFilter = if (refactorLabelFlag) { + refactorLabel(data, _labelCol) + } else { + data + } + + val fillMap = if (naFillConfig == null) { + fillNA(preFilter) + } else { + naFillConfig + } + val filledData = preFilter.na + .fill(fillMap.numericColumns) + .na + .fill(fillMap.categoricalColumns) + .na + .fill(fillMap.booleanColumns) + .filter(col(_labelCol).isNotNull) + .filter(!col(_labelCol).isNaN) + .toDF() + + if (decidedModel != null && decidedModel.nonEmpty) { + (filledData, fillMap, decidedModel) + } else { + (filledData, fillMap, decideModel()) + } + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/sanitize/FeatureCorrelationDetection.scala b/src/main/scala/com/databricks/labs/automl/sanitize/FeatureCorrelationDetection.scala new file mode 100644 index 00000000..cd3f2779 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/sanitize/FeatureCorrelationDetection.scala @@ -0,0 +1,305 @@ +package com.databricks.labs.automl.sanitize + +import com.databricks.labs.automl.exceptions.FeatureCorrelationException +import com.databricks.labs.automl.utils.SparkSessionWrapper +import com.databricks.labs.automl.utils.structures.{ + FieldCorrelationAggregationStats, + FieldCorrelationPayload, + FieldPairs, + FieldRemovalPayload +} +import org.apache.log4j.Logger +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DoubleType + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer +import scala.collection.parallel.ForkJoinTaskSupport +import scala.concurrent.forkjoin.ForkJoinPool + +class FeatureCorrelationDetection(data: DataFrame, fieldListing: Array[String]) + extends SparkSessionWrapper { + + private final val DEVIATION = "_deviation" + private final val SQUARED = "_squared" + private final val COV = "covariance_value" + private final val PRODUCT = "_product" + + private val logger: Logger = Logger.getLogger(this.getClass) + + private var _correlationCutoffHigh: Double = 0.0 + private var _correlationCutoffLow: Double = 0.0 + private var _labelCol: String = "label" + private var _parallelism: Int = 20 + + final private val _dataFieldNames = data.schema.fieldNames + + def setCorrelationCutoffHigh(value: Double): this.type = { + require( + value <= 1.0, + "Maximum range of Correlation Cutoff on the high end must be less than 1.0" + ) + _correlationCutoffHigh = value + this + } + + def setCorrelationCutoffLow(value: Double): this.type = { + require( + value >= -1.0, + "Minimum range of Correlation Cutoff on the low end must be greater than -1.0" + ) + _correlationCutoffLow = value + this + } + + def setLabelCol(value: String): this.type = { + require( + _dataFieldNames.contains(value), + s"Label field $value is not in Dataframe" + ) + _labelCol = value + this + } + + def setParallelism(value: Int): this.type = { + _parallelism = value + this + } + + def getCorrelationCutoffHigh: Double = _correlationCutoffHigh + + def getCorrelationCutoffLow: Double = _correlationCutoffLow + + def getLabelCol: String = _labelCol + def getParallelism: Int = _parallelism + + def filterFeatureCorrelation(): DataFrame = { + + assert( + _dataFieldNames.contains(_labelCol), + s"Label field ${_labelCol} is not in Dataframe" + ) + val featureCorrelation = determineFieldsToDrop + data.drop(featureCorrelation.dropFields: _*) + } + + /** + * Create the left/right testing pairs to be used in determining correlation between feature fields + * @return Array of distinct pairs of feature fields to test + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def buildFeaturePairs(): Array[FieldPairs] = { + + fieldListing + .combinations(2) + .map { case Array(x, y) => FieldPairs(x, y) } + .toArray + + } + + /** + * Method for calculating all of the pairwise correlation calculations for the feature fields + * @return Array of FieldCorrelationPayload data (left/right name pairs and the correlation value) + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def calculateFeatureCorrelation: Array[FieldCorrelationPayload] = { + + val aggregationStats = getAggregationStats + + val interactionData = new ArrayBuffer[FieldCorrelationPayload] + + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val pairs = buildFeaturePairs().par + pairs.tasksupport = taskSupport + + pairs.foreach { x => + interactionData += calculateCorrelation( + x, + aggregationStats.rowCounts, + aggregationStats.averageMap + ) + } + + val preSort = interactionData.toArray.groupBy(_.primaryColumn) + fieldListing.flatMap(x => preSort.get(x)).flatten + + } + + /** + * Private method for determining the interaction combinations for the recursive pairwise comparison and + * evaluating whether a particular column has positive or negative correlation to all other columns + * and therefore should be dropped from the dataset. + * @param data Array[FieldCorrelationPayload] that represents the pairwise correlation between fields + * @return Map[String, Double] with each field and the percentage of other fields it meets the criteria for + * filtering based on the correlation cutoff values. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def calculateGroupStats( + data: Array[FieldCorrelationPayload] + ): Map[String, Double] = { + data.groupBy(_.primaryColumn).map { + case (k, v) => + val positiveCounts = + v.count(_.correlation >= _correlationCutoffHigh).toDouble + val negativeCounts = + v.count(_.correlation <= _correlationCutoffLow).toDouble + k -> (positiveCounts + negativeCounts) / v.length + } + } + + /** + * Method for determining which columns need to be dropped from the feature set based on the correlation cutoff settings + * @return FieldRemovalPayload that contains the removal and retain fields. + * @since 0.6.2 + * @author Ben Wilson, Databricks + * @throws FeatureCorrelationException: totalFields, removedFields + */ + @throws(classOf[RuntimeException]) + def determineFieldsToDrop: FieldRemovalPayload = { + + val retainBuffer = mutable.SortedSet[String]() + val removeBuffer = mutable.SortedSet[String]() + + val correlationResult = calculateFeatureCorrelation + val groupData = calculateGroupStats(correlationResult) + + correlationResult.foreach(x => { + + if (!removeBuffer + .contains(x.primaryColumn) && groupData(x.primaryColumn) < 1.0) { + if (x.correlation >= _correlationCutoffHigh) { + retainBuffer += x.primaryColumn + removeBuffer += x.pairs.right + } else if (x.correlation <= _correlationCutoffLow) { + retainBuffer += x.primaryColumn + removeBuffer += x.pairs.right + } else { + retainBuffer += x.primaryColumn + } + } else { + removeBuffer += x.primaryColumn + } + + }) + + // Validation + if (retainBuffer.isEmpty) + throw FeatureCorrelationException( + fieldListing, + removeBuffer.result.toArray + ) + FieldRemovalPayload(removeBuffer.toArray, retainBuffer.toArray) + + } + + /** + * Debug method to allow for an inspection of the correlation between each feature value to one another + * @return DataFrame that contains the pair information and the correlation values of those pairs + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateFeatureCorrelationReport: DataFrame = { + + import spark.sqlContext.implicits._ + + sc.parallelize(calculateFeatureCorrelation).toDF + + } + + /** + * Private method for accessing the required data needed for correlation calculation (one-time calculation) + * @return FieldCorrelationAggregationStats of row counts and map of average values for each feature field + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def getAggregationStats: FieldCorrelationAggregationStats = { + + val summaryMap = data + .select(fieldListing map col: _*) + .summary("mean") + .filter(col("summary") === "mean") + .drop("summary") + .first() + .getValuesMap[Double](fieldListing) + + val rowCounts = data + .select(_labelCol) + .agg(count(_labelCol).alias(_labelCol)) + .withColumn(_labelCol, col(_labelCol).cast(DoubleType)) + .first() + .getAs[Double](_labelCol) + FieldCorrelationAggregationStats(rowCounts, summaryMap) + } + + /** + * Private method for executing a pair-wise correlation calculation between two feature fields + * @param pair the left/right pair of feature fields to calculate + * @param rowCount Number of rows within the data set + * @param averages average value of each feature as a Map + * @return FieldCorrelationPayload that contains the pair and the correlation value + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def calculateCorrelation( + pair: FieldPairs, + rowCount: Double, + averages: Map[String, Double] + ): FieldCorrelationPayload = { + + // Establish a subset DataFrame that contains only the two columns being tested + val subsetDataFrame = data + .select(pair.left, pair.right) + .withColumn(pair.left, col(pair.left).cast(DoubleType)) + .withColumn(pair.right, col(pair.right).cast(DoubleType)) + + // Get the map of the values required to support the linear correlation calculation + val covarianceMap = subsetDataFrame + .withColumn(pair.left + DEVIATION, col(pair.left) - averages(pair.left)) + .withColumn( + pair.right + DEVIATION, + col(pair.right) - averages(pair.right) + ) + .withColumn(pair.left + SQUARED, pow(col(pair.left), 2)) + .withColumn(pair.right + SQUARED, pow(col(pair.right), 2)) + .withColumn(COV, col(pair.left + DEVIATION) * col(pair.right + DEVIATION)) + .withColumn(PRODUCT, col(pair.left) * col(pair.right)) + .agg( + sum(pair.left).alias(pair.left), + sum(pair.right).alias(pair.right), + sum(PRODUCT).alias(PRODUCT), + sum(COV).alias(COV), + sum(pair.left + SQUARED).alias(pair.left + SQUARED), + sum(pair.right + SQUARED).alias(pair.right + SQUARED) + ) + .first() + .getValuesMap[Double]( + Seq( + COV, + pair.left, + pair.right, + PRODUCT, + pair.left + SQUARED, + pair.right + SQUARED + ) + ) + + // Calculate the correlation + val linearCorrelationCoefficient = (covarianceMap(PRODUCT) - (covarianceMap( + pair.left + ) * covarianceMap(pair.right) / rowCount)) / + math.sqrt( + (covarianceMap(pair.left + SQUARED) - math + .pow(covarianceMap(pair.left), 2.0) / rowCount) * (covarianceMap( + pair.right + SQUARED + ) - math.pow(covarianceMap(pair.right), 2.0) / rowCount) + ) + + FieldCorrelationPayload(pair.left, pair, linearCorrelationCoefficient) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/sanitize/OutlierFiltering.scala b/src/main/scala/com/databricks/labs/automl/sanitize/OutlierFiltering.scala new file mode 100644 index 00000000..751cb992 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/sanitize/OutlierFiltering.scala @@ -0,0 +1,295 @@ +package com.databricks.labs.automl.sanitize + +import java.util.regex.Pattern + +import com.databricks.labs.automl.exceptions.ThreadPoolsBySize +import com.databricks.labs.automl.params.{FilterData, ManualFilters} +import com.databricks.labs.automl.utils.{ + DataValidation, + SchemaUtils, + SparkSessionWrapper +} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.{Column, DataFrame} + +import scala.collection.mutable +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.concurrent.forkjoin.ForkJoinPool +import scala.concurrent.{Await, Future} + +/** + * + * @param df - Input DataFrame pre-feature vectorization + */ +class OutlierFiltering(df: DataFrame) + extends SparkSessionWrapper + with DataValidation { + + private case class OutlierFilteredDf(mutatedDf: DataFrame, + outlierDf: DataFrame) + + private lazy val LOWER = "lower" + private lazy val UPPER = "upper" + private lazy val BOTH = "both" + + private var _labelCol: String = "label" + private var _filterBounds: String = BOTH + private var _lowerFilterNTile: Double = 0.02 + private var _upperFilterNTile: Double = 0.98 + private var _filterPrecision: Double = 0.01 + private var _continuousDataThreshold: Int = 50 + private var _parallelism: Int = 20 + + final private val _filterBoundaryAllowances: Array[String] = + Array(LOWER, UPPER, BOTH) + final private val _dfSchema = df.schema.fieldNames + + def setLabelCol(value: String): this.type = { + require( + _dfSchema.contains(value), + s"DataFrame does not contain label column $value" + ) + this._labelCol = value + this + } + + def setFilterBounds(value: String): this.type = { + require( + _filterBoundaryAllowances.contains(value), + s"Filter Boundary Mode $value is not a valid member of " + + s"${invalidateSelection(value, _filterBoundaryAllowances)}" + ) + this._filterBounds = value + this + } + + def setLowerFilterNTile(value: Double): this.type = { + require( + value >= 0.0 & value <= 1.0, + s"Lower Filter NTile must be between 0.0 and 1.0" + ) + this._lowerFilterNTile = value + this + } + + def setUpperFilterNTile(value: Double): this.type = { + require( + value >= 0.0 & value <= 1.0, + s"Upper Filter NTile must be between 0.0 and 1.0" + ) + this._upperFilterNTile = value + this + } + + def setFilterPrecision(value: Double): this.type = { + if (value == 0.0) + println( + "Warning! Precision of 0 is an exact calculation of quantiles and may not be performant!" + ) + this._filterPrecision = value + this + } + + def setContinuousDataThreshold(value: Int): this.type = { + if (value < 50) + println("Warning! Values less than 50 may indicate oridinal data!") + this._continuousDataThreshold = value + this + } + + def setParallelism(value: Int): this.type = { + _parallelism = value + this + } + + def getLabelCol: String = _labelCol + def getFilterBounds: String = _filterBounds + def getLowerFilterNTile: Double = _lowerFilterNTile + def getUpperFilterNTile: Double = _upperFilterNTile + def getFilterPrecision: Double = _filterPrecision + def getContinuousDataThreshold: Int = _continuousDataThreshold + def getParallelism: Int = _parallelism + + private def filterBoundaries(field: String, ntile: Double): Double = { + df.stat.approxQuantile(field, Array(ntile), _filterPrecision)(0) + } + + private def getBatches(items: List[String]): Array[List[String]] = { + val batches = ArrayBuffer[List[String]]() + val batchSize = items.length / _parallelism + for (i <- 0 to items.length by batchSize) { + batches.append(items.slice(i, i + batchSize)) + } + batches.toArray + } + + private def validateNumericFields( + ignoreList: Array[String] + ): (List[FilterData], List[String]) = { + + val numericFieldReport = new ListBuffer[FilterData] + val fields = SchemaUtils.extractTypes(df, _labelCol, ignoreList) + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + val numericFieldBatches = getBatches(fields.numericFields).par + numericFieldBatches.tasksupport = taskSupport + + numericFieldBatches.foreach { batch => + val countFields = ArrayBuffer[Column]() + batch.foreach(batchCol => { + countFields.append(approx_count_distinct(batchCol, _filterPrecision)) + }) + val countsByCol = batch zip df + .select(countFields: _*) + .collect()(0) + .toSeq + .toArray + .map(_.asInstanceOf[Long]) + if (countsByCol.nonEmpty) + numericFieldReport += FilterData( + countsByCol.head._1, + countsByCol.head._2 + ) + } + val totalFields = fields.numericFields ::: fields.categoricalFields + (numericFieldReport.result(), totalFields) + } + + private def filterLow(data: DataFrame, + field: String, + filterThreshold: Double): OutlierFilteredDf = { + OutlierFilteredDf( + data.filter(col(field) >= filterThreshold), + data.filter(col(field) < filterThreshold) + ) + } + + private def filterHigh(data: DataFrame, + field: String, + filterThreshold: Double): OutlierFilteredDf = { + OutlierFilteredDf( + data.filter(col(field) <= filterThreshold), + data.filter(col(field) > filterThreshold) + ) + } + + def filterContinuousOutliers( + vectorIgnoreList: Array[String], + ignoreList: Array[String] = Array.empty[String] + ): (DataFrame, DataFrame, Map[String, (Double, String)]) = { + val filteredNumericPayload = new ListBuffer[String] + val (numericPayload, totalFeatureFields) = validateNumericFields( + vectorIgnoreList ++ ignoreList + ) + + val totalFields = totalFeatureFields ++ List(_labelCol) ++ vectorIgnoreList.toList ++ ignoreList.toList + numericPayload.foreach { x => + if (!ignoreList.contains(x.field) & x.uniqueValues >= _continuousDataThreshold) + filteredNumericPayload += x.field + } + var mutatedDF = df + var outlierDF = df + val inferenceOutlierMap = + addToInferenceOutlierMap(filteredNumericPayload.toList, _filterBounds) + inferenceOutlierMap.foreach(item => { + val colName = item._1.split(Pattern.quote("||"))(0) + item._2._2 match { + case LOWER => + val outlierDfs = filterLow(mutatedDF, colName, item._2._1) + mutatedDF = outlierDfs.mutatedDf + outlierDF = outlierDfs.outlierDf + case UPPER => + val outlierDfs = filterHigh(mutatedDF, colName, item._2._1) + mutatedDF = outlierDfs.mutatedDf + outlierDF = if (BOTH.equals(_filterBounds)) { + outlierDfs.outlierDf.union(outlierDF) + } else { + outlierDfs.outlierDf + } + } + }) + ( + mutatedDF.select(totalFields.distinct map col: _*), + outlierDF.select(totalFields.distinct map col: _*), + inferenceOutlierMap.result().toMap + ) + } + + private def addToInferenceOutlierMap( + filteredData: List[String], + filterDirection: String + ): mutable.Map[String, (Double, String)] = { + val executionContext = + ThreadPoolsBySize.withScalaExecutionContext(_parallelism) + var outlierMap = mutable.Map[String, (Double, String)]() + val colFutures = ArrayBuffer[Future[(String, (Double, String))]]() + filteredData.foreach(colName => { + getFilterNTileByCase(filterDirection) + .foreach( + item => + colFutures += Future { + colName + "||" + item._1 -> (filterBoundaries(colName, item._2), item._1) + }(executionContext) + ) + }) + colFutures.foreach(item => { + val outlier = Await.result(item, scala.concurrent.duration.Duration.Inf) + outlierMap += outlier + }) + outlierMap + } + + private def getFilterNTileByCase( + filterDirection: String + ): Array[(String, Double)] = { + filterDirection match { + case LOWER => Array((LOWER, _lowerFilterNTile)) + case UPPER => Array((UPPER, _upperFilterNTile)) + case BOTH => Array((LOWER, _lowerFilterNTile), (UPPER, _upperFilterNTile)) + } + } + + def filterContinuousOutliers( + manualFilter: List[ManualFilters], + vectorIgnoreList: Array[String] + ): (DataFrame, DataFrame, Map[String, (Double, String)]) = { + + var mutatedDF = df + var outlierDF = df + val (numericPayload, totalFeatureFields) = validateNumericFields( + vectorIgnoreList + ) + val totalFields = totalFeatureFields ++ List(_labelCol) ++ vectorIgnoreList.toList + + val inferenceOutlierMap: mutable.Map[String, (Double, String)] = + mutable.Map.empty[String, (Double, String)] + + manualFilter.foreach { x => + _filterBounds match { + case LOWER => + inferenceOutlierMap.put(x.field, (x.threshold, LOWER)) + val outlierDfs = filterLow(mutatedDF, x.field, x.threshold) + mutatedDF = outlierDfs.mutatedDf + outlierDF = outlierDfs.outlierDf + case UPPER => + inferenceOutlierMap.put(x.field, (x.threshold, UPPER)) + val outlierDfs = filterHigh(mutatedDF, x.field, x.threshold) + mutatedDF = outlierDfs.mutatedDf + outlierDF = if (BOTH.equals(_filterBounds)) { + outlierDfs.outlierDf.union(outlierDF) + } else { + outlierDfs.outlierDf + } + case _ => + throw new UnsupportedOperationException( + s"Filter mode '${_filterBounds} is not supported. Please use either '$LOWER' or '$UPPER'" + ) + } + } + ( + mutatedDF.select(totalFields.distinct map col: _*), + outlierDF.select(totalFields.distinct map col: _*), + inferenceOutlierMap.result.toMap + ) + } +} diff --git a/src/main/scala/com/databricks/labs/automl/sanitize/PearsonFiltering.scala b/src/main/scala/com/databricks/labs/automl/sanitize/PearsonFiltering.scala new file mode 100644 index 00000000..45854464 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/sanitize/PearsonFiltering.scala @@ -0,0 +1,541 @@ +package com.databricks.labs.automl.sanitize + +import com.databricks.labs.automl.params.PearsonPayload +import com.databricks.labs.automl.utils.DataValidation +import org.apache.spark.ml.feature.VectorAssembler +import org.apache.spark.ml.linalg.Vector +import org.apache.spark.ml.stat.ChiSquareTest +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DoubleType + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} +import scala.collection.parallel.ForkJoinTaskSupport +import scala.concurrent.forkjoin.ForkJoinPool + +/** + * + * @param df : DataFrame -> Dataset with a vectorized field of features, + * the feature columns, and a label column. + * @param featureColumnListing : Array[String] -> List of all fields that make up the feature vector + * + * Usage: + * val autoFiltered = new PearsonFiltering(featurizedData, fields) + * .setLabelCol("label") + * .setFeaturesCol("features") + * .setFilterStatistic("pearsonStat") + * .setFilterDirection("greater") + * .setFilterMode("auto") + * .setAutoFilterNTile(0.5) + * .filterFields + */ +class PearsonFiltering(df: DataFrame, + featureColumnListing: Array[String], + modelType: String) + extends DataValidation + with SanitizerDefaults { + + private final val PRODUCT = "product" + private final val COV_VALUE = "cov_calculation" + private final val DEVIATION = "_deviation" + private final val SQUARED = "_squared" + + private var _labelCol: String = defaultLabelCol + private var _featuresCol: String = defaultFeaturesCol + private var _filterStatistic: String = defaultPearsonFilterStatistic + private var _filterDirection: String = defaultPearsonFilterDirection + + private var _filterManualValue: Double = defaultPearsonFilterManualValue + private var _filterMode: String = defaultPearsonFilterMode + private var _autoFilterNTile: Double = defaultPearsonAutoFilterNTile + private var _parallelism: Int = 20 + + final private val _dataFieldNames = df.schema.fieldNames + final private val _dataFieldTypes = df.schema.fields + + def setLabelCol(value: String): this.type = { + require( + _dataFieldNames.contains(value), + s"Label Field $value is not in DataFrame Schema." + ) + _labelCol = value + this + } + + def setFeaturesCol(value: String): this.type = { + require( + _dataFieldNames.contains(value), + s"Feature Field $value is not in DataFrame Schema." + ) + require( + _dataFieldTypes.filter(_.name == value)(0).dataType.typeName == "vector", + s"Feature Field $value is not of vector type." + ) + _featuresCol = value + this + } + + def setFilterStatistic(value: String): this.type = { + require( + _allowedStats.contains(value), + s"Pearson Filtering Statistic '$value' is not a valid member of ${invalidateSelection(value, _allowedStats)}" + ) + _filterStatistic = value + this + } + + def setFilterDirection(value: String): this.type = { + require( + _allowedFilterDirections.contains(value), + s"Filter Direction '$value' is not a valid member of ${invalidateSelection(value, _allowedFilterDirections)}" + ) + _filterDirection = value + this + } + + def setFilterManualValue(value: Double): this.type = { + _filterManualValue = value + this + } + + def setFilterManualValue(value: Int): this.type = { + _filterManualValue = value.toDouble + this + } + + def setFilterMode(value: String): this.type = { + require( + _allowedFilterModes.contains(value), + s"Filter Mode $value is not a valid member of ${invalidateSelection(value, _allowedFilterModes)}" + ) + _filterMode = value + this + } + + def setAutoFilterNTile(value: Double): this.type = { + require(value <= 1.0 & value >= 0.0, "NTile value must be between 0 and 1.") + _autoFilterNTile = value + this + } + + def setParallelism(value: Int): this.type = { + _parallelism = value + this + } + + def getLabelCol: String = _labelCol + def getFeaturesCol: String = _featuresCol + def getFilterStatistic: String = _filterStatistic + def getFilterDirection: String = _filterDirection + def getFilterManualValue: Double = _filterManualValue + def getFilterMode: String = _filterMode + def getAutoFilterNTile: Double = _autoFilterNTile + def getParallelism: Int = _parallelism + + private var _pearsonVectorFields: Array[String] = Array.empty + private var _pearsonNonCategoricalFields: Array[String] = Array.empty + + private def setPearsonNonCategoricalFields( + value: Array[String] + ): this.type = { + _pearsonNonCategoricalFields = value + this + } + + private def setPearsonVectorFields(value: Array[String]): this.type = { + _pearsonVectorFields = value + this + } + + /** + * Private method for calculating the ChiSq relation of each feature to the label column. + * @param data DataFrame that contains the vector to test and the label column. + * @param featureColumn the name of the feature column vector to be used in the test. + * @return List of the stats from the comparison calculated. + */ + private def buildChiSq(data: DataFrame, + featureColumn: String): List[PearsonPayload] = { + val reportBuffer = new ListBuffer[PearsonPayload] + + val chi = ChiSquareTest.test(data, featureColumn, _labelCol).head + val pvalues = chi.getAs[Vector](0).toArray + val degreesFreedom = chi.getSeq[Int](1).toArray + val pearsonStat = chi.getAs[Vector](2).toArray + + for (i <- _pearsonVectorFields.indices) { + reportBuffer += PearsonPayload( + _pearsonVectorFields(i), + pvalues(i), + degreesFreedom(i), + pearsonStat(i) + ) + } + reportBuffer.result + } + + /** + * Method for, given a particular column, get the exact count of the cardinality of the field. + * @param column Name of the column that is being tested for cardinality + * @return [Long] the number of unique entries in the column + */ + private def acquireCardinality(column: String): Long = { + + val aggregateData = + df.select(col(column)).groupBy(col(column)).agg(count(col(column))) + aggregateData.count() + } + + /** + * Private method for running through all of the fields included in the base feature vector and calculating their + * cardinality in parallel (10x concurrency) + * @return An Array of Field Name, Distinct Count + */ + private def featuresCardinality(): Array[(String, Long)] = { + + val cardinalityOfFields = new ArrayBuffer[(String, Long)]() + + val featurePool = featureColumnListing.par + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(_parallelism)) + featurePool.tasksupport = taskSupport + + featurePool.foreach { x => + cardinalityOfFields += Tuple2(x, acquireCardinality(x)) + } + + cardinalityOfFields.result.toArray + } + + /** + * Private method for analyzing the input feature vector columns, determining their cardinality, and updating + * the private var's to use these new lists. + * @return Nothing - it updates the class-scoped variables when called. + */ + private def restrictFeatureSet(): this.type = { + + // Empty ArrayBuffer to hold the fields to build the PearsonFeature Vector + val pearsonVectorBuffer = new ArrayBuffer[String] + val pearsonNonCategoricalBuffer = new ArrayBuffer[String] + + val determineCardinality = featuresCardinality() + + determineCardinality.foreach { x => + if (x._2 < 50) pearsonVectorBuffer += x._1 + else pearsonNonCategoricalBuffer += x._1 + } + + setPearsonNonCategoricalFields(pearsonNonCategoricalBuffer.result.toArray) + setPearsonVectorFields(pearsonVectorBuffer.result.toArray) + + } + + /** + * Method for creating a new temporary feature vector that will be used for Pearson Filtering evaluation, removing + * the high cardinality fields from this test. + * @return [DataFrame] the DataFrame with a new vector entiitled "pearsonVector" that is used for removing + * fields from the feature vector that are either highly positively or negatively correlated to the label + * field. + */ + private def reVectorize(): DataFrame = { + + // Create a new feature vector based on the fields that will be evaluated in PearsonFiltering + restrictFeatureSet() + + require( + _pearsonVectorFields.nonEmpty, + s"Pearson Filtering contains all continuous variables in the feature" + + s" vector, or cardinality of all features is greater than the threshold of 10k unique entries. " + + s"Please turn off pearson filtering for this data set by defining the main class with the setter: " + + s".pearsonFilterOff() to continue." + ) + + val assembler = new VectorAssembler() + .setInputCols(_pearsonVectorFields) + .setOutputCol("pearsonVector") + + assembler.transform(df) + } + + /** + * Method for manually filtering out fields from the feature vector based on a user-supplied or + * automation-calculated threshold cutoff. + * @param statPayload the calculated correlation stats from feature elements in the vector to the label column. + * @param filterValue the cut-off value specified by the user, or calculated through the quantile generator + * methodology. + * @return A list of fields that will be persisted and included in the feature vector going forward. + */ + private def filterChiSq(statPayload: List[PearsonPayload], + filterValue: Double): List[String] = { + val fieldRestriction = new ListBuffer[String] + _filterDirection match { + case "greater" => + statPayload.foreach(x => { + x.getClass.getDeclaredFields foreach { f => + f.setAccessible(true) + if (f.getName == _filterStatistic) + if (f.get(x).asInstanceOf[Double] >= filterValue) + fieldRestriction += x.fieldName + else None + else None + } + }) + case "lesser" => + statPayload.foreach(x => { + x.getClass.getDeclaredFields foreach { f => + f.setAccessible(true) + if (f.getName == _filterStatistic) + if (f.get(x).asInstanceOf[Double] <= filterValue) + fieldRestriction += x.fieldName + else None + else None + } + }) + case _ => + throw new UnsupportedOperationException( + s"${_filterDirection} is not supported for manualFilterChiSq" + ) + } + fieldRestriction.result + } + + /** + * Method for automatically detecting the quantile values for the filter statistic to cull fields automatically + * based on the distribution of correlation amongst the feature vector and the label. + * @param pearsonResults The pearson (and other) stats that have been calculated between each element of the + * feature vector and the label. + * @return The PearsonPayload results for each field, filtering out those elements that are either above / below + * the threshold configured. + */ + private def quantileGenerator( + pearsonResults: List[PearsonPayload] + ): Double = { + + val statBuffer = new ListBuffer[Double] + pearsonResults.foreach(x => { + x.getClass.getDeclaredFields foreach { f => + f.setAccessible(true) + if (f.getName == _filterStatistic) + statBuffer += f.get(x).asInstanceOf[Double] + } + }) + + val statSorted = statBuffer.result.sortWith(_ < _) + if (statSorted.size % 2 == 1) + statSorted((statSorted.size * _autoFilterNTile).toInt) + else { + val splitLoc = math.floor(statSorted.size * _autoFilterNTile).toInt + val splitCheck = if (splitLoc < 1) 1 else splitLoc.toInt + val (high, low) = statSorted.splitAt(splitCheck) + (high.last + low.head) / 2 + } + + } + + private def filterClassifier( + ignoreFields: Array[String] = Array.empty[String] + ): DataFrame = { + + val revectoredData = reVectorize() + + val chiSqData = buildChiSq(revectoredData, "pearsonVector") + val featureFields: List[String] = _filterMode match { + case "manual" => + filterChiSq(chiSqData, _filterManualValue) + case _ => + filterChiSq(chiSqData, quantileGenerator(chiSqData)) + } + require( + featureFields.nonEmpty, + "All feature fields have been filtered out. Adjust parameters." + ) + val fieldListing = featureFields ::: List(_labelCol) ::: ignoreFields.toList ::: _pearsonNonCategoricalFields.toList + df.select(fieldListing.map(col): _*) + + } + + /** + * Main entry point for Pearson Filtering + * @param ignoreFields Fields that will be ignored from running a Pearson filter against. + * @return + */ + def filterFields( + ignoreFields: Array[String] = Array.empty[String] + ): DataFrame = { + + // Perform check of regression vs classification + val uniqueLabelCounts = df + .select(_labelCol) + .agg(count(_labelCol).alias("uniques")) + .first() + .getAs[Long]("uniques") + + modelType match { + case "classifier" => filterClassifier(ignoreFields) + case _ => filterRegressor(ignoreFields) + } + + } + + /** + * Method for manually filtering out values whose linear correlation coefficient is greater than the _filterManualValue setting. + * @param correlationData The mapping of each feature field's correlation and linear correlation coefficient valuest to the label + * @return Field names that have not been filtered out + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def regressorManualFilter( + correlationData: Map[String, (Double, Double)] + ): Array[String] = { + + val fieldBuffer = new ArrayBuffer[String] + correlationData.keys.foreach { x => + if (correlationData(x)._2 < _filterManualValue) fieldBuffer += x + } + fieldBuffer.toArray + } + + /** + * Method for doing quantile filtering (using the autoFilterNTile in automatic mode to filter out features that + * show a linear correaltion coefficient that is greater than the autoFilterNTile value. (1.0 == perfect correlation) + * @param correlationData The mapping of each feature field's correlation and linear correlation coefficient values to the label + * @return Field names that have not been filtered out + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def regressionAutoFilter( + correlationData: Map[String, (Double, Double)] + ): Array[String] = { + + val fieldBuffer = new ArrayBuffer[String] + correlationData.keys.foreach { x => + if (correlationData(x)._2 < _autoFilterNTile) fieldBuffer += x + } + fieldBuffer.toArray + } + + /** + * Method for filtering out a regression data set (detect extremely high collinearity in features compared to the label values + * @param ignoreFields Fields to ignore from the test + * @return Dataframe that has the highly correlated feature fields removed + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def filterRegressor( + ignoreFields: Array[String] = Array.empty[String] + ): DataFrame = { + + val featureFields = _filterMode match { + case "manual" => + regressorManualFilter(calculateRegressionCovariance(ignoreFields)) + case _ => + regressionAutoFilter(calculateRegressionCovariance(ignoreFields)) + } + + require( + featureFields.nonEmpty, + "All feature fields have been filtered out. Adjust parameters." + ) + val fieldListing = featureFields.toList ::: List(_labelCol) ::: ignoreFields.toList + df.select(fieldListing.map(col): _*) + + } + + /** + * Private method for calculating the covariance and linear correlation coefficient for each feature field to the label + * @param ignoreFields Fields to ignore in the analysis + * @return Map of [FieldName, (Covariance value, Linear Correlation Coefficient) + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def calculateRegressionCovariance( + ignoreFields: Array[String] = Array.empty[String] + ) = { + + val summaryData = df + .select(featureColumnListing ++ Array(_labelCol) map col: _*) + .summary("mean") + + val rowCount = df + .select(col(_labelCol)) + .agg(count(_labelCol).alias(_labelCol)) + .withColumn(_labelCol, col(_labelCol).cast(DoubleType)) + .first() + .getAs[Double](_labelCol) + + val meanValues = + summaryData.filter(col("summary") === "mean").drop("summary") + val meanData = + meanValues.first().getValuesMap[Double](meanValues.schema.fieldNames) + + val buffer = new ArrayBuffer[Map[String, (Double, Double)]] + + meanData.keys.foreach { x => + if (x != _labelCol) + buffer += covarianceCalculation(x, meanData, rowCount) + } + + buffer.result.flatten.toMap + + } + + /** + * Private method for calculating the coveriance and linear correlation coefficient between a field and the label + * @param field Field to compare + * @param avgMap Mapping of the average values of each field and the label (calculated only once) + * @param rowCount Double of the row count of the Dataframe (calculated only once) + * @return Map of FieldName -> (covariance, linear correlation coefficient) + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def covarianceCalculation( + field: String, + avgMap: Map[String, Double], + rowCount: Double + ): Map[String, (Double, Double)] = { + + val tempDF = df + .withColumn(field, col(field).cast(DoubleType)) + .select(field, _labelCol) + .withColumn(field + DEVIATION, col(field) - avgMap(field)) + .withColumn(field + SQUARED, col(field) * col(field)) + .withColumn(_labelCol + DEVIATION, col(_labelCol) - avgMap(_labelCol)) + .withColumn(_labelCol + SQUARED, col(_labelCol) * col(_labelCol)) + .withColumn( + COV_VALUE, + col(field + DEVIATION) * col(_labelCol + DEVIATION) + ) + .withColumn(PRODUCT, col(field) * col(_labelCol)) + + val summed = tempDF + .agg( + sum(field).alias(field), + sum(_labelCol).alias(_labelCol), + sum(PRODUCT).alias(PRODUCT), + sum(COV_VALUE).alias(COV_VALUE), + sum(field + SQUARED).alias(field + SQUARED), + sum(_labelCol + SQUARED).alias(_labelCol + SQUARED) + ) + .first() + .getValuesMap[Double]( + Seq( + COV_VALUE, + field, + _labelCol, + PRODUCT, + _labelCol + SQUARED, + field + SQUARED + ) + ) + + val linearCorrelationCoefficient = (summed(PRODUCT) - (summed(field) * summed( + _labelCol + ) / rowCount)) / math.sqrt( + (summed(field + SQUARED) - math + .pow(summed(field), 2.0) / rowCount) * (summed(_labelCol + SQUARED) - math + .pow(summed(_labelCol), 2.0) / rowCount) + ) + + Map(field -> (summed(COV_VALUE) / rowCount, linearCorrelationCoefficient)) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/sanitize/SanitizerDefaults.scala b/src/main/scala/com/databricks/labs/automl/sanitize/SanitizerDefaults.scala new file mode 100644 index 00000000..a5c433ff --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/sanitize/SanitizerDefaults.scala @@ -0,0 +1,41 @@ +package com.databricks.labs.automl.sanitize + +trait SanitizerDefaults { + + //TODO: fill in the rest of the default values here from the other packages within sanitize. + + /** + * Global Defaults + */ + def defaultLabelCol = "label" + def defaultFeaturesCol = "features" + + /** + * Pearson Defaults + */ + final val _allowedStats: Array[String] = + Array("pvalue", "degreesFreedom", "pearsonStat") + final val _allowedFilterDirections: Array[String] = Array("greater", "lesser") + final val _allowedFilterModes: Array[String] = Array("auto", "manual") + + def defaultPearsonFilterStatistic = "pvalue" + def defaultPearsonFilterDirection = "greater" + def defaultPearsonFilterManualValue = 0.0 + def defaultPearsonFilterMode = "auto" + def defaultPearsonAutoFilterNTile = 0.99 + + /** + * Scaler Defaults + */ + final val allowableScalers: Array[String] = + Array("minMax", "standard", "normalize", "maxAbs") + + def defaultRenamedFeaturesCol = "features_r" + def defaultScalerType = "minMax" + def defaultScalerMin = 0.0 + def defaultScalerMax = 1.0 + def defaultStandardScalerMeanFlag = false + def defaultStandardScalerStdDevFlag = true + def defaultPNorm = 2.0 + +} diff --git a/src/main/scala/com/databricks/labs/automl/sanitize/Scaler.scala b/src/main/scala/com/databricks/labs/automl/sanitize/Scaler.scala new file mode 100644 index 00000000..6486bcc2 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/sanitize/Scaler.scala @@ -0,0 +1,189 @@ +package com.databricks.labs.automl.sanitize + +import com.databricks.labs.automl.utils.DataValidation +import org.apache.spark.ml.{Estimator, Model, PipelineStage, Transformer} +import org.apache.spark.ml.feature.{MaxAbsScaler, MinMaxScaler, Normalizer, StandardScaler} +import org.apache.spark.sql.DataFrame + +class Scaler(df: DataFrame = null) extends DataValidation with SanitizerDefaults { + + private var _featuresCol: String = defaultFeaturesCol + + private var _renamedFeaturesCol: String = defaultRenamedFeaturesCol + + private var _scalerType: String = defaultScalerType + + private var _scalerMin: Double = defaultScalerMin + + private var _scalerMax: Double = defaultScalerMax + + private var _standardScalerMeanFlag: Boolean = defaultStandardScalerMeanFlag + + private var _standardScalerStdDevFlag: Boolean = defaultStandardScalerStdDevFlag + + private var _pNorm: Double = defaultPNorm + + private val _dfFieldNames: Array[String] = if(df != null) df.columns else Array.empty + + private def renameFeaturesCol(): this.type = { + _renamedFeaturesCol = _featuresCol + "_r" + this + } + + def setFeaturesCol(value: String): this.type = { + if(_dfFieldNames.nonEmpty) { + require(_dfFieldNames.contains(value), s"Feature Column '$value' is not present in Dataframe schema.") + } + _featuresCol = value + renameFeaturesCol() + this + } + + def setScalerType(value: String): this.type = { + require(allowableScalers.contains(value), s"Scaler Type '$value' is not a valid member of ${ + invalidateSelection(value, allowableScalers) + }") + _scalerType = value + this + } + + def setScalerMin(value: Double): this.type = { + _scalerMin = value + this + } + + def setScalerMax(value: Double): this.type = { + _scalerMax = value + this + } + + def setStandardScalerMeanMode(value: Boolean): this.type = { + _standardScalerMeanFlag = value + this + } + + def setStandardScalerStdDevMode(value: Boolean): this.type = { + _standardScalerStdDevFlag = value + this + } + + def setPNorm(value: Double): this.type = { + require(value >= 1.0, s"pNorm value must be greater than or equal to 1.0. '$value' is invalid.") + _pNorm = value + this + } + + def getFeaturesCol: String = _featuresCol + + def getScalerType: String = _scalerType + + def getScalerMin: Double = _scalerMin + + def getScalerMax: Double = _scalerMax + + def getStandardScalerMeanFlag: Boolean = _standardScalerMeanFlag + + def getStandardScalerStdDevFlag: Boolean = _standardScalerStdDevFlag + + def getAllowableScalers: Array[String] = allowableScalers + + def getPNorm: Double = _pNorm + + private def normalizeFeatures(): DataFrame = { + + val normalizer = normalizeFeaturesStage() + + normalizer.transform( + df.withColumnRenamed(_featuresCol, _renamedFeaturesCol) + ) + .drop(_renamedFeaturesCol) + + } + + private def normalizeFeaturesStage() : Transformer = { + new Normalizer() + .setInputCol(_renamedFeaturesCol) + .setOutputCol(_featuresCol) + .setP(_pNorm) + } + + private def minMaxFeatures(): DataFrame = { + + require(_scalerMax > _scalerMin, s"Scaler Max (${_scalerMax}) must be greater than Scaler Min (${_scalerMin})") + val minMaxScaler = minMaxFeaturesStage() + + val dfRenamed = df.withColumnRenamed(_featuresCol, _renamedFeaturesCol) + + val fitScaler = minMaxScaler.fit(dfRenamed) + + fitScaler.transform(dfRenamed).drop(_renamedFeaturesCol) + } + + private def minMaxFeaturesStage(): Estimator[_ <: Model[_]] = { + new MinMaxScaler() + .setInputCol(_renamedFeaturesCol) + .setOutputCol(_featuresCol) + .setMin(_scalerMin) + .setMax(_scalerMax) + } + + private def standardScaleFeatures(): DataFrame = { + + val standardScaler = standardScaleFeaturesStage() + + val dfRenamed = df.withColumnRenamed(_featuresCol, _renamedFeaturesCol) + + val fitStandardScaler = standardScaler.fit(dfRenamed) + + fitStandardScaler.transform(dfRenamed).drop(_renamedFeaturesCol) + + } + + private def standardScaleFeaturesStage(): Estimator[_ <: Model[_]] = { + new StandardScaler() + .setInputCol(_renamedFeaturesCol) + .setOutputCol(_featuresCol) + .setWithMean(_standardScalerMeanFlag) + .setWithStd(_standardScalerStdDevFlag) + } + + private def maxAbsScaleFeatures(): DataFrame = { + + val maxAbsScaler = maxAbsScaleFeaturesStage() + + val dfRenamed = df.withColumnRenamed(_featuresCol, _renamedFeaturesCol) + + val fitMaxAbsScaler = maxAbsScaler.fit(dfRenamed) + + fitMaxAbsScaler.transform(dfRenamed).drop(_renamedFeaturesCol) + + } + + private def maxAbsScaleFeaturesStage(): Estimator[_ <: Model[_]] = { + new MaxAbsScaler() + .setInputCol(_renamedFeaturesCol) + .setOutputCol(_featuresCol) + } + + def scaleFeatures(): DataFrame = { + _scalerType match { + case "minMax" => minMaxFeatures() + case "standard" => standardScaleFeatures() + case "normalize" => normalizeFeatures() + case "maxAbs" => maxAbsScaleFeatures() + case _ => throw new UnsupportedOperationException(s"Scaler '${_scalerType}' is not supported.") + } + } + + + def scaleFeaturesForPipeline(): PipelineStage = { + _scalerType match { + case "minMax" => minMaxFeaturesStage() + case "standard" => standardScaleFeaturesStage() + case "normalize" => normalizeFeaturesStage() + case "maxAbs" => maxAbsScaleFeaturesStage() + case _ => throw new UnsupportedOperationException(s"Scaler '${_scalerType}' is not supported.") + } + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/sanitize/VarianceFiltering.scala b/src/main/scala/com/databricks/labs/automl/sanitize/VarianceFiltering.scala new file mode 100644 index 00000000..874e23a4 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/sanitize/VarianceFiltering.scala @@ -0,0 +1,111 @@ +package com.databricks.labs.automl.sanitize + +import com.databricks.labs.automl.pipeline.FeaturePipeline +import org.apache.log4j.Logger +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.ArrayBuffer + +class VarianceFiltering(data: DataFrame) { + + private var _labelCol = "label" + private var _featureCol = "features" + private var _dateTimeConversionType = "split" + private var _parallelism = 20 + + private val logger: Logger = Logger.getLogger(this.getClass) + + private final val dfSchema = data.schema.fieldNames + + def setLabelCol(value: String): this.type = { + require( + dfSchema.contains(value), + s"Label Column $value does not exist in Dataframe" + ) + _labelCol = value + this + } + + def setFeatureCol(value: String): this.type = { + _featureCol = value + this + } + + def setDateTimeConversionType(value: String): this.type = { + _dateTimeConversionType = value + this + } + + def setParallelism(value: Int): this.type = { + _parallelism = value + this + } + + def getLabelCol: String = _labelCol + + def getFeatureCol: String = _featureCol + + def getDateTimeConversionType: String = _dateTimeConversionType + + def getParallelism: Int = _parallelism + + private def regenerateSchema(fieldSchema: Array[String]): Array[String] = { + fieldSchema.map { x => + x.split("_si$")(0) + } + } + + private def getBatches(items: List[String]): Array[List[String]] = { + val batches = ArrayBuffer[List[String]]() + val batchSize = items.length / _parallelism + for (i <- 0 to items.length by batchSize) { + batches.append(items.slice(i, i + batchSize)) + } + batches.toArray + } + + def filterZeroVariance( + fieldsToIgnore: Array[String] = Array.empty[String] + ): (DataFrame, Array[String]) = { + + val (featurizedData, fields, allFields) = new FeaturePipeline(data) + .setLabelCol(_labelCol) + .setFeatureCol(_featureCol) + .setDateTimeConversionType(_dateTimeConversionType) + .makeFeaturePipeline(fieldsToIgnore) + + val dfParts = featurizedData.rdd.partitions.length.toDouble + val summaryParts = Math.max(32, Math.min(Math.ceil(dfParts / 20.0).toInt, 200)) + val stddevInformation = featurizedData + .coalesce(summaryParts) + .summary("stddev") + .select(fields map col: _*) + .collect()(0) + .toSeq + .toArray + + val stddevData = fields.zip(stddevInformation) + + val preserveColumns = new ArrayBuffer[String] + val removedColumns = new ArrayBuffer[String] + + stddevData.foreach { x => + if (x._2.toString.toDouble != 0.0) { + preserveColumns.append(x._1) + } else { + removedColumns.append(x._1) + } + } + + // val removedColumnsString = removedColumns.toArray.mkString(", ") + // println(s"The following columns were removed due to zero variance: $removedColumnsString") + // + // logger.log(Level.WARN, s"The following columns were removed due to zero variance: $removedColumnsString") + val finalFields = preserveColumns.result ++ Array(_labelCol) + + (data.select(finalFields map col: _*), removedColumns.toArray) + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/tracking/MLFlowStructures.scala b/src/main/scala/com/databricks/labs/automl/tracking/MLFlowStructures.scala new file mode 100644 index 00000000..5f618b50 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/tracking/MLFlowStructures.scala @@ -0,0 +1,9 @@ +package com.databricks.labs.automl.tracking + +import org.mlflow.tracking.MlflowClient + +case class MLFlowReturn(client: MlflowClient, + experimentId: String, + runIdPayload: Array[(String, Double)]) + +case class MLFlowReportStructure(fullLog: MLFlowReturn, bestLog: MLFlowReturn) diff --git a/src/main/scala/com/databricks/labs/automl/tracking/MLFlowTracker.scala b/src/main/scala/com/databricks/labs/automl/tracking/MLFlowTracker.scala new file mode 100644 index 00000000..76247557 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/tracking/MLFlowTracker.scala @@ -0,0 +1,954 @@ +package com.databricks.labs.automl.tracking + +import java.io.{File, PrintWriter} +import java.nio.file.Paths + +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.inference.InferenceConfig._ +import com.databricks.labs.automl.inference.{InferenceModelConfig, InferenceTools} +import com.databricks.labs.automl.params.{GenericModelReturn, MLFlowConfig, MainConfig} +import com.databricks.labs.automl.utils.{AutoMlPipelineMlFlowUtils, InitDbUtils, PipelineMlFlowTagKeys} +import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassificationModel, XGBoostRegressionModel} +import org.apache.spark.ml.classification._ +import org.apache.spark.ml.regression.{DecisionTreeRegressionModel, GBTRegressionModel, LinearRegressionModel, RandomForestRegressionModel} +import org.mlflow.api.proto.Service +import org.mlflow.api.proto.Service.CreateRun +import org.mlflow.tracking.MlflowClient +import org.mlflow.tracking.creds._ +import org.apache.log4j.{Level, Logger} + +import scala.collection.mutable +import scala.collection.mutable.ArrayBuffer +import scala.collection.parallel.ForkJoinTaskSupport +import scala.concurrent.forkjoin.ForkJoinPool +import scala.io.Source + +class MLFlowTracker extends InferenceTools { + + private var _mainConfig: MainConfig = _ + private var _mlFlowTrackingURI: String = _ + private var _mlFlowExperimentName: String = "default" + private var _mlFlowHostedAPIToken: String = _ + private var _modelSaveDirectory: String = _ + private var _logArtifacts: Boolean = false + private var _mlFlowLoggingMode: String = _ + private var _mlFlowBestSuffix: String = _ + private var _mlFlowCustomRunTags: Map[String, String] = Map.empty + private var _mlFlowClient: MlflowClient = _ + + private val logger: Logger = Logger.getLogger(this.getClass) + final private val HOSTED_NAMESPACE = List("databricks.com", "databricks.net") + + def setMainConfig(value: MainConfig): this.type = { + _mainConfig = value + this + } + def setMlFlowTrackingURI(value: String): this.type = { + _mlFlowTrackingURI = value + this + } + + def setMlFlowHostedAPIToken(value: String): this.type = { + _mlFlowHostedAPIToken = value + this + } + + def setMlFlowExperimentName(value: String): this.type = { + _mlFlowExperimentName = value + this + } + + def setModelSaveDirectory(value: String): this.type = { + _modelSaveDirectory = value + this + } + + def logArtifactsOn(): this.type = { + _logArtifacts = true + this + } + + def logArtifactsOff(): this.type = { + _logArtifacts = false + this + } + + def setMlFlowLoggingMode(value: String): this.type = { + _mlFlowLoggingMode = value + this + } + + def setMlFlowBestSuffix(value: String): this.type = { + _mlFlowBestSuffix = value + this + } + + def setMlFlowCustomRunTags(value: Map[String, String]): this.type = { + _mlFlowCustomRunTags = value + this + } + + //Intentionally not providing a getter for an API token. + + def getMlFlowTrackingURI: String = _mlFlowTrackingURI + def getMlFlowExperimentName: String = _mlFlowExperimentName + def getModelSaveDirectory: String = _modelSaveDirectory + def getArtifactLogSetting: Boolean = _logArtifacts + def getMlFlowLoggingMode: String = _mlFlowLoggingMode + def getMlFlowBestSuffix: String = _mlFlowBestSuffix + def getMlFlowCustomRunTags: Map[String, String] = _mlFlowCustomRunTags + + /** + * Get a single MLFlow Client for the instance of the object. Reduce garbage collection by not creating + * a version each time the object is called. + * As of 0.7.1 + * @return + */ + def getMLFlowClient: MlflowClient = { + if (_mlFlowClient == null) { + _mlFlowClient = createHostedMlFlowClient() + _mlFlowClient + } else { + _mlFlowClient + } + } + + /** + * Method for either getting an existing experiment by name, or creating a new one by name and returning the id + * + * @param client: MlflowClient to get access to the mlflow service agent + * @return the experiment id from either an existing run or the newly created one. + */ + private def getOrCreateExperimentId(client: MlflowClient, + experimentName: String = + _mlFlowExperimentName): String = { + + val experiment = client.getExperimentByName(experimentName) + if (experiment.isPresent) experiment.get().getExperimentId + else client.createExperiment(experimentName) + + } + + def createHostedMlFlowClient(): MlflowClient = { + + val hosted: Boolean = HOSTED_NAMESPACE.exists(_mlFlowTrackingURI.contains) + + if (hosted) { + new MlflowClient( + new BasicMlflowHostCreds(_mlFlowTrackingURI, _mlFlowHostedAPIToken) + ) + } else { + new MlflowClient(_mlFlowTrackingURI) + } + } + + private def generateMlFlowRun(client: MlflowClient, + experimentID: String, + runIdentifier: String, + runName: String, + sourceVer: String): String = { + val request: CreateRun.Builder = CreateRun + .newBuilder() + .setExperimentId(experimentID) + .setStartTime(System.currentTimeMillis()) + .addTags(Service.RunTag.newBuilder().setKey("mlflow.runName").setValue(runName).build()) + .addTags(Service.RunTag.newBuilder().setKey("mlflow.source.name").setValue(runIdentifier).build()) + .addTags(Service.RunTag.newBuilder().setKey("mlflow.source.version").setValue(sourceVer).build()) + val run = client.createRun(request.build()) + run.getRunId + } + + def generateMlFlowRunId(): String = { + val client = getMLFlowClient + val experimentId = getOrCreateExperimentId(client, _mlFlowExperimentName + _mlFlowBestSuffix).toString + client.createRun(experimentId).getRunId + } + + private def createFusePath(path: String): String = { + path.replace("dbfs:", "/dbfs") + } + + def logCustomTags(client: MlflowClient, + runId: String, + tags: Map[String, String]): Unit = { + if (tags.nonEmpty) { + tags.foreach { case (k, v) => client.setTag(runId, k, v) } + } + } + + def deleteCustomTags(client: MlflowClient, + runId: String, + tagKeys: Seq[String]): Unit = { + if (tagKeys.nonEmpty) { + tagKeys.foreach(k => client.deleteTag(runId, k)) + } + } + + /** + * Save the entire _mainConfig as a json string and add it as an artifact in MLFlow + * @param runId MLFlow run id + * @param configDir The parent directory of where the config file will be stored in /dbfs fuse path + * @return + */ + private def saveConfig(runId: String, configDir: String): String = { + val configPath = s"${configDir}/config_${runId}.json" + if (! new File(createFusePath(configDir)).exists()) new File(createFusePath(configDir)).mkdirs + val cleansedTokenConfig = _mainConfig.mlFlowConfig.copy(mlFlowAPIToken = "[REDACTED]") + val cleansedMainConfig = _mainConfig.copy(mlFlowConfig = cleansedTokenConfig) + logger.log(Level.DEBUG, s"DEBUG: ConfigPath = $configPath") + logger.log(Level.DEBUG, convertMainConfigToJson(cleansedMainConfig)) + + val pw = new PrintWriter(new File(createFusePath(configPath))) + pw.write(convertMainConfigToJson(cleansedMainConfig).compactJson) + pw.close() + + // Add tag for config file location + getMLFlowClient.setTag( + runId, + "MainConfigLocation", + configPath + ) + createFusePath(configPath) + } + + /** + * Private method for saving an individual model, creating a Fuse mount for it, and registering the artifact. + * + * @param client MlFlow client that has been registered. + * @param path Path in blob store for saving the SparkML Model + * @param runId Unique runID for the run to log model artifacts to. + * @param modelReturn Modeling payload for the run in order to extract the specific model type + * @param modelDescriptor Text Assignment for the model family + type of model that was run + * @param modelId Unique uuid identifier for the model. + */ + private def saveModel(client: MlflowClient, + path: String, + runId: String, + modelReturn: GenericModelReturn, + modelDescriptor: String, + modelId: String): Unit = { + + println(s"Model will be saved to path $path") + modelDescriptor match { + case "regressor_RandomForest" => + modelReturn.model + .asInstanceOf[RandomForestRegressionModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "classifier_RandomForest" => + modelReturn.model + .asInstanceOf[RandomForestClassificationModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "regressor_XGBoost" => + modelReturn.model + .asInstanceOf[XGBoostRegressionModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "classifier_XGBoost" => + modelReturn.model + .asInstanceOf[XGBoostClassificationModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "regressor_GBT" => + modelReturn.model + .asInstanceOf[GBTRegressionModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "classifier_GBT" => + modelReturn.model + .asInstanceOf[GBTClassificationModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "classifier_MLPC" => + modelReturn.model + .asInstanceOf[MultilayerPerceptronClassificationModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "regressor_LinearRegression" => + modelReturn.model + .asInstanceOf[LinearRegressionModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "classifier_LogisticRegression" => + modelReturn.model + .asInstanceOf[LogisticRegressionModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "regressor_SVM" => + modelReturn.model + .asInstanceOf[LinearSVCModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "regressor_Trees" => + modelReturn.model + .asInstanceOf[DecisionTreeRegressionModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case "classifier_Trees" => + modelReturn.model + .asInstanceOf[DecisionTreeClassificationModel] + .write + .overwrite() + .save(path) + if (_logArtifacts) + client.logArtifacts(runId, new File(createFusePath(path))) + client.setTag(runId, "ModelSaveLocation", path) + client.setTag(runId, "TrainingPayload", modelReturn.toString) + case _ => + throw new UnsupportedOperationException( + s"Model Type $modelDescriptor is not supported for mlflow logging." + ) + } + } + + /** + * Public method for logging a model, parameters, and metrics to MlFlow + * + * @param runData Full collection parameters, results, and models for the autoML experiment + * @param modelFamily Type of Model Family used (e.g. "RandomForest") + * @param modelType Type of Model used (e.g. "regression") + */ + def logMlFlowDataAndModels( + runData: Array[GenericModelReturn], + modelFamily: String, + modelType: String, + inferenceSaveLocation: String, + optimizationStrategy: String + ): MLFlowReportStructure = { + + val dummyLog = + MLFlowReturn(getMLFlowClient, "none", Array(("none", 0.0))) + + logger.log(Level.INFO, s"DEBUG: mlFlowLoggingMode: ${_mlFlowLoggingMode}") + + val bestLog = _mlFlowLoggingMode match { + case "tuningOnly" => + dummyLog + case _ => + logBest( + runData, + modelFamily, + modelType, + inferenceSaveLocation, + optimizationStrategy + ) + } + + val fullLog = _mlFlowLoggingMode match { + case "bestOnly" => dummyLog + case _ => + logTuning(runData, modelFamily, modelType, inferenceSaveLocation) + } + + MLFlowReportStructure(fullLog = fullLog, bestLog = bestLog) + + } + + /** + * This method does not save any artifacts or inference configs. + * For the Best Model logging mode, it logs params and metrics to a given mlFlowRunId + * For the tuning logging mode, it logs params and metrics to separate mlFlowRunIds + * @param mlFlowRunId + * @param runData + * @param modelFamily + * @param modelType + * @param optimizationStrategy + * @return + */ + def logMlFlowForPipeline(mlFlowRunId: String, + runData: Array[GenericModelReturn], + modelFamily: String, + modelType: String, + optimizationStrategy: String + ): MLFlowReportStructure = { + val dummyLog = + MLFlowReturn(getMLFlowClient, "none", Array(("none", 0.0))) + + val bestLog = _mlFlowLoggingMode match { + case "tuningOnly" => + dummyLog + case _ => + logBestForPipeline( + mlFlowRunId, + runData, + modelFamily, + modelType, + optimizationStrategy + ) + } + + val fullLog = _mlFlowLoggingMode match { + case "bestOnly" => dummyLog + case _ => + logTuningForPipeline(runData, modelFamily, modelType) + } + MLFlowReportStructure(fullLog = fullLog, bestLog = bestLog) + } + + private def logBestForPipeline( + runId: String, + runData: Array[GenericModelReturn], + modelFamily: String, + modelType: String, + optimizationStrategy: String): MLFlowReturn = { + val mlflowLoggingClient = getMLFlowClient + + val experimentId = getOrCreateExperimentId( + mlflowLoggingClient, + _mlFlowExperimentName + _mlFlowBestSuffix + ).toString + + val bestModel = getBestModel(optimizationStrategy, runData) + + val runIdPayload = Array((runId, bestModel.score)) + + val modelHyperParams = bestModel.hyperParams.keys + val metrics = bestModel.metrics.keys + + modelHyperParams.foreach { x => + val valueData = bestModel.hyperParams(x) + mlflowLoggingClient.logParam(runId, x, valueData.toString) + } + metrics.foreach { x => + val valueData = bestModel.metrics(x) + mlflowLoggingClient.logMetric(runId, x, valueData.toString.toDouble) + } + + val modelDescriptor = s"${modelType}_$modelFamily" + mlflowLoggingClient.logParam(runId, "modelType", modelDescriptor) + + mlflowLoggingClient.logParam(runId, "generation", "Best") + + // Save main config to MLFlow + val baseDirectory = Paths.get(s"${_modelSaveDirectory}/BestRun").toString + val configDir = s"${baseDirectory}/${modelDescriptor}_${runId}/config" + val configPath = saveConfig(runId, configDir) + mlflowLoggingClient.logArtifact(runId, new File(configPath)) + + // Log custom tags if present + if (_mlFlowCustomRunTags.nonEmpty) { + logCustomTags(mlflowLoggingClient, runId, _mlFlowCustomRunTags) + } + + MLFlowReturn( + mlflowLoggingClient, + experimentId, + runIdPayload) + } + + private def getBestModel(optimizationStrategy: String, + runData: Array[GenericModelReturn]): GenericModelReturn = { + optimizationStrategy match { + case "minimize" => runData.sortWith(_.score < _.score)(0) + case _ => runData.sortWith(_.score > _.score)(0) + } + } + + private def logBest(runData: Array[GenericModelReturn], + modelFamily: String, + modelType: String, + inferenceSaveLocation: String, + optimizationStrategy: String): MLFlowReturn = { + + val bestModel = getBestModel(optimizationStrategy, runData) + val mlflowLoggingClient = getMLFlowClient + val experimentId = getOrCreateExperimentId( + mlflowLoggingClient, + _mlFlowExperimentName + _mlFlowBestSuffix + ).toString + + var totalVersion = + mlflowLoggingClient.getExperiment(experimentId).getRunsCount + + val baseDirectory = Paths.get(s"${_modelSaveDirectory}/BestRun").toString + + val modelDescriptor = s"${modelType}_$modelFamily" + + //TODO(Jas): This needs to be synchronized to make sure two a true runVersion is generated + val runVersion: Int = totalVersion + 1 + + val runId = generateMlFlowRun( + mlflowLoggingClient, + experimentId, + modelDescriptor, + "BestRun_", + runVersion.toString + ) + + val runIdPayload = Array((runId, bestModel.score)) + + val modelHyperParams = bestModel.hyperParams.keys + val metrics = bestModel.metrics.keys + + modelHyperParams.foreach { x => + val valueData = bestModel.hyperParams(x) + mlflowLoggingClient.logParam(runId, x, valueData.toString) + } + metrics.foreach { x => + val valueData = bestModel.metrics(x) + mlflowLoggingClient.logMetric(runId, x, valueData.toString.toDouble) + } + + mlflowLoggingClient.logParam(runId, "modelType", modelDescriptor) + + val modelDir = s"$baseDirectory/${modelDescriptor}_$runId/bestModel" + + saveModel( + mlflowLoggingClient, + modelDir, + runId, + bestModel, + modelDescriptor, + "BestRun" + ) + mlflowLoggingClient.logParam(runId, "generation", "Best") + + // Save main config to MLFlow + val configDir = s"${baseDirectory}/${modelDescriptor}_${runId}/config" + val configPath = saveConfig(runId, configDir) + mlflowLoggingClient.logArtifact(runId, new File(configPath)) + + // Log custom tags if present + if (_mlFlowCustomRunTags.nonEmpty) { + logCustomTags(mlflowLoggingClient, runId, _mlFlowCustomRunTags) + } + + //Inference data save + val inferencePath = Paths + .get(s"$inferenceSaveLocation/$experimentId/${_mlFlowBestSuffix}") + .toString + val inferenceLocation = inferencePath + "/" + runId + _mlFlowBestSuffix + val inferenceMlFlowConfig = getInternalMlFlowConfig(baseDirectory) + val inferenceModelConfig = getInferenceModelConfig( + modelFamily, + modelType, + "mlflow", + inferenceMlFlowConfig, + runId, + modelDir) + setInferenceModelConfig(inferenceModelConfig) + setInferenceConfigStorageLocation(inferenceLocation) + + val inferenceConfig = getInferenceConfig + + val inferenceConfigAsJSON = convertInferenceConfigToJson(inferenceConfig) + + val inferenceConfigAsDF = convertInferenceConfigToDataFrame(inferenceConfig) + + //Save the inference config to the save location + println(s"Inference DF will be saved to $inferenceLocation") + inferenceConfigAsDF.write.save(inferenceLocation) + + mlflowLoggingClient.setTag( + runId, + "InferenceConfig", + inferenceConfigAsJSON.compactJson + ) + + mlflowLoggingClient.setTag( + runId, + "InferenceDataFrameLocation", + inferenceLocation + ) + + MLFlowReturn(mlflowLoggingClient, experimentId, runIdPayload) + + } + + private def getInferenceModelConfig(modelFamily: String, + modelType: String, + modelLoadMethod: String, + inferenceMlFlowConfig: MLFlowConfig, + mlFlowRunId: String, + modelPathLocation: String): InferenceModelConfig = { + InferenceModelConfig( + modelFamily = modelFamily, + modelType = modelType, + modelLoadMethod = "mlflow", + mlFlowConfig = inferenceMlFlowConfig, + mlFlowRunId = mlFlowRunId, + modelPathLocation = modelPathLocation + ) + } + private def getInternalMlFlowConfig(baseDirectory: String): MLFlowConfig = { + MLFlowConfig( + mlFlowTrackingURI = _mlFlowTrackingURI, + mlFlowExperimentName = _mlFlowExperimentName, + mlFlowAPIToken = _mlFlowHostedAPIToken, + mlFlowModelSaveDirectory = baseDirectory, + mlFlowLoggingMode = _mlFlowLoggingMode, + mlFlowBestSuffix = _mlFlowBestSuffix, + mlFlowCustomRunTags = _mlFlowCustomRunTags + ) + } + + private def logTuningForPipeline( + runData: Array[GenericModelReturn], + modelFamily: String, + modelType: String): MLFlowReturn = { + + val runIdPayloadBuffer = ArrayBuffer[(String, Double)]() + + val mlflowLoggingClient = getMLFlowClient + val experimentId = getOrCreateExperimentId(mlflowLoggingClient).toString + + var totalVersion = + mlflowLoggingClient.getExperiment(experimentId).getRunsCount + + val generationSet = mutable.Set[Int]() + runData.map(x => generationSet += x.generation) + val uniqueGenerations = generationSet.result.toArray.sortWith(_ < _) + + val modelDescriptor = s"${modelType}_$modelFamily" + + // loop through each generation and log the data + uniqueGenerations.foreach { g => + // get the runs from this generation + val currentGen = runData.filter(x => x.generation == g) + var withinRunId = 0 + // Execute these writes in parallel. + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(10)) + val generations = currentGen.par + generations.tasksupport = taskSupport + generations.foreach { x => + totalVersion += 1 + val uniqueRunIdent = + s"${modelFamily}_${modelType}_${x.generation.toString}_${withinRunId.toString}_${x.score.toString}" + val runName = "run_" + x.generation.toString + "_" + withinRunId.toString + val runId = generateMlFlowRun( + mlflowLoggingClient, + experimentId, + uniqueRunIdent, + runName, + totalVersion.toString + ) + runIdPayloadBuffer += Tuple2(runId, x.score) + val hyperParamKeys = x.hyperParams.keys + hyperParamKeys.foreach { k => + val valueData = modelFamily match { + case "MLPC" => + x.hyperParams(k) match { + case "layers" => + x.hyperParams(k).asInstanceOf[Array[Int]].mkString(",") + case _ => x.hyperParams(k) + } + case _ => x.hyperParams(k) + } + //val valueData = x.hyperParams(k) + mlflowLoggingClient.logParam(runId, k, valueData.toString) + } + val metricKeys = x.metrics.keys + metricKeys.foreach { k => + val valueData = x.metrics(k) + mlflowLoggingClient.logMetric(runId, k, valueData.toString.toDouble) + } + + // Save main config to MLFlow + val baseDirectory = Paths.get(s"${_modelSaveDirectory}/BestRun").toString + val configDir = s"${baseDirectory}/${modelDescriptor}_${runId}/config" + val configPath = saveConfig(runId, configDir) + mlflowLoggingClient.logArtifact(runId, new File(configPath)) + mlflowLoggingClient.logParam(runId, "modelType", modelDescriptor) + + // log the generation + mlflowLoggingClient.logParam(runId, "generation", x.generation.toString) + // Log custom tags if present + if (_mlFlowCustomRunTags.nonEmpty) { + logCustomTags(mlflowLoggingClient, runId, _mlFlowCustomRunTags) + } + withinRunId += 1 + } + } + MLFlowReturn( + mlflowLoggingClient, + experimentId, + runIdPayloadBuffer.result().toArray + ) + } + + private def logTuning(runData: Array[GenericModelReturn], + modelFamily: String, + modelType: String, + inferenceSaveLocation: String): MLFlowReturn = { + + val runIdPayloadBuffer = ArrayBuffer[(String, Double)]() + + val mlflowLoggingClient = getMLFlowClient + + val experimentId = getOrCreateExperimentId(mlflowLoggingClient).toString + + var totalVersion = + mlflowLoggingClient.getExperiment(experimentId).getRunsCount + + val generationSet = mutable.Set[Int]() + runData.map(x => generationSet += x.generation) + val uniqueGenerations = generationSet.result.toArray.sortWith(_ < _) + + val baseDirectory = Paths.get(s"${_modelSaveDirectory}").toString + + val modelDescriptor = s"${modelType}_$modelFamily" + + // loop through each generation and log the data + uniqueGenerations.foreach { g => + // get the runs from this generation + val currentGen = runData.filter(x => x.generation == g) + + var withinRunId = 0 + + // Execute these writes in parallel. + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(10)) + val generations = currentGen.par + generations.tasksupport = taskSupport + + generations.foreach { x => + totalVersion += 1 + + val uniqueRunIdent = + s"${modelFamily}_${modelType}_${x.generation.toString}_${withinRunId.toString}_${x.score.toString}" + + val runName = "run_" + x.generation.toString + "_" + withinRunId.toString + + val runId = generateMlFlowRun( + mlflowLoggingClient, + experimentId, + uniqueRunIdent, + runName, + totalVersion.toString + ) + + runIdPayloadBuffer += Tuple2(runId, x.score) + + val hyperParamKeys = x.hyperParams.keys + + hyperParamKeys.foreach { k => + val valueData = modelFamily match { + case "MLPC" => + x.hyperParams(k) match { + case "layers" => + x.hyperParams(k).asInstanceOf[Array[Int]].mkString(",") + case _ => x.hyperParams(k) + } + case _ => x.hyperParams(k) + } + + //val valueData = x.hyperParams(k) + mlflowLoggingClient.logParam(runId, k, valueData.toString) + } + val metricKeys = x.metrics.keys + + metricKeys.foreach { k => + val valueData = x.metrics(k) + mlflowLoggingClient.logMetric(runId, k, valueData.toString.toDouble) + } + + mlflowLoggingClient.logParam(runId, "modelType", modelDescriptor) + + // Generate a new unique uuid for the model to ensure there are no overwrites. + val uniqueModelId = java.util.UUID.randomUUID().toString.replace("-", "") + + // Set a location to write the model to + val modelDir = s"$baseDirectory/${modelDescriptor}_$runId/$uniqueModelId" + + // log the model artifact + saveModel( + mlflowLoggingClient, + modelDir, + runId, + x, + modelDescriptor, + uniqueModelId + ) + + // log the generation + mlflowLoggingClient.logParam(runId, "generation", x.generation.toString) + + // Save the Config and add to MLFlow artifacts + val configDir = s"${baseDirectory}/${modelDescriptor}_${runId}/config" + val configPath = saveConfig(runId, configDir) + mlflowLoggingClient.logArtifact(runId, new File(configPath)) + + // Log custom tags if present + if (_mlFlowCustomRunTags.nonEmpty) { + logCustomTags(mlflowLoggingClient, runId, _mlFlowCustomRunTags) + } + + /** + * Set the remaining aspect of InferenceConfig for this run + */ + // set the model save directory + val inferencePath = inferenceSaveLocation.takeRight(1) match { + case "/" => s"$inferenceSaveLocation$experimentId/" + case _ => s"$inferenceSaveLocation/$experimentId/" + } + + val inferenceLocation = inferencePath + runId + + val inferenceMlFlowConfig = getInternalMlFlowConfig(baseDirectory) + val inferenceModelConfig = getInferenceModelConfig( + modelFamily, + modelType, + "mlflow", + inferenceMlFlowConfig, + runId, + modelDir) + + setInferenceModelConfig(inferenceModelConfig) + setInferenceConfigStorageLocation(inferenceLocation) + + val inferenceConfig = getInferenceConfig + + val inferenceConfigAsJSON = + convertInferenceConfigToJson(inferenceConfig) + + val inferenceConfigAsDF = + convertInferenceConfigToDataFrame(inferenceConfig) + + //Save the inference config to the save location + inferenceConfigAsDF.write.save(inferenceLocation) + + mlflowLoggingClient.setTag( + runId, + "InferenceConfig", + inferenceConfigAsJSON.compactJson + ) + + mlflowLoggingClient.setTag( + runId, + "InferenceDataFrameLocation", + inferenceLocation + ) + + withinRunId += 1 + + } + } + + MLFlowReturn( + mlflowLoggingClient, + experimentId, + runIdPayloadBuffer.result().toArray + ) + + } + +} +object MLFlowTracker { + /** + * Normal method for instantiating MLFlowTracker. Must be used for training tuning and can be used for + * Inference + * @param mainConfig + * @return + */ + def apply(mainConfig: MainConfig): MLFlowTracker = { + new MLFlowTracker() + .setMlFlowTrackingURI(mainConfig.mlFlowConfig.mlFlowTrackingURI) + .setMlFlowHostedAPIToken(mainConfig.mlFlowConfig.mlFlowAPIToken) + .setMlFlowExperimentName(mainConfig.mlFlowConfig.mlFlowExperimentName) + .setModelSaveDirectory(mainConfig.mlFlowConfig.mlFlowModelSaveDirectory) + .setMlFlowLoggingMode(mainConfig.mlFlowConfig.mlFlowLoggingMode) + .setMlFlowBestSuffix(mainConfig.mlFlowConfig.mlFlowBestSuffix) + .setMainConfig(mainConfig) + } + + /** + * WARNING -- ONLY FOR INFERENCE PIPELINE + * Create MLFlow tracker with only a required RunID. Used for Inference Runs referencing a main config + * tracked by MLFlow + * @param runId String of MLFlow runId to be used for Inference + * @param trackingURI Optional input to use an external tracking server + * @param apiToken Optional input to use non-default user token + * @return + */ + def apply(runId: String, trackingURI: Option[String] = None, apiToken: Option[String] = None): MLFlowTracker = { + def createFusePath(path: String): String = { + path.replace("dbfs:", "/dbfs") + } + + val _uri = if (trackingURI.isEmpty) InitDbUtils.getTrackingURI else trackingURI.get + val _apiToken = if (apiToken.isEmpty) InitDbUtils.getAPIToken else apiToken.get + + val client = new MlflowClient(new BasicMlflowHostCreds(_uri, _apiToken)) + val mainConfigPath = AutoMlPipelineMlFlowUtils.getMlFlowTagByKey(client, runId, "MainConfigLocation") + val sourceBuffer = Source.fromFile(createFusePath(mainConfigPath)) + val jString = sourceBuffer.getLines.mkString + sourceBuffer.close() + val mainConfig = ConfigurationGenerator.generateMainConfigFromJson(jString) + new MLFlowTracker() + .setMlFlowTrackingURI(mainConfig.mlFlowConfig.mlFlowTrackingURI) + .setMlFlowHostedAPIToken(mainConfig.mlFlowConfig.mlFlowAPIToken) + .setMlFlowExperimentName(mainConfig.mlFlowConfig.mlFlowExperimentName) + .setModelSaveDirectory(mainConfig.mlFlowConfig.mlFlowModelSaveDirectory) + .setMlFlowLoggingMode(mainConfig.mlFlowConfig.mlFlowLoggingMode) + .setMlFlowBestSuffix(mainConfig.mlFlowConfig.mlFlowBestSuffix) + .setMainConfig(mainConfig) + } + + @deprecated("No Main Config tracking available with this method. " + + "Only pass in logging config for for old pipelines. For new pipelines only call " + + "MLFlowTracker(runId, optional trackingURI, optional apiToken)", "0.7.1") + def apply(mlFlowConfig: MLFlowConfig): MLFlowTracker = { + new MLFlowTracker() + .setMlFlowTrackingURI(mlFlowConfig.mlFlowTrackingURI) + .setMlFlowHostedAPIToken(mlFlowConfig.mlFlowAPIToken) + .setMlFlowExperimentName(mlFlowConfig.mlFlowExperimentName) + .setModelSaveDirectory(mlFlowConfig.mlFlowModelSaveDirectory) + .setMlFlowLoggingMode(mlFlowConfig.mlFlowLoggingMode) + .setMlFlowBestSuffix(mlFlowConfig.mlFlowBestSuffix) + } +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/AutoMlPipelineMlFlowUtils.scala b/src/main/scala/com/databricks/labs/automl/utils/AutoMlPipelineMlFlowUtils.scala new file mode 100644 index 00000000..c616c261 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/AutoMlPipelineMlFlowUtils.scala @@ -0,0 +1,209 @@ +package com.databricks.labs.automl.utils + +import java.nio.file.Paths + +import com.databricks.labs.automl.executor.config.LoggingConfig +import com.databricks.labs.automl.params.{MLFlowConfig, MainConfig} +import com.databricks.labs.automl.pipeline.{PipelineStateCache, PipelineVars} +import com.databricks.labs.automl.tracking.MLFlowTracker +import org.apache.log4j.Logger +import org.apache.spark.ml.{PipelineModel, PredictionModel} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.types.StructType +import org.mlflow.api.proto.Service +import org.mlflow.tracking.MlflowClient + +/** + * @author Jas Bali + * @since 0.6.1 + * Mlflow Utility for Pipeline tasks + */ +object AutoMlPipelineMlFlowUtils { + + @transient private val logger: Logger = Logger.getLogger(this.getClass) + + lazy final val AUTOML_INTERNAL_ID_COL = "automl_internal_id" + + case class ConfigByPipelineIdOutput(mainConfig: MainConfig, + mlFlowRunId: String) + + def extractTopLevelColNames(schema: StructType): Array[String] = + schema.fields.map(field => field.name) + + def getMainConfigByPipelineId( + pipelineId: String + ): ConfigByPipelineIdOutput = { + val mainConfig = PipelineStateCache + .getFromPipelineByIdAndKey(pipelineId, PipelineVars.MAIN_CONFIG.key) + .asInstanceOf[MainConfig] + if (mainConfig.mlFlowLoggingFlag) { + val mlFlowRunId = PipelineStateCache + .getFromPipelineByIdAndKey(pipelineId, PipelineVars.MLFLOW_RUN_ID.key) + .asInstanceOf[String] + ConfigByPipelineIdOutput(mainConfig, mlFlowRunId) + } else { + ConfigByPipelineIdOutput(mainConfig, null) + } + } + + def logTagsToMlFlow(pipelineId: String, tags: Map[String, String]): Unit = { + val mlFlowRunIdAndConfig = + AutoMlPipelineMlFlowUtils.getMainConfigByPipelineId(pipelineId) + if (mlFlowRunIdAndConfig.mainConfig.mlFlowLoggingFlag) { + val mlflowTracker = MLFlowTracker( + mlFlowRunIdAndConfig.mainConfig + ) + val client = mlflowTracker.getMLFlowClient + // Delete a tag first + try { + mlflowTracker + .deleteCustomTags( + client, + mlFlowRunIdAndConfig.mlFlowRunId, + tags.keys.toSet.toSeq + ) + } catch { + case ex: org.mlflow.tracking.MlflowHttpException => { + logger.debug(s"MlFlow Tag deletion failed: ${ex.getBodyMessage}") + } + } + //Create a new tag + mlflowTracker + .logCustomTags(client, mlFlowRunIdAndConfig.mlFlowRunId, tags) + } + } + + def getMlFlowTagByKey(client: MlflowClient, runId: String, tag: String): String = { + client.getRun(runId) + .getData.getTagsList + .toArray() + .map(item => item.asInstanceOf[Service.RunTag]) + .filter(item => item.getKey.equals(tag)) + .head.getValue + } + + def getPipelinePathByRunId(runId: String, + loggingConfig: Option[LoggingConfig] = None, + mainConfig: Option[MainConfig] = None): String = { + try { + if (loggingConfig.isDefined) { + val client = MLFlowTracker(MLFlowConfig( + loggingConfig.get.mlFlowTrackingURI, + loggingConfig.get.mlFlowExperimentName, + loggingConfig.get.mlFlowAPIToken, + loggingConfig.get.mlFlowModelSaveDirectory, + loggingConfig.get.mlFlowLoggingMode, + loggingConfig.get.mlFlowBestSuffix, + loggingConfig.get.mlFlowCustomRunTags + )).getMLFlowClient + getMlFlowTagByKey(client, runId, PipelineMlFlowTagKeys.PIPELINE_MODEL_SAVE_PATH_KEY) + } + if (mainConfig.isDefined) { + val client = MLFlowTracker(mainConfig.get) + .getMLFlowClient + getMlFlowTagByKey(client, runId, PipelineMlFlowTagKeys.PIPELINE_MODEL_SAVE_PATH_KEY) + } else { + val client = MLFlowTracker(runId) + .getMLFlowClient + getMlFlowTagByKey(client, runId, PipelineMlFlowTagKeys.PIPELINE_MODEL_SAVE_PATH_KEY) + } + } catch { + case e: Exception => { + throw new RuntimeException( + s"Exception in fetching Pipeline model path by MlFlow Run ID $runId", + e + ) + } + } + } + + def saveInferencePipelineDfAndLogToMlFlow(pipelineId: String, + decidedModel: String, + modelFamily: String, + mlFlowModelSaveDirectory: String, + finalPipelineModel: PipelineModel, + originalDf: DataFrame): Unit = { + val mlFlowRunIdAndConfig = getMainConfigByPipelineId(pipelineId: String) + if (mlFlowRunIdAndConfig.mainConfig.mlFlowLoggingFlag) { + // Log inference pipeline stages' names to MLFlow + saveAllPipelineStagesToMlFlow( + pipelineId, + finalPipelineModel, + mlFlowRunIdAndConfig.mainConfig + ) + // Save Pipeline and log to MlFlow + val modelDescriptor = s"$decidedModel" + "_" + s"$modelFamily" + val baseDirectory = Paths.get(s"$mlFlowModelSaveDirectory/BestRun/") + val pipelineDir = + s"$baseDirectory${modelDescriptor}_${mlFlowRunIdAndConfig.mlFlowRunId}/BestPipeline/" + val finalPipelineSavePath = Paths.get(pipelineDir).toString + logger.info( + s"Saving pipeline id $pipelineId to path $finalPipelineSavePath" + ) + finalPipelineModel.save(finalPipelineSavePath) + logger.info( + s"Saved pipeline id $pipelineId to path $finalPipelineSavePath" + ) + logTagsToMlFlow( + pipelineId, + Map( + PipelineMlFlowTagKeys.PIPELINE_MODEL_SAVE_PATH_KEY -> finalPipelineSavePath + ) + ) + // Save TrainingDf and log to MlFlow + val trainDfBaseDirectory = + Paths.get(s"$mlFlowModelSaveDirectory/FeatureEngineeredDataset/") + val trainDfDir = + s"$trainDfBaseDirectory${modelDescriptor}_${mlFlowRunIdAndConfig.mlFlowRunId}/data/" + val finalFeatEngDfPath = Paths.get(trainDfDir).toString + finalPipelineModel + .transform(originalDf) + .write + .mode("overwrite") + .format("delta") + .save(finalFeatEngDfPath) + logger.info(s"Saved feature engineered df to path $finalFeatEngDfPath") + logTagsToMlFlow( + pipelineId, + Map( + PipelineMlFlowTagKeys.PIPELINE_TRAIN_DF_PATH_KEY -> finalFeatEngDfPath + ) + ) + } + } + + private def saveAllPipelineStagesToMlFlow(pipelineId: String, + finalPipelineModel: PipelineModel, + mainConfig: MainConfig): Unit = { + val finalPipelineStges = + if (mainConfig.geneticConfig.trainSplitMethod == "kSample") { + val ksamplerStagesPipelineHolder = "KSAMPLER_STAGER_PLACEHOLDER" + val ksamplerPipelineStages = PipelineStateCache + .getFromPipelineByIdAndKey( + pipelineId, + PipelineVars.KSAMPLER_STAGES.key + ) + .asInstanceOf[String] + // Interpolate to enter ksampler pipeline stages just before the modeling stage + // to make sure pipeline stages are stringified in the order of their execution + finalPipelineModel.stages + .map(item => { + if (item.isInstanceOf[PredictionModel[_, _]]) { + ksamplerStagesPipelineHolder + ", \n" + item.getClass.getName + } else { + item.getClass.getName + } + }) + .mkString(", \n") + .replace(ksamplerStagesPipelineHolder, ksamplerPipelineStages) + } else { + finalPipelineModel.stages.map(_.getClass.getName).mkString(", \n") + } + AutoMlPipelineMlFlowUtils + .logTagsToMlFlow( + pipelineId, + Map(s"All_Stages_For_Pipeline_${pipelineId}" -> finalPipelineStges) + ) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/AutomationTools.scala b/src/main/scala/com/databricks/labs/automl/utils/AutomationTools.scala new file mode 100644 index 00000000..db79b810 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/AutomationTools.scala @@ -0,0 +1,281 @@ +package com.databricks.labs.automl.utils + +import com.databricks.labs.automl.inference.{ + InferenceDataConfig, + InferenceSwitchSettings +} +import com.databricks.labs.automl.params.{ + GenerationalReport, + GenericModelReturn, + MLPCConfig, + MainConfig +} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.storage.StorageLevel +import org.json4s.{Formats, _} +import org.json4s.jackson.Serialization +import org.json4s.jackson.Serialization.writePretty + +import scala.collection.mutable.{ArrayBuffer, ListBuffer} + +trait AutomationTools extends SparkSessionWrapper { + + def extractPayload(cc: Product): Map[String, Any] = { + val values = cc.productIterator + cc.getClass.getDeclaredFields.map { + _.getName -> (values.next() match { + case p: Product if p.productArity > 0 => extractPayload(p) + case x => x + }) + }.toMap + } + + def extractMLPCPayload(payload: MLPCConfig): Map[String, Any] = { + + Map( + "layers" -> payload.layers.mkString(","), + "maxIter" -> payload.maxIter, + "solver" -> payload.solver, + "stepSize" -> payload.stepSize, + "tolerance" -> payload.tolerance + ) + + } + + def extractGenerationData( + payload: Array[GenericModelReturn] + ): Map[Int, (Double, Double)] = { + + val scoreBuffer = new ListBuffer[(Int, Double)] + payload.foreach { x => + scoreBuffer += ((x.generation, x.score)) + } + scoreBuffer.groupBy(_._1).map { + case (k, v) => + val values = v.map(_._2) + val mean = values.sum / values.size + val res = values.map(x => scala.math.pow(x - mean, 2)) + k -> (mean, scala.math.sqrt(res.sum / res.size)) + } + } + + def dataPersist(preDF: DataFrame, + postDF: DataFrame, + cacheLevel: StorageLevel, + blockUnpersist: Boolean): (DataFrame, String) = { + + postDF.persist(cacheLevel) + val newDFRowCount = s"Row count of data: ${postDF.count()}" + preDF.unpersist(blockUnpersist) + (postDF, newDFRowCount) + } + + def fieldRemovalCompare(preFilterFields: Array[String], + postFilterFields: Array[String]): List[String] = { + + preFilterFields.toList.filterNot(postFilterFields.toList.contains(_)) + + } + + def extractGenerationalScores( + payload: Array[GenericModelReturn], + scoringOptimizationStrategy: String, + modelFamily: String, + modelType: String + ): Array[GenerationalReport] = { + + val uniqueGenerations = payload + .map(x => x.generation) + .toList + .foldLeft(Nil: List[Int]) { (curr, next) => + if (curr contains next) curr else next :: curr + } + .sortWith(_ < _) + + val outputPayload = new ArrayBuffer[GenerationalReport] + + val generationScoringData = extractGenerationData(payload) + + for (g <- uniqueGenerations) { + + val generationData = payload.filter(_.generation == g) + + val generationSummaryData = generationScoringData(g) + + val bestGenerationRun = scoringOptimizationStrategy match { + case "maximize" => generationData.sortWith(_.score > _.score)(0) + case "minimize" => generationData.sortWith(_.score < _.score)(0) + case _ => + throw new UnsupportedOperationException( + s"Optimization Strategy $scoringOptimizationStrategy is not supported." + ) + } + + val bestModel = bestGenerationRun.model + val bestParams = bestGenerationRun.hyperParams + val bestScores = bestGenerationRun.metrics + + outputPayload += GenerationalReport( + modelFamily = modelFamily, + modelType = modelType, + generation = g, + generationMeanScore = generationSummaryData._1, + generationStddevScore = generationSummaryData._2 + ) + } + scoringOptimizationStrategy match { + case "maximize" => + outputPayload.toArray.sortWith( + _.generationMeanScore > _.generationMeanScore + ) + case "minimize" => + outputPayload.toArray.sortWith( + _.generationMeanScore < _.generationMeanScore + ) + case _ => + throw new UnsupportedOperationException( + s"Optimization Strategy $scoringOptimizationStrategy is not supported." + ) + } + } + + def generationDataFrameReport(generationalData: Array[GenerationalReport], + sortingStrategy: String): DataFrame = { + + import spark.sqlContext.implicits._ + + val rawDf = spark.sparkContext + .parallelize(generationalData) + .toDF( + "model_family", + "model_type", + "generation", + "generation_mean_score", + "generation_std_dev_score" + ) + + sortingStrategy match { + case "maximize" => rawDf.orderBy(col("generation_mean_score").desc) + case "minimize" => rawDf.orderBy(col("generation_mean_score").asc) + } + + } + + def printSchema(df: DataFrame, dataName: String): String = { + + s"Schema for $dataName is: \n ${df.schema.fieldNames.mkString(", ")}" + + } + + def printSchema(schema: Array[String], dataName: String): String = { + + s"Schema for $dataName is: \n ${schema.mkString(", ")}" + + } + + def trainSplitValidation(trainSplitMethod: String, + modelSelection: String): String = { + + modelSelection match { + case "regressor" => + trainSplitMethod match { + case "stratified" => + println( + "[WARNING] Stratified Method is NOT ALLOWED on Regressors. Setting to Random." + ) + "random" + case _ => trainSplitMethod + } + case _ => trainSplitMethod + + } + + } + + /** + * Single-pass method for recording all switch settings to the InferenceConfig Object. + * @param config MainConfig used for starting the training AutoML run + */ + def recordInferenceSwitchSettings( + config: MainConfig + ): InferenceSwitchSettings = { + + // Set the switch settings + InferenceSwitchSettings( + naFillFlag = config.naFillFlag, + varianceFilterFlag = config.varianceFilterFlag, + outlierFilterFlag = config.outlierFilterFlag, + pearsonFilterFlag = config.pearsonFilteringFlag, + covarianceFilterFlag = config.covarianceFilteringFlag, + oneHotEncodeFlag = config.oneHotEncodeFlag, + scalingFlag = config.scalingFlag, + featureInteractionFlag = config.featureInteractionFlag + ) + } + + /** + * Helper method for removing any of the mutators that have occurred during pre-processing of the field types + * @param names Array: The collection of column names from the DataFrame immediately after data pre-processing + * tasks of type validation and conversion. + * @since 0.5.1 + * @author Ben Wilson, Databricks + * @return A wrapped Array of distinct field names to use for re-producability of the model for inference runs + * that is cleaned of the _si or _oh suffixes as a result of feature engineering tasks. + */ + private[utils] def cleanFieldNames(names: Array[String]): Array[String] = { + + names.map { x => + x.takeRight(3) match { + case "_si" => x.dropRight(3) + case "_oh" => x.dropRight(3) + case _ => x + } + }.distinct + + } + + /** + * Helper method for generating the Inference Config object for the data configuration steps needed to perform to + * reproduce the modeling for subsequent inference runs. + * @param config The full main Config that is utilized for the execution of the run. + * @param startingFields The fields that are are returned from type casting and validation (may contain artificial + * suffixes for StringIndexer (_si) and OneHotEncoder(_oh). These will be removed before + * recording. + * @since 0.4.0 + * @return and Instance of InferenceDataConfig + */ + def recordInferenceDataConfig( + config: MainConfig, + startingFields: Array[String] + ): InferenceDataConfig = { + + // Strip out any of the trailing encoding modifications that may have been done to the starting fields. + + val cleanedStartingFields = cleanFieldNames(startingFields) + + InferenceDataConfig( + labelCol = config.labelCol, + featuresCol = config.featuresCol, + startingColumns = cleanedStartingFields, + fieldsToIgnore = config.fieldsToIgnoreInVector, + dateTimeConversionType = config.dateTimeConversionType + ) + + } + + /** + * Provide a human-readable report into stdout and in the logs that show the configuration for a model run + * with the key -> value relationship shown as json + * @param config AnyRef -> a defined case class + * @return String in the form of pretty print syntax + */ + def prettyPrintConfig(config: AnyRef): String = { + + implicit val formats: Formats = + Serialization.formats(hints = FullTypeHints(List(config.getClass))) + writePretty(config) + .replaceAll(": \\{", "\\{") + .replaceAll(":", "->") + } +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/ConfigParser.scala b/src/main/scala/com/databricks/labs/automl/utils/ConfigParser.scala new file mode 100644 index 00000000..ee8c08ff --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/ConfigParser.scala @@ -0,0 +1,25 @@ +package com.databricks.labs.automl.utils + +import com.databricks.labs.automl.executor.config.{ + ConfigurationDefaults, + InstanceConfig +} +import com.databricks.labs.automl.params.{Defaults, MainConfig} + +//object ConfigParser extends Defaults with ConfigurationDefaults with InferenceTools { +//// Holding +//} + +object DefaultConfigAccessor extends Defaults { + + def getMainConfig: MainConfig = _mainConfigDefaults + +} + +object DefaultInstanceConfigAccessor extends ConfigurationDefaults { + + def getInstanceConfig(modelFamily: String, + predictionType: String): InstanceConfig = + getDefaultConfig(modelFamily, predictionType) + +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/DBUtilsHelper.scala b/src/main/scala/com/databricks/labs/automl/utils/DBUtilsHelper.scala new file mode 100644 index 00000000..30316174 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/DBUtilsHelper.scala @@ -0,0 +1,86 @@ +package com.databricks.labs.automl.utils +import com.databricks.dbutils_v1.DBUtilsHolder.dbutils0 +import org.apache.log4j.Logger +import org.apache.spark.sql.SparkSession + +import scala.reflect.runtime.universe._ + +/** + * Reflection Object brought to you courtesy of Jas Bali + */ +object DBUtilsHelper { + + private val logger: Logger = Logger.getLogger(this.getClass) + + private val ERROR_RETURN = "NA" + + protected def reflectedDBUtilsMethod(methodName: String): Array[String] = { + Array( + dbutils0 + .get() + .notebook + .getContext() + .getClass + .getMethods + .map(_.getName) + .filter(_.equals(methodName)) + ).head + } + protected def hijackProtectedMethods(methodName: String): String = { + val ctx = dbutils0.get().notebook.getContext() + val mirrorContext = runtimeMirror(getClass.getClassLoader) + .reflect(dbutils0.get().notebook.getContext()) + val result = reflectedDBUtilsMethod(methodName) + .map(x => { + scala.reflect.runtime.universe + .typeOf[ctx.type] + .decl(TermName(x)) + .asMethod + }) + .map(mirrorContext.reflectMethod(_).apply()) + result(0).asInstanceOf[Option[_]].get.toString + } + + private def wrapWithException(methodName: String): String = { + try { + if(!isLocalSparkSession) { + return hijackProtectedMethods(methodName) + } + } catch { + case e: Exception => { + logger.debug(s"Method name $methodName not present on dbutils") + } + } + ERROR_RETURN + } + + /** + * Gets the current running notebook path + * @return + */ + def getNotebookPath: String = { + wrapWithException("notebookPath") + } + + def getNotebookDirectory: String = { + val notebookPath = getNotebookPath + val notebookPathFinal = if(!notebookPath.equals(ERROR_RETURN)) notebookPath.substring(0, notebookPath.lastIndexOf("/")) + "/" else ERROR_RETURN + notebookPathFinal + } + + def getTrackingURI: String = { + wrapWithException("apiUrl") + } + + def getAPIToken: String = { + wrapWithException("apiToken") + } + + def isLocalSparkSession: Boolean = { + SparkSession + .builder() + .getOrCreate() + .sparkContext + .isLocal + } +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/DataValidation.scala b/src/main/scala/com/databricks/labs/automl/utils/DataValidation.scala new file mode 100644 index 00000000..0cf8072c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/DataValidation.scala @@ -0,0 +1,223 @@ +package com.databricks.labs.automl.utils + +import org.apache.log4j.Logger +import org.apache.spark.ml.feature.{ + OneHotEncoderEstimator, + StringIndexer, + VectorAssembler +} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +import scala.collection.mutable.ListBuffer +import scala.collection.parallel.ForkJoinTaskSupport +import scala.concurrent.forkjoin.ForkJoinPool + +trait DataValidation { + + def _allowableDateTimeConversions = List("unix", "split") + def _allowableCategoricalFilterModes = List("silent", "warn") + def _allowableCardinalilties = List("approx", "exact") + + @transient lazy private val logger: Logger = Logger.getLogger(this.getClass) + + def invalidateSelection(value: String, allowances: Seq[String]): String = { + s"${allowances.foldLeft("")((a, b) => a + " " + b)}" + } + + def oneHotEncodeStrings( + stringIndexedFields: List[String] + ): (OneHotEncoderEstimator, Array[String]) = { + + var encodedColumns = new ListBuffer[String] + var oneHotEncoders = new ListBuffer[OneHotEncoderEstimator] + + stringIndexedFields.foreach { x => + encodedColumns += x.dropRight(3) + "_oh" + } + + val oneHotEncodeObj = new OneHotEncoderEstimator() + .setHandleInvalid("keep") + .setInputCols(stringIndexedFields.toArray) + .setOutputCols(encodedColumns.result.toArray) + + (oneHotEncodeObj, encodedColumns.result.toArray) + + } + + def indexStrings( + categoricalFields: List[String] + ): (Array[StringIndexer], Array[String]) = { + + var indexedColumns = new ListBuffer[String] + var stringIndexers = new ListBuffer[StringIndexer] + + categoricalFields.map(x => { + val stringIndexedColumnName = x + "_si" + val stringIndexerObj = new StringIndexer() + .setHandleInvalid("keep") + .setInputCol(x) + .setOutputCol(stringIndexedColumnName) + indexedColumns += stringIndexedColumnName + stringIndexers += stringIndexerObj + }) + + (stringIndexers.result.toArray, indexedColumns.result.toArray) + + } + + private def splitDateTimeParts( + df: DataFrame, + dateFields: List[String], + timeFields: List[String] + ): (DataFrame, List[String]) = { + + var resultFields = new ListBuffer[String] + + var data = df + dateFields.map(x => { + data = data + .withColumn(x + "_year", year(col(x))) + .withColumn(x + "_month", month(col(x))) + .withColumn(x + "_day", dayofmonth(col(x))) + resultFields ++= List(x + "_year", x + "_month", x + "_day") + }) + timeFields.map(x => { + data = data + .withColumn(x + "_year", year(col(x))) + .withColumn(x + "_month", month(col(x))) + .withColumn(x + "_day", dayofmonth(col(x))) + .withColumn(x + "_hour", hour(col(x))) + .withColumn(x + "_minute", minute(col(x))) + .withColumn(x + "_second", second(col(x))) + resultFields ++= List( + x + "_year", + x + "_month", + x + "_day", + x + "_hour", + x + "_minute", + x + "_second" + ) + }) + + (data, resultFields.result) + + } + + private def convertToUnix( + df: DataFrame, + dateFields: List[String], + timeFields: List[String] + ): (DataFrame, List[String]) = { + + var resultFields = new ListBuffer[String] + + var data = df + + dateFields.map(x => { + data = data.withColumn(x + "_unix", unix_timestamp(col(x)).cast("Double")) + resultFields += x + "_unix" + }) + + timeFields.map(x => { + data = data.withColumn(x + "_unix", unix_timestamp(col(x)).cast("Double")) + resultFields += x + "_unix" + }) + + (data, resultFields.result) + + } + + def convertDateAndTime(df: DataFrame, + dateFields: List[String], + timeFields: List[String], + mode: String): (DataFrame, List[String]) = { + + val (data, fieldList) = mode match { + case "split" => splitDateTimeParts(df, dateFields, timeFields) + case "unix" => convertToUnix(df, dateFields, timeFields) + } + + (data, fieldList) + + } + + def generateAssembly( + numericColumns: List[String], + characterColumns: List[String], + featureCol: String + ): (Array[StringIndexer], Array[String], VectorAssembler) = { + + val assemblerColumns = new ListBuffer[String] + numericColumns.map(x => assemblerColumns += x) + + val (indexers, indexedColumns) = indexStrings(characterColumns) + indexedColumns.map(x => assemblerColumns += x) + + val assembledColumns = assemblerColumns.result.toArray + + val assembler = new VectorAssembler() + .setInputCols(assembledColumns) + .setOutputCol(featureCol) + + (indexers, assembledColumns, assembler) + } + + def validateLabelAndFeatures(df: DataFrame, + labelCol: String, + featureCol: String): Unit = { + val dfSchema = df.schema + assert( + dfSchema.fieldNames.contains(labelCol), + s"Dataframe does not contain label column named: $labelCol" + ) + assert( + dfSchema.fieldNames.contains(featureCol), + s"Dataframe does not contain features column named: $featureCol" + ) + } + + def validateFieldPresence(df: DataFrame, column: String): Unit = { + val dfSchema = df.schema + assert( + dfSchema.fieldNames.contains(column), + s"Dataframe does not contain column named: '$column'" + ) + } + + def validateInputDataframe(df: DataFrame): Unit = { + require(df != null, "Input dataset cannot be null") + require(df.count() > 0, "Input dataset cannot be empty") + } + + def validateCardinality(df: DataFrame, + stringFields: List[String], + cardinalityLimit: Int = 500, + parallelism: Int = 20): ValidatedCategoricalFields = { + + var validStringFields = ListBuffer[String]() + var invalidStringFields = ListBuffer[String]() + + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(parallelism)) + val collection = stringFields.par + collection.tasksupport = taskSupport + + collection.foreach { x => + val uniqueValues = df.select(x).distinct().count() + if (uniqueValues <= cardinalityLimit) { + validStringFields += x + } else { + invalidStringFields += x + } + } + + ValidatedCategoricalFields( + validStringFields.toList, + invalidStringFields.toList + ) + + } +} + +case class ValidatedCategoricalFields(validFields: List[String], + invalidFields: List[String]) diff --git a/src/main/scala/com/databricks/labs/automl/utils/InitDbUtils.scala b/src/main/scala/com/databricks/labs/automl/utils/InitDbUtils.scala new file mode 100644 index 00000000..c776773b --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/InitDbUtils.scala @@ -0,0 +1,38 @@ +package com.databricks.labs.automl.utils + +import java.nio.file.Paths + +/** + * This util initializes required Dbutils params so that when invoked from + * pyspark, it doesn't result in NPE due to runtime proxy injections + */ +object InitDbUtils { + + case class LogggingConfigType(mlFlowTrackingURI: String, mlFlowExperimentName: String, mlFlowAPIToken: String, mlFlowModelSaveDirectory: String) + + def getNotebookPath: String = DBUtilsHelper.getNotebookPath + def getNotebookDirectory: String = DBUtilsHelper.getNotebookDirectory + def getTrackingURI: String = DBUtilsHelper.getTrackingURI + def getAPIToken: String = DBUtilsHelper.getAPIToken + + def validate(): Unit = { + assert(getNotebookPath != null && !"".equals(getNotebookPath), "NotebookPath cannot be null") + assert(getNotebookDirectory != null && !"".equals(getNotebookDirectory, "NotebookDirectory cannot be null")) + assert(getTrackingURI != null && !"".equals(getTrackingURI, "TrackingURI cannot be null")) + assert(getAPIToken != null && !"".equals(getAPIToken, "APIToken cannot be null")) + } + + def getMlFlowLoggingConfig(mlFlowLoggingFlag: Boolean): LogggingConfigType = { + if(mlFlowLoggingFlag) { + validate() + LogggingConfigType( + getTrackingURI, + Paths.get(InitDbUtils.getNotebookDirectory + "/MLFlowLogs" ).toString, + InitDbUtils.getAPIToken, + Paths.get("dbfs:/tmp/automl/AutoML_Artifacts").toString + ) + } else { + LogggingConfigType("http://localhost:5000/", "/tmp/local_mlflow_exp", "", "/tmp/local_mlflow_exp/artifacts") + } + } +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/ModelFamily.scala b/src/main/scala/com/databricks/labs/automl/utils/ModelFamily.scala new file mode 100644 index 00000000..523cdbc4 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/ModelFamily.scala @@ -0,0 +1,13 @@ +package com.databricks.labs.automl.utils + +object ModelFamily extends Enumeration { + + val RANDOM_FOREST = Value("RandomForest") + +} + +object ModelType extends Enumeration { + + val CLASSIFIER = Value("classifier") + +} \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/utils/PipelineMlFlowTagKeys.scala b/src/main/scala/com/databricks/labs/automl/utils/PipelineMlFlowTagKeys.scala new file mode 100644 index 00000000..f87514cc --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/PipelineMlFlowTagKeys.scala @@ -0,0 +1,27 @@ +package com.databricks.labs.automl.utils + +import com.databricks.labs.automl.params.MainConfig + +/** + * @author Jas Bali + * @since 0.6.1 + * Enums for reporting pipeline to MLflow + */ +object PipelineMlFlowTagKeys { + + lazy final val PIPELINE_MODEL_SAVE_PATH_KEY = "BestPipelineModelSavePath" + lazy final val PIPELINE_TRAIN_DF_PATH_KEY = "FeatureEngineeredTrainDfPath" + lazy final val PIPELINE_STATUS = "PipelineExecutionCurrentStatus" + lazy final val PIPELINE_ID = "PipelineExecutionCurrentStatus" +} +object PipelineStatus extends Enumeration { + + type PipelineStatus = PipelineStatusEnum + + val PIPELINE_STARTED = PipelineStatusEnum("STARTED") + val PIPELINE_RUNNING = PipelineStatusEnum("RUNNING") + val PIPELINE_COMPLETED = PipelineStatusEnum("COMPLETED") + val PIPELINE_FAILED = PipelineStatusEnum("FAILED") + + case class PipelineStatusEnum(key: String) extends Val +} \ No newline at end of file diff --git a/src/main/scala/com/databricks/labs/automl/utils/SchemaUtils.scala b/src/main/scala/com/databricks/labs/automl/utils/SchemaUtils.scala new file mode 100644 index 00000000..2cfd85a9 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/SchemaUtils.scala @@ -0,0 +1,189 @@ +package com.databricks.labs.automl.utils + +import com.databricks.labs.automl.pipeline.PipelineEnums +import com.databricks.labs.automl.utils.structures.{ + FieldDefinitions, + FieldTypes +} +import org.apache.log4j.{Level, Logger} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.types.StructType + +import scala.collection.mutable +import scala.collection.mutable.ListBuffer +import scala.collection.parallel.ForkJoinTaskSupport +import scala.concurrent.forkjoin.ForkJoinPool + +object SchemaUtils { + + private val logger: Logger = Logger.getLogger(this.getClass) + +// private def extractSchema(schema: StructType): List[(DataType, String)] = { +// +// var preParsedFields = new ListBuffer[(DataType, String)] +// +// schema.map(x => preParsedFields += ((x.dataType, x.name))) +// +// preParsedFields.result +// } + + /** + * Method for extracting the data type and field name from the StructType of a DataFrame schema + * @param schema Schema of the DataFrame + * @return Array[FieldDefinitions] with the payload of (dataType: DataType, fieldName: String) + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def extractSchema(schema: StructType): Array[FieldDefinitions] = { + schema.map(x => FieldDefinitions(x.dataType, x.name)).toArray + } + + /** + * Standardized Type Extraction and assignment to different collections for handling of the various primitive types + * @param data DataFrame that is in need of analysis + * @param labelColumn Label Column of the DataFrame + * @return FieldTypes which contains the Lists of the column names based on their collective handling types + * @since 0.1.0 + * @author Ben Wilson, DataBricks + * @throws UnsupportedOperationException if the data type seen in the DataFrame is not currently supported. + */ + @throws(classOf[UnsupportedOperationException]) + def extractTypes( + data: DataFrame, + labelColumn: String, + fieldsToIgnore: Array[String] = Array.empty[String] + ): FieldTypes = { + + val fieldExtraction = extractSchema(data.schema) + .filterNot(x => fieldsToIgnore.contains(x.fieldName)) + + logger.log( + Level.DEBUG, + s"EXTRACT TYPES field listing: ${fieldExtraction.map(x => x.fieldName).mkString(", ")}" + ) + + var categoricalFields = new ListBuffer[String] + var dateFields = new ListBuffer[String] + var timeFields = new ListBuffer[String] + var numericFields = new ListBuffer[String] + var booleanFields = new ListBuffer[String] + + fieldExtraction.map( + x => + x.dataType.typeName match { + case "string" => categoricalFields += x.fieldName + case "integer" => numericFields += x.fieldName + case "double" => numericFields += x.fieldName + case "float" => numericFields += x.fieldName + case "long" => numericFields += x.fieldName + case "byte" => categoricalFields += x.fieldName + case "boolean" => booleanFields += x.fieldName + case "binary" => booleanFields += x.fieldName + case "date" => dateFields += x.fieldName + case "timestamp" => timeFields += x.fieldName + case z if z.take(7) == "decimal" => numericFields += x.fieldName + case _ => + throw new UnsupportedOperationException( + s"Field '${x.fieldName}' is of type ${x.dataType} (${x.dataType.typeName}) which is not supported." + ) + } + ) + + numericFields -= labelColumn + + FieldTypes( + numericFields.result, + categoricalFields.result, + dateFields.result, + timeFields.result, + booleanFields.result + ) + } + + def isLabelRefactorNeeded(schema: StructType, labelCol: String): Boolean = { + val labelDataType = schema.fields.find(_.name.equals(labelCol)).get.dataType + labelDataType.typeName match { + case "string" => true + case "integer" => false + case "double" => false + case "float" => false + case "long" => false + case "byte" => true + case "boolean" => false + case "binary" => false + case "date" => true + case "timestamp" => true + case z if z.take(7) == "decimal" => true + case _ => + throw new UnsupportedOperationException( + s"Field '$labelCol' is of type $labelDataType, which is not supported." + ) + } + } + + def validateCardinality(df: DataFrame, + stringFields: List[String], + cardinalityLimit: Int = 500, + parallelism: Int = 20): ValidatedCategoricalFields = { + + var validStringFields = ListBuffer[String]() + var invalidStringFields = ListBuffer[String]() + + val taskSupport = new ForkJoinTaskSupport(new ForkJoinPool(parallelism)) + val collection = stringFields.par + collection.tasksupport = taskSupport + + collection.foreach { x => + val uniqueValues = df.select(x).distinct().count() + if (uniqueValues <= cardinalityLimit) { + validStringFields += x + } else { + invalidStringFields += x + } + } + + ValidatedCategoricalFields( + validStringFields.toList, + invalidStringFields.toList + ) + + } + + def isNotEmpty[A](list: Array[A]): Boolean = { + list != null && list.nonEmpty + } + + def isNotEmpty[A](list: List[A]): Boolean = { + list != null && list.nonEmpty + } + + def isEmpty[A](list: Array[A]): Boolean = { + list == null || list.isEmpty + } + + def generateStringIndexedColumn(columnName: String): String = { + columnName + PipelineEnums.SI_SUFFIX.value + } + + def generateOneHotEncodedColumn(columnName: String): String = { + val oheSuffix = PipelineEnums.OHE_SUFFIX.value + if (columnName.endsWith(PipelineEnums.SI_SUFFIX.value)) { + columnName.dropRight(3) + oheSuffix + } else { + columnName + oheSuffix + } + } + + def generateMapFromKeysValues[T](keys: Array[String], + values: Array[T]): Map[String, T] = { + assert( + keys.length == values.length, + "Keys and Values lists cannot be different in size" + ) + var map = mutable.Map[String, T]() + for ((key, i) <- keys.view.zipWithIndex) { + map += (key -> values(i)) + } + map.toMap + } +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/SeedConverters.scala b/src/main/scala/com/databricks/labs/automl/utils/SeedConverters.scala new file mode 100644 index 00000000..de855626 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/SeedConverters.scala @@ -0,0 +1,147 @@ +package com.databricks.labs.automl.utils + +import com.databricks.labs.automl.params._ + +import scala.collection.mutable.ListBuffer + +trait SeedConverters { + + def generateXGBoostConfig(configMap: Map[String, Any]): XGBoostConfig = { + XGBoostConfig( + alpha = configMap("alpha").asInstanceOf[String].toDouble, + eta = configMap("eta").asInstanceOf[String].toDouble, + gamma = configMap("gamma").asInstanceOf[String].toDouble, + lambda = configMap("lambda").asInstanceOf[String].toDouble, + maxDepth = configMap("maxDepth").asInstanceOf[String].toInt, + maxBins = configMap("maxBins").asInstanceOf[String].toInt, + subSample = configMap("subSample").asInstanceOf[String].toDouble, + minChildWeight = configMap("minChildWeight").asInstanceOf[String].toDouble, + numRound = configMap("numRound").asInstanceOf[String].toInt, + trainTestRatio = configMap("trainTestRatio").asInstanceOf[String].toDouble + ) + } + + def generateRandomForestConfig( + configMap: Map[String, Any] + ): RandomForestConfig = { + RandomForestConfig( + numTrees = configMap("numTrees").asInstanceOf[String].toInt, + impurity = configMap("impurity").asInstanceOf[String], + maxBins = configMap("maxBins").asInstanceOf[String].toInt, + maxDepth = configMap("maxDepth").asInstanceOf[String].toInt, + minInfoGain = configMap("minInfoGain").asInstanceOf[String].toDouble, + subSamplingRate = + configMap("subSamplingRate").asInstanceOf[String].toDouble, + featureSubsetStrategy = + configMap("featureSubsetStrategy").asInstanceOf[String] + ) + } + + def generateMLPCConfig(configMap: Map[String, Any]): MLPCConfig = { + + var layers = ListBuffer[Int]() + val stringLayers = configMap("layers").asInstanceOf[Array[String]] + stringLayers.foreach { x => + layers += x.toInt + } + + MLPCConfig( + layers = layers.result.toArray, + maxIter = configMap("maxIter").asInstanceOf[String].toInt, + solver = configMap("solver").asInstanceOf[String], + stepSize = configMap("stepSize").asInstanceOf[String].toDouble, + tolerance = configMap("tolerance").asInstanceOf[String].toDouble + ) + } + + def generateTreesConfig(configMap: Map[String, Any]): TreesConfig = { + TreesConfig( + impurity = configMap("impurity").asInstanceOf[String], + maxBins = configMap("maxBins").asInstanceOf[String].toInt, + maxDepth = configMap("maxDepth").asInstanceOf[String].toInt, + minInfoGain = configMap("minInfoGain").asInstanceOf[String].toDouble, + minInstancesPerNode = + configMap("minInstancesPerNode").asInstanceOf[String].toInt + ) + } + + def generateGBTConfig(configMap: Map[String, Any]): GBTConfig = { + GBTConfig( + impurity = configMap("impurity").asInstanceOf[String], + lossType = configMap("lossType").asInstanceOf[String], + maxBins = configMap("maxBins").asInstanceOf[String].toInt, + maxDepth = configMap("maxDepth").asInstanceOf[String].toInt, + maxIter = configMap("maxIter").asInstanceOf[String].toInt, + minInfoGain = configMap("minInfoGain").asInstanceOf[String].toDouble, + minInstancesPerNode = + configMap("minInstancesPerNode").asInstanceOf[String].toInt, + stepSize = configMap("stepSize").asInstanceOf[String].toDouble + ) + } + + def generateLogisticRegressionConfig( + configMap: Map[String, Any] + ): LogisticRegressionConfig = { + LogisticRegressionConfig( + elasticNetParams = + configMap("elasticNetParams").asInstanceOf[String].toDouble, + fitIntercept = configMap("fitIntercept").asInstanceOf[String].toBoolean, + maxIter = configMap("maxIter").asInstanceOf[String].toInt, + regParam = configMap("regParam").asInstanceOf[String].toDouble, + standardization = + configMap("standardization").asInstanceOf[String].toBoolean, + tolerance = configMap("tolerance").asInstanceOf[String].toDouble + ) + } + + def generateLinearRegressionConfig( + configMap: Map[String, Any] + ): LinearRegressionConfig = { + LinearRegressionConfig( + elasticNetParams = + configMap("elasticNetParams").asInstanceOf[String].toDouble, + fitIntercept = configMap("fitIntercept").asInstanceOf[String].toBoolean, + loss = configMap("loss").asInstanceOf[String], + maxIter = configMap("maxIter").asInstanceOf[String].toInt, + regParam = configMap("regParam").asInstanceOf[String].toDouble, + standardization = + configMap("standardization").asInstanceOf[String].toBoolean, + tolerance = configMap("tolerance").asInstanceOf[String].toDouble + ) + } + + def generateSVMConfig(configMap: Map[String, Any]): SVMConfig = { + SVMConfig( + fitIntercept = configMap("fitIntercept").asInstanceOf[String].toBoolean, + maxIter = configMap("maxIter").asInstanceOf[String].toInt, + regParam = configMap("regParam").asInstanceOf[String].toDouble, + standardization = + configMap("standardization").asInstanceOf[String].toBoolean, + tolerance = configMap("tolerance").asInstanceOf[String].toDouble + ) + } + + def generateLightGBMConfig(configMap: Map[String, Any]): LightGBMConfig = { + LightGBMConfig( + baggingFraction = + configMap("baggingFraction").asInstanceOf[String].toDouble, + baggingFreq = configMap("baggingFreq").asInstanceOf[String].toInt, + featureFraction = + configMap("featureFraction").asInstanceOf[String].toDouble, + learningRate = configMap("learningRate").asInstanceOf[String].toDouble, + maxBin = configMap("maxBin").asInstanceOf[String].toInt, + maxDepth = configMap("maxDepth").asInstanceOf[String].toInt, + minSumHessianInLeaf = + configMap("minSumHessianInLeaf").asInstanceOf[String].toDouble, + numIterations = configMap("numIterations").asInstanceOf[String].toInt, + numLeaves = configMap("numLeaves").asInstanceOf[String].toInt, + boostFromAverage = + configMap("boostFromAverage").asInstanceOf[String].toBoolean, + lambdaL1 = configMap("lambdaL1").asInstanceOf[String].toDouble, + lambdaL2 = configMap("lambdaL2").asInstanceOf[String].toDouble, + alpha = configMap("alpha").asInstanceOf[String].toDouble, + boostingType = configMap("boostingType").asInstanceOf[String] + ) + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/SparkSessionWrapper.scala b/src/main/scala/com/databricks/labs/automl/utils/SparkSessionWrapper.scala new file mode 100644 index 00000000..766e073c --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/SparkSessionWrapper.scala @@ -0,0 +1,15 @@ +package com.databricks.labs.automl.utils + +import org.apache.spark.SparkContext +import org.apache.spark.sql.SparkSession + +trait SparkSessionWrapper extends Serializable { + + lazy val spark: SparkSession = SparkSession + .builder() + .appName("Databricks Automated ML") + .getOrCreate() + + lazy val sc: SparkContext = SparkContext.getOrCreate() + +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/WorkspaceDirectoryValidation.scala b/src/main/scala/com/databricks/labs/automl/utils/WorkspaceDirectoryValidation.scala new file mode 100644 index 00000000..f95909f1 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/WorkspaceDirectoryValidation.scala @@ -0,0 +1,137 @@ +package com.databricks.labs.automl.utils + +import scala.sys.process._ + +/** + * Class for performing pre-check validation of mlflow working directories by interfacing (safely) with the + * Workspace API. + * Without performing these checks, on a sufficiently large and complex run, if the MlFlow logging project + * directory does not yet exist in the Workspace, the job will fail to log to MlFlow. + * Accessing the apply method on the object will: + * 1. Check if the directory exists. If it does, return Boolean true. + * 2. If the path does not exist, attempt to make a mkdir POST to recursively create the pathing to the + * target directory in the Workspace. + * 3. Re-validate that the directory has been created and is set up correctly. There is a linear back-off + * sleep statement to make sure that there is a pause between requests to ensure that, if the REST service + * is a bit overloaded, there is enough time to get the successful return confirmation of directory creation. + * @param apiURL The shard URL + * @param apiToken The user-specified token from the notebook context for authorization validation. + */ +class WorkspaceDirectoryValidation(apiURL: String, + apiToken: String, + path: String) { + + final private val statusAPI = s"$apiURL/api/2.0/workspace/get-status" + final private val mkdirAPI = s"$apiURL/api/2.0/workspace/mkdirs" + final private val header = s"Authentication: Bearer $apiToken" + final private val baseCurl = Seq("curl", "-H", header, "-X") + private val directoryMatch = "(\\/\\w+$)".r + final private val adjustedPath = directoryMatch.replaceFirstIn(path, "") + + /** + * Private method for generating the REST body statement for both requests. + * @param adjPath String path in the Workspace for where to store the experimental results + * @return The body statement + */ + private def createPathBody(adjPath: String): String = + s""" + |{ + | "path": "$adjPath" + |} + """.stripMargin + + /** + * Private method for executing a recursive mkdir command to the Workspace + * @param adjPath The path in the Workspace to create. + * @return REST return statement (should be empty JSON) + */ + private def createDir(adjPath: String): String = { + val createCall = baseCurl ++ Seq( + "POST", + mkdirAPI, + "-d", + createPathBody(adjPath) + ) + // Eat the stdout nonsense from the REST API call + val buffer = new StringBuffer() + createCall.lineStream_!(ProcessLogger(buffer append _)).toString() + } + + /** + * Helper method for performing a geometric-back-off sleep based on the effective retry policy. + * + * @example val waitTimes = (1 to 6).map(x => geomSleep(x, 1000)) + * waitTimes: scala.collection.immutable.IndexedSeq[Int] = Vector(1000, 1617, 3344, 6834, 13334, 24790) + * @param counter the iteration of retry + * @param pauseTime The amount of base wait time to apply for a back-off calculation. + */ + private def geomSleep(counter: Int, pauseTime: Int): Unit = { + val sleepTime = scala.math + .ceil(pauseTime * scala.math.pow(counter, scala.math.log(counter))) + .toInt + Thread.sleep(sleepTime) + } + + /** + * Main method for checking whether the mlflow path exists to log run results to and if it does not, + * attempts to create it as specified by the configuration. + * @param cnt Loop counter (used in the recursive call) + * @return Boolean: true if directory exists. + */ + def validate(cnt: Int = 0): Boolean = { + + var attemptCounter = cnt + + val statusCall = baseCurl ++ Seq( + "GET", + statusAPI, + "-d", + createPathBody(adjustedPath) + ) + + val statusBuffer = new StringBuffer() + val statusReturn = + statusCall.lineStream_!(ProcessLogger(statusBuffer append _)).toString() + + val statusAnswer = try { + statusReturn.split("\"")(1) + } catch { + case e: java.lang.ArrayIndexOutOfBoundsException => + println( + s"The directory that you are attempting to log mlflow results to in your Workspace does not have " + + s"the correct permissions for your account to create this directory. Please provide a valid location " + + s"in the Workspace. Invalid access for path: $adjustedPath" + ) + println(s"\n\n ${e.printStackTrace()}") + throw e + } + + statusAnswer match { + case "error_code" => + attemptCounter += 1 + if (attemptCounter < 6) { + createDir(adjustedPath) + + geomSleep(attemptCounter, 1000) + + validate(attemptCounter) + + } else { + throw new RuntimeException( + s"Unable to validate or create Workspace path to $adjustedPath. Ensure permissions" + + s" are sufficient to have write access to Workspace Location. " + + s"\n\nSee: https://docs.databricks.com/user-guide/workspace.html for further information." + ) + } + case _ => true + } + } + +} + +object WorkspaceDirectoryValidation { + + def apply(apiURL: String, apiToken: String, path: String): Boolean = + new WorkspaceDirectoryValidation(apiURL, apiToken, path).validate() + +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/data/CategoricalHandler.scala b/src/main/scala/com/databricks/labs/automl/utils/data/CategoricalHandler.scala new file mode 100644 index 00000000..f12cd7fd --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/data/CategoricalHandler.scala @@ -0,0 +1,54 @@ +package com.databricks.labs.automl.utils.data + +import org.apache.spark.sql.DataFrame + +class CategoricalHandler(data: DataFrame, mode: String = "silent") { + + final val CATEGORICAL_MODES = List("silent", "warn") + final val CARDINALITIES = List("approx", "exact") + + private var _cardinalityType: String = "exact" + private var _precision = 0.05 + + def setCardinalityType(value: String): this.type = { + require( + CARDINALITIES.contains(value), + s"Specified cardinality type $value is not a member of ${CARDINALITIES + .mkString(", ")}" + ) + _cardinalityType = value + this + } + + def setPrecision(value: Double): this.type = { + require(value >= 0.0, s"Precision must be greater than or equal to 0.") + require(value <= 1.0, s"Precision must be less than or equal to 1.") + _precision = value + this + } + + def validateCategoricalFields(fields: List[String], + cardinalityLimit: Int): Array[String] = { + + mode match { + case "silent" => + FieldValidation.restrictFields( + data, + fields.toArray, + _cardinalityType, + cardinalityLimit.toLong, + _precision + ) + case _ => + FieldValidation.confirmCardinalityCheck( + data, + fields.toArray, + _cardinalityType, + cardinalityLimit.toLong, + _precision + ) + } + + } + +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/data/FieldValidation.scala b/src/main/scala/com/databricks/labs/automl/utils/data/FieldValidation.scala new file mode 100644 index 00000000..81b215be --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/data/FieldValidation.scala @@ -0,0 +1,181 @@ +package com.databricks.labs.automl.utils.data + +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.StructType + +class FieldValidation(data: DataFrame) { + + final val CARDINALITIES: Array[String] = Array("approx", "exact") + + final val schema: StructType = data.schema + final val fieldNames: Array[String] = schema.names + + /** + * Private method for validating the presence of fields in the DataFrame's schema + * + * @param fields Array[String] a list of the fields to validate against the DataFrame's schema + * @throws IllegalArgumentException if the fields to be tested are not in the DataFrame's schema + * @author Ben Wilson + * @since 0.5.2 + */ + @throws(classOf[IllegalArgumentException]) + private def validateMembership(fields: Array[String]): Unit = { + + fields.foreach( + x => + require( + fieldNames.contains(x), + s"ERROR! Schema does not contain field $x" + ) + ) + } + + /** + * Private method for determining the cardinality for a particular Array of fields + * + * @param fields Array of field names to validate their cardinality before executing a potentially expensive + * operation. + * @param cardinalityType The type of distinct to use for each column. [Either "approx or "exact"] + * @param cardinalityLimit The limit, above which, an exception will be thrown for attempting to send + * a DataFrame with categorical data that exceeds the desired threshold for inclusion in a + * feature vector. + * @param precision For approx distinct, the precision by which the the approximate distinct will be performed. + * @throws AssertionError if the cardinality is too high for a field + * @author Ben Wilson + * @since 0.5.2 + */ + @throws(classOf[AssertionError]) + private def checkCardinality(fields: Array[String], + cardinalityType: String, + cardinalityLimit: Long, + precision: Double = 0.05): Unit = { + + fields.foreach { x => + val cardinality = + calculateCardinality(data, x, cardinalityType, precision).rdd + .map(r => r.getLong(0)) + .take(1)(0) + assert( + cardinality <= cardinalityLimit, + s"Field $x has a cardinality of $cardinality which exceeds the " + + s"limit of: $cardinalityLimit" + ) + } + + } + + /** + * Private method for switching between the cardinality methodologies (either exact or approximate) + * + * @param df The Dataframe for which a cardinality will be applied for a particular field + * @param field The field to calculate the cardinality for + * @param cardinalityType The type of cardinality check to use [either approx or exact] + * @param precision The precision for an approx distinct check + * @return The Dataframe (1 row) that contains the cardinality distinct check + * @author Ben Wilson + * @since 0.5.2 + */ + private def calculateCardinality(df: DataFrame, + field: String, + cardinalityType: String, + precision: Double = 0.05): DataFrame = { + + cardinalityType match { + case "exact" => df.select(countDistinct(field)) + case _ => df.select(approx_count_distinct(field, rsd = precision)) + } + + } + + /** + * Validation method for ensuring that the fields specified have a cardinality below a set threshold + * + * @param fields Fields to test as an Array of Column Names + * @param cardinalityType The type of distinct check to perform to calculate the cardinality + * [either 'exact' or 'approx'] + * @param cardinalityLimit The limit, above which, the check will fail. + * @throws AssertionError if the cardinality of a field exceeds the threshold + * @author Ben Wilson + * @since 0.5.2 + */ + @throws(classOf[AssertionError]) + def validateCardinality(fields: Array[String], + cardinalityType: String, + cardinalityLimit: Long, + precision: Double = 0.05): Array[String] = { + + validateMembership(fields) + checkCardinality(fields, cardinalityType, cardinalityLimit, precision) + fields + } + + /** + * Method for filtering out any fields that are above a certain cardinality threshold to protect against + * creating unmanageably large feature vectors or computationally extreme StringIndexed values + * + * @param fields Fields to validate cardinality for + * @param cardinalityType The mode of cardinality checking [either "approx" for approximate distinct or "exact"] + * @param cardinalityLimit The limitation above which any field's cardinality will cause the field to be culled + * from the collection of fields to perform an operation on + * @param precision The precision set point for approx_distinct calculations for expected high cardinality fields + * or large data sets. + * @return Array[String] of column names whose cardinality is below the threshold specified by cardinalityLimit + * @author Ben Wilson + * @since 0.5.2 + */ + def restrictFieldsBasedOnCardinality( + fields: Array[String], + cardinalityType: String, + cardinalityLimit: Long, + precision: Double = 0.05 + ): Array[String] = { + + validateMembership(fields) + + fields + .map { x => + val cardinality = + calculateCardinality(data, x, cardinalityType, precision).rdd + .map(r => r.getAs[Long](0)) + .take(1)(0) + cardinality match { + case y if y <= cardinalityLimit => x + case _ => "" + } + } + .filterNot(x => x.equals("")) + } +} + +/** + * Companion Object + */ +object FieldValidation { + + def apply(data: DataFrame): FieldValidation = new FieldValidation(data) + + def confirmCardinalityCheck(data: DataFrame, + fields: Array[String], + cardinalityType: String, + cardinalityLimit: Long, + precision: Double = 0.05): Array[String] = + this + .apply(data) + .validateCardinality(fields, cardinalityType, cardinalityLimit, precision) + + def restrictFields(data: DataFrame, + fields: Array[String], + cardinalityType: String, + cardinalityLimit: Long, + precision: Double = 0.05): Array[String] = + this + .apply(data) + .restrictFieldsBasedOnCardinality( + fields, + cardinalityType, + cardinalityLimit, + precision + ) + +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/structures/FeatureEngineeringEnums.scala b/src/main/scala/com/databricks/labs/automl/utils/structures/FeatureEngineeringEnums.scala new file mode 100644 index 00000000..4644fd86 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/structures/FeatureEngineeringEnums.scala @@ -0,0 +1,33 @@ +package com.databricks.labs.automl.utils.structures + +object FeatureEngineeringEnums extends Enumeration { + + type FeatureEngineeringEnums = FeatureEngineeringConstants + + val MIN = FeatureEngineeringConstants("min") + val MAX = FeatureEngineeringConstants("max") + val COUNT_COL = FeatureEngineeringConstants("count") + + case class FeatureEngineeringConstants(value: String) extends Val +} + +object FeatureEngineeringAllowables extends Enumeration { + + type FeatureEngineeringAllowables = FeatureEngineeringAllowableConstants + + val ALLOWED_CATEGORICAL_FILL_MODES = FeatureEngineeringAllowableConstants( + Array("min", "max") + ) + + case class FeatureEngineeringAllowableConstants(values: Array[String]) + extends Val + +} + +object ArrayGeneratorMode extends Enumeration { + val ASC = ArrayMode("ascending") + val DESC = ArrayMode("descending") + val RAND = ArrayMode("random") + protected case class ArrayMode(arrayMode: String) extends super.Val() + implicit def convert(value: Value): ArrayMode = value.asInstanceOf[ArrayMode] +} diff --git a/src/main/scala/com/databricks/labs/automl/utils/structures/FeatureEngineeringStructures.scala b/src/main/scala/com/databricks/labs/automl/utils/structures/FeatureEngineeringStructures.scala new file mode 100644 index 00000000..67ea0f61 --- /dev/null +++ b/src/main/scala/com/databricks/labs/automl/utils/structures/FeatureEngineeringStructures.scala @@ -0,0 +1,23 @@ +package com.databricks.labs.automl.utils.structures + +import org.apache.spark.sql.types.DataType + +case class FieldTypes(numericFields: List[String], + categoricalFields: List[String], + dateFields: List[String], + timeFields: List[String], + booleanFields: List[String]) + +case class FieldDefinitions(dataType: DataType, fieldName: String) + +case class FieldPairs(left: String, right: String) + +case class FieldCorrelationPayload(primaryColumn: String, + pairs: FieldPairs, + correlation: Double) + +case class FieldCorrelationAggregationStats(rowCounts: Double, + averageMap: Map[String, Double]) + +case class FieldRemovalPayload(dropFields: Array[String], + retainFields: Array[String]) diff --git a/src/main/scala/org/apache/spark/ml/automl/feature/BinaryEncoder.scala b/src/main/scala/org/apache/spark/ml/automl/feature/BinaryEncoder.scala new file mode 100644 index 00000000..b340e57b --- /dev/null +++ b/src/main/scala/org/apache/spark/ml/automl/feature/BinaryEncoder.scala @@ -0,0 +1,718 @@ +package org.apache.spark.ml.automl.feature + +import org.apache.hadoop.fs.Path +import org.apache.spark.SparkException +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{ + HasHandleInvalid, + HasInputCols, + HasOutputCols +} +import org.apache.spark.ml.util._ +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} +import org.apache.spark.sql.{DataFrame, Dataset} + +trait BinaryEncoderBase + extends Params + with HasHandleInvalid + with HasInputCols + with HasOutputCols { + + /** + * Configuration of the Parameter for handling invalid entries in a previously modeled feature column. + */ + override val handleInvalid: Param[String] = new Param[String]( + this, + "handleInvalid", + "Handling invalid data flag for utilizing the BinaryEncoderModel during transform method call." + + "Options are: 'keep' (encodes unknown values as Binary 0's as an unknown categorical class) or " + + "'error' (throw error if unknown value is introduced).", + ParamValidators.inArray(BinaryEncoder.supportedHandleInvalids) + ) + + setDefault(handleInvalid, BinaryEncoder.ERROR_INVALID) + + /** + * Method for validating the resultant schema from the application of building and transforming using this + * encoder package. The purpose of validation is to ensure that the supplied input columns are of the correct + * binary or nominal (ordinal numeric) type and that the output columns will contain the correct number of columns + * based on the configuration set. + * @param schema The schema of the dataset supplied for training of the model or used in transforming using the model + * @param keepInvalid Boolean flag for whether to allow for an additional binary encoding value to be used for + * any values that were unknown at the time of model training, which will summarily be + * converted to a 'max binary value' of the encoding length + 1 with maximum n * "1" values. + * @return StructType that represents the transformed schema with additional output columns appended to the + * dataset structure. + * @since 0.5.3 + * @throws UnsupportedOperationException if the configured input cols and output cols do not match one another in + * length. + * @author Ben Wilson, Databricks + */ + @throws(classOf[UnsupportedOperationException]) + protected def validateAndTransformSchema(schema: StructType, + keepInvalid: Boolean): StructType = { + + val inputColNames = $(inputCols) + val outputColNames = $(outputCols) + + require( + inputColNames.length == outputColNames.length, + s"The supplied number of input columns " + + s"${inputColNames.length} to BinaryEncoder" + + s"do not match the output columns count ${outputColNames.length}.\n InputCols: ${inputColNames + .mkString(", ")}" + + s"\n OutputCols: ${outputColNames.mkString(", ")}" + ) + + // Validate that the supplied input columns are of numeric type + inputColNames.foreach(SchemaUtils.checkNumericType(schema, _)) + + val inputFields = $(inputCols).map(schema(_)) + + val outputFields = inputFields.zip(outputColNames).map { + case (inputField, outputColName) => + BinaryEncoderCommon + .transformOutputColumnSchema(inputField, outputColName, keepInvalid) + } + outputFields.foldLeft(schema) { + case (newSchema, outputField) => + StructType(newSchema.fields :+ outputField) + } + } + +} + +class BinaryEncoder(override val uid: String) + extends Estimator[BinaryEncoderModel] + with DefaultParamsWritable + with HasInputCols + with HasOutputCols + with BinaryEncoderBase { + + def this() = this(Identifiable.randomUID("binaryEncoder")) + + /** + * Setter for supplying the array of input columns to be encoded with the BinaryEncoder type + * @param values Array of column names + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** + * Setter for supplying the array of output columns that are the result of running a .transform from a trained + * model on an appropriate dataset of compatible schema + * @param values Array of column names that will be generated through a .transform + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** + * Setter for supplying an optional 'keep' or 'error' (Default: 'error') for un-seen values that arrive into a + * pre-trained model. With the 'keep' setting, an additional vector position is added to the output column + * to ensure no collisions may exist with real data and the values throughout each of the Array[Double] locations + * in the DenseVector output will all be set to '1' + * @param value String: either 'keep' or 'error' (Default: 'error') + * @throws SparkException if the configuration value supplied is not either 'keep' or 'error' + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + @throws(classOf[SparkException]) + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + override def transformSchema(schema: StructType): StructType = { + val keepInvalid = $(handleInvalid) == BinaryEncoder.KEEP_INVALID + validateAndTransformSchema(schema, keepInvalid = keepInvalid) + } + + /** + * Main fit method that will build a BinaryEncoder model from the data set and the configured input and output columns + * specified in the setters. + * The primary principle at work here is dimensionality reduction for the encoding of extremely high-cardinality + * StringIndexed columns. OneHotEncoding works extremely well for this purpose, but has the side-effect of + * requiring extremely large amounts of columns to be generated when performing OHE is increased memory pressure. + * This package allows for a lossy reduction in this space by distilling the information into a binary string + * encoding space that is dynamic based on the encoded length of the maximum nominal space as represented in binary + * + * @example e.g. if the cardinality of a nominal column is 113, the binary representation of that is 1110001. + * When using OHE, this would result in 113 (or 114 if allowing invalids) binary positions within a sparse + * vector, creating 113 or 114 columns in the dataset. However, using BinaryEncoder, we are left with 7 + * (or 8, if allowing invalids) dense vector positions to capture the same amount of information. + * + * @note Due to the nature of this encoding and how the majority of models learn, this is seen as an information + * loss encoding. However, considering that high cardinality non-numeric nominal fields are frequently + * discarded due to the explosion of the data set, this is providing the ability to utilize high cardinality + * fields that otherwise would not be able to be included. + * @param dataset The dataset (or DataFrame) used in training the model + * @return BinaryEncoderModel - a serializable artifact that has the output schema and encoding embedded within it. + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + override def fit(dataset: Dataset[_]): BinaryEncoderModel = { + + transformSchema(dataset.schema) + + val mutatedSchema = + validateAndTransformSchema(dataset.schema, keepInvalid = false) + + // Build an array of the size of the targeted binary arrays to be encoded + val categoricalColumnSizes = new Array[Int]($(outputCols).length) + + // populate the Array of categorical column sizes with the size information for fitting of the model + val columnToScanIndices = $(outputCols).zipWithIndex.flatMap { + case (outputColName, idx) => + val classCount = + AttributeGroup.fromStructField(mutatedSchema(outputColName)).size + if (classCount < 0) { + Some(idx) + } else { + categoricalColumnSizes(idx) = classCount + None + } + } + + // If the metadata doesn't have the attribute information, extract it manually. + if (columnToScanIndices.length > 0) { + val inputColNames = columnToScanIndices.map($(inputCols)(_)) + val outputColNames = columnToScanIndices.map($(outputCols)(_)) + + val attrGroups = BinaryEncoderCommon.getOutputAttrGroupFromData( + dataset, + inputColNames, + outputColNames + ) + attrGroups.zip(columnToScanIndices).foreach { + case (attrGroup, idx) => + categoricalColumnSizes(idx) = attrGroup.size + } + } + + val model = + new BinaryEncoderModel(uid, categoricalColumnSizes).setParent(this) + copyValues(model) + + } + + override def copy(extra: ParamMap): BinaryEncoder = defaultCopy(extra) + +} + +object BinaryEncoder extends DefaultParamsReadable[BinaryEncoder] { + + private[feature] val KEEP_INVALID: String = "keep" + private[feature] val ERROR_INVALID: String = "error" + private[feature] val supportedHandleInvalids: Array[String] = + Array(KEEP_INVALID, ERROR_INVALID) + + override def load(path: String): BinaryEncoder = super.load(path) + +} + +class BinaryEncoderModel(override val uid: String, + val categorySizes: Array[Int]) + extends Model[BinaryEncoderModel] + with BinaryEncoderBase + with MLWritable { + + import BinaryEncoderModel._ + + /** + * Helper method for adjusting the sizes that the Binary Encoder will need to encode to capture the BinaryString + * representation of the Nominal data, adjusted for the keepInvalid flag. + * @since 0.5.3 + */ + private def getConfigedCategorySizes: Array[Int] = { + + val keepInvalid = getHandleInvalid == BinaryEncoder.KEEP_INVALID + + if (keepInvalid) { + categorySizes.map(_ + 1) + } else { + categorySizes + } + + } + + /** + * Main UDF for performing the conversion of nominal / binary encoded data from StringIndexer to a BinaryString format + * as a Breeze DenseVector + * @throws SparkException if the value for encoding is not within the column index during training + * (keepInvalid set to 'error') + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + @throws(classOf[SparkException]) + private def encoder: UserDefinedFunction = { + + val keepInvalid = getHandleInvalid == BinaryEncoder.KEEP_INVALID + + val localCategorySizes = categorySizes + + udf { (label: Any, colIdx: Int) => + val origCategorySize = localCategorySizes(colIdx) + + val maxCategorySize = + BinaryEncoderCommon.convertToBinaryString(Some(origCategorySize)).length + + // Add additional index position to vector if the + val idxLength = if (keepInvalid) maxCategorySize + 1 else maxCategorySize + + val encodedData = + BinaryEncoderCommon.convertToBinary(Some(label), idxLength) + + val idx = if (encodedData.length <= origCategorySize) { + encodedData + } else { + if (keepInvalid) { + Array.fill(maxCategorySize)(1.0) + } else { + throw new SparkException( + s"The value specified for Binary Encoding (${label.toString}) in column index " + + s"$colIdx has not been seen" + + s"during training of this model. To enable reclassification of unseen values, set the handleInvalid" + + s"parameter to ${BinaryEncoder.KEEP_INVALID}" + ) + } + } + Vectors.dense(idx) + } + } + + /** + * Setter for specifying the column names in Array format for the columns intended to be Binary Indexed. + * @param values Array of column names + * @since 0.5.3 + * @author Ben Wilson, DataBricks + */ + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** + * Setter for specifying the desired output columns in Array format for the columns to be generated as Breeze + * DenseVectors when the model is used to transform a dataset + * @param values Array of output column names + * @note the index position relationship between setInputCols and setOutputCols is a 1 to 1 relationship. The + * positional order and length must be congruent and match. + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** + * Setter for whether to allow for unseen indexed nominal values to be used in the transformation of a dataset with + * the generated BinaryEncoderModel. + * @note Default: 'error' optional settings: 'keep' or 'error' + * @param value The setting to be used: either 'keep' or 'error' + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + /** + * Method for mutating the dataset schema to support the addition of BinaryEncoded columns + * @param schema the schema of the dataset + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + override def transformSchema(schema: StructType): StructType = { + + val inputColNames = $(inputCols) + + require( + inputColNames.length == categorySizes.length, + s"The number of input columns specified " + + s"(${inputColNames.length}) must be the same number of feature columns during the fit (${categorySizes.length})" + ) + + val keepInvalid = $(handleInvalid) == BinaryEncoder.KEEP_INVALID + + val transformedSchema = + validateAndTransformSchema(schema, keepInvalid = keepInvalid) + verifyNumOfValues(transformedSchema) + + } + + /** + * Private method for validating that the schema metadata matches the expected cardinality for the column to be encoded + * @param schema schema of the input dataset + * @throws IllegalArgumentException if the metadata information does not match the data cardinality + * @return the schema of the dataset (pass-through method) + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + @throws(classOf[IllegalArgumentException]) + private def verifyNumOfValues(schema: StructType): StructType = { + val configedSizes = getConfigedCategorySizes + $(outputCols).zipWithIndex.foreach { + case (outputColName, idx) => + val inputColName = $(inputCols)(idx) + val attrGroup = AttributeGroup.fromStructField(schema(outputColName)) + + if (attrGroup.attributes.nonEmpty) { + val numCategories = configedSizes(idx) + require( + attrGroup.size == numCategories, + s"The number of distinct values in column $inputColName" + + s"was expected to be $numCategories, but the metadata shows ${attrGroup.size} distinct values." + ) + } + } + schema + } + + /** + * Main transformation method that will apply the model's configured encoding through a udf to the input dataset + * and add encoded columns. + * @param dataset input dataset for the model to mutate + * @return a DataFrame with added BinaryEncoded columns + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + override def transform(dataset: Dataset[_]): DataFrame = { + + val transformedSchema = transformSchema(dataset.schema, logging = true) + + val keepInvalid = $(handleInvalid) == BinaryEncoder.KEEP_INVALID + + val encodedColumns = $(inputCols).indices.map { idx => + val inputColName = $(inputCols)(idx) + val outputColName = $(outputCols)(idx) + + val metadata = BinaryEncoderCommon + .createAttrGroupForAttrNames( + outputColName, + categorySizes(idx), + keepInvalid + ) + .toMetadata() + + encoder(col(inputColName).cast(DoubleType), lit(idx)) + .as(outputColName, metadata) + } + + dataset.withColumns($(outputCols), encodedColumns) + + } + + override def copy(extra: ParamMap): BinaryEncoderModel = { + val copied = new BinaryEncoderModel(uid, categorySizes) + copyValues(copied, extra).setParent(parent) + } + + override def write: MLWriter = new BinaryEncoderModelWriter(this) + +} + +object BinaryEncoderModel extends MLReadable[BinaryEncoderModel] { + private[BinaryEncoderModel] class BinaryEncoderModelWriter( + instance: BinaryEncoderModel + ) extends MLWriter { + + private case class Data(categorySizes: Array[Int]) + + override protected def saveImpl(path: String): Unit = { + + DefaultParamsWriter.saveMetadata(instance, path, sc) + val data = Data(instance.categorySizes) + val dataPath = new Path(path, "data").toString + sparkSession + .createDataFrame(Seq(data)) + .repartition(1) + .write + .parquet(dataPath) + } + } + + private class BinaryEncoderModelReader extends MLReader[BinaryEncoderModel] { + + private val className = classOf[BinaryEncoderModel].getName + + override def load(path: String): BinaryEncoderModel = { + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val dataPath = new Path(path, "data").toString + val data = sparkSession.read + .parquet(dataPath) + .select("categorySizes") + .head() + val categorySizes = data.getAs[Seq[Int]](0).toArray + val model = new BinaryEncoderModel(metadata.uid, categorySizes) + metadata.getAndSetParams(model) + model + } + + } + + override def read: MLReader[BinaryEncoderModel] = new BinaryEncoderModelReader + + override def load(path: String): BinaryEncoderModel = super.load(path) + +} + +private[feature] object BinaryEncoderCommon { + + /** + * Helper method for appropriately padding leading zero's based on the BinaryString Length value and the + * target encoding length + * @param inputString A supplied BinaryString encoded value + * @throws IllegalArgumentException if the target encoding size is smaller than the data size after encoding, + * since this would result in significant data loss and issues with modeling. + * @return Padded string to the prescribed encoding length + */ + @throws(classOf[IllegalArgumentException]) + private[feature] def padZeros(inputString: String, + encodingSize: Int): String = { + + val deltaLength = encodingSize - inputString.length + + deltaLength match { + case 0 => inputString + case x if x > 0 => "0" * x + inputString + case _ => + throw new IllegalArgumentException( + s"Target encoding size $encodingSize of BinaryString is less " + + s"than total encoded information. Information loss of a substantial degree would be generated. " + + s"Adjust encodingSize higher." + ) + } + + } + + /** + * Private encoding method. + * @param value takes in an Option of Any type and casts it to a BinaryString representation + * @note Conversions from non-whole-number values of numeric types WILL incur information loss as decimal values + * cannot be represented by simple BinaryString. It will incur a rounding to the nearest whole number. + * @tparam A Any type + * @return Binary String conversion of the input data + * @throws UnsupportedOperationException if the data type being input is not of the correct type for conversion. + */ + @throws(classOf[UnsupportedOperationException]) + private[feature] def convertToBinaryString[A <: Any]( + value: Option[A] + ): String = { + + value.get match { + case a: Boolean => if (a) "1" else "0" + case a: Byte => a.toByte.toBinaryString + case a: Char => a.toChar.toBinaryString + case a: Int => a.toInt.toBinaryString + case a: Long => a.toLong.toBinaryString + case a: Float => a.toFloat.toByte.toBinaryString + case a: Double => a.toString.toDouble.toByte.toBinaryString + case a: String => + a.toString.toCharArray + .flatMap(_.toBinaryString) + .mkString("") + case a: BigDecimal => a.toString.toByte.toBinaryString + case _ => + throw new UnsupportedOperationException( + s"ordinalToBinary does not support type :" + + s"${value.getClass.getSimpleName}" + ) + } + + } + + /** + * Private method for converting Any type to and Array of Doubles with appropriate prefix padding to ensure + * constant vector length of the output DenseVector from transformation. + * @param ordinalValue The value to encode + * @param encodingSize The size of the Array of Binary Double values for output + * @tparam A Any type (primitives are only supported - collections will throw an exception) + * @return Array of Doubles, representing the Binary values of the encoded value + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + private[feature] def convertToBinary[A <: Any]( + ordinalValue: Option[A], + encodingSize: Int + ): Array[Double] = { + + val binaryString = convertToBinaryString(ordinalValue) + val padded = padZeros(binaryString, encodingSize) + binaryStringToDoubleArray(padded) + } + + /** + * Private method for converting a binary string to Array[Double] + * @param binary Binary Encoded String + * @return Array[Double] for the numeric vector-representation of an encoded value in Binary format + * @since 0.5.3 + * @author Ben Wilson, Databricks + */ + private[feature] def binaryStringToDoubleArray( + binary: String + ): Array[Double] = { + + binary.toCharArray.map(_.toString.toDouble) + + } + + /** + * Pulled from OneHotEncoder with slight modifications. + * @param dataset Input dataset + * @param inputColNames The names that are to be encoded + * @param outputColNames the output columns after encoding + * @return Seq[AttributeGroup] for the schema metadata associated with the Model + * @since 0.5.3 + */ + def getOutputAttrGroupFromData( + dataset: Dataset[_], + inputColNames: Seq[String], + outputColNames: Seq[String] + ): Seq[AttributeGroup] = { + + val columns = inputColNames.map { inputColName => + col(inputColName).cast(DoubleType) + } + val numOfColumns = columns.length + + val numAttrsArray = dataset + .select(columns: _*) + .rdd + .map { row => + (0 until numOfColumns).map(idx => row.getDouble(idx)).toArray + } + .treeAggregate(new Array[Double](numOfColumns))( + (maxValues: Array[Double], curValues: Array[Double]) => { + (0 until numOfColumns).foreach { + idx => + val x = curValues(idx) + assert( + x <= Int.MaxValue, + s"Index out of range for maximum ordinal Value for 32bit int space. " + + s"Value: $x is larger than maximum of ${Int.MaxValue} in field ${inputColNames(idx)}" + ) + assert( + x >= 0.0 && x == x.toInt, + s"Values from column ${inputColNames(idx)} must be indices, but got value $x, which is invalid." + ) + maxValues(idx) = math.max(maxValues(idx), x) + } + maxValues + }, + (m0, m1) => { + (0 until numOfColumns).foreach { idx => + m0(idx) = math.max(m0(idx), m1(idx)) + } + m0 + } + ) + .map(_.toInt + 1) + + outputColNames.zip(numAttrsArray).map { + case (outputColName, numAttrs) => + createAttrGroupForAttrNames( + outputColName, + numAttrs, + keepInvalid = false + ) + } + + } + + /** + * Pulled from OneHotEncoder with slight modifications. + * @param outputColName Output Column to be created by the BinaryEncoderModel + * @param numAttrs The number of unique values for the output column + * @param keepInvalid Boolean flag for whether to allow for unseen values to be grouped into an unknown binary + * representation during transformation + * @return AttributeGroup + * @since 0.5.3 + */ + def createAttrGroupForAttrNames(outputColName: String, + numAttrs: Int, + keepInvalid: Boolean): AttributeGroup = { + + val maxAttributeSize = if (keepInvalid) { + BinaryEncoderCommon.convertToBinaryString(Some(numAttrs)).length + 1 + } else { + BinaryEncoderCommon.convertToBinaryString(Some(numAttrs)).length + } + + val outputAttrNames = Array.tabulate(maxAttributeSize)(_.toString) + + genOutputAttrGroup(Some(outputAttrNames), outputColName) + } + + /** + * Pulled from OneHotEncoder with slight modifications. + * @param outputAttrNames Attributes for the metadata + * @param outputColName The name of the output column to be added + * @return AttributeGroup + * @since 0.5.3 + */ + private def genOutputAttrGroup(outputAttrNames: Option[Array[String]], + outputColName: String): AttributeGroup = { + outputAttrNames + .map { attrNames => + val attrs: Array[Attribute] = attrNames.map { name => + BinaryAttribute.defaultAttr.withName(name) + } + new AttributeGroup(outputColName, attrs) + } + .getOrElse { + new AttributeGroup(outputColName) + } + } + + /** + * Pulled from OneHotEncoder with slight modifications. + * @param inputCol An input column to extract the encoding lengths for the output column during transformation + * @return Output names for attritubtes + */ + private def genOutputAttrNames( + inputCol: StructField + ): Option[Array[String]] = { + val inputAttr = Attribute.fromStructField(inputCol) + + inputAttr match { + case nominal: NominalAttribute => + val outputCardinality = + BinaryEncoderCommon + .convertToBinaryString(Some(nominal.values.get.length)) + .length + Some((0 to outputCardinality).toArray.map(_.toString)) + case binary: BinaryAttribute => + if (binary.values.isDefined) { + binary.values + } else { + Some(Array.tabulate(2)(_.toString)) + } + case _: NumericAttribute => + throw new RuntimeException( + s"The input column ${inputCol.name} cannot be continuous-value." + ) + case _ => + None // optimistic about unknown attributes + } + } + + /** + * Pulled from OneHotEncoder with slight modifications. + * @param inputCol An input column that will be used to validate the output column corresponding to it. + * @param outputColName An output column to match to the input column + * @param keepInvalid invalid retention flag + * @return the Output column struct field definition + */ + def transformOutputColumnSchema(inputCol: StructField, + outputColName: String, + keepInvalid: Boolean = false): StructField = { + + val outputAttrNames = genOutputAttrNames(inputCol) + val filteredOutputAttrNames = outputAttrNames.map { names => + if (keepInvalid) { + names ++ Seq("invalidValues") + } else { + names + } + } + genOutputAttrGroup(filteredOutputAttrNames, outputColName).toStructField() + } + +} diff --git a/src/test/resources/AirQualityUCI.csv b/src/test/resources/AirQualityUCI.csv new file mode 100644 index 00000000..ace78ebf --- /dev/null +++ b/src/test/resources/AirQualityUCI.csv @@ -0,0 +1,1001 @@ +Date,Time,label,PT08_S1_CO,NMHC_GT,C6H6_GT,PT08_S2_NMHC,Nox_GT,PT08_S3(NOx),NO2_GT,PT08_S4_NO2,PT08_S5_O3,T,RH,AH +3/10/04,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578 +3/10/04,19:00:00,2,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255 +3/10/04,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502 +3/10/04,21:00:00,2.2,1376,80,9.2,948,172,1092,122,1584,1203,11.0,60.0,0.7867 +3/10/04,22:00:00,1.6,1272,51,6.5,836,131,1205,116,1490,1110,11.2,59.6,0.7888 +3/10/04,23:00:00,1.2,1197,38,4.7,750,89,1337,96,1393,949,11.2,59.2,0.7848 +3/11/04,0:00:00,1.2,1185,31,3.6,690,62,1462,77,1333,733,11.3,56.8,0.7603 +3/11/04,1:00:00,1,1136,31,3.3,672,62,1453,76,1333,730,10.7,60.0,0.7702 +3/11/04,2:00:00,0.9,1094,24,2.3,609,45,1579,60,1276,620,10.7,59.7,0.7648 +3/11/04,3:00:00,0.6,1010,19,1.7,561,200,1705,200,1235,501,10.3,60.2,0.7517 +3/11/04,4:00:00,200,1011,14,1.3,527,21,1818,34,1197,445,10.1,60.5,0.7465 +3/11/04,5:00:00,0.7,1066,8,1.1,512,16,1918,28,1182,422,11.0,56.2,0.7366 +3/11/04,6:00:00,0.7,1052,16,1.6,553,34,1738,48,1221,472,10.5,58.1,0.7353 +3/11/04,7:00:00,1.1,1144,29,3.2,667,98,1490,82,1339,730,10.2,59.6,0.7417 +3/11/04,8:00:00,2,1333,64,8.0,900,174,1136,112,1517,1102,10.8,57.4,0.7408 +3/11/04,9:00:00,2.2,1351,87,9.5,960,129,1079,101,1583,1028,10.5,60.6,0.7691 +3/11/04,10:00:00,1.7,1233,77,6.3,827,112,1218,98,1446,860,10.8,58.4,0.7552 +3/11/04,11:00:00,1.5,1179,43,5.0,762,95,1328,92,1362,671,10.5,57.9,0.7352 +3/11/04,12:00:00,1.6,1236,61,5.2,774,104,1301,95,1401,664,9.5,66.8,0.7951 +3/11/04,13:00:00,1.9,1286,63,7.3,869,146,1162,112,1537,799,8.3,76.4,0.8393 +3/11/04,14:00:00,2.9,1371,164,11.5,1034,207,983,128,1730,1037,8.0,81.1,0.8736 +3/11/04,15:00:00,2.2,1310,79,8.8,933,184,1082,126,1647,946,8.3,79.8,0.8778 +3/11/04,16:00:00,2.2,1292,95,8.3,912,193,1103,131,1591,957,9.7,71.2,0.8569 +3/11/04,17:00:00,2.9,1383,150,11.2,1020,243,1008,135,1719,1104,9.8,67.6,0.8185 +3/11/04,18:00:00,4.8,1581,307,20.8,1319,281,799,151,2083,1409,10.3,64.2,0.8065 +3/11/04,19:00:00,6.9,1776,461,27.4,1488,383,702,172,2333,1704,9.7,69.3,0.8319 +3/11/04,20:00:00,6.1,1640,401,24.0,1404,351,743,165,2191,1654,9.6,67.8,0.8133 +3/11/04,21:00:00,3.9,1313,197,12.8,1076,240,957,136,1707,1285,9.1,64.0,0.7419 +3/11/04,22:00:00,1.5,965,61,4.7,749,94,1325,85,1333,821,8.2,63.4,0.6905 +3/11/04,23:00:00,1,913,26,2.6,629,47,1565,53,1252,552,8.2,60.8,0.6657 +3/12/04,0:00:00,1.7,1080,55,5.9,805,122,1254,97,1375,816,8.3,58.5,0.6438 +3/12/04,1:00:00,1.9,1044,53,6.4,829,133,1247,110,1378,832,7.7,59.7,0.6308 +3/12/04,2:00:00,1.4,988,40,4.1,718,82,1396,91,1304,692,7.1,61.8,0.6276 +3/12/04,3:00:00,0.8,889,21,1.9,574,200,1680,200,1187,512,7.0,62.3,0.6261 +3/12/04,4:00:00,200,831,10,1.1,506,21,1893,32,1134,384,6.1,65.9,0.6248 +3/12/04,5:00:00,0.6,847,7,1.0,501,30,1895,44,1155,394,6.3,65.0,0.6233 +3/12/04,6:00:00,0.8,927,17,1.8,571,56,1685,71,1223,487,6.8,62.9,0.6234 +3/12/04,7:00:00,1.4,1091,33,4.4,730,109,1387,104,1361,748,6.4,65.1,0.6316 +3/12/04,8:00:00,4.4,1587,202,17.9,1236,307,897,141,1900,1400,7.3,63.1,0.6499 +3/12/04,9:00:00,200,1545,200,22.1,1353,200,767,200,2058,1588,9.2,56.2,0.6561 +3/12/04,10:00:00,3.1,1350,208,14.0,1118,187,912,122,1712,1237,13.2,41.7,0.6320 +3/12/04,11:00:00,2.7,1263,166,11.6,1037,216,969,143,1598,1167,14.3,38.4,0.6243 +3/12/04,12:00:00,2.1,1206,114,10.2,986,143,1035,113,1537,959,15.0,36.5,0.6195 +3/12/04,13:00:00,2.5,1252,140,11.0,1016,160,1008,116,1593,983,16.1,34.5,0.6262 +3/12/04,14:00:00,2.7,1287,169,12.8,1078,163,949,123,1660,1061,16.3,35.7,0.6560 +3/12/04,15:00:00,2.9,1353,185,14.2,1122,190,922,126,1740,1139,15.8,37.0,0.6610 +3/12/04,16:00:00,2.8,1309,165,12.7,1073,178,954,120,1657,1112,15.9,37.2,0.6657 +3/12/04,17:00:00,2.4,1274,133,11.7,1041,150,1006,119,1610,994,16.9,34.3,0.6549 +3/12/04,18:00:00,3.9,1510,233,19.3,1277,206,812,149,1910,1410,15.1,39.6,0.6766 +3/12/04,19:00:00,3.7,1525,242,18.2,1246,202,821,145,1847,1448,14.4,43.4,0.7084 +3/12/04,20:00:00,6.6,1843,488,32.6,1610,340,624,170,2390,1887,12.9,50.5,0.7478 +3/12/04,21:00:00,4.4,1598,333,20.1,1299,274,752,149,1941,1627,12.1,53.3,0.7536 +3/12/04,22:00:00,3.5,1484,215,14.3,1127,253,839,139,1723,1491,11.0,59.1,0.7740 +3/12/04,23:00:00,5.4,1677,367,21.8,1346,300,741,134,2062,1657,9.7,64.6,0.7771 +3/13/04,0:00:00,2.7,1280,122,9.6,964,193,963,113,1544,1285,9.5,64.1,0.7597 +3/13/04,1:00:00,1.9,1196,67,7.4,873,139,1071,97,1463,1144,9.1,63.9,0.7423 +3/13/04,2:00:00,1.6,1184,43,5.4,782,83,1176,82,1365,1043,8.8,63.9,0.7256 +3/13/04,3:00:00,1.7,1172,46,5.4,783,200,1179,200,1380,996,7.8,67.5,0.7173 +3/13/04,4:00:00,200,1147,56,6.2,821,109,1132,83,1412,992,7.0,71.1,0.7158 +3/13/04,5:00:00,1,978,30,2.6,625,62,1420,65,1274,819,8.3,63.6,0.6982 +3/13/04,6:00:00,1.2,1100,27,2.9,646,53,1406,60,1268,835,7.2,67.5,0.6887 +3/13/04,7:00:00,1.5,1112,47,5.1,770,139,1228,77,1409,940,6.3,71.9,0.6932 +3/13/04,8:00:00,2.7,1336,132,11.8,1043,256,935,96,1678,1192,6.5,71.6,0.6945 +3/13/04,9:00:00,3.7,1408,239,15.1,1153,295,830,119,1777,1411,9.6,59.7,0.7124 +3/13/04,10:00:00,3.2,1447,160,12.9,1081,250,869,126,1667,1465,12.4,51.2,0.7335 +3/13/04,11:00:00,4.1,1542,283,16.1,1184,296,808,158,1780,1583,15.6,42.2,0.7451 +3/13/04,12:00:00,3.6,1451,210,14.0,1117,239,875,161,1679,1387,18.4,33.8,0.7090 +3/13/04,13:00:00,2.8,1328,154,12.3,1059,153,987,124,1600,1101,19.4,31.3,0.6950 +3/13/04,14:00:00,2,1207,112,8.6,924,118,1088,102,1488,850,18.0,34.8,0.7127 +3/13/04,15:00:00,2,1240,108,9.2,947,119,1049,116,1532,947,18.4,33.6,0.7042 +3/13/04,16:00:00,2.5,1306,111,10.2,987,138,1004,124,1554,1078,17.6,35.1,0.7012 +3/13/04,17:00:00,2.3,1326,97,10.6,1000,148,976,125,1602,1084,16.7,37.8,0.7117 +3/13/04,18:00:00,3.2,1473,191,15.5,1163,227,831,148,1779,1395,16.1,41.0,0.7451 +3/13/04,19:00:00,4.2,1609,258,19.6,1286,277,758,165,1922,1612,15.8,42.4,0.7569 +3/13/04,20:00:00,4.2,1611,284,19.2,1274,279,754,161,1915,1697,15.7,44.1,0.7786 +3/13/04,21:00:00,4.2,1621,269,18.3,1247,283,762,159,1860,1886,15.3,46.8,0.8091 +3/13/04,22:00:00,3.1,1444,180,13.1,1089,214,844,143,1748,1624,14.6,48.6,0.8060 +3/13/04,23:00:00,2.6,1418,116,10.9,1010,172,892,130,1603,1536,14.7,49.3,0.8193 +3/14/04,0:00:00,2.9,1534,93,11.0,1013,190,889,129,1611,1535,13.9,53.6,0.8498 +3/14/04,1:00:00,2.8,1484,131,11.9,1045,174,880,119,1624,1530,14.6,51.5,0.8536 +3/14/04,2:00:00,2.5,1367,92,8.6,925,128,953,104,1543,1337,12.5,58.9,0.8537 +3/14/04,3:00:00,2.4,1344,132,9.7,968,200,921,200,1620,1278,11.6,63.4,0.8674 +3/14/04,4:00:00,200,1130,56,5.2,773,70,1130,82,1452,1051,12.1,61.1,0.8603 +3/14/04,5:00:00,1.2,1062,32,3.7,691,53,1272,70,1377,929,11.5,63.1,0.8533 +3/14/04,6:00:00,1,1076,29,2.5,618,44,1395,63,1333,872,11.6,62.2,0.8473 +3/14/04,7:00:00,0.9,1028,27,2.4,615,74,1384,67,1340,853,10.4,67.6,0.8530 +3/14/04,8:00:00,1.4,1155,36,4.2,722,101,1225,84,1414,959,11.6,62.7,0.8530 +3/14/04,9:00:00,1.6,1235,57,6.4,828,118,1055,83,1527,1093,12.4,60.0,0.8627 +3/14/04,10:00:00,2.2,1332,129,8.6,923,144,952,98,1614,1225,14.5,53.1,0.8728 +3/14/04,11:00:00,2.8,1445,148,10.9,1009,176,878,114,1696,1355,16.9,46.1,0.8789 +3/14/04,12:00:00,2.8,1416,145,10.7,1002,161,907,119,1677,1262,19.3,38.3,0.8474 +3/14/04,13:00:00,2,1281,93,7.5,880,113,1084,104,1525,980,21.2,31.4,0.7812 +3/14/04,14:00:00,1.8,1207,84,7.5,879,103,1104,102,1490,872,21.4,30.2,0.7616 +3/14/04,15:00:00,1.9,1258,99,8.2,906,112,1081,107,1511,900,21.9,29.0,0.7525 +3/14/04,16:00:00,3,1458,150,11.9,1045,170,974,129,1646,1099,22.2,28.4,0.7516 +3/14/04,17:00:00,2.9,1438,156,12.0,1051,180,943,128,1668,1206,21.3,30.8,0.7696 +3/14/04,18:00:00,2.5,1478,122,12.2,1055,160,929,121,1671,1262,19.7,36.7,0.8307 +3/14/04,19:00:00,4.6,1808,262,20.6,1312,261,753,157,1993,1698,18.4,41.7,0.8732 +3/14/04,20:00:00,5.9,1898,341,23.1,1381,325,681,173,2103,1905,17.6,46.1,0.9210 +3/14/04,21:00:00,3.4,1560,214,14.7,1140,217,784,146,1818,1648,16.7,49.6,0.9320 +3/14/04,22:00:00,2.1,1324,100,9.0,940,146,924,121,1587,1423,16.3,51.0,0.9341 +3/14/04,23:00:00,2.2,1349,79,8.8,933,152,933,119,1617,1349,14.7,55.9,0.9314 +3/15/04,0:00:00,1.8,1239,66,7.4,872,104,985,99,1547,1250,14.8,54.7,0.9164 +3/15/04,1:00:00,1.8,1239,73,6.9,853,106,1010,93,1543,1174,14.0,57.0,0.9094 +3/15/04,2:00:00,1.8,1224,66,7.0,855,108,998,88,1566,1149,13.4,61.3,0.9361 +3/15/04,3:00:00,1.1,1078,44,4.4,734,200,1128,200,1487,1021,12.6,63.5,0.9230 +3/15/04,4:00:00,200,1078,44,4.0,711,66,1150,71,1468,1013,12.3,65.4,0.9351 +3/15/04,5:00:00,1,1075,39,3.9,703,88,1156,74,1464,1010,11.9,67.4,0.9375 +3/15/04,6:00:00,1.4,1157,51,6.4,830,138,1030,80,1584,1083,11.4,70.5,0.9475 +3/15/04,7:00:00,2.2,1314,107,9.7,966,228,897,89,1710,1235,11.3,70.2,0.9401 +3/15/04,8:00:00,5.5,1797,336,25.9,1451,360,652,114,2323,1680,12.4,63.9,0.9170 +3/15/04,9:00:00,8.1,1961,618,36.7,1701,478,537,149,2665,2184,14.8,54.3,0.9076 +3/15/04,10:00:00,5.8,1771,438,26.6,1470,394,622,157,2262,1973,17.4,45.6,0.8993 +3/15/04,11:00:00,4.2,1564,334,20.1,1300,319,710,155,2029,1798,19.8,38.5,0.8801 +3/15/04,12:00:00,3.1,1430,221,14.1,1120,201,831,134,1783,1522,22.0,34.1,0.8904 +3/15/04,13:00:00,2.9,1417,207,14.9,1146,171,830,119,1831,1404,23.3,32.2,0.9096 +3/15/04,14:00:00,2.9,1400,191,15.4,1162,159,838,111,1829,1263,23.9,30.0,0.8757 +3/15/04,15:00:00,2.5,1317,185,12.1,1053,153,926,104,1707,1137,24.4,28.9,0.8736 +3/15/04,16:00:00,2.3,1318,141,11.5,1033,143,950,99,1675,1068,24.4,29.4,0.8848 +3/15/04,17:00:00,2.8,1445,214,14.8,1141,156,857,110,1824,1252,23.8,31.3,0.9137 +3/15/04,18:00:00,6.1,1917,471,32.1,1601,314,631,162,2447,1843,22.5,35.4,0.9547 +3/15/04,19:00:00,8,2040,685,39.2,1754,404,542,187,2679,2122,20.4,42.5,1.0086 +3/15/04,20:00:00,6.5,1895,538,31.0,1573,320,565,165,2443,1992,18.3,52.6,1.0945 +3/15/04,21:00:00,4.2,1595,319,19.9,1294,256,678,145,2058,1707,16.7,57.4,1.0786 +3/15/04,22:00:00,3.2,1439,224,15.3,1159,193,764,125,1856,1494,15.7,60.2,1.0654 +3/15/04,23:00:00,1.4,1142,67,6.9,852,89,1008,101,1547,1164,15.3,61.4,1.0580 +3/16/04,0:00:00,2.1,1304,155,11.1,1019,127,852,103,1731,1272,14.1,65.7,1.0494 +3/16/04,1:00:00,1.2,1074,49,5.4,784,79,1098,88,1483,1040,14.8,60.6,1.0148 +3/16/04,2:00:00,0.8,968,29,2.8,640,43,1320,61,1386,867,14.8,59.2,0.9879 +3/16/04,3:00:00,0.7,929,25,2.3,608,200,1376,200,1364,826,13.6,62.1,0.9636 +3/16/04,4:00:00,200,941,25,2.6,626,59,1316,59,1373,840,12.3,66.2,0.9450 +3/16/04,5:00:00,0.6,937,17,2.0,585,38,1412,52,1348,793,12.8,63.2,0.9283 +3/16/04,6:00:00,0.9,1017,27,3.5,681,82,1246,64,1418,870,11.2,68.5,0.9081 +3/16/04,7:00:00,1.3,1171,50,5.1,770,99,1162,70,1467,930,11.0,66.5,0.8705 +3/16/04,8:00:00,3.4,1541,218,16.2,1185,263,770,97,1889,1407,11.7,63.7,0.8719 +3/16/04,9:00:00,3.7,1539,285,19.7,1287,229,698,95,2055,1507,13.6,56.3,0.8743 +3/16/04,10:00:00,5.3,1735,437,25.1,1431,396,628,150,2211,1843,17.8,42.9,0.8672 +3/16/04,11:00:00,4.1,1571,327,20.0,1297,314,730,162,1973,1729,21.4,33.3,0.8417 +3/16/04,12:00:00,3.3,1452,283,18.3,1250,217,776,154,1868,1583,24.4,27.4,0.8231 +3/16/04,13:00:00,4,1579,366,22.3,1359,252,724,161,1998,1671,25.3,26.1,0.8264 +3/16/04,14:00:00,3.8,1466,318,20.4,1309,263,773,161,1897,1491,25.8,23.2,0.7589 +3/16/04,15:00:00,2.8,1280,228,14.6,1136,180,893,128,1675,1240,27.0,20.2,0.7094 +3/16/04,16:00:00,2.9,1407,201,16.6,1197,184,905,129,1759,1313,28.2,18.6,0.7014 +3/16/04,17:00:00,2.9,1389,199,15.8,1173,190,898,133,1739,1363,28.0,19.1,0.7098 +3/16/04,18:00:00,3.4,1447,237,17.8,1235,184,859,139,1778,1296,23.9,25.7,0.7519 +3/16/04,19:00:00,3.9,1551,261,19.1,1271,181,800,137,1875,1432,21.3,34.8,0.8730 +3/16/04,20:00:00,3.2,1474,230,15.8,1173,166,854,143,1776,1432,20.4,36.7,0.8704 +3/16/04,21:00:00,5.1,1800,349,24.9,1426,317,700,177,2106,2034,19.0,41.3,0.9007 +3/16/04,22:00:00,2.6,1379,183,13.5,1101,184,818,138,1710,1602,17.9,45.9,0.9342 +3/16/04,23:00:00,1.7,1201,88,9.1,943,130,935,117,1560,1362,16.7,48.9,0.9226 +3/17/04,0:00:00,1.7,1205,85,8.6,925,132,922,107,1547,1314,15.5,52.9,0.9271 +3/17/04,1:00:00,1.2,1072,47,5.4,784,95,1066,90,1442,1114,15.5,51.9,0.9059 +3/17/04,2:00:00,0.9,998,34,4.1,714,70,1169,79,1383,992,14.1,55.6,0.8910 +3/17/04,3:00:00,0.7,933,26,2.6,625,200,1292,200,1332,884,13.1,57.9,0.8666 +3/17/04,4:00:00,200,883,17,1.9,577,54,1396,60,1303,808,12.7,57.9,0.8447 +3/17/04,5:00:00,0.5,869,11,1.6,554,28,1460,40,1268,667,11.8,58.0,0.8026 +3/17/04,6:00:00,0.5,891,18,1.9,576,46,1417,53,1289,733,11.9,57.4,0.7978 +3/17/04,7:00:00,1.6,1173,84,7.5,878,160,1038,84,1556,1052,9.9,65.2,0.7956 +3/17/04,8:00:00,4.1,1587,260,21.4,1334,320,702,108,1999,1534,11.1,60.2,0.7958 +3/17/04,9:00:00,6.6,1852,534,36.4,1696,377,553,127,2535,1931,14.1,50.0,0.8020 +3/17/04,10:00:00,4.3,1522,368,21.3,1331,280,703,134,2026,1734,17.7,40.1,0.8048 +3/17/04,11:00:00,2.9,1438,200,15.4,1161,221,819,135,1782,1595,21.1,33.4,0.8265 +3/17/04,12:00:00,2.5,1393,145,12.5,1067,210,905,142,1654,1490,24.3,28.3,0.8471 +3/17/04,13:00:00,2.8,1452,188,15.1,1152,204,861,153,1747,1508,25.6,25.6,0.8254 +3/17/04,14:00:00,2.6,1389,152,13.7,1108,161,887,123,1712,1334,25.9,25.9,0.8503 +3/17/04,15:00:00,2,1207,103,10.4,994,135,1013,104,1520,1045,26.8,18.7,0.6500 +3/17/04,16:00:00,2.9,1365,193,15.2,1154,186,940,129,1679,1221,29.3,15.8,0.6309 +3/17/04,17:00:00,2.5,1247,134,12.3,1060,147,1032,114,1525,1069,28.5,14.9,0.5708 +3/17/04,18:00:00,5,1557,386,27.0,1478,299,793,158,1981,1569,25.9,16.0,0.5237 +3/17/04,19:00:00,7.6,1973,577,38.4,1737,411,617,194,2414,2306,23.1,26.5,0.7403 +3/17/04,20:00:00,6.7,1975,523,35.1,1667,347,597,182,2416,2359,20.5,38.2,0.9133 +3/17/04,21:00:00,5.7,1795,472,27.2,1485,336,653,180,2174,2050,19.1,42.6,0.9294 +3/17/04,22:00:00,2.8,1444,206,15.0,1150,202,770,136,1727,1727,17.2,44.1,0.8558 +3/17/04,23:00:00,2.6,1488,216,15.7,1171,178,731,127,1778,1705,16.0,50.9,0.9206 +3/18/04,0:00:00,2.3,1371,159,13.0,1083,154,796,116,1669,1551,14.8,53.9,0.9024 +3/18/04,1:00:00,1.4,1161,70,8.1,904,92,947,107,1502,1240,14.3,55.4,0.8975 +3/18/04,2:00:00,1,1064,44,5.5,787,61,1057,88,1407,1115,14.8,52.1,0.8686 +3/18/04,3:00:00,0.7,970,42,3.6,687,200,1196,200,1351,969,13.9,53.9,0.8501 +3/18/04,4:00:00,200,954,28,2.9,645,60,1260,78,1334,925,11.6,61.9,0.8442 +3/18/04,5:00:00,0.6,931,20,2.5,623,37,1293,57,1307,828,12.0,58.9,0.8233 +3/18/04,6:00:00,0.7,938,26,3.0,650,68,1251,71,1333,893,10.9,62.1,0.8113 +3/18/04,7:00:00,1.5,1166,78,7.7,885,139,1005,85,1551,1075,10.6,63.3,0.8067 +3/18/04,8:00:00,4.7,1643,319,23.3,1384,339,667,124,2094,1610,11.5,60.0,0.8117 +3/18/04,9:00:00,6.6,1934,506,35.8,1682,421,541,151,2468,2051,14.3,50.6,0.8186 +3/18/04,10:00:00,4.5,1617,200,21.3,1333,349,686,150,2010,1819,17.8,40.5,0.8210 +3/18/04,11:00:00,2.8,1473,200,14.3,1127,224,831,152,1752,1568,20.8,34.4,0.8365 +3/18/04,12:00:00,2.2,1379,200,12.5,1068,171,899,139,1663,1374,23.8,28.2,0.8219 +3/18/04,13:00:00,2.2,1385,200,12.2,1056,149,891,133,1648,1268,24.2,28.7,0.8515 +3/18/04,14:00:00,2.3,1379,200,13.1,1087,137,901,126,1660,1144,25.2,24.9,0.7829 +3/18/04,15:00:00,2.2,1322,200,14.4,1129,149,934,128,1639,1109,27.0,17.8,0.6275 +3/18/04,16:00:00,2.8,1496,200,16.8,1205,172,822,169,1767,1347,27.1,23.1,0.8180 +3/18/04,17:00:00,2.7,1409,200,14.5,1131,166,873,149,1689,1206,25.8,23.9,0.7832 +3/18/04,18:00:00,3.7,1513,200,21.5,1338,214,764,156,1957,1397,23.0,26.8,0.7446 +3/18/04,19:00:00,5.1,1667,200,26.4,1464,280,683,168,2118,1588,20.7,31.1,0.7523 +3/18/04,20:00:00,5.1,1667,200,26.0,1453,276,684,176,2051,1569,18.6,36.2,0.7676 +3/18/04,21:00:00,3.2,1430,200,14.1,1121,178,814,135,1732,1322,16.0,48.4,0.8768 +3/18/04,22:00:00,2.1,1333,200,10.3,989,129,885,121,1621,1194,14.5,57.8,0.9456 +3/18/04,23:00:00,1.7,1262,200,8.3,911,95,948,99,1545,1062,13.1,64.2,0.9606 +3/19/04,0:00:00,2,1287,200,8.9,936,126,918,106,1606,1040,12.0,69.7,0.9735 +3/19/04,1:00:00,1.6,1134,200,6.6,840,103,1021,96,1501,917,11.9,71.1,0.9915 +3/19/04,2:00:00,0.9,999,200,3.6,688,48,1229,63,1380,594,12.5,69.3,1.0029 +3/19/04,3:00:00,0.7,961,200,2.5,622,200,1382,200,1301,430,12.5,67.5,0.9768 +3/19/04,4:00:00,200,934,200,1.8,569,20,1440,32,1280,397,12.3,67.7,0.9665 +3/19/04,5:00:00,0.5,913,200,1.3,525,18,1620,28,1260,370,12.5,66.8,0.9620 +3/19/04,6:00:00,0.7,969,200,2.3,607,56,1373,61,1324,438,12.3,66.3,0.9487 +3/19/04,7:00:00,1.5,1182,200,6.7,845,115,1054,99,1539,704,12.4,64.9,0.9332 +3/19/04,8:00:00,4.8,1740,200,22.8,1372,320,671,157,2144,1476,13.0,61.6,0.9209 +3/19/04,9:00:00,6.2,1819,200,31.3,1582,357,575,166,2456,1716,13.6,58.8,0.9120 +3/19/04,10:00:00,4,1427,200,19.2,1275,253,701,149,1980,1398,13.9,57.1,0.9032 +3/19/04,11:00:00,3.3,1390,200,16.4,1191,218,759,135,1879,1268,14.5,54.4,0.8917 +3/19/04,12:00:00,2.8,1283,200,14.0,1117,192,813,127,1751,1188,15.5,50.1,0.8744 +3/19/04,13:00:00,3,1304,200,15.3,1157,176,793,122,1801,1171,16.1,47.6,0.8655 +3/19/04,14:00:00,3.3,1364,200,16.7,1202,198,771,135,1875,1250,16.3,47.1,0.8640 +3/19/04,15:00:00,3.5,1410,200,19.0,1268,212,724,139,1957,1344,16.4,47.5,0.8791 +3/19/04,16:00:00,4,1476,200,19.4,1280,249,708,147,1955,1378,16.1,50.2,0.9094 +3/19/04,17:00:00,4.6,1522,200,20.9,1320,270,690,158,2005,1435,16.0,50.7,0.9142 +3/19/04,18:00:00,4.1,1442,200,20.2,1301,231,704,146,1952,1372,15.8,50.3,0.8989 +3/19/04,19:00:00,4.5,1469,200,21.7,1344,251,677,144,1997,1471,15.5,50.6,0.8840 +3/19/04,20:00:00,3.9,1467,200,19.8,1292,210,698,138,1942,1439,15.4,51.4,0.8956 +3/19/04,21:00:00,4,1421,200,16.7,1202,249,748,143,1854,1429,15.2,53.1,0.9117 +3/19/04,22:00:00,2.2,1175,200,9.1,945,143,904,116,1604,1081,14.7,57.6,0.9573 +3/19/04,23:00:00,2.1,1215,200,8.3,912,127,948,109,1547,993,14.2,58.3,0.9380 +3/20/04,0:00:00,1.7,1127,200,5.8,802,104,1064,92,1447,837,13.8,57.9,0.9085 +3/20/04,1:00:00,1.6,1090,200,5.2,773,86,1105,83,1429,761,13.9,55.9,0.8842 +3/20/04,2:00:00,1.3,1017,200,4.1,718,74,1182,81,1382,650,13.9,55.6,0.8765 +3/20/04,3:00:00,1.3,997,200,3.4,677,200,1252,200,1359,591,13.8,55.1,0.8666 +3/20/04,4:00:00,200,945,200,2.9,646,44,1308,55,1332,505,13.8,54.6,0.8574 +3/20/04,5:00:00,0.8,956,200,3.1,661,42,1260,53,1351,518,13.6,55.1,0.8560 +3/20/04,6:00:00,0.8,966,200,2.5,622,63,1317,67,1342,571,13.6,59.0,0.9129 +3/20/04,7:00:00,1.1,1064,200,4.3,728,115,1161,91,1461,842,13.5,58.4,0.8960 +3/20/04,8:00:00,2.1,1300,200,8.9,935,170,902,113,1623,1145,13.8,58.0,0.9083 +3/20/04,9:00:00,2.4,1348,200,10.5,998,147,834,102,1689,1218,14.6,53.1,0.8765 +3/20/04,10:00:00,2.6,1425,200,12.5,1068,166,776,107,1764,1341,15.0,52.3,0.8881 +3/20/04,11:00:00,2.8,1474,200,12.3,1061,195,768,113,1768,1406,15.4,52.1,0.9046 +3/20/04,12:00:00,2.6,1490,200,11.7,1039,182,782,108,1757,1422,16.3,50.0,0.9158 +3/20/04,13:00:00,2.6,1495,200,11.7,1040,168,809,105,1735,1373,17.1,47.9,0.9281 +3/20/04,14:00:00,2.1,1376,200,9.3,953,125,900,93,1636,1153,19.0,42.9,0.9370 +3/20/04,15:00:00,1.7,1305,200,7.6,884,95,980,87,1573,927,19.5,42.3,0.9476 +3/20/04,16:00:00,1.6,1283,200,6.7,844,79,1036,76,1539,766,19.5,42.3,0.9493 +3/20/04,17:00:00,2.1,1392,200,9.7,967,119,930,94,1649,941,19.1,43.8,0.9556 +3/20/04,18:00:00,2.3,1452,200,12.4,1063,142,856,110,1774,1146,18.5,46.3,0.9744 +3/20/04,19:00:00,3.5,1633,200,16.6,1199,215,734,136,1919,1417,17.4,51.6,1.0159 +3/20/04,20:00:00,3.9,1625,200,16.4,1191,229,733,139,1885,1442,17.1,51.3,0.9914 +3/20/04,21:00:00,3.3,1535,200,13.7,1108,206,785,125,1820,1323,16.4,53.8,0.9964 +3/20/04,22:00:00,2.3,1323,200,9.9,973,147,864,111,1712,1125,16.2,55.4,1.0132 +3/20/04,23:00:00,2.1,1309,200,8.9,937,122,905,97,1672,1036,15.9,57.7,1.0356 +3/21/04,0:00:00,2.8,1456,200,10.6,999,175,835,107,1714,1202,15.7,58.6,1.0396 +3/21/04,1:00:00,2.1,1300,200,7.4,872,133,941,90,1577,1054,15.1,60.9,1.0398 +3/21/04,2:00:00,1.6,1203,200,6.2,823,89,1004,84,1526,903,15.6,56.6,0.9951 +3/21/04,3:00:00,1.6,1195,200,6.5,836,200,987,200,1554,885,15.2,57.9,0.9941 +3/21/04,4:00:00,200,1117,200,5.8,802,85,1029,87,1539,818,15.4,57.8,1.0021 +3/21/04,5:00:00,1.1,1062,200,4.0,711,55,1136,66,1461,700,14.9,60.6,1.0196 +3/21/04,6:00:00,0.7,947,200,2.1,594,24,1300,36,1390,580,14.5,63.0,1.0316 +3/21/04,7:00:00,0.8,1065,200,3.6,687,72,1143,62,1492,815,14.2,66.2,1.0630 +3/21/04,8:00:00,1.2,1155,200,4.4,734,122,1062,91,1528,1035,14.5,65.3,1.0704 +3/21/04,9:00:00,1.4,1220,200,5.4,782,111,996,89,1560,1067,15.3,62.1,1.0730 +3/21/04,10:00:00,1.7,1301,200,7.1,862,123,914,93,1633,1114,16.3,59.0,1.0868 +3/21/04,11:00:00,1.8,1279,200,7.3,870,106,911,81,1646,1000,18.9,49.6,1.0700 +3/21/04,12:00:00,1.8,1267,200,7.6,883,103,953,82,1617,906,19.4,46.1,1.0249 +3/21/04,13:00:00,1.9,1239,200,6.8,848,102,981,80,1568,822,20.8,41.6,1.0091 +3/21/04,14:00:00,1.3,1128,200,4.6,742,48,1144,48,1507,610,21.4,40.3,1.0126 +3/21/04,15:00:00,1.6,1243,200,8.4,916,63,962,61,1669,657,21.2,39.3,0.9791 +3/21/04,16:00:00,1.9,1208,200,6.9,853,98,1060,89,1506,647,20.3,38.0,0.8942 +3/21/04,17:00:00,2.3,1306,200,9.0,938,132,961,105,1607,762,19.4,41.3,0.9206 +3/21/04,18:00:00,3.8,1546,200,15.1,1151,173,769,116,1874,1122,18.5,43.8,0.9252 +3/21/04,19:00:00,3.5,1501,200,12.6,1071,185,823,119,1744,1130,17.8,48.1,0.9689 +3/21/04,20:00:00,4.3,1605,200,15.1,1153,266,769,144,1840,1637,17.9,46.9,0.9520 +3/21/04,21:00:00,2.8,1316,200,9.9,974,188,883,123,1673,1272,17.4,49.1,0.9688 +3/21/04,22:00:00,1.9,1195,200,8.0,898,122,933,105,1594,1098,17.0,51.7,0.9914 +3/21/04,23:00:00,1.9,1211,200,7.9,896,112,920,93,1624,1066,16.4,55.1,1.0198 +3/22/04,0:00:00,1.7,1161,200,6.1,815,93,995,86,1582,909,16.1,60.0,1.0919 +3/22/04,1:00:00,1.5,1095,200,5.1,767,74,1050,76,1547,818,15.8,60.5,1.0805 +3/22/04,2:00:00,0.6,897,200,1.7,563,23,1417,33,1355,472,16.3,57.0,1.0491 +3/22/04,3:00:00,0.4,842,200,0.7,468,200,1813,200,1274,394,16.9,53.9,1.0309 +3/22/04,4:00:00,200,854,200,0.8,481,17,1756,27,1304,396,16.1,55.9,1.0153 +3/22/04,5:00:00,0.3,845,200,0.7,472,15,1786,23,1278,378,16.5,53.9,1.0049 +3/22/04,6:00:00,0.6,942,200,2.0,586,41,1480,46,1373,432,16.4,54.0,0.9979 +3/22/04,7:00:00,1.2,1090,200,5.1,768,86,1072,76,1538,699,15.7,57.6,1.0195 +3/22/04,8:00:00,3.6,1514,200,17.7,1230,226,732,116,1996,1252,15.8,53.7,0.9571 +3/22/04,9:00:00,3.7,1379,200,18.4,1252,214,711,115,1962,1237,17.4,42.1,0.8281 +3/22/04,10:00:00,1.8,1074,200,6.9,854,115,1032,83,1484,772,17.9,38.4,0.7793 +3/22/04,11:00:00,1.6,1090,200,7.3,868,124,1026,87,1509,766,18.5,36.7,0.7764 +3/22/04,12:00:00,1.9,1110,200,9.6,964,122,947,90,1577,829,21.0,29.6,0.7263 +3/22/04,13:00:00,2,1102,200,10.4,994,107,955,87,1599,783,21.3,28.1,0.7067 +3/22/04,14:00:00,2.2,1100,200,11.2,1021,133,952,99,1607,822,21.3,26.1,0.6539 +3/22/04,15:00:00,2.1,1094,200,10.7,1003,130,965,99,1574,813,21.3,26.8,0.6681 +3/22/04,16:00:00,2.3,1076,451,11.6,1034,150,964,101,1544,813,21.3,24.2,0.6060 +3/22/04,17:00:00,2.6,1152,185,12.4,1062,138,928,103,1606,850,20.2,28.5,0.6682 +3/22/04,18:00:00,3.8,1407,426,18.3,1249,213,720,129,1932,1276,17.0,43.1,0.8264 +3/22/04,19:00:00,4.4,1438,672,21.0,1325,232,678,143,1993,1401,15.6,47.4,0.8347 +3/22/04,20:00:00,4.5,1425,624,21.0,1325,238,678,146,1965,1356,15.3,47.4,0.8178 +3/22/04,21:00:00,3.3,1279,325,14.5,1131,180,775,124,1758,1243,14.5,49.7,0.8177 +3/22/04,22:00:00,2,1132,143,8.4,916,128,920,113,1561,1083,13.9,52.0,0.8231 +3/22/04,23:00:00,1.1,1006,89,4.3,729,65,1142,80,1370,795,14.5,49.4,0.8106 +3/23/04,0:00:00,0.9,982,73,3.8,700,53,1204,72,1359,735,13.8,51.4,0.8071 +3/23/04,1:00:00,0.8,969,48,2.7,636,47,1271,64,1309,684,14.0,51.3,0.8163 +3/23/04,2:00:00,0.7,941,47,2.5,621,51,1298,70,1330,714,13.0,56.8,0.8444 +3/23/04,3:00:00,0.5,892,36,1.2,522,200,1522,200,1285,456,11.0,70.0,0.9215 +3/23/04,4:00:00,200,855,25,0.8,478,20,1701,31,1261,401,10.1,76.8,0.9483 +3/23/04,5:00:00,0.3,834,19,0.6,459,13,1908,22,1212,361,8.8,80.5,0.9147 +3/23/04,6:00:00,0.5,909,26,1.8,567,33,1542,47,1287,418,9.6,75.4,0.9004 +3/23/04,7:00:00,1.1,1061,60,4.6,745,103,1178,88,1429,616,9.8,75.0,0.9093 +3/23/04,8:00:00,3,1441,248,14.2,1124,224,851,130,1829,1089,10.0,74.2,0.9117 +3/23/04,9:00:00,3.8,1498,535,21.0,1325,244,684,132,2095,1329,11.2,69.7,0.9232 +3/23/04,10:00:00,3.4,1343,340,16.2,1185,237,760,132,1857,1311,12.0,62.9,0.8803 +3/23/04,11:00:00,3,1308,325,15.5,1166,220,792,131,1794,1236,12.8,55.8,0.8210 +3/23/04,12:00:00,2.7,1209,267,13.4,1097,207,837,129,1746,1147,13.1,54.8,0.8234 +3/23/04,13:00:00,3.5,1310,325,16.9,1207,208,770,132,1847,1301,13.8,53.4,0.8351 +3/23/04,14:00:00,2.8,1171,268,12.4,1064,160,881,110,1639,1097,16.0,42.4,0.7674 +3/23/04,15:00:00,1.9,1112,136,10.1,980,131,924,99,1589,900,14.1,49.8,0.7971 +3/23/04,16:00:00,2,1132,118,8.7,927,128,944,100,1570,781,12.0,61.1,0.8537 +3/23/04,17:00:00,3,1225,276,13.8,1111,176,818,122,1719,1013,13.6,54.6,0.8465 +3/23/04,18:00:00,1.9,1065,151,9.3,953,107,962,97,1514,790,13.7,49.5,0.7695 +3/23/04,19:00:00,2.7,1236,236,13.0,1084,131,854,112,1688,1059,11.7,57.1,0.7822 +3/23/04,20:00:00,3.3,1309,220,14.3,1126,166,812,128,1722,1205,11.5,58.2,0.7867 +3/23/04,21:00:00,1.7,1089,87,7.6,884,89,1006,101,1487,975,12.6,54.2,0.7881 +3/23/04,22:00:00,1.3,1000,81,5.3,776,81,1141,96,1375,896,12.0,54.9,0.7664 +3/23/04,23:00:00,1.3,1041,74,5.4,782,78,1105,86,1425,911,11.0,61.4,0.8057 +3/24/04,0:00:00,1.2,1045,55,4.9,756,70,1137,85,1411,902,11.1,62.2,0.8199 +3/24/04,1:00:00,1.1,997,74,4.6,744,76,1151,79,1405,832,10.8,62.5,0.8082 +3/24/04,2:00:00,0.9,948,35,2.7,634,43,1320,61,1326,715,11.3,60.0,0.8025 +3/24/04,3:00:00,0.6,878,29,1.6,551,200,1493,200,1272,683,10.6,61.6,0.7856 +3/24/04,4:00:00,200,908,29,1.7,559,44,1477,66,1287,721,10.2,62.7,0.7813 +3/24/04,5:00:00,0.6,892,20,1.2,522,34,1572,54,1255,648,10.9,58.6,0.7628 +3/24/04,6:00:00,0.8,954,32,1.7,561,43,1496,57,1255,612,10.8,57.2,0.7382 +3/24/04,7:00:00,1.2,1045,84,4.8,751,79,1186,75,1403,800,10.4,59.2,0.7455 +3/24/04,8:00:00,3.1,1383,245,13.8,1109,174,832,97,1706,1173,11.0,57.3,0.7501 +3/24/04,9:00:00,4.4,1597,553,23.7,1395,235,652,110,2080,1519,11.2,56.9,0.7565 +3/24/04,10:00:00,3.4,1318,450,14.2,1124,218,762,113,1770,1287,12.1,54.9,0.7734 +3/24/04,11:00:00,1.7,1138,118,8.0,898,136,950,88,1505,966,15.1,44.3,0.7535 +3/24/04,12:00:00,1.5,1098,118,6.7,842,117,1012,85,1440,828,15.6,40.6,0.7146 +3/24/04,13:00:00,1.7,1118,144,8.7,927,119,959,96,1477,825,16.9,34.6,0.6597 +3/24/04,14:00:00,2.2,1170,159,10.8,1009,126,879,106,1566,872,16.6,35.6,0.6697 +3/24/04,15:00:00,200,1112,122,6.7,842,100,1015,91,1494,680,11.3,63.9,0.8554 +3/24/04,16:00:00,2.7,1390,230,11.4,1028,197,829,126,1730,1002,9.6,81.1,0.9720 +3/24/04,17:00:00,2.8,1353,320,13.4,1096,195,774,130,1783,1159,11.7,69.1,0.9497 +3/24/04,18:00:00,2.7,1260,263,12.0,1050,156,828,111,1722,979,10.7,68.5,0.8794 +3/24/04,19:00:00,4.5,1501,556,20.2,1302,247,681,134,2008,1311,10.7,67.3,0.8643 +3/24/04,20:00:00,3.5,1273,458,14.4,1130,190,795,136,1744,1071,10.8,65.7,0.8486 +3/24/04,21:00:00,2.6,1179,214,9.4,955,169,928,125,1571,899,10.3,66.3,0.8336 +3/24/04,22:00:00,1.7,1047,97,5.9,809,118,1064,108,1435,766,10.1,66.9,0.8248 +3/24/04,23:00:00,1.3,973,79,4.5,739,81,1186,87,1354,589,10.4,61.1,0.7717 +3/25/04,0:00:00,1.3,997,81,4.5,737,80,1168,89,1341,631,10.3,59.9,0.7516 +3/25/04,1:00:00,1,970,52,2.5,621,42,1400,58,1249,479,11.0,58.4,0.7686 +3/25/04,2:00:00,0.7,902,38,1.8,568,38,1450,54,1253,479,9.1,68.1,0.7890 +3/25/04,3:00:00,0.7,921,35,1.4,538,200,1564,200,1224,417,9.5,68.2,0.8091 +3/25/04,4:00:00,0.5,858,29,0.9,490,18,1707,28,1212,378,9.3,72.3,0.8496 +3/25/04,5:00:00,0.5,878,21,0.6,457,12,1935,20,1158,344,9.9,66.8,0.8152 +3/25/04,6:00:00,0.6,893,46,1.4,540,43,1604,51,1268,413,7.6,83.2,0.8695 +3/25/04,7:00:00,1.1,1038,55,3.8,697,84,1260,82,1383,605,8.1,79.0,0.8544 +3/25/04,8:00:00,2.7,1320,271,11.6,1035,184,902,112,1721,991,8.7,74.6,0.8429 +3/25/04,9:00:00,3.5,1353,434,17.8,1235,202,740,119,1937,1175,8.6,75.1,0.8415 +3/25/04,10:00:00,2.3,1133,300,8.8,933,133,969,99,1552,923,9.7,69.7,0.8389 +3/25/04,11:00:00,1.6,1066,116,6.8,849,130,1039,98,1468,829,11.0,64.4,0.8447 +3/25/04,12:00:00,1.3,1027,95,5.9,809,106,1081,88,1442,733,13.6,53.0,0.8209 +3/25/04,13:00:00,2,1106,211,8.9,935,132,958,96,1570,819,14.2,48.4,0.7780 +3/25/04,14:00:00,1.9,1038,168,8.2,907,106,976,81,1492,723,14.2,46.3,0.7438 +3/25/04,15:00:00,1.9,1084,154,8.6,924,125,986,92,1501,786,16.0,41.3,0.7462 +3/25/04,16:00:00,2.2,1105,267,10.1,982,138,939,105,1545,850,17.7,35.4,0.7094 +3/25/04,17:00:00,2,1102,143,9.4,957,120,968,88,1527,820,17.1,37.4,0.7225 +3/25/04,18:00:00,2.9,1240,374,14.6,1137,158,836,103,1748,991,14.9,40.5,0.6835 +3/25/04,19:00:00,5.2,1443,797,24.6,1418,253,666,141,2060,1454,13.4,47.8,0.7293 +3/25/04,20:00:00,4.6,1389,698,21.6,1341,231,692,133,1982,1488,12.6,52.5,0.7642 +3/25/04,21:00:00,2.5,1183,234,10.3,989,150,904,119,1539,1205,12.2,53.2,0.7555 +3/25/04,22:00:00,1.5,1014,104,5.7,798,99,1083,106,1376,909,10.6,56.1,0.7140 +3/25/04,23:00:00,1.2,979,67,4.5,736,75,1154,92,1324,769,9.9,59.5,0.7276 +3/26/04,0:00:00,1.7,1048,88,5.5,787,93,1098,97,1373,835,9.6,61.1,0.7323 +3/26/04,1:00:00,1.4,1024,79,4.8,753,79,1131,91,1357,793,9.7,61.6,0.7425 +3/26/04,2:00:00,1.2,974,61,3.6,690,67,1208,83,1313,733,9.4,62.6,0.7398 +3/26/04,3:00:00,0.6,862,43,1.7,561,200,1449,200,1219,502,8.2,68.5,0.7455 +3/26/04,4:00:00,0.7,925,40,2.2,602,45,1398,59,1263,562,8.0,69.4,0.7459 +3/26/04,5:00:00,0.8,952,52,3.0,654,72,1210,73,1312,832,7.7,72.4,0.7655 +3/26/04,6:00:00,0.9,1005,64,4.0,713,103,1130,78,1363,924,8.0,71.8,0.7711 +3/26/04,7:00:00,1.6,1122,88,6.7,842,132,1015,90,1438,974,8.5,66.3,0.7375 +3/26/04,8:00:00,3.4,1351,375,16.7,1201,239,783,110,1828,1307,9.8,59.6,0.7214 +3/26/04,9:00:00,3.8,1408,592,19.3,1278,275,682,114,1928,1481,11.2,57.0,0.7544 +3/26/04,10:00:00,3.1,1304,357,14.8,1142,232,772,119,1747,1405,14.8,44.1,0.7407 +3/26/04,11:00:00,2.7,1207,296,13.4,1098,180,826,121,1664,1195,16.8,35.3,0.6729 +3/26/04,12:00:00,2,1099,181,11.0,1014,112,948,97,1503,868,19.2,27.1,0.5976 +3/26/04,13:00:00,2.3,1106,211,12.5,1068,116,915,99,1535,874,19.1,26.2,0.5741 +3/26/04,14:00:00,1.9,1009,199,8.4,916,103,1015,85,1391,738,16.3,30.8,0.5683 +3/26/04,15:00:00,1.3,962,81,5.3,776,89,1152,75,1279,550,14.2,36.8,0.5922 +3/26/04,16:00:00,1.9,1093,143,8.8,933,112,976,87,1454,777,13.4,41.0,0.6292 +3/26/04,17:00:00,2.3,1180,247,11.2,1023,127,886,102,1588,966,14.2,43.3,0.6982 +3/26/04,18:00:00,2.4,1168,239,11.6,1037,125,868,105,1580,990,12.1,45.6,0.6403 +3/26/04,19:00:00,2.7,1168,267,12.4,1063,120,876,98,1540,944,10.4,48.0,0.6046 +3/26/04,20:00:00,2.6,1140,261,10.6,1000,120,916,93,1490,962,10.0,51.1,0.6280 +3/26/04,21:00:00,1.5,1001,97,6.0,811,99,1097,91,1308,778,9.8,50.8,0.6172 +3/26/04,22:00:00,1.2,953,66,4.6,743,79,1185,85,1252,690,9.7,50.4,0.6063 +3/26/04,23:00:00,1.1,972,60,4.1,715,66,1237,74,1258,644,9.9,50.2,0.6103 +3/27/04,0:00:00,1.5,1021,77,5.2,776,94,1144,87,1307,753,9.4,53.4,0.6300 +3/27/04,1:00:00,1,940,57,3.2,664,70,1290,79,1232,646,8.6,57.8,0.6488 +3/27/04,2:00:00,1.2,1046,65,4.5,736,75,1142,78,1312,838,8.1,62.1,0.6746 +3/27/04,3:00:00,1.1,1026,59,3.8,699,200,1191,200,1292,851,7.9,63.5,0.6777 +3/27/04,4:00:00,200,983,48,3.8,698,57,1209,63,1296,801,7.4,64.8,0.6683 +3/27/04,5:00:00,0.8,915,27,1.9,578,32,1417,48,1212,665,8.2,58.9,0.6444 +3/27/04,6:00:00,0.9,935,25,2.4,615,54,1373,59,1250,717,6.3,66.6,0.6410 +3/27/04,7:00:00,1.1,976,42,3.3,669,68,1270,65,1266,795,7.2,62.9,0.6396 +3/27/04,8:00:00,1.5,1124,78,6.7,842,98,1050,71,1418,936,7.2,63.9,0.6500 +3/27/04,9:00:00,1.8,1163,128,8.5,919,128,922,84,1530,1076,11.1,50.5,0.6654 +3/27/04,10:00:00,2.1,1167,184,9.7,966,129,942,93,1520,1016,15.5,36.1,0.6313 +3/27/04,11:00:00,2.1,1121,156,9.4,955,130,992,99,1451,834,17.8,29.2,0.5920 +3/27/04,12:00:00,1.9,1071,176,9.0,940,111,994,93,1449,756,17.6,30.1,0.6013 +3/27/04,13:00:00,2.1,1065,232,10.0,979,106,950,96,1467,766,19.1,26.8,0.5861 +3/27/04,14:00:00,2.5,1157,305,12.6,1072,137,888,114,1596,905,17.7,28.6,0.5735 +3/27/04,15:00:00,1.9,1046,150,7.6,883,113,1051,101,1357,765,17.8,28.4,0.5737 +3/27/04,16:00:00,2.2,1147,188,11.8,1042,122,934,108,1538,830,18.7,26.0,0.5555 +3/27/04,17:00:00,2.3,1119,221,11.2,1022,137,916,117,1508,853,17.1,31.3,0.6038 +3/27/04,18:00:00,2.7,1210,219,12.4,1062,165,886,125,1586,981,15.8,36.1,0.6440 +3/27/04,19:00:00,3,1306,306,12.9,1079,196,827,133,1621,1286,14.7,40.1,0.6657 +3/27/04,20:00:00,2.8,1258,270,12.2,1057,174,841,119,1633,1244,13.6,44.9,0.6991 +3/27/04,21:00:00,2.2,1148,231,8.8,930,140,948,107,1475,1056,12.5,48.1,0.6946 +3/27/04,22:00:00,1.6,1108,125,6.8,848,102,1023,96,1421,1003,11.8,50.8,0.7042 +3/27/04,23:00:00,2.1,1231,122,8.6,924,130,906,105,1483,1245,10.9,56.1,0.7326 +3/28/04,0:00:00,2.3,1293,161,8.9,937,121,930,102,1464,1200,10.9,55.9,0.7277 +3/28/04,1:00:00,2.3,1274,101,8.3,912,111,939,97,1451,1188,10.3,58.1,0.7278 +3/28/04,2:00:00,1.7,1135,95,6.3,827,87,1013,79,1395,1070,9.3,62.2,0.7294 +3/28/04,3:00:00,2.2,1151,129,8.3,911,200,935,200,1501,1075,8.2,66.7,0.7276 +3/28/04,4:00:00,1.3,1010,96,5.1,769,77,1110,69,1396,895,8.3,64.9,0.7124 +3/28/04,5:00:00,0.8,870,54,2.6,626,39,1325,52,1284,755,9.6,58.5,0.7000 +3/28/04,6:00:00,1.1,983,63,3.6,688,77,1251,62,1342,806,7.7,66.4,0.6990 +3/28/04,7:00:00,1.4,995,72,3.6,691,91,1219,69,1321,836,7.4,68.0,0.7043 +3/28/04,8:00:00,1.3,1024,91,4.7,747,105,1136,75,1382,897,9.7,58.6,0.7035 +3/28/04,9:00:00,1.9,1163,127,7.5,880,133,959,83,1486,1073,12.5,50.1,0.7242 +3/28/04,10:00:00,2.3,1218,193,9.0,938,128,918,94,1501,1088,16.8,36.9,0.7018 +3/28/04,11:00:00,2.3,1210,188,9.5,958,126,933,96,1514,1036,19.2,31.3,0.6887 +3/28/04,12:00:00,1.8,1114,151,7.7,887,92,1020,86,1438,768,21.3,26.4,0.6607 +3/28/04,13:00:00,1.4,995,103,5.7,796,66,1159,72,1287,587,21.2,23.1,0.5768 +3/28/04,14:00:00,1,927,55,3.8,700,50,1285,53,1189,452,18.6,25.0,0.5319 +3/28/04,15:00:00,1.4,1002,104,4.8,754,70,1189,65,1260,470,18.0,27.3,0.5594 +3/28/04,16:00:00,1.3,987,116,4.3,727,73,1213,70,1244,502,18.2,28.3,0.5848 +3/28/04,17:00:00,1.3,993,93,4.1,714,77,1217,75,1238,516,17.0,31.4,0.6048 +3/28/04,18:00:00,1.4,1012,93,4.7,746,86,1174,86,1301,555,15.6,35.0,0.6170 +3/28/04,19:00:00,1.9,1114,155,6.2,823,93,1068,94,1357,671,14.1,40.2,0.6438 +3/28/04,20:00:00,1.8,1093,115,5.5,789,105,1102,103,1329,788,13.4,43.2,0.6623 +3/28/04,21:00:00,1.1,968,75,3.3,669,67,1261,81,1246,538,12.3,46.8,0.6686 +3/28/04,22:00:00,1.1,960,65,3.2,666,63,1266,80,1224,544,11.9,46.5,0.6449 +3/28/04,23:00:00,1.1,942,57,2.9,648,63,1300,77,1204,500,11.5,46.6,0.6319 +3/29/04,0:00:00,0.9,888,40,2.2,598,46,1396,58,1156,417,10.6,48.0,0.6135 +3/29/04,1:00:00,0.6,818,27,1.3,524,21,1588,31,1071,341,10.7,45.8,0.5895 +3/29/04,2:00:00,0.5,840,23,1.1,509,22,1643,38,1076,352,10.4,46.6,0.5901 +3/29/04,3:00:00,0.7,894,28,1.3,524,200,1557,200,1095,397,10.1,47.8,0.5913 +3/29/04,4:00:00,0.6,864,21,1.2,516,39,1496,59,1153,502,11.3,46.9,0.6280 +3/29/04,5:00:00,0.7,898,33,1.7,559,55,1441,70,1180,648,10.1,51.2,0.6341 +3/29/04,6:00:00,0.9,946,40,2.9,643,76,1309,75,1232,760,9.9,50.7,0.6167 +3/29/04,7:00:00,2.9,1309,279,14.3,1125,181,844,103,1687,1183,8.1,56.8,0.6150 +3/29/04,8:00:00,4.1,1326,743,19.7,1289,259,714,134,1806,1431,10.9,44.1,0.5748 +3/29/04,9:00:00,1.5,966,147,5.5,786,118,1111,98,1258,705,11.8,37.6,0.5184 +3/29/04,10:00:00,1.5,975,97,5.6,793,119,1133,97,1254,637,13.9,33.2,0.5222 +3/29/04,11:00:00,1.5,988,118,5.8,803,123,1105,98,1293,658,15.6,30.1,0.5285 +3/29/04,12:00:00,1.4,971,91,5.5,786,92,1129,78,1272,546,16.5,29.0,0.5406 +3/29/04,13:00:00,1.6,1010,146,6.5,833,91,1070,83,1332,561,17.5,27.9,0.5517 +3/29/04,14:00:00,1.5,984,139,5.5,790,103,1131,92,1268,532,17.9,26.8,0.5446 +3/29/04,15:00:00,1.4,962,155,5.2,776,102,1131,81,1247,535,18.2,26.1,0.5387 +3/29/04,16:00:00,1.5,1012,128,5.8,801,94,1098,90,1289,525,18.0,27.1,0.5522 +3/29/04,17:00:00,1.9,1058,166,7.9,896,98,996,95,1358,603,17.0,28.9,0.5558 +3/29/04,18:00:00,2.5,1173,299,10.2,985,126,905,114,1477,802,16.1,31.3,0.5680 +3/29/04,19:00:00,2.1,1127,163,8.2,907,108,957,106,1404,796,14.5,35.4,0.5839 +3/29/04,20:00:00,1.6,1020,154,5.7,797,95,1066,99,1302,689,13.2,37.9,0.5733 +3/29/04,21:00:00,1.2,949,80,3.8,700,78,1221,90,1178,553,13.1,34.4,0.5170 +3/29/04,22:00:00,1.1,930,58,4.0,710,61,1278,82,1137,509,13.5,28.5,0.4405 +3/29/04,23:00:00,1,900,55,2.8,643,55,1339,74,1089,455,12.4,31.9,0.4583 +3/30/04,0:00:00,1,899,33,2.6,630,57,1418,76,1050,425,13.1,26.8,0.4023 +3/30/04,1:00:00,0.7,866,33,1.8,569,41,1468,66,1055,436,11.6,33.6,0.4595 +3/30/04,2:00:00,0.7,900,32,1.7,563,46,1402,73,1130,618,10.5,42.9,0.5442 +3/30/04,3:00:00,0.8,949,25,1.4,540,200,1457,200,1104,747,11.7,38.7,0.5298 +3/30/04,4:00:00,200,897,29,1.3,529,41,1462,69,1101,693,12.2,37.2,0.5277 +3/30/04,5:00:00,0.7,956,26,2.3,605,52,1341,71,1117,691,12.9,33.9,0.5037 +3/30/04,6:00:00,1.1,1054,86,5.3,779,111,1080,98,1266,1013,11.3,40.3,0.5376 +3/30/04,7:00:00,2.6,1343,294,13.4,1096,191,839,115,1587,1311,11.9,38.4,0.5341 +3/30/04,8:00:00,4,1584,664,23.8,1399,244,642,130,1947,1675,13.3,35.6,0.5416 +3/30/04,9:00:00,4.2,1556,695,21.5,1337,283,674,150,1852,1689,16.3,28.8,0.5305 +3/30/04,10:00:00,4.7,1565,735,21.0,1324,320,695,159,1872,1688,17.9,28.2,0.5741 +3/30/04,11:00:00,3.9,1429,649,18.4,1251,249,734,146,1754,1548,18.3,26.6,0.5529 +3/30/04,12:00:00,3.7,1419,586,18.9,1267,219,742,138,1763,1490,20.8,22.5,0.5463 +3/30/04,13:00:00,3.4,1373,546,17.1,1213,200,791,128,1711,1354,22.4,20.0,0.5352 +3/30/04,14:00:00,2.2,1185,245,10.4,992,138,942,97,1448,1014,20.9,21.7,0.5295 +3/30/04,15:00:00,1.9,1122,178,8.0,897,118,991,91,1372,831,19.1,26.5,0.5790 +3/30/04,16:00:00,1.6,1085,130,6.2,820,99,1060,81,1325,699,18.1,30.6,0.6295 +3/30/04,17:00:00,2.1,1249,151,9.7,968,112,874,99,1527,926,17.3,34.7,0.6788 +3/30/04,18:00:00,2.2,1236,272,9.6,962,117,853,101,1532,928,15.8,40.4,0.7228 +3/30/04,19:00:00,2.7,1334,301,11.9,1047,129,815,106,1638,1038,15.9,42.0,0.7541 +3/30/04,20:00:00,2.4,1264,237,8.7,928,132,884,102,1534,1035,15.9,43.5,0.7790 +3/30/04,21:00:00,1.3,1073,95,4.7,746,73,1072,84,1360,789,15.0,46.1,0.7833 +3/30/04,22:00:00,1.2,1053,68,3.8,701,68,1147,81,1323,692,15.1,44.6,0.7574 +3/30/04,23:00:00,1.5,1098,101,4.7,747,80,1094,91,1360,764,14.9,43.5,0.7302 +3/31/04,0:00:00,1.3,1029,81,3.8,702,68,1180,82,1287,639,14.8,41.0,0.6850 +3/31/04,1:00:00,1,973,50,2.7,631,44,1287,64,1240,541,14.8,40.8,0.6814 +3/31/04,2:00:00,0.9,972,66,2.8,639,47,1222,70,1256,678,14.8,40.8,0.6834 +3/31/04,3:00:00,0.5,832,22,1.0,496,200,1613,200,1066,381,14.2,36.1,0.5823 +3/31/04,4:00:00,0.5,851,18,0.8,477,15,1699,26,1064,343,14.5,36.1,0.5928 +3/31/04,5:00:00,0.6,909,31,1.5,545,44,1473,58,1131,436,14.6,36.0,0.5947 +3/31/04,6:00:00,1,1038,57,3.5,684,85,1220,93,1279,688,14.4,38.5,0.6291 +3/31/04,7:00:00,3.1,1636,342,15.6,1168,207,721,110,1926,1438,13.8,58.2,0.9129 +3/31/04,8:00:00,4.1,1580,644,19.9,1295,230,639,115,1957,1489,13.2,61.1,0.9256 +3/31/04,9:00:00,2.2,1262,216,8.6,924,181,872,125,1576,1093,10.4,74.4,0.9384 +3/31/04,10:00:00,1.7,1197,117,6.5,837,144,957,118,1498,965,11.5,71.0,0.9582 +3/31/04,11:00:00,1.9,1277,156,7.7,886,140,902,109,1579,1004,12.2,70.0,0.9924 +3/31/04,12:00:00,2.9,1430,332,11.3,1025,204,779,123,1772,1166,12.3,73.3,1.0442 +3/31/04,13:00:00,2.2,1242,232,9.1,944,149,846,114,1638,991,11.3,78.1,1.0411 +3/31/04,14:00:00,2.6,1349,295,11.2,1023,171,804,114,1748,1054,13.6,69.2,1.0746 +3/31/04,15:00:00,3.1,1390,386,13.5,1100,195,755,119,1799,1192,14.9,59.3,0.9974 +3/31/04,16:00:00,2.3,1224,278,10.1,980,152,850,111,1637,1055,15.7,50.9,0.8997 +3/31/04,17:00:00,2.4,1253,263,11.3,1026,128,810,103,1681,1048,15.2,52.9,0.9087 +3/31/04,18:00:00,2.8,1312,322,11.8,1041,121,821,103,1712,1073,15.5,50.8,0.8881 +3/31/04,19:00:00,2.7,1254,226,11.5,1033,142,819,107,1669,1068,14.3,52.9,0.8581 +3/31/04,20:00:00,2.3,1198,221,9.3,953,131,882,105,1588,1019,13.2,56.0,0.8446 +3/31/04,21:00:00,1.5,1060,109,5.5,787,81,1044,91,1418,833,12.6,57.0,0.8315 +3/31/04,22:00:00,1.4,1050,105,5.1,769,60,1078,79,1409,759,12.3,57.8,0.8224 +3/31/04,23:00:00,1.2,1029,84,4.8,754,57,1098,79,1395,749,12.0,58.4,0.8164 +4/1/04,0:00:00,1.6,1143,106,6.3,825,96,986,86,1477,978,12.0,61.6,0.8593 +4/1/04,1:00:00,1.2,1044,100,5.1,770,85,1031,70,1425,944,11.5,63.9,0.8652 +4/1/04,2:00:00,1.1,1034,71,4.1,716,50,1085,55,1405,891,10.7,67.2,0.8630 +4/1/04,3:00:00,0.9,956,72,4.0,713,200,1099,200,1422,849,9.0,73.1,0.8394 +4/1/04,4:00:00,0.7,909,44,2.4,615,57,1237,49,1322,790,10.2,66.6,0.8299 +4/1/04,5:00:00,0.9,996,45,2.9,648,64,1176,50,1340,852,11.0,63.7,0.8325 +4/1/04,6:00:00,1.7,1154,134,7.4,876,153,1002,67,1561,987,9.6,68.8,0.8243 +4/1/04,7:00:00,4.2,1510,505,19.8,1291,342,675,94,1949,1435,9.5,69.6,0.8273 +4/1/04,8:00:00,6.2,1722,1042,31.9,1595,378,539,119,2439,1798,11.9,60.9,0.8455 +4/1/04,9:00:00,4.6,1512,737,21.0,1323,304,631,139,2001,1677,16.2,48.6,0.8892 +4/1/04,10:00:00,2.8,1258,375,11.7,1040,192,800,113,1701,1282,19.2,39.6,0.8693 +4/1/04,11:00:00,1.6,1094,124,6.8,850,130,970,88,1489,871,19.7,38.4,0.8743 +4/1/04,12:00:00,1.7,1129,137,7.4,875,106,974,73,1524,759,20.6,36.7,0.8797 +4/1/04,13:00:00,1.7,1125,179,8.6,924,102,937,73,1542,790,21.8,33.9,0.8771 +4/1/04,14:00:00,1.7,200,222,200.0,200,99,200,72,200,200,200,200,200 +4/1/04,15:00:00,1.9,200,197,200.0,200,108,200,81,200,200,200,200,200 +4/1/04,16:00:00,2.3,200,319,200.0,200,131,200,93,200,200,200,200,200 +4/1/04,17:00:00,3.3,1308,388,16.4,1193,163,793,106,1193,1187,23.6,27.5,0.7913 +4/1/04,18:00:00,4.2,1529,589,21.2,1331,211,700,132,2028,1399,22.8,32.1,0.8814 +4/1/04,19:00:00,5.5,1592,840,25.0,1429,267,624,164,2089,1644,20.8,36.7,0.8878 +4/1/04,20:00:00,4.9,1536,655,20.2,1302,269,666,161,1922,1599,18.3,40.9,0.8493 +4/1/04,21:00:00,2.5,1192,254,10.8,1007,154,827,124,1604,1223,16.8,44.0,0.8341 +4/1/04,22:00:00,2,1186,188,9.9,976,127,844,113,1570,1200,15.8,48.0,0.8569 +4/1/04,23:00:00,2,1203,120,8.5,920,122,898,103,1529,1118,14.9,52.0,0.8757 +4/2/04,0:00:00,2,1139,157,8.0,899,126,921,104,1514,1067,14.1,54.4,0.8701 +4/2/04,1:00:00,1.3,1072,88,5.6,795,84,986,93,1442,1048,13.3,58.8,0.8922 +4/2/04,2:00:00,1,954,68,3.2,667,58,1180,60,1350,718,13.3,60.9,0.9229 +4/2/04,3:00:00,0.9,951,57,3.3,673,200,1137,200,1381,797,11.8,64.3,0.8869 +4/2/04,4:00:00,200,926,36,2.8,638,37,1195,52,1342,749,11.2,65.9,0.8775 +4/2/04,5:00:00,0.7,968,51,2.6,628,56,1197,66,1372,737,11.3,72.9,0.9738 +4/2/04,6:00:00,1.1,1067,93,4.3,727,91,1055,82,1448,903,11.4,71.8,0.9671 +4/2/04,7:00:00,2.6,1386,284,13.0,1083,159,798,109,1833,1159,11.9,73.7,1.0254 +4/2/04,8:00:00,3.9,1474,486,20.3,1306,190,634,114,2060,1365,12.6,70.2,1.0221 +4/2/04,9:00:00,5,1569,798,20.7,1316,249,634,139,2054,1510,13.6,65.3,1.0136 +4/2/04,10:00:00,3.3,1375,524,15.9,1177,196,706,127,1898,1386,16.9,52.3,1.0013 +4/2/04,11:00:00,2.9,1328,468,14.2,1123,180,738,127,1821,1368,19.4,44.1,0.9855 +4/2/04,12:00:00,3.1,1341,454,15.9,1176,167,748,135,1825,1347,22.9,33.8,0.9322 +4/2/04,13:00:00,3,1297,461,16.1,1183,159,765,130,1773,1242,23.6,31.3,0.9001 +4/2/04,14:00:00,2.7,1303,391,14.7,1139,152,790,126,1761,1138,24.7,30.5,0.9371 +4/2/04,15:00:00,2.7,1281,337,14.0,1116,149,809,121,1703,1080,26.1,27.3,0.9101 +4/2/04,16:00:00,2.6,1248,297,13.6,1103,143,872,122,1632,1002,27.6,22.8,0.8288 +4/2/04,17:00:00,3.7,1382,588,19.7,1288,185,778,138,1799,1212,26.9,22.3,0.7792 +4/2/04,18:00:00,4.5,1491,721,23.3,1385,223,685,165,1915,1486,24.3,30.1,0.9012 +4/2/04,19:00:00,4.7,1620,710,24.1,1407,215,641,154,2052,1567,21.2,40.1,0.9978 +4/2/04,20:00:00,5.5,1693,787,25.4,1439,271,618,168,2083,1723,19.4,42.5,0.9483 +4/2/04,21:00:00,3,1374,415,14.1,1121,180,771,146,1685,1399,18.5,43.0,0.9074 +4/2/04,22:00:00,1.9,1287,245,10.1,982,115,841,124,1582,1291,17.8,47.0,0.9467 +4/2/04,23:00:00,2.3,1408,294,12.1,1052,149,761,124,1684,1443,16.8,53.2,1.0100 +4/3/04,0:00:00,1.6,1254,139,9.0,941,98,828,106,1566,1270,17.1,51.8,0.9990 +4/3/04,1:00:00,1.3,1141,98,6.3,827,73,936,88,1488,1110,16.2,54.2,0.9933 +4/3/04,2:00:00,1.2,1141,88,5.3,777,69,986,83,1455,1060,15.1,58.2,0.9886 +4/3/04,3:00:00,0.9,1042,66,3.8,697,200,1056,200,1410,965,15.1,57.6,0.9796 +4/3/04,4:00:00,0.8,986,57,3.0,651,60,1145,74,1380,891,14.5,58.0,0.9488 +4/3/04,5:00:00,0.8,1005,57,3.3,673,56,1103,64,1387,905,13.7,61.6,0.9631 +4/3/04,6:00:00,0.9,1041,60,3.5,683,73,1112,69,1398,899,13.9,59.6,0.9408 +4/3/04,7:00:00,2,1272,200,9.5,959,159,846,91,1660,1136,13.3,61.7,0.9399 +4/3/04,8:00:00,3,1436,451,15.2,1154,203,704,96,1872,1391,15.1,55.8,0.9486 +4/3/04,9:00:00,3.1,1472,422,14.3,1126,211,720,106,1816,1460,17.0,50.5,0.9684 +4/3/04,10:00:00,200,1418,200,11.2,1023,200,785,200,1720,1410,19.7,43.3,0.9837 +4/3/04,11:00:00,200,1410,200,11.4,1030,200,781,200,1743,1372,20.7,41.7,1.0092 +4/3/04,12:00:00,200,1438,200,13.4,1096,200,781,200,1779,1266,23.3,35.8,1.0098 +4/3/04,13:00:00,200,1278,200,10.6,1000,200,866,200,1621,1014,23.8,30.5,0.8897 +4/3/04,14:00:00,200,1257,200,10.8,1006,200,915,200,1573,870,24.8,27.0,0.8325 +4/3/04,15:00:00,200,1331,200,11.9,1046,200,882,200,1635,905,25.0,27.5,0.8575 +4/3/04,16:00:00,200,1361,200,13.5,1102,200,834,200,1704,991,25.8,26.5,0.8675 +4/3/04,17:00:00,200,1398,200,15.2,1156,200,812,200,1722,1076,25.1,27.1,0.8496 +4/3/04,18:00:00,200,1483,200,17.1,1213,200,764,200,1798,1249,23.7,28.6,0.8273 +4/3/04,19:00:00,200,1697,200,26.7,1471,200,625,200,2153,1548,22.1,36.8,0.9681 +4/3/04,20:00:00,200,1459,200,14.2,1124,200,736,200,1823,1300,19.9,47.5,1.0912 +4/3/04,21:00:00,200,1268,200,8.5,918,200,873,200,1602,1029,18.7,50.4,1.0772 +4/3/04,22:00:00,200,1469,200,14.3,1128,200,762,200,1870,1207,18.1,51.3,1.0587 +4/3/04,23:00:00,200,1316,200,9.3,952,200,847,200,1644,1047,17.5,53.8,1.0648 +4/4/04,0:00:00,200,1224,200,7.8,892,200,884,200,1580,923,16.7,56.5,1.0634 +4/4/04,1:00:00,200,1215,200,6.7,843,200,929,200,1551,862,15.9,59.2,1.0635 +4/4/04,2:00:00,200,1115,200,5.4,782,200,980,200,1500,752,15.2,62.4,1.0726 +4/4/04,3:00:00,200,1124,200,5.6,793,200,965,200,1521,791,14.7,65.0,1.0766 +4/4/04,4:00:00,200,1028,200,3.5,682,200,1090,200,1448,697,14.3,65.3,1.0594 +4/4/04,5:00:00,200,1010,200,3.0,650,200,1171,200,1434,664,13.7,66.5,1.0393 +4/4/04,6:00:00,200,1074,200,4.4,731,200,1020,200,1488,891,12.9,69.1,1.0264 +4/4/04,7:00:00,200,1034,200,3.5,684,200,1110,200,1423,813,13.7,64.8,1.0135 +4/4/04,8:00:00,200,1130,200,4.2,720,200,1063,200,1461,828,15.3,59.0,1.0181 +4/4/04,9:00:00,200,1275,200,8.0,900,200,868,200,1634,1057,18.1,49.8,1.0272 +4/4/04,10:00:00,200,1324,200,9.0,939,200,846,200,1658,1011,21.5,40.7,1.0301 +4/4/04,11:00:00,200,1268,200,8.1,902,200,906,200,1589,932,23.3,37.1,1.0457 +4/4/04,12:00:00,200,1272,200,8.6,924,200,895,200,1603,881,25.1,33.8,1.0594 +4/4/04,13:00:00,200,1160,200,7.1,860,200,969,200,1531,646,25.6,32.1,1.0393 +4/4/04,14:00:00,200,1136,200,5.6,792,200,1057,200,1476,536,25.8,31.1,1.0143 +4/4/04,15:00:00,200,1296,200,10.0,977,200,863,200,1642,746,26.1,30.8,1.0253 +4/4/04,16:00:00,200,1345,200,10.5,996,200,841,200,1678,833,24.4,36.0,1.0850 +4/4/04,17:00:00,200,1296,200,10.1,983,200,845,200,1626,802,23.0,36.2,1.0053 +4/4/04,18:00:00,200,1258,200,10.0,977,200,838,200,1636,838,22.1,39.3,1.0336 +4/4/04,19:00:00,200,1420,200,13.0,1085,200,755,200,1756,1051,19.8,44.6,1.0208 +4/4/04,20:00:00,200,1366,200,10.5,997,200,830,200,1673,1041,18.5,48.9,1.0323 +4/4/04,21:00:00,200,1113,200,5.2,772,200,989,200,1493,646,16.6,56.1,1.0481 +4/4/04,22:00:00,200,1196,200,6.9,853,200,909,200,1556,769,15.8,58.8,1.0464 +4/4/04,23:00:00,200,1188,200,6.4,829,200,930,200,1540,785,15.3,60.8,1.0501 +4/5/04,0:00:00,200,1102,200,4.7,750,200,1007,200,1470,613,14.7,64.4,1.0717 +4/5/04,1:00:00,200,969,200,2.4,611,200,1239,200,1330,430,14.8,63.3,1.0584 +4/5/04,2:00:00,200,983,200,2.2,600,200,1233,200,1375,485,15.0,61.5,1.0418 +4/5/04,3:00:00,200,876,200,1.1,505,200,1456,200,1294,467,15.5,56.9,0.9961 +4/5/04,4:00:00,200,884,200,0.9,489,200,1536,200,1278,436,16.0,54.0,0.9742 +4/5/04,5:00:00,200,1024,200,3.1,658,200,1175,200,1417,682,13.5,63.9,0.9809 +4/5/04,6:00:00,200,1126,200,6.2,822,200,945,200,1541,927,12.9,65.3,0.9688 +4/5/04,7:00:00,200,1415,200,13.2,1089,200,767,200,1795,1192,13.4,62.5,0.9582 +4/5/04,8:00:00,200,1631,200,25.8,1449,200,570,200,2299,1615,14.5,58.8,0.9630 +4/5/04,9:00:00,200,1499,200,18.2,1247,200,647,200,1972,1505,15.3,56.7,0.9773 +4/5/04,10:00:00,3.1,1461,424,14.3,1127,236,703,124,1836,1451,15.6,57.4,1.0075 +4/5/04,11:00:00,3,1430,339,13.7,1107,228,713,121,1853,1409,16.6,55.2,1.0382 +4/5/04,12:00:00,3.2,1498,445,15.7,1171,206,684,114,1942,1434,17.0,54.3,1.0434 +4/5/04,13:00:00,3.1,1454,535,15.4,1162,188,682,114,1915,1457,18.0,51.5,1.0552 +4/5/04,14:00:00,2.5,1402,278,12.6,1069,158,733,106,1793,1325,19.5,47.5,1.0653 +4/5/04,15:00:00,2.1,1327,256,9.8,971,124,803,89,1705,1120,21.6,41.7,1.0606 +4/5/04,16:00:00,2,1273,245,9.6,963,98,832,82,1662,931,22.2,38.3,1.0106 +4/5/04,17:00:00,3.2,1457,458,16.3,1190,160,678,111,1877,1212,20.3,42.8,1.0053 +4/5/04,18:00:00,4.1,1543,710,19.8,1290,194,631,132,2011,1353,19.7,45.4,1.0312 +4/5/04,19:00:00,3.9,1499,678,18.0,1239,174,651,121,1957,1359,18.8,47.5,1.0199 +4/5/04,20:00:00,3.1,1378,436,13.6,1102,157,742,116,1796,1273,17.9,49.4,1.0064 +4/5/04,21:00:00,2,1243,178,8.7,928,109,852,100,1622,1079,17.5,50.9,1.0090 +4/5/04,22:00:00,1.7,1196,160,7.7,885,85,884,89,1589,973,17.4,51.8,1.0230 +4/5/04,23:00:00,1.6,1137,177,7.3,869,90,890,92,1554,951,17.0,53.9,1.0363 +4/6/04,0:00:00,1,1015,101,4.1,718,61,1065,71,1428,714,16.6,55.4,1.0408 +4/6/04,1:00:00,0.8,978,66,2.7,635,36,1188,51,1382,579,15.9,57.0,1.0224 +4/6/04,2:00:00,0.5,933,40,1.5,547,25,1357,39,1336,542,15.0,59.2,1.0059 +4/6/04,3:00:00,0.5,927,54,1.9,579,200,1267,200,1370,562,14.5,61.4,1.0041 +4/6/04,4:00:00,0.5,908,39,1.8,569,24,1303,35,1360,526,14.1,61.6,0.9883 +4/6/04,5:00:00,0.8,980,62,3.4,677,73,1141,64,1435,659,14.6,58.8,0.9711 +4/6/04,6:00:00,1.7,1236,120,8.5,920,164,880,94,1641,1012,14.4,58.9,0.9594 +4/6/04,7:00:00,4.5,1664,599,24.1,1407,329,594,134,2198,1564,15.3,55.2,0.9538 +4/6/04,8:00:00,6.2,1757,1084,33.7,1636,302,494,129,2560,1806,16.2,52.5,0.9576 +4/6/04,9:00:00,4.4,1524,776,22.0,1350,256,588,125,2096,1604,18.4,46.3,0.9734 +4/6/04,10:00:00,2.9,1329,470,14.1,1119,182,707,107,1823,1288,19.6,41.8,0.9407 +4/6/04,11:00:00,1.6,1167,169,7.6,884,105,881,78,1580,931,20.2,40.1,0.9370 +4/6/04,12:00:00,2.6,1369,304,14.5,1132,136,715,91,1818,1162,20.9,36.6,0.8939 +4/6/04,13:00:00,2.9,1284,405,13.6,1103,167,755,116,1704,1064,20.4,34.7,0.8229 +4/6/04,14:00:00,2.5,1221,297,11.5,1032,166,811,131,1630,913,19.3,35.4,0.7813 +4/6/04,15:00:00,1.9,1096,220,9.2,947,115,872,113,1519,684,18.6,36.9,0.7829 +4/6/04,16:00:00,1.9,1143,226,10.3,988,109,849,98,1590,710,18.8,36.8,0.7883 +4/6/04,17:00:00,2.7,1251,456,14.0,1116,143,760,116,1716,923,17.5,40.6,0.8068 +4/6/04,18:00:00,3,1268,470,15.0,1148,159,728,120,1776,1006,16.8,44.3,0.8428 +4/6/04,19:00:00,3.3,1314,503,15.1,1150,181,726,126,1784,1083,16.3,47.7,0.8746 +4/6/04,20:00:00,3.2,1315,459,14.8,1142,178,724,126,1807,1161,15.6,50.7,0.8947 +4/6/04,21:00:00,1.8,1095,227,8.5,918,91,896,90,1527,800,15.7,49.4,0.8768 +4/6/04,22:00:00,1.3,1027,152,5.9,808,76,996,81,1458,647,15.8,48.6,0.8653 +4/6/04,23:00:00,1.1,1012,94,4.8,752,66,1053,74,1393,649,15.7,47.0,0.8325 +4/7/04,0:00:00,0.9,948,93,4.0,710,56,1127,67,1352,575,15.7,45.9,0.8121 +4/7/04,1:00:00,0.7,889,49,2.3,603,35,1308,46,1271,455,15.6,44.9,0.7920 +4/7/04,2:00:00,0.4,830,30,1.4,538,21,1475,30,1214,392,15.6,44.4,0.7827 +4/7/04,3:00:00,0.3,801,30,0.7,464,200,1749,200,1154,340,16.1,42.4,0.7718 +4/7/04,4:00:00,0.3,804,30,0.7,468,12,1826,19,1166,320,15.7,45.4,0.8063 +4/7/04,5:00:00,0.3,824,32,0.8,478,22,1705,32,1200,331,15.3,50.1,0.8617 +4/7/04,6:00:00,0.8,1000,59,3.6,688,55,1257,61,1398,426,14.2,58.6,0.9443 +4/7/04,7:00:00,2.8,1307,277,12.9,1082,152,777,111,1769,830,14.3,58.2,0.9431 +4/7/04,8:00:00,3.1,1286,454,15.7,1171,153,716,102,1821,1006,15.0,52.0,0.8817 +4/7/04,9:00:00,2.3,1150,211,8.6,925,152,865,116,1576,820,13.6,61.4,0.9519 +4/7/04,10:00:00,1.6,1087,196,7.4,876,112,911,96,1554,737,16.1,53.1,0.9675 +4/7/04,11:00:00,1.5,1016,110,6.6,839,119,991,97,1448,790,16.6,44.3,0.8294 +4/7/04,12:00:00,1.6,1029,164,7.8,892,122,937,110,1474,760,17.4,40.3,0.7922 +4/7/04,13:00:00,1.8,1109,151,9.1,943,115,883,103,1604,767,18.1,42.2,0.8679 +4/7/04,14:00:00,1.4,997,112,6.1,817,84,1030,79,1426,626,14.4,51.8,0.8433 +4/7/04,15:00:00,1.9,1177,117,7.5,880,146,927,104,1552,776,10.9,76.4,0.9953 +4/7/04,16:00:00,1.2,1045,120,5.9,808,65,1038,69,1421,589,16.4,47.7,0.8819 +4/7/04,17:00:00,2.3,1171,251,12.7,1073,120,847,96,1623,872,18.0,37.1,0.7598 +4/7/04,18:00:00,3.3,1221,435,13.5,1101,185,799,121,1700,996,14.7,48.7,0.8089 +4/7/04,19:00:00,3.1,1262,345,14.5,1133,157,774,129,1748,1033,11.3,71.3,0.9514 +4/7/04,20:00:00,3.3,1207,343,11.0,1013,174,876,116,1632,982,10.0,74.0,0.9087 +4/7/04,21:00:00,1.2,956,62,3.2,666,62,1235,71,1282,518,9.8,71.8,0.8702 +4/7/04,22:00:00,1.3,1020,70,4.3,727,81,1153,85,1391,631,9.3,75.2,0.8796 +4/7/04,23:00:00,1.9,1107,114,7.1,861,118,957,101,1510,879,9.6,76.0,0.9088 +4/8/04,0:00:00,1.4,1020,88,4.8,752,95,1074,97,1398,839,9.3,75.0,0.8799 +4/8/04,1:00:00,1.5,1051,95,5.7,798,84,1016,89,1455,876,9.0,76.7,0.8825 +4/8/04,2:00:00,1.1,965,83,4.1,716,56,1087,72,1383,827,8.2,79.6,0.8672 +4/8/04,3:00:00,0.8,890,63,3.3,672,200,1183,200,1338,781,7.6,80.6,0.8437 +4/8/04,4:00:00,200,823,38,1.8,568,43,1366,57,1263,699,8.3,75.6,0.8302 +4/8/04,5:00:00,0.8,869,68,2.8,642,69,1291,62,1288,723,8.1,72.1,0.7798 +4/8/04,6:00:00,1.1,939,59,4.8,754,74,1169,64,1384,797,8.1,70.4,0.7609 +4/8/04,7:00:00,2.6,1274,226,13.8,1109,161,820,80,1723,1138,8.1,70.5,0.7627 +4/8/04,8:00:00,5.1,1512,802,27.1,1482,308,596,118,2189,1605,9.9,64.3,0.7859 +4/8/04,9:00:00,4.2,1435,585,19.1,1271,300,676,130,1885,1569,13.0,54.2,0.8090 +4/8/04,10:00:00,2.6,1207,324,11.7,1041,189,813,113,1620,1319,16.5,42.6,0.7912 +4/8/04,11:00:00,2.4,1223,308,12.7,1075,164,818,112,1678,1198,19.3,34.2,0.7566 +4/8/04,12:00:00,2.1,1121,238,11.3,1026,122,865,96,1541,958,21.5,27.4,0.6941 +4/8/04,13:00:00,2.6,1176,301,14.2,1123,166,816,125,1609,1068,20.4,26.7,0.6323 +4/8/04,14:00:00,2.8,1144,294,13.9,1114,181,845,128,1551,1010,20.6,22.6,0.5432 +4/8/04,15:00:00,2.5,1136,353,13.2,1089,127,856,101,1552,928,20.6,24.8,0.5931 +4/8/04,16:00:00,1.9,1085,209,10.7,1004,106,888,95,1454,816,21.0,25.4,0.6264 +4/8/04,17:00:00,3.6,1297,538,19.7,1287,192,723,138,1798,1216,21.5,24.5,0.6203 +4/8/04,18:00:00,4.6,1408,808,24.0,1404,241,646,154,1972,1463,19.3,28.6,0.6357 +4/8/04,19:00:00,6.3,1618,974,29.1,1530,326,579,171,2167,1791,18.0,32.9,0.6737 +4/8/04,20:00:00,4.3,1319,544,15.8,1172,232,746,136,1699,1425,15.8,39.8,0.7096 +4/8/04,21:00:00,1.6,1045,138,7.0,857,92,968,93,1410,922,15.3,41.5,0.7194 +4/8/04,22:00:00,1.4,1058,92,6.3,824,95,993,99,1411,971,14.1,46.0,0.7363 +4/8/04,23:00:00,2,200,137,200.0,200,129,200,106,200,200,200,200,200 +4/9/04,0:00:00,2.4,200,189,200.0,200,154,200,109,200,200,200,200,200 +4/9/04,1:00:00,1.8,200,159,200.0,200,118,200,97,200,200,200,200,200 +4/9/04,2:00:00,1,200,80,200.0,200,69,200,83,200,200,200,200,200 +4/9/04,3:00:00,1,200,66,200.0,200,200,200,200,200,200,200,200,200 +4/9/04,4:00:00,1,200,87,200.0,200,97,200,79,200,200,200,200,200 +4/9/04,5:00:00,0.9,200,79,200.0,200,145,200,84,200,200,200,200,200 +4/9/04,6:00:00,1.5,200,150,200.0,200,169,200,86,200,200,200,200,200 +4/9/04,7:00:00,2.6,200,196,200.0,200,250,200,111,200,200,200,200,200 +4/9/04,8:00:00,2.9,200,299,200.0,200,215,200,117,200,200,200,200,200 +4/9/04,9:00:00,2.7,200,254,200.0,200,237,200,122,200,200,200,200,200 +4/9/04,10:00:00,2.4,200,226,200.0,200,190,200,119,200,200,200,200,200 +4/9/04,11:00:00,2.6,200,262,200.0,200,219,200,121,200,200,200,200,200 +4/9/04,12:00:00,4.1,200,505,200.0,200,294,200,127,200,200,200,200,200 +4/9/04,13:00:00,4.3,200,512,200.0,200,253,200,135,200,200,200,200,200 +4/9/04,14:00:00,2.2,200,159,200.0,200,176,200,119,200,200,200,200,200 +4/9/04,15:00:00,3.4,200,337,200.0,200,243,200,137,200,200,200,200,200 +4/9/04,16:00:00,3.6,200,357,200.0,200,214,200,139,200,200,200,200,200 +4/9/04,17:00:00,2.5,200,194,200.0,200,141,200,117,200,200,200,200,200 +4/9/04,18:00:00,2.8,200,278,200.0,200,166,200,127,200,200,200,200,200 +4/9/04,19:00:00,3.4,200,303,200.0,200,197,200,137,200,200,200,200,200 +4/9/04,20:00:00,3,200,234,200.0,200,191,200,127,200,200,200,200,200 +4/9/04,21:00:00,2.1,200,128,200.0,200,159,200,107,200,200,200,200,200 +4/9/04,22:00:00,2.4,200,160,200.0,200,175,200,102,200,200,200,200,200 +4/9/04,23:00:00,2.3,1176,169,8.4,917,157,866,95,1560,1151,12.0,67.9,0.9523 +4/10/04,0:00:00,1.3,1047,77,5.4,783,83,1014,79,1449,975,12.0,66.9,0.9336 +4/10/04,1:00:00,1.7,1116,98,6.9,854,118,929,78,1533,1034,11.5,70.1,0.9503 +4/10/04,2:00:00,1.7,1074,102,6.4,832,101,951,75,1507,980,11.4,69.8,0.9390 +4/10/04,3:00:00,1,922,65,3.6,687,200,1142,200,1390,825,11.1,69.3,0.9128 +4/10/04,4:00:00,0.8,898,69,3.3,669,46,1163,52,1386,807,10.8,70.3,0.9092 +4/10/04,5:00:00,1,970,89,4.0,708,81,1121,55,1435,842,10.2,73.4,0.9129 +4/10/04,6:00:00,1.4,1073,134,6.0,812,112,988,65,1500,950,10.8,70.5,0.9142 +4/10/04,7:00:00,1.7,1081,157,8.2,906,150,895,72,1591,1024,11.0,71.0,0.9279 +4/10/04,8:00:00,2,1159,167,9.9,976,138,837,76,1663,1098,12.4,64.0,0.9206 +4/10/04,9:00:00,2,1112,157,8.8,931,124,869,84,1588,1011,15.0,53.1,0.8993 +4/10/04,10:00:00,1.9,1112,172,8.4,917,124,897,90,1542,931,16.6,44.4,0.8313 +4/10/04,11:00:00,2.1,1063,114,7.5,877,115,971,89,1481,811,18.0,38.4,0.7862 +4/10/04,12:00:00,1.8,1032,166,7.8,892,101,973,87,1457,756,20.6,30.0,0.7213 +4/10/04,13:00:00,1.5,936,94,6.4,829,90,1054,88,1336,607,19.9,26.7,0.6137 +4/10/04,14:00:00,1.3,883,108,5.1,766,90,1152,81,1238,498,19.8,23.1,0.5283 +4/10/04,15:00:00,1.4,942,98,6.9,852,89,1055,89,1318,540,21.4,22.7,0.5722 +4/10/04,16:00:00,1.5,943,132,6.8,850,92,1084,91,1296,544,22.1,21.3,0.5605 +4/10/04,17:00:00,1.7,1007,144,7.4,875,103,1027,104,1357,620,21.1,24.5,0.6040 +4/10/04,18:00:00,2,1037,165,8.7,927,122,949,111,1426,740,18.3,31.6,0.6562 +4/10/04,19:00:00,2.5,1127,210,11.2,1022,159,871,129,1534,921,16.1,35.0,0.6372 +4/10/04,20:00:00,2.4,1123,277,10.5,996,172,861,133,1551,1033,14.8,41.5,0.6929 +4/10/04,21:00:00,2,1110,168,8.2,909,125,916,115,1487,984,13.2,50.3,0.7577 +4/10/04,22:00:00,2.9,1247,248,11.6,1037,187,812,125,1618,1193,12.3,55.5,0.7916 +4/10/04,23:00:00,2.5,1220,235,9.7,967,174,844,117,1545,1219,11.8,56.6,0.7833 +4/11/04,0:00:00,1.4,1028,84,5.1,770,75,1047,84,1326,913,11.2,56.2,0.7469 +4/11/04,1:00:00,1.2,978,75,4.0,711,48,1144,63,1304,750,11.0,56.6,0.7418 +4/11/04,2:00:00,1,949,62,3.7,694,61,1160,68,1310,772,9.3,64.3,0.7531 +4/11/04,3:00:00,1,910,66,4.0,709,200,1162,200,1329,712,8.8,66.0,0.7501 +4/11/04,4:00:00,200,863,49,3.0,651,38,1224,52,1286,712,7.8,70.4,0.7463 +4/11/04,5:00:00,0.7,836,38,2.9,643,38,1244,49,1306,727,7.7,70.7,0.7463 +4/11/04,6:00:00,0.8,856,44,2.9,644,41,1259,49,1301,746,7.4,71.0,0.7346 +4/11/04,7:00:00,0.9,888,60,3.0,651,51,1240,52,1298,733,8.2,66.6,0.7283 +4/11/04,8:00:00,1.1,912,88,3.7,694,61,1196,58,1343,727,8.7,64.3,0.7267 +4/11/04,9:00:00,1,938,70,3.9,706,47,1177,51,1320,642,13.0,47.9,0.7116 +4/11/04,10:00:00,1.6,1053,94,6.2,823,87,1040,77,1386,810,17.4,35.4,0.6990 +4/11/04,11:00:00,1.8,1087,126,7.3,870,107,1014,90,1412,851,20.2,29.2,0.6830 +4/11/04,12:00:00,2.6,1153,181,9.2,945,161,969,118,1425,911,22.7,22.9,0.6224 +4/11/04,13:00:00,1.1,894,66,4.0,710,61,1209,70,1200,611,22.3,22.4,0.5955 +4/11/04,14:00:00,0.6,797,54,2.3,606,26,1452,35,1076,368,22.2,18.9,0.4993 +4/11/04,15:00:00,0.9,854,48,3.1,658,42,1409,46,1108,369,22.0,19.0,0.4969 +4/11/04,16:00:00,1,892,66,3.2,665,55,1348,58,1150,394,21.0,23.6,0.5788 +4/11/04,17:00:00,1.2,935,65,3.3,672,65,1299,61,1183,415,19.6,27.4,0.6162 +4/11/04,18:00:00,1.4,989,68,4.0,709,75,1225,80,1233,469,17.9,30.8,0.6258 +4/11/04,19:00:00,2,1063,118,5.7,797,93,1079,93,1285,607,16.4,35.3,0.6532 +4/11/04,20:00:00,1.1,922,65,3.4,674,63,1220,78,1205,498,15.6,37.5,0.6575 +4/11/04,21:00:00,1.1,925,63,3.5,680,60,1223,76,1217,481,15.5,36.7,0.6438 +4/11/04,22:00:00,1.1,924,61,2.9,648,63,1272,74,1186,465,14.5,39.0,0.6428 +4/11/04,23:00:00,1.1,914,63,3.1,661,66,1256,81,1187,478,14.2,38.2,0.6163 +4/12/04,0:00:00,0.7,840,31,1.6,556,41,1474,58,1094,366,13.7,38.3,0.5957 +4/12/04,1:00:00,0.6,799,29,1.1,514,42,1615,59,1036,332,14.1,34.0,0.5442 +4/12/04,2:00:00,0.7,856,34,2.0,581,36,1448,54,1111,346,14.6,34.0,0.5605 +4/12/04,3:00:00,0.6,830,34,1.7,560,200,1452,200,1066,350,14.4,32.8,0.5364 +4/12/04,4:00:00,0.3,753,9,0.7,462,16,1822,25,955,274,13.9,30.9,0.4868 +4/12/04,5:00:00,0.3,785,14,0.5,448,16,1923,26,1003,263,13.5,37.9,0.5852 +4/12/04,6:00:00,0.4,835,23,0.7,472,33,1756,49,1117,314,13.9,43.6,0.6871 +4/12/04,7:00:00,0.5,865,36,0.9,494,39,1613,53,1185,367,13.5,50.1,0.7706 +4/12/04,8:00:00,0.6,922,39,1.5,541,45,1453,57,1246,432,13.3,53.6,0.8169 +4/12/04,9:00:00,0.8,975,38,2.2,598,63,1268,73,1293,536,13.4,56.0,0.8576 +4/12/04,10:00:00,1,1040,55,2.9,646,71,1210,78,1325,555,13.0,59.6,0.8860 +4/12/04,11:00:00,1.3,1115,75,3.9,707,104,1105,93,1384,677,12.7,62.7,0.9173 +4/12/04,12:00:00,1.1,1004,64,2.8,641,85,1201,71,1305,557,12.1,60.5,0.8505 +4/12/04,13:00:00,0.6,895,39,1.4,536,53,1415,56,1229,421,11.2,64.4,0.8537 +4/12/04,14:00:00,0.7,917,42,1.7,563,58,1415,60,1257,423,10.7,67.6,0.8703 +4/12/04,15:00:00,1.2,1025,78,3.6,689,89,1189,77,1325,519,11.4,61.2,0.8236 +4/12/04,16:00:00,1.4,1035,69,3.1,660,96,1222,82,1301,533,9.9,70.5,0.8581 +4/12/04,17:00:00,1.2,1025,67,3.5,682,88,1197,77,1313,537,10.3,65.2,0.8151 +4/12/04,18:00:00,1.3,1040,83,3.9,705,100,1149,91,1324,612,11.0,58.9,0.7714 +4/12/04,19:00:00,1.8,1070,79,5.2,776,127,1065,93,1369,697,10.6,59.5,0.7595 +4/12/04,20:00:00,1.2,964,68,3.7,692,88,1176,82,1292,573,10.6,57.7,0.7362 +4/12/04,21:00:00,0.8,904,56,2.7,635,66,1269,74,1233,489,10.6,56.1,0.7149 +4/12/04,22:00:00,0.8,913,49,2.6,626,61,1271,68,1234,474,10.3,58.0,0.7239 +4/12/04,23:00:00,0.8,914,56,2.4,610,54,1322,65,1230,455,10.4,58.4,0.7352 +4/13/04,0:00:00,0.7,893,48,2.0,587,46,1368,58,1219,433,10.7,58.2,0.7459 +4/13/04,1:00:00,0.5,844,27,1.3,532,25,1499,36,1176,376,10.5,59.9,0.7624 +4/13/04,2:00:00,0.3,837,29,1.2,518,18,1581,29,1182,360,10.5,60.8,0.7704 +4/13/04,3:00:00,0.4,868,32,1.3,530,200,1482,200,1215,384,10.6,63.1,0.8068 +4/13/04,4:00:00,0.3,825,47,0.9,489,21,1565,28,1171,363,11.1,60.1,0.7925 +4/13/04,5:00:00,0.7,956,66,3.1,656,76,1251,68,1356,525,10.1,68.4,0.8443 +4/13/04,6:00:00,1.6,1124,163,6.9,853,149,960,99,1531,873,10.0,68.7,0.8456 +4/13/04,7:00:00,3.9,1496,524,19.1,1272,328,667,130,2011,1399,11.0,64.2,0.8398 +4/13/04,8:00:00,4.5,1452,657,22.1,1354,282,620,128,2057,1501,12.0,58.1,0.8138 +4/13/04,9:00:00,2.7,1172,324,11.8,1042,206,797,120,1641,1186,12.8,52.0,0.7643 +4/13/04,10:00:00,1.6,1020,144,7.3,871,133,956,93,1426,911,12.6,50.8,0.7387 +4/13/04,11:00:00,1.6,1044,135,8.0,900,137,934,96,1471,919,13.8,47.4,0.7435 +4/13/04,12:00:00,1.6,1020,140,7.2,866,116,972,84,1429,791,15.1,42.7,0.7294 +4/13/04,13:00:00,1.5,1009,141,7.7,885,112,994,87,1439,751,16.3,38.1,0.7006 +4/13/04,14:00:00,1.8,1073,181,9.7,966,128,935,87,1537,805,16.6,36.7,0.6867 +4/13/04,15:00:00,2.1,1098,227,9.8,971,145,877,103,1527,966,15.8,40.5,0.7241 +4/13/04,16:00:00,1.7,1061,149,7.7,887,110,920,88,1462,869,16.3,41.8,0.7712 +4/13/04,17:00:00,3,1289,425,15.3,1158,184,755,118,1791,1196,16.1,42.8,0.7783 +4/13/04,18:00:00,4.6,1442,669,21.5,1337,243,630,134,2004,1523,15.3,46.8,0.8074 +4/13/04,19:00:00,5,1479,680,21.7,1343,259,631,144,1995,1579,14.7,47.0,0.7815 +4/13/04,20:00:00,3.5,1266,446,14.6,1138,205,726,129,1747,1370,14.1,49.6,0.7931 +4/13/04,21:00:00,2.1,1149,205,9.3,952,144,845,108,1577,1204,13.5,54.1,0.8351 +4/13/04,22:00:00,2.1,1168,194,9.4,954,152,843,104,1586,1201,12.3,59.2,0.8461 +4/13/04,23:00:00,1.5,1079,115,6.8,847,108,925,90,1463,1122,11.6,62.2,0.8510 +4/14/04,0:00:00,1.1,1047,74,4.9,760,64,1032,74,1379,1003,11.5,61.4,0.8303 +4/14/04,1:00:00,0.9,936,85,3.9,705,73,1086,73,1364,918,10.9,64.6,0.8395 +4/14/04,2:00:00,0.7,900,66,2.7,635,47,1207,60,1307,836,11.1,63.2,0.8331 +4/14/04,3:00:00,0.4,839,60,1.9,576,200,1325,200,1285,759,10.7,64.4,0.8284 +4/14/04,4:00:00,200,840,40,1.7,563,45,1349,57,1275,755,10.7,63.8,0.8174 +4/14/04,5:00:00,0.9,905,47,3.3,670,79,1191,62,1346,820,10.2,65.8,0.8196 +4/14/04,6:00:00,1.6,1079,116,7.4,875,113,947,71,1535,998,10.3,65.6,0.8225 +4/14/04,7:00:00,3.9,1417,478,18.2,1246,263,669,94,1921,1403,10.8,64.0,0.8307 +4/14/04,8:00:00,5,1529,836,27.7,1497,275,569,99,2278,1641,13.3,55.6,0.8453 +4/14/04,9:00:00,4.3,1397,655,18.3,1248,263,659,119,1906,1534,15.8,46.8,0.8328 +4/14/04,10:00:00,2.7,1214,312,12.3,1061,183,780,103,1662,1257,19.5,34.4,0.7738 +4/14/04,11:00:00,2.1,1109,195,8.9,935,147,882,95,1518,1026,21.1,30.2,0.7450 +4/14/04,12:00:00,2.1,1122,238,10.4,993,132,859,95,1571,993,18.4,34.2,0.7185 +4/14/04,13:00:00,200,1094,200,9.7,966,200,877,200,1519,919,21.0,30.6,0.7538 +4/14/04,14:00:00,200,1164,200,12.8,1076,200,788,200,1653,1032,21.4,28.7,0.7231 +4/14/04,15:00:00,200,1188,200,12.9,1082,200,790,200,1653,1077,21.3,27.3,0.6856 +4/14/04,16:00:00,200,1098,200,8.9,934,200,895,200,1497,883,20.4,30.4,0.7193 +4/14/04,17:00:00,200,1180,200,11.0,1014,200,826,200,1595,1036,18.7,34.8,0.7436 +4/14/04,18:00:00,200,1195,200,12.6,1072,200,799,200,1609,1144,18.6,34.7,0.7369 +4/14/04,19:00:00,200,1234,200,13.3,1095,200,789,200,1638,1212,17.9,35.8,0.7276 +4/14/04,20:00:00,200,1425,200,18.1,1242,200,675,200,1858,1486,16.5,42.1,0.7856 +4/14/04,21:00:00,200,1320,200,13.0,1085,200,730,200,1689,1440,14.8,49.7,0.8321 +4/14/04,22:00:00,200,1158,200,7.8,892,200,870,200,1494,1242,13.4,54.3,0.8291 +4/14/04,23:00:00,200,1060,200,5.8,805,200,964,200,1418,1089,13.4,54.2,0.8319 +4/15/04,0:00:00,200,1036,200,4.7,746,200,1034,200,1364,993,12.3,57.6,0.8223 +4/15/04,1:00:00,200,943,200,3.1,660,200,1163,200,1297,862,11.6,58.8,0.7996 +4/15/04,2:00:00,200,899,200,2.1,591,200,1243,200,1259,815,10.7,62.3,0.7988 +4/15/04,3:00:00,200,889,200,1.7,559,200,1331,200,1241,714,10.2,63.2,0.7865 +4/15/04,4:00:00,200,847,200,1.3,527,200,1416,200,1216,682,10.5,61.3,0.7761 +4/15/04,5:00:00,200,918,200,2.6,625,200,1255,200,1298,744,9.3,65.8,0.7722 +4/15/04,6:00:00,200,1105,200,7.9,897,200,906,200,1510,1050,9.1,67.4,0.7814 +4/15/04,7:00:00,200,1491,200,21.4,1335,200,625,200,2001,1522,10.2,62.9,0.7831 +4/15/04,8:00:00,200,1508,200,22.1,1355,200,609,200,2034,1623,11.6,58.7,0.7999 +4/15/04,9:00:00,3.9,1489,536,19.1,1270,309,633,114,1889,1638,13.5,53.4,0.8196 +4/15/04,10:00:00,3.8,1451,481,17.3,1219,327,661,127,1804,1633,15.0,49.3,0.8364 +4/15/04,11:00:00,3.8,1454,460,17.5,1225,299,661,135,1809,1654,17.1,42.1,0.8120 +4/15/04,12:00:00,3.9,1392,597,17.0,1210,257,689,135,1792,1559,18.4,36.9,0.7758 +4/15/04,13:00:00,3.5,1351,465,17.0,1210,195,709,120,1783,1374,19.3,35.4,0.7824 +4/15/04,14:00:00,4,1418,542,19.5,1282,231,681,129,1890,1418,19.2,35.7,0.7826 +4/15/04,15:00:00,3,1303,437,14.8,1143,170,748,114,1734,1228,21.5,30.4,0.7677 +4/15/04,16:00:00,3.1,1292,383,14.0,1116,192,746,129,1676,1293,19.6,34.6,0.7807 +4/15/04,17:00:00,5.2,1560,778,24.4,1413,251,594,149,2046,1642,19.6,35.5,0.8017 +4/15/04,18:00:00,5.9,1765,926,28.1,1507,289,533,149,2271,1895,18.3,44.1,0.9183 +4/15/04,19:00:00,7.3,1875,1189,33.7,1637,351,482,158,2446,2086,17.4,47.8,0.9392 +4/15/04,20:00:00,6.1,1746,824,24.2,1409,319,553,148,2119,1898,16.8,49.8,0.9446 +4/15/04,21:00:00,4.6,1607,660,19.8,1290,264,587,134,1960,1758,16.2,53.2,0.9723 +4/15/04,22:00:00,2.5,1277,203,9.3,952,144,802,111,1607,1284,14.9,59.1,0.9928 +4/15/04,23:00:00,2.2,1207,136,7.3,868,119,877,106,1532,1075,13.9,64.0,1.0130 +4/16/04,0:00:00,2,1209,114,7.1,863,120,882,108,1547,1084,13.7,66.8,1.0410 +4/16/04,1:00:00,1.9,1189,110,6.7,844,96,887,93,1566,1049,13.4,69.6,1.0688 +4/16/04,2:00:00,1.6,1132,127,6.0,810,77,899,83,1532,1021,12.8,71.9,1.0632 +4/16/04,3:00:00,1,974,68,2.7,635,200,1159,200,1387,778,12.9,68.5,1.0155 +4/16/04,4:00:00,1,1033,70,3.6,686,63,1030,69,1420,948,11.9,74.8,1.0380 +4/16/04,5:00:00,1.2,1075,93,5.4,782,97,941,77,1505,988,11.6,76.4,1.0415 +4/16/04,6:00:00,2,1242,179,10.0,978,180,778,87,1686,1184,11.7,75.3,1.0325 +4/16/04,7:00:00,4.5,1657,523,23.2,1384,352,579,109,2176,1600,12.8,71.0,1.0428 +4/16/04,8:00:00,4.1,1406,546,18.7,1258,279,636,116,1890,1495,13.8,61.1,0.9595 +4/16/04,9:00:00,2.6,1150,251,9.9,973,186,827,105,1590,1188,15.3,51.4,0.8886 +4/16/04,10:00:00,2.3,1184,222,10.2,983,169,823,106,1604,1166,16.6,48.7,0.9146 +4/16/04,11:00:00,2.6,1223,212,11.0,1014,172,781,108,1666,1219,17.6,46.7,0.9324 +4/16/04,12:00:00,2.4,1196,263,10.5,997,127,802,100,1644,1128,19.2,42.7,0.9425 +4/16/04,13:00:00,4.3,1559,535,18.9,1267,230,653,149,2047,1373,15.7,62.5,1.1092 +4/16/04,14:00:00,4.6,1544,609,18.9,1266,280,613,148,2066,1415,14.5,75.7,1.2425 +4/16/04,15:00:00,4,1485,449,17.5,1223,242,650,126,1980,1302,14.4,77.4,1.2654 +4/16/04,16:00:00,4.7,1498,536,17.7,1230,250,649,140,1963,1308,14.6,74.9,1.2391 +4/16/04,17:00:00,5.7,1593,743,23.0,1378,309,582,142,2159,1449,14.6,76.0,1.2573 +4/16/04,18:00:00,5.2,1529,575,20.4,1309,270,627,141,2083,1385,14.0,76.3,1.2166 +4/16/04,19:00:00,6.4,1689,783,26.8,1474,323,560,149,2298,1492,13.9,78.0,1.2303 +4/16/04,20:00:00,6,1535,706,19.9,1295,307,627,140,2039,1380,13.8,78.0,1.2229 +4/16/04,21:00:00,3.4,1304,299,12.5,1067,224,737,117,1811,1167,13.5,79.4,1.2224 +4/16/04,22:00:00,2.7,1235,159,9.6,963,166,803,103,1680,1095,13.1,80.1,1.2016 +4/16/04,23:00:00,3.1,1281,171,10.0,977,171,794,99,1695,1105,12.8,81.0,1.1915 +4/17/04,0:00:00,2.6,1167,162,8.1,904,169,846,100,1627,1051,12.7,80.6,1.1819 +4/17/04,1:00:00,2.2,1134,111,7.1,863,120,869,88,1599,1009,12.5,81.8,1.1838 +4/17/04,2:00:00,1.9,1100,134,6.4,829,103,887,83,1565,972,12.7,81.2,1.1849 +4/17/04,3:00:00,200,1057,200,5.1,768,200,958,200,1527,884,12.4,82.4,1.1848 +4/17/04,4:00:00,200,902,200,3.0,652,200,1158,200,1437,661,12.6,79.9,1.1619 +4/17/04,5:00:00,200,982,200,4.2,724,200,1055,200,1508,774,12.8,80.0,1.1761 +4/17/04,6:00:00,200,997,200,5.1,769,200,1002,200,1524,868,12.6,79.9,1.1596 +4/17/04,7:00:00,200,1141,200,8.1,902,200,905,200,1672,965,12.5,79.6,1.1538 +4/17/04,8:00:00,200,1163,200,8.8,930,200,849,200,1691,985,12.8,80.8,1.1863 +4/17/04,9:00:00,200,1107,200,7.2,864,200,911,200,1604,963,13.5,78.1,1.1997 +4/17/04,10:00:00,200,1228,200,9.2,949,200,838,200,1706,1069,14.6,74.9,1.2348 +4/17/04,11:00:00,200,1033,200,6.0,810,200,969,200,1526,864,16.8,61.0,1.1599 +4/17/04,12:00:00,200,984,200,5.0,763,200,1054,200,1472,641,18.8,50.1,1.0769 +4/17/04,13:00:00,200,1066,200,7.5,879,200,939,200,1536,740,19.3,47.1,1.0445 +4/17/04,14:00:00,200,926,200,3.9,707,200,1130,200,1429,532,19.3,46.6,1.0298 +4/17/04,15:00:00,200,979,200,5.0,764,200,1054,200,1462,572,19.7,44.4,1.0081 +4/17/04,16:00:00,200,993,200,5.4,783,200,1033,200,1482,630,19.3,45.7,1.0161 +4/17/04,17:00:00,200,970,200,5.0,763,200,1043,200,1462,608,17.9,49.9,1.0135 +4/17/04,18:00:00,200,997,200,5.5,788,200,1014,200,1482,644,16.9,53.2,1.0179 +4/17/04,19:00:00,200,1097,200,7.8,892,200,909,200,1594,833,16.1,56.0,1.0151 +4/17/04,20:00:00,200,1105,200,8.2,908,200,900,200,1566,923,16.5,52.8,0.9860 +4/17/04,21:00:00,200,993,200,5.3,777,200,1022,200,1482,790,16.9,51.1,0.9743 +4/17/04,22:00:00,200,1136,200,7.9,896,200,919,200,1623,961,17.0,50.8,0.9751 +4/17/04,23:00:00,200,1297,200,10.7,1003,200,798,200,1700,1177,16.9,54.8,1.0455 +4/18/04,0:00:00,200,1110,200,6.9,855,200,897,200,1568,943,16.3,57.2,1.0492 +4/18/04,1:00:00,200,1027,200,6.3,827,200,942,200,1529,742,16.0,56.2,1.0164 +4/18/04,2:00:00,200,1024,200,5.6,795,200,982,200,1501,705,15.6,60.2,1.0617 +4/18/04,3:00:00,200,992,200,5.4,783,200,988,200,1527,663,14.6,68.2,1.1282 +4/18/04,4:00:00,200,929,200,3.2,664,200,1144,200,1449,536,14.3,69.6,1.1257 +4/18/04,5:00:00,200,892,200,1.9,579,200,1309,200,1396,441,13.8,71.9,1.1312 +4/18/04,6:00:00,200,895,200,2.3,609,200,1268,200,1410,477,13.8,72.7,1.1402 +4/18/04,7:00:00,200,915,200,2.4,612,200,1266,200,1411,506,14.2,72.3,1.1632 +4/18/04,8:00:00,200,947,200,2.9,649,200,1154,200,1459,608,14.5,72.9,1.1950 +4/18/04,9:00:00,200,1032,200,4.2,721,200,1052,200,1546,663,14.2,75.6,1.2201 +4/18/04,10:00:00,200,1219,200,6.8,849,200,901,200,1664,912,14.5,77.2,1.2695 +4/18/04,11:00:00,200,1217,200,7.3,869,200,873,200,1660,963,14.9,74.5,1.2530 +4/18/04,12:00:00,200,1097,200,5.3,779,200,980,200,1565,803,15.5,67.4,1.1780 +4/18/04,13:00:00,200,938,200,3.4,674,200,1136,200,1433,564,16.7,56.0,1.0544 +4/18/04,14:00:00,200,944,200,3.4,677,200,1140,200,1436,501,17.1,55.5,1.0724 +4/18/04,15:00:00,200,1139,200,6.8,847,200,929,200,1604,701,16.2,62.4,1.1423 +4/18/04,16:00:00,200,1163,200,7.0,859,200,921,200,1603,783,17.4,58.9,1.1567 +4/18/04,17:00:00,200,1157,200,7.7,885,200,916,200,1590,784,17.5,54.6,1.0800 +4/18/04,18:00:00,200,1154,200,8.5,919,200,876,200,1623,861,17.5,54.2,1.0722 +4/18/04,19:00:00,200,1342,200,13.5,1102,200,744,200,1824,1156,17.0,55.1,1.0621 +4/18/04,20:00:00,200,1178,200,9.2,949,200,847,200,1653,1047,16.4,57.8,1.0669 +4/18/04,21:00:00,200,1152,200,8.7,926,200,854,200,1669,1034,15.8,60.4,1.0744 +4/18/04,22:00:00,200,1078,200,6.6,841,200,920,200,1595,962,15.5,63.3,1.1071 +4/18/04,23:00:00,200,1167,200,9.1,945,200,826,200,1702,1100,15.0,65.2,1.1045 +4/19/04,0:00:00,200,1111,200,7.8,890,200,871,200,1658,1047,14.5,67.3,1.1054 +4/19/04,1:00:00,200,988,200,4.2,722,200,1029,200,1500,874,14.1,70.1,1.1224 +4/19/04,2:00:00,200,917,200,3.4,677,200,1103,200,1494,747,13.8,73.3,1.1518 +4/19/04,3:00:00,200,850,200,1.7,562,200,1275,200,1407,589,13.1,78.3,1.1716 +4/19/04,4:00:00,200,885,200,2.1,589,200,1238,200,1443,635,13.0,78.2,1.1664 +4/19/04,5:00:00,200,888,200,2.2,597,200,1242,200,1428,671,14.3,70.1,1.1328 +4/19/04,6:00:00,200,1101,200,6.9,852,200,913,200,1655,992,14.0,70.9,1.1306 +4/19/04,7:00:00,200,1399,200,16.5,1195,200,680,200,1976,1308,15.4,64.2,1.1123 +4/19/04,8:00:00,200,1325,200,18.7,1260,200,652,200,1969,1303,16.5,57.9,1.0826 +4/19/04,9:00:00,200,1058,200,8.5,921,200,876,200,1594,893,18.8,48.0,1.0323 +4/19/04,10:00:00,200,990,200,6.8,846,200,947,200,1503,715,19.5,43.3,0.9706 +4/19/04,11:00:00,200,1018,200,7.5,879,200,950,200,1574,685,18.5,45.5,0.9606 +4/19/04,12:00:00,200,1105,200,10.9,1010,200,811,200,1709,947,18.0,46.7,0.9569 +4/19/04,13:00:00,200,1194,200,13.4,1097,200,743,200,1847,1049,16.5,58.2,1.0813 +4/19/04,14:00:00,200,1088,200,7.9,893,200,892,200,1565,741,12.1,75.6,1.0662 +4/19/04,15:00:00,200,1084,200,8.8,933,200,872,200,1602,801,11.1,75.6,0.9974 +4/19/04,16:00:00,200,1207,200,11.5,1032,200,797,200,1701,969,10.5,79.0,1.0016 +4/19/04,17:00:00,200,1186,200,11.8,1042,200,808,200,1684,1038,10.0,80.5,0.9853 +4/19/04,18:00:00,200,1053,200,8.3,911,200,916,200,1528,889,11.3,70.5,0.9389 +4/19/04,19:00:00,200,1032,200,7.9,895,200,921,200,1508,879,10.7,71.5,0.9197 +4/19/04,20:00:00,200,989,200,6.7,846,200,960,200,1476,846,9.9,74.0,0.9009 +4/19/04,21:00:00,200,930,200,4.6,745,200,1085,200,1391,712,10.8,68.2,0.8806 +4/19/04,22:00:00,200,951,200,5.0,766,200,1056,200,1431,747,10.2,74.0,0.9176 +4/19/04,23:00:00,200,985,200,5.3,780,200,1017,200,1444,791,10.1,77.5,0.9579 +4/20/04,0:00:00,200,937,200,3.9,703,200,1136,200,1377,637,9.9,77.7,0.9497 +4/20/04,1:00:00,200,887,200,2.9,648,200,1222,200,1357,553,9.5,80.2,0.9533 +4/20/04,2:00:00,200,806,200,1.5,544,200,1429,200,1290,449,9.5,79.7,0.9489 +4/20/04,3:00:00,200,771,200,0.8,483,200,1587,200,1277,485,9.7,78.9,0.9487 +4/20/04,4:00:00,200,802,200,1.2,519,200,1523,200,1315,540,9.8,80.0,0.9670 +4/20/04,5:00:00,200,878,200,2.9,644,200,1260,200,1410,685,10.0,79.6,0.9800 +4/20/04,6:00:00,200,1070,200,7.6,883,200,942,200,1625,925,10.3,79.4,0.9927 +4/20/04,7:00:00,200,1430,200,20.7,1316,200,638,200,2069,1382,11.0,77.4,1.0135 +4/20/04,8:00:00,200,1565,200,29.9,1549,200,540,200,2396,1645,12.3,69.2,0.9906 +4/20/04,9:00:00,200,1365,200,20.1,1299,200,637,200,1999,1514,14.5,60.4,0.9936 +4/20/04,10:00:00,200,1348,200,17.1,1214,200,678,200,1874,1509,16.0,56.3,1.0185 +4/20/04,11:00:00,200,1311,200,19.4,1279,200,648,200,1991,1520,18.1,51.1,1.0551 +4/20/04,12:00:00,200,1302,200,18.7,1260,200,658,200,1970,1471,23.0,37.5,1.0389 +4/20/04,13:00:00,200,1250,200,15.8,1173,200,720,200,1870,1282,21.8,39.6,1.0209 +4/20/04,14:00:00,200,1083,200,10.3,988,200,859,200,1643,1004,21.9,37.2,0.9683 +4/20/04,15:00:00,200,1005,200,7.9,894,200,956,200,1530,714,23.8,31.4,0.9132 +4/20/04,16:00:00,200,1129,200,13.7,1107,200,844,200,1742,940,25.9,26.9,0.8845 +4/20/04,17:00:00,200,1327,200,19.7,1289,200,712,200,1938,1272,25.8,29.9,0.9808 +4/20/04,18:00:00,200,1428,200,25.3,1437,200,599,200,2156,1526,22.6,39.5,1.0679 +4/20/04,19:00:00,200,1426,200,23.9,1400,200,612,200,2102,1488,19.9,45.0,1.0364 +4/20/04,20:00:00,200,1310,200,20.6,1313,200,663,200,1958,1475,17.9,47.4,0.9657 +4/20/04,21:00:00,200,1114,200,12.3,1059,200,792,200,1704,1182,16.8,50.4,0.9545 +4/20/04,22:00:00,200,991,200,7.7,889,200,932,200,1550,933,16.0,52.0,0.9375 +4/20/04,23:00:00,200,1089,200,9.0,940,200,866,200,1589,1001,14.2,59.1,0.9520 +4/21/04,0:00:00,200,911,200,4.5,740,200,1071,200,1412,756,14.2,56.7,0.9111 +4/21/04,1:00:00,200,1005,200,5.8,802,200,976,200,1474,952,12.2,67.7,0.9625 +4/21/04,2:00:00,200,863,200,3.2,664,200,1172,200,1381,808,13.2,61.7,0.9328 +4/21/04,3:00:00,200,857,200,3.0,653,200,1197,200,1375,779,10.9,71.0,0.9249 +4/21/04,4:00:00,200,804,200,2.0,585,200,1301,200,1319,717,11.3,67.6,0.9038 +4/21/04,5:00:00,200,871,200,3.7,694,200,1160,200,1421,779,10.1,71.8,0.8880 +4/21/04,6:00:00,200,1035,200,8.7,929,200,887,200,1599,1009,10.2,70.9,0.8841 +4/21/04,7:00:00,200,1513,200,27.0,1480,200,572,200,2249,1595,11.8,66.0,0.9085 +4/21/04,8:00:00,200,1726,200,40.3,1776,200,478,200,2684,1940,14.0,58.2,0.9271 +4/21/04,10:00:00,200,1333,200,17.1,1212,200,668,200,1854,1573,22.3,36.9,0.9838 \ No newline at end of file diff --git a/src/test/resources/adult_data.csv b/src/test/resources/adult_data.csv new file mode 100644 index 00000000..f38d1844 --- /dev/null +++ b/src/test/resources/adult_data.csv @@ -0,0 +1,10001 @@ +age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class +39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K +50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K +38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K +37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K +52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +31, Private, 45781, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 14084, 0, 50, United-States, >50K +42, Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5178, 0, 40, United-States, >50K +37, Private, 280464, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 80, United-States, >50K +30, State-gov, 141297, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, >50K +23, Private, 122272, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K +32, Private, 205019, Assoc-acdm, 12, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 50, United-States, <=50K +40, Private, 121772, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, >50K +34, Private, 245487, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, Amer-Indian-Eskimo, Male, 0, 0, 45, Mexico, <=50K +25, Self-emp-not-inc, 176756, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 35, United-States, <=50K +32, Private, 186824, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 28887, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +43, Self-emp-not-inc, 292175, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, >50K +40, Private, 193524, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +54, Private, 302146, HS-grad, 9, Separated, Other-service, Unmarried, Black, Female, 0, 0, 20, United-States, <=50K +35, Federal-gov, 76845, 9th, 5, Married-civ-spouse, Farming-fishing, Husband, Black, Male, 0, 0, 40, United-States, <=50K +43, Private, 117037, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 2042, 40, United-States, <=50K +59, Private, 109015, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +56, Local-gov, 216851, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +19, Private, 168294, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K +39, Private, 367260, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K +49, Private, 193366, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Local-gov, 190709, Assoc-acdm, 12, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K +20, Private, 266015, Some-college, 10, Never-married, Sales, Own-child, Black, Male, 0, 0, 44, United-States, <=50K +45, Private, 386940, Bachelors, 13, Divorced, Exec-managerial, Own-child, White, Male, 0, 1408, 40, United-States, <=50K +30, Federal-gov, 59951, Some-college, 10, Married-civ-spouse, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, State-gov, 311512, Some-college, 10, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 15, United-States, <=50K +48, Private, 242406, 11th, 7, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 40, Puerto-Rico, <=50K +21, Private, 197200, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 544091, HS-grad, 9, Married-AF-spouse, Adm-clerical, Wife, White, Female, 0, 0, 25, United-States, <=50K +31, Private, 84154, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 38, ?, >50K +48, Self-emp-not-inc, 265477, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 507875, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 43, United-States, <=50K +53, Self-emp-not-inc, 88506, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 172987, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 50, United-States, <=50K +49, Private, 94638, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 289980, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +57, Federal-gov, 337895, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 40, United-States, >50K +53, Private, 144361, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 38, United-States, <=50K +44, Private, 128354, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +41, State-gov, 101603, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 271466, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 43, United-States, <=50K +25, Private, 32275, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, Other, Female, 0, 0, 40, United-States, <=50K +18, Private, 226956, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, ?, <=50K +47, Private, 51835, Prof-school, 15, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1902, 60, Honduras, >50K +50, Federal-gov, 251585, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +47, Self-emp-inc, 109832, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +43, Private, 237993, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Private, 216666, 5th-6th, 3, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +35, Private, 56352, Assoc-voc, 11, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Puerto-Rico, <=50K +41, Private, 147372, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 48, United-States, <=50K +30, Private, 188146, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 5013, 0, 40, United-States, <=50K +30, Private, 59496, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 2407, 0, 40, United-States, <=50K +32, ?, 293936, 7th-8th, 4, Married-spouse-absent, ?, Not-in-family, White, Male, 0, 0, 40, ?, <=50K +48, Private, 149640, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 116632, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +29, Private, 105598, Some-college, 10, Divorced, Tech-support, Not-in-family, White, Male, 0, 0, 58, United-States, <=50K +36, Private, 155537, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 183175, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +53, Private, 169846, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +49, Self-emp-inc, 191681, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +25, ?, 200681, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 101509, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 32, United-States, <=50K +31, Private, 309974, Bachelors, 13, Separated, Sales, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +29, Self-emp-not-inc, 162298, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, United-States, >50K +23, Private, 211678, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +79, Private, 124744, Some-college, 10, Married-civ-spouse, Prof-specialty, Other-relative, White, Male, 0, 0, 20, United-States, <=50K +27, Private, 213921, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, Mexico, <=50K +40, Private, 32214, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +67, ?, 212759, 10th, 6, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 2, United-States, <=50K +18, Private, 309634, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 22, United-States, <=50K +31, Local-gov, 125927, 7th-8th, 4, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 446839, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +52, Private, 276515, Bachelors, 13, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Cuba, <=50K +46, Private, 51618, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, <=50K +59, Private, 159937, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, <=50K +44, Private, 343591, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Female, 14344, 0, 40, United-States, >50K +53, Private, 346253, HS-grad, 9, Divorced, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +49, Local-gov, 268234, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 202051, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +30, Private, 54334, 9th, 5, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +43, Federal-gov, 410867, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +57, Private, 249977, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 286730, Some-college, 10, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 212563, Some-college, 10, Divorced, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 25, United-States, <=50K +30, Private, 117747, HS-grad, 9, Married-civ-spouse, Sales, Wife, Asian-Pac-Islander, Female, 0, 1573, 35, ?, <=50K +34, Local-gov, 226296, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Local-gov, 115585, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +48, Self-emp-not-inc, 191277, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 60, United-States, >50K +37, Private, 202683, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, >50K +48, Private, 171095, Assoc-acdm, 12, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, England, <=50K +32, Federal-gov, 249409, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +76, Private, 124191, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Private, 198282, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +47, Self-emp-not-inc, 149116, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +20, Private, 188300, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 103432, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Self-emp-inc, 317660, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 7688, 0, 40, United-States, >50K +17, ?, 304873, 10th, 6, Never-married, ?, Own-child, White, Female, 34095, 0, 32, United-States, <=50K +30, Private, 194901, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, Local-gov, 189265, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 124692, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 432376, Bachelors, 13, Never-married, Sales, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 65324, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +56, Self-emp-not-inc, 335605, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 1887, 50, Canada, >50K +28, Private, 377869, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 4064, 0, 25, United-States, <=50K +36, Private, 102864, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +53, Private, 95647, 9th, 5, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 50, United-States, <=50K +56, Self-emp-inc, 303090, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +49, Local-gov, 197371, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +55, Private, 247552, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 56, United-States, <=50K +22, Private, 102632, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 41, United-States, <=50K +21, Private, 199915, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 118853, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +30, Private, 77143, Bachelors, 13, Never-married, Exec-managerial, Own-child, Black, Male, 0, 0, 40, Germany, <=50K +29, State-gov, 267989, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +19, Private, 301606, Some-college, 10, Never-married, Other-service, Own-child, Black, Male, 0, 0, 35, United-States, <=50K +47, Private, 287828, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +20, Private, 111697, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 1719, 28, United-States, <=50K +31, Private, 114937, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +35, ?, 129305, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 365739, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 69621, Assoc-acdm, 12, Never-married, Sales, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +24, Private, 43323, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 1762, 40, United-States, <=50K +38, Self-emp-not-inc, 120985, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 4386, 0, 35, United-States, <=50K +37, Private, 254202, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +46, Private, 146195, Assoc-acdm, 12, Divorced, Tech-support, Not-in-family, Black, Female, 0, 0, 36, United-States, <=50K +38, Federal-gov, 125933, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, Iran, >50K +43, Self-emp-not-inc, 56920, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +27, Private, 163127, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, <=50K +20, Private, 34310, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +49, Private, 81973, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, >50K +61, Self-emp-inc, 66614, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 232782, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 316868, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, Mexico, <=50K +45, Private, 196584, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 1564, 40, United-States, >50K +70, Private, 105376, Some-college, 10, Never-married, Tech-support, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 185814, HS-grad, 9, Never-married, Transport-moving, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +22, Private, 175374, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 24, United-States, <=50K +36, Private, 108293, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 24, United-States, <=50K +64, Private, 181232, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 2179, 40, United-States, <=50K +43, ?, 174662, Some-college, 10, Divorced, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Local-gov, 186009, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 38, Mexico, <=50K +34, Private, 198183, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 163003, Bachelors, 13, Never-married, Exec-managerial, Other-relative, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +21, Private, 296158, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 35, United-States, <=50K +52, ?, 252903, HS-grad, 9, Divorced, ?, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +48, Private, 187715, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 46, United-States, <=50K +23, Private, 214542, Bachelors, 13, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +71, Self-emp-not-inc, 494223, Some-college, 10, Separated, Sales, Unmarried, Black, Male, 0, 1816, 2, United-States, <=50K +29, Private, 191535, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +42, Private, 228456, Bachelors, 13, Separated, Other-service, Other-relative, Black, Male, 0, 0, 50, United-States, <=50K +68, ?, 38317, 1st-4th, 2, Divorced, ?, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +25, Private, 252752, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +44, Self-emp-inc, 78374, Masters, 14, Divorced, Exec-managerial, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +28, Private, 88419, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, England, <=50K +45, Self-emp-not-inc, 201080, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 207157, Some-college, 10, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +39, Federal-gov, 235485, Assoc-acdm, 12, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +46, State-gov, 102628, Masters, 14, Widowed, Protective-serv, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 25828, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 16, United-States, <=50K +66, Local-gov, 54826, Assoc-voc, 11, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +27, Private, 124953, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 1980, 40, United-States, <=50K +28, State-gov, 175325, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 96062, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 1977, 40, United-States, >50K +27, Private, 428030, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +28, State-gov, 149624, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 253814, HS-grad, 9, Married-spouse-absent, Sales, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +21, Private, 312956, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +34, Private, 483777, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +18, Private, 183930, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 12, United-States, <=50K +33, Private, 37274, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 65, United-States, <=50K +44, Local-gov, 181344, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 38, United-States, >50K +43, Private, 114580, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 633742, Some-college, 10, Never-married, Craft-repair, Not-in-family, Black, Male, 0, 0, 45, United-States, <=50K +40, Private, 286370, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, >50K +37, Federal-gov, 29054, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 42, United-States, >50K +34, Private, 304030, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, <=50K +41, Self-emp-not-inc, 143129, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +53, ?, 135105, Bachelors, 13, Divorced, ?, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +31, Private, 99928, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, United-States, <=50K +58, State-gov, 109567, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 1, United-States, >50K +38, Private, 155222, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 28, United-States, <=50K +24, Private, 159567, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Local-gov, 523910, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +47, Private, 120939, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 45, United-States, <=50K +41, Federal-gov, 130760, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 24, United-States, <=50K +23, Private, 197387, 5th-6th, 3, Married-civ-spouse, Transport-moving, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +36, Private, 99374, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, Federal-gov, 56795, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 14084, 0, 55, United-States, >50K +35, Private, 138992, Masters, 14, Married-civ-spouse, Prof-specialty, Other-relative, White, Male, 7298, 0, 40, United-States, >50K +24, Self-emp-not-inc, 32921, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 397317, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 1876, 40, United-States, <=50K +19, ?, 170653, HS-grad, 9, Never-married, ?, Own-child, White, Male, 0, 0, 40, Italy, <=50K +51, Private, 259323, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +42, Local-gov, 254817, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 1340, 40, United-States, <=50K +37, State-gov, 48211, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +18, Private, 140164, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 128757, Bachelors, 13, Married-civ-spouse, Other-service, Husband, Black, Male, 7298, 0, 36, United-States, >50K +35, Private, 36270, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +58, Self-emp-inc, 210563, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 15024, 0, 35, United-States, >50K +17, Private, 65368, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 12, United-States, <=50K +44, Local-gov, 160943, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +37, Private, 208358, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 153790, Some-college, 10, Never-married, Sales, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +60, Private, 85815, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +54, Self-emp-inc, 125417, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +37, Private, 635913, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 60, United-States, >50K +50, Private, 313321, Assoc-acdm, 12, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 182609, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, Poland, <=50K +45, Private, 109434, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, <=50K +25, Private, 255004, 10th, 6, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 197860, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +64, ?, 187656, 1st-4th, 2, Divorced, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +90, Private, 51744, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Male, 0, 2206, 40, United-States, <=50K +54, Private, 176681, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 20, United-States, <=50K +53, Local-gov, 140359, Preschool, 1, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +18, Private, 243313, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +60, ?, 24215, 10th, 6, Divorced, ?, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 10, United-States, <=50K +66, Self-emp-not-inc, 167687, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 1409, 0, 50, United-States, <=50K +75, Private, 314209, Assoc-voc, 11, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 20, Columbia, <=50K +65, Private, 176796, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 538583, 11th, 7, Separated, Transport-moving, Not-in-family, Black, Male, 3674, 0, 40, United-States, <=50K +41, Private, 130408, HS-grad, 9, Divorced, Sales, Unmarried, Black, Female, 0, 0, 38, United-States, <=50K +25, Private, 159732, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +33, Private, 110978, Some-college, 10, Divorced, Craft-repair, Other-relative, Other, Female, 0, 0, 40, United-States, <=50K +28, Private, 76714, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +59, State-gov, 268700, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, State-gov, 170525, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +41, Private, 180138, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, Iran, >50K +38, Local-gov, 115076, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, >50K +23, Private, 115458, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 347890, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Self-emp-not-inc, 196001, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 20, United-States, <=50K +24, State-gov, 273905, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 50, United-States, <=50K +20, ?, 119156, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 20, United-States, <=50K +38, Private, 179488, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 1741, 40, United-States, <=50K +56, Private, 203580, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 35, ?, <=50K +58, Private, 236596, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, United-States, >50K +32, Private, 183916, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 34, United-States, <=50K +40, Private, 207578, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 1977, 60, United-States, >50K +45, Private, 153141, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, ?, <=50K +41, Private, 112763, Prof-school, 15, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +42, Private, 390781, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, <=50K +59, Local-gov, 171328, 10th, 6, Widowed, Other-service, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +19, Local-gov, 27382, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 259014, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +42, Self-emp-not-inc, 303044, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Cambodia, >50K +20, Private, 117789, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 172579, HS-grad, 9, Separated, Other-service, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +45, Private, 187666, Assoc-voc, 11, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +50, Private, 204518, 7th-8th, 4, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 150042, Bachelors, 13, Divorced, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 98092, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +17, Private, 245918, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 12, United-States, <=50K +59, Private, 146013, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 4064, 0, 40, United-States, <=50K +26, Private, 378322, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Self-emp-inc, 257295, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 75, Thailand, >50K +19, ?, 218956, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 24, Canada, <=50K +64, Private, 21174, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 185480, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +33, Private, 222205, HS-grad, 9, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 40, United-States, >50K +61, Private, 69867, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +17, Private, 191260, 9th, 5, Never-married, Other-service, Own-child, White, Male, 1055, 0, 24, United-States, <=50K +50, Self-emp-not-inc, 30653, Masters, 14, Married-civ-spouse, Farming-fishing, Husband, White, Male, 2407, 0, 98, United-States, <=50K +27, Local-gov, 209109, Masters, 14, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 35, United-States, <=50K +30, Private, 70377, HS-grad, 9, Divorced, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 477983, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +44, Private, 170924, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 40, United-States, >50K +35, Private, 190174, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 193787, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 279472, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 7298, 0, 48, United-States, >50K +22, Private, 34918, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 15, Germany, <=50K +42, Local-gov, 97688, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 5178, 0, 40, United-States, >50K +34, Private, 175413, Assoc-acdm, 12, Divorced, Sales, Unmarried, Black, Female, 0, 0, 45, United-States, <=50K +60, Private, 173960, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 42, United-States, <=50K +21, Private, 205759, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +57, Federal-gov, 425161, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 40, United-States, >50K +41, Private, 220531, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +50, Private, 176609, Some-college, 10, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +25, Private, 371987, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +50, Private, 193884, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Ecuador, <=50K +36, Private, 200352, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +31, Private, 127595, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +29, Local-gov, 220419, Bachelors, 13, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 56, United-States, <=50K +21, Private, 231931, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 45, United-States, <=50K +27, Private, 248402, Bachelors, 13, Never-married, Tech-support, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +65, Private, 111095, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 16, United-States, <=50K +37, Self-emp-inc, 57424, Bachelors, 13, Divorced, Sales, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +39, ?, 157443, Masters, 14, Married-civ-spouse, ?, Wife, Asian-Pac-Islander, Female, 3464, 0, 40, ?, <=50K +24, Private, 278130, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 169469, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K +48, Private, 146268, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7688, 0, 40, United-States, >50K +21, Private, 153718, Some-college, 10, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 25, United-States, <=50K +31, Private, 217460, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, >50K +55, Private, 238638, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 4386, 0, 40, United-States, >50K +24, Private, 303296, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, Asian-Pac-Islander, Female, 0, 0, 40, Laos, <=50K +43, Private, 173321, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 193945, Assoc-acdm, 12, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +46, Private, 83082, Assoc-acdm, 12, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 33, United-States, <=50K +35, Private, 193815, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Self-emp-inc, 34987, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 54, United-States, >50K +26, Private, 59306, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 142897, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 7298, 0, 35, Taiwan, >50K +19, ?, 860348, Some-college, 10, Never-married, ?, Own-child, Black, Female, 0, 0, 25, United-States, <=50K +36, Self-emp-not-inc, 205607, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, >50K +22, Private, 199698, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 15, United-States, <=50K +24, Private, 191954, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +77, Self-emp-not-inc, 138714, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 399087, 5th-6th, 3, Married-civ-spouse, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, Mexico, <=50K +29, Private, 423158, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +62, Private, 159841, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +39, Self-emp-not-inc, 174308, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 50356, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1485, 50, United-States, <=50K +35, Private, 186110, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +29, Private, 200381, 11th, 7, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +76, Self-emp-not-inc, 174309, Masters, 14, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K +63, Self-emp-not-inc, 78383, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 45, United-States, <=50K +23, ?, 211601, Assoc-voc, 11, Never-married, ?, Own-child, Black, Female, 0, 0, 15, United-States, <=50K +43, Private, 187728, Some-college, 10, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1887, 50, United-States, >50K +58, Self-emp-not-inc, 321171, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +66, Private, 127921, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 2050, 0, 55, United-States, <=50K +41, Private, 206565, Some-college, 10, Never-married, Craft-repair, Not-in-family, Black, Male, 0, 0, 45, United-States, <=50K +26, Private, 224563, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 178686, Assoc-voc, 11, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +55, Local-gov, 98545, 10th, 6, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 242606, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 270942, 5th-6th, 3, Never-married, Other-service, Other-relative, White, Male, 0, 0, 48, Mexico, <=50K +30, Private, 94235, HS-grad, 9, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 71195, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +19, Private, 104112, HS-grad, 9, Never-married, Sales, Unmarried, Black, Male, 0, 0, 30, Haiti, <=50K +45, Private, 261192, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 40, United-States, <=50K +26, Private, 94936, Assoc-acdm, 12, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 296478, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 40, United-States, >50K +36, State-gov, 119272, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 7298, 0, 40, United-States, >50K +33, Private, 85043, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +22, State-gov, 293364, Some-college, 10, Never-married, Protective-serv, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 241895, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +67, ?, 36135, 11th, 7, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 8, United-States, <=50K +30, ?, 151989, Assoc-voc, 11, Divorced, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +56, Private, 101128, Assoc-acdm, 12, Married-spouse-absent, Other-service, Not-in-family, White, Male, 0, 0, 25, Iran, <=50K +31, Private, 156464, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 25, United-States, <=50K +33, Private, 117963, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 192262, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 111363, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Local-gov, 329752, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 30, United-States, <=50K +59, ?, 372020, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, >50K +38, Federal-gov, 95432, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +65, Private, 161400, 11th, 7, Widowed, Other-service, Unmarried, Other, Male, 0, 0, 40, United-States, <=50K +40, Private, 96129, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +42, Private, 111949, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, <=50K +26, Self-emp-not-inc, 117125, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Portugal, <=50K +36, Private, 348022, 10th, 6, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 24, United-States, <=50K +62, Private, 270092, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +43, Private, 180609, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +43, Private, 174575, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 1564, 45, United-States, >50K +22, Private, 410439, HS-grad, 9, Married-spouse-absent, Sales, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +28, Private, 92262, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +56, Self-emp-not-inc, 183081, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +22, Private, 362589, Assoc-acdm, 12, Never-married, Sales, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +57, Private, 212448, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, >50K +39, Private, 481060, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +26, Federal-gov, 185885, Some-college, 10, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 15, United-States, <=50K +17, Private, 89821, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 10, United-States, <=50K +40, State-gov, 184018, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 38, United-States, >50K +45, Private, 256649, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +44, Private, 160323, HS-grad, 9, Never-married, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +20, Local-gov, 350845, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 10, United-States, <=50K +33, Private, 267404, HS-grad, 9, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 35633, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Self-emp-not-inc, 80914, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +38, Private, 172927, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +54, Private, 174319, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 214955, 5th-6th, 3, Divorced, Craft-repair, Not-in-family, White, Female, 0, 2339, 45, United-States, <=50K +25, Private, 344991, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 108699, Some-college, 10, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +36, Local-gov, 117312, Some-college, 10, Married-civ-spouse, Transport-moving, Wife, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 396099, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +29, Private, 134152, HS-grad, 9, Separated, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 162028, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 2415, 6, United-States, >50K +19, Private, 25429, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 16, United-States, <=50K +19, Private, 232392, HS-grad, 9, Never-married, Other-service, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 220098, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, >50K +27, Private, 301302, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +46, Self-emp-not-inc, 277946, Assoc-acdm, 12, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, State-gov, 98101, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 45, ?, >50K +34, Private, 196164, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +44, Private, 115562, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 96975, Some-college, 10, Divorced, Handlers-cleaners, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +20, ?, 137300, HS-grad, 9, Never-married, ?, Other-relative, White, Female, 0, 0, 35, United-States, <=50K +25, Private, 86872, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +52, Self-emp-inc, 132178, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +20, Private, 416103, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 108574, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +50, State-gov, 288353, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Private, 227689, Assoc-voc, 11, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 64, United-States, <=50K +28, Private, 166481, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, Other, Male, 0, 2179, 40, Puerto-Rico, <=50K +41, Private, 445382, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 65, United-States, >50K +28, Private, 110145, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Self-emp-not-inc, 317253, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 25, United-States, <=50K +28, ?, 123147, Some-college, 10, Married-civ-spouse, ?, Wife, White, Female, 0, 1887, 40, United-States, >50K +32, Private, 364657, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +41, Local-gov, 42346, Some-college, 10, Divorced, Other-service, Not-in-family, Black, Female, 0, 0, 24, United-States, <=50K +24, Private, 241951, HS-grad, 9, Never-married, Handlers-cleaners, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +33, Private, 118500, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 188386, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +31, State-gov, 1033222, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 92440, 12th, 8, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +52, Private, 190762, 1st-4th, 2, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +30, Private, 426017, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 19, United-States, <=50K +34, Local-gov, 243867, 11th, 7, Separated, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +34, State-gov, 240283, HS-grad, 9, Divorced, Transport-moving, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 61777, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 30, United-States, <=50K +17, Private, 175024, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 2176, 0, 18, United-States, <=50K +32, State-gov, 92003, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +29, Private, 188401, HS-grad, 9, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 228528, 10th, 6, Never-married, Craft-repair, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +25, Private, 133373, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +36, Federal-gov, 255191, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 1408, 40, United-States, <=50K +23, Private, 204653, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 72, Dominican-Republic, <=50K +63, Self-emp-inc, 222289, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +47, Local-gov, 287480, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +80, ?, 107762, HS-grad, 9, Widowed, ?, Not-in-family, White, Male, 0, 0, 24, United-States, <=50K +17, ?, 202521, 11th, 7, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +40, Self-emp-not-inc, 204116, Bachelors, 13, Married-spouse-absent, Prof-specialty, Not-in-family, White, Female, 2174, 0, 40, United-States, <=50K +30, Private, 29662, Assoc-acdm, 12, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 25, United-States, >50K +27, Private, 116358, Some-college, 10, Never-married, Craft-repair, Own-child, Asian-Pac-Islander, Male, 0, 1980, 40, Philippines, <=50K +33, Private, 208405, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +34, Local-gov, 284843, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, Black, Male, 594, 0, 60, United-States, <=50K +34, Local-gov, 117018, Some-college, 10, Never-married, Protective-serv, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 81281, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Local-gov, 340148, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 363425, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 45857, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 28, United-States, <=50K +24, Federal-gov, 191073, HS-grad, 9, Never-married, Armed-Forces, Own-child, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 116632, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 405855, 9th, 5, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +20, Private, 298227, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +44, Private, 290521, HS-grad, 9, Widowed, Exec-managerial, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +51, Private, 56915, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +20, Private, 146538, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, ?, 258872, 11th, 7, Never-married, ?, Own-child, White, Female, 0, 0, 5, United-States, <=50K +19, Private, 206399, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +45, Self-emp-inc, 197332, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, >50K +60, Private, 245062, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +42, Private, 197583, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 40, ?, >50K +44, Self-emp-not-inc, 234885, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, >50K +40, Private, 72887, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, >50K +30, Private, 180374, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 351299, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 50, United-States, <=50K +23, Private, 54012, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +32, ?, 115745, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 116632, Assoc-acdm, 12, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 40, United-States, <=50K +54, Local-gov, 288825, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +32, Private, 132601, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +50, Private, 193374, 1st-4th, 2, Married-spouse-absent, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 170070, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +37, Private, 126708, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 60, United-States, <=50K +52, Private, 35598, HS-grad, 9, Divorced, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 33983, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 192776, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 45, United-States, >50K +30, Private, 118551, Bachelors, 13, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 16, United-States, >50K +60, Private, 201965, Some-college, 10, Never-married, Prof-specialty, Unmarried, White, Male, 0, 0, 40, United-States, >50K +22, ?, 139883, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 285020, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 303990, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +67, Private, 49401, Assoc-voc, 11, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +46, Private, 279196, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +17, Private, 211870, 9th, 5, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 6, United-States, <=50K +22, Private, 281432, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 30, United-States, <=50K +27, Private, 161155, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 197904, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +33, Private, 111746, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, Portugal, <=50K +43, Self-emp-not-inc, 170721, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 20, United-States, <=50K +28, State-gov, 70100, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +41, Private, 193626, HS-grad, 9, Married-spouse-absent, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +52, ?, 271749, 12th, 8, Never-married, ?, Other-relative, Black, Male, 594, 0, 40, United-States, <=50K +25, Private, 189775, Some-college, 10, Married-spouse-absent, Adm-clerical, Own-child, Black, Female, 0, 0, 20, United-States, <=50K +63, ?, 401531, 1st-4th, 2, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 35, United-States, <=50K +59, Local-gov, 286967, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +45, Local-gov, 164427, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 91039, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 60, United-States, >50K +40, Private, 347934, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +46, Federal-gov, 371373, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 32220, Assoc-acdm, 12, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +34, Private, 187251, HS-grad, 9, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +33, Private, 178107, Bachelors, 13, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 20, United-States, <=50K +41, Private, 343121, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +20, Private, 262749, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 403107, 5th-6th, 3, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, El-Salvador, <=50K +26, Private, 64293, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +72, ?, 303588, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 20, United-States, <=50K +23, Local-gov, 324960, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, Poland, <=50K +62, Local-gov, 114060, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 48925, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 180980, Some-college, 10, Divorced, Other-service, Unmarried, White, Female, 0, 0, 42, France, <=50K +25, Private, 181054, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 388093, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +19, Private, 249609, Some-college, 10, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 8, United-States, <=50K +43, Private, 112131, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Local-gov, 543162, HS-grad, 9, Separated, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +39, Private, 91996, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 141944, Assoc-voc, 11, Married-spouse-absent, Handlers-cleaners, Unmarried, White, Male, 0, 1380, 42, United-States, <=50K +53, ?, 251804, 5th-6th, 3, Widowed, ?, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +32, Private, 37070, Assoc-voc, 11, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +34, Private, 337587, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +28, Private, 189346, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +57, ?, 222216, Assoc-voc, 11, Widowed, ?, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +25, Private, 267044, Some-college, 10, Never-married, Adm-clerical, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 20, United-States, <=50K +20, ?, 214635, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 24, United-States, <=50K +21, ?, 204226, Some-college, 10, Never-married, ?, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +34, Private, 108116, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +38, Self-emp-inc, 99146, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 80, United-States, >50K +50, Private, 196232, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 50, United-States, >50K +24, Local-gov, 248344, Some-college, 10, Divorced, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 50, United-States, <=50K +37, Local-gov, 186035, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 45, United-States, >50K +44, Private, 177905, Some-college, 10, Divorced, Machine-op-inspct, Unmarried, White, Male, 0, 0, 58, United-States, >50K +28, Private, 85812, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 221172, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +74, Private, 99183, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 9, United-States, <=50K +38, Self-emp-not-inc, 190387, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +44, Self-emp-not-inc, 202692, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 109339, 11th, 7, Divorced, Machine-op-inspct, Unmarried, Other, Female, 0, 0, 46, Puerto-Rico, <=50K +26, Private, 108658, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 197202, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 40, United-States, <=50K +41, Private, 101739, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 50, United-States, >50K +67, Private, 231559, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 20051, 0, 48, United-States, >50K +39, Local-gov, 207853, 12th, 8, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, Private, 190942, 1st-4th, 2, Widowed, Priv-house-serv, Not-in-family, Black, Female, 0, 0, 30, United-States, <=50K +29, Private, 102345, Assoc-voc, 11, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, Self-emp-inc, 41493, Bachelors, 13, Never-married, Farming-fishing, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +34, ?, 190027, HS-grad, 9, Never-married, ?, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +44, Private, 210525, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 133937, Doctorate, 16, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 237903, Some-college, 10, Never-married, Handlers-cleaners, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 163862, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 201872, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 84179, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +58, Private, 51662, 10th, 6, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 8, United-States, <=50K +35, Local-gov, 233327, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 259510, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 36, United-States, <=50K +28, Private, 184831, Some-college, 10, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +46, Self-emp-not-inc, 245724, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +36, Self-emp-not-inc, 27053, HS-grad, 9, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +72, Private, 205343, 11th, 7, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 229328, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 0, 0, 40, United-States, <=50K +33, Federal-gov, 319560, Assoc-voc, 11, Divorced, Craft-repair, Unmarried, Black, Female, 0, 0, 40, United-States, >50K +69, Private, 136218, 11th, 7, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 54576, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 323069, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 20, ?, <=50K +34, Private, 148291, HS-grad, 9, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 32, United-States, <=50K +30, Private, 152453, 11th, 7, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Mexico, <=50K +28, Private, 114053, Bachelors, 13, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +54, Private, 212960, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 35, United-States, >50K +47, Private, 264052, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Private, 82804, HS-grad, 9, Never-married, Handlers-cleaners, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +52, Self-emp-not-inc, 334273, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +20, Private, 27337, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 48, United-States, <=50K +43, Self-emp-inc, 188436, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5013, 0, 45, United-States, <=50K +45, Private, 433665, 7th-8th, 4, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +29, Self-emp-not-inc, 110663, HS-grad, 9, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +47, Private, 87490, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 42, United-States, <=50K +24, Private, 354351, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 95469, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 242718, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 12, United-States, <=50K +37, Private, 22463, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1977, 40, United-States, >50K +27, Private, 158156, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 70, United-States, <=50K +29, Private, 350162, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Male, 0, 0, 40, United-States, >50K +18, ?, 165532, 12th, 8, Never-married, ?, Own-child, White, Male, 0, 0, 25, United-States, <=50K +36, Self-emp-not-inc, 28738, Assoc-acdm, 12, Divorced, Sales, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +58, Local-gov, 283635, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Self-emp-not-inc, 86646, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +65, ?, 195733, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, >50K +57, Private, 69884, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +59, Private, 199713, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 181659, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Self-emp-not-inc, 340939, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 197747, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 24, United-States, <=50K +29, Private, 34292, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +18, Private, 156764, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 25826, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1887, 47, United-States, >50K +57, Self-emp-inc, 103948, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K +42, ?, 137390, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, ?, 105138, HS-grad, 9, Married-civ-spouse, ?, Wife, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +60, Private, 39352, 7th-8th, 4, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 48, United-States, >50K +31, Private, 168387, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 40, Canada, >50K +23, Private, 117789, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 267147, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, ?, 99399, Some-college, 10, Never-married, ?, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 25, United-States, <=50K +42, Self-emp-not-inc, 214242, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 50, United-States, >50K +25, Private, 200408, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K +49, Private, 136455, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +32, Private, 239824, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 217039, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 28, United-States, <=50K +60, Private, 51290, 7th-8th, 4, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Local-gov, 175674, 9th, 5, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Self-emp-not-inc, 194404, Assoc-acdm, 12, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 45612, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 37, United-States, <=50K +51, Private, 410114, Masters, 14, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 182521, HS-grad, 9, Never-married, Craft-repair, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +36, Local-gov, 339772, HS-grad, 9, Separated, Exec-managerial, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +17, Private, 169658, 10th, 6, Never-married, Other-service, Own-child, White, Female, 0, 0, 21, United-States, <=50K +52, Private, 200853, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 6849, 0, 60, United-States, <=50K +24, Private, 247564, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 249909, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 50, United-States, <=50K +26, Local-gov, 208122, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 1055, 0, 40, United-States, <=50K +27, Private, 109881, Bachelors, 13, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +39, Private, 207824, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 60, United-States, <=50K +30, Private, 369027, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 45, United-States, <=50K +50, Self-emp-not-inc, 114117, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 32, United-States, <=50K +52, Self-emp-inc, 51048, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +46, Private, 102388, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 45, United-States, >50K +23, Private, 190483, Bachelors, 13, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +45, Private, 462440, 11th, 7, Widowed, Other-service, Not-in-family, Black, Female, 0, 0, 20, United-States, <=50K +65, Private, 109351, 9th, 5, Widowed, Priv-house-serv, Unmarried, Black, Female, 0, 0, 24, United-States, <=50K +29, Private, 34383, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 55, United-States, <=50K +47, Private, 241832, 9th, 5, Married-spouse-absent, Handlers-cleaners, Unmarried, White, Male, 0, 0, 40, El-Salvador, <=50K +30, Private, 124187, HS-grad, 9, Never-married, Farming-fishing, Own-child, Black, Male, 0, 0, 60, United-States, <=50K +34, Private, 153614, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +38, Self-emp-not-inc, 267556, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 64, United-States, <=50K +33, Private, 205469, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +49, Private, 268090, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 26, United-States, >50K +47, Self-emp-not-inc, 165039, Some-college, 10, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +49, Local-gov, 120451, 10th, 6, Separated, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +43, Private, 154374, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 15024, 0, 60, United-States, >50K +30, Private, 103649, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, >50K +58, Self-emp-not-inc, 35723, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +19, Private, 262601, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 14, United-States, <=50K +21, Private, 226181, Bachelors, 13, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 175697, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +47, Self-emp-inc, 248145, 5th-6th, 3, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, Cuba, <=50K +52, Self-emp-not-inc, 289436, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +26, Private, 75654, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +60, Private, 199378, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 160968, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 188563, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 5178, 0, 50, United-States, >50K +31, Private, 55849, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +50, Self-emp-inc, 195322, Doctorate, 16, Separated, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +31, Local-gov, 402089, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +71, Private, 78277, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 15, United-States, <=50K +58, ?, 158611, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 50, United-States, <=50K +30, State-gov, 169496, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 130959, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +24, Private, 556660, HS-grad, 9, Never-married, Exec-managerial, Other-relative, White, Male, 4101, 0, 50, United-States, <=50K +35, Private, 292472, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Taiwan, >50K +38, State-gov, 143774, Some-college, 10, Separated, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +27, Private, 288341, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 32, United-States, <=50K +29, State-gov, 71592, Some-college, 10, Never-married, Adm-clerical, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +70, ?, 167358, 9th, 5, Widowed, ?, Unmarried, White, Female, 1111, 0, 15, United-States, <=50K +34, Private, 106742, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +44, Private, 219288, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +43, Private, 174524, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Self-emp-not-inc, 335183, 12th, 8, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, >50K +35, Private, 261293, Masters, 14, Never-married, Sales, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +27, Private, 111900, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +43, Local-gov, 194360, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +20, Private, 81145, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +42, Private, 341204, Assoc-acdm, 12, Divorced, Prof-specialty, Unmarried, White, Female, 8614, 0, 40, United-States, >50K +27, State-gov, 249362, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 3411, 0, 40, United-States, <=50K +42, Private, 247019, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +20, ?, 114746, 11th, 7, Married-spouse-absent, ?, Own-child, Asian-Pac-Islander, Female, 0, 1762, 40, South, <=50K +24, Private, 172146, 9th, 5, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 1721, 40, United-States, <=50K +48, Federal-gov, 110457, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +17, ?, 80077, 11th, 7, Never-married, ?, Own-child, White, Female, 0, 0, 20, United-States, <=50K +17, Self-emp-not-inc, 368700, 11th, 7, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 10, United-States, <=50K +33, Private, 182556, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Self-emp-inc, 219420, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +22, Private, 240817, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 2597, 0, 40, United-States, <=50K +17, Private, 102726, 12th, 8, Never-married, Other-service, Own-child, White, Male, 0, 0, 16, United-States, <=50K +32, Private, 226267, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, Mexico, <=50K +31, Private, 125457, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +58, Self-emp-not-inc, 204021, HS-grad, 9, Widowed, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +29, Local-gov, 92262, HS-grad, 9, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 48, United-States, <=50K +37, Private, 161141, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Portugal, >50K +34, Self-emp-not-inc, 190290, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +23, Local-gov, 430828, Some-college, 10, Separated, Exec-managerial, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +18, State-gov, 59342, 11th, 7, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 5, United-States, <=50K +34, Private, 136721, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +66, ?, 149422, 7th-8th, 4, Never-married, ?, Not-in-family, White, Male, 0, 0, 4, United-States, <=50K +45, Local-gov, 86644, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 55, United-States, <=50K +41, Private, 195124, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 35, Dominican-Republic, <=50K +26, Private, 167350, HS-grad, 9, Never-married, Other-service, Other-relative, White, Male, 0, 0, 30, United-States, <=50K +54, Local-gov, 113000, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 140027, Some-college, 10, Never-married, Machine-op-inspct, Own-child, Black, Female, 0, 0, 45, United-States, <=50K +42, Private, 262425, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 316702, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 20, United-States, <=50K +23, State-gov, 335453, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +25, ?, 202480, Assoc-acdm, 12, Never-married, ?, Other-relative, White, Male, 0, 0, 45, United-States, <=50K +35, Private, 203628, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 60, United-States, >50K +31, Private, 118710, Masters, 14, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 1902, 40, United-States, >50K +30, Private, 189620, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, Poland, <=50K +19, Private, 475028, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +36, Local-gov, 110866, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +31, Private, 243605, Bachelors, 13, Widowed, Sales, Unmarried, White, Female, 0, 1380, 40, Cuba, <=50K +21, Private, 163870, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 30, United-States, <=50K +31, Self-emp-not-inc, 80145, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 295566, Doctorate, 16, Divorced, Prof-specialty, Unmarried, White, Female, 25236, 0, 65, United-States, >50K +44, Private, 63042, Bachelors, 13, Divorced, Exec-managerial, Own-child, White, Female, 0, 0, 50, United-States, >50K +40, Private, 229148, 12th, 8, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 40, Jamaica, <=50K +45, Private, 242552, Some-college, 10, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +60, Private, 177665, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +18, Private, 208103, 11th, 7, Never-married, Other-service, Other-relative, White, Male, 0, 0, 25, United-States, <=50K +28, Private, 296450, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 70282, Some-college, 10, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +36, Private, 271767, Bachelors, 13, Separated, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, ?, <=50K +40, Private, 144995, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 4386, 0, 40, United-States, <=50K +36, Local-gov, 382635, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 35, Honduras, <=50K +31, Private, 295697, HS-grad, 9, Separated, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +33, Private, 194141, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, State-gov, 378418, HS-grad, 9, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 214399, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +34, Private, 217460, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +33, Private, 182556, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 125831, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 2051, 60, United-States, <=50K +29, Private, 271328, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 4650, 0, 40, United-States, <=50K +50, Local-gov, 50459, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 42, United-States, >50K +42, Private, 162140, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 45, United-States, >50K +43, Private, 177937, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, ?, >50K +44, Private, 111502, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 299047, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +31, Private, 223212, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Mexico, <=50K +65, Self-emp-not-inc, 118474, 11th, 7, Married-civ-spouse, Exec-managerial, Husband, White, Male, 9386, 0, 59, ?, >50K +23, Private, 352139, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +55, Private, 173093, Some-college, 10, Divorced, Adm-clerical, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +26, Private, 181655, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 2377, 45, United-States, <=50K +25, Private, 332702, Assoc-voc, 11, Never-married, Other-service, Own-child, White, Female, 0, 0, 15, United-States, <=50K +45, ?, 51164, Some-college, 10, Married-civ-spouse, ?, Wife, Black, Female, 0, 0, 40, United-States, <=50K +35, Private, 234901, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 2407, 0, 40, United-States, <=50K +36, Private, 131414, Some-college, 10, Never-married, Sales, Not-in-family, Black, Female, 0, 0, 36, United-States, <=50K +43, State-gov, 260960, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +56, Private, 156052, HS-grad, 9, Widowed, Other-service, Unmarried, Black, Female, 594, 0, 20, United-States, <=50K +42, Private, 279914, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +19, Private, 192453, Some-college, 10, Never-married, Other-service, Other-relative, White, Female, 0, 0, 25, United-States, <=50K +55, Self-emp-not-inc, 200939, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 72, United-States, <=50K +42, Private, 151408, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 14084, 0, 50, United-States, >50K +26, Private, 112847, Assoc-voc, 11, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 316929, 12th, 8, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 20, United-States, <=50K +42, Local-gov, 126319, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +55, Private, 197422, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 7688, 0, 40, United-States, >50K +32, Private, 267736, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +29, Private, 267034, 11th, 7, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, Haiti, <=50K +46, State-gov, 193047, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 37, United-States, <=50K +29, State-gov, 356089, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 40, United-States, >50K +22, Private, 223515, Bachelors, 13, Never-married, Prof-specialty, Unmarried, White, Male, 0, 0, 20, United-States, <=50K +58, Self-emp-not-inc, 87510, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 145111, HS-grad, 9, Never-married, Transport-moving, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +39, Private, 48093, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 31757, Assoc-voc, 11, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 38, United-States, <=50K +54, Private, 285854, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Local-gov, 120064, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +46, Federal-gov, 167381, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +37, Private, 103408, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +36, Private, 101460, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 18, United-States, <=50K +59, Local-gov, 420537, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 38, United-States, >50K +34, Local-gov, 119411, HS-grad, 9, Divorced, Protective-serv, Unmarried, White, Male, 0, 0, 40, Portugal, <=50K +53, Self-emp-inc, 128272, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, >50K +51, Private, 386773, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +32, Private, 283268, 10th, 6, Separated, Other-service, Unmarried, White, Female, 0, 0, 42, United-States, <=50K +31, State-gov, 301526, Some-college, 10, Married-spouse-absent, Other-service, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 151790, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 30, Germany, <=50K +47, Self-emp-not-inc, 106252, Bachelors, 13, Divorced, Sales, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +32, Private, 188557, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 171114, Some-college, 10, Never-married, Farming-fishing, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +37, Private, 327323, 5th-6th, 3, Separated, Farming-fishing, Not-in-family, White, Male, 0, 0, 32, Guatemala, <=50K +31, Private, 244147, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 55, United-States, <=50K +37, Private, 280282, Assoc-voc, 11, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 24, United-States, >50K +55, Private, 116442, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +23, Local-gov, 282579, Assoc-voc, 11, Divorced, Tech-support, Not-in-family, White, Male, 0, 0, 56, United-States, <=50K +36, Private, 51838, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 73585, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, <=50K +43, Private, 226902, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +54, Private, 279129, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +43, State-gov, 146908, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, ?, <=50K +28, Private, 196690, Assoc-voc, 11, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 1669, 42, United-States, <=50K +40, Private, 130760, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +41, Self-emp-not-inc, 49572, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +40, Private, 237601, Bachelors, 13, Never-married, Sales, Not-in-family, Other, Female, 0, 0, 55, United-States, >50K +42, Private, 169628, Some-college, 10, Divorced, Adm-clerical, Not-in-family, Black, Female, 0, 0, 38, United-States, <=50K +61, Self-emp-not-inc, 36671, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 2352, 50, United-States, <=50K +18, Private, 231193, 12th, 8, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 30, United-States, <=50K +59, ?, 192130, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 16, United-States, <=50K +21, ?, 149704, HS-grad, 9, Never-married, ?, Not-in-family, White, Female, 1055, 0, 40, United-States, <=50K +48, Private, 102102, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 50, United-States, >50K +41, Self-emp-inc, 32185, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, ?, 196061, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 33, United-States, <=50K +23, Private, 211046, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 2463, 0, 40, United-States, <=50K +60, Private, 31577, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, >50K +22, Private, 162343, Some-college, 10, Never-married, Other-service, Other-relative, Black, Male, 0, 0, 20, United-States, <=50K +61, Private, 128831, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 316688, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +46, Private, 90758, Masters, 14, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 35, United-States, >50K +43, Private, 274363, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 40, England, >50K +43, Private, 154538, Assoc-acdm, 12, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 106085, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 1721, 30, United-States, <=50K +68, Self-emp-not-inc, 315859, 11th, 7, Never-married, Farming-fishing, Unmarried, White, Male, 0, 0, 20, United-States, <=50K +31, Private, 51471, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +17, Private, 193830, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +32, Private, 231043, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 5178, 0, 48, United-States, >50K +50, ?, 23780, Masters, 14, Married-spouse-absent, ?, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 169879, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 47, United-States, >50K +64, Private, 270333, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 138768, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 30, United-States, <=50K +30, Private, 191571, HS-grad, 9, Separated, Other-service, Own-child, White, Female, 0, 0, 36, United-States, <=50K +22, ?, 219941, Some-college, 10, Never-married, ?, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +43, Private, 94113, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +22, Private, 137510, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 32607, 10th, 6, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 20, United-States, <=50K +47, Self-emp-not-inc, 93208, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 75, Italy, <=50K +41, Private, 254440, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +56, Private, 186556, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +64, Private, 169871, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +47, Private, 191277, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, >50K +48, Private, 167159, Assoc-voc, 11, Never-married, Adm-clerical, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 171871, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 46, United-States, <=50K +29, Private, 154411, Assoc-voc, 11, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 129227, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 110331, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1672, 60, United-States, <=50K +57, Private, 34269, HS-grad, 9, Widowed, Transport-moving, Unmarried, White, Male, 0, 653, 42, United-States, >50K +62, Private, 174355, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 680390, HS-grad, 9, Separated, Machine-op-inspct, Unmarried, White, Female, 0, 0, 24, United-States, <=50K +43, Private, 233130, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +24, Self-emp-inc, 165474, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +42, ?, 257780, 11th, 7, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 15, United-States, <=50K +53, Private, 194259, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 4386, 0, 40, United-States, >50K +26, Private, 280093, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +73, Self-emp-not-inc, 177387, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +72, ?, 28929, 11th, 7, Widowed, ?, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +55, Private, 105304, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 499233, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 180572, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +24, Private, 321435, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +63, Private, 86108, HS-grad, 9, Widowed, Farming-fishing, Not-in-family, White, Male, 0, 0, 6, United-States, <=50K +17, Private, 198124, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +35, Private, 135162, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +51, Private, 146813, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +62, Local-gov, 291175, Bachelors, 13, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 48, United-States, <=50K +55, Private, 387569, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 4386, 0, 40, United-States, >50K +43, Private, 102895, Some-college, 10, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, Local-gov, 33274, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +37, Private, 86551, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +39, Private, 138192, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 118966, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 18, United-States, <=50K +61, Private, 99784, Masters, 14, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +26, Private, 90980, Assoc-voc, 11, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +46, Self-emp-not-inc, 177407, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +26, Private, 96467, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, State-gov, 327886, Doctorate, 16, Divorced, Prof-specialty, Own-child, White, Male, 0, 0, 50, United-States, >50K +34, Private, 111567, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +34, Local-gov, 166545, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +59, Private, 142182, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Private, 188798, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 38563, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 56, United-States, >50K +18, Private, 216284, 11th, 7, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +43, Private, 191547, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, Mexico, <=50K +48, Private, 285335, 11th, 7, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +28, Self-emp-inc, 142712, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, <=50K +33, Private, 80945, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 309055, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +21, Private, 62339, 10th, 6, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 368700, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 28, United-States, <=50K +39, Private, 176186, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, >50K +29, Self-emp-not-inc, 266855, Bachelors, 13, Separated, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 48087, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +24, Private, 121313, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 50, United-States, <=50K +71, Self-emp-not-inc, 143437, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 10605, 0, 40, United-States, >50K +51, Self-emp-not-inc, 160724, Bachelors, 13, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 2415, 40, China, >50K +55, Private, 282753, 5th-6th, 3, Divorced, Other-service, Unmarried, Black, Male, 0, 0, 25, United-States, <=50K +41, Private, 194636, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +23, Private, 153044, HS-grad, 9, Never-married, Handlers-cleaners, Unmarried, Black, Female, 0, 0, 7, United-States, <=50K +38, Private, 411797, Assoc-voc, 11, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 117683, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +19, Private, 376540, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +49, Private, 72393, 9th, 5, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 270335, Bachelors, 13, Married-civ-spouse, Adm-clerical, Other-relative, White, Male, 0, 0, 40, Philippines, >50K +27, Private, 96226, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 70, United-States, <=50K +38, Private, 95336, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +33, Private, 258498, Some-college, 10, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 60, United-States, <=50K +63, ?, 149698, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 15, United-States, <=50K +23, Private, 205865, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 28, United-States, <=50K +33, Self-emp-inc, 155781, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, ?, <=50K +54, Self-emp-not-inc, 406468, HS-grad, 9, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 40, United-States, <=50K +29, Private, 177119, Assoc-voc, 11, Divorced, Tech-support, Not-in-family, White, Female, 2174, 0, 45, United-States, <=50K +48, ?, 144397, Some-college, 10, Divorced, ?, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +35, Self-emp-not-inc, 372525, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 164170, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Wife, Asian-Pac-Islander, Female, 0, 0, 40, India, <=50K +37, Private, 183800, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 50, United-States, >50K +42, Self-emp-not-inc, 177307, Prof-school, 15, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 65, United-States, >50K +40, Private, 170108, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 341995, Some-college, 10, Divorced, Sales, Own-child, White, Male, 0, 0, 55, United-States, <=50K +22, Private, 226508, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 50, United-States, <=50K +30, Private, 87418, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +28, Private, 109165, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +63, Local-gov, 28856, 7th-8th, 4, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 55, United-States, <=50K +51, Self-emp-not-inc, 175897, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 20, United-States, <=50K +22, Private, 99697, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, ?, 90270, Assoc-acdm, 12, Married-civ-spouse, ?, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +35, Private, 152375, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +46, Private, 171550, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +37, Private, 211154, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K +24, Private, 202570, Bachelors, 13, Never-married, Prof-specialty, Own-child, Black, Male, 0, 0, 15, United-States, <=50K +37, Self-emp-not-inc, 168496, HS-grad, 9, Divorced, Handlers-cleaners, Own-child, White, Male, 0, 0, 10, United-States, <=50K +53, Private, 68898, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 93235, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +38, Private, 278924, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 44, United-States, <=50K +53, Self-emp-not-inc, 311020, 10th, 6, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +34, Private, 175878, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 543028, HS-grad, 9, Never-married, Sales, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +39, Private, 202027, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 45, United-States, >50K +43, Private, 158926, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, Asian-Pac-Islander, Female, 0, 0, 50, South, <=50K +67, Self-emp-inc, 76860, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, >50K +81, Self-emp-not-inc, 136063, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 30, United-States, <=50K +21, Private, 186648, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +23, Private, 257509, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +25, Private, 98155, Some-college, 10, Never-married, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 274198, 5th-6th, 3, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 38, Mexico, <=50K +38, Private, 97083, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, <=50K +64, ?, 29825, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 5, United-States, <=50K +32, Private, 262153, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 214738, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 138022, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +22, Private, 91842, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 42, United-States, <=50K +33, Private, 373662, 1st-4th, 2, Married-spouse-absent, Priv-house-serv, Not-in-family, White, Female, 0, 0, 40, Guatemala, <=50K +42, Private, 162003, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +19, ?, 52114, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 10, United-States, <=50K +51, Local-gov, 241843, Preschool, 1, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 375871, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, Mexico, <=50K +37, Private, 186934, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 3103, 0, 44, United-States, >50K +37, Private, 176900, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 99, United-States, >50K +47, Private, 21906, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +41, Private, 132222, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2415, 40, United-States, >50K +33, Private, 143653, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 30, United-States, <=50K +31, Private, 111567, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +31, Private, 78602, Assoc-acdm, 12, Divorced, Other-service, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +35, Private, 465507, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +38, Self-emp-inc, 196373, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +18, Private, 293227, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +20, Private, 241752, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Local-gov, 166398, Some-college, 10, Divorced, Exec-managerial, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +40, Private, 184682, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Self-emp-inc, 108293, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 1977, 45, United-States, >50K +43, Private, 250802, Some-college, 10, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 35, United-States, <=50K +44, Self-emp-not-inc, 325159, Some-college, 10, Divorced, Farming-fishing, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +44, State-gov, 174675, HS-grad, 9, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +43, Private, 227065, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 43, United-States, >50K +51, Private, 269080, 7th-8th, 4, Widowed, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +18, Private, 177722, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +51, Private, 133461, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 239683, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 30, ?, <=50K +44, Self-emp-inc, 398473, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, United-States, >50K +33, Local-gov, 298785, 10th, 6, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Self-emp-not-inc, 123424, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +42, Private, 176286, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 150062, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +32, Private, 169240, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +32, Private, 288273, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 70, Mexico, <=50K +36, Private, 526968, 10th, 6, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 57066, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 323573, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +35, Self-emp-inc, 368825, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +55, Self-emp-not-inc, 189721, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 20, United-States, <=50K +48, Private, 164966, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, >50K +36, ?, 94954, Assoc-voc, 11, Widowed, ?, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +34, Private, 202046, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, >50K +28, Private, 161538, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +67, Private, 105252, Bachelors, 13, Widowed, Exec-managerial, Not-in-family, White, Male, 0, 2392, 40, United-States, >50K +37, Private, 200153, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 32185, HS-grad, 9, Never-married, Transport-moving, Unmarried, White, Male, 0, 0, 70, United-States, <=50K +25, Private, 178326, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 255957, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Female, 4101, 0, 40, United-States, <=50K +40, State-gov, 188693, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +78, Private, 182977, HS-grad, 9, Widowed, Other-service, Not-in-family, Black, Female, 2964, 0, 40, United-States, <=50K +34, Private, 159929, HS-grad, 9, Divorced, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 123207, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 44, United-States, <=50K +22, Private, 284317, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, ?, 184699, HS-grad, 9, Never-married, ?, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +60, Self-emp-not-inc, 154474, HS-grad, 9, Never-married, Farming-fishing, Unmarried, White, Male, 0, 0, 42, United-States, <=50K +45, Local-gov, 318280, HS-grad, 9, Widowed, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +63, Private, 254907, Assoc-voc, 11, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +41, Private, 349221, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Female, 0, 0, 35, United-States, <=50K +47, Private, 335973, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 126701, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 122159, Some-college, 10, Widowed, Prof-specialty, Not-in-family, White, Female, 3325, 0, 40, United-States, <=50K +46, Private, 187370, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 1504, 40, United-States, <=50K +41, Private, 194636, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Self-emp-not-inc, 124793, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 30, United-States, <=50K +47, Private, 192835, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 50, United-States, >50K +35, Private, 290226, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +56, Private, 112840, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +45, Private, 89325, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +48, Federal-gov, 33109, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 58, United-States, >50K +40, Private, 82465, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 2580, 0, 40, United-States, <=50K +39, Self-emp-inc, 329980, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 50, United-States, >50K +20, Private, 148294, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 168212, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 65, United-States, >50K +38, State-gov, 343642, HS-grad, 9, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +23, Local-gov, 115244, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 60, United-States, <=50K +31, Private, 162572, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 16, United-States, <=50K +58, Private, 356067, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +66, Private, 271567, HS-grad, 9, Separated, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +39, Self-emp-inc, 180804, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +54, Self-emp-not-inc, 123011, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 15024, 0, 52, United-States, >50K +26, Private, 109186, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, Germany, <=50K +51, Private, 220537, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 124827, Assoc-voc, 11, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 767403, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 3103, 0, 40, United-States, >50K +42, Private, 118494, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 44, United-States, >50K +38, Private, 173208, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 25, United-States, <=50K +48, Private, 107373, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 26973, Assoc-voc, 11, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 40, United-States, >50K +51, Private, 191965, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +22, Private, 122346, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, ?, 117201, HS-grad, 9, Never-married, ?, Own-child, White, Male, 0, 0, 30, United-States, <=50K +41, Private, 198316, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, Japan, <=50K +48, Local-gov, 123075, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +42, Private, 209370, HS-grad, 9, Separated, Sales, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +34, Private, 33117, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 129042, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +56, Private, 169133, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 50, Yugoslavia, <=50K +30, Private, 201624, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Male, 0, 0, 45, ?, <=50K +45, Private, 368561, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +48, Private, 207848, 10th, 6, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +48, Self-emp-inc, 138370, Masters, 14, Married-spouse-absent, Sales, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 50, India, <=50K +31, Private, 93106, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, State-gov, 223515, Assoc-acdm, 12, Never-married, Other-service, Own-child, White, Male, 0, 1719, 20, United-States, <=50K +27, Private, 389713, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 206365, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +76, ?, 431192, 7th-8th, 4, Widowed, ?, Not-in-family, White, Male, 0, 0, 2, United-States, <=50K +19, ?, 241616, HS-grad, 9, Never-married, ?, Unmarried, White, Male, 0, 2001, 40, United-States, <=50K +66, Self-emp-inc, 150726, 9th, 5, Married-civ-spouse, Exec-managerial, Husband, White, Male, 1409, 0, 1, ?, <=50K +37, Private, 123785, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 75, United-States, <=50K +34, Private, 289984, HS-grad, 9, Divorced, Priv-house-serv, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +34, ?, 164309, 11th, 7, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 8, United-States, <=50K +90, Private, 137018, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 137994, Some-college, 10, Never-married, Machine-op-inspct, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +43, Private, 341204, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 167005, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 60, United-States, >50K +24, Private, 34446, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 37, United-States, <=50K +28, Private, 187160, Prof-school, 15, Divorced, Prof-specialty, Unmarried, White, Male, 0, 0, 55, United-States, <=50K +64, ?, 196288, Assoc-acdm, 12, Never-married, ?, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +23, Private, 217961, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 74631, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +36, Private, 156667, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 50, United-States, >50K +61, Private, 125155, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Self-emp-not-inc, 263925, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, Canada, >50K +30, Private, 296453, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7298, 0, 40, United-States, >50K +52, Self-emp-not-inc, 44728, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K +38, Private, 193026, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Iran, <=50K +32, Private, 87643, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Self-emp-not-inc, 106742, 12th, 8, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 75, United-States, <=50K +41, Private, 302122, Assoc-voc, 11, Divorced, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +49, Local-gov, 193960, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 40, United-States, >50K +45, Private, 185385, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 47, United-States, >50K +43, Self-emp-not-inc, 277647, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, <=50K +61, Private, 128848, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 3471, 0, 40, United-States, <=50K +54, Private, 377701, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 32, Mexico, <=50K +34, Private, 157886, Assoc-acdm, 12, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 175958, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 80, United-States, >50K +38, Private, 223004, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 199352, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 80, United-States, >50K +36, Private, 29984, 12th, 8, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 181651, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +36, Private, 117312, Assoc-acdm, 12, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +22, Local-gov, 34029, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 20, United-States, <=50K +38, Private, 132879, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1902, 40, United-States, >50K +37, Private, 215310, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +48, State-gov, 55863, Doctorate, 16, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1902, 46, United-States, >50K +17, Private, 220384, 11th, 7, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 15, United-States, <=50K +19, Self-emp-not-inc, 36012, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 20, United-States, <=50K +27, Private, 137645, Bachelors, 13, Never-married, Sales, Not-in-family, Black, Female, 0, 1590, 40, United-States, <=50K +22, Private, 191342, Bachelors, 13, Never-married, Sales, Own-child, Asian-Pac-Islander, Male, 0, 0, 50, Taiwan, <=50K +49, Private, 31339, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, State-gov, 227910, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +43, Private, 173728, Bachelors, 13, Separated, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +19, Local-gov, 167816, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +58, Self-emp-not-inc, 81642, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +41, Local-gov, 195258, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 232475, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 241259, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 118161, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 201954, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +42, Private, 150533, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 52, United-States, >50K +38, Private, 412296, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 28, United-States, <=50K +41, Federal-gov, 133060, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Self-emp-not-inc, 120539, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +31, Private, 196025, Doctorate, 16, Married-spouse-absent, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 60, China, <=50K +34, Private, 107793, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 163870, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Self-emp-not-inc, 361280, Bachelors, 13, Never-married, Prof-specialty, Own-child, Asian-Pac-Islander, Male, 0, 0, 20, India, <=50K +62, Private, 92178, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, ?, 80710, HS-grad, 9, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +29, Self-emp-inc, 260729, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 1977, 25, United-States, >50K +43, Private, 182254, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +68, ?, 140282, 7th-8th, 4, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 8, United-States, <=50K +45, Self-emp-inc, 149865, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 60, United-States, >50K +39, Self-emp-inc, 218184, 9th, 5, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1651, 40, Mexico, <=50K +41, Private, 118619, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 50, United-States, <=50K +34, Self-emp-not-inc, 196791, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 25, United-States, >50K +34, Local-gov, 167999, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 33, United-States, <=50K +31, Private, 51259, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 47, United-States, <=50K +29, Private, 131088, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 25, United-States, <=50K +41, Private, 118212, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 3103, 0, 40, United-States, >50K +41, Private, 293791, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 55, United-States, <=50K +35, Self-emp-inc, 289430, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, Mexico, >50K +33, Private, 35378, Bachelors, 13, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 45, United-States, >50K +37, State-gov, 60227, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +69, Private, 168139, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 290763, HS-grad, 9, Divorced, Handlers-cleaners, Own-child, White, Female, 0, 0, 40, United-States, <=50K +60, Self-emp-inc, 226355, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 2415, 70, ?, >50K +36, Private, 51100, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 227644, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +58, Local-gov, 205267, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +53, Private, 288020, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Japan, <=50K +29, Private, 140863, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Federal-gov, 170915, HS-grad, 9, Divorced, Tech-support, Not-in-family, White, Female, 4865, 0, 40, United-States, <=50K +34, State-gov, 50178, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 38, United-States, <=50K +36, Private, 112497, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 95244, Some-college, 10, Divorced, Other-service, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +20, Private, 117606, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 89508, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, >50K +63, Federal-gov, 124244, HS-grad, 9, Widowed, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +41, Self-emp-not-inc, 154374, Some-college, 10, Divorced, Other-service, Unmarried, White, Male, 0, 0, 45, United-States, <=50K +28, Private, 294936, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +30, Private, 347132, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +34, ?, 181934, HS-grad, 9, Divorced, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 316672, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +37, Private, 189382, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 38, United-States, <=50K +42, ?, 184018, Some-college, 10, Divorced, ?, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 184307, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, Jamaica, >50K +46, Self-emp-not-inc, 246212, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +35, Federal-gov, 250504, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 60, United-States, >50K +27, Private, 138705, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 53, United-States, <=50K +41, Private, 328447, 1st-4th, 2, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, Mexico, <=50K +19, Private, 194608, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +20, Private, 230891, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +59, Federal-gov, 212448, HS-grad, 9, Widowed, Sales, Unmarried, White, Female, 0, 0, 40, Germany, <=50K +40, Private, 214010, Bachelors, 13, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 37, United-States, <=50K +56, Self-emp-not-inc, 200235, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +33, Private, 354573, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 44, United-States, >50K +30, Self-emp-inc, 205733, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +46, Private, 185041, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, <=50K +61, Self-emp-inc, 84409, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, >50K +50, Self-emp-inc, 293196, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 40, United-States, >50K +25, Private, 241626, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +40, Private, 520586, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 39, United-States, <=50K +24, ?, 35633, Some-college, 10, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, ?, <=50K +51, Private, 302847, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 54, United-States, <=50K +43, State-gov, 165309, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 117529, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 54, Mexico, <=50K +46, Private, 106092, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +28, State-gov, 445824, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, United-States, >50K +26, Private, 227332, Bachelors, 13, Never-married, Transport-moving, Unmarried, Asian-Pac-Islander, Male, 0, 0, 40, ?, <=50K +20, Private, 275691, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 28, United-States, <=50K +44, Private, 193459, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 3411, 0, 40, United-States, <=50K +51, Private, 284329, HS-grad, 9, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 114691, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +54, Private, 96062, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Private, 133963, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1977, 40, United-States, >50K +33, Private, 178506, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +65, Private, 350498, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 10605, 0, 20, United-States, >50K +22, ?, 131573, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 8, United-States, <=50K +88, Self-emp-not-inc, 206291, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 182302, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 241346, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +50, Private, 157043, 11th, 7, Divorced, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +25, Private, 404616, Masters, 14, Married-civ-spouse, Farming-fishing, Not-in-family, White, Male, 0, 0, 99, United-States, >50K +20, Private, 411862, Assoc-voc, 11, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +47, Private, 183013, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, ?, 169982, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 188544, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K +50, State-gov, 356619, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 48, United-States, >50K +47, Private, 45857, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Local-gov, 289886, 11th, 7, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 45, United-States, <=50K +50, ?, 146015, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 216237, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, >50K +36, Private, 416745, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 202952, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 167725, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, ?, 165637, Masters, 14, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Federal-gov, 43280, Some-college, 10, Never-married, Exec-managerial, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +65, Private, 118779, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 30, United-States, <=50K +24, State-gov, 191269, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 65, United-States, <=50K +27, Local-gov, 247507, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 35, United-States, <=50K +51, Private, 239155, Assoc-voc, 11, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 182862, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 33886, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +28, Private, 444304, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 187161, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +49, Local-gov, 116892, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +51, Local-gov, 176813, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +59, Private, 151616, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, <=50K +18, Private, 240747, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, Dominican-Republic, <=50K +50, Private, 75472, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 4386, 0, 40, ?, <=50K +45, Federal-gov, 320818, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 80, United-States, >50K +30, Local-gov, 235271, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +37, Private, 166497, Bachelors, 13, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +44, Private, 344060, Bachelors, 13, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +33, Private, 221196, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +61, Self-emp-inc, 113544, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +61, Local-gov, 321117, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 79619, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 42, United-States, >50K +22, ?, 42004, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 30, United-States, <=50K +36, Private, 135289, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K +44, Self-emp-inc, 320984, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 5178, 0, 60, United-States, >50K +37, Private, 203070, Some-college, 10, Separated, Adm-clerical, Own-child, White, Male, 0, 0, 62, United-States, <=50K +31, Private, 32406, Some-college, 10, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +54, Private, 99185, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 45, United-States, >50K +20, Private, 205839, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 16, United-States, <=50K +63, ?, 150389, Bachelors, 13, Widowed, ?, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +48, Self-emp-not-inc, 243631, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Amer-Indian-Eskimo, Male, 7688, 0, 40, United-States, >50K +33, ?, 163003, HS-grad, 9, Divorced, ?, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 41, China, <=50K +31, Private, 231263, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 4650, 0, 45, United-States, <=50K +38, Private, 200818, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Self-emp-not-inc, 247379, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +48, Private, 349151, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 22154, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 176317, HS-grad, 9, Widowed, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 22245, Masters, 14, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 72, ?, >50K +29, Private, 236436, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 354078, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +42, Self-emp-not-inc, 166813, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +50, Private, 358740, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, England, <=50K +75, Self-emp-not-inc, 208426, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, <=50K +46, Private, 265266, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 40, United-States, >50K +52, Federal-gov, 31838, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 175034, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 413297, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +31, Private, 106347, 11th, 7, Separated, Other-service, Not-in-family, Black, Female, 0, 0, 42, United-States, <=50K +23, Private, 174754, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +34, Private, 441454, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +41, Self-emp-not-inc, 209344, HS-grad, 9, Married-civ-spouse, Sales, Other-relative, White, Female, 0, 0, 40, Cuba, <=50K +31, Private, 185732, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +42, Private, 65372, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 33975, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +55, Private, 326297, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +36, State-gov, 194630, HS-grad, 9, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +65, Self-emp-not-inc, 167414, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 59, United-States, >50K +38, Local-gov, 165799, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 12, United-States, <=50K +62, Private, 192866, Some-college, 10, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +54, Self-emp-inc, 166459, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 60, United-States, >50K +49, Private, 148995, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 190040, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 209432, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 40, United-States, <=50K +51, Self-emp-inc, 229465, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 50, United-States, >50K +48, Self-emp-not-inc, 397466, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +30, Private, 283767, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, ?, <=50K +52, Federal-gov, 202452, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +28, Self-emp-not-inc, 218555, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 1762, 40, United-States, <=50K +29, Private, 128604, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +38, Private, 65466, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +57, Private, 141326, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +43, Federal-gov, 369468, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +37, State-gov, 136137, Some-college, 10, Separated, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 236770, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 89534, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 48, United-States, >50K +69, ?, 195779, Assoc-voc, 11, Widowed, ?, Not-in-family, White, Female, 0, 0, 1, United-States, <=50K +73, Private, 29778, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 37, United-States, <=50K +22, Self-emp-inc, 153516, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 35, United-States, <=50K +31, Private, 163594, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +38, Private, 189623, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1887, 40, United-States, >50K +50, Self-emp-not-inc, 343748, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +37, Private, 387430, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 37, United-States, <=50K +44, Local-gov, 409505, Bachelors, 13, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 200734, Bachelors, 13, Never-married, Exec-managerial, Unmarried, Black, Female, 0, 0, 45, United-States, <=50K +27, Private, 115831, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 150296, Assoc-acdm, 12, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 80, United-States, <=50K +25, Private, 323545, HS-grad, 9, Never-married, Tech-support, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +20, Private, 232577, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +51, Local-gov, 152754, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Private, 129007, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1977, 40, United-States, >50K +67, Private, 171584, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 6514, 0, 7, United-States, >50K +47, Private, 386136, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 35, United-States, <=50K +42, Private, 342865, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 186785, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 1876, 50, United-States, <=50K +42, Federal-gov, 158926, Assoc-acdm, 12, Divorced, Prof-specialty, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, >50K +65, ?, 36039, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 164019, Some-college, 10, Never-married, Farming-fishing, Own-child, Black, Male, 0, 0, 10, United-States, <=50K +50, Private, 88926, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 5178, 0, 40, United-States, >50K +46, Private, 188861, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 370119, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 50, United-States, >50K +57, Private, 182062, 10th, 6, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 37238, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +50, Private, 421132, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, ?, 178660, 12th, 8, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +63, Self-emp-not-inc, 795830, 1st-4th, 2, Widowed, Other-service, Unmarried, White, Female, 0, 0, 30, El-Salvador, <=50K +39, Private, 278403, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 65, United-States, <=50K +46, Private, 279661, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 35, United-States, <=50K +36, Private, 113397, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 280093, 5th-6th, 3, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1628, 50, United-States, <=50K +21, Private, 236696, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 57, United-States, <=50K +41, Private, 265266, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Local-gov, 34935, Some-college, 10, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +22, Private, 58222, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Federal-gov, 301010, Some-college, 10, Never-married, Armed-Forces, Not-in-family, Black, Male, 0, 0, 60, United-States, <=50K +29, Private, 419721, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, Japan, <=50K +58, Self-emp-inc, 186791, Some-college, 10, Married-civ-spouse, Transport-moving, Wife, White, Female, 0, 0, 40, United-States, >50K +36, Self-emp-not-inc, 180686, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +30, Private, 209103, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, <=50K +37, Private, 32668, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 43, United-States, >50K +29, Private, 256956, Assoc-voc, 11, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 202203, 5th-6th, 3, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, Mexico, <=50K +43, Private, 85995, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +49, Private, 125421, HS-grad, 9, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 40, United-States, >50K +45, Federal-gov, 283037, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 192932, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, ?, 244689, Some-college, 10, Never-married, ?, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +51, Private, 179646, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 509350, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 50, Canada, >50K +24, Private, 96279, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 119098, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7298, 0, 40, United-States, >50K +35, ?, 327120, Assoc-acdm, 12, Never-married, ?, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +41, State-gov, 144928, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +48, Private, 55237, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +61, Local-gov, 101265, HS-grad, 9, Widowed, Adm-clerical, Unmarried, White, Female, 1471, 0, 35, United-States, <=50K +20, Private, 114874, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 30, United-States, <=50K +27, Private, 190525, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +55, Private, 121912, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 24, United-States, >50K +39, Private, 83893, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +17, ?, 138507, 10th, 6, Never-married, ?, Own-child, White, Male, 0, 0, 20, United-States, <=50K +47, Private, 256522, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, ?, <=50K +52, Private, 168381, HS-grad, 9, Widowed, Other-service, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, India, >50K +24, Private, 293579, HS-grad, 9, Never-married, Sales, Own-child, Black, Female, 0, 0, 20, United-States, <=50K +29, Private, 285290, 11th, 7, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +25, Private, 188488, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 324469, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 275244, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 35, United-States, <=50K +57, Private, 265099, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +51, Private, 146767, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 40681, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 3674, 0, 16, United-States, <=50K +39, Private, 174938, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 240124, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +71, Private, 269708, Bachelors, 13, Divorced, Tech-support, Own-child, White, Female, 2329, 0, 16, United-States, <=50K +38, State-gov, 34180, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, State-gov, 225904, Prof-school, 15, Never-married, Prof-specialty, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 89392, Masters, 14, Married-spouse-absent, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 46857, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +59, State-gov, 105363, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 195105, HS-grad, 9, Never-married, Sales, Not-in-family, Other, Male, 0, 0, 40, United-States, <=50K +35, Private, 184117, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +61, Self-emp-inc, 134768, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, Germany, >50K +17, ?, 145886, 11th, 7, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K +36, Private, 153078, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 60, ?, >50K +62, ?, 225652, 7th-8th, 4, Married-civ-spouse, ?, Husband, White, Male, 3411, 0, 50, United-States, <=50K +34, Private, 467108, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +32, Self-emp-inc, 199765, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 7688, 0, 50, United-States, >50K +42, Private, 173938, HS-grad, 9, Separated, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 191161, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 132606, 5th-6th, 3, Divorced, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +61, Self-emp-not-inc, 30073, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 1848, 60, United-States, >50K +40, Private, 155190, 10th, 6, Never-married, Craft-repair, Other-relative, Black, Male, 0, 0, 55, United-States, <=50K +31, Private, 42900, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 37, United-States, <=50K +36, Private, 191161, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +23, Private, 181820, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 105974, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 41, United-States, <=50K +52, Private, 146378, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 103440, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +51, Private, 203435, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, Italy, <=50K +31, Federal-gov, 168312, Assoc-voc, 11, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +49, Self-emp-inc, 257764, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 171301, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 0, 0, 40, United-States, <=50K +53, Federal-gov, 225339, Some-college, 10, Widowed, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +52, Private, 152234, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 99999, 0, 40, Japan, >50K +20, Private, 444554, 10th, 6, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 403788, Assoc-acdm, 12, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +61, ?, 190997, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 6, United-States, <=50K +43, Private, 221550, Masters, 14, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 30, Poland, <=50K +46, Self-emp-inc, 98929, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 52, United-States, <=50K +43, Local-gov, 169203, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +41, Private, 102332, HS-grad, 9, Divorced, Sales, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +44, Self-emp-not-inc, 230684, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 55, United-States, <=50K +54, Private, 449257, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +65, Private, 198766, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 20051, 0, 40, United-States, >50K +32, Private, 97429, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, Canada, <=50K +25, Private, 208999, Some-college, 10, Separated, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 37072, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +25, Local-gov, 163101, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +19, Private, 119075, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 50, United-States, <=50K +37, Self-emp-not-inc, 137314, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +45, Private, 127303, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 45, United-States, <=50K +37, Private, 349116, HS-grad, 9, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 44, United-States, <=50K +40, Self-emp-not-inc, 266324, Some-college, 10, Divorced, Exec-managerial, Other-relative, White, Male, 0, 1564, 70, Iran, >50K +19, ?, 194095, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 46496, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 5, United-States, <=50K +27, Private, 29904, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +40, Local-gov, 289403, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 1887, 40, ?, >50K +59, Private, 226922, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 1762, 30, United-States, <=50K +19, Federal-gov, 234151, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +43, Private, 238287, 10th, 6, Never-married, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +42, Private, 230624, 10th, 6, Never-married, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50K +54, Private, 398212, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 5013, 0, 40, United-States, <=50K +54, Self-emp-not-inc, 114758, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +51, Private, 246519, 10th, 6, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 2105, 0, 45, United-States, <=50K +50, Private, 137815, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, >50K +40, Private, 260696, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +55, Private, 325007, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 25, United-States, <=50K +50, Private, 113176, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 66815, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +42, ?, 51795, HS-grad, 9, Divorced, ?, Unmarried, Black, Female, 0, 0, 32, United-States, <=50K +24, Private, 241523, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 45, United-States, >50K +30, Private, 30226, 11th, 7, Divorced, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +39, Local-gov, 352628, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 50, United-States, >50K +37, Private, 143912, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +33, Private, 130021, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 329778, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-inc, 196945, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 78, Thailand, <=50K +39, Private, 24342, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +53, Private, 34368, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +52, Self-emp-not-inc, 173839, 10th, 6, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +28, State-gov, 73211, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 0, 0, 20, United-States, <=50K +32, Private, 86723, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K +31, Private, 179186, Bachelors, 13, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 90, United-States, >50K +31, Private, 127610, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +47, Private, 115070, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, ?, 172582, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 50, United-States, <=50K +40, Private, 256202, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +40, Private, 202872, Assoc-acdm, 12, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 45, United-States, <=50K +41, Private, 184102, 11th, 7, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +53, Federal-gov, 130703, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, <=50K +46, Private, 134727, 11th, 7, Divorced, Machine-op-inspct, Unmarried, Amer-Indian-Eskimo, Male, 0, 0, 43, Germany, <=50K +45, Self-emp-inc, 36228, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 4386, 0, 35, United-States, >50K +39, Private, 297847, 9th, 5, Married-civ-spouse, Other-service, Wife, Black, Female, 3411, 0, 34, United-States, <=50K +19, Private, 213644, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 173796, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1887, 40, United-States, >50K +49, Private, 147322, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, Peru, <=50K +59, Private, 296253, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 180871, Assoc-voc, 11, Divorced, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +18, ?, 169882, Some-college, 10, Never-married, ?, Own-child, White, Female, 594, 0, 15, United-States, <=50K +35, State-gov, 211115, Some-college, 10, Never-married, Protective-serv, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +58, Self-emp-inc, 183870, 10th, 6, Married-civ-spouse, Transport-moving, Wife, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 441620, Bachelors, 13, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 43, Mexico, <=50K +36, Federal-gov, 218542, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +41, Self-emp-not-inc, 141327, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +47, Private, 67716, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Self-emp-inc, 175339, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1672, 60, United-States, <=50K +61, ?, 347089, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 16, United-States, <=50K +36, Private, 336595, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +38, Private, 27997, Assoc-voc, 11, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +56, Self-emp-not-inc, 145574, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 1902, 60, United-States, >50K +50, Private, 30447, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +45, Self-emp-not-inc, 256866, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 5013, 0, 40, United-States, <=50K +44, Self-emp-not-inc, 120837, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 66, United-States, <=50K +51, Private, 185283, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +44, Self-emp-inc, 229466, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +25, Private, 298225, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +60, Private, 185749, 11th, 7, Widowed, Transport-moving, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +17, ?, 333100, 10th, 6, Never-married, ?, Own-child, White, Male, 1055, 0, 30, United-States, <=50K +49, Self-emp-inc, 125892, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +46, Private, 563883, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 60, United-States, >50K +56, Private, 311249, HS-grad, 9, Widowed, Adm-clerical, Unmarried, Black, Female, 0, 0, 38, United-States, <=50K +25, Private, 221757, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 3325, 0, 45, United-States, <=50K +22, Private, 310152, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +76, ?, 211453, HS-grad, 9, Widowed, ?, Not-in-family, Black, Female, 0, 0, 2, United-States, <=50K +41, Self-emp-inc, 94113, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +48, Self-emp-inc, 192945, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 40, United-States, >50K +46, Private, 161508, 10th, 6, Never-married, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +30, Private, 177675, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +39, Private, 51100, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 100584, 10th, 6, Divorced, Craft-repair, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +70, Federal-gov, 163003, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 67728, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 2051, 45, United-States, <=50K +49, Private, 101320, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 75, United-States, <=50K +24, Private, 42706, Assoc-voc, 11, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +40, Private, 228535, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7298, 0, 36, United-States, >50K +61, Private, 120939, Prof-school, 15, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 5, United-States, >50K +25, Private, 98283, Bachelors, 13, Never-married, Prof-specialty, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +28, Local-gov, 216481, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +69, State-gov, 208869, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 11, United-States, <=50K +22, Private, 207940, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 36, United-States, <=50K +47, Private, 34248, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 38, United-States, <=50K +38, Private, 83727, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 48, United-States, <=50K +26, Private, 183077, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +17, Private, 197850, 11th, 7, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 24, United-States, <=50K +33, Self-emp-not-inc, 235271, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +43, Self-emp-not-inc, 35236, HS-grad, 9, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 255822, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +53, Self-emp-inc, 263925, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 99999, 0, 40, United-States, >50K +26, Private, 256263, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +43, Local-gov, 293535, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 40, United-States, >50K +31, Private, 209448, 10th, 6, Married-civ-spouse, Farming-fishing, Husband, White, Male, 2105, 0, 40, Mexico, <=50K +30, Private, 57651, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Male, 0, 2001, 42, United-States, <=50K +25, Private, 174592, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +57, Federal-gov, 278763, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 175232, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 60, United-States, >50K +32, Private, 402812, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, <=50K +26, Private, 101150, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 41, United-States, <=50K +45, Private, 103538, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +53, State-gov, 156877, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 15024, 0, 35, United-States, >50K +27, Private, 23940, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +28, Self-emp-inc, 210295, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +32, Private, 80058, 11th, 7, Divorced, Sales, Not-in-family, White, Male, 0, 0, 43, United-States, >50K +35, Private, 187119, Bachelors, 13, Divorced, Sales, Not-in-family, White, Female, 0, 1980, 65, United-States, <=50K +36, Self-emp-not-inc, 105021, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +19, Private, 225775, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +37, Self-emp-inc, 395831, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 80, United-States, >50K +49, Private, 50282, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 3325, 0, 45, United-States, <=50K +20, Private, 32732, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +64, Self-emp-inc, 179436, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 55, United-States, >50K +60, ?, 290593, Masters, 14, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 123253, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 42, United-States, <=50K +58, State-gov, 48433, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 245317, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 431745, Some-college, 10, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 14, United-States, <=50K +42, State-gov, 436006, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, <=50K +25, Private, 224943, Some-college, 10, Married-spouse-absent, Prof-specialty, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +30, Self-emp-not-inc, 167990, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 65, United-States, >50K +37, Self-emp-inc, 217054, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +66, Self-emp-not-inc, 298834, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +59, Self-emp-inc, 125000, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, England, >50K +44, Private, 123983, Bachelors, 13, Divorced, Other-service, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, China, <=50K +46, Private, 155489, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 58, United-States, >50K +59, Private, 284834, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, White, Female, 2885, 0, 30, United-States, <=50K +25, Private, 212495, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 1340, 40, United-States, <=50K +17, Local-gov, 32124, 9th, 5, Never-married, Other-service, Own-child, Black, Male, 0, 0, 9, United-States, <=50K +47, Local-gov, 246891, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, State-gov, 141483, 9th, 5, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 31985, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 170800, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Female, 0, 0, 40, United-States, <=50K +26, Local-gov, 166295, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 2339, 55, United-States, <=50K +20, Private, 231286, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 15, United-States, <=50K +33, Private, 159322, HS-grad, 9, Divorced, Other-service, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 176026, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 118025, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 99999, 0, 50, United-States, >50K +37, Private, 26898, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 12, United-States, <=50K +47, Private, 232628, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +40, Private, 85995, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +48, Private, 125421, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, >50K +49, Private, 245305, 10th, 6, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 42, United-States, >50K +50, Private, 73493, Some-college, 10, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 197058, Assoc-acdm, 12, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 122116, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 75742, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 214731, 10th, 6, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 265954, HS-grad, 9, Separated, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +26, State-gov, 197156, HS-grad, 9, Divorced, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K +62, Private, 162245, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1628, 70, United-States, <=50K +39, Local-gov, 203070, HS-grad, 9, Separated, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +59, Local-gov, 165695, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +69, ?, 473040, 5th-6th, 3, Divorced, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 168107, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 163494, 10th, 6, Never-married, Sales, Own-child, White, Male, 0, 0, 30, United-States, <=50K +38, Private, 180342, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 122381, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1887, 50, United-States, >50K +27, Private, 148069, 10th, 6, Never-married, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 200973, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +17, Private, 130806, 10th, 6, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 24, United-States, <=50K +56, Private, 117148, 7th-8th, 4, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 213977, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +62, Private, 134768, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 40, ?, >50K +44, Private, 139338, 12th, 8, Divorced, Transport-moving, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +23, Private, 315877, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +41, Self-emp-not-inc, 195124, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, ?, <=50K +25, Private, 352057, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 65, United-States, <=50K +21, Private, 236684, Some-college, 10, Never-married, Other-service, Other-relative, Black, Female, 0, 0, 8, United-States, <=50K +18, Private, 208447, 12th, 8, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 6, United-States, <=50K +45, Private, 149640, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +75, ?, 111177, Bachelors, 13, Widowed, ?, Not-in-family, White, Female, 25124, 0, 16, United-States, >50K +51, Private, 154342, 7th-8th, 4, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Federal-gov, 141459, HS-grad, 9, Separated, Other-service, Other-relative, Black, Female, 0, 0, 40, United-States, <=50K +47, Private, 111797, Some-college, 10, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 35, Outlying-US(Guam-USVI-etc), <=50K +29, Private, 111900, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 78707, 11th, 7, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +43, Local-gov, 160574, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, ?, 174714, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 16, United-States, <=50K +19, ?, 62534, Bachelors, 13, Never-married, ?, Own-child, Black, Female, 0, 0, 40, Jamaica, <=50K +44, Private, 216907, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1848, 40, United-States, >50K +24, Private, 198148, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +19, Private, 124265, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +49, ?, 261059, 10th, 6, Separated, ?, Own-child, White, Male, 2176, 0, 40, United-States, <=50K +52, Private, 208137, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 257250, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K +24, State-gov, 147253, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +32, Local-gov, 244268, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +72, ?, 213255, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 8, United-States, <=50K +26, Private, 266912, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +31, Private, 169104, Bachelors, 13, Never-married, Sales, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, ?, <=50K +29, Private, 200511, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +39, Private, 128715, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Male, 10520, 0, 40, United-States, >50K +48, Self-emp-not-inc, 65535, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +40, Private, 103395, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +51, Private, 71046, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 45, Scotland, <=50K +28, Self-emp-not-inc, 125442, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 169188, HS-grad, 9, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 20, United-States, <=50K +23, Private, 121471, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +65, Private, 207281, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 16, United-States, <=50K +26, Local-gov, 46097, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, ?, 206671, Some-college, 10, Never-married, ?, Own-child, White, Male, 1055, 0, 50, United-States, <=50K +55, Private, 98361, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 50, ?, >50K +38, Self-emp-not-inc, 322143, Bachelors, 13, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 10, United-States, <=50K +33, Private, 149184, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 60, United-States, >50K +33, Local-gov, 119829, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 60, United-States, <=50K +37, Private, 910398, Bachelors, 13, Never-married, Sales, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +19, Private, 176570, 11th, 7, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 60, United-States, <=50K +24, Private, 216129, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +30, Private, 27207, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +57, State-gov, 68830, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +22, State-gov, 178818, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 20, United-States, <=50K +57, Private, 236944, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, >50K +46, State-gov, 273771, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +67, Private, 318533, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 35, United-States, <=50K +35, ?, 451940, HS-grad, 9, Never-married, ?, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +47, Private, 102318, HS-grad, 9, Separated, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 379350, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 21095, Some-college, 10, Divorced, Other-service, Unmarried, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +58, Self-emp-not-inc, 211547, 12th, 8, Divorced, Sales, Not-in-family, White, Female, 0, 0, 52, United-States, <=50K +36, Private, 85272, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 30, United-States, >50K +45, Private, 46406, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 36, England, >50K +54, Private, 53833, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 161007, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 30, United-States, <=50K +60, Private, 53707, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Private, 370119, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 60, United-States, >50K +26, Private, 310907, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 35, United-States, <=50K +32, Private, 375833, 11th, 7, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +38, Local-gov, 107513, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Self-emp-not-inc, 58683, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, United-States, >50K +44, Self-emp-not-inc, 179557, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 45, United-States, >50K +37, Private, 70240, HS-grad, 9, Never-married, Other-service, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +44, Private, 147206, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 175548, HS-grad, 9, Never-married, Other-service, Not-in-family, Other, Female, 0, 0, 35, United-States, <=50K +61, Self-emp-not-inc, 163174, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +51, Private, 126010, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 147876, Bachelors, 13, Married-civ-spouse, Sales, Wife, White, Female, 15024, 0, 60, United-States, >50K +45, Private, 428350, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 1740, 40, United-States, <=50K +36, ?, 200904, Assoc-acdm, 12, Married-civ-spouse, ?, Wife, Black, Female, 0, 0, 21, Haiti, <=50K +39, Private, 328466, 11th, 7, Married-civ-spouse, Other-service, Husband, White, Male, 2407, 0, 70, Mexico, <=50K +67, Local-gov, 258973, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +40, State-gov, 345969, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 127796, 5th-6th, 3, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 35, Mexico, <=50K +37, Private, 405723, 1st-4th, 2, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +57, Private, 175942, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +27, Private, 284196, 10th, 6, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 89718, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 2202, 0, 48, United-States, <=50K +34, Self-emp-inc, 175761, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +54, Private, 206369, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 5178, 0, 50, United-States, >50K +52, Private, 158993, HS-grad, 9, Divorced, Other-service, Other-relative, Black, Female, 0, 0, 38, United-States, <=50K +42, Private, 285066, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +48, Private, 126754, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 40, United-States, >50K +65, State-gov, 209280, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 6514, 0, 35, United-States, >50K +55, Self-emp-not-inc, 52888, Prof-school, 15, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 10, United-States, <=50K +71, Self-emp-inc, 133821, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 20, United-States, >50K +33, Private, 240763, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +30, Private, 39054, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 119272, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +59, Private, 143372, 10th, 6, Divorced, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +19, Private, 323421, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 20, United-States, <=50K +36, Self-emp-not-inc, 136028, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 30, United-States, <=50K +26, Self-emp-not-inc, 163189, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +34, Local-gov, 202729, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 421871, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +44, Private, 120277, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 50, Italy, >50K +26, ?, 211798, HS-grad, 9, Separated, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 198901, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 48, United-States, <=50K +18, Private, 214617, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 16, United-States, <=50K +55, Self-emp-not-inc, 179715, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 18, United-States, <=50K +49, Local-gov, 107231, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 2002, 40, United-States, <=50K +44, Private, 110355, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +43, Private, 184378, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +62, Private, 273454, 7th-8th, 4, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, Cuba, <=50K +44, Private, 443040, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +39, ?, 71701, HS-grad, 9, Divorced, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +50, Self-emp-inc, 160151, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +35, Private, 107991, 11th, 7, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +52, Private, 94391, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 99835, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +44, Private, 43711, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 7688, 0, 40, United-States, >50K +43, Private, 83756, Some-college, 10, Never-married, Exec-managerial, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +51, Private, 120914, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 2961, 0, 40, United-States, <=50K +20, Private, 180052, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +47, Private, 170846, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, Italy, >50K +43, Private, 37937, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +64, ?, 168340, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, ?, >50K +24, Private, 38455, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Federal-gov, 128059, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 420895, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 166744, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 12, United-States, <=50K +26, Private, 238768, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 60, United-States, <=50K +43, Private, 176270, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 99999, 0, 60, United-States, >50K +50, Private, 140592, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, <=50K +20, Self-emp-not-inc, 211466, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 80, United-States, <=50K +37, Private, 188540, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 45, United-States, >50K +43, Private, 39581, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 45, United-States, <=50K +37, Private, 171150, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1887, 50, United-States, >50K +53, Private, 117496, 9th, 5, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 36, Canada, <=50K +44, Private, 145160, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 28520, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +17, Private, 103851, 11th, 7, Never-married, Adm-clerical, Own-child, White, Female, 1055, 0, 20, United-States, <=50K +19, Private, 375077, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 50, United-States, <=50K +53, State-gov, 281590, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 15024, 0, 40, United-States, >50K +44, Private, 151504, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +51, Private, 415287, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 1902, 40, United-States, >50K +49, Private, 32212, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 43, United-States, <=50K +35, Private, 123606, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 202565, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +54, Private, 177927, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +37, Private, 256723, Some-college, 10, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +18, Private, 46247, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 15, United-States, <=50K +24, Private, 266926, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 112031, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +22, ?, 376277, Some-college, 10, Divorced, ?, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +35, Private, 168817, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +56, Private, 187487, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 48, United-States, >50K +32, ?, 158784, 7th-8th, 4, Widowed, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 67222, Bachelors, 13, Never-married, Machine-op-inspct, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 45, China, <=50K +43, Private, 201723, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 40, United-States, >50K +73, Private, 267408, HS-grad, 9, Widowed, Sales, Other-relative, White, Female, 0, 0, 15, United-States, <=50K +47, Federal-gov, 168191, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, ?, <=50K +49, Private, 105444, 12th, 8, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 39, United-States, <=50K +38, Private, 156728, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 148600, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +39, Private, 19914, Some-college, 10, Divorced, Adm-clerical, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +42, Private, 190767, Bachelors, 13, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 233955, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 45, China, >50K +35, Private, 30381, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +38, Private, 187069, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +31, Private, 367314, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Local-gov, 101119, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 70, United-States, <=50K +38, Private, 86551, Bachelors, 13, Divorced, Sales, Not-in-family, White, Male, 0, 0, 48, United-States, >50K +40, Local-gov, 218995, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 42, United-States, >50K +21, Private, 57711, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +44, Private, 303521, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 199067, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 247445, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +49, Private, 186078, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, United-States, >50K +31, Private, 77634, Assoc-voc, 11, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 42, United-States, <=50K +24, Private, 180060, Masters, 14, Never-married, Exec-managerial, Own-child, White, Male, 6849, 0, 90, United-States, <=50K +46, Private, 56482, Some-college, 10, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +26, Private, 314177, HS-grad, 9, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 239755, Bachelors, 13, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +27, Private, 377680, Assoc-voc, 11, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +64, Self-emp-not-inc, 134960, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 35, United-States, >50K +26, Private, 294493, Bachelors, 13, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 32616, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 1719, 16, United-States, <=50K +45, Private, 182655, Bachelors, 13, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 45, ?, >50K +57, Local-gov, 52267, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 72, United-States, <=50K +30, Private, 117963, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 98881, 11th, 7, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 32, United-States, <=50K +50, Private, 196963, 7th-8th, 4, Divorced, Craft-repair, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +38, Private, 166988, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 193459, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +42, Private, 182342, Some-college, 10, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +32, Private, 496743, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 154781, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 219371, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +45, Private, 99179, 11th, 7, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 224910, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 304651, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +37, Private, 349689, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +60, Private, 106850, 10th, 6, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +53, Self-emp-not-inc, 196328, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 45, United-States, >50K +25, Private, 169323, Bachelors, 13, Married-civ-spouse, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 162924, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 60, Japan, <=50K +40, Self-emp-not-inc, 34037, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +51, ?, 167651, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 197384, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 10, United-States, <=50K +42, Private, 251795, Some-college, 10, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, United-States, >50K +65, ?, 266081, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 165309, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 215873, 10th, 6, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 45, United-States, <=50K +46, Private, 133938, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Female, 27828, 0, 50, United-States, >50K +49, Private, 159816, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 99999, 0, 20, United-States, >50K +24, Private, 228424, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +32, Private, 195576, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +71, Private, 105200, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 6767, 0, 20, United-States, <=50K +26, Private, 167350, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 3103, 0, 40, United-States, >50K +29, Private, 52199, HS-grad, 9, Married-spouse-absent, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 171338, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 99999, 0, 50, United-States, >50K +51, Private, 120173, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 3103, 0, 50, United-States, >50K +17, ?, 158762, 10th, 6, Never-married, ?, Own-child, White, Female, 0, 0, 20, United-States, <=50K +49, Private, 169818, HS-grad, 9, Married-civ-spouse, Other-service, Wife, Black, Female, 0, 0, 40, United-States, >50K +31, Private, 288419, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 207546, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Local-gov, 147707, HS-grad, 9, Widowed, Farming-fishing, Unmarried, White, Male, 0, 2339, 40, United-States, <=50K +17, ?, 228373, 10th, 6, Never-married, ?, Own-child, White, Male, 0, 0, 30, United-States, <=50K +43, Private, 193882, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 7688, 0, 40, United-States, >50K +38, Private, 31033, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K +37, Private, 272950, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 183523, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 238415, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +31, Private, 19302, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Male, 2202, 0, 38, United-States, <=50K +42, Local-gov, 339671, Bachelors, 13, Married-spouse-absent, Prof-specialty, Not-in-family, White, Female, 8614, 0, 45, United-States, >50K +35, Local-gov, 103260, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 35, United-States, >50K +39, Private, 79331, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 15024, 0, 40, United-States, >50K +40, Private, 135056, Assoc-acdm, 12, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +66, Private, 142723, 5th-6th, 3, Married-spouse-absent, Handlers-cleaners, Unmarried, White, Female, 0, 0, 40, Puerto-Rico, <=50K +30, Federal-gov, 188569, 9th, 5, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 57322, Assoc-acdm, 12, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 178309, 9th, 5, Never-married, Other-service, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +45, Private, 166107, Masters, 14, Never-married, Adm-clerical, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +31, Private, 53042, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, Trinadad&Tobago, <=50K +33, Private, 155343, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 3103, 0, 40, United-States, >50K +32, Private, 35595, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 429507, Assoc-acdm, 12, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +50, Federal-gov, 159670, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +63, Private, 151210, 7th-8th, 4, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 186792, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 204640, Some-college, 10, Widowed, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +52, Private, 87205, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +38, Self-emp-inc, 112847, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, >50K +41, Private, 107306, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K +50, State-gov, 211319, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +59, Private, 183606, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 205390, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 49, United-States, <=50K +73, Local-gov, 232871, 7th-8th, 4, Married-civ-spouse, Protective-serv, Husband, White, Male, 2228, 0, 10, United-States, <=50K +52, Self-emp-inc, 101017, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 38, United-States, <=50K +57, Private, 114495, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +35, Private, 183898, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 50, United-States, >50K +51, Private, 163921, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 56, United-States, >50K +22, Private, 311764, 11th, 7, Widowed, Sales, Own-child, Black, Female, 0, 0, 35, United-States, <=50K +49, Private, 188330, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 267174, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +46, Local-gov, 36228, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1902, 40, United-States, <=50K +48, Private, 199739, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 185407, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, ?, <=50K +43, State-gov, 206139, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 50, United-States, >50K +25, Private, 282063, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +31, Private, 332379, 7th-8th, 4, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 418324, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 36, United-States, <=50K +19, ?, 263338, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 45, United-States, <=50K +51, Private, 158948, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 84, United-States, >50K +51, Private, 221532, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 40, United-States, >50K +22, Self-emp-not-inc, 202920, HS-grad, 9, Never-married, Prof-specialty, Unmarried, White, Female, 99999, 0, 40, Dominican-Republic, >50K +37, Local-gov, 118909, HS-grad, 9, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +19, Private, 286469, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 30, United-States, <=50K +45, Private, 191914, HS-grad, 9, Divorced, Transport-moving, Unmarried, White, Female, 0, 0, 55, United-States, <=50K +21, State-gov, 142766, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 10, United-States, <=50K +52, Private, 198744, HS-grad, 9, Widowed, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +46, Local-gov, 272780, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 24, United-States, <=50K +42, State-gov, 219553, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +56, Private, 261232, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 64292, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +58, Private, 312131, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +70, Private, 30713, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 30, United-States, <=50K +30, Private, 246439, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 338105, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +23, Private, 228243, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 44, United-States, <=50K +34, Local-gov, 62463, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1579, 40, United-States, <=50K +38, Private, 31603, Bachelors, 13, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 165054, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 121618, 7th-8th, 4, Never-married, Transport-moving, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +45, Federal-gov, 273194, HS-grad, 9, Never-married, Transport-moving, Not-in-family, Black, Male, 3325, 0, 40, United-States, <=50K +21, ?, 163665, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 538319, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, Puerto-Rico, <=50K +34, Private, 238246, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Self-emp-inc, 244665, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 5178, 0, 45, United-States, >50K +21, Private, 131811, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +63, ?, 231777, Masters, 14, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, <=50K +23, Private, 156807, 9th, 5, Never-married, Handlers-cleaners, Own-child, Black, Male, 0, 0, 36, United-States, <=50K +28, Private, 236861, Bachelors, 13, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +29, Self-emp-not-inc, 229842, HS-grad, 9, Never-married, Transport-moving, Unmarried, Black, Male, 0, 0, 45, United-States, <=50K +25, Local-gov, 190057, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +44, State-gov, 55076, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +18, Private, 152545, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 8, United-States, <=50K +26, Private, 153434, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 24, United-States, <=50K +47, Local-gov, 171095, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, >50K +23, Private, 239322, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 138999, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +61, Local-gov, 95450, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5178, 0, 50, United-States, >50K +25, Private, 176520, HS-grad, 9, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +38, Local-gov, 72338, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, Asian-Pac-Islander, Male, 0, 0, 54, United-States, >50K +60, ?, 386261, Bachelors, 13, Married-spouse-absent, ?, Unmarried, Black, Female, 0, 0, 15, United-States, <=50K +23, Private, 235722, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +36, Federal-gov, 128884, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 48, United-States, <=50K +46, Private, 187226, 9th, 5, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +32, Self-emp-not-inc, 298332, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +40, Private, 173607, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 226756, HS-grad, 9, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +31, Private, 157887, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 65, United-States, >50K +32, State-gov, 171111, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 37, United-States, <=50K +21, Private, 126314, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 10, United-States, <=50K +63, Private, 174018, Some-college, 10, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 40, United-States, >50K +44, Private, 144778, Some-college, 10, Separated, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +42, Self-emp-not-inc, 201522, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +23, ?, 22966, Bachelors, 13, Never-married, ?, Own-child, White, Male, 0, 0, 35, United-States, <=50K +30, Private, 399088, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +24, Private, 282202, HS-grad, 9, Never-married, Other-service, Unmarried, White, Male, 0, 0, 40, El-Salvador, <=50K +42, Private, 102606, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Self-emp-not-inc, 246862, HS-grad, 9, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, Italy, >50K +27, Federal-gov, 508336, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 48, United-States, <=50K +27, Local-gov, 263431, Some-college, 10, Never-married, Exec-managerial, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 235733, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +68, Private, 107910, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +55, Self-emp-not-inc, 184425, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 99, United-States, >50K +22, Self-emp-not-inc, 143062, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, Greece, <=50K +25, Private, 199545, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 15, United-States, <=50K +68, Self-emp-not-inc, 197015, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 45, United-States, <=50K +62, Private, 149617, Some-college, 10, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 16, United-States, <=50K +26, Private, 33610, HS-grad, 9, Divorced, Other-service, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 192002, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +68, Private, 67791, Some-college, 10, Widowed, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Local-gov, 445382, Bachelors, 13, Separated, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +45, Private, 112283, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +26, Private, 157249, 11th, 7, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 109872, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K +23, Private, 119838, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 50, United-States, <=50K +29, Private, 149943, Some-college, 10, Never-married, Other-service, Not-in-family, Other, Male, 0, 1590, 40, ?, <=50K +65, Without-pay, 27012, 7th-8th, 4, Widowed, Farming-fishing, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +31, Private, 91666, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +26, Private, 270276, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 179271, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +44, Private, 161819, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Local-gov, 339681, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 1506, 0, 45, United-States, <=50K +26, Self-emp-not-inc, 219897, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +26, Private, 91683, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 35, United-States, <=50K +36, Private, 188834, 9th, 5, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +38, Private, 187046, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +39, Private, 191807, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 48, United-States, <=50K +52, Self-emp-inc, 179951, Prof-school, 15, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 324420, 1st-4th, 2, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, Mexico, <=50K +41, Self-emp-not-inc, 66632, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Local-gov, 121718, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 60, United-States, >50K +47, Private, 162034, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, United-States, <=50K +28, Local-gov, 218990, Assoc-voc, 11, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 46, United-States, <=50K +25, Local-gov, 125863, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 35, United-States, <=50K +35, Private, 225330, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 120426, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 119741, Masters, 14, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +44, Private, 32000, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 18, United-States, >50K +21, ?, 124242, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 278581, Bachelors, 13, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 230224, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +30, Private, 204374, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 1741, 48, United-States, <=50K +45, Private, 188386, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 1628, 45, United-States, <=50K +20, Private, 164922, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 195176, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K +43, Private, 166740, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +50, ?, 156008, 11th, 7, Married-civ-spouse, ?, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +28, Private, 162551, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Other-relative, Asian-Pac-Islander, Female, 0, 0, 48, China, <=50K +25, Private, 211231, HS-grad, 9, Married-civ-spouse, Tech-support, Other-relative, White, Female, 0, 0, 48, United-States, >50K +25, Private, 169990, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +90, Private, 221832, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +38, Local-gov, 255454, Bachelors, 13, Separated, Prof-specialty, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 28160, Bachelors, 13, Married-spouse-absent, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +50, State-gov, 159219, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, Canada, >50K +26, Local-gov, 103148, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 165186, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 45, United-States, <=50K +56, Private, 31782, 10th, 6, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Local-gov, 249101, HS-grad, 9, Divorced, Protective-serv, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +46, Private, 243190, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 7688, 0, 40, United-States, >50K +18, Local-gov, 153405, 11th, 7, Never-married, Adm-clerical, Other-relative, White, Female, 0, 0, 25, United-States, <=50K +37, Private, 329980, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 2415, 60, United-States, >50K +57, Private, 176079, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, State-gov, 218542, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +29, State-gov, 303446, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 25, Nicaragua, <=50K +40, Private, 102606, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Self-emp-not-inc, 483201, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +77, Local-gov, 144608, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 6, United-States, <=50K +30, Private, 226013, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +21, Private, 165475, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +66, Private, 263637, 10th, 6, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 201495, 11th, 7, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 35, United-States, <=50K +68, Private, 213720, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +64, Private, 170483, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +26, Private, 214303, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +32, Private, 190511, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +32, Private, 242150, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 38, United-States, <=50K +51, Local-gov, 159755, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Private, 147629, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 45, United-States, >50K +49, Private, 268022, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +28, Private, 188711, Bachelors, 13, Never-married, Transport-moving, Unmarried, White, Male, 0, 0, 20, United-States, <=50K +29, Private, 452205, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +21, Private, 260847, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 30, United-States, <=50K +28, Private, 291374, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +55, Private, 189933, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Self-emp-not-inc, 133969, HS-grad, 9, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 50, South, >50K +35, Private, 330664, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, ?, 672412, 11th, 7, Separated, ?, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +26, Private, 122999, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 8614, 0, 40, United-States, >50K +30, Private, 111415, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 55, Germany, <=50K +33, Private, 217235, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 20, United-States, <=50K +40, Private, 121956, Bachelors, 13, Married-spouse-absent, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 13550, 0, 40, Cambodia, >50K +23, Private, 120172, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 343403, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, Self-emp-not-inc, 104790, 11th, 7, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, >50K +39, Local-gov, 473547, 10th, 6, Divorced, Handlers-cleaners, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +53, Local-gov, 260106, Prof-school, 15, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +49, Federal-gov, 168232, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 348491, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +36, Private, 24106, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3103, 0, 40, United-States, >50K +60, Self-emp-inc, 197553, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 50, United-States, >50K +29, Private, 421065, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +54, Self-emp-inc, 138852, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +28, ?, 169631, Assoc-acdm, 12, Married-AF-spouse, ?, Wife, White, Female, 0, 0, 3, United-States, <=50K +34, Private, 379412, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +30, Private, 181992, Some-college, 10, Never-married, Sales, Not-in-family, Black, Female, 0, 0, 35, United-States, <=50K +19, Private, 365640, HS-grad, 9, Never-married, Priv-house-serv, Not-in-family, White, Female, 0, 0, 45, ?, <=50K +26, Private, 236564, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 363418, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 70, United-States, >50K +50, Private, 112351, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 38, United-States, <=50K +30, Private, 204704, Bachelors, 13, Never-married, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, >50K +44, Private, 54611, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +49, Private, 128132, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +75, Self-emp-not-inc, 30599, Masters, 14, Married-spouse-absent, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +37, Private, 379522, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K +51, State-gov, 196504, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 38, United-States, <=50K +35, Private, 82552, HS-grad, 9, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 35, United-States, <=50K +28, Private, 104024, Some-college, 10, Never-married, Sales, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +66, Self-emp-not-inc, 293114, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 1409, 0, 40, United-States, <=50K +72, Private, 74141, 9th, 5, Married-civ-spouse, Exec-managerial, Wife, Asian-Pac-Islander, Female, 0, 0, 48, United-States, >50K +39, Private, 192337, Bachelors, 13, Separated, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +27, Private, 262478, HS-grad, 9, Never-married, Farming-fishing, Own-child, Black, Male, 0, 0, 30, United-States, <=50K +57, Private, 185072, Some-college, 10, Never-married, Adm-clerical, Other-relative, Black, Female, 0, 0, 40, Jamaica, <=50K +24, Private, 296045, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 2635, 0, 38, United-States, <=50K +28, Private, 246595, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 70, United-States, <=50K +23, Private, 54472, Some-college, 10, Married-spouse-absent, Other-service, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +31, Private, 331065, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 1408, 40, United-States, <=50K +23, Private, 161708, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +31, Private, 264936, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Local-gov, 113545, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 212237, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1740, 45, United-States, <=50K +31, Private, 170430, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 80, ?, <=50K +34, Private, 173806, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 4865, 0, 60, United-States, <=50K +57, Federal-gov, 370890, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 2258, 40, United-States, <=50K +39, Private, 505119, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, Cuba, >50K +23, Private, 193089, 11th, 7, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +24, Local-gov, 33432, Assoc-acdm, 12, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 103110, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, England, <=50K +32, Private, 160362, Some-college, 10, Divorced, Other-service, Other-relative, White, Male, 0, 0, 40, Nicaragua, <=50K +35, Private, 204621, Assoc-acdm, 12, Divorced, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 35309, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, ?, 154373, Bachelors, 13, Never-married, ?, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +47, Private, 194772, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 154410, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Federal-gov, 220563, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, State-gov, 253354, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 211699, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 1485, 40, United-States, >50K +63, Self-emp-not-inc, 167501, Some-college, 10, Married-civ-spouse, Prof-specialty, Wife, White, Female, 20051, 0, 10, United-States, >50K +34, Private, 229732, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 185465, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +27, Private, 335764, 11th, 7, Married-civ-spouse, Sales, Own-child, Black, Male, 0, 0, 35, United-States, <=50K +23, Private, 460046, HS-grad, 9, Separated, Exec-managerial, Unmarried, White, Female, 0, 0, 42, United-States, <=50K +19, ?, 33487, Some-college, 10, Never-married, ?, Other-relative, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +50, Private, 176924, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 38, United-States, <=50K +49, State-gov, 213307, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 83893, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +23, Private, 194102, Bachelors, 13, Never-married, Exec-managerial, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 238611, 7th-8th, 4, Widowed, Other-service, Unmarried, Black, Female, 0, 0, 38, United-States, <=50K +41, Private, 113597, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 16, United-States, <=50K +27, Self-emp-not-inc, 208406, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +53, Private, 274528, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +17, Self-emp-not-inc, 60116, 10th, 6, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 10, United-States, <=50K +23, ?, 196816, HS-grad, 9, Never-married, ?, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +53, Private, 166368, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 303954, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1848, 42, United-States, >50K +24, Private, 99386, Bachelors, 13, Married-spouse-absent, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 188569, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +53, Private, 302868, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +18, Private, 283342, 11th, 7, Never-married, Other-service, Other-relative, Black, Male, 0, 0, 20, United-States, <=50K +24, Private, 233777, Some-college, 10, Never-married, Sales, Unmarried, White, Male, 0, 0, 50, Mexico, <=50K +20, Private, 170038, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Local-gov, 261319, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, State-gov, 367237, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Male, 8614, 0, 40, United-States, >50K +34, Private, 126838, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 354104, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 30, United-States, <=50K +20, Private, 176321, 12th, 8, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, Mexico, <=50K +47, Private, 85129, HS-grad, 9, Divorced, Other-service, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +20, ?, 376474, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 32, United-States, <=50K +22, Private, 62507, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +42, Local-gov, 111252, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 40, United-States, >50K +60, Private, 156889, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 549430, HS-grad, 9, Never-married, Priv-house-serv, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +46, Private, 29696, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +66, Private, 98837, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 86150, Bachelors, 13, Married-civ-spouse, Other-service, Wife, Asian-Pac-Islander, Female, 0, 0, 30, United-States, >50K +34, Private, 204991, Some-college, 10, Divorced, Exec-managerial, Own-child, White, Male, 0, 0, 44, United-States, <=50K +45, Private, 371886, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 46, United-States, <=50K +35, Private, 103605, 11th, 7, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +63, ?, 54851, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Local-gov, 133050, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 40, United-States, >50K +36, Local-gov, 126569, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Federal-gov, 144259, HS-grad, 9, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +51, Private, 161482, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +25, Self-emp-not-inc, 305449, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 125010, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 45, United-States, <=50K +47, Private, 304133, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +59, Local-gov, 120617, HS-grad, 9, Separated, Protective-serv, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +34, Private, 157747, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 297396, Some-college, 10, Separated, Exec-managerial, Unmarried, White, Female, 0, 0, 60, United-States, <=50K +42, Private, 121287, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +28, ?, 308493, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 17, Honduras, <=50K +37, Private, 49115, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +51, Self-emp-inc, 208302, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 38, United-States, >50K +25, Private, 304032, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 36, United-States, <=50K +31, Federal-gov, 207301, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +37, Private, 123211, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 44, United-States, >50K +42, Private, 33521, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +29, ?, 410351, Bachelors, 13, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 410034, Some-college, 10, Divorced, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +51, Private, 175339, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 47, United-States, >50K +22, ?, 27937, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 36, United-States, <=50K +49, Private, 168211, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1485, 40, United-States, >50K +26, Private, 125680, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 16, Japan, <=50K +56, Local-gov, 160829, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 46, United-States, <=50K +52, Private, 266529, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +61, Self-emp-not-inc, 115023, Masters, 14, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 4, ?, <=50K +47, State-gov, 224149, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +52, Private, 150930, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 343699, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-inc, 172826, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 99999, 0, 55, United-States, >50K +35, Private, 163392, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, <=50K +17, ?, 103810, 12th, 8, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 213821, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1887, 40, United-States, >50K +26, Private, 211265, Some-college, 10, Married-spouse-absent, Craft-repair, Other-relative, Black, Female, 0, 0, 35, Dominican-Republic, <=50K +58, Local-gov, 160586, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +66, Private, 146454, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 5556, 0, 40, United-States, >50K +30, Private, 203277, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 60, United-States, >50K +46, Private, 309895, Some-college, 10, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 26522, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 1902, 35, United-States, >50K +57, Private, 103809, HS-grad, 9, Never-married, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 90291, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +21, State-gov, 181761, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 10, United-States, <=50K +37, Private, 35330, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 1669, 55, United-States, <=50K +45, Local-gov, 135776, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +61, ?, 188172, Doctorate, 16, Widowed, ?, Not-in-family, White, Female, 0, 0, 5, United-States, <=50K +39, Private, 179579, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 193626, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 53, United-States, <=50K +20, Private, 108887, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 199070, HS-grad, 9, Never-married, Protective-serv, Own-child, Black, Male, 0, 0, 16, United-States, <=50K +25, Private, 441591, Bachelors, 13, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 185254, 5th-6th, 3, Never-married, Priv-house-serv, Own-child, White, Female, 0, 0, 40, El-Salvador, <=50K +24, Private, 109307, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 45, United-States, <=50K +20, ?, 81853, Some-college, 10, Never-married, ?, Own-child, Asian-Pac-Islander, Female, 0, 0, 15, United-States, <=50K +35, Private, 23621, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +44, Local-gov, 145178, HS-grad, 9, Married-civ-spouse, Other-service, Wife, Black, Female, 0, 0, 38, Jamaica, >50K +47, State-gov, 30575, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +28, State-gov, 130620, 11th, 7, Separated, Adm-clerical, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, India, <=50K +41, Local-gov, 22155, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 60, United-States, <=50K +31, Private, 106437, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 79787, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +47, Private, 326857, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 5013, 0, 40, United-States, <=50K +44, Private, 81853, HS-grad, 9, Never-married, Sales, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +61, Private, 120933, Some-college, 10, Never-married, Priv-house-serv, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Federal-gov, 153143, Some-college, 10, Divorced, Adm-clerical, Other-relative, White, Female, 0, 0, 40, Puerto-Rico, <=50K +46, Private, 27669, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 28, United-States, <=50K +46, Private, 105444, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +54, Local-gov, 169785, Masters, 14, Widowed, Prof-specialty, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +49, Private, 122493, HS-grad, 9, Widowed, Tech-support, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +56, Local-gov, 242670, Some-college, 10, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 54933, Masters, 14, Divorced, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 209317, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, Puerto-Rico, <=50K +25, Self-emp-not-inc, 282631, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 98044, 11th, 7, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +58, Private, 187487, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, State-gov, 60186, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 75648, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, <=50K +28, Private, 201175, 11th, 7, Never-married, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +30, Private, 19302, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +21, ?, 300812, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 30, United-States, <=50K +44, Private, 146659, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1887, 35, United-States, >50K +75, Private, 101887, 10th, 6, Widowed, Priv-house-serv, Not-in-family, White, Female, 0, 0, 70, United-States, <=50K +66, ?, 117778, 11th, 7, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 60726, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +33, Self-emp-inc, 201763, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +57, Self-emp-inc, 119253, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 65, United-States, >50K +47, Self-emp-not-inc, 121124, 5th-6th, 3, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, Italy, >50K +41, Private, 220132, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 40, United-States, >50K +21, Private, 60639, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 37, United-States, <=50K +17, Private, 195262, 11th, 7, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 17, United-States, <=50K +61, ?, 113544, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 55, United-States, <=50K +47, ?, 331650, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 8, United-States, >50K +22, Private, 100587, Some-college, 10, Never-married, Other-service, Own-child, Black, Female, 0, 0, 15, United-States, <=50K +47, Private, 298130, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 242391, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +32, Self-emp-not-inc, 197867, Assoc-voc, 11, Divorced, Sales, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +59, Private, 151977, 10th, 6, Separated, Priv-house-serv, Not-in-family, Black, Female, 0, 0, 30, United-States, <=50K +38, Private, 277347, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +33, Private, 125249, HS-grad, 9, Separated, Protective-serv, Own-child, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 222142, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 270194, 9th, 5, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 169995, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, <=50K +27, Private, 359155, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +60, Private, 123992, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +64, Local-gov, 266080, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +37, Private, 201531, Assoc-acdm, 12, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +54, Self-emp-not-inc, 179704, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +36, Private, 393673, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 50, United-States, >50K +34, Private, 244147, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, >50K +41, Self-emp-not-inc, 438696, Masters, 14, Divorced, Sales, Unmarried, White, Male, 0, 0, 5, United-States, >50K +35, Self-emp-not-inc, 207568, 7th-8th, 4, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 75, United-States, <=50K +63, Self-emp-inc, 54052, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 68, United-States, >50K +46, Private, 187581, HS-grad, 9, Divorced, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 77102, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 353010, Bachelors, 13, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 65, United-States, <=50K +29, Private, 54131, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +74, Federal-gov, 39890, Some-college, 10, Widowed, Transport-moving, Not-in-family, White, Female, 0, 0, 18, United-States, <=50K +50, Private, 156877, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 70, United-States, >50K +22, Private, 355686, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 300168, 12th, 8, Separated, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +30, Private, 488720, 9th, 5, Married-civ-spouse, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +32, Private, 157287, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 184659, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 214169, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 192149, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +50, Private, 137253, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +44, Private, 373050, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +65, Private, 90377, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 6767, 0, 60, United-States, <=50K +28, Federal-gov, 183151, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 60, United-States, <=50K +55, Private, 227158, Bachelors, 13, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +49, Local-gov, 34021, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 50, United-States, <=50K +31, Private, 165148, HS-grad, 9, Separated, Exec-managerial, Unmarried, White, Female, 0, 0, 12, United-States, <=50K +47, Private, 211668, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 0, 0, 40, United-States, >50K +45, Private, 358886, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 47707, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +34, Self-emp-not-inc, 306982, Bachelors, 13, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, <=50K +49, Local-gov, 52590, HS-grad, 9, Widowed, Protective-serv, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +39, ?, 179352, Some-college, 10, Divorced, ?, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +27, Private, 158156, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 42, United-States, <=50K +42, Private, 70055, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, ?, 131852, 5th-6th, 3, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, >50K +64, Self-emp-not-inc, 177825, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 1055, 0, 40, United-States, <=50K +33, Private, 127215, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, >50K +23, Private, 175183, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 142287, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +34, Private, 221324, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +53, Private, 227602, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 37, Mexico, <=50K +22, Private, 228452, 10th, 6, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +57, State-gov, 39380, HS-grad, 9, Separated, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +20, ?, 96862, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 8, United-States, <=50K +23, Private, 336360, 7th-8th, 4, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +31, Private, 257644, 11th, 7, Never-married, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +23, State-gov, 235853, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 22, United-States, <=50K +30, Private, 270577, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +32, Local-gov, 222900, Bachelors, 13, Separated, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +42, Private, 99254, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, >50K +51, Private, 224763, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, Cuba, <=50K +59, Self-emp-not-inc, 174056, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 40, United-States, >50K +36, Private, 127306, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 339506, HS-grad, 9, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 178322, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, Germany, >50K +33, Private, 189843, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 160815, Assoc-voc, 11, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +60, Private, 207665, HS-grad, 9, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 40, United-States, >50K +37, State-gov, 160402, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +35, Private, 170263, Some-college, 10, Never-married, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 184659, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 52, United-States, <=50K +38, Federal-gov, 338320, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 50, United-States, >50K +54, Private, 101017, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 204322, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 241350, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +63, Federal-gov, 217994, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 128143, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +58, Self-emp-not-inc, 164065, Masters, 14, Divorced, Sales, Not-in-family, White, Male, 0, 0, 18, United-States, <=50K +64, Local-gov, 78866, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 236769, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +44, Federal-gov, 239539, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, >50K +39, Private, 34028, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 48, United-States, <=50K +45, State-gov, 207847, Some-college, 10, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 175935, Doctorate, 16, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 55, United-States, >50K +22, Federal-gov, 218445, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +63, Self-emp-inc, 215833, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 156976, Assoc-voc, 11, Separated, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +42, Self-emp-not-inc, 220647, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 45, United-States, <=50K +20, Private, 218343, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +29, Private, 241431, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K +38, Local-gov, 123983, Bachelors, 13, Never-married, Exec-managerial, Unmarried, Asian-Pac-Islander, Male, 0, 1741, 40, Vietnam, <=50K +25, Private, 73289, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 408623, Bachelors, 13, Married-civ-spouse, Craft-repair, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +46, Private, 169180, Assoc-voc, 11, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 54929, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +24, Private, 306779, Assoc-voc, 11, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 35, United-States, <=50K +43, Private, 159549, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +23, Private, 482082, 12th, 8, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 21, Mexico, <=50K +32, Local-gov, 286101, HS-grad, 9, Never-married, Transport-moving, Unmarried, Black, Female, 0, 0, 37, United-States, <=50K +44, Private, 167955, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Poland, <=50K +40, Self-emp-not-inc, 209040, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 105017, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 27776, Assoc-voc, 11, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 242941, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 1602, 10, United-States, <=50K +41, Private, 118853, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 119565, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 196827, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 1902, 40, United-States, <=50K +47, Private, 275361, Assoc-acdm, 12, Widowed, Other-service, Own-child, White, Female, 0, 0, 35, United-States, <=50K +42, Private, 225193, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 329783, 10th, 6, Never-married, Sales, Other-relative, White, Female, 0, 0, 10, United-States, <=50K +29, Local-gov, 107411, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 70, United-States, <=50K +21, State-gov, 258490, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +18, ?, 120243, 11th, 7, Never-married, ?, Own-child, White, Male, 0, 0, 27, United-States, <=50K +31, Private, 219509, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Mexico, >50K +27, Local-gov, 29174, Bachelors, 13, Never-married, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 40083, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, Canada, <=50K +23, Private, 87528, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +41, Private, 116379, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 55, Taiwan, >50K +46, Local-gov, 216214, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +34, Private, 268051, Some-college, 10, Married-civ-spouse, Protective-serv, Other-relative, Black, Female, 0, 0, 25, Haiti, <=50K +42, Self-emp-not-inc, 121718, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 24, United-States, <=50K +18, Private, 201901, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 1719, 15, United-States, <=50K +46, Private, 109089, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 37, United-States, <=50K +18, ?, 346382, 11th, 7, Never-married, ?, Own-child, White, Male, 0, 0, 15, United-States, <=50K +52, Private, 284129, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +56, Private, 143030, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 212619, Assoc-voc, 11, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Self-emp-not-inc, 199011, HS-grad, 9, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 20, United-States, <=50K +31, Private, 118901, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +41, Self-emp-not-inc, 129865, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 60, United-States, <=50K +25, Private, 157900, Some-college, 10, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 349341, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +45, Private, 158685, HS-grad, 9, Separated, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 386585, Some-college, 10, Divorced, Tech-support, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +90, Private, 52386, Some-college, 10, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 35, United-States, <=50K +45, Private, 246891, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1902, 40, United-States, >50K +30, Private, 190385, Bachelors, 13, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, >50K +42, Private, 37869, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 217807, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 35, United-States, <=50K +53, Private, 149784, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 40, United-States, >50K +64, State-gov, 201293, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +56, Private, 128764, 7th-8th, 4, Widowed, Transport-moving, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +42, Private, 27444, Some-college, 10, Married-spouse-absent, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +26, Private, 62438, Bachelors, 13, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, >50K +31, Local-gov, 151726, Assoc-acdm, 12, Never-married, Adm-clerical, Own-child, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +40, Private, 29841, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +58, Private, 131608, Some-college, 10, Widowed, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 110562, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, Self-emp-inc, 190541, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 47, United-States, <=50K +62, State-gov, 33142, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +65, Self-emp-inc, 139272, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 99999, 0, 60, United-States, >50K +40, Private, 234633, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Local-gov, 238386, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 460835, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 55, United-States, <=50K +23, ?, 243190, Some-college, 10, Never-married, ?, Own-child, Asian-Pac-Islander, Male, 0, 0, 20, China, <=50K +63, Federal-gov, 97855, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +39, Private, 77146, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1887, 50, United-States, >50K +37, Private, 200863, Some-college, 10, Widowed, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +25, ?, 41107, Bachelors, 13, Married-spouse-absent, ?, Not-in-family, White, Male, 0, 0, 40, Canada, <=50K +56, Private, 77415, 11th, 7, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 236770, Some-college, 10, Divorced, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +53, Federal-gov, 173093, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, Asian-Pac-Islander, Female, 0, 1887, 40, Philippines, >50K +32, Private, 235124, Assoc-voc, 11, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Self-emp-not-inc, 282604, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 7688, 0, 60, United-States, >50K +35, Private, 199288, 11th, 7, Separated, Transport-moving, Not-in-family, White, Male, 0, 0, 90, United-States, <=50K +51, Private, 191659, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1887, 65, United-States, >50K +19, Private, 43285, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +41, Private, 160837, 11th, 7, Married-spouse-absent, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, Guatemala, <=50K +22, Private, 230574, 10th, 6, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 25, United-States, <=50K +23, Private, 176178, HS-grad, 9, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 116358, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Taiwan, >50K +27, ?, 253873, Some-college, 10, Divorced, ?, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +45, Private, 107787, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Canada, <=50K +23, Self-emp-not-inc, 519627, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 25, Mexico, <=50K +21, Private, 191460, 11th, 7, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +44, Private, 198282, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 40, United-States, >50K +29, Private, 214858, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Self-emp-not-inc, 64875, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 60, United-States, <=50K +18, Private, 675421, 9th, 5, Never-married, Handlers-cleaners, Own-child, White, Male, 594, 0, 40, United-States, <=50K +62, Self-emp-not-inc, 134768, 12th, 8, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Federal-gov, 207342, Some-college, 10, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +34, Private, 64830, Assoc-acdm, 12, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +31, Private, 220066, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 14344, 0, 50, United-States, >50K +37, Private, 82521, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 40, United-States, >50K +33, Private, 176711, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, England, <=50K +22, ?, 217421, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K +28, Private, 111900, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 30, United-States, <=50K +22, ?, 196943, Some-college, 10, Separated, ?, Own-child, White, Male, 0, 0, 25, United-States, <=50K +47, Private, 481987, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, >50K +67, ?, 184506, 11th, 7, Married-civ-spouse, ?, Husband, White, Male, 0, 419, 3, United-States, <=50K +20, ?, 121313, 10th, 6, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 158420, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, <=50K +26, Private, 256000, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 60, United-States, >50K +36, Private, 183892, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 44, United-States, >50K +28, Private, 42734, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 181773, HS-grad, 9, Never-married, Transport-moving, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +47, Private, 184945, Some-college, 10, Separated, Other-service, Not-in-family, Black, Female, 0, 0, 35, United-States, <=50K +33, Private, 107248, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +34, Self-emp-inc, 215382, Masters, 14, Separated, Prof-specialty, Not-in-family, White, Female, 4787, 0, 40, United-States, >50K +25, Private, 122999, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 758700, 9th, 5, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 3781, 0, 50, Mexico, <=50K +36, State-gov, 166606, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +61, Local-gov, 192060, Bachelors, 13, Separated, Prof-specialty, Not-in-family, White, Male, 0, 0, 30, ?, <=50K +74, ?, 340939, 9th, 5, Married-civ-spouse, ?, Husband, White, Male, 3471, 0, 40, United-States, <=50K +57, Private, 205708, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Poland, <=50K +55, Private, 67450, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, England, <=50K +20, Private, 242077, HS-grad, 9, Divorced, Sales, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +43, Private, 129573, HS-grad, 9, Never-married, Sales, Not-in-family, Black, Female, 0, 0, 44, United-States, <=50K +54, Private, 181132, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, England, >50K +25, Private, 212302, Some-college, 10, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 83411, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 1408, 40, United-States, <=50K +23, ?, 148751, Some-college, 10, Never-married, ?, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +17, Private, 317681, 11th, 7, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 10, United-States, <=50K +39, ?, 103986, HS-grad, 9, Never-married, ?, Not-in-family, White, Male, 0, 1590, 40, United-States, <=50K +63, Private, 30602, 7th-8th, 4, Married-spouse-absent, Other-service, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +19, Private, 172893, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 30, United-States, <=50K +56, Self-emp-inc, 211804, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 50, United-States, >50K +33, Self-emp-not-inc, 312055, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +37, Private, 65390, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +27, Private, 200500, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +36, Local-gov, 241962, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Self-emp-inc, 78530, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 65, Canada, >50K +22, Private, 189950, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +35, Private, 111387, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 1579, 40, United-States, <=50K +20, Private, 241951, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +18, Private, 343059, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 302465, 12th, 8, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 1741, 40, United-States, <=50K +53, Private, 156843, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 1564, 54, United-States, >50K +21, ?, 79728, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 55284, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +34, Private, 509364, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +32, State-gov, 117927, Some-college, 10, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +20, Private, 137651, Some-college, 10, Never-married, Machine-op-inspct, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +70, Private, 131060, 7th-8th, 4, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 25, United-States, <=50K +57, Private, 346963, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +54, Private, 183611, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 3137, 0, 50, United-States, <=50K +34, Private, 134737, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 36503, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +42, Private, 250121, 11th, 7, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 330535, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 3325, 0, 40, United-States, <=50K +27, Private, 387776, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 41474, 10th, 6, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Mexico, <=50K +36, Local-gov, 318972, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 65, United-States, <=50K +33, Private, 86143, Some-college, 10, Never-married, Exec-managerial, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +50, Private, 181139, Some-college, 10, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 326232, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Male, 0, 2547, 50, United-States, >50K +39, Local-gov, 153976, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +55, Self-emp-not-inc, 59469, 9th, 5, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 25, United-States, <=50K +24, Private, 127139, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 136343, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Self-emp-not-inc, 350624, 12th, 8, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +66, ?, 177351, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 2174, 40, United-States, >50K +68, Private, 166149, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 2206, 30, United-States, <=50K +29, Private, 121523, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +24, Self-emp-not-inc, 267396, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 83045, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 160449, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 44, United-States, >50K +55, Self-emp-inc, 124137, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2415, 35, Greece, >50K +20, ?, 287681, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 36, United-States, <=50K +41, Private, 154194, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 295127, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 84, United-States, <=50K +60, Private, 240521, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 40, United-States, >50K +61, Self-emp-not-inc, 244087, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 52, United-States, >50K +35, Private, 356250, Prof-school, 15, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 35, China, <=50K +42, State-gov, 293791, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +26, Private, 44308, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Local-gov, 210527, Some-college, 10, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +31, State-gov, 151763, Masters, 14, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +39, State-gov, 267581, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +20, Private, 100188, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 24, United-States, <=50K +32, Self-emp-inc, 111746, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Self-emp-not-inc, 171091, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 355645, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 20, Trinadad&Tobago, <=50K +54, Local-gov, 137678, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 70894, Assoc-acdm, 12, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 171306, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 3, United-States, <=50K +31, Private, 100997, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +35, Private, 63921, 5th-6th, 3, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +29, Private, 32897, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +29, Local-gov, 251854, HS-grad, 9, Never-married, Protective-serv, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +25, Private, 345121, 10th, 6, Separated, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +46, Private, 86220, Bachelors, 13, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 172845, Assoc-voc, 11, Never-married, Other-service, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +20, Private, 171398, 10th, 6, Never-married, Sales, Not-in-family, Other, Male, 0, 0, 40, United-States, <=50K +24, Self-emp-not-inc, 174391, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 207058, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 37, United-States, <=50K +37, Private, 291251, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +60, Self-emp-not-inc, 224377, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 105813, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Local-gov, 180916, Some-college, 10, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +31, Self-emp-not-inc, 122749, Assoc-voc, 11, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 20, United-States, <=50K +38, Private, 31069, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 4386, 0, 40, United-States, >50K +26, Self-emp-not-inc, 284343, Assoc-acdm, 12, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +64, Private, 319371, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 40, United-States, >50K +46, Private, 174224, Assoc-voc, 11, Divorced, Protective-serv, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +69, ?, 183958, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 8, United-States, <=50K +39, Private, 127772, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 3103, 0, 44, United-States, >50K +48, Private, 80651, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +46, Private, 62793, HS-grad, 9, Divorced, Sales, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 191712, Bachelors, 13, Divorced, Sales, Not-in-family, White, Male, 0, 1590, 40, United-States, <=50K +39, Self-emp-not-inc, 237532, HS-grad, 9, Married-civ-spouse, Sales, Wife, Black, Female, 0, 0, 54, Dominican-Republic, >50K +50, Federal-gov, 20179, Masters, 14, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +24, Private, 311376, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 432565, Assoc-voc, 11, Married-civ-spouse, Tech-support, Other-relative, White, Female, 0, 0, 40, Canada, >50K +39, Self-emp-inc, 329980, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 2415, 60, United-States, >50K +29, Self-emp-not-inc, 125190, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 342946, 11th, 7, Never-married, Transport-moving, Own-child, White, Female, 0, 0, 38, United-States, <=50K +21, ?, 219835, Assoc-voc, 11, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 123429, 10th, 6, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +69, Self-emp-inc, 69209, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 3818, 0, 30, United-States, <=50K +55, Private, 66356, HS-grad, 9, Separated, Handlers-cleaners, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 195897, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +44, Self-emp-inc, 153132, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 52, United-States, >50K +18, Private, 230875, 11th, 7, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +74, Self-emp-not-inc, 92298, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 10, United-States, <=50K +40, Private, 185145, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 297296, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +75, ?, 164849, 9th, 5, Married-civ-spouse, ?, Husband, Black, Male, 1409, 0, 5, United-States, <=50K +55, Private, 145214, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +52, Self-emp-not-inc, 242341, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +54, Private, 240542, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 48, United-States, <=50K +36, Private, 104772, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 48, United-States, <=50K +76, ?, 152802, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 8, United-States, <=50K +26, Private, 181666, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 415520, 11th, 7, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 20, United-States, <=50K +38, Private, 258761, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +50, Private, 88842, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 7298, 0, 40, United-States, >50K +19, ?, 356717, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 25, United-States, <=50K +32, Private, 158438, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +57, Private, 206206, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 51816, HS-grad, 9, Never-married, Protective-serv, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +27, Private, 253814, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Self-emp-not-inc, 161745, Bachelors, 13, Married-spouse-absent, Exec-managerial, Not-in-family, White, Male, 0, 1980, 60, United-States, <=50K +60, Private, 162947, 5th-6th, 3, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, Puerto-Rico, <=50K +52, Private, 163027, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +61, Private, 146788, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +57, Self-emp-not-inc, 73309, HS-grad, 9, Widowed, Craft-repair, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +19, ?, 143867, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +59, Self-emp-not-inc, 104216, Prof-school, 15, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 25, United-States, <=50K +34, Self-emp-not-inc, 345705, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +31, Private, 133770, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 50, United-States, >50K +42, Private, 209392, HS-grad, 9, Divorced, Protective-serv, Not-in-family, Black, Male, 0, 0, 35, United-States, <=50K +70, Private, 262345, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 6, United-States, <=50K +47, Private, 277545, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, >50K +47, ?, 174525, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 3942, 0, 40, ?, <=50K +29, Private, 490332, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, >50K +27, Private, 211570, 11th, 7, Never-married, Handlers-cleaners, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +25, Private, 374918, 12th, 8, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 106728, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5178, 0, 60, United-States, >50K +28, Private, 173649, HS-grad, 9, Never-married, Other-service, Own-child, Black, Female, 0, 0, 40, ?, <=50K +35, Private, 174597, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 233533, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, >50K +54, ?, 169785, Masters, 14, Never-married, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 133169, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 198824, Assoc-voc, 11, Separated, Other-service, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +65, Private, 174056, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 188696, Assoc-voc, 11, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Local-gov, 90692, HS-grad, 9, Divorced, Prof-specialty, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +34, Private, 271933, Bachelors, 13, Never-married, Exec-managerial, Other-relative, White, Female, 0, 1741, 45, United-States, <=50K +47, Self-emp-not-inc, 102359, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 30, United-States, <=50K +49, Federal-gov, 213668, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 56, United-States, >50K +21, Private, 294789, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +20, Private, 157599, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +18, Local-gov, 134935, 12th, 8, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 466224, Some-college, 10, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +34, Self-emp-not-inc, 111985, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +38, Private, 264627, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 213427, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 279015, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 65, United-States, <=50K +47, Private, 165937, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, >50K +27, Federal-gov, 188343, HS-grad, 9, Separated, Exec-managerial, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +63, Private, 158609, Assoc-voc, 11, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 8, United-States, <=50K +34, Private, 193036, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, >50K +25, Private, 198632, Some-college, 10, Married-spouse-absent, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +54, Private, 175912, HS-grad, 9, Widowed, Machine-op-inspct, Unmarried, White, Male, 914, 0, 40, United-States, <=50K +19, ?, 192773, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 35, United-States, <=50K +35, Private, 101387, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +24, Private, 60783, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 70, United-States, >50K +26, Private, 183224, Some-college, 10, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 35, United-States, <=50K +59, Local-gov, 100776, Assoc-voc, 11, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 57600, Doctorate, 16, Married-spouse-absent, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, ?, <=50K +20, Private, 174063, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 306495, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 249741, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 93021, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Other, Female, 0, 0, 40, United-States, <=50K +36, Private, 49626, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 63062, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 60, United-States, <=50K +55, Private, 320835, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +22, Local-gov, 123727, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 21, United-States, <=50K +58, State-gov, 110517, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 4064, 0, 40, India, <=50K +43, Private, 149670, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 4064, 0, 15, United-States, <=50K +39, Private, 172425, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 20, United-States, >50K +40, Private, 216116, 9th, 5, Never-married, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, Haiti, <=50K +46, Private, 174209, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +54, Federal-gov, 175083, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +19, Private, 129059, Some-college, 10, Never-married, Sales, Own-child, Black, Male, 0, 0, 30, United-States, <=50K +24, Private, 121313, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, ?, 181317, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, State-gov, 166851, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 13, United-States, <=50K +29, Self-emp-not-inc, 29616, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 65, United-States, <=50K +56, Self-emp-inc, 105582, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 50, United-States, >50K +54, ?, 124993, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, >50K +21, ?, 148509, Some-college, 10, Never-married, ?, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +34, Private, 230246, 9th, 5, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 40, ?, <=50K +56, Private, 117881, 11th, 7, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 203408, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 50, United-States, >50K +19, Private, 446219, 10th, 6, Never-married, Sales, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +32, Self-emp-inc, 110331, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 65, United-States, >50K +48, Private, 207946, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 52, United-States, <=50K +67, ?, 45537, Masters, 14, Married-civ-spouse, ?, Husband, Black, Male, 0, 0, 40, United-States, >50K +47, Private, 188330, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 25, United-States, <=50K +52, Private, 147629, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, >50K +40, Private, 153799, 1st-4th, 2, Married-spouse-absent, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, Dominican-Republic, <=50K +28, Private, 203776, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +41, Private, 168071, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 50, United-States, >50K +57, Private, 348430, 1st-4th, 2, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, Portugal, <=50K +51, Private, 103407, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, ?, 152046, 11th, 7, Never-married, ?, Not-in-family, White, Female, 0, 0, 35, Germany, <=50K +36, Private, 153205, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 45, ?, <=50K +33, Private, 326104, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +46, Private, 238162, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +50, Private, 221336, HS-grad, 9, Divorced, Adm-clerical, Other-relative, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +33, Private, 180656, Some-college, 10, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, ?, <=50K +77, Self-emp-not-inc, 145329, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 401, 0, 20, United-States, <=50K +39, Private, 315776, Masters, 14, Never-married, Exec-managerial, Not-in-family, Black, Male, 8614, 0, 52, United-States, >50K +67, ?, 150516, HS-grad, 9, Widowed, ?, Unmarried, White, Male, 0, 0, 3, United-States, <=50K +35, Private, 325802, Assoc-acdm, 12, Divorced, Handlers-cleaners, Unmarried, White, Female, 0, 0, 24, United-States, <=50K +23, Private, 133985, 10th, 6, Never-married, Craft-repair, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +37, Private, 269329, Assoc-voc, 11, Divorced, Prof-specialty, Not-in-family, White, Female, 8614, 0, 45, United-States, >50K +41, Private, 183203, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, >50K +60, Private, 76127, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 35, United-States, >50K +32, Private, 195891, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +56, Federal-gov, 162137, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +45, State-gov, 37672, Assoc-voc, 11, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 161708, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, >50K +18, Private, 80616, 10th, 6, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 27, United-States, <=50K +31, Private, 209276, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Other, Male, 0, 0, 40, United-States, <=50K +21, ?, 34443, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 50, United-States, <=50K +45, Private, 192835, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 55, United-States, >50K +23, Private, 203240, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +45, State-gov, 102308, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 40829, 11th, 7, Never-married, Sales, Other-relative, Amer-Indian-Eskimo, Female, 0, 0, 25, United-States, <=50K +25, Private, 60726, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 30, United-States, <=50K +31, State-gov, 116677, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 57067, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, White, Male, 0, 0, 45, United-States, <=50K +41, Private, 304906, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +74, Private, 101590, Prof-school, 15, Widowed, Adm-clerical, Not-in-family, Black, Female, 0, 0, 20, United-States, <=50K +27, Private, 258102, 5th-6th, 3, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +23, Private, 241185, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 124827, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Self-emp-inc, 76625, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Federal-gov, 263339, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 135645, Masters, 14, Never-married, Sales, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +42, Private, 245626, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Amer-Indian-Eskimo, Male, 0, 0, 60, United-States, <=50K +24, Private, 210781, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 235786, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +45, Self-emp-not-inc, 160167, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 70, United-States, <=50K +52, Federal-gov, 30731, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 40, United-States, >50K +34, Private, 314375, Assoc-voc, 11, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 81528, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 60, United-States, <=50K +54, Private, 182854, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +42, Federal-gov, 296798, 11th, 7, Never-married, Tech-support, Not-in-family, White, Male, 0, 1340, 40, United-States, <=50K +32, Private, 194426, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 15024, 0, 40, United-States, >50K +40, ?, 70645, Some-college, 10, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 20, United-States, <=50K +55, Self-emp-inc, 141807, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +66, ?, 112871, 11th, 7, Never-married, ?, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +52, State-gov, 71344, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, State-gov, 341410, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +33, Private, 118941, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +52, ?, 159755, Assoc-voc, 11, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 50, United-States, >50K +28, Private, 128509, 5th-6th, 3, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, ?, <=50K +27, Self-emp-not-inc, 229125, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Local-gov, 142756, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +27, Self-emp-inc, 243871, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 45, United-States, <=50K +47, Private, 213140, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 7688, 0, 40, United-States, >50K +19, Private, 196857, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 138626, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Self-emp-not-inc, 161334, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 25, Nicaragua, <=50K +50, Private, 273536, 7th-8th, 4, Married-civ-spouse, Sales, Husband, Other, Male, 0, 0, 49, Dominican-Republic, <=50K +32, Private, 115631, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 4101, 0, 50, United-States, <=50K +28, Private, 185957, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 334357, HS-grad, 9, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +43, Private, 96102, Masters, 14, Married-spouse-absent, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +34, Private, 213226, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, Iran, >50K +19, Private, 115248, Some-college, 10, Never-married, Adm-clerical, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, Vietnam, <=50K +37, Private, 185061, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 55, United-States, <=50K +27, Private, 147638, Bachelors, 13, Never-married, Adm-clerical, Other-relative, Asian-Pac-Islander, Female, 0, 0, 40, Hong, <=50K +18, Private, 280298, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 24, United-States, <=50K +31, Private, 163516, Some-college, 10, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 277434, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +26, Federal-gov, 206983, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, Columbia, <=50K +48, Private, 108993, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +39, Private, 288551, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +41, Private, 176069, HS-grad, 9, Separated, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +48, State-gov, 183486, Assoc-voc, 11, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 56, United-States, >50K +40, Private, 163215, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Female, 10520, 0, 40, United-States, >50K +70, Private, 94692, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, United-States, >50K +20, Private, 118462, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 43, United-States, <=50K +38, Private, 407068, 5th-6th, 3, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 75, Mexico, <=50K +37, Self-emp-not-inc, 243587, Some-college, 10, Separated, Other-service, Own-child, White, Female, 0, 0, 40, Cuba, <=50K +49, Private, 23074, Some-college, 10, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +51, Private, 237735, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 3103, 0, 40, United-States, >50K +43, Private, 188291, 1st-4th, 2, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 284166, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +18, ?, 423460, 11th, 7, Never-married, ?, Own-child, White, Male, 0, 0, 36, United-States, <=50K +23, Private, 287681, 7th-8th, 4, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 25, Mexico, <=50K +34, Private, 509364, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +62, ?, 139391, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 24, United-States, <=50K +33, Private, 91964, Some-college, 10, Never-married, Adm-clerical, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 117526, Some-college, 10, Never-married, Farming-fishing, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +64, Private, 91343, Some-college, 10, Widowed, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +26, Local-gov, 336969, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 28, El-Salvador, <=50K +55, Private, 255364, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +61, Local-gov, 167670, Bachelors, 13, Married-spouse-absent, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 211494, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +78, Local-gov, 136198, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 15, United-States, <=50K +27, Federal-gov, 409815, Some-college, 10, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +49, Private, 188823, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +55, State-gov, 146326, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 45, United-States, >50K +42, Private, 154374, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 58, United-States, <=50K +22, ?, 216563, HS-grad, 9, Never-married, ?, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +61, Private, 197286, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +64, Self-emp-not-inc, 100722, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 5, United-States, <=50K +46, Local-gov, 377622, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 145964, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 358636, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 2829, 0, 70, United-States, <=50K +47, Private, 155489, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 7688, 0, 55, United-States, >50K +18, Private, 57413, Some-college, 10, Divorced, Other-service, Own-child, White, Male, 0, 0, 15, United-States, <=50K +48, Private, 320421, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +50, Self-emp-not-inc, 174752, HS-grad, 9, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, State-gov, 229364, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +56, Self-emp-not-inc, 157486, 10th, 6, Divorced, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 92682, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 4865, 0, 40, United-States, <=50K +56, Federal-gov, 101338, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 132652, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +21, Private, 34616, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +40, Private, 218903, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +27, Local-gov, 204098, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +52, Self-emp-not-inc, 64045, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 45, United-States, >50K +46, Private, 189763, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +23, Private, 26248, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +50, Private, 92079, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 45, United-States, >50K +19, Private, 280071, Some-college, 10, Never-married, Machine-op-inspct, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 224059, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 185520, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 8614, 0, 40, United-States, >50K +24, Private, 265567, 11th, 7, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 35, United-States, <=50K +72, Private, 106890, Assoc-voc, 11, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, State-gov, 39586, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 20, United-States, >50K +42, Private, 153132, Bachelors, 13, Divorced, Sales, Unmarried, White, Male, 0, 0, 45, ?, <=50K +51, Private, 209912, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 50, United-States, <=50K +39, Private, 144169, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +40, Local-gov, 50442, Some-college, 10, Never-married, Adm-clerical, Unmarried, Amer-Indian-Eskimo, Female, 2977, 0, 35, United-States, <=50K +34, Private, 89644, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K +19, Private, 275889, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, Mexico, <=50K +26, Private, 231638, HS-grad, 9, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +45, Local-gov, 224474, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 4934, 0, 50, United-States, >50K +28, Private, 355259, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +30, Federal-gov, 68330, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +32, Private, 185410, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 87653, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +21, Private, 286853, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 35, United-States, <=50K +54, Private, 96710, HS-grad, 9, Married-civ-spouse, Priv-house-serv, Other-relative, Black, Female, 0, 0, 20, United-States, <=50K +62, Private, 160143, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 30, United-States, >50K +25, Private, 186925, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 2597, 0, 48, United-States, <=50K +49, Self-emp-inc, 109705, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +32, Private, 94235, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 225279, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 1602, 40, ?, <=50K +37, Local-gov, 297449, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +58, Private, 205896, HS-grad, 9, Divorced, Sales, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 93717, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 7298, 0, 45, United-States, >50K +41, Private, 194710, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 236391, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 40, United-States, >50K +47, State-gov, 189123, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 358677, HS-grad, 9, Divorced, Other-service, Unmarried, Black, Male, 0, 0, 35, United-States, <=50K +30, State-gov, 199539, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 1902, 40, United-States, <=50K +43, Private, 128170, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 7688, 0, 40, United-States, >50K +34, Private, 231238, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +57, Private, 296152, Some-college, 10, Divorced, Exec-managerial, Other-relative, White, Female, 594, 0, 10, United-States, <=50K +46, Private, 166003, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 281437, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 55, United-States, <=50K +20, Private, 190231, 9th, 5, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 11, Nicaragua, <=50K +47, Private, 122026, Assoc-voc, 11, Divorced, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +55, ?, 205527, HS-grad, 9, Divorced, ?, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +53, Self-emp-not-inc, 174102, 7th-8th, 4, Married-civ-spouse, Exec-managerial, Husband, White, Male, 4386, 0, 50, Greece, >50K +43, Private, 125461, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 65, United-States, >50K +80, Self-emp-not-inc, 184335, 7th-8th, 4, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 30, United-States, <=50K +24, Private, 211345, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, Mexico, <=50K +43, Local-gov, 147328, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 60, United-States, >50K +22, Private, 222993, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 225978, Some-college, 10, Separated, Exec-managerial, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +48, Private, 121124, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +56, ?, 656036, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 25, United-States, <=50K +34, ?, 346762, 11th, 7, Divorced, ?, Own-child, White, Male, 0, 0, 84, United-States, <=50K +51, Private, 234057, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +24, Federal-gov, 306515, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 116562, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 171159, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +24, Private, 199011, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 443508, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 48, Canada, >50K +24, Private, 29810, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +22, Local-gov, 238831, Some-college, 10, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 40, United-States, <=50K +32, Federal-gov, 566117, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 255044, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +20, Private, 436253, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 35, United-States, <=50K +31, Private, 300687, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +55, Private, 144071, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 18, United-States, >50K +49, State-gov, 133917, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 1902, 60, ?, >50K +26, Private, 188767, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +27, Self-emp-not-inc, 300777, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +35, Private, 26987, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 174395, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 60, Greece, <=50K +59, Private, 90290, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 34, United-States, <=50K +61, Private, 183735, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +31, Private, 123273, HS-grad, 9, Never-married, Sales, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +43, Federal-gov, 186916, Masters, 14, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 60, United-States, >50K +61, Private, 43554, 5th-6th, 3, Never-married, Handlers-cleaners, Not-in-family, Black, Male, 0, 2339, 40, United-States, <=50K +54, Private, 178251, Assoc-acdm, 12, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +30, Private, 255885, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 64292, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +27, State-gov, 194773, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, Germany, <=50K +44, Self-emp-inc, 133060, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +64, Private, 258006, Some-college, 10, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, Cuba, <=50K +55, Private, 92215, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 33945, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 6849, 0, 55, United-States, <=50K +61, Private, 153048, HS-grad, 9, Widowed, Sales, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +28, Private, 192200, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, <=50K +34, Private, 355571, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +47, Self-emp-inc, 139268, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 60, United-States, >50K +26, Private, 34402, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +35, Private, 25955, 11th, 7, Never-married, Other-service, Unmarried, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +36, Private, 209609, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 38, United-States, <=50K +47, Private, 168283, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +17, Private, 295488, 11th, 7, Never-married, Other-service, Own-child, Black, Female, 0, 0, 25, United-States, <=50K +35, Private, 190895, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +33, Private, 164190, Masters, 14, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 20, United-States, <=50K +25, Private, 216010, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 387568, 10th, 6, Never-married, Sales, Own-child, White, Male, 0, 0, 10, United-States, <=50K +47, State-gov, 188386, Masters, 14, Separated, Prof-specialty, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +44, Private, 174491, HS-grad, 9, Widowed, Other-service, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +41, Private, 31221, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +30, Private, 272451, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +53, Self-emp-not-inc, 152652, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +53, Private, 104413, HS-grad, 9, Widowed, Other-service, Unmarried, Black, Female, 0, 0, 20, United-States, <=50K +40, Private, 105936, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 5013, 0, 20, United-States, <=50K +24, Private, 379066, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 2205, 24, United-States, <=50K +27, Private, 214858, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 237735, 5th-6th, 3, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 37, Mexico, <=50K +36, Private, 158592, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +41, Private, 237321, 1st-4th, 2, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, >50K +41, Private, 23646, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 169240, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Federal-gov, 454508, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 130356, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 48, United-States, <=50K +22, Private, 427686, 10th, 6, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +18, Local-gov, 36411, 12th, 8, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 30, United-States, <=50K +39, Private, 548510, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 30, United-States, <=50K +38, Private, 187264, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 55, United-States, <=50K +35, State-gov, 140752, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 325596, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +54, Self-emp-not-inc, 175804, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 107302, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +63, Local-gov, 41161, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, <=50K +39, Private, 401832, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, <=50K +57, Self-emp-not-inc, 353808, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +58, Self-emp-inc, 349910, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 50, United-States, >50K +29, Private, 161478, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, Japan, <=50K +17, Private, 400225, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +40, Private, 367533, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +69, Self-emp-not-inc, 69306, Some-college, 10, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 15, United-States, <=50K +28, Private, 270366, 10th, 6, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 103751, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 75227, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 14084, 0, 40, United-States, >50K +45, Local-gov, 132563, Prof-school, 15, Divorced, Prof-specialty, Unmarried, Black, Female, 0, 1726, 40, United-States, <=50K +33, State-gov, 79580, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, <=50K +41, Local-gov, 344624, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 1485, 40, United-States, >50K +37, Self-emp-inc, 186359, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 7688, 0, 60, United-States, >50K +50, Private, 121685, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +48, Private, 75104, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +26, ?, 188343, HS-grad, 9, Never-married, ?, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +36, Private, 246449, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +21, Private, 85088, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 37, United-States, <=50K +37, Private, 545483, Assoc-acdm, 12, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, State-gov, 243986, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 20, United-States, <=50K +54, Self-emp-not-inc, 32778, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 30, United-States, <=50K +28, Private, 369114, HS-grad, 9, Separated, Sales, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 217200, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 149220, Assoc-voc, 11, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +46, ?, 162034, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, ?, 157813, 11th, 7, Divorced, ?, Unmarried, White, Female, 0, 0, 58, Canada, <=50K +17, ?, 179715, 10th, 6, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +40, Self-emp-not-inc, 335549, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 2444, 45, United-States, >50K +47, Private, 102308, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Private, 367749, 1st-4th, 2, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, El-Salvador, <=50K +25, Private, 98281, 12th, 8, Never-married, Handlers-cleaners, Unmarried, White, Male, 0, 0, 43, United-States, <=50K +35, Private, 115792, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, <=50K +29, Private, 277788, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 25, United-States, <=50K +30, Private, 103435, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 37646, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +56, Self-emp-not-inc, 385632, 7th-8th, 4, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +25, Self-emp-not-inc, 210278, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 30, United-States, <=50K +28, Private, 335357, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 272165, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Local-gov, 148995, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 60, United-States, >50K +46, Self-emp-not-inc, 113434, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, State-gov, 132551, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +38, Federal-gov, 115433, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, White, Female, 7688, 0, 33, United-States, >50K +29, Private, 227890, HS-grad, 9, Never-married, Protective-serv, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +25, Private, 503012, 5th-6th, 3, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +56, Private, 250873, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +31, Private, 407930, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 148187, 11th, 7, Never-married, Other-service, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 159322, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 334368, Some-college, 10, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +53, Private, 196328, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 40, United-States, <=50K +45, Private, 270842, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +71, Private, 235079, Preschool, 1, Widowed, Craft-repair, Unmarried, Black, Male, 0, 0, 10, United-States, <=50K +65, ?, 327154, HS-grad, 9, Widowed, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 188391, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 50, United-States, >50K +19, Federal-gov, 30559, HS-grad, 9, Married-AF-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, <=50K +34, Local-gov, 255098, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 248010, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, United-States, <=50K +40, Private, 174515, HS-grad, 9, Married-spouse-absent, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +90, Private, 171956, Some-college, 10, Separated, Adm-clerical, Own-child, White, Female, 0, 0, 40, Puerto-Rico, <=50K +56, Private, 193130, Prof-school, 15, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 16, United-States, <=50K +21, Private, 108670, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 186172, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +45, Private, 348854, Some-college, 10, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 27, United-States, <=50K +46, Private, 271828, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +64, Private, 148606, 10th, 6, Separated, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +29, Local-gov, 123983, Masters, 14, Never-married, Prof-specialty, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, Taiwan, <=50K +22, Private, 24896, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 30, Germany, <=50K +47, Private, 573583, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 48, Italy, >50K +67, Self-emp-inc, 106175, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 2392, 75, United-States, >50K +43, Private, 307767, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 200574, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 59083, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 1672, 50, United-States, <=50K +53, Private, 358056, 11th, 7, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +81, Private, 114670, 9th, 5, Widowed, Priv-house-serv, Not-in-family, Black, Female, 2062, 0, 5, United-States, <=50K +33, Local-gov, 262042, HS-grad, 9, Divorced, Adm-clerical, Own-child, White, Female, 0, 1138, 40, United-States, <=50K +17, Private, 206010, 12th, 8, Never-married, Other-service, Own-child, White, Female, 0, 0, 8, United-States, <=50K +55, Self-emp-inc, 183869, Prof-school, 15, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, ?, >50K +28, Private, 159001, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 30, United-States, <=50K +24, Private, 155818, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 96055, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +30, Local-gov, 131776, Masters, 14, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 228613, 11th, 7, Never-married, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 198163, Masters, 14, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 37028, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +30, Private, 177304, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 144064, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +40, Private, 146659, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +63, Self-emp-not-inc, 26904, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 98, United-States, <=50K +23, Private, 238917, 7th-8th, 4, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 36, United-States, <=50K +56, Private, 170148, HS-grad, 9, Divorced, Craft-repair, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +42, Self-emp-not-inc, 27821, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +40, Private, 220460, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Canada, <=50K +49, Private, 101320, Assoc-acdm, 12, Married-civ-spouse, Sales, Wife, White, Female, 0, 1902, 40, United-States, >50K +35, Private, 173858, HS-grad, 9, Married-spouse-absent, Craft-repair, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, ?, <=50K +52, Private, 91048, HS-grad, 9, Divorced, Machine-op-inspct, Own-child, Black, Female, 0, 0, 35, United-States, <=50K +28, Private, 298696, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 207202, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 70, United-States, <=50K +21, ?, 230397, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 5, United-States, <=50K +43, Self-emp-not-inc, 180599, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +32, ?, 199046, Assoc-voc, 11, Never-married, ?, Unmarried, White, Female, 0, 0, 2, United-States, <=50K +29, Self-emp-not-inc, 132686, Prof-school, 15, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 50, Italy, >50K +23, Private, 240063, Bachelors, 13, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 25, United-States, <=50K +50, Local-gov, 177705, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 1740, 48, United-States, <=50K +34, Private, 511361, Some-college, 10, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +19, Private, 89397, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 239439, 11th, 7, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 0, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 36989, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 76978, HS-grad, 9, Never-married, Sales, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +75, Private, 200068, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, >50K +24, Private, 454941, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, State-gov, 107218, Bachelors, 13, Never-married, Tech-support, Own-child, Asian-Pac-Islander, Male, 0, 0, 20, United-States, <=50K +17, Local-gov, 182070, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 16, United-States, <=50K +31, Private, 176360, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +31, Private, 452405, Preschool, 1, Never-married, Other-service, Other-relative, White, Female, 0, 0, 35, Mexico, <=50K +18, ?, 297396, HS-grad, 9, Never-married, ?, Own-child, White, Male, 0, 0, 10, United-States, <=50K +45, Private, 84790, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +31, Private, 186787, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 42, United-States, <=50K +27, Private, 169662, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 42, United-States, >50K +48, Private, 125933, Some-college, 10, Widowed, Exec-managerial, Unmarried, Black, Female, 0, 1669, 38, United-States, <=50K +22, ?, 35448, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 22, United-States, <=50K +34, Private, 225548, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 30, United-States, <=50K +26, Private, 240842, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +53, Private, 103931, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, Private, 232618, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, >50K +49, Local-gov, 288548, Masters, 14, Separated, Prof-specialty, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +40, Private, 220609, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, Self-emp-inc, 26145, 10th, 6, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 80, United-States, <=50K +23, Private, 268525, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +68, ?, 133758, 7th-8th, 4, Widowed, ?, Not-in-family, Black, Male, 0, 0, 10, United-States, <=50K +42, Private, 121264, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +37, Self-emp-not-inc, 29814, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 85, United-States, <=50K +27, Private, 193701, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Female, 0, 0, 45, United-States, <=50K +38, Private, 183279, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 44, United-States, >50K +27, Private, 163942, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 35, Ireland, <=50K +75, Private, 188612, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +49, Self-emp-inc, 102771, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 52, United-States, >50K +27, Private, 85625, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +36, Self-emp-not-inc, 245090, Bachelors, 13, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 50, Mexico, <=50K +36, Private, 131239, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 45, United-States, >50K +35, Private, 182074, HS-grad, 9, Divorced, Handlers-cleaners, Own-child, White, Male, 0, 0, 35, United-States, <=50K +36, Private, 187046, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +53, Private, 90624, 11th, 7, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +27, Private, 37933, Some-college, 10, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +34, Private, 182177, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 3325, 0, 35, United-States, <=50K +61, Private, 716416, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 44, United-States, >50K +29, Private, 190562, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, White, Male, 0, 0, 56, United-States, <=50K +40, State-gov, 141583, Some-college, 10, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +37, Private, 98941, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +22, Private, 201729, 9th, 5, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 30, United-States, <=50K +43, Self-emp-inc, 175485, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Self-emp-not-inc, 149168, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 30, United-States, <=50K +28, Private, 115971, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 161708, Bachelors, 13, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +64, Local-gov, 244903, 11th, 7, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +46, Private, 155664, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 112754, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 178385, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 48, India, <=50K +20, Private, 44064, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 25, United-States, <=50K +62, Self-emp-not-inc, 120939, Some-college, 10, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 165134, Assoc-voc, 11, Never-married, Exec-managerial, Unmarried, White, Female, 0, 0, 35, Columbia, <=50K +29, Private, 100405, 10th, 6, Married-civ-spouse, Farming-fishing, Wife, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +35, Self-emp-not-inc, 361888, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, Japan, <=50K +39, Local-gov, 167864, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 30, United-States, <=50K +39, Private, 202950, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +37, Private, 218188, HS-grad, 9, Divorced, Machine-op-inspct, Other-relative, White, Female, 0, 0, 32, United-States, <=50K +38, Private, 234962, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 2829, 0, 30, Mexico, <=50K +72, ?, 177226, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 8, United-States, <=50K +31, Private, 259931, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 189528, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +38, Private, 34996, Some-college, 10, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 112584, Some-college, 10, Divorced, Sales, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +25, Private, 117589, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, ?, 145234, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 30, United-States, <=50K +37, Private, 267086, Assoc-voc, 11, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 52, United-States, <=50K +49, Private, 44434, Some-college, 10, Divorced, Tech-support, Other-relative, White, Male, 0, 0, 35, United-States, <=50K +26, Private, 96130, HS-grad, 9, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 181382, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 43, United-States, <=50K +44, Self-emp-inc, 168845, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 60, United-States, <=50K +37, Private, 271767, Masters, 14, Separated, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +42, Private, 194636, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +64, State-gov, 194894, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 4787, 0, 40, United-States, >50K +28, Private, 132686, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Self-emp-not-inc, 185848, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 4650, 0, 50, United-States, <=50K +40, State-gov, 184378, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Federal-gov, 270859, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 231866, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 65, United-States, <=50K +49, Private, 36032, Some-college, 10, Never-married, Exec-managerial, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +51, State-gov, 172962, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +57, Private, 98350, Prof-school, 15, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 1902, 40, Philippines, >50K +51, Private, 24185, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +38, Private, 53930, 10th, 6, Never-married, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, ?, <=50K +24, Private, 85088, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 1762, 32, United-States, <=50K +45, Self-emp-not-inc, 94962, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, England, <=50K +28, Private, 480861, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K +42, Self-emp-inc, 187702, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2415, 60, United-States, >50K +22, Private, 52262, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, State-gov, 52636, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, Private, 175273, HS-grad, 9, Widowed, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 327825, HS-grad, 9, Separated, Machine-op-inspct, Unmarried, White, Female, 0, 2238, 40, United-States, <=50K +47, Private, 125892, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 75, United-States, >50K +40, ?, 78255, HS-grad, 9, Divorced, ?, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +30, Private, 398827, HS-grad, 9, Married-AF-spouse, Adm-clerical, Husband, White, Male, 0, 0, 60, United-States, <=50K +61, Private, 208919, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +71, Local-gov, 365996, Bachelors, 13, Widowed, Prof-specialty, Unmarried, White, Female, 0, 0, 6, United-States, <=50K +42, Private, 307638, HS-grad, 9, Divorced, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +44, Local-gov, 33068, 12th, 8, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, <=50K +46, Self-emp-not-inc, 254291, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +50, Local-gov, 125417, Prof-school, 15, Never-married, Exec-managerial, Not-in-family, Black, Female, 0, 0, 52, United-States, >50K +27, State-gov, 28848, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 9, United-States, <=50K +40, ?, 273425, Assoc-voc, 11, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 15, United-States, <=50K +21, Private, 194723, Some-college, 10, Never-married, Adm-clerical, Other-relative, White, Female, 0, 0, 40, Mexico, <=50K +25, Private, 195118, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 35, United-States, <=50K +61, Private, 123273, 5th-6th, 3, Divorced, Transport-moving, Not-in-family, White, Male, 0, 1876, 56, United-States, <=50K +54, Private, 220115, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +31, Private, 265706, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +53, Self-emp-not-inc, 279129, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 30, United-States, <=50K +39, Self-emp-inc, 122742, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 55, United-States, >50K +57, Self-emp-inc, 172654, Prof-school, 15, Married-civ-spouse, Transport-moving, Husband, White, Male, 15024, 0, 50, United-States, >50K +48, Private, 119199, Bachelors, 13, Divorced, Sales, Unmarried, White, Female, 0, 0, 44, United-States, <=50K +30, Private, 107793, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 56, United-States, >50K +35, Private, 237943, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 60, United-States, <=50K +42, Self-emp-not-inc, 64632, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +34, Self-emp-not-inc, 96245, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Private, 361494, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +69, Local-gov, 122850, 10th, 6, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 20, United-States, <=50K +29, Private, 173652, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 164663, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 98678, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 15, United-States, <=50K +40, Private, 245529, Assoc-acdm, 12, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 55294, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 140583, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 79797, HS-grad, 9, Married-spouse-absent, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, Japan, >50K +72, ?, 113044, HS-grad, 9, Widowed, ?, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +20, Private, 283499, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 30, United-States, <=50K +41, Local-gov, 51111, Bachelors, 13, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 232475, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +48, Private, 176140, 11th, 7, Divorced, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +27, Private, 301654, HS-grad, 9, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 376455, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +28, ?, 192569, HS-grad, 9, Never-married, ?, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +27, Private, 229803, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +20, Private, 337639, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 130849, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +32, Private, 296282, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 266645, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +23, State-gov, 110128, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 90196, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +40, State-gov, 40024, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 38, United-States, >50K +35, Private, 144322, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +74, Self-emp-inc, 162340, Some-college, 10, Widowed, Exec-managerial, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 169069, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 113601, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 30, United-States, <=50K +20, Self-emp-not-inc, 157145, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Male, 0, 2258, 10, United-States, <=50K +44, Private, 111275, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 56, United-States, <=50K +46, Local-gov, 102076, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +20, ?, 182117, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +51, Self-emp-not-inc, 145409, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 50, United-States, >50K +40, Private, 190122, Some-college, 10, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 331482, Prof-school, 15, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 1977, 40, United-States, >50K +60, Self-emp-not-inc, 170114, 9th, 5, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 1672, 84, United-States, <=50K +48, Self-emp-inc, 193188, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Local-gov, 267588, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 70, United-States, <=50K +48, Self-emp-inc, 200471, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +22, ?, 175586, HS-grad, 9, Never-married, ?, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +24, Local-gov, 322658, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +45, State-gov, 263982, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 266287, 12th, 8, Never-married, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +39, Private, 278187, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +65, Self-emp-inc, 81413, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 2352, 65, United-States, <=50K +22, Private, 221745, Some-college, 10, Divorced, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 140764, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +28, Private, 206351, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 176814, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 50, United-States, >50K +42, Local-gov, 245307, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 48, United-States, >50K +61, State-gov, 124971, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, >50K +28, Private, 119545, Some-college, 10, Married-civ-spouse, Exec-managerial, Own-child, White, Male, 7688, 0, 50, United-States, >50K +18, Private, 179203, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +24, Federal-gov, 44075, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +45, Private, 178319, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 56, United-States, >50K +24, Private, 219754, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 198316, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 50, United-States, >50K +20, Private, 168165, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +35, Private, 356838, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 2829, 0, 55, Poland, <=50K +52, Self-emp-inc, 210736, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +25, Private, 173212, Assoc-acdm, 12, Never-married, Farming-fishing, Not-in-family, White, Male, 2354, 0, 45, United-States, <=50K +19, Private, 130431, 5th-6th, 3, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 36, Mexico, <=50K +35, ?, 169809, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 20, United-States, <=50K +54, Private, 197481, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 155066, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 31290, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +42, Private, 54102, Assoc-voc, 11, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 181546, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +55, Private, 153484, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 50, United-States, >50K +44, State-gov, 351228, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 131976, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 55, United-States, <=50K +26, Private, 200639, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +64, Federal-gov, 267546, Assoc-acdm, 12, Separated, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +41, Private, 179875, 11th, 7, Divorced, Other-service, Unmarried, Other, Female, 0, 0, 40, United-States, <=50K +25, ?, 237865, Some-college, 10, Never-married, ?, Own-child, Black, Male, 0, 0, 40, ?, <=50K +43, Private, 300528, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 67716, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 10520, 0, 48, United-States, >50K +48, Federal-gov, 326048, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 44, United-States, >50K +60, Private, 191188, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Self-emp-not-inc, 32172, Some-college, 10, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +51, Private, 252903, 10th, 6, Married-civ-spouse, Sales, Husband, White, Male, 0, 1977, 40, United-States, >50K +37, Federal-gov, 334314, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 83704, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 30, United-States, <=50K +44, Private, 160574, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 65, United-States, >50K +27, Private, 203776, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +47, Local-gov, 328610, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 295589, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 1977, 40, United-States, >50K +40, Private, 174373, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +41, Private, 247752, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, ?, 199244, 10th, 6, Divorced, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 139992, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 95680, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +55, Self-emp-inc, 189933, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +38, Private, 498785, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, State-gov, 177526, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 15, United-States, <=50K +64, Self-emp-not-inc, 150121, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 25, United-States, >50K +56, Federal-gov, 130454, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Private, 119079, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 49, United-States, >50K +33, Private, 220939, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 45, United-States, >50K +33, Private, 94235, Prof-school, 15, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 42, United-States, >50K +21, Private, 305874, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Local-gov, 62020, HS-grad, 9, Widowed, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +58, Private, 235624, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, Germany, >50K +43, Local-gov, 247514, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 275726, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +45, Private, 72896, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +53, Local-gov, 110510, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 50, United-States, >50K +41, Private, 173938, Prof-school, 15, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 50, ?, >50K +27, Private, 200641, 10th, 6, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, Mexico, <=50K +53, Private, 211654, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, ?, >50K +38, Private, 242720, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +31, Private, 111567, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, >50K +41, Private, 179533, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 35, United-States, <=50K +22, State-gov, 334693, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 198096, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +41, State-gov, 355756, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 19395, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Wife, White, Female, 0, 0, 35, United-States, <=50K +41, Local-gov, 242586, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 208358, Prof-school, 15, Divorced, Prof-specialty, Not-in-family, White, Male, 99999, 0, 45, United-States, >50K +49, Private, 160647, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +20, Private, 227943, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 45, United-States, <=50K +58, Self-emp-not-inc, 197665, HS-grad, 9, Married-spouse-absent, Other-service, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +35, Self-emp-not-inc, 216129, 12th, 8, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 40, Trinadad&Tobago, <=50K +30, Local-gov, 326104, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 57211, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 100219, Assoc-acdm, 12, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 45, United-States, <=50K +40, Private, 291192, HS-grad, 9, Widowed, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +54, State-gov, 93415, Bachelors, 13, Never-married, Prof-specialty, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, United-States, >50K +35, Private, 191502, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +35, Private, 261382, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +40, Private, 170230, Bachelors, 13, Married-spouse-absent, Other-service, Not-in-family, White, Female, 0, 0, 40, ?, <=50K +59, Private, 374924, HS-grad, 9, Separated, Sales, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-inc, 320984, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, <=50K +39, Private, 338320, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, >50K +51, Private, 135190, 7th-8th, 4, Separated, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 30, United-States, <=50K +71, Private, 157909, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 2964, 0, 60, United-States, <=50K +33, Private, 637222, 12th, 8, Divorced, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +28, Private, 430084, HS-grad, 9, Divorced, Other-service, Own-child, Black, Male, 0, 0, 35, United-States, <=50K +30, Private, 125279, HS-grad, 9, Married-spouse-absent, Sales, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 221955, 5th-6th, 3, Married-spouse-absent, Farming-fishing, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +51, Self-emp-inc, 180195, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +39, Private, 208778, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, >50K +62, Private, 81534, Some-college, 10, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, >50K +37, Private, 325538, HS-grad, 9, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 60, ?, <=50K +28, Private, 142264, 9th, 5, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 50, Dominican-Republic, <=50K +23, Private, 128604, HS-grad, 9, Never-married, Sales, Own-child, Asian-Pac-Islander, Male, 0, 0, 48, South, <=50K +39, Private, 277886, Bachelors, 13, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 30, United-States, <=50K +50, Self-emp-inc, 100029, Bachelors, 13, Widowed, Sales, Unmarried, White, Male, 0, 0, 65, United-States, >50K +31, Private, 169269, 7th-8th, 4, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +45, Local-gov, 160472, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 50, United-States, >50K +23, ?, 123983, Bachelors, 13, Never-married, ?, Own-child, Other, Male, 0, 0, 40, United-States, <=50K +47, Private, 297884, 10th, 6, Widowed, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +59, Private, 99131, HS-grad, 9, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 18, United-States, <=50K +32, Private, 44392, Assoc-acdm, 12, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +82, ?, 29441, 7th-8th, 4, Widowed, ?, Not-in-family, White, Male, 0, 0, 5, United-States, <=50K +49, Private, 199029, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 2415, 55, United-States, >50K +74, Federal-gov, 181508, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Male, 0, 0, 17, United-States, <=50K +22, Private, 190625, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 35, United-States, <=50K +32, Private, 194740, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, Greece, <=50K +34, Private, 27380, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, >50K +59, Private, 160631, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, >50K +36, Private, 224531, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +59, Private, 283005, 11th, 7, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +47, Self-emp-inc, 101926, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 70, United-States, >50K +53, Local-gov, 135102, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 2002, 45, United-States, <=50K +25, Self-emp-not-inc, 113436, Some-college, 10, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 35, United-States, <=50K +44, Private, 248973, Bachelors, 13, Divorced, Adm-clerical, Not-in-family, Black, Male, 0, 0, 65, United-States, <=50K +57, Self-emp-not-inc, 225334, Prof-school, 15, Married-civ-spouse, Sales, Wife, White, Female, 15024, 0, 35, United-States, >50K +42, Self-emp-not-inc, 157562, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 1902, 80, United-States, >50K +58, Local-gov, 310085, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 129597, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 3464, 0, 40, United-States, <=50K +32, ?, 53042, HS-grad, 9, Never-married, ?, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +45, Private, 204205, 7th-8th, 4, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +47, Private, 169324, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, Black, Female, 0, 0, 35, United-States, >50K +52, ?, 134447, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 50, United-States, <=50K +56, Self-emp-not-inc, 236731, 1st-4th, 2, Separated, Exec-managerial, Not-in-family, White, Male, 0, 0, 25, ?, <=50K +52, Private, 141301, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 235124, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +36, Self-emp-not-inc, 367020, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +41, Private, 149102, HS-grad, 9, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Poland, <=50K +30, Private, 423770, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, Mexico, <=50K +44, Private, 211759, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Other, Male, 0, 0, 40, Puerto-Rico, <=50K +17, ?, 110998, Some-college, 10, Never-married, ?, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +34, Private, 56883, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 223062, Some-college, 10, Separated, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +29, Private, 406662, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 206600, 9th, 5, Never-married, Machine-op-inspct, Other-relative, White, Male, 0, 0, 48, Mexico, <=50K +42, Local-gov, 147510, Bachelors, 13, Separated, Prof-specialty, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 235646, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 3103, 0, 40, United-States, >50K +26, Private, 187577, Assoc-voc, 11, Never-married, Sales, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +64, Self-emp-inc, 132832, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 20051, 0, 40, ?, >50K +46, Self-emp-inc, 278322, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +38, Private, 278924, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1887, 50, United-States, >50K +49, State-gov, 203039, 11th, 7, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 145651, Some-college, 10, Never-married, Sales, Own-child, Black, Female, 0, 0, 20, United-States, <=50K +46, Local-gov, 144531, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +30, Private, 91145, HS-grad, 9, Never-married, Handlers-cleaners, Unmarried, White, Male, 0, 0, 55, United-States, <=50K +49, Self-emp-not-inc, 211762, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +47, ?, 111563, Assoc-voc, 11, Divorced, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 180985, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, ?, >50K +31, Private, 207537, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 0, 1669, 50, United-States, <=50K +19, Private, 417657, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 50, United-States, <=50K +45, Private, 189890, Assoc-acdm, 12, Divorced, Prof-specialty, Unmarried, White, Female, 5455, 0, 38, United-States, <=50K +34, Private, 223212, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1848, 40, Peru, >50K +26, Private, 108658, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 190023, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 222130, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +36, Self-emp-inc, 164866, Assoc-acdm, 12, Separated, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +31, Private, 170983, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +30, Private, 186269, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 286026, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +30, Private, 403433, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Male, 0, 0, 50, United-States, >50K +21, ?, 224209, HS-grad, 9, Married-civ-spouse, ?, Wife, Black, Female, 0, 0, 30, United-States, <=50K +73, Private, 123160, 10th, 6, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 10, United-States, <=50K +38, Federal-gov, 99527, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +30, Private, 123178, 10th, 6, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +33, Private, 231043, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +52, Local-gov, 317733, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 40, United-States, >50K +58, Private, 241056, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 46, United-States, <=50K +34, Local-gov, 220066, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +35, Private, 180342, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Federal-gov, 31840, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 183168, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 386036, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +31, Local-gov, 446358, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, Mexico, >50K +45, Private, 28035, Some-college, 10, Never-married, Farming-fishing, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +40, Private, 282155, HS-grad, 9, Separated, Other-service, Other-relative, White, Female, 0, 0, 25, United-States, <=50K +27, Private, 192384, Prof-school, 15, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 383637, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +29, Private, 457402, 5th-6th, 3, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 25, Mexico, <=50K +34, Self-emp-inc, 80249, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 72, United-States, <=50K +32, State-gov, 159537, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 240859, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Cuba, <=50K +33, Private, 83446, 11th, 7, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, >50K +74, ?, 29866, 7th-8th, 4, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 2, United-States, <=50K +62, Private, 185503, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +39, Self-emp-not-inc, 68781, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 220589, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 51136, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +24, Private, 54560, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +76, ?, 28221, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, Canada, >50K +25, Private, 201413, Some-college, 10, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +19, Private, 40425, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 28, United-States, <=50K +31, Private, 189461, Assoc-acdm, 12, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 41, United-States, <=50K +53, Private, 200576, 11th, 7, Divorced, Craft-repair, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +61, Private, 92691, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 3, United-States, <=50K +47, Private, 664821, Bachelors, 13, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, El-Salvador, <=50K +37, Private, 175130, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +50, Self-emp-not-inc, 391016, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +27, Private, 249315, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 44, United-States, <=50K +58, Private, 111169, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 334946, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +39, Private, 352248, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 173804, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +56, Private, 155449, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +26, Private, 73689, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 55, United-States, <=50K +23, Private, 227594, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 38, United-States, <=50K +47, Private, 161676, 11th, 7, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +68, Private, 75913, 12th, 8, Widowed, Sales, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +47, Local-gov, 242552, Some-college, 10, Never-married, Protective-serv, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +45, Federal-gov, 352094, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 7688, 0, 40, Guatemala, >50K +26, Private, 159732, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +20, Private, 131230, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 1590, 40, United-States, <=50K +46, Private, 180695, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 189922, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 50, United-States, >50K +37, Private, 409189, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +43, Private, 111252, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 42, United-States, <=50K +59, Private, 294395, Masters, 14, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 172718, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 43403, Some-college, 10, Divorced, Farming-fishing, Not-in-family, White, Female, 0, 1590, 54, United-States, <=50K +63, Private, 111963, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 16, United-States, <=50K +45, Private, 247869, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +59, Private, 114032, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, ?, 356838, 12th, 8, Never-married, ?, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +26, Private, 179633, HS-grad, 9, Never-married, Tech-support, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 19847, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 231689, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 209942, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +53, Private, 197492, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 262439, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, >50K +46, Private, 283037, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +79, ?, 144533, HS-grad, 9, Widowed, ?, Not-in-family, Black, Female, 0, 0, 30, United-States, <=50K +31, Private, 83446, HS-grad, 9, Widowed, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 215443, HS-grad, 9, Separated, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +57, Local-gov, 268252, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 62, United-States, <=50K +40, Private, 181015, HS-grad, 9, Separated, Other-service, Unmarried, White, Female, 0, 0, 47, United-States, <=50K +41, Self-emp-inc, 139916, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, Other, Male, 0, 2179, 84, Mexico, <=50K +20, Private, 195770, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 26, United-States, <=50K +45, Private, 125194, 11th, 7, Never-married, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +27, Private, 58654, Assoc-voc, 11, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 252327, 5th-6th, 3, Married-spouse-absent, Craft-repair, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +30, Private, 116508, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Germany, <=50K +36, Private, 166988, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +25, Private, 374163, HS-grad, 9, Married-spouse-absent, Farming-fishing, Not-in-family, Other, Male, 0, 0, 40, Mexico, <=50K +30, ?, 96851, Some-college, 10, Never-married, ?, Not-in-family, White, Female, 0, 1719, 25, United-States, <=50K +31, Private, 196788, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 186172, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 7688, 0, 45, United-States, >50K +26, Private, 245628, 11th, 7, Never-married, Handlers-cleaners, Unmarried, White, Male, 0, 0, 20, United-States, <=50K +25, Private, 159732, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 129856, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 182812, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 3325, 0, 52, Dominican-Republic, <=50K +41, Private, 314322, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 102976, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +57, Self-emp-inc, 42959, Bachelors, 13, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, >50K +21, Private, 256356, 11th, 7, Never-married, Priv-house-serv, Other-relative, White, Female, 0, 0, 40, Mexico, <=50K +29, Private, 136277, 10th, 6, Never-married, Other-service, Own-child, Black, Female, 0, 0, 32, United-States, <=50K +36, Private, 284616, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 185554, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 25, United-States, <=50K +51, Private, 138847, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 33487, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 84306, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5013, 0, 50, United-States, <=50K +40, Self-emp-not-inc, 223881, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 70, United-States, >50K +61, Private, 149653, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +38, Private, 348739, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +20, ?, 235442, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 35, United-States, <=50K +21, Private, 34506, HS-grad, 9, Separated, Other-service, Own-child, White, Female, 0, 0, 35, United-States, <=50K +40, Private, 346964, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 192208, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +21, Private, 305874, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 54, United-States, <=50K +35, Self-emp-not-inc, 462890, 10th, 6, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 50, United-States, <=50K +39, Private, 89508, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 200153, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +30, Private, 179446, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 208965, 9th, 5, Never-married, Machine-op-inspct, Unmarried, Other, Male, 0, 0, 40, Mexico, <=50K +32, Private, 40142, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Self-emp-not-inc, 57452, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 327573, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 151267, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 15024, 0, 40, United-States, >50K +44, Private, 265266, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 203836, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 3464, 0, 40, Columbia, <=50K +51, ?, 163998, HS-grad, 9, Married-spouse-absent, ?, Not-in-family, White, Male, 0, 0, 20, United-States, >50K +46, Self-emp-not-inc, 28281, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, >50K +51, Private, 293196, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 60, Iran, >50K +45, Private, 214627, Doctorate, 16, Widowed, Prof-specialty, Unmarried, White, Male, 15020, 0, 40, Iran, >50K +20, Private, 368852, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +44, Private, 353396, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +33, Private, 161745, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 45, United-States, <=50K +18, Private, 97963, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +61, Self-emp-inc, 156542, Prof-school, 15, Separated, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +50, State-gov, 198103, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Federal-gov, 55377, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +34, Private, 173730, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +53, Private, 374588, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 60, United-States, <=50K +39, Self-emp-not-inc, 174330, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 78141, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +66, ?, 190324, HS-grad, 9, Married-civ-spouse, ?, Husband, Black, Male, 0, 0, 18, United-States, <=50K +26, Private, 31350, 11th, 7, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +41, Private, 243607, 5th-6th, 3, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, Mexico, <=50K +47, Local-gov, 134671, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 197023, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +52, Private, 117674, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 169815, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +43, Private, 598606, 9th, 5, Separated, Handlers-cleaners, Unmarried, Black, Female, 0, 0, 50, United-States, <=50K +42, Federal-gov, 122861, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 166235, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 30, United-States, <=50K +41, Private, 187821, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 2885, 0, 40, United-States, <=50K +34, Private, 340940, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 60, United-States, >50K +52, Self-emp-not-inc, 194791, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 231323, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Local-gov, 305597, HS-grad, 9, Separated, Exec-managerial, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 25429, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +46, State-gov, 192779, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 40, United-States, >50K +39, Private, 346478, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +22, Private, 341368, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +30, State-gov, 295612, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 168936, Assoc-voc, 11, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 32, United-States, <=50K +43, Private, 218558, Bachelors, 13, Married-spouse-absent, Prof-specialty, Not-in-family, White, Male, 3325, 0, 40, United-States, <=50K +37, Private, 336598, 5th-6th, 3, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 36, Mexico, <=50K +23, Private, 308205, Assoc-acdm, 12, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +39, Local-gov, 357173, Assoc-acdm, 12, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 59, United-States, <=50K +54, Private, 457237, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Self-emp-inc, 284799, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +20, Private, 179423, Some-college, 10, Never-married, Transport-moving, Own-child, White, Female, 0, 0, 40, United-States, <=50K +50, Self-emp-not-inc, 363405, Bachelors, 13, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 50, United-States, >50K +17, Private, 139183, 10th, 6, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +36, Private, 203482, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 112554, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +53, Private, 99476, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +50, Private, 93690, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +38, Private, 220585, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +63, Self-emp-not-inc, 194638, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 32, United-States, <=50K +53, Private, 154785, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +40, ?, 162108, Bachelors, 13, Divorced, ?, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +23, Self-emp-inc, 214542, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 161922, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 43, United-States, <=50K +46, Private, 207940, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +28, Private, 259351, 10th, 6, Never-married, Other-service, Other-relative, Amer-Indian-Eskimo, Male, 0, 0, 40, Mexico, <=50K +59, Private, 208395, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +41, Private, 116391, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 239781, Preschool, 1, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, Mexico, <=50K +56, Private, 174351, 12th, 8, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Italy, <=50K +50, Self-emp-not-inc, 44368, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 55, El-Salvador, >50K +31, Local-gov, 188798, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +41, Private, 50122, Assoc-voc, 11, Divorced, Sales, Own-child, White, Male, 0, 0, 50, United-States, <=50K +38, Private, 111398, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 7688, 0, 40, United-States, >50K +25, State-gov, 152035, Assoc-acdm, 12, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +18, ?, 139003, HS-grad, 9, Never-married, ?, Other-relative, Other, Female, 0, 0, 12, United-States, <=50K +49, Local-gov, 249289, Bachelors, 13, Divorced, Prof-specialty, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +39, Private, 257726, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, ?, 113175, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 20, United-States, <=50K +21, Private, 151158, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 25, United-States, <=50K +35, Private, 465326, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, ?, 356772, HS-grad, 9, Never-married, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 364782, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +55, Private, 198385, 7th-8th, 4, Widowed, Other-service, Unmarried, White, Female, 0, 0, 20, ?, <=50K +31, Private, 329301, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +17, Self-emp-inc, 254859, 11th, 7, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 20, United-States, <=50K +31, Private, 203488, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7298, 0, 50, United-States, >50K +25, Local-gov, 222800, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 96452, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +50, Private, 170050, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Local-gov, 116580, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 20, United-States, >50K +50, Private, 400004, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +63, Private, 183608, 10th, 6, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 194055, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +23, Private, 210443, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 43272, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +43, Local-gov, 108945, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 48, United-States, <=50K +34, Private, 114691, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +18, Private, 304169, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 35, United-States, <=50K +46, Private, 503923, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 4386, 0, 40, United-States, >50K +35, Private, 340428, Bachelors, 13, Never-married, Sales, Unmarried, White, Female, 0, 0, 40, United-States, >50K +46, State-gov, 106705, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +59, Private, 146391, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 7298, 0, 40, United-States, >50K +31, Private, 235389, 7th-8th, 4, Never-married, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 30, Portugal, <=50K +27, Private, 39665, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 37, United-States, <=50K +41, Private, 113823, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, England, <=50K +42, Private, 217826, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, ?, <=50K +55, Private, 349304, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +34, ?, 197688, HS-grad, 9, Never-married, ?, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +44, Private, 54507, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 117833, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 1669, 50, United-States, <=50K +36, Private, 163396, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +69, Private, 88566, 9th, 5, Married-civ-spouse, Other-service, Husband, White, Male, 1424, 0, 35, United-States, <=50K +33, Private, 323619, Some-college, 10, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 75755, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 148903, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 16, United-States, >50K +25, Private, 40915, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +21, Private, 182606, Some-college, 10, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, ?, <=50K +18, Private, 131033, 11th, 7, Never-married, Other-service, Other-relative, Black, Male, 0, 0, 15, United-States, <=50K +35, Self-emp-not-inc, 168475, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, >50K +20, Private, 121568, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 50, United-States, <=50K +26, Private, 139098, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 5013, 0, 40, United-States, <=50K +46, Private, 357338, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 283268, Bachelors, 13, Never-married, Prof-specialty, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +40, Private, 572751, Prof-school, 15, Married-civ-spouse, Craft-repair, Husband, White, Male, 5178, 0, 40, Mexico, >50K +40, Private, 315321, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 625, 52, United-States, <=50K +31, Private, 120461, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Self-emp-not-inc, 65278, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Self-emp-not-inc, 208503, Some-college, 10, Divorced, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Local-gov, 112835, Masters, 14, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 265038, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +18, Private, 89478, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +55, Private, 276229, HS-grad, 9, Widowed, Sales, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +52, Private, 366232, 9th, 5, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, Cuba, <=50K +26, Private, 152035, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 205339, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +39, Private, 75995, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +62, Self-emp-not-inc, 192236, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +19, ?, 188618, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 25, United-States, <=50K +47, Private, 229737, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +51, Local-gov, 199688, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +55, Private, 52953, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 221043, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +59, Federal-gov, 115389, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 36, United-States, <=50K +45, Self-emp-not-inc, 204205, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 65, United-States, <=50K +52, Private, 338816, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 45, United-States, >50K +21, Private, 197387, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 20, United-States, <=50K +31, Private, 42485, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 55, United-States, <=50K +29, Private, 367706, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +24, Private, 102493, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 263746, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 24, United-States, <=50K +47, Private, 115358, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +46, Private, 189680, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +32, ?, 282622, HS-grad, 9, Divorced, ?, Unmarried, White, Female, 0, 0, 28, United-States, <=50K +34, Private, 127651, 10th, 6, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 44, ?, <=50K +63, Private, 230823, 12th, 8, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, Cuba, <=50K +21, Private, 300812, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 25, United-States, <=50K +18, Private, 174732, HS-grad, 9, Never-married, Other-service, Other-relative, Black, Male, 0, 0, 36, United-States, <=50K +49, State-gov, 183710, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +81, Self-emp-not-inc, 137018, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +36, Self-emp-inc, 213008, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +47, Private, 357848, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 165799, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +39, Self-emp-not-inc, 188571, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +46, Private, 97883, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +43, Local-gov, 105862, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 1902, 40, United-States, >50K +39, Local-gov, 57424, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +29, Private, 151476, Some-college, 10, Separated, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 129583, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Black, Female, 0, 0, 16, United-States, <=50K +57, Private, 180920, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 43, United-States, <=50K +38, Self-emp-not-inc, 182416, HS-grad, 9, Never-married, Sales, Unmarried, Black, Female, 0, 0, 42, United-States, <=50K +25, Private, 251915, Bachelors, 13, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +39, Local-gov, 187127, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 69045, Some-college, 10, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 40, Jamaica, <=50K +56, Private, 192869, Masters, 14, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1977, 44, United-States, >50K +39, Private, 74163, 12th, 8, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 60847, Assoc-voc, 11, Never-married, Sales, Unmarried, White, Female, 0, 0, 60, United-States, <=50K +17, ?, 213055, 11th, 7, Never-married, ?, Not-in-family, Other, Female, 0, 0, 20, United-States, <=50K +67, Self-emp-not-inc, 116057, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 3273, 0, 16, United-States, <=50K +41, Private, 82393, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 50, United-States, <=50K +24, Local-gov, 134181, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 50, United-States, <=50K +51, Private, 159910, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, Black, Male, 10520, 0, 40, United-States, >50K +30, Self-emp-inc, 117570, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +47, Self-emp-inc, 214169, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 15024, 0, 40, United-States, >50K +56, Private, 56331, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 32, United-States, <=50K +51, Private, 35576, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +57, Self-emp-not-inc, 149168, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, <=50K +34, Private, 157165, Some-college, 10, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 278130, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 257200, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 283122, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +23, Private, 580248, HS-grad, 9, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 230054, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +58, Private, 519006, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 52, United-States, <=50K +19, ?, 37332, HS-grad, 9, Never-married, ?, Own-child, White, Female, 1055, 0, 12, United-States, <=50K +19, ?, 365871, 7th-8th, 4, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +68, State-gov, 235882, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2377, 60, United-States, >50K +43, Private, 336513, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 40, United-States, >50K +17, Private, 115551, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +53, State-gov, 50048, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 50, United-States, >50K +37, Self-emp-inc, 382802, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 99, United-States, >50K +21, ?, 180303, Bachelors, 13, Never-married, ?, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 25, ?, <=50K +63, Private, 106023, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 332379, Some-college, 10, Married-spouse-absent, Transport-moving, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +29, Private, 95465, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Local-gov, 96102, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 1887, 40, United-States, >50K +27, Private, 36440, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 65, United-States, >50K +25, Self-emp-not-inc, 209384, HS-grad, 9, Never-married, Other-service, Other-relative, White, Male, 0, 0, 32, United-States, <=50K +28, Private, 50814, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +54, Private, 143865, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +74, ?, 104661, Some-college, 10, Widowed, ?, Not-in-family, White, Female, 0, 0, 12, United-States, <=50K +31, Local-gov, 50442, Some-college, 10, Never-married, Exec-managerial, Own-child, Amer-Indian-Eskimo, Female, 0, 0, 32, United-States, <=50K +23, Private, 236601, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +19, Private, 100999, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 30, United-States, <=50K +39, ?, 362685, Preschool, 1, Widowed, ?, Not-in-family, White, Female, 0, 0, 20, El-Salvador, <=50K +61, Self-emp-not-inc, 32423, HS-grad, 9, Married-civ-spouse, Farming-fishing, Wife, White, Female, 22040, 0, 40, United-States, <=50K +59, ?, 154236, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 7688, 0, 40, United-States, >50K +27, Self-emp-inc, 153546, Assoc-voc, 11, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 36, United-States, >50K +19, Private, 182355, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 20, United-States, <=50K +23, ?, 191444, HS-grad, 9, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Local-gov, 44216, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 35, United-States, <=50K +40, Private, 97688, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 48, United-States, >50K +53, Private, 209022, 11th, 7, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 37, United-States, <=50K +32, Private, 96016, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +72, Self-emp-not-inc, 52138, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2392, 25, United-States, >50K +61, Private, 159046, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 138634, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 130125, 10th, 6, Never-married, Other-service, Own-child, Amer-Indian-Eskimo, Female, 1055, 0, 20, United-States, <=50K +73, Private, 247355, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 16, Canada, <=50K +41, Self-emp-not-inc, 227065, Some-college, 10, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 244771, Some-college, 10, Never-married, Machine-op-inspct, Own-child, Black, Female, 0, 0, 20, Jamaica, <=50K +23, Private, 215616, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, Canada, <=50K +65, Private, 386672, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 15, United-States, <=50K +45, Self-emp-inc, 177543, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 50, United-States, <=50K +52, Federal-gov, 617021, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, Black, Male, 7688, 0, 40, United-States, >50K +24, Local-gov, 117109, Bachelors, 13, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 27, United-States, <=50K +23, Private, 373550, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 19847, Some-college, 10, Divorced, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 189590, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 58343, HS-grad, 9, Divorced, Farming-fishing, Unmarried, White, Male, 0, 0, 56, United-States, <=50K +17, Private, 354201, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 119422, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 363405, HS-grad, 9, Separated, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +63, Private, 181863, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 27, United-States, <=50K +27, Private, 194472, HS-grad, 9, Never-married, Priv-house-serv, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +31, Private, 247328, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 3137, 0, 40, Mexico, <=50K +71, Self-emp-not-inc, 130731, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +35, Private, 236910, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +44, Private, 378251, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 38, United-States, <=50K +36, Private, 120760, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, China, <=50K +22, Private, 203182, Bachelors, 13, Never-married, Exec-managerial, Other-relative, White, Female, 0, 0, 20, United-States, <=50K +32, Private, 130304, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1485, 48, United-States, <=50K +30, Local-gov, 352542, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +60, ?, 191024, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 197728, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +76, Private, 316185, 7th-8th, 4, Widowed, Protective-serv, Not-in-family, White, Female, 0, 0, 12, United-States, <=50K +41, Private, 89226, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 292353, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, Other, Male, 0, 0, 40, United-States, <=50K +45, Private, 304570, 12th, 8, Married-civ-spouse, Machine-op-inspct, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, <=50K +32, Private, 180296, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 361487, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 218490, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1848, 40, United-States, >50K +63, Self-emp-not-inc, 231777, Bachelors, 13, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 189832, Assoc-acdm, 12, Never-married, Transport-moving, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +61, Private, 232308, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, State-gov, 33308, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 333677, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K +33, Private, 170651, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 1055, 0, 40, United-States, <=50K +39, Private, 343403, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 36, United-States, <=50K +53, Private, 166386, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Wife, Asian-Pac-Islander, Female, 0, 0, 40, China, <=50K +26, Federal-gov, 48099, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 143062, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 32, United-States, <=50K +18, Private, 104704, HS-grad, 9, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +34, Private, 30497, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +44, State-gov, 174325, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 7688, 0, 40, United-States, >50K +31, Private, 286675, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 50, United-States, <=50K +44, Private, 59474, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +42, Private, 378384, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 60, United-States, >50K +43, Private, 245842, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 44, Mexico, <=50K +33, Private, 274222, Bachelors, 13, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 7688, 0, 38, United-States, >50K +21, Private, 342575, Some-college, 10, Never-married, Sales, Own-child, Black, Female, 0, 0, 30, United-States, <=50K +30, Private, 206051, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +55, Private, 234213, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +57, Private, 145189, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 233490, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 344129, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +62, Self-emp-not-inc, 171315, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +31, Self-emp-not-inc, 181485, Bachelors, 13, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 40, United-States, >50K +51, Private, 255412, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, France, >50K +37, Private, 262409, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Female, 0, 213, 45, United-States, <=50K +45, Private, 199590, 5th-6th, 3, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 38, Mexico, <=50K +47, Private, 84726, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, ?, 226883, HS-grad, 9, Divorced, ?, Own-child, White, Male, 0, 0, 75, United-States, <=50K +75, Self-emp-not-inc, 184335, 11th, 7, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 30, United-States, <=50K +43, Private, 102025, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Other, Male, 0, 0, 50, United-States, <=50K +39, Private, 183898, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 7688, 0, 60, Germany, >50K +30, Private, 55291, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 150025, 5th-6th, 3, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, Guatemala, <=50K +44, Private, 100584, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +53, Local-gov, 181755, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 35, United-States, >50K +40, Private, 150528, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 107277, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +33, Private, 247205, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 40, England, <=50K +20, Private, 291979, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 270985, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 50, United-States, <=50K +48, Private, 62605, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +46, Self-emp-not-inc, 176863, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 53197, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Self-emp-not-inc, 267776, HS-grad, 9, Never-married, Other-service, Other-relative, White, Female, 0, 0, 30, United-States, <=50K +24, Private, 308205, 7th-8th, 4, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +30, Private, 306383, Some-college, 10, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +70, Private, 35494, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 30, United-States, <=50K +26, Private, 291968, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 44, United-States, <=50K +34, Private, 80933, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1672, 40, United-States, <=50K +46, Private, 271828, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +70, Private, 121993, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 5, United-States, <=50K +37, Local-gov, 31023, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 36425, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +23, Private, 407684, 9th, 5, Never-married, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, Mexico, <=50K +28, Private, 241895, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 1628, 40, United-States, <=50K +44, Self-emp-not-inc, 158555, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +58, Private, 140363, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 3325, 0, 30, United-States, <=50K +53, Private, 123429, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +23, Private, 40060, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 290286, HS-grad, 9, Never-married, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +21, ?, 249271, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Local-gov, 106169, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +43, Private, 76487, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 437994, Some-college, 10, Never-married, Other-service, Other-relative, Black, Male, 0, 0, 20, United-States, <=50K +41, Private, 113555, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Amer-Indian-Eskimo, Male, 7298, 0, 50, United-States, >50K +36, Private, 160120, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Vietnam, <=50K +41, Local-gov, 343079, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1740, 20, United-States, <=50K +27, Private, 406662, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 4416, 0, 40, United-States, <=50K +42, Self-emp-not-inc, 37618, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +27, Private, 114158, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 35, United-States, <=50K +41, Private, 115562, HS-grad, 9, Divorced, Protective-serv, Own-child, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 353994, Bachelors, 13, Married-civ-spouse, Exec-managerial, Other-relative, Asian-Pac-Islander, Female, 0, 0, 40, China, >50K +21, Private, 344891, Some-college, 10, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Male, 0, 0, 20, United-States, <=50K +44, Private, 286750, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 50, United-States, >50K +29, Private, 194197, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Self-emp-not-inc, 206599, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 22, United-States, <=50K +21, Local-gov, 596776, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, Guatemala, <=50K +46, Private, 56841, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 112561, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +43, Private, 147110, Some-college, 10, Never-married, Adm-clerical, Unmarried, White, Male, 0, 0, 48, United-States, >50K +54, Self-emp-inc, 175339, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +38, Private, 234901, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 60, United-States, >50K +18, ?, 298133, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 217083, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +30, Private, 97757, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 36, United-States, >50K +30, Private, 151868, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Local-gov, 25864, HS-grad, 9, Never-married, Exec-managerial, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 35, United-States, <=50K +26, Private, 109419, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +37, Federal-gov, 203070, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 43, United-States, <=50K +32, Private, 107843, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 5178, 0, 50, United-States, >50K +64, State-gov, 264544, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 5, United-States, >50K +18, Private, 148644, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 28, United-States, <=50K +30, Private, 125762, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 30, United-States, <=50K +36, ?, 53606, Assoc-voc, 11, Married-civ-spouse, ?, Wife, White, Female, 3908, 0, 8, United-States, <=50K +18, Private, 193741, 11th, 7, Never-married, Other-service, Other-relative, Black, Male, 0, 0, 30, United-States, <=50K +27, Private, 588905, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 115613, 9th, 5, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, State-gov, 222374, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 43, United-States, >50K +37, Private, 185359, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 173647, Some-college, 10, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 31166, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, Other, Female, 0, 0, 30, Germany, <=50K +22, ?, 517995, HS-grad, 9, Never-married, ?, Own-child, White, Male, 0, 0, 40, Mexico, <=50K +25, Self-emp-not-inc, 189027, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 30, United-States, <=50K +38, Private, 296125, HS-grad, 9, Separated, Priv-house-serv, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +32, ?, 640383, Bachelors, 13, Divorced, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 334291, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +56, Private, 318450, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 80, United-States, >50K +29, Private, 174163, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 119721, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 142719, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 162593, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +46, Self-emp-not-inc, 236852, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +28, Local-gov, 154863, HS-grad, 9, Never-married, Protective-serv, Other-relative, Black, Male, 0, 1876, 40, United-States, <=50K +39, Private, 168894, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +42, Self-emp-not-inc, 344920, Some-college, 10, Married-civ-spouse, Farming-fishing, Wife, White, Female, 0, 0, 50, United-States, <=50K +39, Private, 33355, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 7298, 0, 48, United-States, >50K +68, ?, 196782, 7th-8th, 4, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, <=50K +37, Self-emp-inc, 291518, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 15024, 0, 55, United-States, >50K +57, Private, 170244, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 369549, Some-college, 10, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 30, United-States, <=50K +24, Private, 23438, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 30, United-States, >50K +19, Private, 202673, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 50, United-States, <=50K +55, Private, 171780, Assoc-acdm, 12, Divorced, Sales, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +37, Local-gov, 264503, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Local-gov, 244341, Some-college, 10, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +28, Private, 209109, Bachelors, 13, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 50, United-States, <=50K +27, Private, 187392, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +40, State-gov, 119578, Bachelors, 13, Never-married, Prof-specialty, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +51, Private, 195105, HS-grad, 9, Divorced, Priv-house-serv, Own-child, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 101752, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 56, United-States, <=50K +74, ?, 95825, Some-college, 10, Widowed, ?, Not-in-family, White, Female, 0, 0, 3, United-States, <=50K +49, Self-emp-inc, 362654, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 50, United-States, >50K +20, ?, 29810, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +40, Federal-gov, 77332, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +80, Private, 87518, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 1816, 60, United-States, <=50K +63, Private, 113324, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +63, Private, 96299, HS-grad, 9, Divorced, Transport-moving, Unmarried, White, Male, 0, 0, 45, United-States, >50K +51, Private, 237729, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 200973, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +66, Self-emp-not-inc, 212456, HS-grad, 9, Widowed, Craft-repair, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +33, Self-emp-not-inc, 131568, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 66, United-States, <=50K +49, Private, 185859, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 43, United-States, <=50K +20, Private, 231981, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 32, United-States, <=50K +33, Self-emp-inc, 117963, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1887, 60, United-States, >50K +26, Private, 78172, Some-college, 10, Married-AF-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 164135, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +33, Private, 171216, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +47, Private, 140664, Bachelors, 13, Divorced, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +23, Private, 249277, HS-grad, 9, Never-married, Exec-managerial, Own-child, Black, Male, 0, 0, 75, United-States, <=50K +53, Federal-gov, 117847, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 52372, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +26, Federal-gov, 95806, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 3325, 0, 40, United-States, <=50K +53, Private, 137428, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, >50K +65, Private, 169047, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 10, United-States, <=50K +68, Private, 339168, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 30, United-States, <=50K +30, Private, 504725, 10th, 6, Never-married, Sales, Other-relative, White, Male, 0, 0, 18, Guatemala, <=50K +28, Private, 132870, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +54, Local-gov, 135840, 10th, 6, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 45, United-States, <=50K +30, Private, 35644, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 10, United-States, <=50K +22, Private, 198148, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 50, United-States, <=50K +25, Private, 220098, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +19, Private, 262515, 11th, 7, Never-married, Other-service, Other-relative, White, Male, 0, 0, 20, United-States, <=50K +19, ?, 423863, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 35, United-States, <=50K +32, Federal-gov, 111567, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +23, Private, 194096, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +51, Local-gov, 420917, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +25, Private, 197871, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 44, United-States, >50K +46, Local-gov, 253116, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +38, Private, 206535, Some-college, 10, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +26, State-gov, 70447, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, >50K +46, Private, 201217, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 209970, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 196745, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 594, 0, 16, United-States, <=50K +29, Local-gov, 175262, Masters, 14, Married-civ-spouse, Prof-specialty, Other-relative, White, Male, 0, 0, 35, United-States, <=50K +51, Self-emp-inc, 304955, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +40, Private, 181265, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 52, United-States, <=50K +24, Private, 200973, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Self-emp-not-inc, 37440, Bachelors, 13, Never-married, Farming-fishing, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +31, Private, 395170, Assoc-voc, 11, Married-civ-spouse, Other-service, Wife, Amer-Indian-Eskimo, Female, 0, 0, 24, Mexico, <=50K +54, ?, 32385, HS-grad, 9, Divorced, ?, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +34, Private, 353213, Assoc-acdm, 12, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +19, Private, 38619, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 66, United-States, <=50K +21, Private, 177711, HS-grad, 9, Never-married, Transport-moving, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +21, Private, 190761, 10th, 6, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 70, United-States, <=50K +23, Private, 27776, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 24, United-States, <=50K +37, Federal-gov, 470663, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 71738, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 46, United-States, >50K +57, Private, 74156, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 30, United-States, <=50K +48, Private, 202467, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1485, 40, United-States, >50K +24, Private, 123983, 11th, 7, Married-civ-spouse, Transport-moving, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Vietnam, <=50K +43, Private, 193494, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, ?, 169886, Bachelors, 13, Never-married, ?, Not-in-family, White, Female, 0, 0, 20, ?, <=50K +40, Private, 130571, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +52, Self-emp-inc, 90363, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 35, United-States, >50K +49, Private, 83444, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Self-emp-not-inc, 239093, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Amer-Indian-Eskimo, Male, 3137, 0, 40, United-States, <=50K +62, Local-gov, 151369, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 56630, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 117095, HS-grad, 9, Separated, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +55, Federal-gov, 189985, Some-college, 10, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +20, ?, 34862, Some-college, 10, Never-married, ?, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 72, United-States, <=50K +37, Self-emp-inc, 126675, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +43, State-gov, 199806, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 57596, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 103459, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, >50K +28, Private, 282398, Some-college, 10, Separated, Tech-support, Unmarried, White, Male, 0, 0, 40, United-States, >50K +38, Private, 298841, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +45, Private, 33300, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 1977, 50, United-States, >50K +22, ?, 306031, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 306467, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +20, Private, 189888, 12th, 8, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +60, Private, 83861, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 117393, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 129934, Some-college, 10, Never-married, Adm-clerical, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, ?, <=50K +51, Private, 179010, Some-college, 10, Divorced, Sales, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +31, Private, 375680, Bachelors, 13, Never-married, Prof-specialty, Unmarried, Black, Female, 0, 0, 40, ?, <=50K +48, Private, 316101, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 293305, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1887, 40, United-States, >50K +51, Local-gov, 175750, HS-grad, 9, Divorced, Transport-moving, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +41, Private, 121718, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 1848, 48, United-States, >50K +62, ?, 94931, Assoc-voc, 11, Married-civ-spouse, ?, Husband, White, Male, 3411, 0, 40, United-States, <=50K +50, State-gov, 229272, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +46, Private, 142828, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, >50K +54, Private, 22743, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 15024, 0, 60, United-States, >50K +68, Private, 76371, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 45, United-States, >50K +23, Self-emp-not-inc, 216129, Assoc-acdm, 12, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +49, Private, 107425, Masters, 14, Never-married, Sales, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +24, Private, 611029, 10th, 6, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Local-gov, 363032, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 40, United-States, <=50K +38, Private, 170020, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 3137, 0, 45, United-States, <=50K +34, Private, 137900, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +22, Private, 322674, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 23778, 7th-8th, 4, Separated, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 147845, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 31, United-States, <=50K +36, Private, 175759, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +51, Self-emp-inc, 166459, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 128212, 5th-6th, 3, Married-civ-spouse, Machine-op-inspct, Wife, Asian-Pac-Islander, Female, 0, 0, 40, Vietnam, >50K +54, Federal-gov, 127455, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 48, United-States, >50K +63, Private, 134699, HS-grad, 9, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +51, Private, 254230, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +63, Self-emp-not-inc, 159715, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Local-gov, 116286, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +27, Private, 146719, HS-grad, 9, Divorced, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +35, Private, 361888, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +31, ?, 26553, Bachelors, 13, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 25, United-States, >50K +46, Self-emp-not-inc, 32825, HS-grad, 9, Separated, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 225768, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +26, Federal-gov, 393728, Some-college, 10, Divorced, Adm-clerical, Own-child, White, Male, 0, 0, 24, United-States, <=50K +43, Private, 160369, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 191807, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 50, United-States, >50K +50, Federal-gov, 176969, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 1590, 40, United-States, <=50K +54, Federal-gov, 33863, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +62, ?, 182687, Some-college, 10, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 45, United-States, >50K +57, State-gov, 141459, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +19, ?, 174233, Some-college, 10, Never-married, ?, Own-child, Black, Male, 0, 0, 24, United-States, <=50K +29, Local-gov, 95393, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +20, Private, 221095, HS-grad, 9, Never-married, Craft-repair, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +53, Private, 104501, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 55, United-States, >50K +18, ?, 437851, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, ?, 131230, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 495888, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, El-Salvador, <=50K +69, Private, 185691, 11th, 7, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 20, United-States, <=50K +56, Private, 201822, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 2002, 40, United-States, <=50K +53, Local-gov, 549341, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 35, United-States, <=50K +28, Private, 247445, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 199566, Bachelors, 13, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 40, United-States, <=50K +33, Self-emp-inc, 139057, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 84, Taiwan, >50K +48, Private, 185039, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +61, Private, 166124, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +49, Private, 82649, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 5013, 0, 45, United-States, <=50K +48, Private, 109275, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 408328, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 186338, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +27, ?, 130856, Bachelors, 13, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 251579, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 14, United-States, <=50K +47, Private, 76612, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +25, Private, 22546, Bachelors, 13, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 60, United-States, <=50K +72, Private, 53684, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 40, United-States, <=50K +29, Private, 183627, 11th, 7, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 73203, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +57, Private, 108426, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 48, England, <=50K +50, Private, 116287, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 60, Columbia, <=50K +45, Self-emp-inc, 145697, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 30, United-States, <=50K +52, Private, 326156, Some-college, 10, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +53, Private, 201127, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, >50K +36, Private, 250791, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 48, United-States, <=50K +46, Private, 328216, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 400443, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +75, Private, 95985, 5th-6th, 3, Widowed, Other-service, Unmarried, Black, Male, 0, 0, 10, United-States, <=50K +32, Local-gov, 127651, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 250679, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 103950, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +17, Private, 200199, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +46, State-gov, 295791, Assoc-voc, 11, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +39, Private, 191841, Assoc-acdm, 12, Separated, Prof-specialty, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +38, Private, 82622, Some-college, 10, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 160728, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 60, United-States, <=50K +63, Local-gov, 109849, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Female, 0, 0, 21, United-States, <=50K +28, Private, 339897, 1st-4th, 2, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 43, Mexico, <=50K +28, ?, 37215, Bachelors, 13, Never-married, ?, Own-child, White, Male, 0, 0, 45, United-States, <=50K +49, Private, 371299, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +43, Private, 421837, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +38, Private, 29702, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +39, Private, 117381, HS-grad, 9, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 62, England, <=50K +42, ?, 240027, HS-grad, 9, Divorced, ?, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +40, Private, 338740, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, ?, 28359, HS-grad, 9, Separated, ?, Unmarried, White, Female, 0, 0, 10, United-States, <=50K +29, ?, 315026, HS-grad, 9, Divorced, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Federal-gov, 314525, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 1741, 45, United-States, <=50K +30, Private, 173005, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 50, United-States, >50K +44, Private, 286750, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +40, Private, 163985, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 24, United-States, <=50K +30, Private, 219318, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, White, Female, 0, 0, 35, Puerto-Rico, <=50K +42, Private, 44121, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 1876, 40, United-States, <=50K +52, Self-emp-not-inc, 103794, Assoc-voc, 11, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +42, Private, 310632, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +39, Private, 153976, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 52, United-States, >50K +43, Private, 174575, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Male, 0, 0, 45, United-States, <=50K +62, Self-emp-not-inc, 82388, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 10566, 0, 40, United-States, <=50K +30, Private, 207253, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, England, <=50K +83, ?, 251951, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +39, Private, 746786, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +41, Private, 308296, HS-grad, 9, Married-civ-spouse, Transport-moving, Wife, White, Female, 0, 0, 20, United-States, <=50K +49, Private, 101825, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1977, 40, United-States, >50K +25, Private, 109009, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 413363, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 2002, 40, United-States, <=50K +59, ?, 117751, Assoc-acdm, 12, Divorced, ?, Not-in-family, White, Male, 0, 0, 8, United-States, <=50K +44, State-gov, 296326, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 208358, 9th, 5, Divorced, Handlers-cleaners, Not-in-family, White, Male, 4650, 0, 56, United-States, <=50K +40, Private, 120277, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, Ireland, <=50K +21, Private, 193219, HS-grad, 9, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 35, Jamaica, <=50K +41, Private, 86399, Some-college, 10, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +24, Private, 215251, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +67, Self-emp-not-inc, 124470, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +24, Private, 228649, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 38, United-States, <=50K +50, Self-emp-not-inc, 386397, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +48, Private, 96798, Masters, 14, Divorced, Sales, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +55, ?, 106707, Assoc-acdm, 12, Married-civ-spouse, ?, Husband, Black, Male, 0, 0, 20, United-States, >50K +29, Private, 159768, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 3325, 0, 40, Ecuador, <=50K +50, Private, 139464, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 36, Ireland, <=50K +64, State-gov, 550848, 10th, 6, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +49, Private, 68505, 9th, 5, Divorced, Other-service, Not-in-family, Black, Male, 0, 0, 37, United-States, <=50K +20, Private, 122215, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 0, 52, United-States, <=50K +30, Private, 159442, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 80638, Bachelors, 13, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 30, China, <=50K +52, Private, 192390, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 191324, Some-college, 10, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 25, United-States, <=50K +77, ?, 147284, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 14, United-States, <=50K +19, State-gov, 73009, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 15, United-States, <=50K +52, Private, 177858, HS-grad, 9, Divorced, Craft-repair, Other-relative, White, Male, 0, 0, 55, United-States, >50K +42, Private, 163003, Bachelors, 13, Married-spouse-absent, Tech-support, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +35, Private, 95551, HS-grad, 9, Separated, Exec-managerial, Not-in-family, White, Female, 0, 0, 36, United-States, <=50K +27, Private, 125298, Some-college, 10, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +54, State-gov, 198186, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 38, United-States, <=50K +37, Private, 295949, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1628, 40, United-States, <=50K +37, Private, 182668, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 35, United-States, <=50K +28, Private, 124905, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 30, United-States, <=50K +63, Private, 171635, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 376240, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 42, United-States, <=50K +28, Private, 157391, Bachelors, 13, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +23, ?, 114357, Some-college, 10, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 178134, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +31, Private, 207201, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 124483, Bachelors, 13, Never-married, Sales, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 50, ?, >50K +64, Private, 102103, HS-grad, 9, Divorced, Priv-house-serv, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +40, Private, 92036, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +59, Local-gov, 236426, Assoc-acdm, 12, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +22, Private, 400966, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 404573, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 44, United-States, <=50K +35, Private, 227571, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +20, Private, 145917, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 15, United-States, <=50K +35, Local-gov, 190226, HS-grad, 9, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 356555, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, <=50K +28, Private, 66473, HS-grad, 9, Divorced, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, ?, 172256, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 20, United-States, <=50K +72, ?, 118902, Doctorate, 16, Married-civ-spouse, ?, Husband, White, Male, 0, 2392, 6, United-States, >50K +25, Self-emp-inc, 163039, Assoc-acdm, 12, Never-married, Sales, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +37, Private, 89559, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +19, ?, 35507, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 45, United-States, <=50K +31, Private, 163303, Assoc-voc, 11, Divorced, Sales, Own-child, White, Female, 0, 0, 38, United-States, <=50K +41, Private, 192712, HS-grad, 9, Never-married, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 381153, 10th, 6, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Private, 222434, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 34706, Some-college, 10, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 47, United-States, <=50K +57, Self-emp-not-inc, 47857, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +26, Private, 195216, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 12, United-States, <=50K +44, Self-emp-inc, 103643, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5013, 0, 60, Greece, <=50K +29, Local-gov, 329426, HS-grad, 9, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 183612, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 40, United-States, >50K +40, Private, 184105, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 211385, Assoc-acdm, 12, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 35, Jamaica, <=50K +21, Private, 61777, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +34, Self-emp-not-inc, 320194, Prof-school, 15, Separated, Prof-specialty, Unmarried, White, Male, 0, 0, 48, United-States, >50K +24, Private, 199444, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 15, United-States, <=50K +28, Private, 312588, 10th, 6, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 168675, HS-grad, 9, Separated, Transport-moving, Own-child, White, Male, 0, 0, 50, United-States, <=50K +35, Private, 87556, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, State-gov, 220421, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Federal-gov, 404599, HS-grad, 9, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +39, Private, 99065, Bachelors, 13, Married-civ-spouse, Handlers-cleaners, Wife, White, Female, 0, 0, 40, Poland, >50K +57, Local-gov, 109973, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 246652, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 57423, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 48, United-States, <=50K +23, Private, 291248, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +50, Private, 163708, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +51, Self-emp-not-inc, 240358, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 20, United-States, <=50K +28, Private, 25955, Assoc-voc, 11, Divorced, Craft-repair, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +44, Private, 101593, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +29, Self-emp-not-inc, 227890, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +31, Private, 225053, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +27, Private, 228472, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +34, Private, 245378, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +50, Self-emp-inc, 156623, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 7688, 0, 50, Philippines, >50K +27, Private, 35032, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 258849, Assoc-voc, 11, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +46, Private, 190115, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 63910, Some-college, 10, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +40, Private, 510072, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K +28, Private, 210867, 11th, 7, Divorced, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 263024, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 306785, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +58, Self-emp-inc, 104333, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +66, Private, 340734, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 288585, HS-grad, 9, Married-civ-spouse, Other-service, Wife, Asian-Pac-Islander, Female, 0, 0, 20, South, <=50K +38, Private, 241765, 11th, 7, Divorced, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +25, Private, 111058, Assoc-acdm, 12, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 104662, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 22, United-States, <=50K +90, Private, 313986, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +41, Local-gov, 52037, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +34, ?, 146589, HS-grad, 9, Never-married, ?, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +33, Private, 131776, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 914, 0, 40, Germany, <=50K +33, Private, 254221, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 152909, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1887, 45, United-States, >50K +39, Self-emp-not-inc, 211785, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Female, 0, 0, 20, United-States, <=50K +59, Private, 160362, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +19, Private, 387215, Some-college, 10, Never-married, Tech-support, Own-child, White, Male, 0, 1719, 16, United-States, <=50K +39, Private, 187046, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 4064, 0, 38, United-States, <=50K +19, ?, 208874, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 169631, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 202956, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Self-emp-not-inc, 80467, HS-grad, 9, Divorced, Other-service, Own-child, White, Female, 0, 0, 24, United-States, <=50K +28, Private, 407672, Some-college, 10, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 243425, HS-grad, 9, Divorced, Other-service, Other-relative, White, Female, 0, 0, 50, Peru, <=50K +50, ?, 174964, 10th, 6, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 99, United-States, <=50K +36, Private, 347491, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +34, Private, 146161, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +23, Private, 449432, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, ?, 175499, 11th, 7, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K +33, Private, 288825, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 2258, 84, United-States, <=50K +27, Local-gov, 134813, Masters, 14, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 52, United-States, <=50K +31, Local-gov, 190401, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 260617, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 36, United-States, <=50K +31, Private, 45604, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 54, United-States, <=50K +59, Private, 67841, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Local-gov, 244522, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 48, United-States, >50K +19, Private, 430471, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 194698, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +34, Private, 94235, Bachelors, 13, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 188330, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 78, United-States, <=50K +51, Local-gov, 146181, 9th, 5, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +21, Private, 177125, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +30, Self-emp-inc, 68330, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +46, Private, 95636, Some-college, 10, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +40, Private, 238329, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +52, Private, 416129, Preschool, 1, Married-civ-spouse, Other-service, Not-in-family, White, Male, 0, 0, 40, El-Salvador, <=50K +23, Private, 285004, Bachelors, 13, Never-married, Sales, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 50, Taiwan, <=50K +90, ?, 256514, Bachelors, 13, Widowed, ?, Other-relative, White, Female, 991, 0, 10, United-States, <=50K +25, Private, 186294, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +43, Private, 188786, Some-college, 10, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +38, State-gov, 31352, Some-college, 10, Divorced, Protective-serv, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, >50K +22, Private, 197613, Assoc-voc, 11, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +33, Local-gov, 161942, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 1055, 0, 40, United-States, <=50K +34, Private, 275438, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 5178, 0, 40, United-States, >50K +65, Private, 361721, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 20, United-States, <=50K +50, Private, 144968, HS-grad, 9, Never-married, Tech-support, Own-child, White, Male, 0, 0, 15, United-States, <=50K +29, Private, 190539, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 6849, 0, 48, United-States, <=50K +25, Private, 178037, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Private, 306985, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 87928, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Private, 242619, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 154165, 9th, 5, Divorced, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 511331, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +65, Local-gov, 221026, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 20, United-States, <=50K +56, Self-emp-not-inc, 222182, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 45, United-States, <=50K +39, Self-emp-not-inc, 126569, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 60, United-States, >50K +23, Private, 202344, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 190423, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +24, Private, 238917, 5th-6th, 3, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, El-Salvador, <=50K +41, Private, 221947, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1887, 50, United-States, >50K +40, Self-emp-inc, 37997, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +55, Private, 147098, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +38, Private, 278253, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 48, United-States, <=50K +23, Private, 195411, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +44, Private, 76196, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +45, Private, 120131, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1887, 40, United-States, >50K +20, Self-emp-not-inc, 186014, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 12, Germany, <=50K +29, Private, 205903, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +43, State-gov, 125405, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 219838, 12th, 8, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, State-gov, 19395, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +31, Private, 223327, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +52, Private, 114062, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 95654, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, Iran, >50K +38, Private, 177305, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +66, ?, 299616, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +63, Self-emp-not-inc, 117681, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 237651, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 35, United-States, <=50K +33, State-gov, 150570, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, State-gov, 106705, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Female, 1506, 0, 50, United-States, <=50K +20, ?, 174714, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 20, United-States, <=50K +47, Self-emp-inc, 175958, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 3325, 0, 60, United-States, <=50K +33, Private, 144064, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +66, ?, 107112, 7th-8th, 4, Never-married, ?, Other-relative, Black, Male, 0, 0, 30, United-States, <=50K +20, Private, 54152, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, ?, <=50K +28, Private, 152951, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 190487, HS-grad, 9, Divorced, Priv-house-serv, Unmarried, White, Female, 0, 0, 28, Ecuador, <=50K +25, Private, 306666, Some-college, 10, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 45, United-States, <=50K +37, Private, 195148, HS-grad, 9, Married-civ-spouse, Craft-repair, Own-child, White, Male, 3137, 0, 40, United-States, <=50K +31, Self-emp-not-inc, 226624, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +49, Private, 157569, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, State-gov, 22966, Some-college, 10, Married-spouse-absent, Tech-support, Unmarried, White, Male, 0, 0, 20, United-States, <=50K +52, Private, 379682, Assoc-voc, 11, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 20, United-States, >50K +29, Private, 446559, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 50, United-States, <=50K +18, Private, 41794, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +31, Local-gov, 90409, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, <=50K +23, Private, 125491, Some-college, 10, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 35, Vietnam, <=50K +27, ?, 129661, Assoc-voc, 11, Married-civ-spouse, ?, Wife, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, >50K +54, Self-emp-not-inc, 104748, 10th, 6, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 65, United-States, <=50K +50, Local-gov, 169182, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 49, Dominican-Republic, <=50K +46, Private, 324655, Masters, 14, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 1902, 40, ?, >50K +24, Private, 122272, Bachelors, 13, Never-married, Farming-fishing, Own-child, White, Female, 0, 0, 40, United-States, <=50K +17, ?, 114798, 11th, 7, Never-married, ?, Own-child, White, Female, 0, 0, 18, United-States, <=50K +49, Self-emp-inc, 289707, HS-grad, 9, Separated, Other-service, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +54, Local-gov, 137691, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Private, 84610, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 60, United-States, >50K +49, Private, 166789, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Local-gov, 348728, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +23, Private, 348092, HS-grad, 9, Never-married, Transport-moving, Own-child, Black, Male, 0, 0, 40, Haiti, <=50K +63, Private, 154526, Some-college, 10, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +67, Private, 288371, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, Canada, >50K +23, Private, 182342, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 244366, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +66, Private, 102423, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 30, United-States, <=50K +25, Private, 259688, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +30, Private, 98733, Some-college, 10, Divorced, Machine-op-inspct, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +35, Private, 174856, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 2885, 0, 40, United-States, <=50K +67, Self-emp-not-inc, 141797, 7th-8th, 4, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 327202, 12th, 8, Never-married, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +26, Private, 76996, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 38, United-States, <=50K +34, Private, 260560, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 370990, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 129010, 12th, 8, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 10, United-States, <=50K +21, Private, 452640, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +76, Self-emp-inc, 120796, 9th, 5, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Federal-gov, 45334, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, Asian-Pac-Islander, Male, 0, 0, 70, ?, <=50K +26, Private, 229523, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 56, United-States, <=50K +18, Private, 127388, 12th, 8, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +18, ?, 395567, 11th, 7, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 119422, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 1672, 50, United-States, <=50K +59, Private, 193895, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +55, Private, 163083, Bachelors, 13, Separated, Exec-managerial, Not-in-family, White, Male, 14084, 0, 45, United-States, >50K +33, Self-emp-not-inc, 155343, Some-college, 10, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 72, United-States, <=50K +25, Private, 73895, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 10, United-States, <=50K +48, Private, 107682, HS-grad, 9, Widowed, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 10, United-States, <=50K +64, Private, 321166, Bachelors, 13, Divorced, Sales, Not-in-family, White, Female, 0, 0, 5, United-States, <=50K +47, Local-gov, 154940, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 45, United-States, >50K +26, Private, 103700, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 63509, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 48, United-States, >50K +21, Private, 243842, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, ?, 187221, 7th-8th, 4, Never-married, ?, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +30, Private, 58597, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 44, United-States, <=50K +41, Self-emp-not-inc, 190290, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +53, ?, 158352, Masters, 14, Never-married, ?, Not-in-family, White, Female, 8614, 0, 35, United-States, >50K +34, Private, 62165, Some-college, 10, Never-married, Sales, Other-relative, Black, Male, 0, 0, 30, United-States, <=50K +20, ?, 307149, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 35, United-States, <=50K +24, Private, 280134, 10th, 6, Never-married, Sales, Not-in-family, White, Male, 0, 0, 49, El-Salvador, <=50K +26, Private, 118736, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +25, Private, 171114, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 48, United-States, <=50K +35, Private, 169638, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +41, Private, 125461, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +33, Private, 145434, 11th, 7, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 152182, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +27, Self-emp-inc, 233724, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +32, Private, 153963, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +51, Local-gov, 88120, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +38, Private, 96330, Assoc-voc, 11, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +41, Local-gov, 66118, Some-college, 10, Married-civ-spouse, Transport-moving, Wife, White, Female, 0, 0, 25, United-States, <=50K +26, Private, 182178, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 2829, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 53628, Assoc-voc, 11, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 35, United-States, <=50K +54, Private, 174865, 9th, 5, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +30, Private, 66194, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 60, Outlying-US(Guam-USVI-etc), <=50K +31, Private, 73796, Some-college, 10, Widowed, Exec-managerial, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +26, State-gov, 28366, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +75, Self-emp-not-inc, 231741, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 4931, 0, 3, United-States, <=50K +29, Private, 237865, Masters, 14, Never-married, Transport-moving, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +61, Private, 195453, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Private, 116934, Some-college, 10, Separated, Sales, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +22, ?, 87867, 12th, 8, Never-married, ?, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +34, Private, 456399, Assoc-acdm, 12, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 263608, Some-college, 10, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 263498, 11th, 7, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Self-emp-not-inc, 183765, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, ?, <=50K +27, Federal-gov, 469705, HS-grad, 9, Never-married, Craft-repair, Not-in-family, Black, Male, 0, 1980, 40, United-States, <=50K +39, Local-gov, 113253, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, United-States, >50K +20, Private, 138768, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 302146, 11th, 7, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +68, Private, 253866, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 30, United-States, <=50K +28, Federal-gov, 214858, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 48, United-States, <=50K +43, Private, 243476, HS-grad, 9, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 169104, Some-college, 10, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 103218, HS-grad, 9, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +41, Private, 57233, Bachelors, 13, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 228320, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 45, United-States, >50K +20, Private, 217421, HS-grad, 9, Married-civ-spouse, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 185041, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 75, United-States, >50K +27, Self-emp-not-inc, 37302, Assoc-acdm, 12, Married-civ-spouse, Transport-moving, Husband, White, Male, 7688, 0, 70, United-States, >50K +32, Private, 261059, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +46, Private, 59767, Some-college, 10, Divorced, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +26, Private, 333541, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 24, United-States, <=50K +20, Private, 133352, Some-college, 10, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, Vietnam, <=50K +36, Private, 99270, HS-grad, 9, Married-civ-spouse, Farming-fishing, Wife, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 204629, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +32, Private, 34104, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 3103, 0, 55, United-States, >50K +32, Private, 312667, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 50, United-States, >50K +49, Private, 329603, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1902, 40, United-States, >50K +36, Private, 281021, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +22, Private, 275385, Some-college, 10, Never-married, Other-service, Other-relative, White, Male, 0, 0, 25, United-States, <=50K +52, Federal-gov, 129177, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 385591, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +22, ?, 201179, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +72, Private, 38360, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 16, United-States, <=50K +30, Local-gov, 73796, Bachelors, 13, Separated, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 67671, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 257621, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +22, Private, 180052, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +59, Private, 656036, Bachelors, 13, Separated, Adm-clerical, Unmarried, White, Male, 0, 0, 60, United-States, <=50K +46, Private, 215943, HS-grad, 9, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 488720, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +64, Federal-gov, 199298, 7th-8th, 4, Widowed, Other-service, Unmarried, White, Female, 0, 0, 30, Puerto-Rico, <=50K +31, Private, 305692, Some-college, 10, Married-civ-spouse, Sales, Wife, Black, Female, 0, 0, 40, United-States, <=50K +64, Private, 114994, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 20, United-States, <=50K +45, Private, 88265, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +59, Private, 168569, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 1887, 40, United-States, >50K +32, Private, 175413, HS-grad, 9, Never-married, Adm-clerical, Other-relative, Black, Female, 0, 0, 40, Jamaica, <=50K +43, Private, 161226, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +66, ?, 160995, 10th, 6, Divorced, ?, Not-in-family, White, Female, 1086, 0, 20, United-States, <=50K +23, Private, 208598, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +49, Self-emp-not-inc, 200471, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 256609, 12th, 8, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +49, Private, 176684, Assoc-voc, 11, Never-married, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 206512, Some-college, 10, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 212640, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 85, United-States, <=50K +47, Private, 148724, HS-grad, 9, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 40, United-States, <=50K +41, Private, 266510, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +34, Local-gov, 240252, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 358975, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +20, ?, 124242, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 434710, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +25, Private, 204338, HS-grad, 9, Never-married, Farming-fishing, Unmarried, White, Male, 0, 0, 30, ?, <=50K +46, Private, 241844, 9th, 5, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 191342, 1st-4th, 2, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Cambodia, <=50K +41, Private, 221947, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 56, United-States, >50K +44, Private, 111483, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 1504, 50, United-States, <=50K +30, Private, 65278, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +54, Private, 133403, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Self-emp-not-inc, 166416, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 99, United-States, <=50K +58, ?, 142158, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 12, United-States, <=50K +21, Private, 221480, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 25, Ecuador, <=50K +35, Self-emp-not-inc, 189878, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 278403, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 80, United-States, >50K +19, Private, 184710, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 20, United-States, <=50K +48, Private, 177775, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +56, ?, 275943, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, Nicaragua, <=50K +65, Self-emp-not-inc, 225473, Some-college, 10, Widowed, Craft-repair, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +40, Private, 289403, Bachelors, 13, Separated, Adm-clerical, Unmarried, Black, Male, 0, 0, 35, United-States, <=50K +26, Private, 269060, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 449354, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 214413, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 80058, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 202027, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 27828, 0, 50, United-States, >50K +22, Self-emp-not-inc, 123440, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +37, Private, 191524, Assoc-voc, 11, Separated, Prof-specialty, Own-child, White, Female, 0, 0, 38, United-States, <=50K +25, Private, 308144, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +64, Private, 164204, 1st-4th, 2, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 53, ?, <=50K +46, Private, 205100, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, >50K +30, Private, 195750, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 27, United-States, <=50K +63, Private, 149756, Assoc-acdm, 12, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +51, Local-gov, 240358, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +68, Self-emp-not-inc, 241174, 7th-8th, 4, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 16, United-States, <=50K +36, Private, 356838, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, Canada, <=50K +28, Self-emp-inc, 115705, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +41, Local-gov, 137142, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 296066, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 401335, Some-college, 10, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +33, ?, 182771, Bachelors, 13, Never-married, ?, Own-child, Asian-Pac-Islander, Male, 0, 0, 80, Philippines, <=50K +34, Self-emp-inc, 186824, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 40, United-States, >50K +46, Federal-gov, 162187, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 98010, Some-college, 10, Married-spouse-absent, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 172538, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 52, United-States, >50K +18, Private, 80163, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +33, Local-gov, 43959, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 50, United-States, >50K +51, Private, 162632, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 7298, 0, 60, United-States, >50K +56, Self-emp-not-inc, 115422, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, <=50K +54, Private, 100933, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 270379, HS-grad, 9, Never-married, Exec-managerial, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +40, Private, 20109, Some-college, 10, Divorced, Handlers-cleaners, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 84, United-States, <=50K +53, Private, 114758, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 65, United-States, >50K +22, Private, 100345, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 35, United-States, <=50K +33, Private, 184901, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 87239, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +63, Private, 127363, Some-college, 10, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 12, United-States, <=50K +53, Federal-gov, 199720, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 60, Germany, >50K +37, Private, 143058, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +50, Federal-gov, 36489, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 141698, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +40, Federal-gov, 26358, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 195532, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 8614, 0, 40, United-States, >50K +21, Private, 30039, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 125159, Assoc-acdm, 12, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, Jamaica, <=50K +20, Private, 246250, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Federal-gov, 77370, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 355569, Assoc-voc, 11, Never-married, Exec-managerial, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +32, Private, 180603, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +42, Private, 201785, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +33, Private, 256211, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Asian-Pac-Islander, Male, 0, 0, 40, South, <=50K +27, Private, 146764, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +22, ?, 211968, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, Iran, <=50K +29, Private, 200515, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 38, United-States, <=50K +29, Private, 52636, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 27049, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 20, United-States, <=50K +35, Private, 111128, 10th, 6, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, Self-emp-not-inc, 348038, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 50, Puerto-Rico, >50K +33, Private, 93930, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +67, Private, 397831, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1539, 40, United-States, <=50K +46, Private, 33794, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 10, United-States, <=50K +45, Private, 178215, 9th, 5, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, >50K +17, Local-gov, 191910, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +35, Private, 340110, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1848, 70, United-States, >50K +48, Self-emp-not-inc, 133694, Bachelors, 13, Married-spouse-absent, Exec-managerial, Not-in-family, Black, Male, 0, 0, 40, Jamaica, >50K +49, Private, 148398, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +20, Private, 133515, Some-college, 10, Never-married, Sales, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 181667, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 5013, 0, 46, Canada, <=50K +64, Private, 159715, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 20, United-States, <=50K +53, Federal-gov, 174040, Some-college, 10, Separated, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +52, Private, 117700, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 37215, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, <=50K +32, Self-emp-inc, 46807, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 99999, 0, 40, United-States, >50K +48, Self-emp-not-inc, 317360, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 20, United-States, >50K +30, Private, 425627, Some-college, 10, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +34, Private, 82623, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 20, United-States, <=50K +19, ?, 63574, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 50, United-States, <=50K +39, Private, 140854, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +28, Private, 185061, 11th, 7, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 160118, 12th, 8, Never-married, Sales, Not-in-family, White, Female, 0, 0, 10, ?, <=50K +54, Private, 282680, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +24, Private, 137591, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 1762, 40, United-States, <=50K +25, Private, 198163, HS-grad, 9, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 132749, 11th, 7, Divorced, Other-service, Unmarried, White, Female, 0, 0, 12, United-States, <=50K +48, Local-gov, 31264, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 5178, 0, 40, United-States, >50K +24, Private, 399449, Bachelors, 13, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 27494, Some-college, 10, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 50, Taiwan, <=50K +47, Private, 368561, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 102096, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, ?, >50K +19, Private, 406078, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 25, United-States, <=50K +52, Self-emp-inc, 100506, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 50, United-States, >50K +52, Private, 29658, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, ?, 20469, HS-grad, 9, Never-married, ?, Other-relative, Asian-Pac-Islander, Female, 0, 0, 12, South, <=50K +60, Private, 181953, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 28, United-States, <=50K +43, Private, 304175, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Private, 170070, Assoc-acdm, 12, Divorced, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +20, ?, 193416, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 194908, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 357962, 9th, 5, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 214716, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 10, United-States, <=50K +40, Self-emp-inc, 207578, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 50, United-States, >50K +54, Private, 146409, Some-college, 10, Widowed, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 341643, Bachelors, 13, Never-married, Other-service, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +52, Private, 131631, 11th, 7, Separated, Machine-op-inspct, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +53, Private, 88842, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 99999, 0, 40, United-States, >50K +56, ?, 128900, Some-college, 10, Widowed, ?, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +35, Private, 417136, HS-grad, 9, Divorced, Craft-repair, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 336763, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 880, 42, United-States, <=50K +29, Private, 209301, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Canada, <=50K +29, Private, 120986, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Amer-Indian-Eskimo, Male, 0, 0, 65, United-States, <=50K +27, Private, 51025, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +58, Private, 218281, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, Mexico, <=50K +64, Private, 114994, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 18, United-States, <=50K +53, Private, 335481, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 32, United-States, <=50K +21, Private, 174503, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +40, Self-emp-not-inc, 230478, Assoc-acdm, 12, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +52, State-gov, 149650, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, Iran, >50K +38, Private, 149419, Assoc-voc, 11, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +40, ?, 341539, Some-college, 10, Divorced, ?, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +39, Private, 185099, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +56, ?, 132930, Masters, 14, Never-married, ?, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +68, Private, 128472, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +24, Private, 124971, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +40, Self-emp-inc, 344060, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +43, Self-emp-inc, 286750, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 99, United-States, >50K +38, Private, 296999, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 70, United-States, <=50K +45, Private, 123681, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +18, Private, 232024, 11th, 7, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 55, United-States, <=50K +57, Local-gov, 52267, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +49, Private, 119182, HS-grad, 9, Separated, Other-service, Not-in-family, Black, Female, 0, 0, 35, United-States, <=50K +25, Private, 191230, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, Yugoslavia, <=50K +52, Federal-gov, 23780, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +56, Private, 184553, 10th, 6, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 56, United-States, <=50K +26, Self-emp-inc, 242651, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +19, Private, 246226, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Self-emp-inc, 86745, Bachelors, 13, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +25, Private, 106889, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 30, United-States, <=50K +21, Private, 460835, HS-grad, 9, Never-married, Sales, Other-relative, White, Male, 0, 0, 45, United-States, <=50K +48, Self-emp-not-inc, 213140, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Italy, <=50K +33, State-gov, 37070, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, Canada, <=50K +31, State-gov, 93589, HS-grad, 9, Divorced, Protective-serv, Own-child, Other, Male, 0, 0, 40, United-States, <=50K +26, Self-emp-not-inc, 213258, HS-grad, 9, Divorced, Farming-fishing, Unmarried, White, Male, 0, 0, 65, United-States, <=50K +37, State-gov, 46814, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +29, ?, 168873, Some-college, 10, Divorced, ?, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +20, Private, 284737, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +28, Private, 309620, Some-college, 10, Married-civ-spouse, Sales, Husband, Other, Male, 0, 0, 60, ?, <=50K +49, Private, 197418, HS-grad, 9, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +73, ?, 132737, 10th, 6, Never-married, ?, Not-in-family, White, Male, 0, 0, 4, United-States, <=50K +49, Private, 185041, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 40, United-States, >50K +51, Private, 159604, Some-college, 10, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +40, Private, 123557, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 275421, Assoc-voc, 11, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 167147, 12th, 8, Never-married, Sales, Own-child, White, Male, 0, 0, 24, United-States, <=50K +41, Private, 197583, 10th, 6, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, >50K +46, Private, 175109, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 1485, 40, United-States, >50K +46, Private, 117502, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +64, Private, 180401, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Self-emp-not-inc, 146603, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +53, State-gov, 143822, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 36, United-States, >50K +21, Private, 51985, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, State-gov, 48121, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 25, United-States, <=50K +37, Private, 234807, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 7430, 0, 45, United-States, >50K +39, Federal-gov, 65324, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +30, Private, 302149, Bachelors, 13, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, <=50K +25, Private, 168403, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 1741, 40, United-States, <=50K +26, Private, 159897, Some-college, 10, Never-married, Exec-managerial, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +43, Private, 416338, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +59, Private, 370615, HS-grad, 9, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 60, United-States, <=50K +27, Private, 219371, HS-grad, 9, Married-spouse-absent, Adm-clerical, Unmarried, White, Female, 0, 0, 40, Jamaica, <=50K +55, Private, 120970, 10th, 6, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +20, Private, 22966, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 12, Canada, <=50K +25, Private, 34541, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 36, Canada, <=50K +28, Private, 191027, Assoc-acdm, 12, Separated, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 107458, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +60, Private, 121832, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +37, Local-gov, 233825, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 15024, 0, 50, United-States, >50K +25, Private, 73839, 11th, 7, Divorced, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 109165, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +50, State-gov, 103063, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Self-emp-not-inc, 29762, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 5013, 0, 70, United-States, <=50K +46, Private, 111979, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 47, United-States, <=50K +35, Private, 150125, Assoc-voc, 11, Divorced, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +21, ?, 301853, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, ?, 296738, 11th, 7, Separated, ?, Not-in-family, White, Female, 6849, 0, 60, United-States, <=50K +40, Private, 118001, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +49, Private, 149337, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 36601, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +43, Local-gov, 118600, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 625, 40, United-States, <=50K +39, Private, 279272, Assoc-acdm, 12, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 60, United-States, <=50K +35, Private, 181020, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 60, United-States, <=50K +52, Private, 165998, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 218136, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 40, Outlying-US(Guam-USVI-etc), <=50K +20, Self-emp-inc, 182200, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 30, United-States, <=50K +46, Private, 39363, Some-college, 10, Divorced, Other-service, Unmarried, White, Female, 0, 0, 10, ?, <=50K +24, Private, 140001, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 193260, Bachelors, 13, Married-civ-spouse, Craft-repair, Other-relative, Asian-Pac-Islander, Male, 0, 0, 30, India, <=50K +21, Private, 191243, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Federal-gov, 207887, Bachelors, 13, Divorced, Exec-managerial, Other-relative, White, Female, 0, 0, 50, United-States, <=50K +43, Federal-gov, 211450, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 184759, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 26, United-States, <=50K +47, Private, 197836, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +61, Private, 232308, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, >50K +21, ?, 189888, Assoc-acdm, 12, Never-married, ?, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +35, Private, 301614, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 60, United-States, <=50K +60, Private, 146674, 10th, 6, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +27, Private, 225291, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +57, Local-gov, 148509, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 35, India, <=50K +56, Private, 136413, 1st-4th, 2, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 126060, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 73064, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 0, 0, 35, United-States, <=50K +19, Private, 39026, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +28, Self-emp-not-inc, 33035, 12th, 8, Divorced, Other-service, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +43, Private, 193494, 10th, 6, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +63, Local-gov, 147440, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +22, ?, 153131, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 64671, HS-grad, 9, Divorced, Handlers-cleaners, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +35, Self-emp-not-inc, 225399, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 8614, 0, 40, United-States, >50K +20, Private, 174391, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 377757, 10th, 6, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, United-States, <=50K +30, Local-gov, 364310, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, Germany, <=50K +31, Private, 110643, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 70240, HS-grad, 9, Never-married, Sales, Own-child, Asian-Pac-Islander, Female, 0, 0, 24, Philippines, <=50K +57, State-gov, 32694, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +41, Private, 95047, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Wife, White, Female, 7688, 0, 44, United-States, >50K +33, Private, 264936, HS-grad, 9, Divorced, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 367329, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 56026, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, >50K +22, Private, 186452, 10th, 6, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +50, Private, 125417, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, Black, Female, 0, 0, 40, United-States, >50K +40, Self-emp-not-inc, 242082, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +37, Private, 31023, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 51, United-States, <=50K +40, ?, 397346, Assoc-acdm, 12, Divorced, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 424079, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 55, United-States, >50K +38, Self-emp-not-inc, 108947, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 7688, 0, 40, United-States, >50K +25, State-gov, 261979, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 55507, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +22, ?, 291407, 12th, 8, Never-married, ?, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +18, Private, 353358, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 16, United-States, <=50K +41, Private, 67339, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 7688, 0, 40, United-States, >50K +33, Private, 235109, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 208180, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +67, State-gov, 423561, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 145290, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 2415, 50, United-States, >50K +24, Private, 403671, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Local-gov, 49325, 7th-8th, 4, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 370494, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +25, Private, 267012, Assoc-voc, 11, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +33, Private, 191856, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +55, Private, 80445, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 379798, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +32, Local-gov, 168387, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +18, Private, 301948, HS-grad, 9, Never-married, Protective-serv, Own-child, White, Male, 34095, 0, 3, United-States, <=50K +36, Private, 274809, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +58, Private, 233193, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 27, United-States, <=50K +34, Private, 299635, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 50, United-States, >50K +19, Private, 236396, 11th, 7, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 688355, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +36, Self-emp-inc, 37019, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +37, Private, 148015, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 15024, 0, 40, United-States, >50K +43, Private, 122975, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 0, 0, 21, Trinadad&Tobago, <=50K +52, State-gov, 349795, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 229846, Assoc-voc, 11, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, ?, <=50K +43, Private, 108945, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +22, Private, 237498, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 188872, 5th-6th, 3, Divorced, Transport-moving, Unmarried, White, Male, 6497, 0, 40, United-States, <=50K +37, Private, 324019, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 82488, Some-college, 10, Divorced, Sales, Unmarried, Asian-Pac-Islander, Female, 0, 0, 38, United-States, <=50K +54, Private, 206964, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 37088, Assoc-acdm, 12, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 152540, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +65, Private, 143554, Some-college, 10, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +30, Private, 126242, Some-college, 10, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 127185, 9th, 5, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 164018, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 4064, 0, 50, United-States, <=50K +25, Private, 210184, 11th, 7, Separated, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, ?, 117528, Assoc-voc, 11, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 40, United-States, >50K +47, Private, 124973, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 50, United-States, >50K +23, Private, 182117, Assoc-acdm, 12, Never-married, Other-service, Own-child, White, Male, 0, 0, 60, United-States, <=50K +42, Private, 220049, HS-grad, 9, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 40, United-States, >50K +39, Self-emp-not-inc, 247975, Some-college, 10, Never-married, Craft-repair, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 30, United-States, <=50K +55, Private, 50164, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +24, State-gov, 123160, Masters, 14, Married-spouse-absent, Prof-specialty, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 10, China, <=50K +46, Self-emp-inc, 219962, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 7298, 0, 40, ?, >50K +53, Private, 79324, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Private, 129100, 11th, 7, Separated, Other-service, Unmarried, Black, Female, 0, 0, 60, United-States, <=50K +40, Private, 210275, HS-grad, 9, Separated, Transport-moving, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +48, Private, 189462, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 7688, 0, 40, United-States, >50K +26, Private, 171114, Assoc-voc, 11, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +22, Private, 201799, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, ?, 200426, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 12, United-States, <=50K +20, ?, 24395, Some-college, 10, Never-married, ?, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +43, Private, 191149, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Local-gov, 34173, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 25, United-States, <=50K +30, Private, 350979, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Laos, <=50K +41, Private, 147314, HS-grad, 9, Married-civ-spouse, Sales, Husband, Amer-Indian-Eskimo, Male, 0, 0, 50, United-States, <=50K +38, Private, 136081, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +77, ?, 232894, 9th, 5, Married-civ-spouse, ?, Husband, Black, Male, 0, 0, 40, United-States, <=50K +42, Self-emp-not-inc, 373403, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 120601, HS-grad, 9, Never-married, Transport-moving, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +36, Private, 130926, Bachelors, 13, Divorced, Adm-clerical, Not-in-family, White, Female, 3674, 0, 40, United-States, <=50K +32, Federal-gov, 72338, Assoc-voc, 11, Never-married, Prof-specialty, Other-relative, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +27, Private, 129624, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +25, State-gov, 328697, Some-college, 10, Divorced, Protective-serv, Other-relative, White, Male, 0, 0, 45, United-States, <=50K +40, Private, 191196, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +18, ?, 191117, 11th, 7, Never-married, ?, Own-child, White, Male, 0, 0, 25, United-States, <=50K +49, Private, 110243, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +17, Private, 181580, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 16, United-States, <=50K +29, Private, 89030, HS-grad, 9, Never-married, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 345493, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 99999, 0, 55, Taiwan, >50K +24, Self-emp-not-inc, 277700, Some-college, 10, Separated, Handlers-cleaners, Own-child, White, Male, 0, 0, 45, United-States, <=50K +58, ?, 198478, 7th-8th, 4, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 24, United-States, <=50K +29, Private, 250679, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 168837, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 24, Canada, >50K +30, Private, 142675, 12th, 8, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +19, Private, 299050, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 20, United-States, <=50K +59, Private, 107833, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1485, 40, United-States, >50K +47, Private, 121958, 7th-8th, 4, Married-spouse-absent, Other-service, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +41, Private, 282948, Some-college, 10, Married-civ-spouse, Tech-support, Husband, Black, Male, 3137, 0, 40, United-States, <=50K +28, Private, 176683, Assoc-acdm, 12, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, France, <=50K +46, Private, 34377, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Self-emp-not-inc, 209833, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +66, State-gov, 41506, 10th, 6, Divorced, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +30, Local-gov, 125159, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Male, 14084, 0, 45, ?, >50K +44, Self-emp-not-inc, 147206, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 12, United-States, <=50K +58, Self-emp-not-inc, 93664, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +21, Private, 315065, 7th-8th, 4, Never-married, Other-service, Other-relative, White, Male, 0, 0, 48, Mexico, <=50K +59, Private, 381851, 9th, 5, Widowed, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +35, Local-gov, 185769, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 186272, 9th, 5, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 5178, 0, 40, United-States, >50K +30, Private, 312667, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 343925, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 40, Jamaica, <=50K +26, Private, 195994, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +48, Private, 398843, Some-college, 10, Separated, Sales, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +31, Private, 73514, HS-grad, 9, Never-married, Sales, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +36, Private, 288049, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +48, Private, 54759, HS-grad, 9, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +30, Private, 155343, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 5013, 0, 40, United-States, <=50K +33, Private, 401104, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 50, United-States, >50K +19, ?, 124884, 9th, 5, Never-married, ?, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +37, Local-gov, 287306, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 99999, 0, 40, ?, >50K +53, Private, 113995, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +18, Private, 146378, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, ?, <=50K +38, Private, 111499, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 7298, 0, 50, United-States, >50K +34, Private, 34374, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +45, Private, 162187, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 52, United-States, >50K +33, Local-gov, 147654, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +35, Private, 182467, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 44, United-States, <=50K +22, Private, 183970, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 332588, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +45, Private, 26781, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, Amer-Indian-Eskimo, Male, 0, 0, 8, United-States, <=50K +17, Private, 48610, 11th, 7, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 45, United-States, <=50K +50, Private, 162632, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +38, Local-gov, 91711, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +33, Private, 198003, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 50, United-States, >50K +46, Private, 179048, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 35, ?, <=50K +25, Private, 262778, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 6849, 0, 50, United-States, <=50K +64, Private, 102470, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +62, Self-emp-not-inc, 123170, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 10, United-States, <=50K +32, Private, 164243, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 60, United-States, >50K +17, Private, 262511, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +61, Private, 51170, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, <=50K +40, State-gov, 91949, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +21, Private, 123727, HS-grad, 9, Never-married, Exec-managerial, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 173175, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 50, United-States, >50K +35, Self-emp-not-inc, 120301, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +29, Private, 250967, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Federal-gov, 285432, Assoc-acdm, 12, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +20, Private, 36235, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +34, ?, 317219, Some-college, 10, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 20, United-States, >50K +51, Local-gov, 110965, Masters, 14, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 123283, HS-grad, 9, Separated, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 15, United-States, <=50K +20, ?, 249087, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 152940, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 376680, HS-grad, 9, Never-married, Tech-support, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +56, Private, 231232, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 30, Canada, <=50K +55, Self-emp-not-inc, 168625, Some-college, 10, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 12, United-States, >50K +26, Private, 33939, Some-college, 10, Never-married, Tech-support, Own-child, White, Male, 0, 0, 20, United-States, <=50K +46, Private, 155659, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 15024, 0, 45, United-States, >50K +32, Local-gov, 190228, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 216178, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 587310, 7th-8th, 4, Never-married, Other-service, Other-relative, White, Male, 0, 0, 35, Guatemala, <=50K +23, Private, 155919, 9th, 5, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +59, Private, 227386, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 138152, 12th, 8, Never-married, Craft-repair, Other-relative, Other, Male, 0, 0, 48, Guatemala, <=50K +36, Private, 167482, 10th, 6, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 57957, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +33, Private, 157747, 9th, 5, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +60, Self-emp-not-inc, 88570, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Wife, White, Female, 0, 0, 15, Germany, >50K +40, Private, 273308, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, White, Female, 0, 0, 48, Mexico, <=50K +48, Private, 216292, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 43, United-States, <=50K +27, Self-emp-not-inc, 131298, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 48, United-States, <=50K +19, Private, 386378, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +38, Private, 179668, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +26, Private, 210812, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 43, United-States, <=50K +45, Federal-gov, 311671, Assoc-voc, 11, Married-civ-spouse, Protective-serv, Husband, White, Male, 3908, 0, 40, United-States, <=50K +20, Private, 215247, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +32, Federal-gov, 125856, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 74631, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 13, United-States, <=50K +22, Private, 24008, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, State-gov, 354591, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 114, 0, 38, United-States, <=50K +34, Private, 155343, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1848, 50, United-States, >50K +46, Private, 308334, 1st-4th, 2, Widowed, Other-service, Unmarried, Other, Female, 0, 0, 30, Mexico, <=50K +39, Private, 245361, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +79, Self-emp-not-inc, 158319, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +60, ?, 204486, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 8, United-States, >50K +24, Private, 314823, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 40, Dominican-Republic, <=50K +31, Private, 211334, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 2407, 0, 65, United-States, <=50K +37, Self-emp-not-inc, 73199, Bachelors, 13, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 3137, 0, 77, Vietnam, <=50K +23, Private, 126550, HS-grad, 9, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 260782, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1579, 45, El-Salvador, <=50K +29, Private, 114224, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +22, State-gov, 64292, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 43, United-States, <=50K +69, ?, 628797, Some-college, 10, Widowed, ?, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +55, Local-gov, 219775, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +43, Private, 212894, HS-grad, 9, Divorced, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 260019, 7th-8th, 4, Never-married, Farming-fishing, Unmarried, Other, Male, 0, 0, 36, Mexico, <=50K +29, Private, 228075, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Male, 0, 0, 35, Mexico, <=50K +22, Private, 239806, Assoc-voc, 11, Never-married, Other-service, Other-relative, White, Female, 0, 0, 40, Mexico, <=50K +22, Private, 324637, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +25, Private, 163620, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 84, United-States, >50K +29, Private, 194200, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 45, United-States, <=50K +25, State-gov, 129200, Some-college, 10, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +33, Federal-gov, 207172, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 135312, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +31, Private, 100734, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 28, United-States, <=50K +30, Local-gov, 226443, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1887, 45, United-States, >50K +55, Private, 110871, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 192704, 12th, 8, Never-married, Exec-managerial, Not-in-family, White, Male, 4650, 0, 50, United-States, <=50K +47, ?, 224108, HS-grad, 9, Widowed, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 78870, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 8614, 0, 40, United-States, >50K +42, Private, 107762, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +51, Private, 183611, Assoc-acdm, 12, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 55, Germany, <=50K +62, Local-gov, 249078, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +65, Self-emp-inc, 208452, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 35, United-States, >50K +23, Private, 302195, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, ?, 199947, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 32, United-States, <=50K +47, Private, 379118, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 60, United-States, >50K +50, Self-emp-inc, 174855, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +70, ?, 173736, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 6, United-States, <=50K +32, Self-emp-not-inc, 39369, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Federal-gov, 196348, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 340917, HS-grad, 9, Never-married, Craft-repair, Unmarried, White, Male, 14344, 0, 40, United-States, >50K +76, Private, 97077, 10th, 6, Widowed, Sales, Unmarried, Black, Female, 0, 0, 12, United-States, <=50K +54, Private, 200098, Bachelors, 13, Divorced, Sales, Not-in-family, Black, Female, 0, 0, 60, United-States, <=50K +32, Federal-gov, 127651, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 315128, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 52, United-States, <=50K +31, Federal-gov, 206823, Bachelors, 13, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +65, Self-emp-not-inc, 316093, Prof-school, 15, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 1668, 40, United-States, <=50K +30, Private, 112115, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 60, Ireland, >50K +63, ?, 203821, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 250051, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 10, United-States, <=50K +40, Federal-gov, 298635, Masters, 14, Married-civ-spouse, Tech-support, Husband, Asian-Pac-Islander, Male, 0, 1902, 40, Philippines, >50K +26, State-gov, 109193, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 130849, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 8, United-States, <=50K +34, Local-gov, 43959, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 50, United-States, <=50K +66, Local-gov, 222810, Some-college, 10, Divorced, Other-service, Other-relative, White, Female, 7896, 0, 40, ?, >50K +44, Self-emp-not-inc, 27242, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 60, United-States, <=50K +30, Private, 53158, Assoc-acdm, 12, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 206520, Bachelors, 13, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 164190, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +22, Private, 287988, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +28, ?, 200819, 7th-8th, 4, Divorced, ?, Own-child, White, Male, 0, 0, 84, United-States, <=50K +23, Private, 83891, HS-grad, 9, Never-married, Sales, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +49, Private, 65087, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 50, United-States, >50K +39, Self-emp-not-inc, 363418, Bachelors, 13, Separated, Craft-repair, Own-child, White, Male, 0, 0, 35, United-States, <=50K +67, ?, 182378, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 9386, 0, 60, United-States, >50K +19, Private, 278870, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 16, United-States, <=50K +30, Private, 174789, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1848, 50, United-States, >50K +25, Private, 228608, Some-college, 10, Never-married, Craft-repair, Other-relative, Asian-Pac-Islander, Female, 0, 0, 40, Cambodia, <=50K +24, Private, 184400, HS-grad, 9, Never-married, Transport-moving, Own-child, Asian-Pac-Islander, Male, 0, 0, 30, ?, <=50K +46, Private, 263568, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 117381, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +41, Federal-gov, 83411, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Self-emp-not-inc, 49156, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 20, United-States, <=50K +44, Private, 421449, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 238944, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 188982, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 20, United-States, >50K +48, Private, 175925, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 48, United-States, <=50K +34, Private, 164190, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 232914, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +46, Self-emp-inc, 120121, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +36, Local-gov, 180805, HS-grad, 9, Never-married, Transport-moving, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +59, Local-gov, 161944, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +29, Private, 319149, 12th, 8, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, Mexico, <=50K +50, ?, 22428, Masters, 14, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 290528, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 123984, Assoc-acdm, 12, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 35, Philippines, <=50K +48, Private, 34186, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 70, United-States, <=50K +51, Federal-gov, 282680, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 1564, 70, United-States, >50K +36, Private, 183892, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 8614, 0, 45, United-States, >50K +42, Local-gov, 195124, 11th, 7, Divorced, Sales, Unmarried, White, Male, 7430, 0, 50, Puerto-Rico, >50K +49, State-gov, 55938, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 209900, Some-college, 10, Never-married, Tech-support, Own-child, White, Male, 0, 0, 20, United-States, <=50K +40, Private, 179717, Bachelors, 13, Divorced, Sales, Not-in-family, White, Male, 0, 1564, 60, United-States, >50K +26, Private, 150361, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +69, ?, 164102, HS-grad, 9, Divorced, ?, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +59, Private, 252714, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 30, Italy, <=50K +30, Private, 205204, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +31, Local-gov, 168906, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, <=50K +30, Private, 112115, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +27, Private, 116531, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, ?, 202994, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 16, United-States, <=50K +56, Private, 191917, Assoc-acdm, 12, Divorced, Adm-clerical, Unmarried, White, Female, 4101, 0, 40, United-States, <=50K +24, Private, 341294, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 216734, Bachelors, 13, Divorced, Sales, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +51, Private, 182187, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Male, 0, 0, 35, United-States, <=50K +34, Private, 424988, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +47, Private, 379118, HS-grad, 9, Divorced, Other-service, Unmarried, Black, Male, 0, 0, 9, United-States, <=50K +47, Private, 168232, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 44, United-States, >50K +20, Private, 147171, Some-college, 10, Never-married, Adm-clerical, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, Vietnam, <=50K +34, Self-emp-inc, 207668, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 54, ?, >50K +31, Private, 193650, 11th, 7, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 200187, Assoc-voc, 11, Divorced, Other-service, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +52, Private, 188644, 5th-6th, 3, Married-spouse-absent, Craft-repair, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +56, Private, 398067, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +53, Private, 29658, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 154966, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +81, Private, 364099, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +28, ?, 291374, 10th, 6, Never-married, ?, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +57, Federal-gov, 97837, Some-college, 10, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 48, United-States, >50K +34, Private, 117983, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, ?, 345497, 10th, 6, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 64167, Assoc-voc, 11, Never-married, Tech-support, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +20, Private, 315877, HS-grad, 9, Never-married, Other-service, Unmarried, White, Male, 0, 2001, 40, United-States, <=50K +68, Federal-gov, 232151, Some-college, 10, Divorced, Adm-clerical, Other-relative, Black, Female, 2346, 0, 40, United-States, <=50K +60, Private, 225526, HS-grad, 9, Separated, Sales, Not-in-family, White, Female, 0, 0, 32, United-States, <=50K +37, Federal-gov, 289653, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 179462, 7th-8th, 4, Never-married, Handlers-cleaners, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +36, Federal-gov, 67317, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 77764, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 253438, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +31, Private, 150309, Bachelors, 13, Separated, Exec-managerial, Not-in-family, White, Female, 0, 0, 70, United-States, <=50K +47, Self-emp-not-inc, 83064, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +60, Self-emp-not-inc, 376973, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 42, United-States, >50K +75, Private, 311184, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 24, United-States, <=50K +43, Local-gov, 159449, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 168288, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 74883, Bachelors, 13, Never-married, Tech-support, Not-in-family, Asian-Pac-Islander, Female, 0, 1092, 40, Philippines, <=50K +20, Private, 275190, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 189838, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +57, Self-emp-inc, 101338, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 20, United-States, <=50K +43, Private, 331894, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, >50K +18, Self-emp-not-inc, 40293, HS-grad, 9, Never-married, Farming-fishing, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +41, Local-gov, 88904, Bachelors, 13, Separated, Prof-specialty, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +48, Private, 145041, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, White, Male, 0, 0, 40, Dominican-Republic, <=50K +35, Private, 46385, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 5178, 0, 90, United-States, >50K +41, State-gov, 363591, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 183327, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Female, 0, 1594, 20, United-States, <=50K +32, State-gov, 182556, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 1887, 45, United-States, >50K +33, Private, 267859, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, El-Salvador, <=50K +58, Private, 190747, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 162869, Some-college, 10, Never-married, Sales, Other-relative, White, Male, 0, 0, 65, United-States, <=50K +33, Private, 141229, Some-college, 10, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +42, Self-emp-not-inc, 174216, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 38, United-States, >50K +25, Private, 366416, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 172538, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +35, Private, 193026, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +50, Private, 184424, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 1902, 38, United-States, >50K +49, Local-gov, 337768, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +25, Local-gov, 179059, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +47, Federal-gov, 99549, Some-college, 10, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +46, Private, 72619, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +42, State-gov, 55764, Assoc-acdm, 12, Divorced, Prof-specialty, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +37, Private, 30267, 11th, 7, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 60, United-States, >50K +25, Private, 308144, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +29, Private, 206351, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 5013, 0, 40, United-States, <=50K +26, Private, 282304, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +26, ?, 176077, Some-college, 10, Never-married, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +45, Self-emp-inc, 142719, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Private, 114973, HS-grad, 9, Separated, Exec-managerial, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +33, Federal-gov, 159548, Assoc-acdm, 12, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +43, Private, 91209, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 196564, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +51, Self-emp-not-inc, 149220, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 75, United-States, <=50K +21, Private, 169699, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +23, Private, 218215, Assoc-acdm, 12, Never-married, Other-service, Own-child, White, Female, 0, 0, 35, United-States, <=50K +30, Private, 156718, HS-grad, 9, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 55720, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +38, Self-emp-inc, 257250, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 194630, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 398931, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1485, 50, United-States, >50K +37, Self-emp-not-inc, 362062, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 50, United-States, >50K +44, Local-gov, 101593, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 1876, 42, United-States, <=50K +33, Private, 196266, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Local-gov, 197332, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 97842, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Private, 86837, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 1902, 40, United-States, >50K +17, Private, 57324, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +43, Private, 116852, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 36, Portugal, >50K +45, Private, 154430, Bachelors, 13, Widowed, Prof-specialty, Not-in-family, White, Female, 10520, 0, 50, United-States, >50K +37, Private, 38468, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Local-gov, 188808, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +55, Local-gov, 177163, Masters, 14, Widowed, Prof-specialty, Unmarried, White, Female, 914, 0, 50, United-States, <=50K +41, Private, 187322, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 48, United-States, <=50K +23, Private, 107578, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 2174, 0, 40, United-States, <=50K +38, Private, 168680, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +23, Private, 256755, Bachelors, 13, Never-married, Handlers-cleaners, Other-relative, White, Female, 0, 0, 40, Cuba, <=50K +35, Private, 360799, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 40, United-States, >50K +18, Private, 188476, 11th, 7, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 20, United-States, <=50K +47, Private, 30457, Assoc-voc, 11, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 252752, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 8, United-States, <=50K +41, Self-emp-not-inc, 443508, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +23, Private, 244408, Some-college, 10, Never-married, Adm-clerical, Other-relative, Asian-Pac-Islander, Female, 0, 0, 24, Vietnam, <=50K +41, Private, 178983, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 143068, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 2407, 0, 50, United-States, <=50K +30, Local-gov, 247328, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 201732, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 246829, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, ?, 290267, Bachelors, 13, Never-married, ?, Not-in-family, White, Male, 0, 0, 18, United-States, <=50K +29, Private, 119170, Some-college, 10, Separated, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 207923, Some-college, 10, Married-spouse-absent, Adm-clerical, Own-child, White, Female, 0, 0, 15, United-States, <=50K +48, State-gov, 170142, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +44, Self-emp-not-inc, 187164, HS-grad, 9, Divorced, Transport-moving, Unmarried, White, Male, 0, 0, 60, United-States, <=50K +34, Local-gov, 303867, 9th, 5, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 291429, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +32, Private, 213179, Some-college, 10, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, >50K +31, State-gov, 111843, Assoc-acdm, 12, Separated, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +25, Private, 297154, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 2407, 0, 40, United-States, <=50K +47, Federal-gov, 68493, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, >50K +46, Federal-gov, 340718, 11th, 7, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 194059, 12th, 8, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 47296, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 1740, 20, United-States, <=50K +28, State-gov, 286310, HS-grad, 9, Married-civ-spouse, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 207202, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 48, United-States, >50K +33, Self-emp-inc, 132601, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +17, ?, 139183, 10th, 6, Never-married, ?, Own-child, White, Female, 0, 0, 15, United-States, <=50K +41, Private, 160785, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Private, 117849, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 40, United-States, >50K +38, Local-gov, 225605, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 44, United-States, <=50K +24, Private, 190290, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +49, Private, 164799, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +60, Federal-gov, 21876, Some-college, 10, Divorced, Prof-specialty, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +44, Private, 160785, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +63, Self-emp-inc, 272425, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 168538, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +45, Self-emp-inc, 204205, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +49, Private, 142287, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 1902, 50, United-States, >50K +36, Private, 169926, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +65, Local-gov, 205024, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 8, United-States, <=50K +41, Private, 374764, Bachelors, 13, Widowed, Exec-managerial, Unmarried, White, Male, 0, 0, 20, United-States, <=50K +25, Private, 108779, Masters, 14, Separated, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +20, ?, 293136, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 35, United-States, <=50K +60, Private, 227332, Assoc-voc, 11, Widowed, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +17, Local-gov, 246308, 11th, 7, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 20, Puerto-Rico, <=50K +28, Private, 51331, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 16, United-States, >50K +31, Private, 153078, Assoc-acdm, 12, Never-married, Craft-repair, Own-child, Other, Male, 0, 0, 50, United-States, <=50K +47, Private, 169180, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +45, Self-emp-not-inc, 193451, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +51, Private, 305147, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 138892, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +44, Self-emp-not-inc, 402397, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 1902, 60, United-States, >50K +34, Private, 223267, HS-grad, 9, Never-married, Exec-managerial, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +19, Private, 29250, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 10, United-States, <=50K +51, ?, 203953, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, >50K +46, State-gov, 29696, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 315640, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 1977, 40, China, >50K +37, Private, 632613, 10th, 6, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 35, Mexico, <=50K +56, Private, 282023, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +29, Private, 77760, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, United-States, <=50K +46, Self-emp-not-inc, 148599, Masters, 14, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +55, Private, 414994, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 339863, Bachelors, 13, Divorced, Sales, Not-in-family, White, Male, 8614, 0, 48, United-States, >50K +34, Private, 499249, HS-grad, 9, Married-spouse-absent, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, Guatemala, <=50K +45, ?, 144354, 9th, 5, Separated, ?, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +41, Private, 252058, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, ?, 99543, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 20, United-States, <=50K +34, Private, 117963, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +27, Private, 194652, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +29, Private, 299705, Some-college, 10, Never-married, Handlers-cleaners, Unmarried, Black, Male, 0, 0, 37, United-States, <=50K +19, Federal-gov, 27433, 12th, 8, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +47, Local-gov, 39986, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Self-emp-inc, 135342, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +52, Private, 270142, Assoc-voc, 11, Separated, Exec-managerial, Unmarried, Black, Female, 0, 0, 60, United-States, <=50K +33, Self-emp-not-inc, 118267, Assoc-acdm, 12, Divorced, Sales, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +29, Private, 266043, 9th, 5, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 35633, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 74568, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 214816, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +43, Private, 222971, 5th-6th, 3, Never-married, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +31, Private, 259425, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +47, Self-emp-inc, 212120, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 245880, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +58, Local-gov, 54947, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +47, Self-emp-inc, 79627, Prof-school, 15, Divorced, Prof-specialty, Not-in-family, White, Male, 27828, 0, 50, United-States, >50K +55, Private, 151474, Bachelors, 13, Never-married, Tech-support, Other-relative, White, Female, 0, 1590, 38, United-States, <=50K +26, Private, 132661, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 5013, 0, 40, United-States, <=50K +28, Private, 161674, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 62346, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +40, Private, 227236, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +19, Private, 283033, 11th, 7, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +63, Self-emp-not-inc, 298249, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 10605, 0, 40, United-States, >50K +42, Private, 251229, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +76, Private, 199949, 9th, 5, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 13, United-States, <=50K +23, State-gov, 305498, Assoc-voc, 11, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 203836, 5th-6th, 3, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, State-gov, 79440, Masters, 14, Never-married, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 30, Japan, <=50K +48, Local-gov, 142719, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +56, Private, 119859, Some-college, 10, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +32, Private, 141410, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +44, Local-gov, 202872, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +27, Private, 198813, HS-grad, 9, Divorced, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +33, Federal-gov, 129707, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +22, Private, 445758, 5th-6th, 3, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +18, ?, 30246, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 20, United-States, <=50K +44, Private, 173981, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 108506, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Amer-Indian-Eskimo, Male, 0, 0, 60, United-States, <=50K +34, Private, 134886, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +45, Federal-gov, 181970, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 1672, 40, United-States, <=50K +57, Self-emp-inc, 282913, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, Cuba, <=50K +59, Local-gov, 196013, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Federal-gov, 348491, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, Black, Female, 0, 0, 40, United-States, >50K +52, Private, 416164, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, Other, Male, 0, 0, 49, Mexico, <=50K +17, Private, 121037, 12th, 8, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +29, Private, 103111, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, Canada, <=50K +63, Self-emp-not-inc, 147589, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 30, United-States, >50K +20, Private, 24008, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 24, United-States, <=50K +42, Self-emp-inc, 123838, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 50, United-States, >50K +50, Self-emp-not-inc, 175456, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +55, Private, 84774, HS-grad, 9, Married-civ-spouse, Priv-house-serv, Wife, White, Female, 0, 0, 30, United-States, <=50K +27, Private, 194590, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 25, United-States, <=50K +28, Private, 134566, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +55, Private, 211678, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K +44, Federal-gov, 44822, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, State-gov, 144586, Some-college, 10, Never-married, Protective-serv, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 119156, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 371987, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, State-gov, 144125, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +55, Private, 31905, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 40, United-States, >50K +48, Self-emp-not-inc, 121124, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +46, Private, 58126, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 318518, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 296509, 7th-8th, 4, Separated, Farming-fishing, Not-in-family, White, Male, 0, 0, 45, Mexico, <=50K +32, Private, 473133, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +52, Private, 155434, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, <=50K +52, Private, 99185, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 50, United-States, >50K +39, Private, 56648, HS-grad, 9, Separated, Sales, Not-in-family, White, Female, 0, 0, 47, United-States, <=50K +57, Local-gov, 118481, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 40, United-States, >50K +21, Private, 321666, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 594, 0, 40, United-States, <=50K +22, State-gov, 119838, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 10, United-States, <=50K +26, Private, 330695, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, State-gov, 58039, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 313022, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 40, United-States, >50K +42, Private, 178134, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +40, Private, 165309, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 43, United-States, <=50K +22, Private, 216181, 11th, 7, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 45, United-States, <=50K +62, Private, 178745, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +44, Private, 111067, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, ?, 163788, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +33, Self-emp-not-inc, 295591, 1st-4th, 2, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +45, Private, 123075, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +18, Private, 78045, 11th, 7, Married-civ-spouse, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +32, Local-gov, 255004, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 254221, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 60, United-States, >50K +20, Private, 174714, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 15, United-States, <=50K +68, Self-emp-not-inc, 450580, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 20, United-States, <=50K +61, Private, 128230, 7th-8th, 4, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +48, Private, 192894, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 325390, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 20333, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 7688, 0, 40, United-States, >50K +32, Federal-gov, 128714, HS-grad, 9, Never-married, Other-service, Own-child, Black, Female, 0, 0, 32, United-States, <=50K +35, Private, 170797, Bachelors, 13, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 269186, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 127671, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 211840, Some-college, 10, Separated, Sales, Unmarried, Black, Female, 0, 0, 16, United-States, <=50K +37, Private, 163392, HS-grad, 9, Never-married, Transport-moving, Other-relative, Asian-Pac-Islander, Male, 0, 0, 40, ?, <=50K +40, Private, 201495, Bachelors, 13, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +25, Private, 251854, Some-college, 10, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, Jamaica, <=50K +41, Private, 279297, HS-grad, 9, Never-married, Sales, Not-in-family, Black, Female, 0, 0, 60, United-States, <=50K +52, Self-emp-not-inc, 195462, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 98, United-States, >50K +33, Private, 170769, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 142443, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Self-emp-not-inc, 182809, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 40, United-States, >50K +53, Private, 121441, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +44, Private, 275094, 1st-4th, 2, Never-married, Other-service, Own-child, White, Male, 0, 0, 10, United-States, <=50K +35, Private, 170263, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 172571, Some-college, 10, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 58, Poland, <=50K +34, Private, 178615, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 279524, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, State-gov, 165201, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 45, United-States, <=50K +65, Local-gov, 323006, HS-grad, 9, Widowed, Other-service, Unmarried, Black, Female, 0, 0, 25, United-States, <=50K +29, Private, 235168, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +39, Self-emp-inc, 114844, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 65, United-States, >50K +46, Local-gov, 216414, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +37, Private, 34378, 7th-8th, 4, Married-civ-spouse, Farming-fishing, Husband, White, Male, 2580, 0, 60, United-States, <=50K +47, State-gov, 80914, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 47, United-States, >50K +62, Private, 73292, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Self-emp-not-inc, 212165, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +90, Private, 52386, Some-college, 10, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 35, United-States, <=50K +33, Private, 205649, Assoc-acdm, 12, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +57, Private, 109638, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1672, 45, United-States, <=50K +25, Private, 200408, Assoc-acdm, 12, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +44, Self-emp-inc, 187720, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +52, Private, 236180, Bachelors, 13, Married-spouse-absent, Other-service, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +21, Private, 118693, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 363130, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Male, 0, 0, 18, United-States, <=50K +39, Private, 225544, Masters, 14, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, Poland, <=50K +59, Federal-gov, 243612, HS-grad, 9, Widowed, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +29, Self-emp-not-inc, 160786, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, <=50K +49, Private, 234320, 7th-8th, 4, Never-married, Prof-specialty, Other-relative, Black, Male, 0, 0, 45, United-States, <=50K +34, Private, 314646, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 124971, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 209184, Bachelors, 13, Married-civ-spouse, Sales, Husband, Other, Male, 0, 0, 40, Puerto-Rico, <=50K +39, State-gov, 121838, HS-grad, 9, Divorced, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +46, Private, 265275, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +50, Private, 71417, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3103, 0, 40, United-States, >50K +34, Private, 45522, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Local-gov, 250135, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1902, 55, United-States, <=50K +18, Private, 120283, 12th, 8, Never-married, Sales, Own-child, White, Female, 0, 0, 24, United-States, <=50K +20, Private, 216972, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +20, Private, 116791, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +55, State-gov, 26290, Assoc-voc, 11, Widowed, Exec-managerial, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 38, United-States, <=50K +22, Private, 216134, Some-college, 10, Never-married, Sales, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +60, Self-emp-not-inc, 143932, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 217120, 10th, 6, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +47, State-gov, 223944, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 30, United-States, <=50K +23, Private, 185452, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 35, Canada, <=50K +57, Local-gov, 44273, HS-grad, 9, Widowed, Transport-moving, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 178983, 11th, 7, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 219288, 7th-8th, 4, Widowed, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 349190, Assoc-acdm, 12, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 20, United-States, <=50K +49, Self-emp-inc, 158685, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 2377, 40, United-States, >50K +41, Federal-gov, 57924, Some-college, 10, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 40, United-States, <=50K +40, State-gov, 270324, Assoc-voc, 11, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 30, United-States, <=50K +38, Private, 33001, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +58, Private, 204021, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, Canada, <=50K +26, Private, 192506, Bachelors, 13, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 35, United-States, <=50K +57, Private, 372967, 10th, 6, Divorced, Adm-clerical, Other-relative, White, Female, 0, 0, 70, Germany, <=50K +28, Private, 273929, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 1628, 60, United-States, <=50K +42, Private, 195821, HS-grad, 9, Separated, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 56179, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 2174, 0, 55, United-States, <=50K +17, ?, 127003, 9th, 5, Never-married, ?, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +39, Self-emp-not-inc, 124090, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 199600, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +42, Private, 255847, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 4386, 0, 48, United-States, >50K +51, Self-emp-not-inc, 218311, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +27, Private, 167336, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 39, United-States, <=50K +41, Private, 59938, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 43, United-States, <=50K +28, Private, 263728, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Self-emp-not-inc, 278230, Some-college, 10, Divorced, Farming-fishing, Unmarried, White, Female, 10520, 0, 30, United-States, >50K +73, ?, 180603, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 8, United-States, <=50K +49, Private, 43910, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 30, United-States, <=50K +47, Private, 190139, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 109001, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 49, United-States, <=50K +42, Local-gov, 159931, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 45, United-States, >50K +32, Private, 194987, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, <=50K +32, Local-gov, 87310, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 41, United-States, <=50K +27, Private, 133937, Masters, 14, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 207064, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 36011, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +41, Federal-gov, 168294, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 5178, 0, 40, United-States, >50K +49, Local-gov, 194895, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 7298, 0, 40, United-States, >50K +58, Self-emp-not-inc, 49884, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +41, Self-emp-not-inc, 27305, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 7688, 0, 40, United-States, >50K +26, Private, 229977, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +21, Private, 64520, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +32, ?, 134886, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 2, United-States, >50K +37, Private, 305379, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +23, Private, 202284, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +42, Self-emp-not-inc, 99185, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 159662, HS-grad, 9, Married-civ-spouse, Sales, Own-child, White, Male, 0, 0, 26, United-States, >50K +67, Private, 197865, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +41, Local-gov, 175149, HS-grad, 9, Divorced, Transport-moving, Not-in-family, Black, Female, 0, 0, 38, United-States, <=50K +49, Local-gov, 349633, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, United-States, >50K +36, Private, 135293, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 1506, 0, 45, ?, <=50K +18, Private, 242893, 11th, 7, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 35, United-States, <=50K +25, Private, 218667, 5th-6th, 3, Married-civ-spouse, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +43, State-gov, 144811, Prof-school, 15, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +38, Private, 146091, Doctorate, 16, Married-civ-spouse, Exec-managerial, Wife, White, Female, 99999, 0, 36, United-States, >50K +21, Private, 206861, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 20, ?, <=50K +65, Self-emp-not-inc, 226215, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 15, United-States, <=50K +66, Private, 114447, Assoc-voc, 11, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +33, Private, 124187, 11th, 7, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 60, United-States, <=50K +51, Private, 147954, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 3411, 0, 38, United-States, <=50K +27, Self-emp-inc, 64379, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1740, 40, United-States, <=50K +17, Private, 156501, 12th, 8, Never-married, Other-service, Own-child, White, Female, 0, 0, 16, United-States, <=50K +32, Private, 207668, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 45, United-States, >50K +61, ?, 161279, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 36, United-States, <=50K +38, Private, 225707, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Cuba, >50K +43, Local-gov, 115603, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +40, State-gov, 506329, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Taiwan, >50K +63, Private, 275034, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1740, 35, United-States, <=50K +76, ?, 172637, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 20, United-States, >50K +42, Private, 56483, Assoc-acdm, 12, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Federal-gov, 144778, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +76, Self-emp-not-inc, 33213, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 30, ?, >50K +41, Local-gov, 297248, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2415, 45, United-States, >50K +17, Private, 137042, 10th, 6, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 20, United-States, <=50K +30, Self-emp-not-inc, 33308, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 158420, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, Iran, <=50K +22, Private, 41763, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +53, ?, 220640, Bachelors, 13, Divorced, ?, Other-relative, Other, Female, 0, 0, 20, United-States, <=50K +28, Private, 149734, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 52, United-States, <=50K +25, ?, 262245, Assoc-voc, 11, Never-married, ?, Own-child, White, Female, 3418, 0, 40, United-States, <=50K +24, Private, 349691, Some-college, 10, Never-married, Sales, Other-relative, Black, Female, 0, 0, 40, United-States, <=50K +47, Private, 185385, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Self-emp-not-inc, 174463, Assoc-voc, 11, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +26, Private, 236068, Some-college, 10, Never-married, Sales, Other-relative, White, Female, 0, 0, 20, United-States, <=50K +63, ?, 445168, Bachelors, 13, Widowed, ?, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 56, United-States, <=50K +25, Private, 91334, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 75, United-States, <=50K +28, Private, 33895, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +36, Private, 214816, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 229773, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +51, Self-emp-inc, 166386, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, Asian-Pac-Islander, Female, 0, 0, 35, Taiwan, <=50K +44, Private, 266135, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +18, Private, 300379, 12th, 8, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 12, United-States, <=50K +54, Federal-gov, 392502, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +61, Private, 73809, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 193720, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +43, Private, 316183, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 162944, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Local-gov, 186888, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, Black, Female, 0, 0, 40, United-States, >50K +27, ?, 330132, HS-grad, 9, Never-married, ?, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +24, Private, 192017, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 30, United-States, <=50K +20, State-gov, 161978, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 20, United-States, <=50K +52, Private, 202930, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +57, Local-gov, 323309, 7th-8th, 4, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +49, Self-emp-inc, 197332, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +42, Self-emp-inc, 204033, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, ?, <=50K +22, Private, 271274, 11th, 7, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 174242, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 209483, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +39, Federal-gov, 99146, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 1887, 60, United-States, >50K +52, Self-emp-not-inc, 102346, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 35, United-States, <=50K +25, Private, 181666, Assoc-acdm, 12, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +50, Private, 207367, Some-college, 10, Married-spouse-absent, Other-service, Not-in-family, White, Female, 0, 0, 40, Cuba, <=50K +35, State-gov, 82622, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 80, United-States, <=50K +50, Private, 202296, Assoc-voc, 11, Separated, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +58, Private, 142182, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 25, United-States, <=50K +48, Federal-gov, 94342, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +30, Private, 41493, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, Canada, <=50K +18, Private, 181712, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 12, United-States, <=50K +29, Self-emp-not-inc, 164607, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Self-emp-not-inc, 41496, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +63, Private, 143098, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 4064, 0, 40, United-States, <=50K +36, Local-gov, 196529, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +24, Private, 157332, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Female, 0, 0, 42, United-States, <=50K +30, Local-gov, 154935, Assoc-acdm, 12, Never-married, Protective-serv, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +23, Private, 223231, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Other, Male, 0, 0, 40, Mexico, <=50K +35, ?, 253860, HS-grad, 9, Divorced, ?, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +21, Private, 362589, Bachelors, 13, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +28, Private, 94880, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 43, Mexico, <=50K +20, Private, 309580, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 130389, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, Scotland, <=50K +21, Private, 349365, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 15, United-States, <=50K +27, Private, 376936, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 179557, Some-college, 10, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 105577, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +51, Private, 224207, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Federal-gov, 47907, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Self-emp-not-inc, 191283, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 20953, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 1902, 40, United-States, >50K +22, State-gov, 186569, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 12, United-States, <=50K +59, Private, 43221, 9th, 5, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, >50K +38, Private, 161141, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 203003, HS-grad, 9, Never-married, Transport-moving, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +90, Private, 141758, 9th, 5, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 113322, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 343847, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 38, United-States, >50K +45, Private, 214068, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +44, Private, 116632, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +23, Private, 240160, Assoc-acdm, 12, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 516337, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +23, Self-emp-inc, 284651, Bachelors, 13, Divorced, Sales, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +39, State-gov, 141420, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 42750, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +54, Private, 165278, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 167265, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 43, United-States, <=50K +44, Private, 139907, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 50, United-States, <=50K +31, Self-emp-inc, 236415, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 20, United-States, >50K +25, Private, 312966, 9th, 5, Separated, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, El-Salvador, <=50K +33, Private, 118941, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 32, United-States, >50K +32, Private, 198068, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +36, Private, 373952, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 236111, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, Other, Male, 0, 0, 55, United-States, >50K +80, Private, 157778, Masters, 14, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 10, United-States, <=50K +21, Private, 143604, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 8, United-States, <=50K +35, Self-emp-not-inc, 319831, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +77, ?, 132728, Masters, 14, Divorced, ?, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +30, Private, 137606, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 5013, 0, 40, United-States, <=50K +35, ?, 61343, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 268234, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 100135, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 1740, 25, United-States, <=50K +53, Self-emp-not-inc, 34973, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 45, United-States, <=50K +41, Private, 323790, HS-grad, 9, Divorced, Handlers-cleaners, Unmarried, White, Male, 0, 0, 55, United-States, <=50K +57, Private, 319733, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Poland, >50K +21, ?, 180339, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 25, United-States, <=50K +19, Private, 125591, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 60772, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 45, United-States, <=50K +42, Federal-gov, 74680, Masters, 14, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 2001, 60, United-States, <=50K +29, Self-emp-not-inc, 141185, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 55, United-States, <=50K +38, ?, 204668, Assoc-voc, 11, Separated, ?, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +26, Private, 273792, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +41, Private, 70037, Some-college, 10, Never-married, Craft-repair, Unmarried, White, Male, 0, 3004, 60, ?, >50K +40, Private, 343068, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 177907, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +28, Private, 144063, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Self-emp-not-inc, 257574, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 20, United-States, <=50K +42, Self-emp-not-inc, 67065, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +32, Private, 183356, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 152940, Assoc-acdm, 12, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 30, United-States, <=50K +37, Private, 227128, 5th-6th, 3, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Local-gov, 45607, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 56, United-States, <=50K +49, Private, 155489, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, ?, 230704, HS-grad, 9, Never-married, ?, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +24, ?, 267955, 9th, 5, Never-married, ?, Own-child, White, Female, 0, 0, 20, United-States, <=50K +19, Private, 165115, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 49923, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 272240, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 255476, 7th-8th, 4, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 30, Mexico, <=50K +59, Private, 194290, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 48, United-States, <=50K +52, Private, 145548, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 175262, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +45, Local-gov, 37306, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +58, Private, 137547, Bachelors, 13, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 40, South, <=50K +53, Private, 276515, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, Cuba, <=50K +23, Private, 174626, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 25, United-States, <=50K +35, Private, 215310, 11th, 7, Divorced, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 332355, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 204057, Assoc-acdm, 12, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 391591, 12th, 8, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 169092, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 50, United-States, >50K +28, Private, 230743, Assoc-acdm, 12, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 190963, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +74, ?, 204840, 5th-6th, 3, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 56, Mexico, <=50K +19, Private, 169853, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 24, United-States, <=50K +28, Private, 212091, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 2580, 0, 40, United-States, <=50K +31, Private, 202822, Some-college, 10, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +61, ?, 226989, Some-college, 10, Married-spouse-absent, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 140011, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 53, United-States, <=50K +20, ?, 432376, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, Germany, <=50K +35, Private, 90273, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, ?, >50K +23, Private, 224424, Bachelors, 13, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 168943, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 30, United-States, >50K +19, Private, 571853, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +30, Private, 156464, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +26, Private, 108542, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 35, United-States, <=50K +34, Local-gov, 194325, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +49, Private, 114797, Bachelors, 13, Divorced, Exec-managerial, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +35, Private, 40135, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 2042, 40, United-States, <=50K +38, Private, 204756, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 228190, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 20, United-States, <=50K +33, Private, 163392, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, Amer-Indian-Eskimo, Male, 0, 0, 48, United-States, >50K +54, Private, 138845, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, Local-gov, 169853, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +18, Never-worked, 206359, 10th, 6, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +60, Private, 224097, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Self-emp-not-inc, 160786, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 190044, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Local-gov, 145290, 5th-6th, 3, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 120268, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 70, United-States, <=50K +17, Private, 327434, 10th, 6, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +41, Self-emp-inc, 218302, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +30, Private, 1184622, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 35, United-States, <=50K +90, Local-gov, 227796, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 20051, 0, 60, United-States, >50K +25, Private, 206343, HS-grad, 9, Never-married, Protective-serv, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 36851, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 35, United-States, <=50K +29, Private, 148550, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 157079, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 40, ?, >50K +31, Federal-gov, 142470, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +43, Private, 86750, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 99, United-States, <=50K +63, Private, 361631, Masters, 14, Separated, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +46, Private, 163229, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Private, 179594, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 254773, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, Black, Female, 0, 0, 50, United-States, >50K +26, Private, 58065, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +26, Private, 205428, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +20, ?, 41183, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K +19, ?, 308064, HS-grad, 9, Never-married, ?, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +61, Private, 173924, 9th, 5, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, Puerto-Rico, >50K +23, State-gov, 142547, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 119704, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 275364, Bachelors, 13, Divorced, Tech-support, Unmarried, White, Male, 7430, 0, 40, Germany, >50K +42, Self-emp-not-inc, 207392, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 12, United-States, <=50K +31, Private, 147215, 12th, 8, Divorced, Other-service, Unmarried, White, Female, 0, 0, 21, United-States, <=50K +31, Private, 101562, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 55, United-States, <=50K +63, Private, 216413, Bachelors, 13, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +29, State-gov, 188986, Assoc-voc, 11, Never-married, Tech-support, Not-in-family, White, Female, 0, 1590, 64, United-States, <=50K +43, State-gov, 52849, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 304710, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 10, Vietnam, <=50K +17, Private, 265657, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 25, United-States, <=50K +23, Self-emp-not-inc, 258298, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Male, 0, 2231, 40, United-States, >50K +35, Private, 360814, 9th, 5, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +32, Private, 53260, HS-grad, 9, Divorced, Other-service, Unmarried, Other, Female, 0, 0, 28, United-States, <=50K +50, Self-emp-inc, 127315, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +25, Private, 233777, HS-grad, 9, Never-married, Transport-moving, Other-relative, White, Male, 0, 0, 40, ?, <=50K +26, Local-gov, 197530, Masters, 14, Married-spouse-absent, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 340940, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 88432, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, Private, 183810, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +90, Private, 51744, Masters, 14, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 50, United-States, >50K +35, Private, 175614, Assoc-voc, 11, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, >50K +31, Self-emp-not-inc, 235237, Some-college, 10, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 60, United-States, >50K +60, Private, 227266, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 33, United-States, <=50K +21, Private, 146499, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Other-relative, White, Female, 0, 1579, 40, United-States, <=50K +71, Local-gov, 337064, Masters, 14, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 141003, Assoc-voc, 11, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +50, Local-gov, 117791, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 172846, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +23, Private, 73514, HS-grad, 9, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, Vietnam, <=50K +74, Private, 211075, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 30, United-States, <=50K +67, Private, 197816, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 1844, 70, United-States, <=50K +59, Private, 43221, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 43, United-States, >50K +28, Private, 183780, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1740, 40, United-States, <=50K +45, Private, 26781, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +63, Self-emp-not-inc, 271550, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 20, United-States, <=50K +39, Private, 250157, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 63, United-States, <=50K +33, State-gov, 913447, Some-college, 10, Divorced, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +32, Private, 153078, Bachelors, 13, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, South, <=50K +34, Private, 181091, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 45, United-States, >50K +39, Private, 231491, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +29, State-gov, 95423, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 36, United-States, <=50K +22, Private, 234663, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 283602, Bachelors, 13, Divorced, Sales, Not-in-family, White, Female, 13550, 0, 43, United-States, >50K +46, Private, 328669, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 42, United-States, <=50K +51, Private, 143741, Some-college, 10, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, >50K +44, Private, 83508, Prof-school, 15, Divorced, Prof-specialty, Not-in-family, White, Female, 2354, 0, 99, United-States, <=50K +56, State-gov, 81954, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 261375, Bachelors, 13, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +52, Private, 310045, 9th, 5, Married-spouse-absent, Machine-op-inspct, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 30, China, <=50K +39, Private, 316211, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +45, Federal-gov, 88564, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 40, United-States, >50K +37, Private, 61299, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +33, Private, 113364, HS-grad, 9, Divorced, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, ?, 476573, HS-grad, 9, Never-married, ?, Not-in-family, White, Female, 0, 0, 4, United-States, <=50K +46, Private, 267107, 5th-6th, 3, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 45, Italy, <=50K +35, Private, 48123, 12th, 8, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 50, United-States, <=50K +33, Private, 214635, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 50, United-States, <=50K +48, Private, 115585, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 194141, HS-grad, 9, Divorced, Machine-op-inspct, Own-child, White, Male, 0, 0, 50, United-States, <=50K +18, ?, 23233, 10th, 6, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 89991, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 32, United-States, <=50K +35, Private, 101709, HS-grad, 9, Never-married, Transport-moving, Own-child, Asian-Pac-Islander, Male, 0, 0, 60, United-States, <=50K +19, Private, 237455, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 25, United-States, <=50K +21, Private, 206492, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 40, ?, <=50K +56, Private, 28729, 11th, 7, Separated, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 153475, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 16, El-Salvador, <=50K +45, Private, 275517, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 128002, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 45, United-States, <=50K +44, Private, 175485, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 12, United-States, <=50K +55, Private, 189664, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Private, 209808, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 176992, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 154669, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 55, United-States, <=50K +25, Private, 191271, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 375482, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 102953, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 55, United-States, >50K +53, Private, 169182, 10th, 6, Married-spouse-absent, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, Columbia, <=50K +47, Private, 184005, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, Amer-Indian-Eskimo, Female, 3325, 0, 45, United-States, <=50K +49, Self-emp-inc, 30751, Assoc-voc, 11, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +22, Private, 145477, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +31, Private, 91964, Some-college, 10, Never-married, Adm-clerical, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +44, Self-emp-inc, 49249, Some-college, 10, Divorced, Other-service, Unmarried, White, Male, 0, 0, 80, United-States, <=50K +19, Private, 218956, HS-grad, 9, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 241306, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +60, ?, 251572, HS-grad, 9, Widowed, ?, Not-in-family, White, Male, 0, 0, 35, Poland, <=50K +23, Private, 319842, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 25, United-States, <=50K +44, Private, 332401, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 65, United-States, >50K +54, Local-gov, 182388, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 35, United-States, <=50K +23, Private, 205939, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 38, United-States, <=50K +21, Private, 203914, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 10, United-States, <=50K +19, State-gov, 156294, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 25, United-States, <=50K +51, Private, 254211, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 20, United-States, >50K +41, Private, 151504, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 50, United-States, >50K +61, Private, 85548, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 15024, 0, 18, United-States, >50K +19, Self-emp-not-inc, 30800, 10th, 6, Married-spouse-absent, Adm-clerical, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +22, Private, 131230, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 61850, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 227800, 7th-8th, 4, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 32, United-States, <=50K +35, Private, 133454, 10th, 6, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +38, Private, 104094, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 105422, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +56, Private, 142182, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +41, Private, 336643, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 80, United-States, <=50K +62, Self-emp-inc, 200577, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +27, Private, 208703, HS-grad, 9, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 40, Japan, <=50K +55, ?, 193895, HS-grad, 9, Divorced, ?, Not-in-family, White, Female, 0, 0, 40, England, <=50K +25, Private, 272428, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 4416, 0, 42, United-States, <=50K +33, Private, 56701, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 75, United-States, >50K +26, Private, 288592, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 266439, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +53, Federal-gov, 276868, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +43, Private, 131435, Bachelors, 13, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +56, Private, 175127, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 35, United-States, <=50K +25, Private, 277444, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +60, Private, 63296, Masters, 14, Divorced, Prof-specialty, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +28, Private, 96337, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 221955, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 40, Mexico, <=50K +40, Private, 197923, Bachelors, 13, Never-married, Adm-clerical, Unmarried, Black, Female, 2977, 0, 40, United-States, <=50K +29, Private, 632593, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +20, Private, 205970, Some-college, 10, Never-married, Craft-repair, Own-child, White, Female, 0, 0, 25, United-States, <=50K +25, Private, 139730, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 80, United-States, >50K +18, Private, 201901, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 10, United-States, <=50K +32, State-gov, 230224, Assoc-acdm, 12, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 35, United-States, <=50K +27, Private, 113464, 1st-4th, 2, Never-married, Other-service, Own-child, Other, Male, 0, 0, 35, Dominican-Republic, <=50K +48, Private, 94461, HS-grad, 9, Widowed, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 16, United-States, <=50K +20, Private, 271379, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 25, United-States, <=50K +55, Private, 231738, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, England, <=50K +33, Local-gov, 198183, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +21, State-gov, 140764, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 12, United-States, <=50K +43, Self-emp-not-inc, 183479, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 30, United-States, <=50K +35, Private, 165767, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Local-gov, 139364, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +19, Private, 227491, HS-grad, 9, Never-married, Sales, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +25, Private, 222254, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +44, Private, 193494, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 72, United-States, >50K +27, Private, 29261, Assoc-acdm, 12, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 174368, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +69, Private, 108196, 10th, 6, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 110622, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +20, ?, 201680, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 35, United-States, <=50K +37, Private, 130277, 5th-6th, 3, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Local-gov, 98130, Bachelors, 13, Divorced, Prof-specialty, Own-child, White, Female, 0, 0, 39, United-States, <=50K +62, ?, 235521, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 48, United-States, <=50K +34, State-gov, 595000, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, United-States, >50K +31, Self-emp-not-inc, 349148, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 48, United-States, <=50K +42, State-gov, 117583, Doctorate, 16, Divorced, Prof-specialty, Not-in-family, White, Female, 8614, 0, 60, United-States, >50K +26, Private, 164583, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +39, Private, 340091, Some-college, 10, Separated, Other-service, Unmarried, White, Female, 0, 0, 75, United-States, <=50K +25, Private, 49092, Bachelors, 13, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +54, Local-gov, 186884, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 30, United-States, <=50K +44, State-gov, 167265, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +34, State-gov, 34104, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 38, United-States, >50K +21, Self-emp-inc, 265116, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 128378, 5th-6th, 3, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 55, ?, <=50K +33, Private, 158416, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, Self-emp-inc, 169878, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +44, Private, 296728, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Local-gov, 342458, Assoc-acdm, 12, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 56, United-States, <=50K +21, Local-gov, 38771, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, Self-emp-not-inc, 269300, Bachelors, 13, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 60, United-States, <=50K +43, Private, 111483, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K +57, ?, 199114, 10th, 6, Separated, ?, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +51, Local-gov, 33863, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +29, Private, 132874, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Local-gov, 277024, HS-grad, 9, Separated, Protective-serv, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 112160, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 703067, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +58, Private, 127264, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, Self-emp-inc, 257200, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +19, Private, 57206, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +37, Private, 201319, Some-college, 10, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 114079, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 44, United-States, <=50K +45, Private, 230979, Some-college, 10, Married-spouse-absent, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 292472, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Cambodia, >50K +64, ?, 286732, 7th-8th, 4, Widowed, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Local-gov, 134444, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 72, United-States, <=50K +30, Private, 172403, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +46, Private, 191357, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, ?, 279288, 10th, 6, Never-married, ?, Other-relative, White, Female, 0, 0, 30, United-States, <=50K +60, Private, 389254, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +30, Private, 303867, HS-grad, 9, Separated, Transport-moving, Not-in-family, White, Male, 0, 0, 44, United-States, <=50K +47, Private, 164113, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 7688, 0, 40, United-States, >50K +39, Private, 111499, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 266084, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 45, United-States, >50K +27, Private, 61580, Some-college, 10, Divorced, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 231348, Some-college, 10, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 164748, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Private, 205337, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +58, Self-emp-not-inc, 54566, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 35, United-States, <=50K +45, Private, 34419, Bachelors, 13, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +59, Private, 116442, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, <=50K +29, Private, 290740, Assoc-acdm, 12, Never-married, Priv-house-serv, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +27, Private, 255582, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 112517, Masters, 14, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 20, United-States, >50K +44, Private, 169397, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 172664, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, >50K +27, Private, 329005, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 123253, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +55, Private, 81865, Assoc-voc, 11, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +32, Self-emp-not-inc, 173314, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, Other, Male, 0, 0, 60, United-States, <=50K +31, Private, 34572, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 45, United-States, <=50K +57, Self-emp-inc, 159028, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 60, United-States, >50K +30, Private, 149184, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +78, ?, 363134, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 1, United-States, <=50K +28, Private, 308709, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 48, United-States, <=50K +30, Self-emp-not-inc, 257295, Some-college, 10, Never-married, Sales, Other-relative, Asian-Pac-Islander, Male, 0, 2258, 40, South, <=50K +29, Private, 168479, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +66, Private, 142501, HS-grad, 9, Never-married, Other-service, Other-relative, Black, Female, 0, 0, 3, United-States, <=50K +60, Private, 338345, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +31, Private, 177675, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 262617, HS-grad, 9, Never-married, Craft-repair, Other-relative, White, Male, 2597, 0, 40, United-States, <=50K +24, Private, 200997, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 45, United-States, <=50K +29, Private, 176683, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +44, Private, 376072, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 45, United-States, >50K +34, Local-gov, 177675, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +59, Private, 348430, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 43, United-States, >50K +23, Private, 320451, Bachelors, 13, Never-married, Exec-managerial, Own-child, Asian-Pac-Islander, Male, 0, 0, 24, United-States, <=50K +23, Private, 38151, 11th, 7, Never-married, Other-service, Other-relative, White, Male, 0, 0, 40, Philippines, <=50K +55, Local-gov, 123382, Assoc-voc, 11, Separated, Prof-specialty, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +39, Self-emp-inc, 151029, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 484475, 11th, 7, Never-married, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +57, Private, 329792, 7th-8th, 4, Divorced, Transport-moving, Unmarried, White, Male, 0, 0, 75, United-States, <=50K +35, Private, 148903, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +39, Local-gov, 301614, Assoc-voc, 11, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 48, United-States, >50K +47, Private, 176319, HS-grad, 9, Married-civ-spouse, Sales, Own-child, White, Female, 0, 0, 38, United-States, >50K +53, State-gov, 53197, Doctorate, 16, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 50, United-States, >50K +23, Private, 291407, Some-college, 10, Never-married, Sales, Own-child, Black, Male, 0, 0, 25, United-States, <=50K +35, Private, 204527, Masters, 14, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +44, Private, 476391, Some-college, 10, Divorced, Farming-fishing, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 224964, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +26, Private, 306225, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, Poland, <=50K +23, Private, 292023, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +32, Private, 94041, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 25, Ireland, <=50K +49, Self-emp-inc, 187563, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, >50K +36, Private, 176101, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 2174, 0, 60, United-States, <=50K +36, Private, 749105, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +41, ?, 230020, 5th-6th, 3, Married-civ-spouse, ?, Husband, Other, Male, 0, 0, 40, United-States, <=50K +21, Private, 216070, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, Amer-Indian-Eskimo, Female, 0, 0, 46, United-States, >50K +54, Self-emp-not-inc, 105010, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +43, Private, 198203, Some-college, 10, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +35, Local-gov, 215419, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 120460, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, >50K +46, Private, 199316, Some-college, 10, Married-civ-spouse, Craft-repair, Other-relative, Asian-Pac-Islander, Male, 0, 0, 40, India, <=50K +46, Private, 146919, HS-grad, 9, Separated, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +56, Private, 174744, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, ?, 189564, Masters, 14, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 1, United-States, <=50K +21, Private, 249957, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 146574, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +47, State-gov, 156417, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Male, 0, 0, 20, United-States, <=50K +42, Private, 236110, 5th-6th, 3, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Puerto-Rico, <=50K +19, Private, 63363, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 30, United-States, <=50K +25, Private, 190107, Bachelors, 13, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 126569, Masters, 14, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 60, United-States, >50K +35, Private, 176756, 12th, 8, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 40, United-States, <=50K +40, Private, 115161, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 30, United-States, <=50K +57, Self-emp-not-inc, 138892, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 15, United-States, <=50K +38, Private, 256864, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 50, United-States, >50K +48, Private, 265083, 10th, 6, Divorced, Sales, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +34, Private, 249948, Some-college, 10, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 34, United-States, <=50K +46, Federal-gov, 31141, Some-college, 10, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 164190, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 38, ?, <=50K +45, State-gov, 67544, Masters, 14, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +32, Self-emp-not-inc, 174789, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +35, Private, 199753, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 48, United-States, <=50K +62, Private, 122246, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Female, 8614, 0, 39, United-States, >50K +56, ?, 188166, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 96586, Assoc-voc, 11, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 189590, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 140590, Some-college, 10, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 33, United-States, <=50K +35, Private, 255702, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 27, United-States, <=50K +33, Private, 260782, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 41, United-States, >50K +38, Private, 169926, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 1902, 40, United-States, >50K +37, State-gov, 151322, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +56, Private, 192869, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +19, Private, 93604, 7th-8th, 4, Never-married, Craft-repair, Own-child, White, Male, 0, 1602, 32, United-States, <=50K +31, Private, 86958, 9th, 5, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K +53, Local-gov, 228723, HS-grad, 9, Divorced, Craft-repair, Not-in-family, Other, Male, 0, 0, 40, ?, >50K +33, Private, 192644, HS-grad, 9, Separated, Handlers-cleaners, Unmarried, White, Male, 0, 0, 35, Puerto-Rico, <=50K +72, Private, 284080, 1st-4th, 2, Divorced, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +54, Private, 43269, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, >50K +30, Private, 190040, Bachelors, 13, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 306108, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +30, Private, 220148, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1848, 50, United-States, >50K +30, Private, 381645, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 216361, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 16, United-States, <=50K +30, Private, 213722, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +35, Private, 112271, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 208277, Some-college, 10, Divorced, Adm-clerical, Own-child, White, Female, 0, 0, 44, United-States, >50K +38, State-gov, 352628, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 129620, 10th, 6, Never-married, Other-service, Other-relative, White, Female, 0, 0, 30, United-States, <=50K +32, Private, 249550, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 44, United-States, <=50K +49, Private, 178749, Masters, 14, Married-spouse-absent, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +76, ?, 173542, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 10, United-States, <=50K +60, Private, 167670, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, <=50K +60, Private, 81578, 9th, 5, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 160662, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 80, United-States, >50K +41, Private, 163322, Bachelors, 13, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 30, ?, <=50K +24, Private, 152189, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 106176, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 60, United-States, >50K +69, State-gov, 159191, Some-college, 10, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 810, 38, United-States, <=50K +71, ?, 250263, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 3432, 0, 30, United-States, <=50K +41, Private, 78410, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +32, Private, 131379, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 166929, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Private, 380357, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 79190, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 35, United-States, <=50K +40, Private, 342164, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 37, United-States, <=50K +44, Private, 182616, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +63, Private, 339473, Some-college, 10, Divorced, Sales, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +31, Local-gov, 381153, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 15024, 0, 56, United-States, >50K +51, Private, 300816, Bachelors, 13, Never-married, Adm-clerical, Unmarried, White, Male, 0, 0, 20, United-States, <=50K +51, Private, 240988, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +23, Private, 149224, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 168216, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 48, United-States, <=50K +56, Private, 286487, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 2885, 0, 45, United-States, <=50K +39, Private, 305597, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Self-emp-not-inc, 109766, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 50, United-States, >50K +30, Self-emp-not-inc, 188798, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 240170, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, Germany, <=50K +31, Private, 459465, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +44, Local-gov, 162506, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, <=50K +43, Self-emp-not-inc, 145441, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 30, United-States, >50K +37, Federal-gov, 129573, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 72, ?, >50K +41, Private, 27444, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 46, United-States, >50K +43, Private, 195258, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +47, State-gov, 55272, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +38, Self-emp-not-inc, 164526, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 2824, 45, United-States, >50K +46, Private, 27802, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, State-gov, 165289, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 274657, 5th-6th, 3, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 50, Guatemala, <=50K +24, Private, 317175, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +39, Self-emp-inc, 163237, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 65, United-States, <=50K +37, Private, 170408, Assoc-voc, 11, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +28, ?, 55950, Bachelors, 13, Never-married, ?, Own-child, Black, Female, 0, 0, 40, Germany, <=50K +40, Private, 76625, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 366066, Assoc-acdm, 12, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 349368, HS-grad, 9, Never-married, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 286824, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 12, United-States, <=50K +32, Private, 373263, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 161978, HS-grad, 9, Separated, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 543922, Masters, 14, Divorced, Transport-moving, Not-in-family, White, Male, 14344, 0, 48, United-States, >50K +46, Local-gov, 109089, Prof-school, 15, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Private, 110151, Assoc-voc, 11, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +26, Private, 34110, Some-college, 10, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 44, United-States, <=50K +47, Self-emp-not-inc, 118506, Bachelors, 13, Married-civ-spouse, Exec-managerial, Own-child, White, Male, 0, 0, 60, United-States, <=50K +22, Private, 117789, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 10, United-States, <=50K +34, Self-emp-not-inc, 353881, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +49, Private, 200471, 1st-4th, 2, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Portugal, <=50K +20, Private, 258517, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 25, United-States, <=50K +28, Private, 190367, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K +30, Private, 174704, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +23, Private, 179413, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 329530, 9th, 5, Never-married, Priv-house-serv, Own-child, White, Male, 0, 0, 40, Mexico, <=50K +31, Private, 273818, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 55, Mexico, <=50K +46, Private, 256522, 1st-4th, 2, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, Puerto-Rico, <=50K +42, Private, 196001, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +60, Self-emp-not-inc, 282660, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 72630, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +27, Private, 50295, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 48, United-States, <=50K +20, Private, 203240, 9th, 5, Never-married, Sales, Own-child, White, Female, 0, 0, 32, United-States, <=50K +56, Self-emp-not-inc, 172618, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 40, United-States, >50K +41, Private, 202168, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +61, Private, 176839, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 176140, HS-grad, 9, Divorced, Exec-managerial, Unmarried, Black, Female, 0, 0, 40, United-States, >50K +60, Private, 39952, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 2228, 0, 37, United-States, <=50K +33, Private, 292465, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, <=50K +40, ?, 161285, Some-college, 10, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 25, United-States, <=50K +48, Private, 355320, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, Canada, >50K +56, Private, 182460, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Private, 69345, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Husband, White, Male, 3103, 0, 55, United-States, >50K +57, Self-emp-not-inc, 102058, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 165804, Some-college, 10, Never-married, Adm-clerical, Own-child, Other, Female, 0, 0, 40, United-States, <=50K +46, Private, 318259, Assoc-voc, 11, Divorced, Tech-support, Other-relative, White, Female, 0, 0, 36, United-States, <=50K +21, Private, 117606, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 170718, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 413297, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 190457, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +54, Private, 88278, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 50, United-States, >50K +32, Local-gov, 217296, HS-grad, 9, Married-civ-spouse, Transport-moving, Wife, White, Female, 4064, 0, 22, United-States, <=50K +62, ?, 97231, Some-college, 10, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 1, United-States, <=50K +50, Private, 123429, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +49, Federal-gov, 420282, Some-college, 10, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +48, Private, 498325, Assoc-acdm, 12, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +24, Private, 248533, Some-college, 10, Never-married, Sales, Other-relative, Black, Female, 0, 0, 40, United-States, <=50K +46, Private, 137354, Masters, 14, Married-civ-spouse, Tech-support, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, >50K +42, Private, 272910, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +52, Self-emp-inc, 206054, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, Local-gov, 92141, Assoc-acdm, 12, Widowed, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +37, Private, 163199, Some-college, 10, Divorced, Tech-support, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +34, Private, 195860, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 115717, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 2051, 40, United-States, <=50K +18, Private, 120029, Some-college, 10, Never-married, Adm-clerical, Other-relative, White, Female, 0, 0, 20, United-States, <=50K +33, Private, 221762, Some-college, 10, Never-married, Prof-specialty, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +41, Private, 342164, Some-college, 10, Divorced, Sales, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +21, Private, 176356, Assoc-acdm, 12, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +23, Private, 133239, Assoc-voc, 11, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Federal-gov, 169101, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 159442, Bachelors, 13, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 174461, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 45, United-States, <=50K +43, Private, 361280, 10th, 6, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 42, China, <=50K +52, State-gov, 447579, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, England, <=50K +27, ?, 308995, Some-college, 10, Divorced, ?, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +61, Private, 248448, 7th-8th, 4, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 161141, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 212465, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Self-emp-inc, 170871, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 55, United-States, >50K +43, Local-gov, 233865, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +51, Private, 163052, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 348690, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Federal-gov, 34845, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, Germany, >50K +22, Private, 206861, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +49, Self-emp-inc, 349230, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +20, Private, 130840, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 20, United-States, <=50K +19, Private, 415354, 10th, 6, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 132191, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 202466, Assoc-acdm, 12, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +27, ?, 224421, Some-college, 10, Divorced, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Self-emp-not-inc, 236804, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 35, United-States, <=50K +20, Private, 107658, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 10, United-States, <=50K +47, Private, 102771, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +17, Private, 221403, 12th, 8, Never-married, Other-service, Own-child, Black, Male, 0, 0, 18, United-States, <=50K +76, ?, 211574, 10th, 6, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 1, United-States, <=50K +39, Private, 52645, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 276310, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +31, Private, 134613, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Wife, Black, Female, 0, 0, 43, United-States, <=50K +44, Private, 215479, HS-grad, 9, Divorced, Transport-moving, Not-in-family, Black, Male, 0, 0, 20, Haiti, <=50K +53, Private, 266529, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Private, 265807, Some-college, 10, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +45, Self-emp-not-inc, 67716, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +34, Private, 178951, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +35, Private, 241126, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 176544, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +45, Private, 169180, Some-college, 10, Widowed, Other-service, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +37, Self-emp-not-inc, 282461, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +53, Private, 157069, Assoc-acdm, 12, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 99357, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 15024, 0, 50, United-States, >50K +38, Self-emp-not-inc, 414991, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 70, ?, <=50K +65, Self-emp-inc, 338316, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Self-emp-not-inc, 59612, 10th, 6, Divorced, Farming-fishing, Unmarried, White, Male, 0, 0, 70, United-States, <=50K +24, Private, 220426, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +54, Private, 115912, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 27032, 10th, 6, Never-married, Sales, Own-child, White, Female, 0, 0, 12, United-States, <=50K +19, Private, 170720, 12th, 8, Never-married, Other-service, Own-child, White, Male, 0, 0, 16, United-States, <=50K +60, Private, 183162, HS-grad, 9, Widowed, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 192360, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, >50K +78, ?, 165694, Masters, 14, Widowed, ?, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +26, Private, 128553, Some-college, 10, Never-married, Exec-managerial, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +58, Private, 209423, 1st-4th, 2, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 38, Cuba, <=50K +37, Self-emp-not-inc, 121510, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 55, United-States, <=50K +41, Private, 93793, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 38, United-States, >50K +30, Private, 133602, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 391329, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +48, Private, 96359, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, Greece, >50K +22, Private, 203894, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +50, Private, 196193, Masters, 14, Married-spouse-absent, Prof-specialty, Other-relative, White, Male, 0, 0, 60, ?, <=50K +25, Private, 195994, 1st-4th, 2, Never-married, Priv-house-serv, Not-in-family, White, Female, 0, 0, 40, Guatemala, <=50K +18, Private, 50879, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 6, United-States, <=50K +21, Private, 186849, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 201127, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +33, Private, 110998, HS-grad, 9, Never-married, Other-service, Other-relative, Amer-Indian-Eskimo, Female, 0, 0, 36, United-States, <=50K +39, Private, 190466, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 2174, 0, 40, United-States, <=50K +67, Self-emp-not-inc, 173935, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 8, United-States, >50K +19, Private, 167140, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 1602, 24, United-States, <=50K +18, Private, 110230, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 11, United-States, <=50K +36, Private, 287658, HS-grad, 9, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +23, Private, 224954, Some-college, 10, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 25, United-States, <=50K +25, ?, 394820, Some-college, 10, Separated, ?, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +40, Private, 37618, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, ?, <=50K +73, Self-emp-not-inc, 29306, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 37314, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 5013, 0, 40, United-States, <=50K +31, Private, 420749, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 482732, 10th, 6, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 206215, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 101364, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +66, Self-emp-inc, 185369, 10th, 6, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 30, United-States, <=50K +66, Private, 216856, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +64, Private, 256019, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +48, Private, 348144, Some-college, 10, Divorced, Transport-moving, Not-in-family, White, Male, 3325, 0, 53, United-States, <=50K +24, Private, 190293, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Self-emp-not-inc, 25932, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +25, Private, 176729, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, >50K +33, Private, 166961, 11th, 7, Separated, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +50, Private, 86373, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, <=50K +51, Private, 320513, 7th-8th, 4, Married-spouse-absent, Craft-repair, Not-in-family, Black, Male, 0, 0, 50, Dominican-Republic, <=50K +34, State-gov, 190290, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 38, United-States, >50K +41, Local-gov, 111891, 7th-8th, 4, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +30, Self-emp-not-inc, 45796, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Private, 108496, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 2907, 0, 40, United-States, <=50K +41, Self-emp-not-inc, 120539, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 3103, 0, 40, United-States, >50K +36, Self-emp-not-inc, 164526, Masters, 14, Never-married, Sales, Not-in-family, White, Male, 10520, 0, 45, United-States, >50K +37, Private, 323155, 1st-4th, 2, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 85, Mexico, <=50K +28, Private, 65389, HS-grad, 9, Never-married, Other-service, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 30, United-States, <=50K +19, Private, 414871, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 161607, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +62, Private, 224953, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +36, Private, 261382, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 45, United-States, >50K +58, Self-emp-not-inc, 231818, 10th, 6, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Greece, <=50K +42, Self-emp-inc, 184018, HS-grad, 9, Divorced, Sales, Unmarried, White, Male, 1151, 0, 50, United-States, <=50K +43, Self-emp-inc, 133060, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 35032, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, State-gov, 304212, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +64, Local-gov, 50442, 9th, 5, Never-married, Adm-clerical, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 146091, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 20, United-States, >50K +26, Private, 267431, Bachelors, 13, Never-married, Sales, Own-child, Black, Female, 0, 0, 20, United-States, <=50K +19, Private, 121240, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +21, Private, 192572, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 45, United-States, <=50K +32, Private, 211028, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Local-gov, 346122, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 5013, 0, 45, United-States, <=50K +26, Private, 202203, Bachelors, 13, Never-married, Adm-clerical, Other-relative, White, Female, 0, 0, 50, United-States, <=50K +20, Private, 159297, Some-college, 10, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 15, United-States, <=50K +19, Private, 310158, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 30, United-States, <=50K +33, Federal-gov, 193246, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 42, United-States, >50K +23, Private, 200089, Some-college, 10, Married-civ-spouse, Craft-repair, Other-relative, White, Male, 0, 0, 40, El-Salvador, <=50K +29, Private, 38353, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +42, Private, 76280, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K +30, Self-emp-not-inc, 243665, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +63, Private, 68872, HS-grad, 9, Married-civ-spouse, Transport-moving, Wife, Asian-Pac-Islander, Female, 0, 0, 20, United-States, <=50K +34, Private, 103596, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, Self-emp-not-inc, 88055, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 24, United-States, <=50K +48, Private, 186203, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 50, United-States, <=50K +25, Private, 257910, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +27, Private, 200227, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, <=50K +55, Self-emp-not-inc, 124975, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 27828, 0, 55, United-States, >50K +32, Private, 227669, Some-college, 10, Never-married, Machine-op-inspct, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +22, Private, 117210, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 25, Greece, <=50K +25, Private, 76144, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +18, Private, 98667, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 16, United-States, <=50K +24, Local-gov, 155818, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 44, United-States, <=50K +29, Private, 283760, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +73, ?, 281907, 11th, 7, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 3, United-States, <=50K +39, Private, 186183, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Self-emp-inc, 202153, Masters, 14, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 365683, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +22, Private, 187538, 10th, 6, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, ?, 209432, HS-grad, 9, Separated, ?, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +33, Private, 126950, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +42, Private, 110028, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 104660, Bachelors, 13, Separated, Prof-specialty, Unmarried, White, Male, 0, 0, 45, United-States, <=50K +57, Self-emp-not-inc, 437281, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 38, United-States, >50K +42, Private, 259643, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 4650, 0, 40, United-States, <=50K +22, Private, 217961, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 1719, 30, United-States, <=50K +21, ?, 134746, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 35, United-States, <=50K +42, Self-emp-not-inc, 120539, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +39, Private, 25803, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +41, Private, 63596, Some-college, 10, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 32, United-States, >50K +20, Local-gov, 325493, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +47, Private, 211239, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 206686, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 427965, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +52, Private, 218550, Some-college, 10, Married-spouse-absent, Adm-clerical, Not-in-family, White, Female, 14084, 0, 16, United-States, >50K +71, Private, 163385, Some-college, 10, Widowed, Sales, Not-in-family, White, Male, 0, 0, 35, United-States, >50K +52, Private, 124993, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 55, United-States, <=50K +36, Private, 107410, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 152373, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 48, United-States, >50K +37, Private, 161226, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 30, United-States, >50K +26, Private, 213799, 10th, 6, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 204461, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +35, Private, 377798, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +20, Private, 116375, 9th, 5, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +34, Local-gov, 210164, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1740, 40, United-States, <=50K +56, Self-emp-not-inc, 258752, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +39, Private, 327435, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 36, United-States, >50K +24, Private, 301199, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 20, United-States, <=50K +24, Private, 186221, 11th, 7, Divorced, Sales, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +23, Private, 203924, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +27, Private, 192236, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +25, Private, 152035, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 201454, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 156580, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 37, United-States, >50K +51, Private, 115851, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 106753, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1740, 40, United-States, <=50K +59, Private, 359292, 1st-4th, 2, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +29, Private, 83003, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 78817, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +24, Private, 200967, HS-grad, 9, Married-civ-spouse, Craft-repair, Wife, White, Female, 0, 0, 36, United-States, <=50K +38, State-gov, 107164, Some-college, 10, Separated, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 189674, HS-grad, 9, Never-married, Priv-house-serv, Unmarried, Black, Female, 0, 0, 28, ?, <=50K +34, Self-emp-not-inc, 90614, HS-grad, 9, Separated, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +42, Self-emp-not-inc, 323790, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 70, United-States, >50K +45, Self-emp-not-inc, 242552, 12th, 8, Divorced, Craft-repair, Other-relative, Black, Male, 0, 0, 35, United-States, <=50K +21, Private, 90935, Assoc-voc, 11, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +64, Self-emp-inc, 165667, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 60, Canada, >50K +32, Private, 162604, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +45, Private, 205424, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +53, Private, 97411, 5th-6th, 3, Married-civ-spouse, Machine-op-inspct, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Laos, <=50K +42, Private, 184857, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 16, United-States, <=50K +32, Private, 165226, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 115784, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +62, Private, 368476, 5th-6th, 3, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 24, Mexico, <=50K +28, Private, 53063, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +29, ?, 134566, Doctorate, 16, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 50, United-States, >50K +32, Private, 153471, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +37, Self-emp-inc, 107164, 10th, 6, Never-married, Transport-moving, Not-in-family, White, Male, 0, 2559, 50, United-States, >50K +38, Private, 180303, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 50, Japan, >50K +44, Local-gov, 236321, HS-grad, 9, Divorced, Transport-moving, Own-child, White, Male, 0, 0, 25, United-States, <=50K +19, Private, 141868, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, ?, 367655, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 203518, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +58, Private, 119558, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +56, Private, 108276, Bachelors, 13, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 385452, 10th, 6, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 162003, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 349028, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 45114, Bachelors, 13, Never-married, Sales, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +44, Private, 112797, 9th, 5, Divorced, Other-service, Own-child, White, Female, 0, 0, 50, United-States, <=50K +28, Private, 183639, HS-grad, 9, Never-married, Handlers-cleaners, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 177121, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +38, Private, 239755, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 150361, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 293091, 11th, 7, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 60, United-States, <=50K +24, Private, 200089, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, Mexico, >50K +40, Private, 91836, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +23, Private, 324960, HS-grad, 9, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +79, Local-gov, 84616, 11th, 7, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 7, United-States, <=50K +44, Private, 252930, 10th, 6, Divorced, Adm-clerical, Unmarried, Other, Female, 0, 0, 42, United-States, <=50K +51, Private, 44000, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 99999, 0, 50, United-States, >50K +30, Private, 154843, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 99307, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 3103, 0, 48, United-States, >50K +41, Private, 182567, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, ?, >50K +33, Private, 93206, 5th-6th, 3, Married-civ-spouse, Farming-fishing, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +50, Private, 100109, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Male, 0, 0, 45, United-States, >50K +51, Private, 114927, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 7298, 0, 40, United-States, >50K +41, Private, 121287, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 189916, Bachelors, 13, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 30, United-States, >50K +34, Private, 157747, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 40, United-States, >50K +28, Private, 39232, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +31, Self-emp-inc, 133861, Assoc-voc, 11, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 505980, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +67, ?, 183374, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 2329, 0, 15, United-States, <=50K +65, Private, 193216, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 9386, 0, 40, United-States, >50K +39, Self-emp-not-inc, 140752, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +23, Private, 549349, Some-college, 10, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +29, Self-emp-not-inc, 179008, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, Self-emp-not-inc, 190554, 10th, 6, Divorced, Exec-managerial, Own-child, White, Male, 0, 0, 60, United-States, >50K +47, Private, 80924, Some-college, 10, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +51, Local-gov, 319054, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 60, United-States, <=50K +34, Private, 297094, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +52, Private, 170562, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +29, Private, 240738, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 297544, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Local-gov, 169905, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 149637, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 182526, Bachelors, 13, Married-spouse-absent, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +55, Self-emp-not-inc, 158315, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +61, Self-emp-inc, 227232, Bachelors, 13, Separated, Sales, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +34, Private, 96483, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, Asian-Pac-Islander, Female, 8614, 0, 60, United-States, >50K +41, Private, 286970, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Local-gov, 223529, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 43, United-States, <=50K +78, Self-emp-not-inc, 316261, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 99999, 0, 20, United-States, >50K +40, Private, 170214, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Self-emp-not-inc, 224361, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 75, United-States, <=50K +43, Private, 124919, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 60, Japan, <=50K +55, ?, 103654, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +25, Private, 306352, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, Mexico, <=50K +26, Self-emp-not-inc, 227858, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 48, United-States, <=50K +43, Self-emp-inc, 150533, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 68, United-States, >50K +25, Private, 144478, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, Poland, <=50K +22, Private, 254547, Some-college, 10, Never-married, Adm-clerical, Other-relative, Black, Female, 0, 0, 30, Jamaica, <=50K +52, Self-emp-not-inc, 313243, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, >50K +61, Private, 149981, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 2414, 0, 5, United-States, <=50K +42, Private, 125461, Bachelors, 13, Never-married, Sales, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 306967, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 192976, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +65, Private, 192133, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 2290, 0, 40, Greece, <=50K +56, ?, 131608, HS-grad, 9, Divorced, ?, Not-in-family, White, Male, 0, 0, 10, United-States, <=50K +33, Federal-gov, 339388, Assoc-acdm, 12, Divorced, Other-service, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 203240, 10th, 6, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 83827, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +45, Self-emp-inc, 160440, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +42, Private, 108502, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 42, United-States, <=50K +37, Private, 410913, HS-grad, 9, Married-spouse-absent, Farming-fishing, Unmarried, Other, Male, 0, 0, 40, Mexico, <=50K +56, Private, 193818, 9th, 5, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +35, ?, 163582, 10th, 6, Divorced, ?, Unmarried, White, Female, 0, 0, 16, ?, <=50K +40, Private, 103789, Assoc-acdm, 12, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 32, United-States, <=50K +31, Private, 34572, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +26, Private, 43408, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, State-gov, 105787, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +42, Self-emp-inc, 90693, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, <=50K +45, Self-emp-not-inc, 285575, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 40, China, <=50K +47, Local-gov, 56482, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 7688, 0, 50, United-States, >50K +22, Private, 496025, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +33, Private, 382764, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 259284, HS-grad, 9, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 50, United-States, <=50K +48, Self-emp-not-inc, 185385, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 98, United-States, <=50K +57, Self-emp-not-inc, 286836, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 8, United-States, <=50K +47, Private, 139145, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +58, Local-gov, 44246, 9th, 5, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 169611, 11th, 7, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 133403, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, <=50K +29, Private, 187327, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 180032, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 46561, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +23, Private, 86065, 12th, 8, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +46, Self-emp-not-inc, 256014, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +30, Private, 188403, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 396758, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1887, 70, United-States, >50K +25, Private, 60485, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K +32, Private, 271276, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 80, United-States, >50K +56, Private, 229525, 9th, 5, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +33, Private, 34574, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 43, United-States, <=50K +19, State-gov, 112432, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 10, United-States, <=50K +20, Private, 105312, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 18, United-States, <=50K +34, Private, 221396, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 304872, 9th, 5, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +55, Self-emp-not-inc, 319733, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 176012, 9th, 5, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 23, United-States, <=50K +31, Private, 213750, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +30, Private, 248384, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 351187, HS-grad, 9, Divorced, Other-service, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 138179, HS-grad, 9, Separated, Machine-op-inspct, Not-in-family, White, Male, 0, 1876, 40, United-States, <=50K +59, Private, 50223, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 117477, HS-grad, 9, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 36, United-States, <=50K +40, Private, 194360, 9th, 5, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Private, 118108, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, <=50K +25, Local-gov, 90730, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 40, United-States, >50K +18, Self-emp-inc, 38307, 11th, 7, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 30, United-States, <=50K +41, Private, 116391, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 210496, 10th, 6, Widowed, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 168475, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Private, 174386, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 24, United-States, <=50K +39, Private, 166744, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 38, United-States, <=50K +19, Private, 375114, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 373469, Assoc-acdm, 12, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Private, 339667, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 41, United-States, <=50K +39, Private, 91711, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Private, 82049, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 236242, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +57, Self-emp-inc, 140319, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 50, United-States, <=50K +33, Local-gov, 34080, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +56, Private, 204816, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +60, Private, 187124, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 72310, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +58, Private, 175127, 12th, 8, Married-civ-spouse, Transport-moving, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +48, Federal-gov, 205707, Masters, 14, Married-spouse-absent, Exec-managerial, Not-in-family, White, Female, 10520, 0, 50, United-States, >50K +45, Local-gov, 236586, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 55, United-States, >50K +18, Private, 71792, HS-grad, 9, Never-married, Sales, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +56, Private, 87584, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +48, Self-emp-inc, 136878, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, >50K +40, Private, 287983, Bachelors, 13, Never-married, Tech-support, Not-in-family, Asian-Pac-Islander, Female, 0, 2258, 48, Philippines, <=50K +38, Private, 110607, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 32, United-States, <=50K +58, Private, 109015, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 235071, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 50, United-States, <=50K +63, Private, 88653, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, <=50K +51, Private, 332243, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +22, ?, 291547, 5th-6th, 3, Married-civ-spouse, ?, Wife, Other, Female, 0, 0, 40, Mexico, <=50K +44, Private, 45093, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 70, United-States, <=50K +46, Federal-gov, 161337, Some-college, 10, Divorced, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +64, State-gov, 211222, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 295117, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 40, England, >50K +31, Private, 206541, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 238415, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +21, Private, 29810, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +30, Private, 108023, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 114324, Assoc-voc, 11, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +54, Private, 172281, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 2051, 50, United-States, <=50K +59, Local-gov, 197290, 9th, 5, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Local-gov, 191177, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 20, United-States, >50K +57, Private, 562558, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 79531, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +53, Self-emp-inc, 157881, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +58, Self-emp-not-inc, 204816, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +19, Private, 185695, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +39, Self-emp-inc, 167482, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +31, Self-emp-inc, 83748, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, Asian-Pac-Islander, Female, 0, 0, 70, South, <=50K +27, Private, 39232, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Local-gov, 236827, 9th, 5, Separated, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 154410, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 135308, Bachelors, 13, Never-married, Sales, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +33, Private, 204042, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +20, Private, 308239, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 16, United-States, <=50K +55, Private, 183884, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +39, Private, 98948, Assoc-acdm, 12, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 141642, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 162623, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +39, Self-emp-inc, 186934, Bachelors, 13, Married-spouse-absent, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 179512, HS-grad, 9, Separated, Exec-managerial, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +25, Private, 391192, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 24, United-States, <=50K +31, Private, 87054, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +51, Private, 30008, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +24, Private, 113466, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +70, Private, 642830, HS-grad, 9, Divorced, Protective-serv, Not-in-family, White, Female, 0, 0, 32, United-States, <=50K +23, Private, 182117, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 162432, HS-grad, 9, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +45, Self-emp-not-inc, 242184, 7th-8th, 4, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +47, Private, 170850, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 4064, 0, 60, United-States, <=50K +56, Private, 435022, Some-college, 10, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +79, Private, 120707, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 20051, 0, 35, El-Salvador, >50K +20, Private, 170800, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 15, United-States, <=50K +30, Private, 268575, HS-grad, 9, Never-married, Craft-repair, Unmarried, Asian-Pac-Islander, Male, 0, 0, 40, Vietnam, <=50K +27, Private, 269354, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 25, ?, <=50K +40, Private, 224232, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +60, ?, 153072, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 5, United-States, <=50K +58, Private, 177368, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +71, Self-emp-not-inc, 163293, Prof-school, 15, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 2, United-States, <=50K +50, Private, 178530, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +29, Local-gov, 183523, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, Iran, <=50K +33, Private, 207267, 10th, 6, Separated, Other-service, Unmarried, White, Female, 3418, 0, 35, United-States, <=50K +60, State-gov, 27037, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, >50K +33, Private, 176711, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +43, Private, 163215, Bachelors, 13, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 35, ?, >50K +33, Private, 394727, 10th, 6, Never-married, Handlers-cleaners, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +33, Private, 195488, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 52, United-States, <=50K +32, State-gov, 443546, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 45, United-States, <=50K +21, Private, 121023, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 9, United-States, <=50K +38, Private, 51838, HS-grad, 9, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 258888, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, >50K +39, State-gov, 189385, Some-college, 10, Separated, Exec-managerial, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +17, Private, 198146, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +21, Private, 337766, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 210525, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 20, United-States, >50K +42, Private, 185602, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +36, Private, 173804, 11th, 7, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 251243, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +37, Self-emp-not-inc, 415847, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 119793, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 181705, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 182360, HS-grad, 9, Separated, Prof-specialty, Unmarried, Other, Female, 0, 0, 60, Puerto-Rico, <=50K +49, Private, 61885, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 146520, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 323790, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 146268, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Federal-gov, 287031, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Male, 8614, 0, 40, United-States, >50K +33, Local-gov, 292217, HS-grad, 9, Divorced, Protective-serv, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 88126, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 143046, Masters, 14, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 401623, Some-college, 10, Married-civ-spouse, Tech-support, Husband, Black, Male, 0, 0, 40, Jamaica, >50K +36, Self-emp-not-inc, 283122, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 1902, 60, United-States, >50K +84, Self-emp-not-inc, 155057, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 20, United-States, <=50K +23, Private, 260254, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 152292, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +55, Self-emp-inc, 138594, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 45, United-States, >50K +30, Self-emp-not-inc, 523095, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +46, Private, 175262, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, <=50K +55, Private, 323706, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 40, United-States, >50K +34, Private, 316470, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +54, Self-emp-not-inc, 163815, Masters, 14, Divorced, Sales, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +27, Private, 72208, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +52, Local-gov, 74784, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +36, Private, 383518, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 99999, 0, 40, United-States, >50K +25, Self-emp-not-inc, 266668, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 347519, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +24, Private, 336088, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 50, United-States, <=50K +36, Private, 190350, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +31, Private, 204052, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +66, ?, 31362, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 20, United-States, <=50K +90, Self-emp-not-inc, 155981, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 10566, 0, 50, United-States, <=50K +67, Private, 195161, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 20051, 0, 60, United-States, >50K +22, Self-emp-inc, 269583, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 2580, 0, 40, United-States, <=50K +47, Private, 26994, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +32, Private, 116539, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1887, 55, United-States, >50K +55, Self-emp-not-inc, 189933, 7th-8th, 4, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 101283, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +48, Private, 113598, Some-college, 10, Separated, Adm-clerical, Other-relative, Black, Female, 0, 0, 40, United-States, <=50K +21, Private, 188793, HS-grad, 9, Married-civ-spouse, Sales, Husband, Other, Male, 0, 0, 35, United-States, <=50K +33, Private, 109996, Assoc-acdm, 12, Married-spouse-absent, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 195681, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 48, ?, <=50K +47, Private, 436770, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 84253, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 24, United-States, <=50K +44, Self-emp-inc, 383493, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, >50K +23, Private, 216867, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 37, Mexico, <=50K +18, Private, 401051, 10th, 6, Never-married, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +56, Private, 83196, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +24, Private, 325596, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 35, United-States, <=50K +43, Private, 187322, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +23, Private, 193949, Bachelors, 13, Never-married, Sales, Own-child, White, Female, 0, 0, 60, United-States, <=50K +26, Private, 133373, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 42, United-States, <=50K +42, Private, 113324, HS-grad, 9, Widowed, Sales, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 178818, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +53, Self-emp-not-inc, 152810, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 335997, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 4386, 0, 55, United-States, >50K +40, Private, 436493, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 25, United-States, <=50K +27, Private, 704108, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +24, Local-gov, 150084, Some-college, 10, Separated, Protective-serv, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +42, Private, 341204, HS-grad, 9, Divorced, Craft-repair, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 187336, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 204209, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 10, United-States, <=50K +42, Self-emp-not-inc, 206066, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 65, United-States, <=50K +38, Private, 63509, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +63, Self-emp-not-inc, 391121, 7th-8th, 4, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 30, United-States, <=50K +31, Private, 56026, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Self-emp-not-inc, 60981, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 4, United-States, <=50K +21, Private, 228255, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +24, Private, 86745, Bachelors, 13, Married-civ-spouse, Prof-specialty, Other-relative, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +55, Private, 234327, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 59948, 9th, 5, Never-married, Adm-clerical, Unmarried, Black, Female, 114, 0, 20, United-States, <=50K +31, Private, 137814, Some-college, 10, Divorced, Sales, Own-child, White, Female, 0, 0, 30, United-States, <=50K +23, Private, 167692, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +35, Private, 245090, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Mexico, <=50K +51, Self-emp-not-inc, 256963, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +19, Private, 160033, Some-college, 10, Never-married, Protective-serv, Own-child, White, Female, 0, 0, 30, United-States, <=50K +38, Local-gov, 289430, HS-grad, 9, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 56, United-States, <=50K +52, Local-gov, 305053, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 2051, 40, United-States, <=50K +70, Self-emp-not-inc, 172370, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 25, United-States, <=50K +53, Private, 320510, 10th, 6, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +59, Private, 171355, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 65027, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 43, United-States, <=50K +18, Private, 215190, 12th, 8, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +41, ?, 149385, HS-grad, 9, Never-married, ?, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +19, ?, 169324, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 10, United-States, <=50K +24, Private, 138938, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 557082, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +32, Private, 273287, Some-college, 10, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 40, Jamaica, <=50K +34, Self-emp-not-inc, 77209, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 1902, 60, United-States, >50K +35, Private, 317153, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +50, Private, 95469, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7298, 0, 45, United-States, >50K +18, Private, 302859, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +37, Private, 333651, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +30, Private, 177596, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +40, Self-emp-inc, 157240, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 15024, 0, 30, Iran, >50K +22, Private, 184779, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +50, Local-gov, 138358, Some-college, 10, Separated, Other-service, Unmarried, Black, Female, 0, 0, 28, United-States, <=50K +70, Private, 176285, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 23, United-States, <=50K +43, Private, 102180, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +77, Self-emp-not-inc, 209507, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Self-emp-not-inc, 229741, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 324546, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 39, United-States, <=50K +51, Private, 337195, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 1902, 50, United-States, >50K +58, State-gov, 194068, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 50, United-States, >50K +22, Private, 250647, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 12, United-States, <=50K +33, Private, 477106, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +27, Private, 104329, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 224566, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +32, Private, 169841, Some-college, 10, Divorced, Sales, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +41, Private, 42563, Bachelors, 13, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 25, United-States, >50K +37, Private, 31368, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 132755, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 15, United-States, <=50K +50, Private, 279129, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +31, ?, 86143, HS-grad, 9, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +54, State-gov, 44172, HS-grad, 9, Separated, Exec-managerial, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +23, State-gov, 93076, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +40, Private, 146653, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +29, Private, 221366, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 5013, 0, 40, Germany, <=50K +38, Private, 189404, HS-grad, 9, Married-spouse-absent, Other-service, Not-in-family, White, Male, 0, 0, 35, ?, <=50K +30, Private, 172304, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, <=50K +20, Private, 116666, Some-college, 10, Never-married, Sales, Own-child, Asian-Pac-Islander, Male, 0, 0, 8, India, <=50K +43, Self-emp-not-inc, 64112, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +25, Private, 55718, Some-college, 10, Never-married, Tech-support, Own-child, White, Male, 0, 0, 25, United-States, <=50K +39, Private, 126675, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +48, Private, 102112, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +41, Self-emp-not-inc, 226505, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 211527, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +20, Private, 175069, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, Yugoslavia, <=50K +25, Private, 25249, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +57, Private, 73411, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 207185, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 35, Puerto-Rico, >50K +66, Private, 127139, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Private, 41809, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 297449, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 14084, 0, 40, United-States, >50K +46, Private, 141483, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, <=50K +42, Local-gov, 117227, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +46, Private, 377401, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1902, 70, Canada, >50K +34, Local-gov, 167063, HS-grad, 9, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 253759, Some-college, 10, Married-civ-spouse, Tech-support, Wife, Black, Female, 0, 0, 40, United-States, <=50K +42, Private, 183096, Some-college, 10, Divorced, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 269654, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +70, ?, 293076, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, <=50K +32, Private, 34104, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Federal-gov, 80057, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, Germany, >50K +42, Self-emp-inc, 369781, 7th-8th, 4, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 25, United-States, <=50K +21, Private, 223811, Assoc-voc, 11, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 163053, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 189461, HS-grad, 9, Never-married, Sales, Other-relative, White, Male, 0, 0, 55, United-States, <=50K +50, Local-gov, 145166, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7298, 0, 40, United-States, >50K +37, Private, 86310, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +19, ?, 263224, 11th, 7, Never-married, ?, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +44, Federal-gov, 280362, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 301031, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +30, Private, 74966, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 24, United-States, <=50K +36, Private, 254493, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 44, United-States, <=50K +49, Self-emp-not-inc, 204241, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +29, Private, 225024, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Local-gov, 148222, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +75, State-gov, 113868, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 20, United-States, >50K +42, Private, 132633, HS-grad, 9, Never-married, Priv-house-serv, Not-in-family, White, Female, 0, 0, 40, ?, <=50K +37, Private, 44780, Assoc-acdm, 12, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 86373, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 25, United-States, <=50K +61, Local-gov, 176753, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 48, United-States, <=50K +33, Private, 164707, Assoc-acdm, 12, Never-married, Exec-managerial, Unmarried, White, Female, 2174, 0, 55, ?, <=50K +50, Local-gov, 370733, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, <=50K +59, Private, 216851, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 137951, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +22, Private, 185279, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 16, United-States, <=50K +56, Private, 159724, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Private, 103233, Bachelors, 13, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +35, Private, 63509, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, Private, 174353, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 168109, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 15024, 0, 50, United-States, >50K +27, Private, 159724, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +50, Self-emp-not-inc, 105010, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 2051, 20, United-States, <=50K +30, Private, 179112, Bachelors, 13, Never-married, Prof-specialty, Own-child, Black, Male, 0, 0, 40, ?, <=50K +46, Private, 364913, 11th, 7, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +48, Self-emp-inc, 155664, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +61, Private, 230568, Bachelors, 13, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +33, Private, 86492, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 87, United-States, <=50K +40, Private, 71305, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 40, United-States, <=50K +58, Self-emp-inc, 189933, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 65, United-States, >50K +46, Self-emp-inc, 191978, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 2392, 50, United-States, >50K +35, Private, 38948, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +51, Self-emp-inc, 139127, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +37, Private, 301568, 12th, 8, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +64, Private, 149044, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 2057, 60, China, <=50K +41, Private, 197344, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 54, United-States, <=50K +18, Private, 32244, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 594, 0, 30, United-States, <=50K +44, Self-emp-not-inc, 315406, 11th, 7, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 88, United-States, <=50K +41, State-gov, 47170, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Amer-Indian-Eskimo, Female, 0, 0, 48, United-States, >50K +33, State-gov, 208785, Some-college, 10, Separated, Prof-specialty, Not-in-family, White, Male, 10520, 0, 40, United-States, >50K +37, Private, 196338, 9th, 5, Separated, Priv-house-serv, Unmarried, White, Female, 0, 0, 16, Mexico, <=50K +34, Private, 269243, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +24, Federal-gov, 215115, Bachelors, 13, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, ?, <=50K +20, Private, 117767, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 176101, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 138283, Assoc-voc, 11, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, Self-emp-not-inc, 132320, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 45, United-States, <=50K +22, Federal-gov, 471452, Bachelors, 13, Never-married, Tech-support, Own-child, White, Male, 0, 0, 8, United-States, <=50K +55, Private, 147653, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 73, United-States, <=50K +20, Private, 49179, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +26, Private, 174921, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, Self-emp-inc, 95997, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 70, United-States, <=50K +40, Private, 247245, 9th, 5, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 67072, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +54, ?, 95329, Some-college, 10, Divorced, ?, Own-child, White, Male, 0, 0, 50, United-States, <=50K +24, Private, 107882, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 241825, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 46, United-States, <=50K +18, Private, 79443, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 8, United-States, <=50K +49, Self-emp-not-inc, 233059, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +17, Private, 226980, 12th, 8, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 17, United-States, <=50K +34, Self-emp-not-inc, 181087, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +37, Private, 305597, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +49, Federal-gov, 311671, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +74, Private, 129879, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 15831, 0, 40, United-States, >50K +37, Private, 83375, HS-grad, 9, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 115824, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 1573, 40, United-States, <=50K +40, Private, 141657, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 35, United-States, >50K +34, Private, 50276, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 27828, 0, 40, United-States, >50K +30, Private, 177216, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 1740, 40, Haiti, <=50K +44, Private, 228057, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, Puerto-Rico, <=50K +40, Private, 222848, 10th, 6, Married-civ-spouse, Other-service, Wife, Black, Female, 0, 0, 32, United-States, <=50K +58, Private, 121111, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, Greece, <=50K +44, Private, 298885, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 149909, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 25, United-States, >50K +39, Private, 387430, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 18, United-States, <=50K +19, Private, 121972, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +41, Private, 280167, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 70, United-States, >50K +29, State-gov, 191355, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, Federal-gov, 112115, Some-college, 10, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +38, ?, 104094, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 15, United-States, <=50K +27, Private, 211032, Preschool, 1, Married-civ-spouse, Farming-fishing, Other-relative, White, Male, 41310, 0, 24, Mexico, <=50K +54, Private, 199307, Some-college, 10, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 48, United-States, <=50K +40, Private, 205175, HS-grad, 9, Widowed, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 37, United-States, <=50K +19, Private, 257750, Some-college, 10, Never-married, Sales, Other-relative, White, Female, 0, 0, 25, United-States, <=50K +17, Private, 191260, 11th, 7, Never-married, Other-service, Own-child, White, Male, 594, 0, 10, United-States, <=50K +33, Private, 342730, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 52, United-States, <=50K +80, Private, 249983, 7th-8th, 4, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +24, Self-emp-not-inc, 161508, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 50, United-States, <=50K +28, Private, 338376, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 334308, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 30, United-States, >50K +21, Private, 133471, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 129177, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +19, Private, 178811, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +42, Private, 178537, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +60, Self-emp-not-inc, 235535, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, <=50K +20, ?, 298155, Some-college, 10, Never-married, ?, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +51, Private, 145114, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 194096, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +37, State-gov, 191779, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 159732, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K +42, Federal-gov, 170230, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 14084, 0, 60, United-States, >50K +40, Private, 104719, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 163083, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 403552, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 32, United-States, <=50K +62, Private, 218009, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1977, 60, United-States, >50K +47, Private, 179313, 10th, 6, Divorced, Sales, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +26, Private, 51961, 12th, 8, Never-married, Sales, Other-relative, Black, Male, 0, 0, 51, United-States, <=50K +59, Private, 426001, HS-grad, 9, Married-spouse-absent, Adm-clerical, Unmarried, White, Female, 0, 0, 20, Puerto-Rico, <=50K +70, Local-gov, 176493, Some-college, 10, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 17, United-States, <=50K +26, Private, 124068, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +47, Private, 108510, 10th, 6, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 65, United-States, <=50K +25, Private, 181528, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +52, Self-emp-inc, 173754, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 60, United-States, >50K +46, Private, 169699, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +67, Private, 126849, 10th, 6, Married-civ-spouse, Transport-moving, Husband, Amer-Indian-Eskimo, Male, 0, 0, 20, United-States, <=50K +34, Private, 204470, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +53, State-gov, 116367, Some-college, 10, Divorced, Adm-clerical, Other-relative, White, Female, 4650, 0, 40, United-States, <=50K +22, Private, 117363, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +39, Local-gov, 106297, HS-grad, 9, Divorced, Adm-clerical, Own-child, White, Male, 0, 0, 42, United-States, <=50K +54, Self-emp-not-inc, 108933, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +24, Private, 190143, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 246677, HS-grad, 9, Separated, Prof-specialty, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +38, Private, 175360, 10th, 6, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 2559, 90, United-States, >50K +41, Local-gov, 210259, Masters, 14, Never-married, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +36, Private, 166304, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 33, United-States, <=50K +43, Private, 303051, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 40, United-States, >50K +39, Private, 49308, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 192262, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 45, United-States, <=50K +49, Local-gov, 192349, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 4650, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 48063, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +43, Private, 170214, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +54, Federal-gov, 51048, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +53, Self-emp-inc, 246562, 5th-6th, 3, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, Mexico, >50K +57, Local-gov, 215175, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +28, Private, 114967, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 464536, Bachelors, 13, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 451996, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 138852, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, State-gov, 353012, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Self-emp-inc, 321822, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 75, United-States, >50K +50, Self-emp-not-inc, 324506, HS-grad, 9, Widowed, Exec-managerial, Unmarried, Asian-Pac-Islander, Female, 0, 0, 48, South, <=50K +36, Private, 162256, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +31, Local-gov, 356689, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 260199, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 10, United-States, <=50K +36, Private, 103605, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 316211, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 308691, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +39, Private, 194404, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 40, United-States, >50K +18, Private, 334427, 10th, 6, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 36, United-States, <=50K +33, Private, 213226, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +35, Private, 342824, HS-grad, 9, Never-married, Tech-support, Not-in-family, White, Female, 1151, 0, 40, United-States, <=50K +23, Private, 33105, Some-college, 10, Never-married, Handlers-cleaners, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +37, Private, 147638, Bachelors, 13, Separated, Other-service, Unmarried, Asian-Pac-Islander, Female, 0, 0, 36, Philippines, <=50K +25, Private, 315643, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 30, United-States, <=50K +51, Federal-gov, 106257, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 342768, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Private, 108960, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +66, ?, 168071, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, <=50K +32, Private, 136935, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 13, United-States, <=50K +37, Self-emp-not-inc, 188774, Bachelors, 13, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +29, Private, 280344, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 202496, Bachelors, 13, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 37, United-States, <=50K +61, Self-emp-inc, 134768, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 175686, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 194748, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Female, 0, 0, 49, United-States, <=50K +49, Private, 61307, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, Other, Male, 0, 0, 38, United-States, <=50K +51, Self-emp-not-inc, 165001, Masters, 14, Divorced, Exec-managerial, Unmarried, White, Male, 25236, 0, 50, United-States, >50K +34, Private, 325658, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, ?, 201844, HS-grad, 9, Separated, ?, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +20, Private, 505980, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 185336, Some-college, 10, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 37, United-States, <=50K +49, Self-emp-inc, 362795, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Male, 99999, 0, 80, Mexico, >50K +26, Private, 126829, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +63, Private, 264600, 10th, 6, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, United-States, <=50K +36, Private, 82743, Assoc-acdm, 12, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 55, Iran, <=50K +63, Self-emp-not-inc, 125178, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 128487, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 10, United-States, <=50K +40, Private, 321758, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 128220, 7th-8th, 4, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +49, Private, 176814, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, Canada, <=50K +35, Private, 188069, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 13550, 0, 55, ?, >50K +23, State-gov, 156423, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 20, United-States, <=50K +25, Private, 169905, Assoc-voc, 11, Never-married, Sales, Not-in-family, White, Male, 27828, 0, 40, United-States, >50K +34, ?, 157289, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 176972, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +44, Self-emp-not-inc, 171424, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 2205, 35, United-States, <=50K +33, Private, 91811, HS-grad, 9, Separated, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 203924, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 2597, 0, 45, United-States, <=50K +55, Private, 177484, 11th, 7, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 1672, 40, United-States, <=50K +17, ?, 454614, 11th, 7, Never-married, ?, Own-child, White, Female, 0, 0, 8, United-States, <=50K +75, Self-emp-not-inc, 242108, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 2346, 0, 15, United-States, <=50K +61, Private, 132972, 9th, 5, Never-married, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +53, Private, 157947, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Local-gov, 177482, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 45, United-States, >50K +48, Private, 246891, Some-college, 10, Widowed, Sales, Unmarried, White, Male, 0, 0, 50, United-States, >50K +28, State-gov, 158834, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +30, ?, 203834, Bachelors, 13, Never-married, ?, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 50, Taiwan, <=50K +29, Private, 110442, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 45, United-States, <=50K +25, Private, 240676, Some-college, 10, Divorced, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 192939, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Local-gov, 260696, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 55, United-States, <=50K +40, Local-gov, 55363, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 144949, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +55, Private, 116878, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 30, United-States, >50K +31, Local-gov, 357954, Assoc-acdm, 12, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +21, ?, 170038, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +32, Self-emp-not-inc, 190290, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, Italy, <=50K +26, State-gov, 203279, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 2463, 0, 50, India, <=50K +26, Private, 167761, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +44, Private, 138845, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 144844, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 52, United-States, >50K +21, ?, 161930, HS-grad, 9, Never-married, ?, Own-child, Black, Female, 0, 1504, 30, United-States, <=50K +26, Private, 55743, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +40, Self-emp-not-inc, 117721, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +19, Self-emp-not-inc, 116385, 11th, 7, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +58, Private, 301867, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 238913, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Self-emp-not-inc, 123983, Some-college, 10, Married-civ-spouse, Sales, Own-child, Asian-Pac-Islander, Male, 0, 0, 63, South, <=50K +26, Private, 165510, Bachelors, 13, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 40, United-States, <=50K +64, Private, 183513, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +42, Self-emp-inc, 119281, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +41, Private, 152629, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 110171, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 211440, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +41, Local-gov, 359259, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 125796, 11th, 7, Separated, Other-service, Not-in-family, Black, Female, 0, 0, 40, Jamaica, <=50K +34, Private, 39609, Assoc-voc, 11, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 111567, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 45, Germany, >50K +23, Private, 44064, Some-college, 10, Separated, Other-service, Not-in-family, White, Male, 0, 2559, 40, United-States, >50K +35, Self-emp-not-inc, 120066, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Amer-Indian-Eskimo, Male, 0, 0, 60, United-States, <=50K +41, Private, 132633, 11th, 7, Divorced, Priv-house-serv, Unmarried, White, Female, 0, 0, 25, Guatemala, <=50K +39, Private, 192702, Masters, 14, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +41, Private, 166813, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +33, Self-emp-inc, 40444, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 290504, HS-grad, 9, Never-married, Other-service, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 178505, Some-college, 10, Never-married, Exec-managerial, Other-relative, White, Female, 0, 1504, 45, United-States, <=50K +25, Private, 175370, Some-college, 10, Divorced, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +77, Self-emp-not-inc, 72931, 7th-8th, 4, Married-spouse-absent, Adm-clerical, Not-in-family, White, Male, 0, 0, 20, Italy, >50K +33, ?, 234542, Assoc-voc, 11, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +66, Private, 284021, HS-grad, 9, Widowed, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 277974, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 35, United-States, <=50K +44, Private, 111275, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 38, United-States, <=50K +45, Self-emp-inc, 191776, Masters, 14, Divorced, Sales, Unmarried, White, Female, 25236, 0, 42, United-States, >50K +28, Private, 125527, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +19, Private, 38294, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 2597, 0, 40, United-States, <=50K +43, Private, 313022, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 4386, 0, 40, United-States, >50K +39, Private, 179668, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 15024, 0, 40, United-States, >50K +33, Private, 198660, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +44, Private, 216116, HS-grad, 9, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 40, Jamaica, <=50K +62, Private, 200922, 7th-8th, 4, Widowed, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 153372, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +41, Private, 406603, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 6, Iran, <=50K +23, Local-gov, 248344, Some-college, 10, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 30, United-States, <=50K +48, Private, 240629, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Italy, >50K +38, Private, 314310, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +37, Private, 259785, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +45, Private, 127111, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +29, Private, 178272, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +66, Local-gov, 75134, HS-grad, 9, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +19, Private, 195985, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +23, Private, 221955, 9th, 5, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 39, Mexico, <=50K +34, Private, 177675, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 182828, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +33, Self-emp-not-inc, 270889, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +43, Private, 183096, Some-college, 10, Separated, Sales, Unmarried, White, Female, 0, 0, 10, United-States, <=50K +27, Private, 336951, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 99, United-States, <=50K +33, State-gov, 295589, Some-college, 10, Separated, Adm-clerical, Own-child, Black, Male, 0, 0, 35, United-States, <=50K +26, Private, 289980, 5th-6th, 3, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 30, Mexico, <=50K +56, Self-emp-inc, 70720, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Male, 27828, 0, 60, United-States, >50K +46, Private, 163352, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 36, United-States, <=50K +38, Private, 190776, Assoc-acdm, 12, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +90, Private, 313986, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +72, Self-emp-inc, 473748, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 25, United-States, >50K +20, Private, 163003, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Asian-Pac-Islander, Female, 0, 0, 15, United-States, <=50K +29, Private, 183061, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Amer-Indian-Eskimo, Male, 0, 0, 48, United-States, <=50K +49, Private, 123584, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 75, United-States, <=50K +23, Private, 120910, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +20, Private, 227554, Some-college, 10, Married-spouse-absent, Sales, Own-child, Black, Female, 0, 0, 18, United-States, <=50K +57, Private, 182677, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 4508, 0, 40, South, <=50K +46, Private, 214955, Assoc-acdm, 12, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 209768, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +24, Private, 258120, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 55, Jamaica, <=50K +49, Private, 110015, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, Greece, <=50K +54, Private, 152652, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 65, United-States, <=50K +46, Federal-gov, 43206, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 1564, 50, United-States, >50K +31, Self-emp-not-inc, 114639, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +43, Self-emp-inc, 221172, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 40, United-States, >50K +18, ?, 128538, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 6, United-States, <=50K +19, Private, 131615, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 353824, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 178417, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 178644, HS-grad, 9, Widowed, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 271665, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +37, ?, 223732, Some-college, 10, Separated, ?, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +21, Federal-gov, 169003, 12th, 8, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 25, United-States, <=50K +52, State-gov, 338816, Masters, 14, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 70, United-States, >50K +34, Private, 506858, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 32, United-States, >50K +28, Private, 265628, Assoc-voc, 11, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 173495, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 177413, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 31670, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 65, United-States, <=50K +49, Private, 154451, 11th, 7, Divorced, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +35, Private, 265535, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 50, Jamaica, >50K +31, Private, 118941, Some-college, 10, Divorced, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +18, Private, 214617, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 30, United-States, <=50K +47, Local-gov, 265097, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 4386, 0, 40, United-States, >50K +46, Private, 276087, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 5013, 0, 50, United-States, <=50K +43, Private, 124692, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +51, Federal-gov, 306784, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 4386, 0, 40, United-States, >50K +21, Private, 434102, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +18, ?, 387641, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, State-gov, 181824, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1902, 35, United-States, >50K +39, Local-gov, 177907, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 1887, 40, United-States, >50K +58, Private, 87329, 11th, 7, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 48, United-States, <=50K +36, Private, 263130, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 262882, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 37546, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 1902, 35, United-States, >50K +19, Private, 27433, 11th, 7, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 393945, Assoc-voc, 11, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 36, United-States, <=50K +26, Private, 173927, Assoc-voc, 11, Never-married, Prof-specialty, Own-child, Other, Female, 0, 0, 60, Jamaica, <=50K +38, Private, 343403, Assoc-acdm, 12, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 16, United-States, <=50K +36, Private, 111128, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +40, Private, 193882, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +25, Private, 310864, Bachelors, 13, Never-married, Tech-support, Not-in-family, Black, Male, 0, 0, 40, ?, <=50K +41, Private, 128354, Bachelors, 13, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 25, United-States, >50K +33, Private, 113364, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +63, ?, 198559, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 16, United-States, <=50K +51, Private, 136913, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 115488, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 154227, Assoc-voc, 11, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 279667, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +30, Self-emp-not-inc, 281030, HS-grad, 9, Never-married, Sales, Unmarried, White, Male, 0, 0, 66, United-States, <=50K +19, Private, 283945, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 25, United-States, <=50K +47, Private, 454989, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +26, Private, 391349, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, State-gov, 166704, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 14, United-States, <=50K +36, Private, 151835, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 60, United-States, >50K +60, Private, 199085, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 61487, HS-grad, 9, Never-married, Prof-specialty, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +19, Private, 120251, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 14, United-States, <=50K +42, Private, 273230, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 90, United-States, <=50K +36, Private, 358373, HS-grad, 9, Separated, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 36, United-States, <=50K +35, Private, 267891, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 38, United-States, <=50K +22, Private, 234880, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 25, United-States, <=50K +54, Private, 48358, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 96452, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +55, Private, 204751, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 65, United-States, <=50K +57, Private, 375868, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +56, Private, 413373, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 36, United-States, <=50K +24, Private, 537222, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +35, Local-gov, 33975, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +51, Self-emp-inc, 162327, 11th, 7, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 182691, HS-grad, 9, Divorced, Exec-managerial, Own-child, White, Male, 0, 0, 44, United-States, <=50K +36, Private, 300829, Assoc-acdm, 12, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 42, United-States, <=50K +51, Local-gov, 114508, 9th, 5, Separated, Other-service, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +46, Self-emp-inc, 214627, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +42, Private, 129684, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, Black, Female, 5455, 0, 50, United-States, <=50K +25, State-gov, 120041, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 361138, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +37, Private, 76893, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 205424, Bachelors, 13, Divorced, Sales, Unmarried, White, Male, 0, 0, 40, United-States, >50K +61, Private, 176839, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 50, United-States, >50K +40, Private, 229148, 12th, 8, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, Jamaica, <=50K +58, Self-emp-inc, 154537, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 20, United-States, >50K +52, Private, 181901, HS-grad, 9, Married-spouse-absent, Farming-fishing, Other-relative, White, Male, 0, 0, 20, Mexico, <=50K +18, Private, 152004, 11th, 7, Never-married, Other-service, Own-child, Black, Male, 0, 0, 20, United-States, <=50K +27, Private, 205188, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Self-emp-not-inc, 30840, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 5013, 0, 45, United-States, <=50K +63, Private, 66634, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 16, United-States, <=50K +38, Self-emp-not-inc, 180220, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +31, Private, 291052, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 2051, 40, United-States, <=50K +40, Self-emp-not-inc, 99651, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +41, Private, 327723, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 32291, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 2174, 0, 40, United-States, <=50K +31, Private, 345122, Prof-school, 15, Divorced, Prof-specialty, Not-in-family, White, Male, 14084, 0, 50, United-States, >50K +32, Private, 127384, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, >50K +30, Private, 363296, HS-grad, 9, Never-married, Handlers-cleaners, Unmarried, Black, Male, 0, 0, 72, United-States, <=50K +39, Local-gov, 86551, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 1876, 40, United-States, <=50K +28, Private, 30070, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +31, Private, 595000, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, Black, Female, 0, 0, 35, United-States, <=50K +21, ?, 152328, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 20, United-States, <=50K +33, ?, 177824, HS-grad, 9, Separated, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +44, State-gov, 111483, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 199555, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 25, United-States, <=50K +42, Private, 50018, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, ?, <=50K +36, Private, 218490, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +39, Private, 49020, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 1974, 40, United-States, <=50K +61, Private, 213321, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1672, 40, United-States, <=50K +31, Private, 159187, HS-grad, 9, Divorced, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 83033, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 25, Germany, <=50K +39, Self-emp-not-inc, 31848, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 2829, 0, 90, United-States, <=50K +34, Self-emp-not-inc, 24961, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 80, United-States, <=50K +21, Private, 182117, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 15, United-States, <=50K +75, Self-emp-not-inc, 146576, Bachelors, 13, Widowed, Prof-specialty, Unmarried, White, Male, 0, 0, 48, United-States, >50K +21, Private, 176690, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 24, United-States, <=50K +81, Private, 122651, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 15, United-States, <=50K +54, Self-emp-inc, 149650, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, Canada, <=50K +34, Private, 454508, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 65, Iran, <=50K +54, Self-emp-not-inc, 269068, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 99999, 0, 50, Philippines, >50K +41, Private, 266530, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Amer-Indian-Eskimo, Male, 0, 0, 45, United-States, <=50K +61, ?, 198542, Bachelors, 13, Divorced, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +63, Private, 133144, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 2580, 0, 20, United-States, <=50K +24, Private, 217961, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 221661, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, Mexico, <=50K +44, Local-gov, 60735, Bachelors, 13, Divorced, Prof-specialty, Own-child, White, Female, 0, 0, 60, United-States, <=50K +47, Self-emp-not-inc, 121124, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 48588, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 48087, 7th-8th, 4, Divorced, Craft-repair, Not-in-family, White, Male, 0, 1590, 40, United-States, <=50K +53, Self-emp-not-inc, 240138, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +63, Private, 273010, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 3471, 0, 40, United-States, <=50K +44, Private, 104196, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +37, Private, 230035, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 46, United-States, >50K +28, Private, 38918, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, Germany, >50K +71, ?, 205011, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 10, United-States, <=50K +57, Private, 176079, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 180052, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 10, United-States, <=50K +33, Local-gov, 173005, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 1848, 45, United-States, >50K +30, Private, 378723, Some-college, 10, Divorced, Adm-clerical, Own-child, White, Female, 0, 0, 55, United-States, <=50K +20, Private, 233624, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 192591, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +54, Private, 249860, 11th, 7, Divorced, Priv-house-serv, Unmarried, Black, Female, 0, 0, 10, United-States, <=50K +20, Private, 247564, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 35, United-States, <=50K +34, Private, 238912, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 190227, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +29, State-gov, 293287, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 180807, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 250217, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 70, United-States, <=50K +19, Private, 217418, Some-college, 10, Never-married, Adm-clerical, Other-relative, Black, Female, 0, 0, 38, United-States, <=50K +22, Local-gov, 137510, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 40, United-States, <=50K +59, State-gov, 163047, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +18, Private, 577521, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 13, United-States, <=50K +22, Private, 221533, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +42, Local-gov, 255675, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 114079, Assoc-acdm, 12, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 155781, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 243762, 11th, 7, Separated, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 113062, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 7, United-States, <=50K +67, Private, 217028, Masters, 14, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +17, Private, 110723, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 25, United-States, <=50K +47, Federal-gov, 191858, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +23, Private, 179423, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 5, United-States, <=50K +20, Private, 339588, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, Peru, <=50K +22, Private, 206815, HS-grad, 9, Never-married, Sales, Unmarried, White, Female, 0, 0, 40, Peru, <=50K +47, State-gov, 103743, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 235683, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, >50K +64, ?, 207321, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 20, United-States, <=50K +35, State-gov, 197495, Some-college, 10, Divorced, Exec-managerial, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +52, Federal-gov, 424012, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 178469, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +73, Self-emp-inc, 92886, 10th, 6, Widowed, Sales, Unmarried, White, Female, 0, 0, 40, Canada, <=50K +38, Self-emp-not-inc, 214008, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +59, Self-emp-not-inc, 325732, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 52, United-States, >50K +35, Private, 28572, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 4064, 0, 35, United-States, <=50K +18, Private, 118376, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +24, Private, 51799, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, <=50K +33, Local-gov, 115488, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +40, Private, 190621, Some-college, 10, Divorced, Exec-managerial, Other-relative, Black, Female, 0, 0, 55, United-States, <=50K +55, Private, 193568, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 192878, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 264663, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 60, United-States, <=50K +22, Private, 234731, HS-grad, 9, Divorced, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 308373, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +45, Private, 205644, HS-grad, 9, Separated, Tech-support, Not-in-family, White, Female, 0, 0, 26, United-States, <=50K +47, Local-gov, 321851, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +56, Private, 206399, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 124563, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +32, State-gov, 198211, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +17, Private, 130795, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +44, Private, 71269, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +32, Self-emp-not-inc, 319280, Assoc-acdm, 12, Never-married, Sales, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K +35, Private, 125933, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +27, Private, 107236, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 32732, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +68, Private, 284763, 11th, 7, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +20, Private, 112668, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +33, Private, 376483, Some-college, 10, Divorced, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +24, Private, 402778, 9th, 5, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 12, United-States, <=50K +48, Private, 36177, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +45, Private, 125489, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +48, Private, 304791, Some-college, 10, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 209205, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +60, ?, 112821, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 35, United-States, >50K +39, Local-gov, 178100, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 70261, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +23, State-gov, 186634, 12th, 8, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 32958, Some-college, 10, Separated, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 254746, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 158746, HS-grad, 9, Never-married, Other-service, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +35, Self-emp-not-inc, 140854, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 51506, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 189564, Bachelors, 13, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 42, United-States, >50K +37, Federal-gov, 325538, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, >50K +58, Private, 213975, Assoc-voc, 11, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +67, Self-emp-not-inc, 431426, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 2, United-States, <=50K +48, Private, 199763, HS-grad, 9, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 8, United-States, <=50K +63, Private, 161563, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 48, United-States, <=50K +24, Local-gov, 252024, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 72, United-States, >50K +43, Private, 43945, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 178487, HS-grad, 9, Divorced, Transport-moving, Own-child, White, Male, 0, 0, 60, United-States, <=50K +32, Private, 604506, HS-grad, 9, Married-civ-spouse, Transport-moving, Own-child, White, Male, 0, 0, 72, Mexico, <=50K +36, Private, 228157, Some-college, 10, Never-married, Craft-repair, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, Laos, <=50K +43, Private, 199191, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +27, Private, 189775, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +17, Private, 171080, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 12, United-States, <=50K +45, Private, 117310, Bachelors, 13, Divorced, Sales, Not-in-family, White, Female, 0, 0, 46, United-States, <=50K +41, Self-emp-inc, 82049, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Self-emp-not-inc, 126094, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 38, United-States, <=50K +18, ?, 202516, HS-grad, 9, Never-married, ?, Own-child, White, Female, 0, 0, 35, United-States, <=50K +48, Local-gov, 246392, Assoc-acdm, 12, Separated, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +51, ?, 69328, Assoc-voc, 11, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 50, United-States, >50K +26, Private, 292803, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 24, United-States, <=50K +54, Private, 286989, Preschool, 1, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +22, Private, 190483, Some-college, 10, Divorced, Sales, Own-child, White, Female, 0, 0, 48, Iran, <=50K +19, Private, 235849, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 35, United-States, <=50K +47, Private, 359766, 7th-8th, 4, Divorced, Handlers-cleaners, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +32, Private, 128016, HS-grad, 9, Married-spouse-absent, Other-service, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +46, Private, 360096, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +30, Private, 170154, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +35, Private, 337286, Masters, 14, Never-married, Exec-managerial, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +52, Private, 204322, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 5013, 0, 40, United-States, <=50K +73, Self-emp-not-inc, 143833, 12th, 8, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 18, United-States, <=50K +17, Private, 365613, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 10, Canada, <=50K +32, Private, 100135, Bachelors, 13, Separated, Prof-specialty, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +43, Local-gov, 180096, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +19, ?, 371827, HS-grad, 9, Never-married, ?, Own-child, White, Male, 0, 0, 40, Portugal, <=50K +26, Private, 61270, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, Other, Female, 0, 0, 40, Columbia, <=50K +41, Federal-gov, 564135, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +49, Private, 198759, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 60, United-States, >50K +52, State-gov, 303462, Some-college, 10, Separated, Protective-serv, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 193106, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +57, Private, 250201, 11th, 7, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 52, United-States, <=50K +35, Private, 200426, Assoc-voc, 11, Married-spouse-absent, Prof-specialty, Unmarried, White, Female, 0, 0, 44, United-States, <=50K +33, Private, 222654, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +56, Private, 53366, 7th-8th, 4, Divorced, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 132222, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 60, United-States, <=50K +17, Private, 100828, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 20, United-States, <=50K +49, Private, 31264, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +39, Private, 202027, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +34, Self-emp-not-inc, 168906, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +37, Self-emp-not-inc, 255454, Some-college, 10, Never-married, Prof-specialty, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +22, Private, 245524, 12th, 8, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +27, Private, 386040, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Mexico, <=50K +21, Private, 35424, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 25, United-States, <=50K +59, ?, 93655, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 152629, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 3103, 0, 40, United-States, >50K +53, Self-emp-not-inc, 151159, 10th, 6, Married-spouse-absent, Transport-moving, Not-in-family, White, Male, 0, 0, 99, United-States, <=50K +26, Private, 410240, 11th, 7, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 138970, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +39, Private, 269722, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, <=50K +34, Private, 223678, HS-grad, 9, Never-married, Other-service, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 32, United-States, <=50K +54, State-gov, 197184, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 38, United-States, <=50K +36, Private, 143486, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 7298, 0, 50, United-States, >50K +60, Private, 160625, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 5013, 0, 40, United-States, <=50K +50, Private, 140516, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +48, Local-gov, 85341, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +35, Private, 108293, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 2205, 40, United-States, <=50K +40, Self-emp-not-inc, 192507, Assoc-acdm, 12, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +30, Private, 186932, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, <=50K +65, Self-emp-not-inc, 223580, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 6514, 0, 40, United-States, >50K +31, Private, 236861, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +46, Local-gov, 327886, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +67, ?, 407618, 9th, 5, Divorced, ?, Not-in-family, White, Female, 2050, 0, 40, United-States, <=50K +62, Self-emp-inc, 197060, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +38, Private, 229180, Bachelors, 13, Never-married, Exec-managerial, Unmarried, White, Female, 0, 0, 40, Cuba, <=50K +24, Private, 284317, Bachelors, 13, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +24, Private, 73514, Some-college, 10, Never-married, Sales, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 50, Philippines, <=50K +27, Private, 47907, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 48, United-States, <=50K +43, State-gov, 134782, Assoc-acdm, 12, Married-spouse-absent, Other-service, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +48, Private, 118831, HS-grad, 9, Divorced, Handlers-cleaners, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, South, <=50K +41, Private, 299505, HS-grad, 9, Separated, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +30, Private, 267161, Some-college, 10, Married-civ-spouse, Tech-support, Wife, Black, Female, 0, 0, 45, United-States, <=50K +38, Private, 119177, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 327886, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +45, Private, 187730, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 109015, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +46, Self-emp-not-inc, 110015, 7th-8th, 4, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 75, Greece, <=50K +24, Private, 104146, Bachelors, 13, Never-married, Prof-specialty, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +31, Local-gov, 50442, Some-college, 10, Never-married, Adm-clerical, Own-child, Amer-Indian-Eskimo, Female, 0, 0, 25, United-States, <=50K +35, Private, 57640, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +37, Local-gov, 333664, Some-college, 10, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 224858, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +56, Private, 290641, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +60, ?, 191118, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 1848, 40, United-States, >50K +25, Private, 34402, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 1590, 60, United-States, <=50K +33, Private, 245378, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 179136, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 116788, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 129699, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Federal-gov, 39606, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, England, >50K +44, Self-emp-inc, 95150, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +63, Private, 102479, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 199191, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 30, United-States, <=50K +31, Private, 229636, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, Mexico, <=50K +26, Private, 53833, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 42, United-States, <=50K +37, Self-emp-inc, 27997, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +60, ?, 124487, Some-college, 10, Divorced, ?, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +33, Private, 111363, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +38, Private, 107630, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 134287, Assoc-voc, 11, Never-married, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +46, Self-emp-inc, 283004, Assoc-voc, 11, Divorced, Exec-managerial, Unmarried, Asian-Pac-Islander, Female, 0, 0, 63, Thailand, <=50K +24, Private, 33616, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +47, Local-gov, 121124, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +27, Private, 188189, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +46, Private, 106255, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Federal-gov, 282830, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, >50K +47, Private, 243904, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Male, 0, 0, 40, Honduras, <=50K +69, Private, 165017, HS-grad, 9, Widowed, Machine-op-inspct, Unmarried, White, Male, 2538, 0, 40, United-States, <=50K +32, Private, 131584, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K +51, Private, 427781, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +36, Private, 334291, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Local-gov, 173224, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 87507, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 60, India, <=50K +32, Private, 187560, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3908, 0, 40, United-States, <=50K +27, Private, 204497, 10th, 6, Divorced, Transport-moving, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 75, United-States, <=50K +60, Private, 230545, 7th-8th, 4, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, Cuba, <=50K +31, Private, 118161, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 150499, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Local-gov, 96554, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 288551, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 52, United-States, >50K +69, Self-emp-not-inc, 104003, 7th-8th, 4, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +54, Self-emp-inc, 124963, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +56, Private, 198388, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Federal-gov, 126204, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 91709, Assoc-acdm, 12, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +34, Self-emp-not-inc, 152109, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +24, Self-emp-not-inc, 191954, 7th-8th, 4, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 50, United-States, <=50K +63, Private, 108097, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 10566, 0, 45, United-States, <=50K +29, Local-gov, 289991, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +64, Private, 92115, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 320277, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 33610, HS-grad, 9, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 60, United-States, <=50K +36, Private, 168276, 10th, 6, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +55, State-gov, 175127, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7688, 0, 38, United-States, >50K +37, Private, 254973, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Wife, White, Female, 0, 0, 40, United-States, >50K +37, Private, 95336, 10th, 6, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 65, United-States, <=50K +63, Private, 346975, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 7688, 0, 36, United-States, >50K +33, Private, 227282, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +19, Private, 138153, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 10, United-States, <=50K +57, Local-gov, 174132, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 1977, 40, United-States, >50K +31, Private, 182237, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 4386, 0, 45, United-States, >50K +20, ?, 111252, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K +58, Local-gov, 217775, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 40, United-States, >50K +20, ?, 168863, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 20, United-States, <=50K +25, Private, 394503, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 141657, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 125441, 11th, 7, Never-married, Other-service, Own-child, White, Male, 1055, 0, 20, United-States, <=50K +26, Private, 172230, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 282944, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +45, Local-gov, 55377, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 40, United-States, <=50K +35, State-gov, 49352, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 38, United-States, <=50K +32, Private, 213887, Some-college, 10, Divorced, Machine-op-inspct, Unmarried, White, Male, 0, 0, 45, United-States, <=50K +61, Self-emp-not-inc, 24046, HS-grad, 9, Widowed, Other-service, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +26, State-gov, 208122, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 15, United-States, <=50K +56, Private, 176118, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3103, 0, 40, United-States, >50K +22, Private, 227994, Some-college, 10, Married-spouse-absent, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 39, United-States, <=50K +49, Private, 215389, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 48, United-States, <=50K +40, Private, 99434, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 12, United-States, <=50K +37, Private, 190964, HS-grad, 9, Separated, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, ?, 113700, Bachelors, 13, Never-married, ?, Own-child, White, Male, 0, 0, 50, United-States, <=50K +28, Private, 259840, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +27, Private, 168827, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Self-emp-inc, 28984, Assoc-voc, 11, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +49, Private, 182211, 9th, 5, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 45, United-States, <=50K +41, Private, 82393, Some-college, 10, Never-married, Craft-repair, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +28, Private, 183639, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 21, United-States, <=50K +38, Private, 342448, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +47, State-gov, 469907, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 1740, 40, United-States, <=50K +28, Local-gov, 211920, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, State-gov, 33658, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 50, United-States, >50K +41, Federal-gov, 34178, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 400630, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 36, United-States, >50K +73, Self-emp-not-inc, 161251, HS-grad, 9, Widowed, Craft-repair, Not-in-family, White, Male, 0, 0, 24, United-States, <=50K +21, Private, 255685, Some-college, 10, Never-married, Other-service, Own-child, Black, Male, 0, 0, 40, Outlying-US(Guam-USVI-etc), <=50K +38, Private, 199256, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +64, ?, 143716, Masters, 14, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 2, United-States, <=50K +47, Private, 221666, Some-college, 10, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 145409, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 15024, 0, 60, Canada, >50K +24, Private, 39615, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +44, Private, 104440, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 151382, 7th-8th, 4, Divorced, Machine-op-inspct, Unmarried, White, Male, 0, 974, 40, United-States, <=50K +61, Self-emp-not-inc, 503675, Some-college, 10, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 60, United-States, >50K +58, Private, 306233, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 40, United-States, >50K +51, Private, 216475, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 1564, 43, United-States, >50K +49, Private, 50748, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 55, England, <=50K +23, Private, 107190, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 20, United-States, <=50K +19, Private, 206874, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +21, Private, 83141, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 53, United-States, <=50K +56, Private, 444089, 11th, 7, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 141896, 12th, 8, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Federal-gov, 33487, Some-college, 10, Divorced, Tech-support, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 20, United-States, <=50K +41, Private, 65372, Doctorate, 16, Divorced, Sales, Unmarried, White, Female, 0, 0, 50, United-States, >50K +30, Private, 341346, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 343403, Doctorate, 16, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 20, ?, <=50K +47, Private, 287480, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +59, Self-emp-inc, 107287, 10th, 6, Widowed, Exec-managerial, Unmarried, White, Female, 0, 2559, 50, United-States, >50K +55, Private, 199067, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 32, United-States, <=50K +22, ?, 182771, Assoc-voc, 11, Never-married, ?, Own-child, Asian-Pac-Islander, Male, 0, 0, 20, United-States, <=50K +31, Private, 159737, 10th, 6, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +30, Private, 110643, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 4386, 0, 40, United-States, >50K +24, Private, 117583, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 48, United-States, <=50K +49, Self-emp-not-inc, 43479, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 203003, 7th-8th, 4, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 25, Germany, <=50K +50, Private, 133963, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +38, Private, 227794, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +20, Self-emp-not-inc, 112137, Some-college, 10, Never-married, Prof-specialty, Other-relative, Asian-Pac-Islander, Female, 0, 0, 20, South, <=50K +49, Self-emp-not-inc, 110457, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +45, Private, 281565, HS-grad, 9, Widowed, Other-service, Other-relative, Asian-Pac-Islander, Female, 0, 0, 50, South, <=50K +46, Federal-gov, 297906, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 50, United-States, >50K +19, Private, 151506, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +31, Federal-gov, 139455, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, Cuba, <=50K +38, Private, 26987, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +56, Self-emp-not-inc, 233312, Masters, 14, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Private, 161092, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +58, Local-gov, 98361, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +28, Private, 188928, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 164922, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 185673, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 193598, Preschool, 1, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +56, Private, 274111, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, United-States, >50K +32, Private, 245482, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, <=50K +56, Private, 160932, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 44, United-States, >50K +50, Private, 44368, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, ?, 291374, HS-grad, 9, Separated, ?, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +30, Private, 280927, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 222993, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +42, Federal-gov, 25240, Assoc-voc, 11, Divorced, Prof-specialty, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +31, Self-emp-not-inc, 204052, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +18, Private, 74054, 11th, 7, Never-married, Sales, Own-child, Other, Female, 0, 0, 20, ?, <=50K +46, Private, 169042, 10th, 6, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 25, Ecuador, <=50K +31, Private, 104509, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 65, United-States, >50K +38, Local-gov, 185394, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 7688, 0, 40, United-States, >50K +44, Local-gov, 254146, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +55, Self-emp-inc, 227856, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 2415, 50, United-States, >50K +19, Private, 183041, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +45, Private, 107682, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +50, Self-emp-inc, 287598, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 70, United-States, <=50K +53, Private, 182186, 9th, 5, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Dominican-Republic, <=50K +41, Self-emp-inc, 194636, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 99999, 0, 65, United-States, >50K +45, Private, 112305, Some-college, 10, Divorced, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +21, Private, 212661, 10th, 6, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 39, United-States, <=50K +37, Private, 32709, Bachelors, 13, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 40, United-States, >50K +42, Federal-gov, 46366, HS-grad, 9, Separated, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 24106, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 30, United-States, <=50K +46, Private, 170850, Bachelors, 13, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 1590, 40, ?, <=50K +45, Self-emp-not-inc, 40666, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 60, United-States, <=50K +32, Private, 182975, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +30, Private, 345122, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, ?, 208311, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 80, United-States, >50K +37, Private, 120045, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 56, United-States, <=50K +18, ?, 201299, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 152940, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +43, Private, 243580, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Private, 182128, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 6497, 0, 50, United-States, <=50K +36, ?, 176458, HS-grad, 9, Divorced, ?, Unmarried, White, Female, 0, 0, 28, United-States, <=50K +33, Private, 101562, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 108699, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 175878, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Female, 0, 0, 40, United-States, <=50K +34, Local-gov, 177675, Some-college, 10, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +33, Private, 213887, HS-grad, 9, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 357619, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 60, Germany, <=50K +23, Private, 435835, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 1669, 55, United-States, <=50K +39, Private, 165799, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 71469, Assoc-acdm, 12, Never-married, Sales, Own-child, White, Female, 0, 0, 25, United-States, <=50K +19, Private, 229745, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 20, United-States, <=50K +47, Private, 284916, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 45, United-States, >50K +46, Private, 28419, Assoc-voc, 11, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +47, Private, 26950, Masters, 14, Divorced, Sales, Not-in-family, White, Female, 0, 0, 6, United-States, <=50K +47, Self-emp-not-inc, 107231, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +52, Local-gov, 512103, Some-college, 10, Divorced, Transport-moving, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 245090, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Mexico, <=50K +58, Private, 314153, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +32, Private, 243988, 5th-6th, 3, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Mexico, <=50K +54, Self-emp-not-inc, 82551, Assoc-voc, 11, Married-civ-spouse, Tech-support, Other-relative, White, Female, 0, 0, 10, United-States, <=50K +20, Private, 42706, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +25, Private, 235795, Assoc-acdm, 12, Never-married, Sales, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +25, Self-emp-not-inc, 108001, 9th, 5, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 15, United-States, <=50K +36, State-gov, 112497, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 1876, 44, United-States, <=50K +69, Self-emp-not-inc, 128206, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 30, United-States, <=50K +28, Private, 224634, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 45, United-States, >50K +20, Private, 362999, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +21, Private, 346693, 7th-8th, 4, Never-married, Farming-fishing, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 175759, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 99199, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 32, United-States, <=50K +25, ?, 219987, Assoc-acdm, 12, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 13, United-States, <=50K +39, Private, 143445, HS-grad, 9, Married-civ-spouse, Other-service, Other-relative, Black, Female, 0, 0, 40, United-States, <=50K +34, Private, 118710, Assoc-voc, 11, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Local-gov, 224185, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 118972, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, >50K +29, Private, 165360, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 38950, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 89, United-States, <=50K +42, Self-emp-inc, 277256, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 60, United-States, >50K +29, Private, 247151, 11th, 7, Divorced, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 213722, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 209955, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 25, United-States, <=50K +41, Private, 174395, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 138626, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 1876, 50, United-States, <=50K +22, ?, 179973, Assoc-voc, 11, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 200207, HS-grad, 9, Divorced, Handlers-cleaners, Own-child, White, Male, 0, 0, 44, United-States, <=50K +19, Private, 156587, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 38, United-States, <=50K +24, Private, 33016, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 197496, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 30, ?, <=50K +32, Private, 153588, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 99736, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Male, 15020, 0, 50, United-States, >50K +36, Private, 284166, HS-grad, 9, Never-married, Sales, Unmarried, White, Male, 0, 0, 60, United-States, >50K +18, Private, 716066, 10th, 6, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 30, United-States, <=50K +27, Private, 188519, HS-grad, 9, Divorced, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 109080, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +52, Private, 174421, Assoc-acdm, 12, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +24, Private, 259351, Some-college, 10, Never-married, Craft-repair, Unmarried, Amer-Indian-Eskimo, Male, 0, 0, 40, Mexico, <=50K +42, Federal-gov, 284403, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +39, Private, 85319, Prof-school, 15, Married-civ-spouse, Prof-specialty, Wife, White, Female, 7688, 0, 60, United-States, >50K +20, ?, 201766, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 35, United-States, <=50K +20, State-gov, 340475, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 10, United-States, <=50K +39, Private, 487486, HS-grad, 9, Widowed, Handlers-cleaners, Unmarried, White, Male, 0, 0, 40, ?, <=50K +68, ?, 484298, 11th, 7, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, <=50K +35, Private, 170617, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 48, United-States, <=50K +54, Private, 94055, Bachelors, 13, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 117779, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 209770, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 8, United-States, <=50K +20, Private, 317443, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 15, United-States, <=50K +64, ?, 140237, Preschool, 1, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 107411, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +36, Self-emp-not-inc, 122493, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 47, United-States, <=50K +44, Self-emp-inc, 195124, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, ?, <=50K +38, Private, 101978, Some-college, 10, Separated, Machine-op-inspct, Not-in-family, White, Male, 0, 2258, 55, United-States, >50K +22, Private, 335453, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +56, Private, 318329, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 100321, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +24, Self-emp-not-inc, 81145, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 75, United-States, <=50K +22, Private, 62865, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 176262, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 30, United-States, <=50K +42, Private, 168103, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Local-gov, 208174, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 55, United-States, <=50K +19, Private, 188815, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 34095, 0, 20, United-States, <=50K +67, Self-emp-not-inc, 226092, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 44, United-States, <=50K +20, Private, 212668, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 30, United-States, <=50K +32, Private, 381583, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Other, Male, 0, 0, 40, United-States, <=50K +46, Private, 239439, HS-grad, 9, Separated, Machine-op-inspct, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +52, Private, 172493, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 36, United-States, <=50K +44, Private, 239876, Bachelors, 13, Divorced, Prof-specialty, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +65, ?, 221881, 11th, 7, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, Mexico, <=50K +37, Local-gov, 218184, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1887, 40, United-States, >50K +27, Self-emp-not-inc, 206889, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +35, Private, 110668, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 0, 0, 35, United-States, <=50K +30, Private, 211028, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +64, Local-gov, 202984, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3137, 0, 40, United-States, <=50K +48, Private, 20296, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 37, United-States, >50K +35, Private, 194690, 7th-8th, 4, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, Self-emp-not-inc, 204984, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +63, Self-emp-not-inc, 35021, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 1977, 32, China, >50K +40, Self-emp-not-inc, 238574, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +33, Private, 345360, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 192381, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +25, Private, 479765, 7th-8th, 4, Never-married, Sales, Other-relative, White, Male, 0, 0, 45, Guatemala, <=50K +45, Self-emp-inc, 34091, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 38, United-States, >50K +30, Private, 151773, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +53, Private, 299080, Some-college, 10, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +63, Private, 135339, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 2105, 0, 40, Vietnam, <=50K +27, Local-gov, 52156, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 60, United-States, <=50K +31, Private, 318647, 11th, 7, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 80145, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +39, State-gov, 343646, Bachelors, 13, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 40, Mexico, >50K +42, Self-emp-not-inc, 198692, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +19, Private, 266635, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 0, 30, United-States, <=50K +31, Private, 197672, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 185846, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 315110, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 20, United-States, <=50K +27, Private, 220754, Doctorate, 16, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 64292, HS-grad, 9, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 126060, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, >50K +52, Private, 78012, HS-grad, 9, Widowed, Sales, Unmarried, White, Female, 0, 1762, 40, United-States, <=50K +32, Private, 210562, Assoc-voc, 11, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 46, United-States, <=50K +23, Private, 350181, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 233421, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 20, United-States, <=50K +53, Private, 167170, HS-grad, 9, Widowed, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +18, Private, 260801, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 20, United-States, <=50K +41, Private, 173370, Bachelors, 13, Separated, Sales, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +27, Private, 135520, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, Dominican-Republic, <=50K +30, Private, 121308, Some-college, 10, Divorced, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 444743, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 65225, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +58, State-gov, 136982, HS-grad, 9, Married-spouse-absent, Other-service, Unmarried, Black, Female, 0, 0, 40, Honduras, <=50K +45, State-gov, 271962, Bachelors, 13, Divorced, Protective-serv, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 204046, 10th, 6, Divorced, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 225823, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 183009, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Other, Female, 0, 1590, 40, United-States, <=50K +50, Private, 121038, Assoc-voc, 11, Married-civ-spouse, Other-service, Wife, Black, Female, 0, 0, 40, United-States, <=50K +26, Private, 49092, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 148709, HS-grad, 9, Separated, Handlers-cleaners, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 209205, Bachelors, 13, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +36, Local-gov, 285865, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +22, Federal-gov, 216129, HS-grad, 9, Never-married, Adm-clerical, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +37, Federal-gov, 40955, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, Japan, <=50K +54, Private, 197189, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 33001, HS-grad, 9, Divorced, Farming-fishing, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +44, Private, 227399, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 40, United-States, <=50K +38, Private, 164050, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 40, United-States, >50K +49, Private, 259087, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 236262, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 12, United-States, <=50K +26, Private, 177929, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +48, Private, 166929, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, ?, >50K +32, Private, 199963, 11th, 7, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, <=50K +35, State-gov, 98776, HS-grad, 9, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 135056, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +40, Self-emp-not-inc, 55363, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3411, 0, 40, United-States, <=50K +42, State-gov, 102343, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 72, India, >50K +30, Private, 231263, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 226913, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +36, Private, 129573, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +31, Private, 191001, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +50, Federal-gov, 69345, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +38, Private, 204556, HS-grad, 9, Divorced, Craft-repair, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 192626, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +45, Private, 202812, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Private, 405177, 10th, 6, Separated, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 227890, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 46, United-States, >50K +33, Private, 101352, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +49, Private, 82572, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +28, Private, 132686, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +49, Local-gov, 149210, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, Black, Male, 15024, 0, 40, United-States, >50K +27, Private, 245661, HS-grad, 9, Separated, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +47, Self-emp-inc, 483596, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 2885, 0, 32, United-States, <=50K +42, State-gov, 104663, Doctorate, 16, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, Italy, >50K +30, Private, 347166, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +37, Local-gov, 108540, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 333305, Doctorate, 16, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 35, United-States, <=50K +51, Private, 155408, HS-grad, 9, Married-spouse-absent, Sales, Not-in-family, Black, Female, 0, 0, 38, United-States, <=50K +27, Federal-gov, 246372, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Private, 30290, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 347321, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +37, Self-emp-inc, 205852, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +40, Federal-gov, 163215, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, ?, <=50K +54, State-gov, 93449, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, >50K +47, Self-emp-inc, 116927, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 42, United-States, >50K +35, Private, 164526, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Yugoslavia, >50K +33, Private, 31573, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Local-gov, 125159, Some-college, 10, Never-married, Adm-clerical, Other-relative, Black, Male, 0, 0, 40, Haiti, <=50K +39, State-gov, 201105, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 55, United-States, >50K +25, ?, 122745, HS-grad, 9, Never-married, ?, Own-child, White, Male, 0, 1602, 40, United-States, <=50K +33, Private, 150570, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 118941, 11th, 7, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, Ireland, <=50K +53, Private, 141388, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 174714, Assoc-acdm, 12, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 35, United-States, <=50K +31, State-gov, 75755, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 55, United-States, >50K +63, Private, 133144, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +21, Self-emp-not-inc, 318865, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +59, Private, 109638, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Private, 92969, 1st-4th, 2, Separated, Prof-specialty, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +66, ?, 376028, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 20, United-States, <=50K +19, Private, 144161, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 30, United-States, <=50K +31, Private, 183778, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +23, Private, 398904, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 170846, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +35, Local-gov, 204277, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 205152, Bachelors, 13, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 225395, 7th-8th, 4, Never-married, Machine-op-inspct, Other-relative, White, Female, 0, 0, 60, Mexico, <=50K +38, Private, 33975, HS-grad, 9, Married-civ-spouse, Exec-managerial, Other-relative, White, Male, 0, 0, 40, United-States, >50K +49, Private, 147032, HS-grad, 9, Married-civ-spouse, Other-service, Wife, Asian-Pac-Islander, Female, 0, 0, 8, Philippines, <=50K +64, Private, 174826, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +60, Local-gov, 232769, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 20, United-States, <=50K +25, Private, 36984, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +21, Private, 292264, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +26, Private, 303973, HS-grad, 9, Never-married, Priv-house-serv, Other-relative, White, Female, 0, 1602, 15, Mexico, <=50K +23, Private, 287988, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +67, Self-emp-inc, 330144, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, >50K +24, Private, 191948, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +46, Private, 324601, 1st-4th, 2, Separated, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, Guatemala, <=50K +38, State-gov, 200289, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Taiwan, <=50K +20, Private, 113307, 7th-8th, 4, Never-married, Machine-op-inspct, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +28, ?, 194087, Some-college, 10, Never-married, ?, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 155213, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +58, Private, 175127, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, State-gov, 358461, Some-college, 10, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +37, State-gov, 354929, Assoc-acdm, 12, Divorced, Protective-serv, Not-in-family, Black, Male, 0, 0, 38, United-States, <=50K +53, State-gov, 104501, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, >50K +45, Private, 112929, 7th-8th, 4, Divorced, Machine-op-inspct, Not-in-family, Black, Female, 0, 0, 35, United-States, <=50K +33, Private, 132832, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +33, State-gov, 357691, Masters, 14, Married-civ-spouse, Protective-serv, Husband, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 114605, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +60, Self-emp-not-inc, 525878, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 40, United-States, <=50K +21, Private, 68358, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +38, Private, 174571, 10th, 6, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 45, United-States, <=50K +40, Private, 42703, Assoc-voc, 11, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 220589, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, >50K +44, Self-emp-not-inc, 197558, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 70, United-States, >50K +27, Private, 423250, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +34, Self-emp-not-inc, 29254, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, >50K +20, ?, 308924, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 25, United-States, <=50K +49, Local-gov, 276247, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 213841, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +52, Private, 181677, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 30, United-States, >50K +46, Private, 160061, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +20, Private, 285295, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, ?, <=50K +43, Private, 265266, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Local-gov, 222115, Bachelors, 13, Divorced, Adm-clerical, Not-in-family, White, Female, 99999, 0, 40, United-States, >50K +25, State-gov, 194954, Some-college, 10, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 15, United-States, <=50K +48, Private, 156926, Bachelors, 13, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Local-gov, 217414, Some-college, 10, Divorced, Protective-serv, Unmarried, White, Male, 0, 0, 55, United-States, <=50K +37, Private, 538443, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 14344, 0, 40, United-States, >50K +18, ?, 192399, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 60, United-States, <=50K +42, Private, 383493, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +60, Private, 193235, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 24, United-States, <=50K +37, Self-emp-inc, 99452, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 55, United-States, >50K +44, Local-gov, 254134, Assoc-acdm, 12, Divorced, Tech-support, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +32, Private, 90446, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 116613, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, Portugal, <=50K +42, Local-gov, 238188, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 35, United-States, >50K +17, Private, 95909, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 12, United-States, <=50K +41, Private, 82319, 12th, 8, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 10, United-States, <=50K +34, Private, 182274, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 1887, 40, United-States, >50K +56, Private, 179625, 10th, 6, Separated, Other-service, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +28, Private, 119793, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Self-emp-not-inc, 254989, 11th, 7, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 104830, 7th-8th, 4, Never-married, Transport-moving, Unmarried, White, Male, 0, 0, 25, Guatemala, <=50K +49, Federal-gov, 110373, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Self-emp-not-inc, 135416, Some-college, 10, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 50, United-States, <=50K +25, Private, 298225, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 35, United-States, <=50K +42, Private, 166740, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 213668, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 50, United-States, >50K +26, Private, 276624, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 226789, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 58, United-States, <=50K +37, Private, 31023, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +39, Local-gov, 116666, HS-grad, 9, Never-married, Protective-serv, Own-child, Amer-Indian-Eskimo, Male, 4650, 0, 48, United-States, <=50K +42, Private, 136986, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, >50K +41, Private, 179580, Some-college, 10, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 36, United-States, >50K +23, Private, 103277, Some-college, 10, Divorced, Other-service, Own-child, White, Female, 0, 0, 24, United-States, <=50K +31, Federal-gov, 351141, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Local-gov, 191161, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 57, United-States, >50K +20, Private, 148709, Some-college, 10, Never-married, Prof-specialty, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +36, Private, 128382, Some-college, 10, Never-married, Machine-op-inspct, Unmarried, White, Male, 0, 0, 45, United-States, <=50K +50, Private, 144361, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, >50K +37, Private, 172538, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 102476, Bachelors, 13, Never-married, Farming-fishing, Own-child, White, Male, 27828, 0, 50, United-States, >50K +39, Private, 46028, Some-college, 10, Divorced, Other-service, Unmarried, White, Female, 0, 0, 60, United-States, <=50K +32, Private, 198452, HS-grad, 9, Married-civ-spouse, Farming-fishing, Wife, White, Female, 0, 0, 40, United-States, <=50K +59, Private, 193895, Some-college, 10, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +29, Private, 233421, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 3411, 0, 45, United-States, <=50K +50, Private, 378747, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +42, Private, 31251, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 37, United-States, <=50K +32, Private, 71540, Some-college, 10, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 194772, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 1902, 40, United-States, >50K +20, Private, 34568, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3781, 0, 35, United-States, <=50K +50, Self-emp-not-inc, 36480, 10th, 6, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +18, Private, 116528, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +60, Private, 52152, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +60, Private, 216690, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, <=50K +42, Local-gov, 227065, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 22, United-States, <=50K +49, Private, 84013, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +35, Self-emp-inc, 82051, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, <=50K +30, Self-emp-not-inc, 176185, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, Iran, <=50K +59, Private, 115414, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +24, Self-emp-inc, 493034, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Male, 13550, 0, 50, United-States, >50K +55, Private, 354923, Assoc-acdm, 12, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 393712, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +39, Private, 98941, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 141483, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 172479, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 40, United-States, >50K +21, Private, 226145, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 30, United-States, <=50K +23, Private, 394612, Bachelors, 13, Never-married, Tech-support, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +22, Private, 231085, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +55, Self-emp-not-inc, 183810, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, >50K +19, Private, 186159, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 162282, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +46, Private, 219021, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 15024, 0, 44, United-States, >50K +23, Private, 273206, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 23, United-States, <=50K +47, Self-emp-inc, 332355, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 50, United-States, >50K +23, Private, 102729, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +42, Private, 198096, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, >50K +22, State-gov, 292933, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 10, United-States, <=50K +18, Private, 135924, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 20, United-States, <=50K +37, Private, 99146, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 50, United-States, >50K +34, Private, 27409, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +30, Private, 299507, Assoc-acdm, 12, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +62, Self-emp-not-inc, 102631, Some-college, 10, Widowed, Farming-fishing, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +51, Private, 153486, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 434292, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 30, United-States, <=50K +28, Self-emp-not-inc, 240172, Masters, 14, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +56, Private, 219426, 10th, 6, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 295791, HS-grad, 9, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +46, Private, 114032, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 1887, 45, United-States, >50K +23, Local-gov, 496382, Some-college, 10, Married-spouse-absent, Adm-clerical, Own-child, White, Female, 0, 0, 40, Guatemala, <=50K +33, Private, 376483, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 30, United-States, <=50K +27, Private, 107218, HS-grad, 9, Never-married, Other-service, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +21, Private, 246207, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 20, United-States, <=50K +18, ?, 80564, HS-grad, 9, Never-married, ?, Own-child, White, Male, 0, 0, 60, United-States, <=50K +36, Private, 83089, 5th-6th, 3, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 7298, 0, 40, Mexico, >50K +37, Local-gov, 328301, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 36, United-States, <=50K +39, Local-gov, 301614, 11th, 7, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 199739, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 7298, 0, 60, United-States, >50K +24, Private, 180060, Assoc-acdm, 12, Never-married, Craft-repair, Not-in-family, White, Male, 2354, 0, 40, United-States, <=50K +26, Private, 121040, Assoc-acdm, 12, Never-married, Exec-managerial, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +37, Private, 125550, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +34, Private, 170772, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +33, Private, 180551, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, Self-emp-not-inc, 48189, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 30, United-States, <=50K +20, Private, 432154, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 8, Mexico, <=50K +26, Private, 263200, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +47, Private, 123207, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 20, United-States, >50K +17, Private, 110798, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +53, Private, 238481, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1485, 40, United-States, <=50K +31, Private, 185528, Some-college, 10, Divorced, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +34, Private, 181311, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 528616, 5th-6th, 3, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Mexico, <=50K +39, Private, 272950, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, >50K +22, ?, 195532, HS-grad, 9, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 197583, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +40, Private, 48612, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +54, Local-gov, 31533, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7298, 0, 40, United-States, >50K +32, Federal-gov, 148138, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 2002, 40, Iran, <=50K +29, Local-gov, 30069, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 2635, 0, 40, United-States, <=50K +68, ?, 170182, Some-college, 10, Never-married, ?, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +27, Local-gov, 230885, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 45, United-States, >50K +54, Private, 174102, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +23, Private, 352606, HS-grad, 9, Divorced, Priv-house-serv, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 241153, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, >50K +54, Private, 155433, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 35, United-States, <=50K +21, Private, 109414, Some-college, 10, Never-married, Prof-specialty, Own-child, Asian-Pac-Islander, Male, 0, 1974, 40, United-States, <=50K +40, Private, 125461, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +19, Private, 331556, 10th, 6, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, ?, 138575, HS-grad, 9, Never-married, ?, Other-relative, White, Male, 0, 0, 60, United-States, <=50K +35, Private, 223514, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, <=50K +39, Private, 115418, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 2174, 0, 45, United-States, <=50K +38, Private, 193026, HS-grad, 9, Never-married, Other-service, Unmarried, White, Male, 0, 1408, 40, ?, <=50K +41, Private, 147206, 12th, 8, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 174592, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 268620, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +70, Self-emp-not-inc, 150886, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 25, United-States, <=50K +45, Private, 112362, 10th, 6, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +83, Private, 195507, HS-grad, 9, Widowed, Protective-serv, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +59, Private, 192983, HS-grad, 9, Separated, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 120544, 9th, 5, Never-married, Other-service, Own-child, Black, Male, 0, 0, 15, United-States, <=50K +31, Private, 59083, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 208277, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 45, United-States, >50K +24, Local-gov, 184678, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 278736, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, <=50K +48, Local-gov, 39464, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 52, United-States, <=50K +27, Private, 162343, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, Dominican-Republic, <=50K +41, Private, 204046, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 255647, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 25, Mexico, <=50K +53, Private, 123011, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +66, Self-emp-not-inc, 291362, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +31, Private, 159187, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +30, State-gov, 126414, Bachelors, 13, Married-spouse-absent, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 227626, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-inc, 173783, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 60, United-States, >50K +74, Private, 211075, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +37, Private, 176756, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 1485, 70, United-States, >50K +35, Self-emp-not-inc, 31095, Some-college, 10, Separated, Farming-fishing, Not-in-family, White, Male, 4101, 0, 60, United-States, <=50K +51, Self-emp-not-inc, 32372, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1672, 70, United-States, <=50K +40, Private, 331651, Some-college, 10, Separated, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Local-gov, 146325, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1887, 40, United-States, >50K +26, Private, 515025, 10th, 6, Married-civ-spouse, Handlers-cleaners, Wife, White, Female, 0, 0, 40, United-States, <=50K +53, Private, 394474, Assoc-acdm, 12, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +32, Private, 400535, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3781, 0, 40, United-States, <=50K +29, Self-emp-not-inc, 337505, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, South, <=50K +42, Private, 211860, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 102684, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 32, United-States, <=50K +62, ?, 225657, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 24, United-States, <=50K +33, Private, 121966, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 396790, HS-grad, 9, Never-married, Transport-moving, Own-child, Black, Male, 0, 0, 20, United-States, <=50K +46, Local-gov, 149949, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +25, Private, 252187, 11th, 7, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 209934, 5th-6th, 3, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Mexico, <=50K +29, Federal-gov, 229300, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +33, Private, 170769, Doctorate, 16, Divorced, Sales, Not-in-family, White, Male, 99999, 0, 60, United-States, >50K +50, Private, 200618, Assoc-acdm, 12, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 216984, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +40, Private, 212760, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 150309, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Amer-Indian-Eskimo, Male, 0, 0, 45, United-States, <=50K +54, Private, 174655, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 109621, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 225124, HS-grad, 9, Widowed, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +46, Private, 172695, 11th, 7, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 27, El-Salvador, <=50K +71, Self-emp-not-inc, 238479, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 8, United-States, <=50K +27, Private, 37754, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 80, United-States, <=50K +56, Private, 85018, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +64, Private, 256466, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, Asian-Pac-Islander, Male, 0, 0, 60, Philippines, >50K +23, Private, 169188, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 25, United-States, <=50K +36, Private, 210945, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Local-gov, 287031, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +26, Private, 224361, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, Federal-gov, 108464, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 75826, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 120277, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 104439, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +27, Private, 56870, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 200819, 12th, 8, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +50, Self-emp-not-inc, 170562, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 20, United-States, <=50K +30, Private, 80933, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 33088, 11th, 7, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Local-gov, 112763, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 7430, 0, 36, United-States, >50K +29, Private, 177651, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +31, Private, 261943, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Private, 169785, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, Italy, <=50K +20, Private, 141481, 11th, 7, Married-civ-spouse, Sales, Other-relative, White, Female, 0, 0, 50, United-States, <=50K +37, Private, 433491, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Local-gov, 86615, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 0, 30, United-States, <=50K +39, Private, 125550, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +46, State-gov, 421223, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 26999, Bachelors, 13, Separated, Exec-managerial, Unmarried, White, Female, 0, 0, 42, United-States, <=50K +36, Self-emp-not-inc, 241998, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 99999, 0, 20, United-States, >50K +34, ?, 133861, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 25, United-States, <=50K +44, Private, 115323, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Self-emp-inc, 23778, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +28, Self-emp-not-inc, 190836, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +38, Self-emp-inc, 159179, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +64, ?, 205479, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 50, United-States, >50K +19, ?, 47713, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 163237, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 52, United-States, >50K +61, Private, 202202, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 168837, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 112271, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 52537, HS-grad, 9, Never-married, Transport-moving, Unmarried, Black, Male, 0, 0, 30, United-States, <=50K +27, Private, 38353, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +22, Private, 141698, 10th, 6, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 28856, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 175652, 11th, 7, Never-married, Other-service, Other-relative, White, Female, 0, 0, 15, United-States, <=50K +36, Private, 213008, HS-grad, 9, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 196501, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 14084, 0, 50, United-States, >50K +63, Private, 118798, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 99999, 0, 40, United-States, >50K +51, Private, 92463, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, State-gov, 125165, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 25, United-States, <=50K +42, Self-emp-not-inc, 103980, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +40, ?, 180362, Bachelors, 13, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 53903, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 179735, Assoc-voc, 11, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +41, ?, 277390, Bachelors, 13, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 30, United-States, >50K +49, Private, 122177, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 80, United-States, <=50K +46, Private, 188161, HS-grad, 9, Separated, Machine-op-inspct, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +32, Self-emp-not-inc, 170108, HS-grad, 9, Separated, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +28, Private, 175262, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, Mexico, <=50K +19, ?, 204441, HS-grad, 9, Never-married, ?, Other-relative, Black, Male, 0, 0, 20, United-States, <=50K +19, Private, 164395, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 25, United-States, <=50K +18, Private, 115630, 11th, 7, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 20, United-States, <=50K +39, Private, 178815, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +60, Self-emp-not-inc, 168223, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +46, Local-gov, 202560, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 1408, 40, United-States, <=50K +38, Private, 100295, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 50, Canada, >50K +36, Private, 172256, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 44, United-States, >50K +45, Private, 51664, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, State-gov, 358893, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 2339, 40, United-States, <=50K +30, Private, 115963, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 333910, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +23, Private, 148948, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +48, State-gov, 130561, Some-college, 10, Never-married, Prof-specialty, Not-in-family, Black, Female, 0, 0, 24, United-States, <=50K +46, Private, 428350, HS-grad, 9, Married-civ-spouse, Other-service, Own-child, White, Female, 0, 0, 15, United-States, <=50K +43, Private, 188808, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +25, Private, 112847, HS-grad, 9, Married-civ-spouse, Transport-moving, Own-child, Other, Male, 0, 0, 40, United-States, <=50K +50, Private, 110748, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +61, Self-emp-inc, 156653, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +35, Private, 196491, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +65, Local-gov, 254413, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +56, Private, 91262, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Asian-Pac-Islander, Male, 0, 0, 45, United-States, <=50K +43, Self-emp-not-inc, 154785, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Wife, Asian-Pac-Islander, Female, 0, 0, 80, Thailand, <=50K +55, Private, 84231, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +22, Private, 226327, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +40, Private, 248406, Some-college, 10, Divorced, Machine-op-inspct, Own-child, White, Male, 0, 0, 32, United-States, <=50K +35, Private, 54317, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 1672, 50, United-States, <=50K +22, ?, 32732, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 95918, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Local-gov, 375675, 12th, 8, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 35, United-States, >50K +43, Private, 244172, HS-grad, 9, Separated, Transport-moving, Unmarried, White, Male, 0, 0, 40, Mexico, <=50K +46, Federal-gov, 233555, HS-grad, 9, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, ?, <=50K +39, Private, 326342, 11th, 7, Married-civ-spouse, Other-service, Husband, Black, Male, 2635, 0, 37, United-States, <=50K +34, Private, 77271, HS-grad, 9, Never-married, Exec-managerial, Unmarried, White, Female, 0, 0, 20, England, <=50K +35, Private, 33397, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +30, Private, 446358, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Male, 0, 0, 41, United-States, <=50K +25, Private, 151810, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 28, United-States, <=50K +44, Private, 125461, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, >50K +35, Private, 133906, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 155106, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Federal-gov, 203637, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7298, 0, 40, United-States, >50K +32, Private, 232766, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +50, Private, 305319, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 121023, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 15, United-States, <=50K +29, Private, 198997, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 20, United-States, <=50K +38, Private, 167140, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 70, United-States, >50K +20, Private, 38772, 10th, 6, Never-married, Handlers-cleaners, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +41, Private, 253759, HS-grad, 9, Never-married, Sales, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +27, Private, 130067, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 65, United-States, <=50K +37, Private, 203828, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +62, State-gov, 221558, Masters, 14, Separated, Prof-specialty, Unmarried, White, Female, 0, 0, 24, ?, <=50K +31, Private, 156464, 10th, 6, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 72333, Some-college, 10, Divorced, Adm-clerical, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +33, Local-gov, 83671, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Black, Female, 0, 0, 50, United-States, <=50K +31, Private, 339482, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1848, 40, United-States, >50K +19, Private, 91928, Some-college, 10, Never-married, Other-service, Other-relative, White, Female, 0, 0, 35, United-States, <=50K +44, Private, 99203, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Self-emp-inc, 455995, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 65, United-States, >50K +62, Private, 192515, HS-grad, 9, Widowed, Farming-fishing, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +65, Self-emp-not-inc, 111483, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 2174, 10, United-States, >50K +17, Private, 221129, 9th, 5, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, Private, 85413, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 44, United-States, >50K +31, Private, 196125, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 265638, Some-college, 10, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +53, Private, 177727, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Private, 205822, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +43, Private, 112607, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +40, Federal-gov, 177595, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1579, 40, United-States, <=50K +18, Private, 183315, 11th, 7, Never-married, Sales, Own-child, Black, Female, 0, 0, 10, United-States, <=50K +47, Private, 116279, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 43, United-States, <=50K +38, Federal-gov, 122493, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 4064, 0, 40, United-States, <=50K +37, Private, 215419, Assoc-acdm, 12, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 25, United-States, <=50K +40, Private, 310101, Some-college, 10, Separated, Sales, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +57, Self-emp-inc, 61885, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 60, United-States, >50K +43, Private, 59107, HS-grad, 9, Separated, Other-service, Not-in-family, White, Female, 4101, 0, 40, United-States, <=50K +32, Private, 227214, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Other, Male, 0, 0, 40, Ecuador, <=50K +64, Private, 239450, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 118847, 12th, 8, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +40, Self-emp-not-inc, 95226, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +17, ?, 659273, 11th, 7, Never-married, ?, Own-child, Black, Female, 0, 0, 40, Trinadad&Tobago, <=50K +23, Private, 215395, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 170600, Assoc-voc, 11, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +45, Self-emp-not-inc, 91044, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +27, Private, 318639, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 60, Mexico, <=50K +23, Private, 160398, Some-college, 10, Married-spouse-absent, Handlers-cleaners, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +58, Self-emp-not-inc, 216824, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, Asian-Pac-Islander, Male, 0, 0, 30, United-States, <=50K +35, Private, 308945, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 75, United-States, <=50K +47, Private, 30840, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +33, Private, 99309, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +27, Private, 188576, Bachelors, 13, Separated, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, >50K +46, Private, 83064, Assoc-acdm, 12, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +24, Private, 403865, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 56, United-States, <=50K +40, Private, 235786, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 44, United-States, >50K +44, Private, 191893, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 24, United-States, <=50K +31, Local-gov, 149184, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 97, United-States, >50K +37, Private, 152909, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7688, 0, 40, United-States, >50K +23, Private, 435604, Assoc-voc, 11, Separated, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +30, Self-emp-inc, 109282, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 52, United-States, >50K +31, Private, 248178, Some-college, 10, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 35, United-States, <=50K +24, ?, 112683, HS-grad, 9, Never-married, ?, Not-in-family, White, Female, 0, 0, 32, United-States, <=50K +32, Private, 209103, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 3464, 0, 40, United-States, <=50K +27, Private, 183639, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, Local-gov, 107233, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Amer-Indian-Eskimo, Male, 0, 0, 55, United-States, <=50K +27, Private, 175387, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 1876, 40, United-States, <=50K +30, Self-emp-not-inc, 178255, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, ?, <=50K +33, Self-emp-not-inc, 38223, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 70, United-States, <=50K +34, Private, 228873, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +29, Private, 202182, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +26, Local-gov, 425092, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 2174, 0, 40, United-States, <=50K +39, Self-emp-not-inc, 152587, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Self-emp-inc, 39089, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 3103, 0, 50, United-States, >50K +51, Private, 204304, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +40, Private, 116103, Some-college, 10, Separated, Craft-repair, Unmarried, White, Male, 4934, 0, 47, United-States, >50K +53, Private, 290640, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Vietnam, <=50K +58, Federal-gov, 81973, Some-college, 10, Married-civ-spouse, Tech-support, Husband, Asian-Pac-Islander, Male, 0, 1485, 40, United-States, >50K +29, Private, 134890, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 452924, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Other, Male, 0, 0, 40, Mexico, <=50K +57, Private, 245193, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +69, State-gov, 34339, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 184756, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 12, United-States, <=50K +56, Private, 392160, HS-grad, 9, Widowed, Sales, Unmarried, White, Female, 0, 0, 25, Mexico, <=50K +49, Private, 168337, 5th-6th, 3, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 309513, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +70, Private, 77219, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 37, United-States, <=50K +44, Private, 212888, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +37, Private, 361888, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 10520, 0, 40, United-States, >50K +58, Local-gov, 237879, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 58, United-States, <=50K +42, Self-emp-not-inc, 93099, Some-college, 10, Married-civ-spouse, Prof-specialty, Own-child, White, Female, 0, 0, 25, United-States, <=50K +41, Private, 225193, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 50814, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Local-gov, 123681, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 60, United-States, >50K +24, Private, 249351, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, <=50K +58, Self-emp-not-inc, 222311, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 7688, 0, 55, United-States, >50K +18, Private, 301762, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 25, United-States, <=50K +50, Private, 195298, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +69, Private, 541737, Some-college, 10, Widowed, Adm-clerical, Not-in-family, White, Female, 2050, 0, 24, United-States, <=50K +84, Private, 241065, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 66, United-States, <=50K +47, Private, 129513, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +19, Private, 374262, 12th, 8, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 20, United-States, <=50K +24, Private, 382146, Some-college, 10, Never-married, Prof-specialty, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +48, ?, 185291, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 6, United-States, <=50K +53, Private, 30447, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +58, Private, 49893, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +22, Private, 197387, Some-college, 10, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 24, Mexico, <=50K +36, Self-emp-not-inc, 111957, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 52, United-States, <=50K +34, Private, 340458, 12th, 8, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 185670, 1st-4th, 2, Widowed, Prof-specialty, Unmarried, White, Female, 0, 0, 21, Mexico, <=50K +37, Private, 210945, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 24, United-States, <=50K +43, Private, 350661, Prof-school, 15, Separated, Tech-support, Not-in-family, White, Male, 0, 0, 50, Columbia, >50K +42, Private, 190543, Some-college, 10, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 40, United-States, >50K +21, Private, 70261, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +49, Self-emp-not-inc, 179048, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, Greece, <=50K +35, Private, 242094, HS-grad, 9, Married-civ-spouse, Other-service, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +49, Self-emp-not-inc, 117634, Some-college, 10, Widowed, Craft-repair, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +28, Private, 82531, Some-college, 10, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +51, Private, 193374, 1st-4th, 2, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Mexico, <=50K +30, ?, 186420, Bachelors, 13, Never-married, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 323605, 7th-8th, 4, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 60, United-States, >50K +56, Private, 371064, 9th, 5, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 39927, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 8, United-States, <=50K +22, Private, 64292, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 37, United-States, <=50K +33, Private, 198660, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 99999, 0, 56, United-States, >50K +54, ?, 196975, HS-grad, 9, Divorced, ?, Other-relative, White, Male, 0, 0, 45, United-States, <=50K +22, Private, 210165, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +68, Private, 144137, Some-college, 10, Divorced, Priv-house-serv, Other-relative, White, Female, 0, 0, 30, United-States, <=50K +56, Local-gov, 155657, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +23, ?, 72953, HS-grad, 9, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +69, Self-emp-not-inc, 107548, 9th, 5, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 163258, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 221324, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +18, Private, 444822, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 8, Mexico, <=50K +17, Private, 154398, 11th, 7, Never-married, Other-service, Own-child, Black, Male, 0, 0, 16, Haiti, <=50K +31, Private, 120672, 11th, 7, Divorced, Handlers-cleaners, Other-relative, Black, Male, 0, 1721, 40, United-States, <=50K +50, Private, 159650, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 60, United-States, >50K +62, Private, 290754, 10th, 6, Widowed, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +41, Private, 49654, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 52, United-States, <=50K +20, Federal-gov, 147352, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 227943, Assoc-acdm, 12, Never-married, Sales, Own-child, White, Male, 0, 0, 30, United-States, <=50K +18, Private, 423024, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +53, ?, 64322, 7th-8th, 4, Separated, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 445940, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Mexico, <=50K +23, Private, 230824, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 48882, HS-grad, 9, Divorced, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 168195, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +53, Local-gov, 188644, 5th-6th, 3, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, Mexico, <=50K +28, Private, 136077, 10th, 6, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, State-gov, 119793, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 336513, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +58, Private, 186991, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, ?, 218948, 7th-8th, 4, Never-married, ?, Not-in-family, White, Female, 0, 0, 32, Mexico, <=50K +26, Private, 211435, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 280169, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 3456, 0, 8, United-States, <=50K +27, Private, 109997, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 286789, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +25, Private, 102460, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 287160, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 15, United-States, <=50K +39, Private, 198097, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +52, Private, 119111, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 174461, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +26, Self-emp-not-inc, 281678, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 55, United-States, <=50K +24, ?, 377725, Bachelors, 13, Never-married, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 151053, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K +49, Local-gov, 186539, Masters, 14, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +20, ?, 149478, Some-college, 10, Never-married, ?, Other-relative, White, Female, 0, 0, 25, United-States, <=50K +40, Private, 198452, 11th, 7, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 198863, Prof-school, 15, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 2559, 60, United-States, >50K +33, Private, 176711, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 165310, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Other-relative, White, Male, 0, 0, 20, United-States, <=50K +37, Private, 213008, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, Japan, <=50K +21, State-gov, 38251, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 20, United-States, <=50K +33, Private, 125761, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 36, United-States, <=50K +28, Private, 148645, Assoc-acdm, 12, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 273435, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1579, 40, United-States, <=50K +43, Private, 208613, Bachelors, 13, Separated, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 192565, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 183885, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +47, Self-emp-not-inc, 243631, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, South, <=50K +37, Private, 191754, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 50, United-States, >50K +26, Private, 261278, Some-college, 10, Separated, Sales, Other-relative, Black, Male, 0, 0, 30, United-States, <=50K +55, Private, 127014, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 60, United-States, <=50K +40, Private, 197919, Assoc-acdm, 12, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 217460, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 86551, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, >50K +54, Self-emp-inc, 98051, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 54, United-States, >50K +38, Private, 215917, Some-college, 10, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +53, Self-emp-not-inc, 192982, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 85, United-States, <=50K +27, Self-emp-not-inc, 334132, Assoc-acdm, 12, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 78, United-States, <=50K +42, Private, 136986, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +62, Private, 116812, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 189123, 11th, 7, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 1485, 58, United-States, <=50K +26, Private, 89648, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +33, ?, 190027, HS-grad, 9, Never-married, ?, Unmarried, Black, Female, 0, 0, 20, United-States, <=50K +59, Private, 99248, Some-college, 10, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 57600, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +25, Private, 199224, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +58, Private, 140363, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 36, United-States, <=50K +30, Private, 308812, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +21, Private, 275421, Some-college, 10, Never-married, Craft-repair, Own-child, White, Female, 0, 0, 40, United-States, <=50K +61, Private, 213321, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 157747, Assoc-acdm, 12, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 182314, Masters, 14, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +70, Private, 220589, Some-college, 10, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 12, United-States, <=50K +55, ?, 208640, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, >50K +29, Self-emp-not-inc, 189346, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 2202, 0, 50, United-States, <=50K +46, Private, 124071, Masters, 14, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 44, United-States, <=50K +35, Federal-gov, 20469, Some-college, 10, Divorced, Exec-managerial, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +31, Private, 154227, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 43, United-States, >50K +37, Private, 105044, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 42, United-States, >50K +43, Private, 35910, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 43, United-States, >50K +23, Private, 189203, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 116493, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 13550, 0, 44, United-States, >50K +42, Local-gov, 19700, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, United-States, >50K +26, Private, 48718, 10th, 6, Never-married, Adm-clerical, Not-in-family, White, Female, 2907, 0, 40, United-States, <=50K +45, Private, 106113, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 256263, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, ?, 202498, 7th-8th, 4, Separated, ?, Not-in-family, White, Male, 0, 0, 40, Guatemala, <=50K +38, Private, 120074, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, >50K +28, Private, 122922, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +68, Self-emp-not-inc, 116903, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2149, 40, United-States, <=50K +42, Local-gov, 222596, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 107302, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, India, <=50K +36, Private, 156400, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 53373, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +22, Private, 58916, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +45, Local-gov, 167159, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Private, 283806, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +57, Private, 140426, 1st-4th, 2, Married-spouse-absent, Other-service, Not-in-family, White, Male, 0, 0, 35, ?, <=50K +36, Local-gov, 61778, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 40, United-States, >50K +41, Private, 33310, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Self-emp-not-inc, 202560, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 35, United-States, <=50K +25, Self-emp-not-inc, 60828, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Female, 0, 0, 50, United-States, <=50K +53, State-gov, 153486, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +28, Local-gov, 167536, Assoc-acdm, 12, Widowed, Prof-specialty, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +30, Local-gov, 370990, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 198867, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Local-gov, 174924, Some-college, 10, Divorced, Protective-serv, Unmarried, White, Male, 0, 0, 48, Germany, <=50K +30, Private, 175856, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 38, United-States, <=50K +41, Private, 169628, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, Black, Female, 0, 0, 40, ?, <=50K +29, ?, 125159, Some-college, 10, Never-married, ?, Not-in-family, Black, Male, 0, 0, 36, ?, <=50K +31, Private, 220690, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K +36, Private, 160035, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3908, 0, 55, United-States, <=50K +59, Self-emp-not-inc, 116878, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, Greece, <=50K +33, Self-emp-not-inc, 134737, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +29, Private, 81648, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 1887, 55, United-States, >50K +49, State-gov, 122177, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Federal-gov, 69614, 10th, 6, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 56, United-States, <=50K +33, Private, 112115, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 45, United-States, >50K +28, Private, 299422, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +81, ?, 162882, HS-grad, 9, Divorced, ?, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +24, Private, 112854, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 16, United-States, <=50K +32, Self-emp-not-inc, 33417, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +47, Federal-gov, 224559, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, >50K +44, ?, 468706, HS-grad, 9, Married-civ-spouse, ?, Husband, Black, Male, 0, 0, 40, United-States, <=50K +24, Private, 357028, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Local-gov, 51158, Some-college, 10, Married-civ-spouse, Tech-support, Wife, White, Female, 7298, 0, 36, United-States, >50K +51, Private, 186303, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +52, Private, 127749, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +22, Private, 291386, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 138054, Assoc-acdm, 12, Never-married, Other-service, Not-in-family, Other, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 174533, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 200835, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 108658, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +43, Private, 180985, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, >50K +25, Private, 34803, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 20, United-States, <=50K +59, Private, 75867, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +29, Private, 156819, Assoc-acdm, 12, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +30, Private, 61272, 9th, 5, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, Portugal, <=50K +24, Private, 39827, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Wife, Other, Female, 0, 0, 40, Puerto-Rico, <=50K +38, Private, 130007, Bachelors, 13, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 80324, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +26, Private, 322614, Preschool, 1, Married-spouse-absent, Machine-op-inspct, Not-in-family, White, Male, 0, 1719, 40, Mexico, <=50K +30, Private, 140869, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +73, Local-gov, 181902, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 10, Poland, >50K +30, Private, 287908, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +33, Private, 309630, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 28225, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 58, United-States, <=50K +40, ?, 428584, HS-grad, 9, Married-civ-spouse, ?, Wife, Black, Female, 3464, 0, 20, United-States, <=50K +18, Private, 39222, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 359131, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 7298, 0, 8, ?, >50K +22, Private, 122272, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +50, Self-emp-inc, 198400, HS-grad, 9, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 60, United-States, <=50K +62, ?, 73091, 7th-8th, 4, Widowed, ?, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +39, Self-emp-inc, 283338, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 40, United-States, >50K +22, Private, 208946, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 348416, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +31, Private, 379046, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, Vietnam, <=50K +29, Private, 183887, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 127961, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, >50K +24, Private, 211129, Bachelors, 13, Never-married, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +29, Local-gov, 187649, HS-grad, 9, Separated, Protective-serv, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +49, Federal-gov, 94754, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 231826, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +28, Private, 142764, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +22, Private, 126822, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 60, United-States, <=50K +37, Private, 188069, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 284395, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +49, Private, 31267, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 161444, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, Columbia, <=50K +25, Private, 144483, HS-grad, 9, Separated, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 133655, HS-grad, 9, Married-spouse-absent, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, State-gov, 112074, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +21, Private, 249727, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 22, United-States, <=50K +18, Private, 165754, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +30, Local-gov, 172822, Assoc-voc, 11, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 288433, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +40, Private, 33331, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +43, Private, 168071, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 44, United-States, <=50K +45, Private, 207277, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +29, Private, 130620, Some-college, 10, Married-spouse-absent, Sales, Own-child, Asian-Pac-Islander, Female, 0, 0, 26, India, <=50K +40, Private, 136244, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 972354, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 48, United-States, <=50K +20, Private, 245297, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +32, State-gov, 71151, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 20, United-States, <=50K +19, Private, 118352, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 16, United-States, <=50K +21, Private, 117210, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 120068, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 48343, 11th, 7, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +52, Private, 84451, Assoc-voc, 11, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 32, United-States, <=50K +51, ?, 76437, Some-college, 10, Divorced, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 281704, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +54, Private, 123011, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +50, Private, 104729, HS-grad, 9, Divorced, Machine-op-inspct, Other-relative, White, Female, 0, 0, 48, United-States, <=50K +29, Private, 110134, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 186067, 10th, 6, Never-married, Tech-support, Own-child, White, Male, 0, 0, 10, United-States, <=50K +47, Private, 214702, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 37, Puerto-Rico, <=50K +46, Private, 384795, Bachelors, 13, Divorced, Prof-specialty, Unmarried, Black, Female, 0, 0, 32, United-States, <=50K +30, Private, 175931, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 44, United-States, <=50K +58, Private, 366324, Some-college, 10, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 30, United-States, <=50K +48, Private, 118717, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +23, Private, 219835, Some-college, 10, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 40, Mexico, <=50K +23, Private, 176486, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 36, United-States, <=50K +45, Private, 273435, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 182661, Some-college, 10, Never-married, Sales, Own-child, Black, Male, 0, 0, 20, United-States, <=50K +26, Private, 212304, 7th-8th, 4, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 48, United-States, <=50K +50, Local-gov, 133963, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, >50K +49, Private, 165152, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, >50K +26, Private, 274724, Some-college, 10, Never-married, Other-service, Other-relative, White, Male, 0, 0, 40, Nicaragua, <=50K +47, Private, 196707, Prof-school, 15, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 213002, 12th, 8, Never-married, Sales, Not-in-family, White, Male, 4650, 0, 50, United-States, <=50K +19, ?, 26620, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 361481, 10th, 6, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, ?, <=50K +35, Private, 148581, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1740, 40, United-States, <=50K +46, Private, 459189, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1902, 50, United-States, >50K +28, Self-emp-not-inc, 214689, 11th, 7, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 289364, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 45, United-States, >50K +21, Private, 174907, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +50, Self-emp-not-inc, 348099, 10th, 6, Divorced, Sales, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +30, ?, 104965, 9th, 5, Never-married, ?, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +31, Private, 31600, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Self-emp-not-inc, 286282, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 20, United-States, <=50K +35, Self-emp-not-inc, 181705, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 3103, 0, 40, United-States, >50K +33, Private, 238912, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +34, Private, 134737, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 55, United-States, >50K +67, ?, 157403, Prof-school, 15, Married-civ-spouse, ?, Husband, White, Male, 6418, 0, 10, United-States, >50K +37, Private, 197429, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 45, United-States, >50K +48, Private, 47343, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Federal-gov, 67083, Bachelors, 13, Never-married, Exec-managerial, Unmarried, Asian-Pac-Islander, Male, 1471, 0, 40, Cambodia, <=50K +24, Private, 249957, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 175942, HS-grad, 9, Divorced, Priv-house-serv, Not-in-family, White, Female, 0, 0, 40, France, <=50K +50, Private, 192982, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 1848, 40, United-States, >50K +40, Private, 209547, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1977, 60, United-States, >50K +33, Private, 142675, Bachelors, 13, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +51, Federal-gov, 190333, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 196396, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 166740, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Local-gov, 174533, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +27, Private, 210867, 7th-8th, 4, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 50, ?, <=50K +37, Private, 118486, Bachelors, 13, Separated, Prof-specialty, Unmarried, White, Female, 4934, 0, 32, United-States, >50K +40, Private, 144067, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 106964, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 178136, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +38, Private, 196554, Prof-school, 15, Separated, Prof-specialty, Not-in-family, White, Male, 0, 0, 35, United-States, >50K +40, Self-emp-not-inc, 403550, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +35, Private, 498216, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 192755, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 20, United-States, >50K +20, ?, 53738, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 60, United-States, <=50K +33, Private, 156192, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +45, Private, 189802, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +66, ?, 213149, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 1825, 40, United-States, >50K +35, Self-emp-not-inc, 179171, HS-grad, 9, Never-married, Sales, Unmarried, Black, Female, 0, 0, 38, Germany, <=50K +32, Private, 77634, 11th, 7, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, United-States, <=50K +23, Private, 189830, Some-college, 10, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 50, United-States, <=50K +19, Private, 127190, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 15, United-States, <=50K +44, ?, 174147, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 138107, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Black, Male, 0, 0, 35, United-States, <=50K +44, Self-emp-inc, 269733, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 30, United-States, <=50K +41, State-gov, 227734, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 3464, 0, 40, United-States, <=50K +19, Private, 318822, Some-college, 10, Never-married, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +48, Private, 48885, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +45, Private, 205424, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 65, United-States, >50K +40, Private, 173858, 7th-8th, 4, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 42, Cambodia, <=50K +34, Private, 202450, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, >50K +20, Private, 154779, Some-college, 10, Never-married, Sales, Other-relative, Other, Female, 0, 0, 40, United-States, <=50K +33, Private, 180551, Some-college, 10, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 177522, HS-grad, 9, Married-civ-spouse, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +23, Private, 277328, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 32, Cuba, <=50K +34, Private, 112584, 10th, 6, Divorced, Other-service, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +48, State-gov, 85384, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +32, ?, 123971, 11th, 7, Divorced, ?, Not-in-family, White, Female, 0, 0, 49, United-States, <=50K +42, Private, 69019, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, United-States, >50K +22, Private, 112847, HS-grad, 9, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +60, Self-emp-not-inc, 52900, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 42, United-States, >50K +42, Private, 37937, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +45, Private, 59380, Bachelors, 13, Separated, Exec-managerial, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +47, Private, 114770, HS-grad, 9, Divorced, Other-service, Own-child, White, Female, 0, 0, 32, United-States, <=50K +29, Private, 216481, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +34, Private, 176469, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +34, Private, 176831, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, >50K +39, Federal-gov, 410034, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 93662, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 24, United-States, <=50K +42, Self-emp-inc, 144236, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +48, Private, 240917, 11th, 7, Separated, Other-service, Not-in-family, Black, Female, 0, 0, 35, United-States, <=50K +53, Private, 608184, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 1902, 40, United-States, >50K +51, Private, 243361, Some-college, 10, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +44, Self-emp-not-inc, 35166, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 90, United-States, <=50K +46, Self-emp-inc, 182655, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +51, Private, 142717, Doctorate, 16, Divorced, Craft-repair, Not-in-family, White, Female, 4787, 0, 60, United-States, >50K +32, Private, 272944, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, ?, 219233, HS-grad, 9, Never-married, ?, Own-child, Black, Male, 0, 1602, 30, United-States, <=50K +24, Private, 228686, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +33, Private, 236818, Assoc-voc, 11, Never-married, Prof-specialty, Unmarried, Black, Female, 0, 0, 26, United-States, <=50K +47, Self-emp-not-inc, 117865, HS-grad, 9, Married-AF-spouse, Craft-repair, Husband, White, Male, 0, 0, 90, United-States, <=50K +64, Self-emp-not-inc, 106538, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +62, Private, 153891, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +52, Private, 190909, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 191002, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, Poland, <=50K +42, Private, 89073, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 48, United-States, <=50K +38, Federal-gov, 238342, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7688, 0, 42, United-States, >50K +55, Private, 259532, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +29, ?, 189282, HS-grad, 9, Married-civ-spouse, ?, Not-in-family, White, Female, 0, 0, 27, United-States, <=50K +42, Private, 132481, Assoc-acdm, 12, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +30, Private, 205659, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 50, Thailand, >50K +32, Private, 182323, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, ?, 216256, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 3464, 0, 30, United-States, <=50K +50, Federal-gov, 166419, 11th, 7, Never-married, Sales, Not-in-family, Black, Female, 3674, 0, 40, United-States, <=50K +27, Private, 152246, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +47, Private, 155659, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +33, Private, 155198, 9th, 5, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +48, Self-emp-not-inc, 100931, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 162945, 7th-8th, 4, Never-married, Other-service, Own-child, White, Male, 0, 0, 35, United-States, <=50K +31, Federal-gov, 334346, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 181597, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, United-States, <=50K +61, Self-emp-not-inc, 133969, HS-grad, 9, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 63, South, <=50K +50, Private, 210217, Bachelors, 13, Divorced, Sales, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +49, Private, 169711, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, Germany, >50K +57, ?, 300104, 5th-6th, 3, Married-civ-spouse, ?, Husband, White, Male, 7298, 0, 84, United-States, >50K +19, Private, 271521, HS-grad, 9, Never-married, Other-service, Other-relative, Asian-Pac-Islander, Male, 0, 0, 24, United-States, <=50K +18, Private, 51255, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 15, United-States, <=50K +44, Self-emp-not-inc, 26669, Assoc-acdm, 12, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 99, United-States, <=50K +54, Private, 194580, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +35, State-gov, 177974, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K +27, State-gov, 315640, Masters, 14, Never-married, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 20, China, <=50K +50, Self-emp-inc, 136913, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +43, State-gov, 230961, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 167062, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 120131, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 243368, Preschool, 1, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 50, Mexico, <=50K +30, Private, 171876, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +19, Private, 136866, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 45, United-States, <=50K +40, Private, 316820, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 1485, 40, United-States, <=50K +55, Private, 185459, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +67, ?, 81761, HS-grad, 9, Divorced, ?, Own-child, White, Male, 0, 0, 20, United-States, <=50K +31, Private, 43716, Assoc-voc, 11, Divorced, Machine-op-inspct, Unmarried, White, Male, 0, 0, 43, United-States, <=50K +30, Private, 220939, Assoc-voc, 11, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +54, ?, 148657, Preschool, 1, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 40, Mexico, <=50K +51, Federal-gov, 40808, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 43, United-States, <=50K +34, Private, 183473, HS-grad, 9, Divorced, Transport-moving, Own-child, White, Female, 0, 0, 40, United-States, <=50K +59, Private, 108496, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +50, Private, 204838, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +29, Private, 132686, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +17, State-gov, 117906, 10th, 6, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 304386, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +52, ?, 248113, Preschool, 1, Married-spouse-absent, ?, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +39, Private, 165215, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1902, 18, United-States, >50K +18, ?, 215463, 12th, 8, Never-married, ?, Own-child, White, Female, 0, 0, 25, United-States, <=50K +32, Private, 259719, Some-college, 10, Divorced, Handlers-cleaners, Unmarried, Black, Male, 0, 0, 40, Nicaragua, <=50K +25, ?, 35829, Some-college, 10, Divorced, ?, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +34, Private, 248795, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +44, Self-emp-not-inc, 124692, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 40, United-States, >50K +37, Local-gov, 128054, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 179731, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 2415, 65, United-States, >50K +32, Self-emp-inc, 113543, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +23, Private, 252153, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 28, United-States, <=50K +45, Federal-gov, 45891, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Male, 0, 0, 42, United-States, <=50K +30, Private, 112263, 11th, 7, Divorced, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 47791, 12th, 8, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 10, United-States, <=50K +41, Private, 202980, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 4, Peru, <=50K +21, Private, 34918, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 30, United-States, <=50K +48, Private, 91251, 7th-8th, 4, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 30, China, <=50K +31, Private, 132996, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 5178, 0, 45, United-States, >50K +34, Private, 306215, Assoc-voc, 11, Never-married, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +25, Private, 203570, HS-grad, 9, Separated, Other-service, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +41, Self-emp-not-inc, 355918, Bachelors, 13, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +35, Self-emp-not-inc, 198841, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +42, Private, 282964, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +34, Self-emp-not-inc, 312197, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 75, Mexico, >50K +44, Private, 98779, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 4386, 0, 60, United-States, <=50K +32, Private, 200246, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 182771, Some-college, 10, Never-married, Sales, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +23, Private, 199908, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +36, Private, 172104, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, Other, Male, 0, 0, 40, India, >50K +53, Self-emp-not-inc, 35295, Bachelors, 13, Never-married, Sales, Unmarried, White, Male, 0, 0, 60, United-States, >50K +27, Private, 216858, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K +27, Private, 332187, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 65, United-States, <=50K +57, Private, 255109, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +17, Private, 111332, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +59, Local-gov, 238431, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +34, Private, 131552, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 110239, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 55, United-States, <=50K +31, State-gov, 255830, Assoc-acdm, 12, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 45, United-States, <=50K +18, ?, 175648, 11th, 7, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 82998, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +19, Private, 164320, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +20, Self-emp-not-inc, 263498, Assoc-voc, 11, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 40, United-States, <=50K +52, Self-emp-not-inc, 162381, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Local-gov, 229651, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +27, Private, 357348, HS-grad, 9, Separated, Other-service, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +19, Private, 269657, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +38, Local-gov, 82880, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 15, United-States, <=50K +19, Private, 389755, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 16, United-States, <=50K +34, Private, 195136, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 1887, 40, United-States, >50K +41, Private, 207685, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, ?, <=50K +23, Private, 222925, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Own-child, White, Female, 2105, 0, 40, United-States, <=50K +24, ?, 196388, Assoc-acdm, 12, Never-married, ?, Not-in-family, White, Male, 0, 0, 12, United-States, <=50K +24, Private, 50341, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 214134, 10th, 6, Never-married, Transport-moving, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 84, United-States, <=50K +45, Private, 114032, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +45, Private, 192053, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +48, Private, 240231, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Japan, >50K +42, Private, 44402, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +35, Self-emp-not-inc, 191503, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 163530, HS-grad, 9, Divorced, Other-service, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +51, Local-gov, 136823, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 32, United-States, <=50K +59, Private, 121912, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Local-gov, 58624, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +27, Local-gov, 74056, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +29, Private, 144259, Bachelors, 13, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 4386, 0, 80, ?, >50K +57, Private, 182028, Assoc-acdm, 12, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +40, Private, 209040, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 206046, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 182494, 7th-8th, 4, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 185057, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 35, Scotland, <=50K +60, Private, 147473, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +53, Local-gov, 221722, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 14344, 0, 50, United-States, >50K +20, ?, 388811, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 221912, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 48189, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +29, State-gov, 382272, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 48347, Bachelors, 13, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 143046, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 1564, 38, United-States, >50K +63, Self-emp-inc, 137940, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 50, United-States, >50K +28, Private, 249571, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +79, Private, 121318, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 20, United-States, <=50K +39, Private, 224531, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 40, United-States, >50K +29, Private, 185019, 12th, 8, Never-married, Other-service, Not-in-family, Other, Male, 0, 0, 40, United-States, <=50K +60, Private, 27886, 7th-8th, 4, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 94741, 12th, 8, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 24, United-States, <=50K +20, Private, 107801, Assoc-acdm, 12, Never-married, Other-service, Own-child, White, Female, 0, 2205, 18, United-States, <=50K +44, Private, 191256, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 65, United-States, >50K +47, Private, 256866, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 48, United-States, <=50K +59, Private, 197148, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 24, United-States, >50K +37, Private, 312271, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 65, United-States, <=50K +21, Private, 118657, HS-grad, 9, Separated, Machine-op-inspct, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +68, Private, 224338, Assoc-voc, 11, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 242488, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 5013, 0, 40, United-States, <=50K +23, ?, 234970, Some-college, 10, Never-married, ?, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +23, Private, 227915, HS-grad, 9, Never-married, Craft-repair, Unmarried, White, Female, 0, 0, 33, United-States, <=50K +40, Local-gov, 105717, Masters, 14, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 1876, 35, United-States, <=50K +45, Self-emp-not-inc, 160962, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 35, United-States, <=50K +34, ?, 353881, Assoc-voc, 11, Married-civ-spouse, ?, Husband, White, Male, 3103, 0, 60, United-States, >50K +22, Private, 188950, Some-college, 10, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 201328, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Private, 218678, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 49, United-States, <=50K +23, Private, 184255, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 40, United-States, <=50K +39, Federal-gov, 200968, Some-college, 10, Married-civ-spouse, Adm-clerical, Other-relative, White, Male, 0, 0, 45, United-States, >50K +26, Private, 102264, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 300584, HS-grad, 9, Never-married, Handlers-cleaners, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 208946, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 25, United-States, <=50K +36, Private, 105021, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 55, United-States, <=50K +20, Private, 124751, Some-college, 10, Never-married, Priv-house-serv, Own-child, White, Female, 0, 0, 20, United-States, <=50K +18, Private, 274057, 11th, 7, Never-married, Other-service, Own-child, Black, Male, 0, 0, 8, United-States, <=50K +38, Private, 132879, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +43, Self-emp-inc, 260960, Bachelors, 13, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +56, Private, 208415, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, Black, Male, 0, 0, 40, ?, <=50K +42, Private, 356934, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 154410, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +31, Private, 35378, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +73, Private, 301210, 1st-4th, 2, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1735, 20, United-States, <=50K +32, Private, 73621, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 42, United-States, <=50K +37, Private, 108140, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 45, United-States, >50K +66, Private, 217198, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 10, United-States, <=50K +22, Private, 157332, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 35, United-States, <=50K +51, Private, 202956, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 173495, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 35, United-States, <=50K +65, Private, 149811, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 2206, 59, Canada, <=50K +39, Private, 444219, HS-grad, 9, Married-civ-spouse, Craft-repair, Wife, Black, Female, 0, 0, 45, United-States, <=50K +48, Private, 125120, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 37, United-States, <=50K +20, Private, 190429, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, ?, 190303, Assoc-acdm, 12, Never-married, ?, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 248164, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 4386, 0, 50, United-States, >50K +29, Federal-gov, 208534, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 80, United-States, <=50K +36, Self-emp-not-inc, 343721, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 30, ?, >50K +35, Self-emp-inc, 196373, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +31, Private, 433788, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +48, State-gov, 122086, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 137314, Assoc-voc, 11, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +40, Self-emp-not-inc, 33068, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, Private, 210688, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 15, United-States, <=50K +26, Local-gov, 117833, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 4865, 0, 35, United-States, <=50K +37, State-gov, 103474, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +65, Private, 115880, Doctorate, 16, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Self-emp-not-inc, 233933, 10th, 6, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 32, United-States, <=50K +42, Private, 52781, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 586657, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, Japan, >50K +62, Private, 113080, 7th-8th, 4, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 251905, Assoc-voc, 11, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 50, United-States, <=50K +76, Self-emp-not-inc, 225964, Some-college, 10, Widowed, Sales, Not-in-family, White, Male, 0, 0, 8, United-States, <=50K +20, ?, 194096, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 263831, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 133136, 12th, 8, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +32, Private, 121634, 10th, 6, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, Mexico, <=50K +22, Self-emp-inc, 40767, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Federal-gov, 355789, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 50, United-States, <=50K +43, Local-gov, 311914, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 91189, Bachelors, 13, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 20, United-States, <=50K +44, Federal-gov, 344060, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Private, 113823, Bachelors, 13, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +49, State-gov, 185800, Masters, 14, Divorced, Prof-specialty, Unmarried, Black, Female, 7430, 0, 40, United-States, >50K +30, Private, 76107, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +23, Private, 117618, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 35, United-States, <=50K +39, Private, 238008, HS-grad, 9, Widowed, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +32, Private, 136480, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +50, Private, 285200, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1887, 35, United-States, >50K +19, Private, 351040, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, Puerto-Rico, <=50K +35, Private, 1226583, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 52, United-States, >50K +23, Private, 195767, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 187540, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 79372, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 226665, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 42, United-States, >50K +52, Private, 213209, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +49, Private, 211005, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 60, United-States, <=50K +24, Private, 96178, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 328216, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 40, United-States, >50K +39, Private, 110713, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +45, Self-emp-not-inc, 225456, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +62, Local-gov, 159908, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 1258, 38, United-States, <=50K +43, Private, 118308, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 50, United-States, >50K +45, Private, 180309, Some-college, 10, Divorced, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +62, Self-emp-not-inc, 39630, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 273828, 5th-6th, 3, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +56, Private, 172071, HS-grad, 9, Divorced, Other-service, Unmarried, Black, Female, 0, 0, 40, Jamaica, <=50K +28, Private, 218887, HS-grad, 9, Never-married, Farming-fishing, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +23, Private, 664670, HS-grad, 9, Never-married, Craft-repair, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +43, Private, 209149, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +26, Private, 84619, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 447346, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +55, Local-gov, 37869, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +48, State-gov, 99086, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +43, Private, 143582, 5th-6th, 3, Married-civ-spouse, Machine-op-inspct, Wife, Asian-Pac-Islander, Female, 0, 2129, 72, ?, <=50K +38, Private, 326886, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +18, Private, 181755, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +56, Self-emp-not-inc, 249368, HS-grad, 9, Married-spouse-absent, Exec-managerial, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +39, Self-emp-not-inc, 326400, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 504725, 5th-6th, 3, Separated, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 50, Mexico, <=50K +36, Private, 88967, 11th, 7, Never-married, Transport-moving, Unmarried, Amer-Indian-Eskimo, Male, 0, 0, 65, United-States, <=50K +42, Self-emp-not-inc, 170721, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 2002, 40, United-States, <=50K +50, Private, 148953, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, >50K +17, Private, 342752, 11th, 7, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 20, United-States, <=50K +57, Private, 220871, 7th-8th, 4, Widowed, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +73, Private, 29675, HS-grad, 9, Widowed, Other-service, Other-relative, White, Female, 0, 0, 12, United-States, <=50K +50, Federal-gov, 183611, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 115215, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 45, United-States, <=50K +27, Private, 152231, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K +24, ?, 41356, Some-college, 10, Never-married, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 225142, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +23, Self-emp-not-inc, 121313, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 134821, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 311350, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 102106, 10th, 6, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 427055, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 45, Mexico, <=50K +40, Private, 117860, HS-grad, 9, Divorced, Other-service, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +58, Private, 285885, 9th, 5, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 212800, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 194864, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 18, United-States, <=50K +36, Private, 31438, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 43, United-States, <=50K +46, Private, 148254, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +69, Private, 113035, 1st-4th, 2, Widowed, Priv-house-serv, Not-in-family, Black, Female, 0, 0, 4, United-States, <=50K +69, Private, 106595, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 1848, 0, 40, United-States, <=50K +28, Private, 144521, HS-grad, 9, Never-married, Other-service, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +20, Private, 172232, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 48, United-States, <=50K +54, State-gov, 123592, HS-grad, 9, Separated, Adm-clerical, Unmarried, Black, Female, 3887, 0, 35, United-States, <=50K +25, Private, 191921, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +64, Local-gov, 237379, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 3471, 0, 40, United-States, <=50K +17, Private, 208463, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +53, Federal-gov, 68985, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 22418, 9th, 5, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +57, Private, 163047, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 38, United-States, <=50K +51, Private, 153870, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 2603, 40, United-States, <=50K +20, ?, 124954, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 30, United-States, <=50K +47, Private, 197702, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 166415, HS-grad, 9, Never-married, Transport-moving, Unmarried, White, Male, 0, 0, 52, United-States, <=50K +50, State-gov, 116211, Prof-school, 15, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 52, United-States, >50K +20, Private, 33644, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K +43, State-gov, 33331, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 70, United-States, >50K +46, Private, 73019, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +54, Private, 169182, HS-grad, 9, Separated, Adm-clerical, Unmarried, White, Female, 0, 0, 38, Puerto-Rico, <=50K +53, Private, 20438, Some-college, 10, Separated, Exec-managerial, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 15, United-States, <=50K +21, Private, 109869, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 30, United-States, <=50K +58, Private, 316849, Some-college, 10, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 208043, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K +61, Private, 153790, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 40, United-States, <=50K +56, State-gov, 153451, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +59, Private, 96840, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +72, Private, 192732, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 20, United-States, <=50K +33, Private, 209101, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 146919, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +46, Local-gov, 192323, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 38, United-States, >50K +48, Private, 217019, HS-grad, 9, Never-married, Prof-specialty, Unmarried, Black, Female, 0, 0, 28, United-States, <=50K +33, Private, 198211, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 222490, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 106758, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +31, Private, 561334, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 203710, Bachelors, 13, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Local-gov, 203322, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +51, Private, 123703, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 4386, 0, 40, United-States, >50K +46, State-gov, 312015, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +25, Private, 209428, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 25, El-Salvador, <=50K +61, Private, 230292, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7688, 0, 40, United-States, >50K +17, Private, 114420, 11th, 7, Never-married, Sales, Own-child, White, Female, 0, 0, 30, United-States, <=50K +26, Private, 120238, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Male, 3325, 0, 40, United-States, <=50K +35, Private, 100375, 10th, 6, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +33, Self-emp-not-inc, 42485, 11th, 7, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 70, United-States, <=50K +37, Private, 130620, 12th, 8, Married-civ-spouse, Sales, Wife, Asian-Pac-Islander, Female, 0, 0, 33, ?, <=50K +39, Local-gov, 134367, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 35, United-States, <=50K +42, Private, 147099, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +35, Private, 36214, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 4386, 0, 47, United-States, >50K +45, Private, 119904, HS-grad, 9, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 50, United-States, >50K +47, Self-emp-inc, 105779, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, >50K +64, Private, 165020, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 48, United-States, <=50K +39, Private, 187098, Prof-school, 15, Married-civ-spouse, Exec-managerial, Wife, White, Female, 15024, 0, 47, United-States, >50K +43, ?, 142030, HS-grad, 9, Divorced, ?, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 241360, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, <=50K +62, Private, 121319, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 40, United-States, >50K +53, Private, 151580, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 40, United-States, >50K +31, Private, 162572, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 35917, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +56, Self-emp-inc, 35723, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +43, Private, 194773, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 62155, Some-college, 10, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 35, United-States, <=50K +45, Self-emp-not-inc, 192203, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1485, 40, United-States, >50K +46, Private, 174370, Some-college, 10, Separated, Sales, Not-in-family, White, Male, 0, 0, 55, United-States, <=50K +26, Private, 161007, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 80, United-States, <=50K +24, Private, 270517, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, Mexico, <=50K +43, Private, 163847, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, >50K +40, Private, 193882, Assoc-voc, 11, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 160037, 7th-8th, 4, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +34, Federal-gov, 189944, Some-college, 10, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 72, United-States, <=50K +85, Private, 115364, HS-grad, 9, Widowed, Sales, Unmarried, White, Male, 0, 0, 35, United-States, <=50K +41, Private, 163174, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, State-gov, 188900, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 3325, 0, 35, United-States, <=50K +22, Private, 214399, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 30, United-States, <=50K +60, Private, 156616, HS-grad, 9, Widowed, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +29, Private, 204862, Assoc-acdm, 12, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 36, United-States, <=50K +34, ?, 55921, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, State-gov, 172475, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Asian-Pac-Islander, Female, 2977, 0, 45, United-States, <=50K +24, Private, 153082, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +45, Local-gov, 195418, Masters, 14, Divorced, Prof-specialty, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +21, Local-gov, 276840, 12th, 8, Never-married, Other-service, Own-child, Black, Male, 0, 0, 20, United-States, <=50K +30, Private, 97933, Assoc-acdm, 12, Married-civ-spouse, Transport-moving, Wife, White, Female, 0, 1485, 37, United-States, >50K +50, Self-emp-inc, 119099, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 99, United-States, >50K +41, Self-emp-not-inc, 83411, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 198992, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 33, United-States, <=50K +45, Private, 337825, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +34, Private, 192002, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 189346, HS-grad, 9, Divorced, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 231962, HS-grad, 9, Never-married, Other-service, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 164488, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 13550, 0, 50, United-States, >50K +48, Private, 200471, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, >50K +69, Private, 228921, Bachelors, 13, Widowed, Prof-specialty, Not-in-family, White, Male, 0, 2282, 40, United-States, >50K +41, Private, 184846, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 233851, Bachelors, 13, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 499001, HS-grad, 9, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, Mexico, <=50K +65, Local-gov, 125768, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 28, United-States, <=50K +31, Private, 255004, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 1741, 38, United-States, <=50K +28, Private, 157624, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 146767, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +45, Private, 118291, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Female, 0, 0, 80, United-States, <=50K +43, Private, 313181, HS-grad, 9, Divorced, Adm-clerical, Other-relative, Black, Male, 0, 0, 38, United-States, <=50K +31, Private, 87891, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +31, Private, 226443, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 55, United-States, >50K +45, Private, 81132, Some-college, 10, Married-civ-spouse, Craft-repair, Other-relative, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +20, Private, 216436, Bachelors, 13, Never-married, Sales, Other-relative, Black, Female, 0, 0, 30, United-States, <=50K +25, Private, 213412, Bachelors, 13, Never-married, Tech-support, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 179358, HS-grad, 9, Widowed, Handlers-cleaners, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +31, Private, 369825, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 4101, 0, 50, United-States, <=50K +56, Private, 199763, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +26, Private, 239390, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 18, United-States, <=50K +47, Self-emp-not-inc, 173613, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 65, United-States, <=50K +40, Self-emp-inc, 37869, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 302845, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 48, United-States, <=50K +34, State-gov, 85218, Masters, 14, Never-married, Prof-specialty, Unmarried, Black, Female, 0, 0, 24, United-States, <=50K +37, Private, 48268, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, <=50K +38, Private, 173968, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +19, Private, 70982, Assoc-voc, 11, Never-married, Other-service, Own-child, Asian-Pac-Islander, Male, 0, 0, 16, United-States, <=50K +49, Private, 166857, 9th, 5, Divorced, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +18, ?, 256191, HS-grad, 9, Never-married, ?, Own-child, Black, Female, 0, 0, 25, United-States, <=50K +26, Private, 162872, Assoc-acdm, 12, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +82, Private, 152148, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 2, United-States, <=50K +40, Private, 139193, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 791084, Some-college, 10, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 50, United-States, <=50K +23, Private, 137214, HS-grad, 9, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 37, United-States, <=50K +19, Private, 183258, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +67, Private, 154035, HS-grad, 9, Widowed, Handlers-cleaners, Other-relative, Black, Male, 0, 0, 32, United-States, <=50K +43, Private, 115323, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 3103, 0, 40, United-States, >50K +41, Private, 213055, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, Other, Female, 0, 0, 50, United-States, <=50K +37, Private, 155064, Assoc-voc, 11, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 33551, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 169995, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 168262, Masters, 14, Separated, Exec-managerial, Not-in-family, White, Male, 99999, 0, 50, United-States, >50K +40, Private, 104196, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, State-gov, 114055, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 274398, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 20, United-States, <=50K +78, ?, 27979, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 2228, 0, 32, United-States, <=50K +67, ?, 244122, Assoc-voc, 11, Widowed, ?, Not-in-family, White, Female, 0, 0, 1, United-States, <=50K +49, Private, 196571, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, >50K +66, Private, 101607, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 10, United-States, <=50K +52, Private, 122109, HS-grad, 9, Never-married, Prof-specialty, Unmarried, White, Female, 0, 323, 40, United-States, <=50K +59, Self-emp-inc, 255822, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +72, Private, 195184, HS-grad, 9, Widowed, Priv-house-serv, Unmarried, White, Female, 0, 0, 12, Cuba, <=50K +35, Federal-gov, 245372, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 169583, Bachelors, 13, Married-AF-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, >50K +36, Private, 224531, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 186151, HS-grad, 9, Separated, Tech-support, Own-child, White, Female, 0, 0, 40, United-States, <=50K +23, Private, 118693, Bachelors, 13, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +39, Private, 297449, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Self-emp-not-inc, 125206, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 393264, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 108140, Bachelors, 13, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +63, Private, 264968, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +58, Self-emp-not-inc, 318106, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +30, Private, 156025, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +38, State-gov, 149455, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +25, Private, 359985, 5th-6th, 3, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 33, Mexico, <=50K +44, State-gov, 165108, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +43, Private, 115178, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +21, Private, 149224, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 30, United-States, <=50K +41, Local-gov, 352056, Assoc-acdm, 12, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 174717, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +75, ?, 173064, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 6, United-States, <=50K +29, Private, 147755, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1672, 40, United-States, <=50K +52, Self-emp-not-inc, 135716, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 70, United-States, <=50K +47, Private, 44216, HS-grad, 9, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +28, Private, 37359, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 50, United-States, >50K +24, Private, 178255, Some-college, 10, Married-civ-spouse, Priv-house-serv, Wife, White, Female, 0, 0, 40, ?, <=50K +30, State-gov, 70617, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 10, China, <=50K +30, Private, 154950, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +40, Private, 356934, Assoc-acdm, 12, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +27, Private, 271714, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 30, United-States, <=50K +26, Private, 247025, HS-grad, 9, Never-married, Protective-serv, Unmarried, White, Male, 0, 0, 44, United-States, <=50K +32, Private, 107417, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 37, United-States, <=50K +36, State-gov, 116554, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 917220, 12th, 8, Never-married, Transport-moving, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +25, Private, 430084, Some-college, 10, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +39, Private, 202937, Some-college, 10, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 40, Poland, <=50K +27, Private, 62737, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 508548, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +38, Self-emp-inc, 275223, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7688, 0, 40, United-States, >50K +35, Self-emp-not-inc, 381931, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 55, United-States, <=50K +29, Private, 246974, Assoc-voc, 11, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 105431, HS-grad, 9, Divorced, Other-service, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +36, Private, 146311, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Self-emp-not-inc, 159869, Doctorate, 16, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +21, Private, 204641, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 66297, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, >50K +38, Private, 227615, 1st-4th, 2, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +66, ?, 107744, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 360393, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 50, United-States, >50K +19, Private, 263340, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 15, United-States, <=50K +18, Private, 141918, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 22, United-States, <=50K +37, Private, 294292, Prof-school, 15, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 128736, Bachelors, 13, Never-married, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +33, Local-gov, 511289, Assoc-voc, 11, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 48, United-States, >50K +27, Private, 302406, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Local-gov, 101517, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +54, State-gov, 161334, Masters, 14, Married-spouse-absent, Prof-specialty, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, China, <=50K +24, Self-emp-inc, 189148, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +44, Self-emp-not-inc, 103111, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +48, Self-emp-not-inc, 51620, Bachelors, 13, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +23, Private, 31606, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 34292, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 38, United-States, <=50K +21, Private, 107882, Assoc-acdm, 12, Never-married, Other-service, Own-child, White, Female, 0, 0, 9, United-States, <=50K +18, Private, 39529, 12th, 8, Never-married, Other-service, Own-child, White, Female, 0, 0, 32, United-States, <=50K +18, Private, 135315, 9th, 5, Never-married, Sales, Own-child, Other, Female, 0, 0, 32, United-States, <=50K +29, Private, 107812, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 229729, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 111891, HS-grad, 9, Separated, Machine-op-inspct, Other-relative, Black, Female, 0, 0, 40, United-States, <=50K +32, Private, 340917, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +61, Private, 202952, 10th, 6, Divorced, Other-service, Not-in-family, Black, Female, 0, 0, 24, United-States, <=50K +79, Private, 333230, HS-grad, 9, Married-spouse-absent, Prof-specialty, Not-in-family, White, Male, 0, 0, 6, United-States, <=50K +34, Private, 114955, Assoc-acdm, 12, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 159869, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Self-emp-not-inc, 57758, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +29, Private, 207064, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 193090, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 3674, 0, 40, United-States, <=50K +64, Private, 151364, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +70, Local-gov, 88638, Masters, 14, Never-married, Prof-specialty, Unmarried, White, Female, 7896, 0, 50, United-States, >50K +28, Local-gov, 304960, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 1980, 40, United-States, <=50K +51, Private, 102828, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Greece, <=50K +20, ?, 210029, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +34, State-gov, 154246, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 4865, 0, 55, United-States, <=50K +29, Private, 142519, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 104455, Bachelors, 13, Married-spouse-absent, Other-service, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +77, Self-emp-inc, 192230, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 292592, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 35, United-States, <=50K +27, Private, 330132, Bachelors, 13, Married-civ-spouse, Tech-support, Wife, White, Female, 0, 0, 40, United-States, >50K +22, Private, 51111, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +35, Local-gov, 258037, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Cuba, >50K +42, Local-gov, 188291, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 1902, 40, United-States, >50K +35, State-gov, 349066, HS-grad, 9, Divorced, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +62, ?, 191188, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 133503, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 2635, 0, 16, United-States, <=50K +45, Private, 146497, Some-college, 10, Widowed, Exec-managerial, Unmarried, White, Female, 0, 0, 55, United-States, <=50K +19, Private, 240468, Some-college, 10, Married-spouse-absent, Sales, Own-child, White, Female, 0, 1602, 40, United-States, <=50K +38, Private, 175120, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 416577, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 45, United-States, <=50K +29, Private, 253814, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +33, Private, 159247, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Self-emp-not-inc, 102471, HS-grad, 9, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 80, Puerto-Rico, <=50K +42, Private, 213464, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 211968, Assoc-voc, 11, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +43, Federal-gov, 32016, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +69, Private, 512992, 11th, 7, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 45, United-States, <=50K +39, Private, 135020, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 109133, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Portugal, <=50K +28, Private, 142712, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Federal-gov, 76900, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 112176, Some-college, 10, Divorced, Sales, Own-child, White, Male, 0, 0, 30, United-States, <=50K +43, Federal-gov, 262233, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +49, Private, 122066, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 30, Hungary, <=50K +28, Private, 194690, 7th-8th, 4, Separated, Other-service, Own-child, White, Male, 0, 0, 60, Mexico, <=50K +35, Private, 306678, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 2885, 0, 40, United-States, <=50K +19, ?, 217769, Some-college, 10, Never-married, ?, Own-child, White, Female, 594, 0, 10, United-States, <=50K +35, Local-gov, 308945, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +57, Private, 46699, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +45, Private, 377757, 9th, 5, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 220993, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 1590, 48, United-States, <=50K +45, Private, 102147, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 113770, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 139012, Bachelors, 13, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Vietnam, <=50K +45, Private, 148900, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +28, Federal-gov, 329426, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +64, Self-emp-inc, 181408, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K +44, Local-gov, 101950, Prof-school, 15, Separated, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +59, Self-emp-not-inc, 32537, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Private, 209547, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 202373, Some-college, 10, Never-married, Sales, Own-child, Black, Male, 0, 0, 25, United-States, <=50K +29, Self-emp-not-inc, 151476, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Female, 2174, 0, 40, United-States, <=50K +51, Self-emp-not-inc, 174824, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 8614, 0, 40, United-States, >50K +22, Private, 138768, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 143482, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +53, Private, 200190, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 80, United-States, >50K +38, Private, 168407, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 5721, 0, 44, United-States, <=50K +23, Private, 148315, Some-college, 10, Separated, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 270517, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, Mexico, <=50K +40, Private, 53506, Bachelors, 13, Divorced, Craft-repair, Own-child, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 105693, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 189589, Some-college, 10, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +20, Private, 164574, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +37, Private, 185744, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 20, United-States, <=50K +40, Local-gov, 33155, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 215955, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 3103, 0, 40, United-States, >50K +38, Self-emp-not-inc, 233571, HS-grad, 9, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 211253, Bachelors, 13, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +32, Federal-gov, 191385, Assoc-acdm, 12, Divorced, Protective-serv, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K +20, Private, 137895, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +62, State-gov, 159699, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 38, United-States, <=50K +31, Private, 295922, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +31, Self-emp-not-inc, 175856, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +24, Private, 216129, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +62, Local-gov, 407669, 7th-8th, 4, Widowed, Other-service, Not-in-family, Black, Female, 0, 0, 35, United-States, <=50K +43, Local-gov, 214242, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 285457, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 50, United-States, <=50K +30, Self-emp-inc, 124420, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 4650, 0, 40, United-States, <=50K +22, ?, 246386, HS-grad, 9, Never-married, ?, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +18, Private, 142751, 10th, 6, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +59, Local-gov, 283635, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Self-emp-not-inc, 322931, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1902, 40, United-States, >50K +49, Private, 76482, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, State-gov, 431745, 11th, 7, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +48, Private, 141944, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 4386, 0, 40, United-States, >50K +32, Private, 193042, Prof-school, 15, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 60, United-States, >50K +33, Private, 67006, 10th, 6, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +23, Private, 240398, Bachelors, 13, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 15, United-States, <=50K +33, Federal-gov, 182714, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 65, United-States, >50K +50, Federal-gov, 172046, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +30, Private, 185177, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 43, United-States, <=50K +32, Private, 102858, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2002, 42, United-States, <=50K +39, Private, 84954, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 2829, 0, 65, United-States, <=50K +21, Private, 115895, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +23, Private, 184589, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 21, United-States, <=50K +32, Private, 282611, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +57, Private, 218649, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, State-gov, 157541, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 10, United-States, <=50K +70, Private, 145419, HS-grad, 9, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 5, United-States, <=50K +34, Private, 122616, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 84, United-States, >50K +53, Private, 204584, Masters, 14, Divorced, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 117210, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 45, United-States, <=50K +37, Private, 69481, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +60, Self-emp-not-inc, 148492, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1485, 50, United-States, >50K +23, Private, 106957, 11th, 7, Never-married, Craft-repair, Own-child, Asian-Pac-Islander, Male, 14344, 0, 40, Vietnam, >50K +32, Private, 29312, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 80, United-States, >50K +57, Private, 120302, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +65, ?, 111916, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 182227, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +30, Private, 219110, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 46, United-States, <=50K +31, Private, 200192, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, Germany, <=50K +19, Private, 427862, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 35, United-States, <=50K +23, State-gov, 33551, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 38, United-States, <=50K +44, Private, 164043, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, ?, 116632, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 45, United-States, >50K +42, Private, 175133, Some-college, 10, Never-married, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +34, Self-emp-not-inc, 289731, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 256362, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 282612, Assoc-voc, 11, Never-married, Tech-support, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +21, Private, 73679, Some-college, 10, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 237824, HS-grad, 9, Married-spouse-absent, Priv-house-serv, Other-relative, Black, Female, 0, 0, 60, Jamaica, <=50K +36, Local-gov, 357720, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +49, Self-emp-not-inc, 155489, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 65, Poland, <=50K +44, Private, 138077, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +42, Private, 183479, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +30, Private, 103596, HS-grad, 9, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 99, United-States, <=50K +33, Private, 172304, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 313853, Bachelors, 13, Divorced, Other-service, Unmarried, Black, Male, 0, 0, 45, United-States, >50K +17, Private, 294485, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 30, United-States, <=50K +20, Private, 637080, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +32, Private, 385959, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 50, United-States, <=50K +33, Self-emp-not-inc, 116539, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 129263, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 60, United-States, <=50K +60, Private, 141253, 10th, 6, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +35, State-gov, 35626, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 15, United-States, <=50K +43, Federal-gov, 94937, Bachelors, 13, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 220269, Some-college, 10, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Self-emp-not-inc, 169544, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 5178, 0, 40, United-States, >50K +36, Private, 214604, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 42, United-States, >50K +27, Private, 81540, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +50, Private, 24013, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 84, United-States, >50K +22, Private, 124940, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, Amer-Indian-Eskimo, Female, 0, 0, 44, United-States, <=50K +33, State-gov, 313729, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +61, Private, 192237, 10th, 6, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, ?, 168524, Assoc-voc, 11, Married-civ-spouse, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +41, Self-emp-not-inc, 113324, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 70, United-States, >50K +22, Private, 215477, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 199903, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 431861, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 105938, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, Black, Female, 0, 1602, 20, United-States, <=50K +28, Private, 274679, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +25, Private, 177499, Bachelors, 13, Never-married, Craft-repair, Own-child, White, Male, 0, 1590, 35, United-States, <=50K +28, Private, 206125, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Local-gov, 221740, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 30, United-States, >50K +58, Private, 202652, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 37, United-States, <=50K +39, Private, 348960, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 171876, Some-college, 10, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +59, Private, 157932, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +58, Private, 201344, Bachelors, 13, Divorced, Craft-repair, Own-child, White, Female, 0, 0, 20, United-States, <=50K +38, Private, 354739, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 36, Philippines, >50K +34, Private, 40067, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 326862, Some-college, 10, Divorced, Craft-repair, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +48, Local-gov, 189762, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +65, ?, 149049, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 226246, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +80, ?, 29020, Prof-school, 15, Married-civ-spouse, ?, Husband, White, Male, 10605, 0, 10, United-States, >50K +23, Private, 38251, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 196385, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 37, United-States, >50K +38, Self-emp-not-inc, 217054, Some-college, 10, Divorced, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Self-emp-not-inc, 104973, Masters, 14, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, >50K +48, Local-gov, 238959, Masters, 14, Divorced, Exec-managerial, Unmarried, Black, Female, 9562, 0, 40, United-States, >50K +40, State-gov, 34218, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Local-gov, 292962, HS-grad, 9, Never-married, Craft-repair, Other-relative, Black, Female, 0, 0, 40, United-States, <=50K +45, Private, 235924, Bachelors, 13, Divorced, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 98656, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +70, Private, 102610, Some-college, 10, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 80, United-States, <=50K +32, Local-gov, 296466, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +33, Private, 323069, Assoc-voc, 11, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 184756, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Local-gov, 233993, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 15, United-States, <=50K +22, Private, 130724, Some-college, 10, Never-married, Sales, Own-child, Black, Male, 0, 0, 25, United-States, <=50K +52, Self-emp-inc, 181855, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Other, Male, 99999, 0, 65, United-States, >50K +67, Self-emp-not-inc, 127543, 10th, 6, Married-civ-spouse, Farming-fishing, Husband, White, Male, 2414, 0, 80, United-States, <=50K +40, Private, 187164, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1672, 45, United-States, <=50K +55, Private, 113912, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 20, United-States, <=50K +29, Private, 216479, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +62, Private, 135480, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 16, United-States, <=50K +22, Private, 204160, HS-grad, 9, Divorced, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +64, State-gov, 114650, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Self-emp-not-inc, 240172, Bachelors, 13, Never-married, Exec-managerial, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +28, Private, 184831, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 124590, HS-grad, 9, Never-married, Exec-managerial, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +47, State-gov, 120429, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 50, United-States, >50K +26, Private, 202033, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +18, Private, 156874, 12th, 8, Never-married, Other-service, Own-child, White, Male, 0, 0, 27, United-States, <=50K +52, Self-emp-inc, 177727, 10th, 6, Married-civ-spouse, Sales, Husband, White, Male, 4064, 0, 45, United-States, <=50K +48, Local-gov, 334409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 50, United-States, >50K +36, Private, 311255, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 0, 0, 40, Haiti, <=50K +23, Private, 214227, Assoc-voc, 11, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +41, Private, 115849, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +56, State-gov, 671292, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 38, United-States, >50K +53, Private, 31460, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 141824, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 310152, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 3325, 0, 40, United-States, <=50K +25, Private, 179953, Masters, 14, Never-married, Prof-specialty, Own-child, White, Female, 2597, 0, 31, United-States, <=50K +31, Private, 137952, Some-college, 10, Married-civ-spouse, Other-service, Husband, Other, Male, 0, 0, 40, Puerto-Rico, <=50K +36, Private, 103323, Assoc-acdm, 12, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 2829, 0, 40, United-States, <=50K +46, Private, 174426, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, State-gov, 192779, Assoc-acdm, 12, Divorced, Adm-clerical, Unmarried, White, Male, 0, 2258, 38, United-States, >50K +32, Private, 169955, Some-college, 10, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 36, Puerto-Rico, <=50K +43, Self-emp-not-inc, 48087, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +30, Private, 132601, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 99999, 0, 50, United-States, >50K +41, Self-emp-inc, 253060, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7688, 0, 45, United-States, >50K +50, Private, 108435, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 99999, 0, 60, United-States, >50K +37, State-gov, 210452, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 38, United-States, <=50K +22, Local-gov, 134181, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +51, Federal-gov, 45487, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 80, United-States, <=50K +47, Private, 183522, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, Black, Female, 0, 0, 40, United-States, >50K +40, Private, 199303, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 83064, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +23, ?, 134997, Some-college, 10, Separated, ?, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +30, Private, 44419, Some-college, 10, Never-married, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +27, Self-emp-not-inc, 442612, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 65, United-States, >50K +31, Local-gov, 158092, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 35, United-States, <=50K +31, Private, 374833, 1st-4th, 2, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, Mexico, <=50K +30, Private, 112650, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +50, Local-gov, 183390, Bachelors, 13, Separated, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +27, Private, 207418, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +22, ?, 335453, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 16, United-States, <=50K +29, Private, 243660, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, >50K +28, Private, 54243, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +54, Private, 50385, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, Black, Female, 0, 0, 45, United-States, >50K +47, State-gov, 187581, Assoc-voc, 11, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 48, United-States, >50K +34, Private, 37380, HS-grad, 9, Married-spouse-absent, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 247025, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +53, ?, 29231, 7th-8th, 4, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 35, United-States, <=50K +23, State-gov, 101094, Some-college, 10, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 60, United-States, <=50K +42, Local-gov, 176716, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 118429, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +52, Federal-gov, 221532, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 45, United-States, >50K +22, ?, 120572, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +27, Local-gov, 124680, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +42, Private, 153160, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +39, Private, 114678, HS-grad, 9, Divorced, Other-service, Unmarried, Black, Female, 5455, 0, 40, United-States, <=50K +49, State-gov, 142856, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 29702, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 7688, 0, 40, United-States, >50K +20, Private, 277700, Preschool, 1, Never-married, Other-service, Own-child, White, Male, 0, 0, 32, United-States, <=50K +55, Self-emp-inc, 67433, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, <=50K +47, Private, 121124, 9th, 5, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 394447, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 33, United-States, >50K +36, Private, 79649, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 203763, Doctorate, 16, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 80, United-States, <=50K +55, Private, 229029, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 48, United-States, >50K +21, ?, 494638, Assoc-acdm, 12, Never-married, ?, Own-child, White, Male, 0, 0, 15, United-States, <=50K +48, Private, 162816, Assoc-acdm, 12, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 45, United-States, <=50K +30, Private, 109117, Assoc-voc, 11, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 45, United-States, <=50K +24, Private, 32732, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, Self-emp-not-inc, 217692, HS-grad, 9, Widowed, Craft-repair, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +20, Private, 34590, Some-college, 10, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 60, United-States, <=50K +18, ?, 276864, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 1602, 20, United-States, <=50K +56, Private, 168625, HS-grad, 9, Divorced, Prof-specialty, Not-in-family, White, Female, 4101, 0, 40, United-States, <=50K +36, Private, 91037, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +44, Private, 171484, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +57, Private, 200453, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 15024, 0, 40, United-States, >50K +57, Private, 36990, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 52, United-States, <=50K +33, Private, 198211, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, <=50K +61, ?, 30475, 7th-8th, 4, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 70995, Bachelors, 13, Married-civ-spouse, Transport-moving, Husband, White, Male, 15024, 0, 99, United-States, >50K +28, Private, 245790, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +31, Private, 273324, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 1721, 16, United-States, <=50K +60, Private, 182687, Assoc-acdm, 12, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +36, Local-gov, 247807, Assoc-voc, 11, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +58, Private, 163113, HS-grad, 9, Widowed, Sales, Not-in-family, White, Female, 0, 0, 35, United-States, >50K +50, Private, 180522, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +23, Local-gov, 203353, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 12, United-States, <=50K +30, Private, 87469, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +20, ?, 216563, 11th, 7, Never-married, ?, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +90, Private, 87372, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 20051, 0, 72, United-States, >50K +49, Local-gov, 173584, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +47, Local-gov, 80282, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3137, 0, 40, United-States, <=50K +34, Private, 319854, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, Taiwan, >50K +37, Federal-gov, 408229, HS-grad, 9, Divorced, Tech-support, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 431307, 10th, 6, Married-civ-spouse, Protective-serv, Wife, Black, Female, 0, 0, 50, United-States, <=50K +37, Private, 134088, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 246396, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, Mexico, <=50K +34, Private, 159255, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +34, Private, 106014, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 186934, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 50, United-States, >50K +39, Private, 120130, Some-college, 10, Separated, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +32, State-gov, 203849, 7th-8th, 4, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 19, United-States, <=50K +24, Private, 207940, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 30, United-States, <=50K +28, Private, 302406, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +41, Self-emp-not-inc, 144594, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 2179, 40, United-States, <=50K +69, ?, 171050, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 9, United-States, <=50K +32, Private, 459007, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 90, United-States, <=50K +58, Private, 372181, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, >50K +47, Self-emp-not-inc, 172034, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 75, United-States, >50K +41, Private, 156566, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 4386, 0, 50, United-States, >50K +35, Self-emp-inc, 338320, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Private, 353696, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, Canada, <=50K +46, Self-emp-not-inc, 342907, HS-grad, 9, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 60, United-States, >50K +69, Self-emp-inc, 169717, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 6418, 0, 45, United-States, >50K +22, Private, 103762, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, State-gov, 47570, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 119432, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +23, Local-gov, 144165, Bachelors, 13, Never-married, Prof-specialty, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 30, United-States, <=50K +35, Private, 180647, Some-college, 10, Never-married, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +37, Local-gov, 312232, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 5178, 0, 40, United-States, >50K +35, State-gov, 150488, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +18, Private, 200876, 11th, 7, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 16, United-States, <=50K +43, Private, 188199, 9th, 5, Divorced, Handlers-cleaners, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +53, State-gov, 118793, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Local-gov, 204325, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 52, United-States, <=50K +29, Private, 256671, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 231515, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 47, Cuba, <=50K +24, Private, 100669, Some-college, 10, Never-married, Handlers-cleaners, Own-child, Asian-Pac-Islander, Male, 0, 0, 30, United-States, <=50K +30, Private, 88913, Some-college, 10, Separated, Other-service, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +23, Private, 363219, Some-college, 10, Never-married, Priv-house-serv, Not-in-family, White, Female, 0, 0, 6, United-States, <=50K +27, ?, 291547, Bachelors, 13, Married-civ-spouse, ?, Not-in-family, Other, Female, 0, 0, 6, Mexico, <=50K +36, Private, 308945, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Self-emp-not-inc, 100316, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 35, United-States, <=50K +33, Private, 296453, Masters, 14, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 15, United-States, <=50K +66, Private, 298834, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, Canada, <=50K +45, Self-emp-not-inc, 188694, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +68, ?, 29240, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 0, 12, United-States, <=50K +37, Private, 186934, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 60, United-States, >50K +17, Private, 154908, 10th, 6, Never-married, Other-service, Own-child, White, Female, 0, 0, 10, United-States, <=50K +31, Private, 22201, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, >50K +46, Private, 216999, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 55, United-States, >50K +40, Private, 186916, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 116677, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +56, Private, 95763, 10th, 6, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +42, Private, 266710, Some-college, 10, Separated, Adm-clerical, Unmarried, Black, Female, 0, 0, 41, United-States, <=50K +46, Private, 117849, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +30, Private, 242460, HS-grad, 9, Separated, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +33, Self-emp-not-inc, 202729, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +47, Private, 181652, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, <=50K +57, Self-emp-not-inc, 174760, Assoc-acdm, 12, Married-spouse-absent, Farming-fishing, Unmarried, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +34, Private, 56121, 11th, 7, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 390369, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 149726, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 51262, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 190350, 12th, 8, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 35, ?, <=50K +53, Federal-gov, 205288, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 7688, 0, 35, United-States, >50K +36, Private, 154835, HS-grad, 9, Separated, Adm-clerical, Own-child, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +45, Private, 89028, HS-grad, 9, Divorced, Craft-repair, Not-in-family, Asian-Pac-Islander, Male, 10520, 0, 40, United-States, >50K +36, Private, 194630, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +18, Self-emp-not-inc, 212207, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 11, United-States, <=50K +27, Private, 204788, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 158688, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 97723, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 193026, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Self-emp-not-inc, 257250, 7th-8th, 4, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 75, United-States, <=50K +48, Private, 355978, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Self-emp-not-inc, 200574, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 60, United-States, >50K +21, Private, 376929, 5th-6th, 3, Never-married, Priv-house-serv, Not-in-family, White, Female, 0, 0, 40, Mexico, <=50K +47, State-gov, 123219, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 38, United-States, >50K +41, Private, 82778, 1st-4th, 2, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, Mexico, <=50K +61, Self-emp-not-inc, 115882, 1st-4th, 2, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 20, United-States, <=50K +64, Private, 103021, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 297767, Some-college, 10, Separated, Adm-clerical, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +44, Private, 259479, HS-grad, 9, Divorced, Transport-moving, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +20, Private, 167787, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +23, Local-gov, 40021, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 70, United-States, <=50K +52, Private, 245275, 10th, 6, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 35, United-States, <=50K +43, Private, 37402, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 25, United-States, <=50K +32, Private, 103608, Bachelors, 13, Married-civ-spouse, Handlers-cleaners, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +63, Private, 137192, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, South, <=50K +29, Private, 137618, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 41, United-States, >50K +42, Self-emp-inc, 96509, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 60, Taiwan, <=50K +65, Private, 196174, 10th, 6, Divorced, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 28, United-States, <=50K +24, Private, 172612, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 141186, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 228190, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +40, Self-emp-inc, 190290, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, ?, >50K +38, Federal-gov, 307404, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +26, Private, 152436, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +46, Self-emp-not-inc, 182541, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 1672, 50, United-States, <=50K +39, Private, 282153, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +29, ?, 41281, Bachelors, 13, Married-spouse-absent, ?, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +42, Private, 162003, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 36, United-States, >50K +36, Private, 190759, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +26, Private, 208122, Some-college, 10, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +57, Private, 173832, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1902, 40, United-States, >50K +55, Private, 129173, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +35, Private, 287548, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K +41, Private, 216116, HS-grad, 9, Separated, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, ?, <=50K +24, Private, 146706, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +47, Private, 285200, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Self-emp-inc, 314375, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +44, Private, 203943, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 35, United-States, >50K +18, ?, 274746, HS-grad, 9, Never-married, ?, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +27, Private, 517000, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +36, Private, 66173, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +21, Private, 182823, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 0, 30, United-States, <=50K +29, Private, 159479, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Other, Male, 0, 0, 55, United-States, <=50K +25, Private, 135568, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +73, Private, 333676, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +45, Private, 201699, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 96020, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 50, United-States, >50K +43, Private, 176138, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +47, Private, 47496, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 42, United-States, <=50K +20, Private, 187158, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 15, United-States, <=50K +22, Private, 249727, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 20, United-States, <=50K +76, Self-emp-not-inc, 237624, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K +24, Private, 175254, Some-college, 10, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +54, Self-emp-not-inc, 42924, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +30, Private, 205950, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +33, Private, 111985, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 58, United-States, <=50K +30, Private, 167476, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +40, Private, 221172, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +27, ?, 188711, Some-college, 10, Divorced, ?, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +49, Private, 199448, Assoc-voc, 11, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 313038, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 148431, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, Other, Female, 0, 0, 40, United-States, <=50K +19, Private, 112432, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 58, United-States, <=50K +46, Private, 57914, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 145166, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 50, United-States, <=50K +56, Private, 247119, 7th-8th, 4, Widowed, Machine-op-inspct, Unmarried, Other, Female, 0, 0, 40, Dominican-Republic, <=50K +53, Private, 196278, Some-college, 10, Widowed, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +60, ?, 366531, Assoc-voc, 11, Widowed, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 216481, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 188027, Some-college, 10, Never-married, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +37, Private, 66686, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +41, Private, 74775, Bachelors, 13, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 30, Vietnam, <=50K +65, ?, 325537, Assoc-voc, 11, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 50, United-States, >50K +30, Self-emp-not-inc, 250499, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 55, United-States, >50K +57, Self-emp-not-inc, 192869, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 72, United-States, <=50K +44, Self-emp-inc, 121352, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Self-emp-not-inc, 70985, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 4064, 0, 40, United-States, <=50K +27, Self-emp-not-inc, 123116, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +57, Local-gov, 339163, Some-college, 10, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +59, Self-emp-not-inc, 124771, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 30, United-States, <=50K +32, Private, 167531, Prof-school, 15, Married-civ-spouse, Prof-specialty, Wife, Asian-Pac-Islander, Female, 15024, 0, 50, United-States, >50K +90, ?, 77053, HS-grad, 9, Widowed, ?, Not-in-family, White, Female, 0, 4356, 40, United-States, <=50K +22, Private, 199266, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +39, Private, 190728, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 99212, Assoc-voc, 11, Married-civ-spouse, Exec-managerial, Husband, White, Male, 3103, 0, 48, United-States, >50K +38, Local-gov, 421446, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 50, United-States, >50K +61, Private, 215944, 9th, 5, Divorced, Sales, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +24, Private, 72310, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 43, United-States, <=50K +25, Private, 57512, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +44, Private, 89413, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Local-gov, 28151, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +90, Private, 46786, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 9386, 0, 15, United-States, >50K +30, Private, 226943, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +44, Private, 182402, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 305352, 10th, 6, Divorced, Craft-repair, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +63, Self-emp-inc, 189253, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +60, Private, 296485, 5th-6th, 3, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Self-emp-not-inc, 204375, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 60, United-States, >50K +49, Self-emp-not-inc, 249585, 11th, 7, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 65, United-States, <=50K +47, Private, 148995, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 65, United-States, >50K +42, Self-emp-inc, 168071, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 43, United-States, >50K +53, Private, 194995, 5th-6th, 3, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, Italy, <=50K +23, Private, 211049, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 4101, 0, 40, United-States, <=50K +28, ?, 196630, Assoc-voc, 11, Separated, ?, Unmarried, White, Female, 0, 0, 40, Mexico, <=50K +20, Private, 50397, Some-college, 10, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 35, United-States, <=50K +43, Private, 177905, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 3908, 0, 40, United-States, <=50K +32, Private, 204374, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1977, 60, United-States, >50K +43, Private, 60001, Bachelors, 13, Divorced, Sales, Unmarried, White, Male, 0, 0, 44, United-States, >50K +31, Private, 223046, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +29, ?, 44921, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 154571, Some-college, 10, Never-married, Adm-clerical, Own-child, Asian-Pac-Islander, Male, 0, 0, 20, United-States, <=50K +39, Private, 67136, Assoc-voc, 11, Separated, Adm-clerical, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, United-States, <=50K +29, Private, 188675, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Black, Male, 0, 0, 40, Jamaica, >50K +20, Private, 390817, 5th-6th, 3, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 25, Mexico, <=50K +23, ?, 145964, Some-college, 10, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 30424, 11th, 7, Separated, Other-service, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +53, Private, 548361, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +20, Private, 189148, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 48, United-States, <=50K +58, Self-emp-not-inc, 266707, 1st-4th, 2, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 2179, 18, United-States, <=50K +51, Self-emp-not-inc, 311569, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 187653, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +38, Private, 235379, Assoc-acdm, 12, Never-married, Prof-specialty, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +41, Private, 188615, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +58, Private, 322691, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 184698, 10th, 6, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, Dominican-Republic, <=50K +50, Private, 144361, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 130057, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +31, Self-emp-inc, 117963, Doctorate, 16, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 123876, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +37, Private, 248445, HS-grad, 9, Divorced, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, El-Salvador, <=50K +32, Private, 207172, Some-college, 10, Never-married, Sales, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +46, State-gov, 119904, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 1564, 55, United-States, >50K +62, Private, 134768, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Local-gov, 269168, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Other, Male, 0, 0, 40, ?, <=50K +56, Private, 132026, Bachelors, 13, Married-civ-spouse, Sales, Husband, Black, Male, 7688, 0, 45, United-States, >50K +37, Private, 60722, Some-college, 10, Divorced, Exec-managerial, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, Japan, >50K +41, Private, 648223, 1st-4th, 2, Married-spouse-absent, Farming-fishing, Unmarried, White, Male, 0, 0, 40, Mexico, <=50K +56, Private, 298695, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 48, United-States, <=50K +20, Private, 219835, Some-college, 10, Never-married, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +34, Self-emp-not-inc, 313729, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +45, Private, 140644, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +30, Private, 203488, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +51, Self-emp-not-inc, 132341, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +27, Private, 161683, 11th, 7, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 42, United-States, <=50K +38, Private, 312771, Assoc-voc, 11, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +39, Private, 258102, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 50, United-States, <=50K +57, ?, 24127, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 8, United-States, <=50K +47, Private, 254367, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +77, ?, 185426, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 15, United-States, <=50K +43, Private, 152629, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Local-gov, 141058, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 65, United-States, <=50K +41, Private, 233130, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 406641, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +30, State-gov, 119422, 10th, 6, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 255486, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 25, United-States, <=50K +22, Private, 161532, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 35, United-States, <=50K +25, Private, 75759, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 44, United-States, >50K +18, Private, 163332, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 22, United-States, <=50K +28, Private, 103802, Bachelors, 13, Never-married, Exec-managerial, Own-child, White, Female, 0, 1408, 40, ?, <=50K +50, Private, 34832, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 15024, 0, 40, United-States, >50K +28, Private, 37933, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Female, 0, 0, 48, United-States, <=50K +21, Private, 165107, Some-college, 10, Never-married, Priv-house-serv, Own-child, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 126011, Assoc-voc, 11, Divorced, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Federal-gov, 56651, Bachelors, 13, Never-married, Prof-specialty, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +23, Private, 522881, 5th-6th, 3, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 35, Mexico, <=50K +32, Private, 191777, Assoc-voc, 11, Never-married, Exec-managerial, Not-in-family, Black, Female, 0, 0, 35, England, <=50K +27, Private, 132686, 12th, 8, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 50, United-States, <=50K +55, Private, 201112, HS-grad, 9, Divorced, Prof-specialty, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +44, Private, 174283, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +25, Private, 208591, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 126399, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +50, Private, 142073, HS-grad, 9, Married-spouse-absent, Exec-managerial, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +18, Private, 395567, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +74, Private, 180455, Bachelors, 13, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 8, United-States, <=50K +22, Private, 235853, 9th, 5, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 160731, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, State-gov, 31935, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 80, United-States, <=50K +41, Private, 35166, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 15024, 0, 50, United-States, >50K +24, Private, 161092, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 7298, 0, 40, United-States, >50K +23, Private, 223019, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +34, Self-emp-not-inc, 179673, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 7688, 0, 60, United-States, >50K +46, State-gov, 248895, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +32, Private, 200323, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +41, Private, 230020, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Other, Male, 0, 0, 40, United-States, <=50K +29, Private, 134890, Assoc-voc, 11, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, >50K +48, Private, 162096, 9th, 5, Married-civ-spouse, Machine-op-inspct, Other-relative, Asian-Pac-Islander, Female, 0, 0, 45, China, <=50K +51, Private, 103824, HS-grad, 9, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 40, Haiti, <=50K +34, State-gov, 61431, 12th, 8, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +58, Private, 197319, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +52, Private, 183618, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 268598, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Other, Male, 7298, 0, 50, Puerto-Rico, >50K +53, Private, 263729, Some-college, 10, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +54, Private, 39493, Assoc-voc, 11, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 20, United-States, <=50K +36, Private, 185360, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +25, Private, 132661, Assoc-voc, 11, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 60, United-States, <=50K +20, Private, 266400, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 48, United-States, <=50K +23, Private, 433669, Assoc-acdm, 12, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Self-emp-inc, 216473, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +20, Self-emp-not-inc, 217404, 10th, 6, Never-married, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 227778, Assoc-voc, 11, Never-married, Other-service, Other-relative, Black, Male, 0, 0, 40, United-States, <=50K +73, State-gov, 96262, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +67, Private, 247566, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 24, United-States, <=50K +56, Private, 139616, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +32, Private, 73585, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 37869, Some-college, 10, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 1590, 40, United-States, <=50K +33, Private, 165814, HS-grad, 9, Married-civ-spouse, Other-service, Husband, Black, Male, 0, 0, 40, United-States, <=50K +37, Private, 108913, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 34975, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +31, Private, 157078, 10th, 6, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +59, Private, 232672, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +21, Private, 294295, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +58, Self-emp-inc, 130454, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Local-gov, 461678, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, State-gov, 252284, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 256737, 11th, 7, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Local-gov, 96480, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, Germany, <=50K +25, Private, 234263, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 109952, 10th, 6, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 50, United-States, <=50K +24, Private, 262570, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +39, Self-emp-not-inc, 65716, Assoc-acdm, 12, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, >50K +68, Private, 201732, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +66, Self-emp-not-inc, 174788, Some-college, 10, Never-married, Sales, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +38, Private, 278924, Bachelors, 13, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 101593, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +71, ?, 193863, 7th-8th, 4, Widowed, ?, Other-relative, White, Female, 0, 0, 16, Poland, <=50K +37, Private, 342768, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Self-emp-not-inc, 242606, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 4386, 0, 45, United-States, >50K +27, State-gov, 176727, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 99179, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +19, State-gov, 354104, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 10, United-States, <=50K +25, Private, 61956, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +47, Federal-gov, 137917, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, >50K +40, Private, 224658, Some-college, 10, Married-civ-spouse, Sales, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 51100, Bachelors, 13, Divorced, Exec-managerial, Not-in-family, White, Male, 3325, 0, 40, United-States, <=50K +25, Private, 224361, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 362912, Some-college, 10, Never-married, Craft-repair, Own-child, White, Female, 0, 0, 50, United-States, <=50K +23, Private, 218782, 10th, 6, Never-married, Handlers-cleaners, Other-relative, Other, Male, 0, 0, 40, United-States, <=50K +28, Private, 103389, Masters, 14, Divorced, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 308944, HS-grad, 9, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 140092, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 202210, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +52, Private, 416059, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 40, United-States, >50K +33, Self-emp-not-inc, 281030, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 94, United-States, <=50K +19, Private, 169758, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 35, United-States, <=50K +68, Private, 193666, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 20051, 0, 55, United-States, >50K +41, Private, 139907, 10th, 6, Never-married, Handlers-cleaners, Unmarried, White, Male, 0, 0, 50, United-States, <=50K +18, Self-emp-inc, 119422, HS-grad, 9, Never-married, Other-service, Unmarried, Asian-Pac-Islander, Female, 0, 0, 30, India, <=50K +29, Private, 149324, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 1485, 40, United-States, >50K +40, Private, 259307, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +51, Self-emp-not-inc, 74160, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Male, 0, 0, 60, United-States, >50K +49, Private, 134797, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +20, State-gov, 41103, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +38, Local-gov, 193026, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +57, Private, 303986, 5th-6th, 3, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, Cuba, <=50K +35, Private, 126569, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 4064, 0, 40, United-States, <=50K +66, Private, 166461, 11th, 7, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 26, United-States, <=50K +27, ?, 61387, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 15, United-States, <=50K +25, Private, 254746, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +77, ?, 28678, Masters, 14, Married-civ-spouse, ?, Husband, White, Male, 9386, 0, 6, United-States, >50K +19, ?, 180976, 10th, 6, Never-married, ?, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +70, Private, 282642, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 2174, 40, United-States, >50K +59, Self-emp-not-inc, 136413, 7th-8th, 4, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 48, United-States, <=50K +25, Private, 131463, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +44, Local-gov, 177240, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 10520, 0, 40, United-States, >50K +37, Private, 218490, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, El-Salvador, >50K +75, ?, 260543, 10th, 6, Widowed, ?, Other-relative, Asian-Pac-Islander, Female, 0, 0, 1, China, <=50K +21, ?, 80680, HS-grad, 9, Never-married, ?, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 20728, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 4101, 0, 40, United-States, <=50K +47, Federal-gov, 117628, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 91939, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 1721, 30, United-States, <=50K +32, State-gov, 175931, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 309566, HS-grad, 9, Never-married, Adm-clerical, Unmarried, White, Female, 0, 0, 20, United-States, <=50K +53, Private, 123703, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, ?, 369678, HS-grad, 9, Never-married, ?, Not-in-family, Other, Male, 0, 0, 30, United-States, <=50K +58, Private, 29928, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 36, United-States, <=50K +22, Private, 167868, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +23, Private, 235894, 11th, 7, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, <=50K +21, Private, 189888, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Male, 3325, 0, 60, United-States, <=50K +36, Private, 111545, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 70, United-States, <=50K +39, Private, 175972, Some-college, 10, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 15, United-States, <=50K +34, Local-gov, 254270, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +41, Local-gov, 185057, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +72, Private, 157593, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 1455, 0, 6, United-States, <=50K +34, Private, 101345, HS-grad, 9, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 40, United-States, >50K +51, Local-gov, 176751, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 1902, 40, United-States, >50K +32, Self-emp-not-inc, 97723, Assoc-voc, 11, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 127601, Some-college, 10, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +37, Private, 227597, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, ?, 143995, Some-college, 10, Never-married, ?, Own-child, Black, Male, 0, 0, 20, United-States, <=50K +21, Private, 250051, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 10, United-States, <=50K +26, Private, 284078, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 207668, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 1887, 40, United-States, >50K +18, Private, 163787, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +27, Private, 119170, 11th, 7, Never-married, Exec-managerial, Unmarried, White, Female, 0, 0, 45, United-States, <=50K +20, Private, 188612, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 38, Nicaragua, <=50K +36, Private, 114605, Assoc-voc, 11, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +31, ?, 317761, Bachelors, 13, Never-married, ?, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 164197, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 60, United-States, >50K +54, Private, 329266, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 44, United-States, >50K +34, Local-gov, 207383, Masters, 14, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 123598, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, >50K +33, Private, 259931, 11th, 7, Separated, Machine-op-inspct, Other-relative, White, Male, 0, 0, 30, United-States, <=50K +32, Private, 134737, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 50, United-States, >50K +42, Private, 106900, Assoc-voc, 11, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 87054, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +37, Private, 82622, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, >50K +28, Private, 181659, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 321205, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 4101, 0, 35, United-States, <=50K +44, Self-emp-not-inc, 231348, Some-college, 10, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 276096, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 290560, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +21, Private, 307315, Some-college, 10, Never-married, Adm-clerical, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +39, State-gov, 99156, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +24, Private, 237928, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 39, United-States, <=50K +46, Private, 153501, HS-grad, 9, Never-married, Transport-moving, Not-in-family, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +47, ?, 149700, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 36, United-States, >50K +47, Private, 189680, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1977, 40, United-States, >50K +35, Private, 374524, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 75, United-States, >50K +60, Self-emp-not-inc, 127805, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, >50K +35, Private, 150217, Bachelors, 13, Married-civ-spouse, Other-service, Wife, White, Female, 0, 0, 24, Poland, <=50K +33, Private, 295649, HS-grad, 9, Separated, Other-service, Unmarried, White, Female, 0, 0, 40, China, <=50K +21, Private, 197182, Assoc-acdm, 12, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Private, 241998, Bachelors, 13, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 50, United-States, >50K +48, Federal-gov, 156410, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 50, United-States, >50K +58, Private, 473836, 7th-8th, 4, Widowed, Farming-fishing, Other-relative, White, Female, 0, 0, 45, Guatemala, <=50K +21, Private, 198431, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 113936, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 318915, HS-grad, 9, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +28, Self-emp-not-inc, 175406, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 30, United-States, >50K +33, ?, 193172, Assoc-voc, 11, Married-civ-spouse, ?, Own-child, White, Female, 7688, 0, 50, United-States, >50K +23, Federal-gov, 320294, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +58, State-gov, 400285, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, >50K +24, ?, 283731, Bachelors, 13, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Local-gov, 227154, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +49, Private, 298659, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 15, Mexico, <=50K +47, Private, 212120, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +50, Private, 320510, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 40, United-States, >50K +21, Private, 175800, HS-grad, 9, Never-married, Prof-specialty, Unmarried, White, Female, 0, 0, 55, United-States, <=50K +55, Private, 170169, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 344157, 11th, 7, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 199441, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 225456, HS-grad, 9, Never-married, Tech-support, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +36, Private, 61178, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +28, Local-gov, 175262, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 2002, 40, England, <=50K +42, Private, 152568, HS-grad, 9, Widowed, Sales, Unmarried, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +41, Private, 182108, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 27828, 0, 35, United-States, >50K +46, Private, 273771, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 99999, 0, 40, United-States, >50K +32, Private, 208291, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +34, Private, 224358, 10th, 6, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +33, Private, 55176, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, State-gov, 152711, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +53, Private, 68684, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 185452, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +39, Federal-gov, 175232, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 1977, 60, United-States, >50K +23, Private, 173851, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 162327, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 1902, 50, ?, >50K +36, Local-gov, 51424, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 123416, 12th, 8, Separated, Prof-specialty, Own-child, White, Female, 1055, 0, 40, United-States, <=50K +26, Private, 262656, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +38, Private, 233194, HS-grad, 9, Married-civ-spouse, Sales, Husband, Black, Male, 0, 0, 40, United-States, >50K +41, Private, 290660, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 7688, 0, 55, United-States, >50K +22, Private, 151105, Some-college, 10, Never-married, Sales, Other-relative, White, Female, 0, 0, 18, United-States, <=50K +38, Private, 179117, Assoc-acdm, 12, Never-married, Machine-op-inspct, Not-in-family, Black, Female, 10520, 0, 50, United-States, >50K +72, ?, 33608, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 9386, 0, 30, United-States, >50K +39, Private, 317434, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, State-gov, 126569, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 7298, 0, 40, United-States, >50K +38, Local-gov, 745768, Some-college, 10, Divorced, Exec-managerial, Unmarried, Black, Female, 0, 0, 45, United-States, <=50K +19, Private, 69927, HS-grad, 9, Married-civ-spouse, Other-service, Wife, Black, Female, 0, 0, 16, United-States, <=50K +26, Private, 302603, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 45, United-States, <=50K +52, Private, 46788, Bachelors, 13, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 25, United-States, <=50K +41, Private, 289886, 5th-6th, 3, Married-civ-spouse, Other-service, Husband, Other, Male, 0, 1579, 40, Nicaragua, <=50K +45, Private, 179135, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +58, Federal-gov, 175873, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +34, Private, 57426, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +36, Private, 312206, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Without-pay, 344858, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 20, United-States, <=50K +26, State-gov, 177035, 11th, 7, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +60, Private, 88055, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +35, Self-emp-not-inc, 111095, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, <=50K +39, Private, 192251, 10th, 6, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 60, United-States, <=50K +27, Private, 29807, HS-grad, 9, Separated, Handlers-cleaners, Unmarried, White, Female, 0, 0, 40, Japan, <=50K +26, Federal-gov, 211596, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 268276, 12th, 8, Never-married, Other-service, Own-child, White, Male, 0, 0, 12, United-States, <=50K +59, Self-emp-not-inc, 181070, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, England, >50K +53, Local-gov, 20676, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, Amer-Indian-Eskimo, Male, 0, 0, 48, United-States, <=50K +35, Private, 115803, 11th, 7, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +34, Local-gov, 124827, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +36, Private, 95336, HS-grad, 9, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 45, United-States, >50K +36, Private, 257942, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 72593, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 147340, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +35, Private, 185325, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, >50K +59, Self-emp-not-inc, 357943, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 215395, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 1602, 10, United-States, <=50K +50, Local-gov, 30682, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +24, Federal-gov, 29591, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, Other, Female, 0, 0, 40, United-States, <=50K +36, Private, 215392, Bachelors, 13, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 110554, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 4386, 0, 40, United-States, >50K +42, Self-emp-not-inc, 133584, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, El-Salvador, <=50K +38, Private, 210438, 7th-8th, 4, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 256916, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 73541, 10th, 6, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 109952, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +54, Private, 197975, 5th-6th, 3, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 51, United-States, <=50K +27, Private, 401723, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +42, Private, 179524, Bachelors, 13, Separated, Other-service, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +33, State-gov, 296282, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 145844, Assoc-acdm, 12, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +59, Private, 191965, 11th, 7, Married-civ-spouse, Other-service, Wife, White, Female, 3908, 0, 28, United-States, <=50K +54, Private, 96792, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +48, Private, 185041, 7th-8th, 4, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1672, 55, United-States, <=50K +19, ?, 233779, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 60, United-States, <=50K +45, Private, 347834, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 215373, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 70, United-States, <=50K +35, Self-emp-not-inc, 169426, Assoc-acdm, 12, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 202856, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 36, United-States, <=50K +33, Private, 50276, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +48, Self-emp-not-inc, 187454, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 126098, HS-grad, 9, Separated, Craft-repair, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +19, Private, 250639, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 24, United-States, <=50K +64, Self-emp-inc, 195366, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +51, Self-emp-not-inc, 186845, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 8, United-States, <=50K +20, Federal-gov, 119156, HS-grad, 9, Never-married, Adm-clerical, Other-relative, White, Male, 0, 0, 20, United-States, <=50K +28, Private, 162343, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, Puerto-Rico, <=50K +52, Private, 108435, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1902, 50, Greece, >50K +29, Self-emp-not-inc, 394927, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +51, Private, 172281, Bachelors, 13, Separated, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 370767, HS-grad, 9, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 2377, 60, United-States, <=50K +43, Private, 352005, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 45, United-States, >50K +52, Private, 165681, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 258819, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +25, Private, 130793, Some-college, 10, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +36, Private, 118909, Assoc-acdm, 12, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, Jamaica, <=50K +44, Private, 202466, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 60, United-States, <=50K +47, Private, 161558, 10th, 6, Married-spouse-absent, Transport-moving, Not-in-family, Black, Male, 0, 0, 45, United-States, <=50K +32, Private, 188246, HS-grad, 9, Divorced, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +37, Private, 160120, Masters, 14, Never-married, Prof-specialty, Unmarried, Asian-Pac-Islander, Male, 0, 0, 40, South, <=50K +40, Private, 144594, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 2829, 0, 40, United-States, <=50K +34, Self-emp-not-inc, 123429, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K +35, Self-emp-inc, 340110, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, >50K +26, Private, 523067, Some-college, 10, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 3, El-Salvador, <=50K +49, Self-emp-not-inc, 113513, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +63, ?, 186809, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 30, United-States, >50K +46, Self-emp-not-inc, 320421, Bachelors, 13, Married-spouse-absent, Prof-specialty, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +31, Local-gov, 295589, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +22, Private, 370548, Assoc-voc, 11, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +20, Private, 120572, Some-college, 10, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 12, United-States, <=50K +52, Private, 110977, Doctorate, 16, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +26, Private, 55860, Some-college, 10, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +34, Private, 158800, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +31, Private, 131568, 9th, 5, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 173613, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +22, Private, 216867, 7th-8th, 4, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, Mexico, <=50K +38, Private, 104089, Assoc-voc, 11, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 40, United-States, <=50K +35, Private, 208106, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, Ecuador, <=50K +27, State-gov, 340269, Some-college, 10, Never-married, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 236246, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 213408, Some-college, 10, Divorced, Sales, Unmarried, White, Female, 0, 0, 40, Cuba, <=50K +40, ?, 84232, HS-grad, 9, Never-married, ?, Not-in-family, White, Female, 0, 0, 4, United-States, <=50K +19, Private, 302945, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 10, Thailand, <=50K +69, ?, 28197, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 60, United-States, >50K +20, Private, 262749, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 35, United-States, <=50K +34, Federal-gov, 198265, Bachelors, 13, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 60, United-States, <=50K +49, Private, 170871, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +27, Private, 177761, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, Other, Male, 0, 0, 50, United-States, <=50K +59, Private, 175689, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 14, Cuba, >50K +45, Private, 205100, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 15024, 0, 60, United-States, >50K +21, Private, 77759, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +51, State-gov, 77905, Assoc-voc, 11, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +64, ?, 193575, 11th, 7, Never-married, ?, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +41, State-gov, 116520, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +18, ?, 85154, 12th, 8, Never-married, ?, Own-child, Asian-Pac-Islander, Female, 0, 0, 24, Germany, <=50K +49, Private, 180532, Masters, 14, Married-spouse-absent, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +51, Private, 508891, HS-grad, 9, Divorced, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +20, Private, 211345, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 20, United-States, <=50K +69, Self-emp-not-inc, 170877, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 48, United-States, <=50K +18, ?, 97318, HS-grad, 9, Never-married, ?, Own-child, White, Female, 0, 0, 35, United-States, <=50K +43, Private, 184105, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 50, United-States, <=50K +50, Private, 150941, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 44, United-States, <=50K +32, Private, 303942, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Local-gov, 273929, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 197077, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +62, Private, 162825, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +46, Private, 159869, HS-grad, 9, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 44, United-States, <=50K +19, Private, 158343, Some-college, 10, Never-married, Tech-support, Own-child, White, Female, 0, 0, 40, ?, <=50K +17, ?, 406920, 10th, 6, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 227986, HS-grad, 9, Never-married, Machine-op-inspct, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +36, Private, 137527, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +36, Private, 180150, 12th, 8, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 239539, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, Asian-Pac-Islander, Male, 0, 0, 40, Philippines, <=50K +58, Private, 281792, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +40, Private, 224799, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K +64, Private, 292639, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 10566, 0, 35, United-States, <=50K +66, Private, 22313, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 20, United-States, <=50K +42, Private, 194636, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Private, 156089, Some-college, 10, Widowed, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +53, Private, 193720, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 40, United-States, >50K +25, Private, 218667, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 358837, Some-college, 10, Never-married, Tech-support, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +20, Private, 174685, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 168854, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 54, United-States, <=50K +28, Private, 133696, Bachelors, 13, Never-married, Sales, Unmarried, White, Male, 0, 0, 65, United-States, <=50K +23, Federal-gov, 350680, Assoc-acdm, 12, Never-married, Exec-managerial, Own-child, White, Female, 0, 0, 40, Poland, <=50K +18, Private, 115215, HS-grad, 9, Never-married, Other-service, Own-child, White, Male, 0, 0, 25, United-States, <=50K +43, Self-emp-not-inc, 152958, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, >50K +29, Private, 217200, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 235124, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 46, Dominican-Republic, <=50K +31, Local-gov, 144949, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +60, Private, 135470, 5th-6th, 3, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, Mexico, <=50K +42, Private, 281209, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +46, Private, 155489, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +38, Private, 290306, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +18, Private, 182042, 11th, 7, Never-married, Other-service, Own-child, White, Female, 0, 0, 19, United-States, <=50K +31, Private, 210008, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +54, Private, 234938, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 4064, 0, 55, United-States, <=50K +46, Private, 315423, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 2042, 50, United-States, <=50K +27, Self-emp-not-inc, 30244, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 80, United-States, <=50K +50, Local-gov, 30008, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +38, Self-emp-not-inc, 201328, 7th-8th, 4, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 56, United-States, <=50K +36, State-gov, 96468, Masters, 14, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 486332, HS-grad, 9, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +19, Private, 46162, HS-grad, 9, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 25, United-States, <=50K +60, Local-gov, 98350, Some-college, 10, Married-civ-spouse, Other-service, Husband, Asian-Pac-Islander, Male, 0, 0, 60, Philippines, <=50K +45, Local-gov, 175958, 9th, 5, Never-married, Other-service, Own-child, White, Male, 0, 0, 40, United-States, <=50K +21, Private, 119309, Some-college, 10, Never-married, Tech-support, Own-child, White, Male, 0, 1602, 16, United-States, <=50K +42, Private, 175935, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 1980, 46, United-States, <=50K +38, Private, 204527, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +22, ?, 57827, Some-college, 10, Never-married, ?, Own-child, White, Male, 0, 0, 40, United-States, <=50K +19, Private, 418176, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 32, United-States, <=50K +23, Private, 262744, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 177287, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Female, 0, 0, 30, United-States, <=50K +30, Private, 255004, Assoc-acdm, 12, Divorced, Sales, Not-in-family, White, Male, 0, 0, 52, United-States, <=50K +62, Private, 183735, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Self-emp-not-inc, 318644, Prof-school, 15, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 20, United-States, <=50K +42, Federal-gov, 132125, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 52, United-States, >50K +33, Private, 206051, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +54, Self-emp-inc, 99185, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, ?, >50K +35, Private, 225750, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +33, Private, 245777, HS-grad, 9, Divorced, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 169092, HS-grad, 9, Divorced, Sales, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +62, Private, 211035, Bachelors, 13, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 30, United-States, >50K +24, Private, 285432, HS-grad, 9, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 40, United-States, <=50K +50, Local-gov, 154779, Some-college, 10, Divorced, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +54, Private, 37237, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +58, Private, 417419, 7th-8th, 4, Divorced, Machine-op-inspct, Own-child, White, Male, 0, 0, 40, United-States, <=50K +39, Self-emp-inc, 33975, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +32, Private, 42485, HS-grad, 9, Divorced, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 170017, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 152683, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 3908, 0, 35, United-States, <=50K +20, Private, 41721, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 60, United-States, <=50K +64, Private, 66634, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +55, Self-emp-inc, 257216, Masters, 14, Widowed, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 167882, HS-grad, 9, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +45, Private, 179428, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 55, United-States, <=50K +26, Private, 57512, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 301614, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 193820, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 1876, 40, United-States, <=50K +58, Private, 222247, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 1887, 40, United-States, >50K +39, Self-emp-inc, 189092, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, <=50K +47, Private, 217509, HS-grad, 9, Widowed, Priv-house-serv, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 45, Thailand, <=50K +35, Private, 308691, Masters, 14, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 48, United-States, <=50K +38, Private, 169672, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 120914, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 370156, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +28, Private, 398220, 5th-6th, 3, Never-married, Craft-repair, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +44, Self-emp-not-inc, 208277, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 45, United-States, <=50K +40, Private, 337456, HS-grad, 9, Divorced, Protective-serv, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +55, Private, 172666, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +29, Self-emp-not-inc, 32280, HS-grad, 9, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +33, Private, 194901, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 45, United-States, <=50K +19, ?, 57329, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, Japan, <=50K +32, Private, 173730, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 40, United-States, >50K +45, Local-gov, 153312, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Black, Male, 0, 0, 10, United-States, >50K +23, Private, 274797, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 25, United-States, <=50K +31, Private, 359249, Assoc-voc, 11, Never-married, Protective-serv, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +22, Private, 152744, Some-college, 10, Never-married, Other-service, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +59, Private, 188041, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +32, Private, 97723, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 38, United-States, <=50K +49, State-gov, 354529, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 249727, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 30, United-States, <=50K +26, Private, 189590, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +23, State-gov, 298871, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Vietnam, <=50K +55, Self-emp-not-inc, 205296, Some-college, 10, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 50, United-States, <=50K +47, Private, 303637, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 49, United-States, >50K +44, Private, 242861, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 37599, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 24, United-States, <=50K +40, State-gov, 199381, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 15024, 0, 37, United-States, >50K +32, Self-emp-not-inc, 56328, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 8, United-States, >50K +20, Private, 256211, Some-college, 10, Never-married, Machine-op-inspct, Other-relative, Asian-Pac-Islander, Male, 0, 0, 40, Vietnam, <=50K +84, Local-gov, 163685, HS-grad, 9, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 33, United-States, <=50K +40, Private, 266084, Some-college, 10, Divorced, Craft-repair, Other-relative, White, Male, 0, 0, 50, United-States, <=50K +37, Private, 161111, Bachelors, 13, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 199031, Some-college, 10, Divorced, Transport-moving, Own-child, White, Male, 0, 1380, 40, United-States, <=50K +47, Private, 166634, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 0, 0, 40, Germany, <=50K +62, Self-emp-not-inc, 204085, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 25, United-States, <=50K +19, ?, 369527, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +47, Private, 464945, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +44, Local-gov, 174684, HS-grad, 9, Divorced, Craft-repair, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +26, Local-gov, 166295, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 41, United-States, <=50K +36, Private, 220511, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 246936, 11th, 7, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 104509, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +48, ?, 266337, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 252168, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, <=50K +25, Private, 92093, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 20, United-States, <=50K +62, Private, 88055, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 129591, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 142719, HS-grad, 9, Married-spouse-absent, Farming-fishing, Not-in-family, White, Male, 0, 0, 65, United-States, <=50K +18, ?, 264924, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 40, United-States, <=50K +46, Private, 128796, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 44, United-States, >50K +38, Private, 115336, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 70, United-States, <=50K +52, Private, 190333, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, >50K +63, Self-emp-not-inc, 179444, 7th-8th, 4, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 15, United-States, <=50K +49, Private, 218676, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 43, United-States, <=50K +17, Local-gov, 148194, 11th, 7, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 12, United-States, <=50K +33, Private, 184833, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 35, United-States, <=50K +70, Self-emp-not-inc, 280639, HS-grad, 9, Widowed, Other-service, Other-relative, White, Female, 2329, 0, 20, United-States, <=50K +19, Private, 217769, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 15, United-States, <=50K +27, ?, 180553, HS-grad, 9, Married-civ-spouse, ?, Wife, White, Female, 0, 0, 40, United-States, >50K +61, Private, 56009, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 255334, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 25, United-States, >50K +46, Self-emp-inc, 328216, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 1902, 42, ?, >50K +29, Private, 349154, 10th, 6, Separated, Farming-fishing, Unmarried, White, Female, 0, 0, 40, Guatemala, <=50K +40, Local-gov, 24763, Some-college, 10, Divorced, Transport-moving, Unmarried, White, Male, 6849, 0, 40, United-States, <=50K +43, State-gov, 41834, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 38, United-States, >50K +24, Private, 113466, HS-grad, 9, Never-married, Sales, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 130856, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 60, United-States, <=50K +61, Self-emp-not-inc, 268797, HS-grad, 9, Married-civ-spouse, Other-service, Wife, Black, Female, 0, 0, 17, United-States, <=50K +48, Private, 202117, 11th, 7, Divorced, Other-service, Not-in-family, White, Female, 0, 0, 34, United-States, <=50K +19, Private, 280146, Some-college, 10, Never-married, Handlers-cleaners, Own-child, White, Male, 0, 0, 20, United-States, <=50K +30, Private, 70377, Some-college, 10, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +24, Private, 236696, Some-college, 10, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +39, Local-gov, 222572, Masters, 14, Never-married, Prof-specialty, Unmarried, White, Female, 0, 0, 43, United-States, <=50K +46, Self-emp-inc, 110702, Some-college, 10, Divorced, Exec-managerial, Unmarried, White, Female, 2036, 0, 60, United-States, <=50K +40, Private, 96129, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 72, United-States, >50K +27, Local-gov, 200492, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 193820, Masters, 14, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 40, United-States, <=50K +31, Private, 454508, 11th, 7, Never-married, Craft-repair, Not-in-family, White, Male, 0, 2001, 40, United-States, <=50K +58, Private, 220789, Bachelors, 13, Divorced, Transport-moving, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +33, Private, 101345, HS-grad, 9, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 42, Canada, >50K +40, Private, 140559, HS-grad, 9, Widowed, Other-service, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +40, Self-emp-inc, 64885, Masters, 14, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, United-States, >50K +31, Private, 402361, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 143582, HS-grad, 9, Separated, Other-service, Unmarried, Asian-Pac-Islander, Female, 0, 0, 48, China, <=50K +49, Private, 185385, Assoc-acdm, 12, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 45, United-States, <=50K +24, Private, 112706, Assoc-voc, 11, Never-married, Tech-support, Own-child, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 130364, HS-grad, 9, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +58, Local-gov, 147428, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +20, Private, 205895, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 20, United-States, <=50K +65, ?, 273569, HS-grad, 9, Widowed, ?, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 153160, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 40, United-States, <=50K +48, Self-emp-not-inc, 167918, Masters, 14, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 50, India, <=50K +41, Private, 195661, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 54, United-States, <=50K +27, State-gov, 146243, Some-college, 10, Separated, Other-service, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +52, ?, 105428, Some-college, 10, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 12, United-States, <=50K +26, Private, 149943, HS-grad, 9, Never-married, Other-service, Other-relative, Asian-Pac-Islander, Male, 0, 0, 60, ?, <=50K +52, Local-gov, 246197, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +52, Local-gov, 192563, Some-college, 10, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +19, Private, 244115, HS-grad, 9, Never-married, Other-service, Own-child, Black, Male, 0, 0, 30, United-States, <=50K +39, Local-gov, 98587, Some-college, 10, Divorced, Prof-specialty, Own-child, White, Female, 0, 0, 45, United-States, <=50K +47, Private, 145886, Some-college, 10, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 244315, HS-grad, 9, Divorced, Craft-repair, Other-relative, Other, Male, 0, 0, 40, United-States, <=50K +48, Private, 192779, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +50, Private, 209464, HS-grad, 9, Separated, Other-service, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +60, Private, 25141, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +28, Private, 405793, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 45, United-States, >50K +47, Federal-gov, 53498, HS-grad, 9, Divorced, Other-service, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +69, ?, 476653, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 20, United-States, <=50K +40, Self-emp-not-inc, 162312, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 66, South, <=50K +37, Private, 277022, HS-grad, 9, Never-married, Handlers-cleaners, Unmarried, White, Female, 3887, 0, 40, Nicaragua, <=50K +41, State-gov, 109762, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 123031, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Asian-Pac-Islander, Male, 0, 0, 48, Trinadad&Tobago, <=50K +46, Federal-gov, 119890, Assoc-voc, 11, Separated, Tech-support, Not-in-family, Other, Female, 0, 0, 30, United-States, <=50K +21, Self-emp-not-inc, 409230, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 30, United-States, <=50K +44, Private, 223308, Masters, 14, Separated, Sales, Unmarried, White, Female, 0, 0, 48, United-States, <=50K +38, ?, 129150, 10th, 6, Separated, ?, Own-child, White, Male, 0, 0, 35, United-States, <=50K +47, Self-emp-not-inc, 119199, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, >50K +42, Private, 46221, Doctorate, 16, Married-spouse-absent, Other-service, Not-in-family, White, Male, 27828, 0, 60, ?, >50K +42, Local-gov, 351161, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +56, Private, 174533, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Female, 0, 0, 45, United-States, >50K +32, Private, 324386, Bachelors, 13, Never-married, Prof-specialty, Own-child, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 126568, Bachelors, 13, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 33, United-States, <=50K +26, Private, 275703, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 219611, Bachelors, 13, Never-married, Sales, Not-in-family, Black, Female, 2174, 0, 50, United-States, <=50K +49, Private, 200471, 11th, 7, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +65, Private, 155261, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +73, State-gov, 74040, 7th-8th, 4, Divorced, Other-service, Not-in-family, Asian-Pac-Islander, Female, 0, 0, 40, United-States, <=50K +34, Private, 226296, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 211968, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +49, Local-gov, 126446, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 30, United-States, <=50K +25, Private, 262885, 11th, 7, Never-married, Other-service, Unmarried, Black, Female, 0, 0, 32, United-States, <=50K +39, Private, 188069, 9th, 5, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 25, United-States, <=50K +19, Private, 113546, 11th, 7, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 56, United-States, <=50K +24, Private, 227070, 10th, 6, Married-civ-spouse, Machine-op-inspct, Wife, White, Female, 0, 0, 40, United-States, <=50K +34, Private, 136997, HS-grad, 9, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 35, United-States, <=50K +35, ?, 119006, HS-grad, 9, Widowed, ?, Own-child, White, Female, 0, 0, 38, United-States, <=50K +21, Private, 212407, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 197810, Masters, 14, Divorced, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +35, Federal-gov, 35309, Bachelors, 13, Never-married, Tech-support, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 28, ?, <=50K +39, Private, 141802, Some-college, 10, Divorced, Adm-clerical, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +48, ?, 184513, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 80, United-States, >50K +33, Self-emp-not-inc, 124187, Assoc-acdm, 12, Never-married, Other-service, Not-in-family, Black, Male, 0, 0, 32, United-States, <=50K +19, Private, 201743, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 26, United-States, <=50K +17, Private, 156736, 10th, 6, Never-married, Sales, Unmarried, White, Female, 0, 0, 12, United-States, <=50K +43, Self-emp-not-inc, 47261, Some-college, 10, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +62, Private, 150693, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 42, United-States, <=50K +53, Local-gov, 233734, Masters, 14, Divorced, Prof-specialty, Not-in-family, Black, Female, 0, 0, 40, United-States, >50K +45, State-gov, 35969, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +47, Private, 159550, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, Black, Female, 0, 0, 40, United-States, <=50K +30, Private, 190823, Some-college, 10, Never-married, Other-service, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +53, Private, 213378, HS-grad, 9, Separated, Sales, Not-in-family, White, Female, 0, 0, 33, United-States, <=50K +24, Private, 257500, HS-grad, 9, Separated, Machine-op-inspct, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +41, Local-gov, 488706, Some-college, 10, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, <=50K +58, Local-gov, 239405, 5th-6th, 3, Divorced, Other-service, Other-relative, Black, Female, 0, 0, 40, Haiti, <=50K +27, Federal-gov, 105189, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Male, 4865, 0, 50, United-States, <=50K +63, State-gov, 109735, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 38, United-States, <=50K +50, Private, 172942, Some-college, 10, Divorced, Other-service, Own-child, White, Male, 0, 0, 28, United-States, <=50K +43, Local-gov, 209899, Masters, 14, Never-married, Tech-support, Not-in-family, Black, Female, 8614, 0, 47, United-States, >50K +29, Self-emp-inc, 87745, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +41, Private, 187881, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 3942, 0, 40, United-States, <=50K +55, Private, 234125, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 272944, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +23, Local-gov, 129232, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 100345, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Male, 13550, 0, 55, United-States, >50K +58, Self-emp-not-inc, 195835, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +25, Private, 251854, 11th, 7, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +40, Private, 103474, HS-grad, 9, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 30, United-States, <=50K +38, Private, 22042, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 39, United-States, <=50K +37, Private, 343721, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +19, Private, 232368, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 35, United-States, <=50K +55, Private, 174478, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 29, United-States, <=50K +55, Private, 282023, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 50, United-States, >50K +28, Private, 274690, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +53, Private, 251675, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, El-Salvador, <=50K +32, ?, 647882, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, ?, <=50K +60, Private, 128367, Some-college, 10, Divorced, Prof-specialty, Unmarried, White, Male, 3325, 0, 42, United-States, <=50K +32, Private, 37380, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +34, Private, 173730, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +49, Private, 353824, Some-college, 10, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, United-States, >50K +21, Private, 225890, Some-college, 10, Never-married, Other-service, Other-relative, White, Female, 0, 0, 30, United-States, <=50K +24, State-gov, 147147, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, Black, Female, 0, 0, 20, United-States, <=50K +53, Private, 233780, Assoc-voc, 11, Divorced, Adm-clerical, Not-in-family, Black, Female, 2202, 0, 40, United-States, <=50K +29, Private, 394927, Assoc-voc, 11, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, ?, <=50K +34, Local-gov, 188682, Bachelors, 13, Married-spouse-absent, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +52, ?, 115209, Prof-school, 15, Married-spouse-absent, ?, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, Vietnam, <=50K +41, Private, 277192, 5th-6th, 3, Married-civ-spouse, Farming-fishing, Wife, White, Female, 0, 0, 40, Mexico, <=50K +21, Private, 314182, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Private, 220776, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 35, United-States, >50K +31, Local-gov, 189269, HS-grad, 9, Never-married, Protective-serv, Own-child, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 35429, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 2042, 40, United-States, <=50K +42, Private, 154374, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 2415, 60, United-States, >50K +62, Private, 161460, Bachelors, 13, Never-married, Farming-fishing, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +51, Private, 251487, 7th-8th, 4, Widowed, Machine-op-inspct, Not-in-family, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +30, Private, 177531, HS-grad, 9, Never-married, Sales, Unmarried, Black, Female, 0, 0, 25, United-States, <=50K +24, Private, 53942, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 113481, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +57, Private, 361324, 10th, 6, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 330087, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, >50K +33, Private, 276221, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +42, Private, 121055, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +62, Private, 118696, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +64, Self-emp-not-inc, 289741, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 20, United-States, <=50K +18, Private, 238401, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 25, United-States, <=50K +43, Private, 262038, 5th-6th, 3, Married-spouse-absent, Farming-fishing, Unmarried, White, Male, 0, 0, 35, Mexico, <=50K +62, Self-emp-not-inc, 26911, 7th-8th, 4, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 66, United-States, <=50K +29, Private, 161155, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, United-States, <=50K +43, Private, 252519, Bachelors, 13, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, Haiti, >50K +39, Private, 43712, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 40, United-States, >50K +69, ?, 167826, Bachelors, 13, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 188900, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K +44, Private, 120057, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 4386, 0, 45, United-States, >50K +25, Private, 134113, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K +47, Local-gov, 165822, Some-college, 10, Divorced, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +17, Private, 99161, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 8, United-States, <=50K +41, Local-gov, 74581, Bachelors, 13, Divorced, Prof-specialty, Unmarried, White, Male, 0, 0, 65, United-States, <=50K +19, Private, 304643, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 20, United-States, <=50K +57, Private, 121821, 1st-4th, 2, Married-civ-spouse, Other-service, Husband, Other, Male, 0, 0, 40, Dominican-Republic, <=50K +25, Private, 154863, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Male, 0, 0, 35, United-States, <=50K +37, Local-gov, 365430, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, Canada, >50K +29, Private, 183111, Assoc-voc, 11, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 50178, Some-college, 10, Never-married, Sales, Own-child, White, Male, 0, 0, 35, United-States, <=50K +35, Private, 186845, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +52, Private, 159908, 12th, 8, Never-married, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +41, Private, 162189, HS-grad, 9, Never-married, Machine-op-inspct, Own-child, White, Female, 1831, 0, 40, Peru, <=50K +29, Private, 128509, HS-grad, 9, Married-spouse-absent, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 38, El-Salvador, <=50K +23, Private, 143032, Masters, 14, Never-married, Prof-specialty, Own-child, White, Female, 0, 0, 36, United-States, <=50K +31, Private, 382368, 11th, 7, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +40, Private, 210013, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +19, Private, 293928, HS-grad, 9, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 20, United-States, <=50K +21, Private, 208503, Some-college, 10, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 10, United-States, <=50K +37, State-gov, 191841, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 8614, 0, 40, United-States, >50K +49, Self-emp-not-inc, 355978, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 99999, 0, 35, United-States, >50K +64, Local-gov, 202738, HS-grad, 9, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, <=50K +37, Local-gov, 144322, Assoc-voc, 11, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +70, Self-emp-not-inc, 155141, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 2377, 12, United-States, >50K +22, Private, 160120, 10th, 6, Never-married, Transport-moving, Own-child, Asian-Pac-Islander, Male, 0, 0, 30, United-States, <=50K +29, Self-emp-inc, 190450, HS-grad, 9, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 0, 40, Germany, <=50K +37, Private, 212900, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 115677, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +38, Private, 252250, 11th, 7, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 65, United-States, <=50K +27, Private, 212041, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +58, State-gov, 198145, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 35, United-States, >50K +60, Local-gov, 113658, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 35, United-States, <=50K +20, Private, 32426, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 25, United-States, <=50K +51, Private, 98791, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +37, Private, 203828, 9th, 5, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 65, United-States, <=50K +22, State-gov, 186634, 12th, 8, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +56, Self-emp-not-inc, 125147, 10th, 6, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +26, Private, 247455, Bachelors, 13, Married-civ-spouse, Sales, Wife, White, Female, 5178, 0, 42, United-States, >50K +19, Private, 97215, Some-college, 10, Separated, Sales, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +37, Private, 330826, Assoc-voc, 11, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 30, United-States, <=50K +27, Private, 200802, Some-college, 10, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 156266, HS-grad, 9, Never-married, Sales, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 20, United-States, <=50K +52, Self-emp-not-inc, 72257, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +45, Private, 363087, HS-grad, 9, Separated, Machine-op-inspct, Unmarried, Black, Female, 0, 0, 40, United-States, <=50K +28, Private, 25955, Some-college, 10, Never-married, Craft-repair, Own-child, Amer-Indian-Eskimo, Male, 0, 0, 40, United-States, <=50K +20, Private, 334633, HS-grad, 9, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +28, Private, 109162, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +44, Private, 569761, Assoc-voc, 11, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K +30, Private, 209900, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, State-gov, 272986, Assoc-acdm, 12, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 8, United-States, <=50K +55, ?, 52267, Masters, 14, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 18, United-States, <=50K +46, Private, 82946, HS-grad, 9, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Private, 104651, Bachelors, 13, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, <=50K +25, Local-gov, 58441, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +43, Local-gov, 269733, HS-grad, 9, Separated, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +19, ?, 128453, HS-grad, 9, Never-married, ?, Own-child, White, Female, 0, 0, 28, United-States, <=50K +36, Private, 179468, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +36, Private, 183081, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +48, Private, 102938, Bachelors, 13, Never-married, Other-service, Unmarried, Asian-Pac-Islander, Female, 0, 0, 40, Vietnam, <=50K +30, ?, 157289, 11th, 7, Never-married, ?, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 359828, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 44, United-States, >50K +30, Private, 155659, HS-grad, 9, Divorced, Other-service, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +24, Private, 585203, Bachelors, 13, Married-civ-spouse, Exec-managerial, Wife, White, Female, 7688, 0, 45, United-States, >50K +62, Private, 173601, Bachelors, 13, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +41, Self-emp-not-inc, 214541, HS-grad, 9, Divorced, Transport-moving, Not-in-family, White, Male, 0, 1590, 40, United-States, <=50K +49, Self-emp-not-inc, 163352, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 85, United-States, >50K +36, Self-emp-not-inc, 153976, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, <=50K +47, Local-gov, 247676, Masters, 14, Divorced, Prof-specialty, Unmarried, White, Female, 5455, 0, 45, United-States, <=50K +49, State-gov, 155372, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +52, Private, 329733, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +52, Private, 162576, Assoc-acdm, 12, Married-civ-spouse, Tech-support, Husband, White, Male, 0, 0, 40, United-States, >50K +26, Private, 176520, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Male, 0, 0, 53, United-States, <=50K +51, State-gov, 226885, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 120781, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +30, Private, 375827, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +46, Private, 205504, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 20, United-States, <=50K +28, Private, 198813, HS-grad, 9, Never-married, Handlers-cleaners, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +48, Self-emp-inc, 254291, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7298, 0, 50, United-States, >50K +62, Private, 159908, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, White, Female, 0, 0, 38, United-States, >50K +49, Private, 40000, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 4064, 0, 44, United-States, <=50K +69, Private, 102874, HS-grad, 9, Widowed, Adm-clerical, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +35, Private, 117381, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 8614, 0, 45, United-States, >50K +78, Private, 180239, Masters, 14, Widowed, Craft-repair, Unmarried, Asian-Pac-Islander, Male, 0, 0, 40, South, <=50K +61, Private, 539563, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +24, Private, 261561, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 81057, Some-college, 10, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 40, United-States, <=50K +36, Self-emp-not-inc, 160120, Bachelors, 13, Married-civ-spouse, Sales, Husband, Other, Male, 0, 0, 45, ?, <=50K +17, Private, 41979, 10th, 6, Never-married, Farming-fishing, Own-child, White, Male, 0, 0, 40, United-States, <=50K +27, Private, 275110, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 80, United-States, >50K +64, Private, 265661, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 40, United-States, <=50K +33, Self-emp-not-inc, 193246, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 55, France, <=50K +32, Private, 236543, 12th, 8, Married-civ-spouse, Craft-repair, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +19, Private, 29510, Some-college, 10, Never-married, Other-service, Own-child, White, Female, 0, 0, 20, United-States, <=50K +42, State-gov, 105804, 11th, 7, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 194604, Bachelors, 13, Never-married, Sales, Own-child, White, Female, 0, 0, 15, United-States, <=50K +23, Private, 1038553, HS-grad, 9, Never-married, Exec-managerial, Not-in-family, Black, Male, 0, 0, 45, United-States, <=50K +51, Local-gov, 209320, Prof-school, 15, Never-married, Prof-specialty, Not-in-family, White, Male, 3325, 0, 40, United-States, <=50K +31, Private, 193231, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Male, 3325, 0, 60, United-States, <=50K +44, Private, 307468, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 29, United-States, >50K +38, Private, 255941, Masters, 14, Never-married, Exec-managerial, Not-in-family, White, Female, 10520, 0, 50, United-States, >50K +44, Local-gov, 107845, Assoc-acdm, 12, Divorced, Protective-serv, Not-in-family, White, Female, 0, 0, 56, United-States, >50K +44, Self-emp-not-inc, 567788, 5th-6th, 3, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 40, Mexico, <=50K +38, Private, 91857, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 44, United-States, <=50K +36, Private, 732569, 9th, 5, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +29, Private, 86613, 1st-4th, 2, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 20, El-Salvador, <=50K +46, Private, 35961, Assoc-acdm, 12, Divorced, Sales, Not-in-family, White, Female, 0, 0, 25, Germany, <=50K +47, Private, 114754, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +30, Private, 235124, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 99999, 0, 40, United-States, >50K +37, Local-gov, 218490, HS-grad, 9, Married-civ-spouse, Protective-serv, Husband, White, Male, 7688, 0, 35, United-States, >50K +27, Private, 329426, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +43, Private, 181015, HS-grad, 9, Divorced, Machine-op-inspct, Unmarried, White, Female, 0, 0, 50, United-States, <=50K +44, Self-emp-not-inc, 264740, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, >50K +31, Private, 381153, Some-college, 10, Never-married, Exec-managerial, Unmarried, White, Male, 0, 0, 60, United-States, <=50K +34, Private, 189759, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 45, United-States, >50K +39, Private, 230467, Bachelors, 13, Never-married, Sales, Own-child, White, Male, 0, 1092, 40, Germany, <=50K +36, Private, 218542, Some-college, 10, Divorced, Craft-repair, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +57, Private, 298507, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 40, United-States, >50K +78, Private, 111189, 7th-8th, 4, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 35, Dominican-Republic, <=50K +24, Private, 168997, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 168894, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +21, Private, 149809, Assoc-acdm, 12, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 344073, HS-grad, 9, Separated, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +22, Private, 416165, Some-college, 10, Never-married, Sales, Unmarried, White, Female, 0, 0, 32, United-States, <=50K +36, Private, 41490, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +61, Private, 40269, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 40, United-States, >50K +67, ?, 243256, 9th, 5, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 15, United-States, <=50K +42, Private, 250536, Some-college, 10, Separated, Other-service, Unmarried, Black, Female, 0, 0, 21, Haiti, <=50K +49, Federal-gov, 105586, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, United-States, >50K +58, Private, 51499, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +37, Local-gov, 189878, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 60, United-States, <=50K +39, Private, 179481, HS-grad, 9, Never-married, Tech-support, Not-in-family, White, Male, 4650, 0, 44, United-States, <=50K +25, Private, 299765, Some-college, 10, Separated, Adm-clerical, Other-relative, Black, Female, 0, 0, 40, Jamaica, <=50K +45, Self-emp-inc, 155664, 5th-6th, 3, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, ?, >50K +30, Private, 54608, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, ?, 174702, Some-college, 10, Never-married, ?, Not-in-family, White, Male, 0, 0, 35, United-States, <=50K +36, Self-emp-not-inc, 285020, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 2885, 0, 40, United-States, <=50K +23, Private, 201145, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 65, United-States, <=50K +51, Private, 125796, Some-college, 10, Married-civ-spouse, Adm-clerical, Wife, Black, Female, 0, 0, 35, Jamaica, <=50K +55, Private, 249072, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K +35, Private, 99156, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +45, State-gov, 94754, Masters, 14, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, India, <=50K +36, Private, 111128, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 48, United-States, >50K +32, Local-gov, 157887, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +39, Private, 74194, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 55, United-States, <=50K +47, Self-emp-inc, 168191, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +49, Private, 28334, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +52, Private, 84278, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 55, ?, >50K +44, Private, 721161, Some-college, 10, Separated, Machine-op-inspct, Not-in-family, Black, Male, 0, 0, 40, United-States, <=50K +36, Private, 188069, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +40, Private, 145178, Some-college, 10, Divorced, Craft-repair, Unmarried, Black, Female, 0, 0, 30, United-States, <=50K +17, Private, 52967, 10th, 6, Never-married, Other-service, Own-child, White, Female, 0, 0, 6, United-States, <=50K +18, Private, 177578, HS-grad, 9, Never-married, Sales, Own-child, White, Female, 0, 0, 38, United-States, <=50K +30, Self-emp-inc, 185384, Some-college, 10, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 25, United-States, <=50K +66, Private, 66008, HS-grad, 9, Widowed, Priv-house-serv, Not-in-family, White, Female, 0, 0, 50, England, <=50K +59, Private, 329059, Bachelors, 13, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 50, United-States, <=50K +30, Local-gov, 348802, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 40, United-States, <=50K +50, Private, 34233, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +24, Private, 509629, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +28, Private, 27956, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 99, Philippines, <=50K +44, Local-gov, 83286, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +25, Private, 309098, Assoc-voc, 11, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +45, Private, 188950, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +20, Private, 224217, HS-grad, 9, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +67, Private, 222899, Assoc-voc, 11, Divorced, Prof-specialty, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +40, Self-emp-not-inc, 123306, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, >50K +52, Federal-gov, 279337, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 347166, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 50, United-States, >50K +37, Local-gov, 251396, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, Canada, >50K +17, Self-emp-inc, 143034, 10th, 6, Never-married, Other-service, Own-child, White, Male, 0, 0, 4, United-States, <=50K +25, Private, 57635, Assoc-voc, 11, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 42, United-States, >50K +35, Local-gov, 162651, Bachelors, 13, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, Puerto-Rico, <=50K +63, Private, 28334, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, >50K +38, Local-gov, 84570, Some-college, 10, Never-married, Adm-clerical, Own-child, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +33, Private, 181091, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 60, Iran, >50K +51, Local-gov, 117496, Masters, 14, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K +64, State-gov, 216160, Doctorate, 16, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, Columbia, >50K +50, Self-emp-inc, 204447, Some-college, 10, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +18, Private, 374969, 10th, 6, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 56, United-States, <=50K +67, Private, 35015, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 99, United-States, <=50K +46, Private, 179869, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +60, Self-emp-not-inc, 137733, Bachelors, 13, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K +29, Private, 193125, HS-grad, 9, Never-married, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +30, Private, 103649, Some-college, 10, Never-married, Other-service, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +56, State-gov, 54260, Doctorate, 16, Married-civ-spouse, Prof-specialty, Not-in-family, Asian-Pac-Islander, Male, 2885, 0, 40, China, <=50K +29, Private, 197932, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Wife, White, Female, 0, 0, 40, Mexico, >50K +37, Private, 249720, Bachelors, 13, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 27, United-States, <=50K +55, Private, 223613, 1st-4th, 2, Divorced, Priv-house-serv, Unmarried, White, Female, 0, 0, 30, Cuba, <=50K +24, Private, 259865, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 30, United-States, <=50K +21, Private, 301694, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 35, Mexico, <=50K +46, Self-emp-inc, 276934, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 70, United-States, >50K +25, Private, 395512, 12th, 8, Married-civ-spouse, Machine-op-inspct, Other-relative, Other, Male, 0, 0, 40, Mexico, <=50K +40, Private, 168071, HS-grad, 9, Never-married, Craft-repair, Own-child, White, Male, 0, 0, 28, United-States, <=50K +23, Private, 45317, Some-college, 10, Separated, Sales, Own-child, White, Female, 0, 0, 40, United-States, <=50K +43, Self-emp-not-inc, 311177, Some-college, 10, Never-married, Transport-moving, Not-in-family, Black, Male, 0, 0, 30, United-States, <=50K +29, Self-emp-not-inc, 190636, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, Amer-Indian-Eskimo, Male, 0, 1485, 60, United-States, >50K +59, Private, 221336, 10th, 6, Widowed, Other-service, Other-relative, Asian-Pac-Islander, Female, 0, 0, 40, Philippines, <=50K +18, Private, 120691, Some-college, 10, Never-married, Other-service, Own-child, Black, Male, 0, 0, 35, ?, <=50K +28, Private, 107389, HS-grad, 9, Never-married, Exec-managerial, Own-child, White, Male, 0, 0, 32, United-States, <=50K +17, Private, 293440, 11th, 7, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 15, United-States, <=50K +53, Private, 145409, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, >50K +22, Private, 213902, 5th-6th, 3, Never-married, Priv-house-serv, Other-relative, White, Female, 0, 0, 40, El-Salvador, <=50K +63, Private, 100099, HS-grad, 9, Married-spouse-absent, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +32, Private, 191856, Masters, 14, Married-civ-spouse, Sales, Wife, White, Female, 0, 0, 45, United-States, >50K +40, Local-gov, 233891, HS-grad, 9, Never-married, Adm-clerical, Unmarried, Black, Female, 0, 0, 35, United-States, <=50K +61, Self-emp-not-inc, 96073, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 50, England, >50K +35, Private, 474136, HS-grad, 9, Never-married, Craft-repair, Not-in-family, White, Male, 0, 1408, 40, United-States, <=50K +43, Self-emp-not-inc, 355856, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, Asian-Pac-Islander, Male, 0, 0, 50, Philippines, <=50K +20, ?, 144685, Some-college, 10, Never-married, ?, Own-child, Asian-Pac-Islander, Female, 0, 1602, 40, Taiwan, <=50K +48, Self-emp-not-inc, 139212, Some-college, 10, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +56, State-gov, 143931, Bachelors, 13, Widowed, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +51, Federal-gov, 160703, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +34, Private, 191291, Some-college, 10, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 68729, Some-college, 10, Married-civ-spouse, Sales, Husband, Asian-Pac-Islander, Male, 0, 1902, 40, United-States, >50K +61, Private, 119986, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, ?, >50K +37, Private, 227545, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 44, United-States, >50K +36, Private, 32776, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 40, United-States, >50K +34, Private, 228881, Some-college, 10, Separated, Machine-op-inspct, Not-in-family, Other, Male, 0, 0, 40, United-States, <=50K +23, Private, 84648, Some-college, 10, Never-married, Sales, Own-child, White, Female, 0, 0, 25, United-States, <=50K +63, Federal-gov, 101996, Masters, 14, Divorced, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +63, ?, 68954, HS-grad, 9, Widowed, ?, Not-in-family, Black, Female, 0, 0, 11, United-States, <=50K +47, Local-gov, 285060, Masters, 14, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 1977, 41, United-States, >50K +55, Self-emp-inc, 209569, HS-grad, 9, Divorced, Sales, Unmarried, White, Female, 0, 0, 50, United-States, >50K +31, Local-gov, 331126, Bachelors, 13, Never-married, Protective-serv, Own-child, Black, Male, 0, 0, 48, United-States, <=50K +27, Private, 279872, Some-college, 10, Divorced, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +58, Private, 150560, Bachelors, 13, Divorced, Adm-clerical, Not-in-family, White, Female, 14084, 0, 40, United-States, >50K +28, Local-gov, 185647, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 48, United-States, <=50K +52, Private, 128871, 7th-8th, 4, Divorced, Machine-op-inspct, Not-in-family, White, Female, 0, 0, 64, United-States, <=50K +31, Federal-gov, 386331, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, Black, Male, 0, 0, 50, United-States, <=50K +53, Private, 117814, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 38, United-States, <=50K +43, Private, 220609, Some-college, 10, Divorced, Tech-support, Not-in-family, White, Female, 0, 0, 50, United-States, <=50K +43, Local-gov, 117022, HS-grad, 9, Married-spouse-absent, Farming-fishing, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K +50, Self-emp-inc, 176751, Masters, 14, Divorced, Exec-managerial, Not-in-family, White, Male, 0, 0, 80, United-States, >50K +68, ?, 76371, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 35, United-States, <=50K +37, Private, 80410, 11th, 7, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 127202, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 121471, 11th, 7, Never-married, Sales, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +22, Private, 219086, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, <=50K +59, Private, 271571, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 15024, 0, 50, United-States, >50K +30, Private, 241583, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +24, Private, 374253, HS-grad, 9, Separated, Other-service, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +30, Private, 214993, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +50, Local-gov, 199995, Bachelors, 13, Divorced, Protective-serv, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +38, Private, 450924, 12th, 8, Married-civ-spouse, Other-service, Husband, White, Male, 3942, 0, 40, United-States, <=50K +29, Private, 120359, HS-grad, 9, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +76, Private, 93125, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 1424, 0, 24, United-States, <=50K +21, Private, 187513, Assoc-voc, 11, Never-married, Craft-repair, Unmarried, White, Male, 0, 0, 40, United-States, <=50K +65, Private, 243569, Some-college, 10, Widowed, Other-service, Unmarried, White, Female, 0, 0, 24, United-States, <=50K +43, Private, 295510, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K +29, Private, 29732, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 24, United-States, <=50K +32, Private, 211743, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +37, Private, 251396, Assoc-acdm, 12, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 65, United-States, >50K +64, Private, 477697, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 16, United-States, <=50K +49, Private, 151584, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, >50K +44, Private, 193882, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, >50K +68, ?, 117542, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 1409, 0, 15, United-States, <=50K +34, Private, 242460, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 7688, 0, 40, United-States, >50K +35, Private, 411395, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 36, United-States, <=50K +53, Private, 191025, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Female, 0, 0, 43, United-States, <=50K +24, Private, 154571, Assoc-voc, 11, Never-married, Sales, Unmarried, Asian-Pac-Islander, Male, 0, 0, 50, South, <=50K +31, Private, 208657, Some-college, 10, Divorced, Adm-clerical, Unmarried, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 29599, Some-college, 10, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 38, United-States, <=50K +36, Private, 423711, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K +29, Private, 122000, Some-college, 10, Never-married, Sales, Not-in-family, White, Male, 0, 0, 60, United-States, <=50K +37, Private, 148581, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 50, United-States, >50K +42, Self-emp-not-inc, 222978, Doctorate, 16, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +30, Private, 149118, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +32, Self-emp-inc, 218407, Some-college, 10, Never-married, Exec-managerial, Not-in-family, White, Male, 0, 0, 70, Cuba, <=50K +47, Self-emp-not-inc, 112200, Bachelors, 13, Never-married, Exec-managerial, Not-in-family, Black, Male, 10520, 0, 45, United-States, >50K +44, Private, 85604, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 40, United-States, >50K +19, Private, 111232, HS-grad, 9, Never-married, Sales, Own-child, White, Male, 0, 0, 40, United-States, <=50K +22, Private, 99199, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 48, United-States, <=50K +51, Private, 199995, 9th, 5, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 50, United-States, >50K +69, Private, 122850, HS-grad, 9, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 16, United-States, <=50K +73, ?, 90557, 11th, 7, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 8, United-States, <=50K +18, ?, 271935, 11th, 7, Never-married, ?, Other-relative, White, Female, 0, 0, 20, United-States, <=50K +33, Self-emp-not-inc, 361497, HS-grad, 9, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +22, Local-gov, 399020, Some-college, 10, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 55, United-States, <=50K +33, Private, 345277, Bachelors, 13, Never-married, Tech-support, Not-in-family, White, Female, 0, 0, 45, United-States, >50K +20, Federal-gov, 55233, Some-college, 10, Never-married, Adm-clerical, Other-relative, White, Female, 0, 0, 40, United-States, <=50K +28, Self-emp-not-inc, 200515, Some-college, 10, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 40, United-States, <=50K +25, Private, 188119, HS-grad, 9, Never-married, Handlers-cleaners, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +27, Private, 176683, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 48, United-States, <=50K +22, Private, 309178, HS-grad, 9, Never-married, Transport-moving, Own-child, White, Male, 0, 0, 40, United-States, <=50K +67, Self-emp-not-inc, 40021, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 35, United-States, <=50K +31, Self-emp-inc, 49923, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 40, United-States, <=50K +36, ?, 36635, Some-college, 10, Never-married, ?, Unmarried, White, Female, 0, 0, 25, United-States, <=50K +43, Federal-gov, 325706, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 50, India, >50K +33, Private, 124407, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +37, Self-emp-not-inc, 301568, HS-grad, 9, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 35, United-States, >50K +27, Private, 339956, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 60, United-States, <=50K +36, Private, 176335, Some-college, 10, Never-married, Transport-moving, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +33, Private, 198452, Assoc-acdm, 12, Divorced, Sales, Not-in-family, White, Female, 0, 0, 45, United-States, <=50K +63, Private, 213945, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 15024, 0, 40, Iran, >50K +48, Private, 171807, Bachelors, 13, Divorced, Other-service, Unmarried, White, Female, 0, 0, 56, United-States, >50K +25, Private, 362826, HS-grad, 9, Never-married, Handlers-cleaners, Other-relative, White, Male, 0, 0, 45, United-States, <=50K +41, Self-emp-not-inc, 344329, HS-grad, 9, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 10, United-States, <=50K +26, Private, 137678, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +23, Private, 175424, Some-college, 10, Never-married, Adm-clerical, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +33, State-gov, 73296, HS-grad, 9, Never-married, Other-service, Unmarried, Black, Female, 1831, 0, 40, United-States, <=50K +30, State-gov, 137613, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 17, Taiwan, <=50K +67, Self-emp-not-inc, 354405, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +32, Private, 130057, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 55, United-States, >50K +48, Self-emp-not-inc, 362883, HS-grad, 9, Divorced, Craft-repair, Not-in-family, White, Male, 0, 0, 60, United-States, >50K +51, Private, 49017, HS-grad, 9, Widowed, Other-service, Not-in-family, White, Female, 0, 0, 24, United-States, <=50K +39, Private, 149943, Masters, 14, Never-married, Sales, Not-in-family, Asian-Pac-Islander, Male, 0, 0, 40, China, <=50K +40, Self-emp-inc, 99185, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, <=50K +40, Private, 294708, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, Black, Male, 0, 0, 40, United-States, >50K +19, Private, 228238, HS-grad, 9, Never-married, Machine-op-inspct, Other-relative, White, Male, 0, 0, 40, Mexico, <=50K +28, Private, 156819, HS-grad, 9, Divorced, Handlers-cleaners, Unmarried, White, Female, 0, 0, 36, United-States, <=50K +47, Private, 332727, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K +20, Private, 289944, Some-college, 10, Never-married, Prof-specialty, Not-in-family, White, Female, 0, 0, 20, United-States, <=50K +41, Private, 116103, HS-grad, 9, Widowed, Exec-managerial, Other-relative, White, Male, 914, 0, 40, United-States, <=50K +29, Private, 24153, Some-college, 10, Married-civ-spouse, Other-service, Wife, Amer-Indian-Eskimo, Female, 0, 0, 40, United-States, <=50K +40, Private, 273425, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 50, United-States, >50K +61, Private, 231183, HS-grad, 9, Widowed, Exec-managerial, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +25, Private, 313930, 11th, 7, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 40, Mexico, <=50K +26, Private, 114483, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 162108, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +17, Private, 168807, 7th-8th, 4, Never-married, Craft-repair, Not-in-family, White, Male, 0, 0, 45, United-States, <=50K +43, Local-gov, 143828, Masters, 14, Divorced, Prof-specialty, Unmarried, Black, Female, 9562, 0, 40, United-States, >50K +73, Private, 242769, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 3471, 0, 40, England, <=50K +46, Local-gov, 111558, Some-college, 10, Divorced, Machine-op-inspct, Own-child, White, Female, 0, 0, 40, United-States, <=50K +19, Private, 69770, Some-college, 10, Never-married, Other-service, Own-child, White, Male, 0, 0, 20, United-States, <=50K +37, Private, 291981, HS-grad, 9, Divorced, Machine-op-inspct, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +26, Private, 102460, Some-college, 10, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K +47, Private, 151584, HS-grad, 9, Divorced, Sales, Own-child, White, Male, 0, 1876, 40, United-States, <=50K +47, Local-gov, 287320, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 115677, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +29, Private, 239632, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 40, United-States, <=50K +33, Private, 409172, Bachelors, 13, Married-civ-spouse, Exec-managerial, Own-child, White, Male, 0, 0, 55, United-States, <=50K +20, Private, 186849, HS-grad, 9, Never-married, Transport-moving, Other-relative, White, Male, 0, 0, 40, United-States, <=50K +28, Private, 118861, 10th, 6, Married-civ-spouse, Craft-repair, Wife, Other, Female, 0, 0, 48, Guatemala, <=50K +26, Private, 142689, Bachelors, 13, Never-married, Prof-specialty, Not-in-family, White, Male, 0, 0, 45, ?, <=50K +41, State-gov, 170924, Masters, 14, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 40, United-States, <=50K +67, ?, 274451, 7th-8th, 4, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K +43, Private, 153489, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 50, United-States, <=50K +35, Private, 186489, 9th, 5, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 46, United-States, <=50K +18, Private, 192409, 12th, 8, Never-married, Other-service, Own-child, White, Female, 0, 0, 25, United-States, <=50K +55, State-gov, 337599, Some-college, 10, Divorced, Adm-clerical, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K +44, Private, 195545, HS-grad, 9, Divorced, Machine-op-inspct, Own-child, Black, Female, 0, 0, 40, United-States, <=50K +64, Private, 61892, HS-grad, 9, Widowed, Priv-house-serv, Not-in-family, White, Female, 0, 0, 15, United-States, <=50K +34, Self-emp-not-inc, 175697, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 75, United-States, <=50K +38, Private, 80303, Some-college, 10, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, >50K +25, Private, 419658, HS-grad, 9, Never-married, Other-service, Own-child, White, Female, 0, 0, 8, United-States, <=50K +21, Private, 319163, Some-college, 10, Never-married, Transport-moving, Own-child, Black, Male, 0, 0, 40, United-States, <=50K +37, Private, 126743, 1st-4th, 2, Married-civ-spouse, Other-service, Husband, White, Male, 0, 0, 53, Mexico, <=50K + diff --git a/src/test/resources/docker/delta/Dockerfile b/src/test/resources/docker/delta/Dockerfile new file mode 100644 index 00000000..e69de29b diff --git a/src/test/resources/docker/docker-compose.yml b/src/test/resources/docker/docker-compose.yml new file mode 100644 index 00000000..e69de29b diff --git a/src/test/resources/docker/mlflow/Dockerfile b/src/test/resources/docker/mlflow/Dockerfile new file mode 100644 index 00000000..e69de29b diff --git a/src/test/resources/docker/spark/Dockerfile b/src/test/resources/docker/spark/Dockerfile new file mode 100644 index 00000000..88e2e1a5 --- /dev/null +++ b/src/test/resources/docker/spark/Dockerfile @@ -0,0 +1,2 @@ +FROM gettyimages/spark +MAINTAINER Jas Bali "jas.bali@databricks.com" diff --git a/src/test/resources/fire_data.csv b/src/test/resources/fire_data.csv new file mode 100644 index 00000000..f58d0ac2 --- /dev/null +++ b/src/test/resources/fire_data.csv @@ -0,0 +1,518 @@ +xGrid,YGrid,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,burnArea +7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0,0 +7,4,oct,tue,90.6,35.4,669.1,6.7,18,33,0.9,0,0 +7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0,0 +8,6,mar,fri,91.7,33.3,77.5,9,8.3,97,4,0.2,0 +8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0,0 +8,6,aug,sun,92.3,85.3,488,14.7,22.2,29,5.4,0,0 +8,6,aug,mon,92.3,88.9,495.6,8.5,24.1,27,3.1,0,0 +8,6,aug,mon,91.5,145.4,608.2,10.7,8,86,2.2,0,0 +8,6,sep,tue,91,129.5,692.6,7,13.1,63,5.4,0,0 +7,5,sep,sat,92.5,88,698.6,7.1,22.8,40,4,0,0 +7,5,sep,sat,92.5,88,698.6,7.1,17.8,51,7.2,0,0 +7,5,sep,sat,92.8,73.2,713,22.6,19.3,38,4,0,0 +6,5,aug,fri,63.5,70.8,665.3,0.8,17,72,6.7,0,0 +6,5,sep,mon,90.9,126.5,686.5,7,21.3,42,2.2,0,0 +6,5,sep,wed,92.9,133.3,699.6,9.2,26.4,21,4.5,0,0 +6,5,sep,fri,93.3,141.2,713.9,13.9,22.9,44,5.4,0,0 +5,5,mar,sat,91.7,35.8,80.8,7.8,15.1,27,5.4,0,0 +8,5,oct,mon,84.9,32.8,664.2,3,16.7,47,4.9,0,0 +6,4,mar,wed,89.2,27.9,70.8,6.3,15.9,35,4,0,0 +6,4,apr,sat,86.3,27.4,97.1,5.1,9.3,44,4.5,0,0 +6,4,sep,tue,91,129.5,692.6,7,18.3,40,2.7,0,0 +5,4,sep,mon,91.8,78.5,724.3,9.2,19.1,38,2.7,0,0 +7,4,jun,sun,94.3,96.3,200,56.1,21,44,4.5,0,0 +7,4,aug,sat,90.2,110.9,537.4,6.2,19.5,43,5.8,0,0 +7,4,aug,sat,93.5,139.4,594.2,20.3,23.7,32,5.8,0,0 +7,4,aug,sun,91.4,142.4,601.4,10.6,16.3,60,5.4,0,0 +7,4,sep,fri,92.4,117.9,668,12.2,19,34,5.8,0,0 +7,4,sep,mon,90.9,126.5,686.5,7,19.4,48,1.3,0,0 +6,3,sep,sat,93.4,145.4,721.4,8.1,30.2,24,2.7,0,0 +6,3,sep,sun,93.5,149.3,728.6,8.1,22.8,39,3.6,0,0 +6,3,sep,fri,94.3,85.1,692.3,15.9,25.4,24,3.6,0,0 +6,3,sep,mon,88.6,91.8,709.9,7.1,11.2,78,7.6,0,0 +6,3,sep,fri,88.6,69.7,706.8,5.8,20.6,37,1.8,0,0 +6,3,sep,sun,91.7,75.6,718.3,7.8,17.7,39,3.6,0,0 +6,3,sep,mon,91.8,78.5,724.3,9.2,21.2,32,2.7,0,0 +6,3,sep,tue,90.3,80.7,730.2,6.3,18.2,62,4.5,0,0 +6,3,oct,tue,90.6,35.4,669.1,6.7,21.7,24,4.5,0,0 +7,4,oct,fri,90,41.5,682.6,8.7,11.3,60,5.4,0,0 +7,3,oct,sat,90.6,43.7,686.9,6.7,17.8,27,4,0,0 +4,4,mar,tue,88.1,25.7,67.6,3.8,14.1,43,2.7,0,0 +4,4,jul,tue,79.5,60.6,366.7,1.5,23.3,37,3.1,0,0 +4,4,aug,sat,90.2,96.9,624.2,8.9,18.4,42,6.7,0,0 +4,4,aug,tue,94.8,108.3,647.1,17,16.6,54,5.4,0,0 +4,4,sep,sat,92.5,88,698.6,7.1,19.6,48,2.7,0,0 +4,4,sep,wed,90.1,82.9,735.7,6.2,12.9,74,4.9,0,0 +5,6,sep,wed,94.3,85.1,692.3,15.9,25.9,24,4,0,0 +5,6,sep,mon,90.9,126.5,686.5,7,14.7,70,3.6,0,0 +6,6,jul,mon,94.2,62.3,442.9,11,23,36,3.1,0,0 +4,4,mar,mon,87.2,23.9,64.7,4.1,11.8,35,1.8,0,0 +4,4,mar,mon,87.6,52.2,103.8,5,11,46,5.8,0,0 +4,4,sep,thu,92.9,137,706.4,9.2,20.8,17,1.3,0,0 +4,3,aug,sun,90.2,99.6,631.2,6.3,21.5,34,2.2,0,0 +4,3,aug,wed,92.1,111.2,654.1,9.6,20.4,42,4.9,0,0 +4,3,aug,wed,92.1,111.2,654.1,9.6,20.4,42,4.9,0,0 +4,3,aug,thu,91.7,114.3,661.3,6.3,17.6,45,3.6,0,0 +4,3,sep,thu,92.9,137,706.4,9.2,27.7,24,2.2,0,0 +4,3,sep,tue,90.3,80.7,730.2,6.3,17.8,63,4.9,0,0 +4,3,oct,sun,92.6,46.5,691.8,8.8,13.8,50,2.7,0,0 +2,2,feb,mon,84,9.3,34,2.1,13.9,40,5.4,0,0 +2,2,feb,fri,86.6,13.2,43,5.3,12.3,51,0.9,0,0 +2,2,mar,sun,89.3,51.3,102.2,9.6,11.5,39,5.8,0,0 +2,2,mar,sun,89.3,51.3,102.2,9.6,5.5,59,6.3,0,0 +2,2,aug,thu,93,75.3,466.6,7.7,18.8,35,4.9,0,0 +2,2,aug,sun,90.2,99.6,631.2,6.3,20.8,33,2.7,0,0 +2,2,aug,mon,91.1,103.2,638.8,5.8,23.1,31,3.1,0,0 +2,2,aug,thu,91.7,114.3,661.3,6.3,18.6,44,4.5,0,0 +2,2,sep,fri,92.4,117.9,668,12.2,23,37,4.5,0,0 +2,2,sep,fri,92.4,117.9,668,12.2,19.6,33,5.4,0,0 +2,2,sep,fri,92.4,117.9,668,12.2,19.6,33,6.3,0,0 +4,5,mar,fri,91.7,33.3,77.5,9,17.2,26,4.5,0,0 +4,5,mar,fri,91.2,48.3,97.8,12.5,15.8,27,7.6,0,0 +4,5,sep,fri,94.3,85.1,692.3,15.9,17.7,37,3.6,0,0 +5,4,mar,fri,91.7,33.3,77.5,9,15.6,25,6.3,0,0 +5,4,aug,tue,88.8,147.3,614.5,9,17.3,43,4.5,0,0 +5,4,sep,fri,93.3,141.2,713.9,13.9,27.6,30,1.3,0,0 +9,9,feb,thu,84.2,6.8,26.6,7.7,6.7,79,3.1,0,0 +9,9,feb,fri,86.6,13.2,43,5.3,15.7,43,3.1,0,0 +1,3,mar,mon,87.6,52.2,103.8,5,8.3,72,3.1,0,0 +1,2,aug,fri,90.1,108,529.8,12.5,14.7,66,2.7,0,0 +1,2,aug,tue,91,121.2,561.6,7,21.6,19,6.7,0,0 +1,2,aug,sun,91.4,142.4,601.4,10.6,19.5,39,6.3,0,0 +1,2,aug,sun,90.2,99.6,631.2,6.3,17.9,44,2.2,0,0 +1,2,aug,tue,94.8,108.3,647.1,17,18.6,51,4.5,0,0 +1,2,aug,wed,92.1,111.2,654.1,9.6,16.6,47,0.9,0,0 +1,2,aug,thu,91.7,114.3,661.3,6.3,20.2,45,3.6,0,0 +1,2,sep,thu,92.9,137,706.4,9.2,21.5,15,0.9,0,0 +1,2,sep,thu,92.9,137,706.4,9.2,25.4,27,2.2,0,0 +1,2,sep,thu,92.9,137,706.4,9.2,22.4,34,2.2,0,0 +1,2,sep,sun,93.5,149.3,728.6,8.1,25.3,36,3.6,0,0 +6,5,mar,sat,91.7,35.8,80.8,7.8,17.4,25,4.9,0,0 +6,5,aug,sat,90.2,96.9,624.2,8.9,14.7,59,5.8,0,0 +8,6,mar,fri,91.7,35.8,80.8,7.8,17.4,24,5.4,0,0 +8,6,aug,sun,92.3,85.3,488,14.7,20.8,32,6.3,0,0 +8,6,aug,sun,91.4,142.4,601.4,10.6,18.2,43,4.9,0,0 +8,6,aug,mon,91.1,103.2,638.8,5.8,23.4,22,2.7,0,0 +4,4,sep,sun,89.7,90,704.4,4.8,17.8,64,1.3,0,0 +3,4,feb,sat,83.9,8,30.2,2.6,12.7,48,1.8,0,0 +3,4,mar,sat,69,2.4,15.5,0.7,17.4,24,5.4,0,0 +3,4,aug,sun,91.4,142.4,601.4,10.6,11.6,87,4.5,0,0 +3,4,aug,sun,91.4,142.4,601.4,10.6,19.8,39,5.4,0,0 +3,4,aug,sun,91.4,142.4,601.4,10.6,19.8,39,5.4,0,0 +3,4,aug,tue,88.8,147.3,614.5,9,14.4,66,5.4,0,0 +2,4,aug,tue,94.8,108.3,647.1,17,20.1,40,4,0,0 +2,4,sep,sat,92.5,121.1,674.4,8.6,24.1,29,4.5,0,0 +2,4,jan,sat,82.1,3.7,9.3,2.9,5.3,78,3.1,0,0 +4,5,mar,fri,85.9,19.5,57.3,2.8,12.7,52,6.3,0,0 +4,5,mar,thu,91.4,30.7,74.3,7.5,18.2,29,3.1,0,0 +4,5,aug,sun,90.2,99.6,631.2,6.3,21.4,33,3.1,0,0 +4,5,sep,sat,92.5,88,698.6,7.1,20.3,45,3.1,0,0 +4,5,sep,mon,88.6,91.8,709.9,7.1,17.4,56,5.4,0,0 +4,4,mar,fri,85.9,19.5,57.3,2.8,13.7,43,5.8,0,0 +3,4,mar,fri,91.7,33.3,77.5,9,18.8,18,4.5,0,0 +3,4,sep,sun,89.7,90,704.4,4.8,22.8,39,3.6,0,0 +3,4,sep,mon,91.8,78.5,724.3,9.2,18.9,35,2.7,0,0 +3,4,mar,tue,88.1,25.7,67.6,3.8,15.8,27,7.6,0,0 +3,5,mar,tue,88.1,25.7,67.6,3.8,15.5,27,6.3,0,0 +3,4,mar,sat,91.7,35.8,80.8,7.8,11.6,30,6.3,0,0 +3,4,mar,sat,91.7,35.8,80.8,7.8,15.2,27,4.9,0,0 +3,4,mar,mon,90.1,39.7,86.6,6.2,10.6,30,4,0,0 +3,4,aug,thu,93,75.3,466.6,7.7,19.6,36,3.1,0,0 +3,4,aug,mon,91.5,145.4,608.2,10.7,10.3,74,2.2,0,0 +3,4,aug,mon,91.5,145.4,608.2,10.7,17.1,43,5.4,0,0 +3,4,sep,sun,92.4,124.1,680.7,8.5,22.5,42,5.4,0,0 +3,4,sep,tue,84.4,73.4,671.9,3.2,17.9,45,3.1,0,0 +3,4,sep,fri,94.3,85.1,692.3,15.9,19.8,50,5.4,0,0 +3,4,oct,sun,92.6,46.5,691.8,8.8,20.6,24,5.4,0,0 +3,5,mar,mon,87.6,52.2,103.8,5,9,49,2.2,0,0 +3,5,sep,fri,93.5,149.3,728.6,8.1,17.2,43,3.1,0,0 +3,5,oct,wed,91.4,37.9,673.8,5.2,15.9,46,3.6,0,0 +2,5,oct,sun,92.6,46.5,691.8,8.8,15.4,35,0.9,0,0 +4,6,feb,sat,68.2,21.5,87.2,0.8,15.4,40,2.7,0,0 +4,6,mar,mon,87.2,23.9,64.7,4.1,14,39,3.1,0,0 +4,6,mar,sun,89.3,51.3,102.2,9.6,10.6,46,4.9,0,0 +4,6,sep,thu,93.7,80.9,685.2,17.9,17.6,42,3.1,0,0 +3,5,mar,tue,88.1,25.7,67.6,3.8,14.9,38,2.7,0,0 +3,5,aug,sat,93.5,139.4,594.2,20.3,17.6,52,5.8,0,0 +3,6,sep,sun,92.4,124.1,680.7,8.5,17.2,58,1.3,0,0 +3,6,sep,mon,90.9,126.5,686.5,7,15.6,66,3.1,0,0 +9,9,jul,tue,85.8,48.3,313.4,3.9,18,42,2.7,0,0.36 +1,4,sep,tue,91,129.5,692.6,7,21.7,38,2.2,0,0.43 +2,5,sep,mon,90.9,126.5,686.5,7,21.9,39,1.8,0,0.47 +1,2,aug,wed,95.5,99.9,513.3,13.2,23.3,31,4.5,0,0.55 +8,6,aug,fri,90.1,108,529.8,12.5,21.2,51,8.9,0,0.61 +1,2,jul,sat,90,51.3,296.3,8.7,16.6,53,5.4,0,0.71 +2,5,aug,wed,95.5,99.9,513.3,13.2,23.8,32,5.4,0,0.77 +6,5,aug,thu,95.2,131.7,578.8,10.4,27.4,22,4,0,0.9 +5,4,mar,mon,90.1,39.7,86.6,6.2,13.2,40,5.4,0,0.95 +8,3,sep,tue,84.4,73.4,671.9,3.2,24.2,28,3.6,0,0.96 +2,2,aug,tue,94.8,108.3,647.1,17,17.4,43,6.7,0,1.07 +8,6,sep,thu,93.7,80.9,685.2,17.9,23.7,25,4.5,0,1.12 +6,5,jun,fri,92.5,56.4,433.3,7.1,23.2,39,5.4,0,1.19 +9,9,jul,sun,90.1,68.6,355.2,7.2,24.8,29,2.2,0,1.36 +3,4,jul,sat,90.1,51.2,424.1,6.2,24.6,43,1.8,0,1.43 +5,4,sep,fri,94.3,85.1,692.3,15.9,20.1,47,4.9,0,1.46 +1,5,sep,sat,93.4,145.4,721.4,8.1,29.6,27,2.7,0,1.46 +7,4,aug,sun,94.8,108.3,647.1,17,16.4,47,1.3,0,1.56 +2,4,sep,sat,93.4,145.4,721.4,8.1,28.6,27,2.2,0,1.61 +2,2,aug,wed,92.1,111.2,654.1,9.6,18.4,45,3.6,0,1.63 +2,4,aug,wed,92.1,111.2,654.1,9.6,20.5,35,4,0,1.64 +7,4,sep,fri,92.4,117.9,668,12.2,19,34,5.8,0,1.69 +7,4,mar,mon,90.1,39.7,86.6,6.2,16.1,29,3.1,0,1.75 +6,4,aug,thu,95.2,131.7,578.8,10.4,20.3,41,4,0,1.9 +6,3,mar,sat,90.6,50.1,100.4,7.8,15.2,31,8.5,0,1.94 +8,6,sep,sat,92.5,121.1,674.4,8.6,17.8,56,1.8,0,1.95 +8,5,sep,sun,89.7,90,704.4,4.8,17.8,67,2.2,0,2.01 +6,5,mar,thu,84.9,18.2,55,3,5.3,70,4.5,0,2.14 +6,5,aug,wed,92.1,111.2,654.1,9.6,16.6,47,0.9,0,2.29 +6,5,aug,wed,96,127.1,570.5,16.5,23.4,33,4.5,0,2.51 +6,5,mar,fri,91.2,48.3,97.8,12.5,14.6,26,9.4,0,2.53 +8,6,aug,thu,95.2,131.7,578.8,10.4,20.7,45,2.2,0,2.55 +5,4,sep,wed,92.9,133.3,699.6,9.2,21.9,35,1.8,0,2.57 +8,6,aug,wed,85.6,90.4,609.6,6.6,17.4,50,4,0,2.69 +7,4,aug,sun,91.4,142.4,601.4,10.6,20.1,39,5.4,0,2.74 +4,4,sep,mon,90.9,126.5,686.5,7,17.7,39,2.2,0,3.07 +1,4,aug,sat,90.2,96.9,624.2,8.9,14.2,53,1.8,0,3.5 +1,4,aug,sat,90.2,96.9,624.2,8.9,20.3,39,4.9,0,4.53 +6,5,apr,thu,81.5,9.1,55.2,2.7,5.8,54,5.8,0,4.61 +2,5,aug,sun,90.2,99.6,631.2,6.3,19.2,44,2.7,0,4.69 +2,5,sep,wed,90.1,82.9,735.7,6.2,18.3,45,2.2,0,4.88 +8,6,aug,tue,88.8,147.3,614.5,9,14.4,66,5.4,0,5.23 +1,3,sep,sun,92.4,124.1,680.7,8.5,23.9,32,6.7,0,5.33 +8,6,oct,mon,84.9,32.8,664.2,3,19.1,32,4,0,5.44 +5,4,feb,sun,86.8,15.6,48.3,3.9,12.4,53,2.2,0,6.38 +7,4,oct,mon,91.7,48.5,696.1,11.1,16.8,45,4.5,0,6.83 +8,6,aug,fri,93.9,135.7,586.7,15.1,20.8,34,4.9,0,6.96 +2,5,sep,tue,91,129.5,692.6,7,17.6,46,3.1,0,7.04 +8,6,mar,sun,89.3,51.3,102.2,9.6,11.5,39,5.8,0,7.19 +1,5,sep,mon,90.9,126.5,686.5,7,21,42,2.2,0,7.3 +6,4,mar,sat,90.8,41.9,89.4,7.9,13.3,42,0.9,0,7.4 +7,4,mar,sun,90.7,44,92.4,5.5,11.5,60,4,0,8.24 +6,5,mar,fri,91.2,48.3,97.8,12.5,11.7,33,4,0,8.31 +2,5,aug,thu,95.2,131.7,578.8,10.4,24.2,28,2.7,0,8.68 +2,2,aug,tue,94.8,108.3,647.1,17,24.6,22,4.5,0,8.71 +4,5,sep,wed,92.9,133.3,699.6,9.2,24.3,25,4,0,9.41 +2,2,aug,tue,94.8,108.3,647.1,17,24.6,22,4.5,0,10.01 +2,5,aug,fri,93.9,135.7,586.7,15.1,23.5,36,5.4,0,10.02 +6,5,apr,thu,81.5,9.1,55.2,2.7,5.8,54,5.8,0,10.93 +4,5,sep,thu,92.9,137,706.4,9.2,21.5,15,0.9,0,11.06 +3,4,sep,tue,91,129.5,692.6,7,13.9,59,6.3,0,11.24 +2,4,sep,mon,63.5,70.8,665.3,0.8,22.6,38,3.6,0,11.32 +1,5,sep,tue,91,129.5,692.6,7,21.6,33,2.2,0,11.53 +6,5,mar,sun,90.1,37.6,83.7,7.2,12.4,54,3.6,0,12.1 +7,4,feb,sun,83.9,8.7,32.1,2.1,8.8,68,2.2,0,13.05 +8,6,oct,wed,91.4,37.9,673.8,5.2,20.2,37,2.7,0,13.7 +5,6,mar,sat,90.6,50.1,100.4,7.8,15.1,64,4,0,13.99 +4,5,sep,thu,92.9,137,706.4,9.2,22.1,34,1.8,0,14.57 +2,2,aug,sat,93.5,139.4,594.2,20.3,22.9,31,7.2,0,15.45 +7,5,sep,tue,91,129.5,692.6,7,20.7,37,2.2,0,17.2 +6,5,sep,fri,92.4,117.9,668,12.2,19.6,33,6.3,0,19.23 +8,3,sep,thu,93.7,80.9,685.2,17.9,23.2,26,4.9,0,23.41 +4,4,oct,sat,90.6,43.7,686.9,6.7,18.4,25,3.1,0,24.23 +7,4,aug,sat,93.5,139.4,594.2,20.3,5.1,96,5.8,0,26 +7,4,sep,fri,94.3,85.1,692.3,15.9,20.1,47,4.9,0,26.13 +7,3,mar,mon,87.6,52.2,103.8,5,11,46,5.8,0,27.35 +4,4,mar,sat,91.7,35.8,80.8,7.8,17,27,4.9,0,28.66 +4,4,mar,sat,91.7,35.8,80.8,7.8,17,27,4.9,0,28.66 +4,4,sep,sun,92.4,124.1,680.7,8.5,16.9,60,1.3,0,29.48 +1,3,sep,mon,88.6,91.8,709.9,7.1,12.4,73,6.3,0,30.32 +4,5,sep,wed,92.9,133.3,699.6,9.2,19.4,19,1.3,0,31.72 +6,5,mar,mon,90.1,39.7,86.6,6.2,15.2,27,3.1,0,31.86 +8,6,aug,sun,90.2,99.6,631.2,6.3,16.2,59,3.1,0,32.07 +3,4,sep,fri,93.3,141.2,713.9,13.9,18.6,49,3.6,0,35.88 +4,3,mar,mon,87.6,52.2,103.8,5,11,46,5.8,0,36.85 +2,2,jul,fri,88.3,150.3,309.9,6.8,13.4,79,3.6,0,37.02 +7,4,sep,wed,90.1,82.9,735.7,6.2,15.4,57,4.5,0,37.71 +4,4,sep,sun,93.5,149.3,728.6,8.1,22.9,39,4.9,0,48.55 +7,5,oct,mon,91.7,48.5,696.1,11.1,16.1,44,4,0,49.37 +8,6,aug,sat,92.2,81.8,480.8,11.9,20.1,34,4.5,0,58.3 +4,6,sep,sun,93.5,149.3,728.6,8.1,28.3,26,3.1,0,64.1 +8,6,aug,sat,92.2,81.8,480.8,11.9,16.4,43,4,0,71.3 +4,4,sep,wed,92.9,133.3,699.6,9.2,26.4,21,4.5,0,88.49 +1,5,sep,sun,93.5,149.3,728.6,8.1,27.8,27,3.1,0,95.18 +6,4,sep,tue,91,129.5,692.6,7,18.7,43,2.7,0,103.39 +9,4,sep,tue,84.4,73.4,671.9,3.2,24.3,36,3.1,0,105.66 +4,5,sep,sat,92.5,121.1,674.4,8.6,17.7,25,3.1,0,154.88 +8,6,aug,sun,91.4,142.4,601.4,10.6,19.6,41,5.8,0,196.48 +2,2,sep,sat,92.5,121.1,674.4,8.6,18.2,46,1.8,0,200.94 +1,2,sep,tue,91,129.5,692.6,7,18.8,40,2.2,0,212.88 +6,5,sep,sat,92.5,121.1,674.4,8.6,25.1,27,4,0,1090.84 +7,5,apr,sun,81.9,3,7.9,3.5,13.4,75,1.8,0,0 +6,3,apr,wed,88,17.2,43.5,3.8,15.2,51,2.7,0,0 +4,4,apr,fri,83,23.3,85.3,2.3,16.7,20,3.1,0,0 +2,4,aug,sun,94.2,122.3,589.9,12.9,15.4,66,4,0,10.13 +7,4,aug,sun,91.8,175.1,700.7,13.8,21.9,73,7.6,1,0 +2,4,aug,sun,91.8,175.1,700.7,13.8,22.4,54,7.6,0,2.87 +3,4,aug,sun,91.8,175.1,700.7,13.8,26.8,38,6.3,0,0.76 +5,4,aug,sun,91.8,175.1,700.7,13.8,25.7,39,5.4,0,0.09 +2,4,aug,wed,92.2,91.6,503.6,9.6,20.7,70,2.2,0,0.75 +8,6,aug,wed,93.1,157.3,666.7,13.5,28.7,28,2.7,0,0 +3,4,aug,wed,93.1,157.3,666.7,13.5,21.7,40,0.4,0,2.47 +8,5,aug,wed,93.1,157.3,666.7,13.5,26.8,25,3.1,0,0.68 +8,5,aug,wed,93.1,157.3,666.7,13.5,24,36,3.1,0,0.24 +6,5,aug,wed,93.1,157.3,666.7,13.5,22.1,37,3.6,0,0.21 +7,4,aug,thu,91.9,109.2,565.5,8,21.4,38,2.7,0,1.52 +6,3,aug,thu,91.6,138.1,621.7,6.3,18.9,41,3.1,0,10.34 +2,5,aug,thu,87.5,77,694.8,5,22.3,46,4,0,0 +8,6,aug,sat,94.2,117.2,581.1,11,23.9,41,2.2,0,8.02 +4,3,aug,sat,94.2,117.2,581.1,11,21.4,44,2.7,0,0.68 +3,4,aug,sat,91.8,170.9,692.3,13.7,20.6,59,0.9,0,0 +7,4,aug,sat,91.8,170.9,692.3,13.7,23.7,40,1.8,0,1.38 +2,4,aug,mon,93.6,97.9,542,14.4,28.3,32,4,0,8.85 +3,4,aug,fri,91.6,112.4,573,8.9,11.2,84,7.6,0,3.3 +2,4,aug,fri,91.6,112.4,573,8.9,21.4,42,3.1,0,4.25 +6,3,aug,fri,91.1,141.1,629.1,7.1,19.3,39,3.6,0,1.56 +4,4,aug,fri,94.3,167.6,684.4,13,21.8,53,3.1,0,6.54 +4,4,aug,tue,93.7,102.2,550.3,14.6,22.1,54,7.6,0,0.79 +6,5,aug,tue,94.3,131.7,607.1,22.7,19.4,55,4,0,0.17 +2,2,aug,tue,92.1,152.6,658.2,14.3,23.7,24,3.1,0,0 +3,4,aug,tue,92.1,152.6,658.2,14.3,21,32,3.1,0,0 +4,4,aug,tue,92.1,152.6,658.2,14.3,19.1,53,2.7,0,4.4 +2,2,aug,tue,92.1,152.6,658.2,14.3,21.8,56,3.1,0,0.52 +8,6,aug,tue,92.1,152.6,658.2,14.3,20.1,58,4.5,0,9.27 +2,5,aug,tue,92.1,152.6,658.2,14.3,20.2,47,4,0,3.09 +4,6,dec,sun,84.4,27.2,353.5,6.8,4.8,57,8.5,0,8.98 +8,6,dec,wed,84,27.8,354.6,5.3,5.1,61,8,0,11.19 +4,6,dec,thu,84.6,26.4,352,2,5.1,61,4.9,0,5.38 +4,4,dec,mon,85.4,25.4,349.7,2.6,4.6,21,8.5,0,17.85 +3,4,dec,mon,85.4,25.4,349.7,2.6,4.6,21,8.5,0,10.73 +4,4,dec,mon,85.4,25.4,349.7,2.6,4.6,21,8.5,0,22.03 +4,4,dec,mon,85.4,25.4,349.7,2.6,4.6,21,8.5,0,9.77 +4,6,dec,fri,84.7,26.7,352.6,4.1,2.2,59,4.9,0,9.27 +6,5,dec,tue,85.4,25.4,349.7,2.6,5.1,24,8.5,0,24.77 +6,3,feb,sun,84.9,27.5,353.5,3.4,4.2,51,4,0,0 +3,4,feb,wed,86.9,6.6,18.7,3.2,8.8,35,3.1,0,1.1 +5,4,feb,fri,85.2,4.9,15.8,6.3,7.5,46,8,0,24.24 +2,5,jul,sun,93.9,169.7,411.8,12.3,23.4,40,6.3,0,0 +7,6,jul,wed,91.2,183.1,437.7,12.5,12.6,90,7.6,0.2,0 +7,4,jul,sat,91.6,104.2,474.9,9,22.1,49,2.7,0,0 +7,4,jul,sat,91.6,104.2,474.9,9,24.2,32,1.8,0,0 +7,4,jul,sat,91.6,104.2,474.9,9,24.3,30,1.8,0,0 +2,5,jul,sat,91.6,104.2,474.9,9,18.7,53,1.8,0,0 +9,4,jul,sat,91.6,104.2,474.9,9,25.3,39,0.9,0,8 +4,5,jul,fri,91.6,100.2,466.3,6.3,22.9,40,1.3,0,2.64 +7,6,jul,tue,93.1,180.4,430.8,11,26.9,28,5.4,0,86.45 +8,6,jul,tue,92.3,88.8,440.9,8.5,17.1,67,3.6,0,6.57 +7,5,jun,sun,93.1,180.4,430.8,11,22.2,48,1.3,0,0 +6,4,jun,sun,90.4,89.5,290.8,6.4,14.3,46,1.8,0,0.9 +8,6,jun,sun,90.4,89.5,290.8,6.4,15.4,45,2.2,0,0 +8,6,jun,wed,91.2,147.8,377.2,12.7,19.6,43,4.9,0,0 +6,5,jun,sat,53.4,71,233.8,0.4,10.6,90,2.7,0,0 +6,5,jun,mon,90.4,93.3,298.1,7.5,20.7,25,4.9,0,0 +6,5,jun,mon,90.4,93.3,298.1,7.5,19.1,39,5.4,0,3.52 +3,6,jun,fri,91.1,94.1,232.1,7.1,19.2,38,4.5,0,0 +3,6,jun,fri,91.1,94.1,232.1,7.1,19.2,38,4.5,0,0 +6,5,may,sat,85.1,28,113.8,3.5,11.3,94,4.9,0,0 +1,4,sep,sun,89.6,84.1,714.3,5.7,19,52,2.2,0,0 +7,4,sep,sun,89.6,84.1,714.3,5.7,17.1,53,5.4,0,0.41 +3,4,sep,sun,89.6,84.1,714.3,5.7,23.8,35,3.6,0,5.18 +2,4,sep,sun,92.4,105.8,758.1,9.9,16,45,1.8,0,0 +2,4,sep,sun,92.4,105.8,758.1,9.9,24.9,27,2.2,0,0 +7,4,sep,sun,92.4,105.8,758.1,9.9,25.3,27,2.7,0,0 +6,3,sep,sun,92.4,105.8,758.1,9.9,24.8,28,1.8,0,14.29 +2,4,sep,sun,50.4,46.2,706.6,0.4,12.2,78,6.3,0,0 +6,5,sep,wed,92.6,115.4,777.1,8.8,24.3,27,4.9,0,0 +4,4,sep,wed,92.6,115.4,777.1,8.8,19.7,41,1.8,0,1.58 +3,4,sep,wed,91.2,134.7,817.5,7.2,18.5,30,2.7,0,0 +4,5,sep,thu,92.4,96.2,739.4,8.6,18.6,24,5.8,0,0 +4,4,sep,thu,92.4,96.2,739.4,8.6,19.2,24,4.9,0,3.78 +6,5,sep,thu,92.8,119,783.5,7.5,21.6,27,2.2,0,0 +5,4,sep,thu,92.8,119,783.5,7.5,21.6,28,6.3,0,4.41 +6,3,sep,thu,92.8,119,783.5,7.5,18.9,34,7.2,0,34.36 +1,4,sep,thu,92.8,119,783.5,7.5,16.8,28,4,0,7.21 +6,5,sep,thu,92.8,119,783.5,7.5,16.8,28,4,0,1.01 +3,5,sep,thu,90.7,136.9,822.8,6.8,12.9,39,2.7,0,2.18 +6,5,sep,thu,88.1,53.3,726.9,5.4,13.7,56,1.8,0,4.42 +1,4,sep,sat,92.2,102.3,751.5,8.4,24.2,27,3.1,0,0 +5,4,sep,sat,92.2,102.3,751.5,8.4,24.1,27,3.1,0,0 +6,5,sep,sat,92.2,102.3,751.5,8.4,21.2,32,2.2,0,0 +6,5,sep,sat,92.2,102.3,751.5,8.4,19.7,35,1.8,0,0 +4,3,sep,sat,92.2,102.3,751.5,8.4,23.5,27,4,0,3.33 +3,3,sep,sat,92.2,102.3,751.5,8.4,24.2,27,3.1,0,6.58 +7,4,sep,sat,91.2,124.4,795.3,8.5,21.5,28,4.5,0,15.64 +4,4,sep,sat,91.2,124.4,795.3,8.5,17.1,41,2.2,0,11.22 +1,4,sep,mon,92.1,87.7,721.1,9.5,18.1,54,3.1,0,2.13 +2,3,sep,mon,91.6,108.4,764,6.2,18,51,5.4,0,0 +4,3,sep,mon,91.6,108.4,764,6.2,9.8,86,1.8,0,0 +7,4,sep,mon,91.6,108.4,764,6.2,19.3,44,2.2,0,0 +6,3,sep,mon,91.6,108.4,764,6.2,23,34,2.2,0,56.04 +8,6,sep,mon,91.6,108.4,764,6.2,22.7,35,2.2,0,7.48 +2,4,sep,mon,91.6,108.4,764,6.2,20.4,41,1.8,0,1.47 +2,5,sep,mon,91.6,108.4,764,6.2,19.3,44,2.2,0,3.93 +8,6,sep,mon,91.9,111.7,770.3,6.5,15.7,51,2.2,0,0 +6,3,sep,mon,91.5,130.1,807.1,7.5,20.6,37,1.8,0,0 +8,6,sep,mon,91.5,130.1,807.1,7.5,15.9,51,4.5,0,2.18 +6,3,sep,mon,91.5,130.1,807.1,7.5,12.2,66,4.9,0,6.1 +2,2,sep,mon,91.5,130.1,807.1,7.5,16.8,43,3.1,0,5.83 +1,4,sep,mon,91.5,130.1,807.1,7.5,21.3,35,2.2,0,28.19 +5,4,sep,fri,92.1,99,745.3,9.6,10.1,75,3.6,0,0 +3,4,sep,fri,92.1,99,745.3,9.6,17.4,57,4.5,0,0 +5,4,sep,fri,92.1,99,745.3,9.6,12.8,64,3.6,0,1.64 +5,4,sep,fri,92.1,99,745.3,9.6,10.1,75,3.6,0,3.71 +4,4,sep,fri,92.1,99,745.3,9.6,15.4,53,6.3,0,7.31 +7,4,sep,fri,92.1,99,745.3,9.6,20.6,43,3.6,0,2.03 +7,4,sep,fri,92.1,99,745.3,9.6,19.8,47,2.7,0,1.72 +7,4,sep,fri,92.1,99,745.3,9.6,18.7,50,2.2,0,5.97 +4,4,sep,fri,92.1,99,745.3,9.6,20.8,35,4.9,0,13.06 +4,4,sep,fri,92.1,99,745.3,9.6,20.8,35,4.9,0,1.26 +6,3,sep,fri,92.5,122,789.7,10.2,15.9,55,3.6,0,0 +6,3,sep,fri,92.5,122,789.7,10.2,19.7,39,2.7,0,0 +1,4,sep,fri,92.5,122,789.7,10.2,21.1,39,2.2,0,8.12 +6,5,sep,fri,92.5,122,789.7,10.2,18.4,42,2.2,0,1.09 +4,3,sep,fri,92.5,122,789.7,10.2,17.3,45,4,0,3.94 +7,4,sep,fri,88.2,55.2,732.3,11.6,15.2,64,3.1,0,0.52 +4,3,sep,tue,91.9,111.7,770.3,6.5,15.9,53,2.2,0,2.93 +6,5,sep,tue,91.9,111.7,770.3,6.5,21.1,35,2.7,0,5.65 +6,5,sep,tue,91.9,111.7,770.3,6.5,19.6,45,3.1,0,20.03 +4,5,sep,tue,91.1,132.3,812.1,12.5,15.9,38,5.4,0,1.75 +4,5,sep,tue,91.1,132.3,812.1,12.5,16.4,27,3.6,0,0 +6,5,sep,sat,91.2,94.3,744.4,8.4,16.8,47,4.9,0,12.64 +4,5,sep,sun,91,276.3,825.1,7.1,13.8,77,7.6,0,0 +7,4,sep,sun,91,276.3,825.1,7.1,13.8,77,7.6,0,11.06 +3,4,jul,wed,91.9,133.6,520.5,8,14.2,58,4,0,0 +4,5,aug,sun,92,203.2,664.5,8.1,10.4,75,0.9,0,0 +5,4,aug,thu,94.8,222.4,698.6,13.9,20.3,42,2.7,0,0 +6,5,sep,fri,90.3,290,855.3,7.4,10.3,78,4,0,18.3 +6,5,sep,sat,91.2,94.3,744.4,8.4,15.4,57,4.9,0,39.35 +8,6,aug,mon,92.1,207,672.6,8.2,21.1,54,2.2,0,0 +2,2,aug,sat,93.7,231.1,715.1,8.4,21.9,42,2.2,0,174.63 +6,5,mar,thu,90.9,18.9,30.6,8,8.7,51,5.8,0,0 +4,5,jan,sun,18.7,1.1,171.4,0,5.2,100,0.9,0,0 +5,4,jul,wed,93.7,101.3,458.8,11.9,19.3,39,7.2,0,7.73 +8,6,aug,thu,90.7,194.1,643,6.8,16.2,63,2.7,0,16.33 +8,6,aug,wed,95.2,217.7,690,18,28.2,29,1.8,0,5.86 +9,6,aug,thu,91.6,248.4,753.8,6.3,20.5,58,2.7,0,42.87 +8,4,aug,sat,91.6,273.8,819.1,7.7,21.3,44,4.5,0,12.18 +2,4,aug,sun,91.6,181.3,613,7.6,20.9,50,2.2,0,16 +3,4,sep,sun,90.5,96.7,750.5,11.4,20.6,55,5.4,0,24.59 +5,5,mar,thu,90.9,18.9,30.6,8,11.6,48,5.4,0,0 +6,4,aug,fri,94.8,227,706.7,12,23.3,34,3.1,0,28.74 +7,4,aug,fri,94.8,227,706.7,12,23.3,34,3.1,0,0 +7,4,feb,mon,84.7,9.5,58.3,4.1,7.5,71,6.3,0,9.96 +8,6,sep,fri,91.1,91.3,738.1,7.2,20.7,46,2.7,0,30.18 +1,3,sep,sun,91,276.3,825.1,7.1,21.9,43,4,0,70.76 +2,4,mar,tue,93.4,15,25.6,11.4,15.2,19,7.6,0,0 +6,5,feb,mon,84.1,4.6,46.7,2.2,5.3,68,1.8,0,0 +4,5,feb,sun,85,9,56.9,3.5,10.1,62,1.8,0,51.78 +4,3,sep,sun,90.5,96.7,750.5,11.4,20.4,55,4.9,0,3.64 +5,6,aug,sun,91.6,181.3,613,7.6,24.3,33,3.6,0,3.63 +1,2,aug,sat,93.7,231.1,715.1,8.4,25.9,32,3.1,0,0 +9,5,jun,wed,93.3,49.5,297.7,14,28,34,4.5,0,0 +9,5,jun,wed,93.3,49.5,297.7,14,28,34,4.5,0,8.16 +3,4,sep,thu,91.1,88.2,731.7,8.3,22.8,46,4,0,4.95 +9,9,aug,fri,94.8,227,706.7,12,25,36,4,0,0 +8,6,aug,thu,90.7,194.1,643,6.8,21.3,41,3.6,0,0 +2,4,sep,wed,87.9,84.8,725.1,3.7,21.8,34,2.2,0,6.04 +2,2,aug,tue,94.6,212.1,680.9,9.5,27.9,27,2.2,0,0 +6,5,sep,sat,87.1,291.3,860.6,4,17,67,4.9,0,3.95 +4,5,feb,sat,84.7,8.2,55,2.9,14.2,46,4,0,0 +4,3,sep,fri,90.3,290,855.3,7.4,19.9,44,3.1,0,7.8 +1,4,jul,tue,92.3,96.2,450.2,12.1,23.4,31,5.4,0,0 +6,3,feb,fri,84.1,7.3,52.8,2.7,14.7,42,2.7,0,0 +7,4,feb,fri,84.6,3.2,43.6,3.3,8.2,53,9.4,0,4.62 +9,4,jul,mon,92.3,92.1,442.1,9.8,22.8,27,4.5,0,1.63 +7,5,aug,sat,93.7,231.1,715.1,8.4,26.4,33,3.6,0,0 +5,4,aug,sun,93.6,235.1,723.1,10.1,24.1,50,4,0,0 +8,6,aug,thu,94.8,222.4,698.6,13.9,27.5,27,4.9,0,746.28 +6,3,jul,tue,92.7,164.1,575.8,8.9,26.3,39,3.1,0,7.02 +6,5,mar,wed,93.4,17.3,28.3,9.9,13.8,24,5.8,0,0 +2,4,aug,sun,92,203.2,664.5,8.1,24.9,42,5.4,0,2.44 +2,5,aug,sun,91.6,181.3,613,7.6,24.8,36,4,0,3.05 +8,8,aug,wed,91.7,191.4,635.9,7.8,26.2,36,4.5,0,185.76 +2,4,aug,wed,95.2,217.7,690,18,30.8,19,4.5,0,0 +8,6,jul,sun,88.9,263.1,795.9,5.2,29.3,27,3.6,0,6.3 +1,3,sep,sat,91.2,94.3,744.4,8.4,22.3,48,4,0,0.72 +8,6,aug,sat,93.7,231.1,715.1,8.4,26.9,31,3.6,0,4.96 +2,2,aug,thu,91.6,248.4,753.8,6.3,20.4,56,2.2,0,0 +8,6,aug,thu,91.6,248.4,753.8,6.3,20.4,56,2.2,0,0 +2,4,aug,mon,92.1,207,672.6,8.2,27.9,33,2.2,0,2.35 +1,3,aug,thu,94.8,222.4,698.6,13.9,26.2,34,5.8,0,0 +3,4,aug,sun,91.6,181.3,613,7.6,24.6,44,4,0,3.2 +7,4,sep,thu,89.7,287.2,849.3,6.8,19.4,45,3.6,0,0 +1,3,aug,sat,92.1,178,605.3,9.6,23.3,40,4,0,6.36 +8,6,aug,thu,94.8,222.4,698.6,13.9,23.9,38,6.7,0,0 +2,4,aug,sun,93.6,235.1,723.1,10.1,20.9,66,4.9,0,15.34 +1,4,aug,fri,90.6,269.8,811.2,5.5,22.2,45,3.6,0,0 +2,5,jul,sat,90.8,84.7,376.6,5.6,23.8,51,1.8,0,0 +8,6,aug,mon,92.1,207,672.6,8.2,26.8,35,1.3,0,0.54 +8,6,aug,sat,89.4,253.6,768.4,9.7,14.2,73,2.7,0,0 +2,5,aug,sat,93.7,231.1,715.1,8.4,23.6,53,4,0,6.43 +1,3,sep,fri,91.1,91.3,738.1,7.2,19.1,46,2.2,0,0.33 +5,4,sep,fri,90.3,290,855.3,7.4,16.2,58,3.6,0,0 +8,6,aug,mon,92.1,207,672.6,8.2,25.5,29,1.8,0,1.23 +6,5,apr,mon,87.9,24.9,41.6,3.7,10.9,64,3.1,0,3.35 +1,2,jul,fri,90.7,80.9,368.3,16.8,14.8,78,8,0,0 +2,5,sep,fri,90.3,290,855.3,7.4,16.2,58,3.6,0,9.96 +5,5,aug,sun,94,47.9,100.7,10.7,17.3,80,4.5,0,0 +6,5,aug,sun,92,203.2,664.5,8.1,19.1,70,2.2,0,0 +3,4,mar,wed,93.4,17.3,28.3,9.9,8.9,35,8,0,0 +7,4,sep,wed,89.7,284.9,844,10.1,10.5,77,4,0,0 +7,4,aug,sun,91.6,181.3,613,7.6,19.3,61,4.9,0,0 +4,5,aug,wed,95.2,217.7,690,18,23.4,49,5.4,0,6.43 +1,4,aug,fri,90.5,196.8,649.9,16.3,11.8,88,4.9,0,9.71 +7,4,aug,mon,91.5,238.2,730.6,7.5,17.7,65,4,0,0 +4,5,aug,thu,89.4,266.2,803.3,5.6,17.4,54,3.1,0,0 +3,4,aug,thu,91.6,248.4,753.8,6.3,16.8,56,3.1,0,0 +3,4,jul,mon,94.6,160,567.2,16.7,17.9,48,2.7,0,0 +2,4,aug,thu,91.6,248.4,753.8,6.3,16.6,59,2.7,0,0 +1,4,aug,wed,91.7,191.4,635.9,7.8,19.9,50,4,0,82.75 +8,6,aug,sat,93.7,231.1,715.1,8.4,18.9,64,4.9,0,3.32 +7,4,aug,sat,91.6,273.8,819.1,7.7,15.5,72,8,0,1.94 +2,5,aug,sat,93.7,231.1,715.1,8.4,18.9,64,4.9,0,0 +8,6,aug,sat,93.7,231.1,715.1,8.4,18.9,64,4.9,0,0 +1,4,sep,sun,91,276.3,825.1,7.1,14.5,76,7.6,0,3.71 +6,5,feb,tue,75.1,4.4,16.2,1.9,4.6,82,6.3,0,5.39 +6,4,feb,tue,75.1,4.4,16.2,1.9,5.1,77,5.4,0,2.14 +2,2,feb,sat,79.5,3.6,15.3,1.8,4.6,59,0.9,0,6.84 +6,5,mar,mon,87.2,15.1,36.9,7.1,10.2,45,5.8,0,3.18 +3,4,mar,wed,90.2,18.5,41.1,7.3,11.2,41,5.4,0,5.55 +6,5,mar,thu,91.3,20.6,43.5,8.5,13.3,27,3.6,0,6.61 +6,3,apr,sun,91,14.6,25.6,12.3,13.7,33,9.4,0,61.13 +5,4,apr,sun,91,14.6,25.6,12.3,17.6,27,5.8,0,0 +4,3,may,fri,89.6,25.4,73.7,5.7,18,40,4,0,38.48 +8,3,jun,mon,88.2,96.2,229,4.7,14.3,79,4,0,1.94 +9,4,jun,sat,90.5,61.1,252.6,9.4,24.5,50,3.1,0,70.32 +4,3,jun,thu,93,103.8,316.7,10.8,26.4,35,2.7,0,10.08 +2,5,jun,thu,93.7,121.7,350.2,18,22.7,40,9.4,0,3.19 +4,3,jul,thu,93.5,85.3,395,9.9,27.2,28,1.3,0,1.76 +4,3,jul,sun,93.7,101.3,423.4,14.7,26.1,45,4,0,7.36 +7,4,jul,sun,93.7,101.3,423.4,14.7,18.2,82,4.5,0,2.21 +7,4,jul,mon,89.2,103.9,431.6,6.4,22.6,57,4.9,0,278.53 +9,9,jul,thu,93.2,114.4,560,9.5,30.2,25,4.5,0,2.75 +4,3,jul,thu,93.2,114.4,560,9.5,30.2,22,4.9,0,0 +3,4,aug,sun,94.9,130.3,587.1,14.1,23.4,40,5.8,0,1.29 +8,6,aug,sun,94.9,130.3,587.1,14.1,31,27,5.4,0,0 +2,5,aug,sun,94.9,130.3,587.1,14.1,33.1,25,4,0,26.43 +2,4,aug,mon,95,135.5,596.3,21.3,30.6,28,3.6,0,2.07 +5,4,aug,tue,95.1,141.3,605.8,17.7,24.1,43,6.3,0,2 +5,4,aug,tue,95.1,141.3,605.8,17.7,26.4,34,3.6,0,16.4 +4,4,aug,tue,95.1,141.3,605.8,17.7,19.4,71,7.6,0,46.7 +4,4,aug,wed,95.1,141.3,605.8,17.7,20.6,58,1.3,0,0 +4,4,aug,wed,95.1,141.3,605.8,17.7,28.7,33,4,0,0 +4,4,aug,thu,95.8,152,624.1,13.8,32.4,21,4.5,0,0 +1,3,aug,fri,95.9,158,633.6,11.3,32.4,27,2.2,0,0 +1,3,aug,fri,95.9,158,633.6,11.3,27.5,29,4.5,0,43.32 +6,6,aug,sat,96,164,643,14,30.8,30,4.9,0,8.59 +6,6,aug,mon,96.2,175.5,661.8,16.8,23.9,42,2.2,0,0 +4,5,aug,mon,96.2,175.5,661.8,16.8,32.6,26,3.1,0,2.77 +3,4,aug,tue,96.1,181.1,671.2,14.3,32.3,27,2.2,0,14.68 +6,5,aug,tue,96.1,181.1,671.2,14.3,33.3,26,2.7,0,40.54 +7,5,aug,tue,96.1,181.1,671.2,14.3,27.3,63,4.9,6.4,10.82 +8,6,aug,tue,96.1,181.1,671.2,14.3,21.6,65,4.9,0.8,0 +7,5,aug,tue,96.1,181.1,671.2,14.3,21.6,65,4.9,0.8,0 +4,4,aug,tue,96.1,181.1,671.2,14.3,20.7,69,4.9,0.4,0 +2,4,aug,wed,94.5,139.4,689.1,20,29.2,30,4.9,0,1.95 +4,3,aug,wed,94.5,139.4,689.1,20,28.9,29,4.9,0,49.59 +1,2,aug,thu,91,163.2,744.4,10.1,26.7,35,1.8,0,5.8 +1,2,aug,fri,91,166.9,752.6,7.1,18.5,73,8.5,0,0 +2,4,aug,fri,91,166.9,752.6,7.1,25.9,41,3.6,0,0 +1,2,aug,fri,91,166.9,752.6,7.1,25.9,41,3.6,0,0 +5,4,aug,fri,91,166.9,752.6,7.1,21.1,71,7.6,1.4,2.17 +6,5,aug,fri,91,166.9,752.6,7.1,18.2,62,5.4,0,0.43 +8,6,aug,sun,81.6,56.7,665.6,1.9,27.8,35,2.7,0,0 +4,3,aug,sun,81.6,56.7,665.6,1.9,27.8,32,2.7,0,6.44 +2,4,aug,sun,81.6,56.7,665.6,1.9,21.9,71,5.8,0,54.29 +7,4,aug,sun,81.6,56.7,665.6,1.9,21.2,70,6.7,0,11.16 +1,4,aug,sat,94.4,146,614.7,11.3,25.6,42,4,0,0 +6,3,nov,tue,79.5,3,106.7,1.1,11.8,31,4.5,0,0 diff --git a/src/test/resources/loan_risk.csv b/src/test/resources/loan_risk.csv new file mode 100644 index 00000000..e8acb981 --- /dev/null +++ b/src/test/resources/loan_risk.csv @@ -0,0 +1,1001 @@ +term,home_ownership,purpose,addr_state,verification_status,application_type,loan_amnt,emp_length,annual_inc,dti,delinq_2yrs,revol_util,total_acc,credit_length_in_years,label,int_rate,net,issue_year + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,10000,3,60000,16.18,0,88.7,12,9,tru,18.75,-4580.83,2012 + 36 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,8000,3,40950,23.92,0,60.4,13,15,fals,7.62,813.72,2012 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,3200,null,10728,19.91,0,91,5,9,tru,24.99,-2441.12,2016 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,27000,8,100000,12.53,0,85.9,37,20,fals,13.65,6045.97,2014 + 36 months,RENT,credit_card,NC,Not Verified,INDIVIDUAL,6000,0,40000,33.57,0,28.6,33,6,fals,9.16,400.09,2016 + 36 months,RENT,debt_consolidation,MO,Verified,INDIVIDUAL,12000,10,80000,22.38,1,65.8,25,11,tru,11.99,-5170.3,2014 + 36 months,MORTGAGE,credit_card,MA,Verified,INDIVIDUAL,18000,10,65000,14.57,1,55.3,20,13,fals,11.47,1923.34,2016 + 36 months,RENT,debt_consolidation,NC,Verified,INDIVIDUAL,7900,7,22000,30.66,0,94.8,12,14,fals,14.46,1087.14,2016 + 60 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,22975,10,125000,15.26,0,42,42,19,tru,14.65,-13542.43,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,20000,0,121656,13.11,13,42,47,21,fals,9.99,2191.74,2015 + 60 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,35000,5,181000,9.87,1,76.2,28,24,fals,15.88,12964.24,2013 + 36 months,MORTGAGE,major_purchase,TX,Not Verified,INDIVIDUAL,12000,10,110000,12.21,0,47.6,32,16,tru,7.62,-828.7,2012 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,3500,8,32000,20.97,0,71.9,15,17,tru,12.99,-2430.21,2014 + 60 months,MORTGAGE,home_improvement,NY,Verified,INDIVIDUAL,15800,4,59000,32.81,0,19.1,35,13,fals,15.61,2669.96,2014 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,35000,6,180000,11.87,0,42.1,20,17,fals,9.76,3577.28,2012 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,15000,9,55000,25.18,0,80.6,13,13,fals,21.49,2533.81,2016 + 36 months,RENT,debt_consolidation,HI,Verified,INDIVIDUAL,11200,9,54600,5.8,0,36.9,13,11,fals,12.99,696.86,2015 + 36 months,MORTGAGE,major_purchase,CA,Verified,INDIVIDUAL,6800,9,70000,19.48,0,28.7,33,29,fals,18.24,1952.11,2014 + 36 months,MORTGAGE,debt_consolidation,MN,Verified,INDIVIDUAL,35000,10,110000,15,0,44.3,36,25,fals,14.65,6975.62,2015 + 36 months,MORTGAGE,home_improvement,MO,Verified,INDIVIDUAL,3600,10,110000,21.31,0,86.5,28,16,fals,13.35,516.37,2014 + 36 months,MORTGAGE,home_improvement,CA,Verified,INDIVIDUAL,15000,6,113000,5.21,0,35,31,30,tru,10.99,-13526.97,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,21000,6,85000,21.59,0,46.1,26,16,fals,8.18,897.43,2015 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,7500,9,55000,27.65,0,46.3,36,11,tru,16.59,-2332.48,2014 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,15600,6,70000,6.79,1,59.8,5,9,fals,16.2,2769.44,2013 + 36 months,RENT,credit_card,NY,Verified,INDIVIDUAL,20000,3,65000,10.51,1,48.1,24,9,fals,10.99,2215.59,2015 + 60 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,22000,null,49000,19.05,0,69.5,28,14,tru,17.77,2870.05,2013 + 60 months,MORTGAGE,credit_card,UT,Not Verified,INDIVIDUAL,14000,1,80000,25.64,0,82.2,24,11,fals,10.49,2921.16,2014 + 60 months,RENT,major_purchase,GA,Verified,INDIVIDUAL,5900,3,144000,1.24,0,26.8,29,15,fals,15.61,1180.95,2014 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,7025,2,30000,0,0,0,7,4,fals,18.99,1678.46,2014 + 60 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,21000,0,215000,13.04,0,94.2,50,17,fals,18.99,4634.81,2015 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,25000,2,60000,9.76,0,72,7,7,fals,7.39,107.08,2016 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,10050,8,30000,14.96,0,45.1,8,8,fals,9.71,1575.06,2013 + 36 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,35000,10,80000,21.18,0,27.4,29,12,tru,15.61,-15492.3,2015 + 36 months,MORTGAGE,debt_consolidation,NV,Not Verified,INDIVIDUAL,2000,0,40000,15.03,0,51.8,21,16,fals,14.31,284.97,2015 + 60 months,RENT,credit_card,OR,Verified,INDIVIDUAL,16675,9,65000,1.88,0,15.4,27,22,fals,15.61,216.92,2014 + 36 months,MORTGAGE,credit_card,RI,Not Verified,INDIVIDUAL,15000,3,75000,27.15,0,84.6,39,13,tru,13.98,-8204.47,2014 + 36 months,MORTGAGE,credit_card,CO,Verified,INDIVIDUAL,24000,2,80000,14.45,0,67.4,18,13,fals,6.99,2450.02,2014 + 60 months,MORTGAGE,home_improvement,FL,Verified,INDIVIDUAL,25000,2,170000,9.37,0,33.7,48,13,fals,14.47,3491.63,2014 + 36 months,MORTGAGE,credit_card,GA,Not Verified,INDIVIDUAL,6700,8,33000,13.27,0,52.7,20,10,fals,9.67,1045.49,2013 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,27575,10,80880,29.35,1,40,45,33,tru,14.99,-12952.57,2015 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,18000,10,100000,7.91,0,44,20,14,tru,7.26,-13543.74,2015 + 60 months,MORTGAGE,debt_consolidation,NM,Verified,INDIVIDUAL,35000,10,110357.14,18.95,0,86.8,34,21,tru,13.99,-19106.16,2015 + 36 months,RENT,debt_consolidation,PA,Not Verified,INDIVIDUAL,5000,2,52000,14.81,2,55.2,12,28,fals,12.49,1021.26,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,15000,4,71000,22.01,0,64.8,15,12,tru,6.39,-8125.8,2015 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,10000,10,78000,20.69,0,99.6,18,20,fals,12.39,812.24,2014 + 36 months,RENT,credit_card,AL,Verified,INDIVIDUAL,16500,null,42445,20.07,0,96.6,6,20,fals,11.67,3094.75,2014 + 36 months,RENT,debt_consolidation,PA,Verified,INDIVIDUAL,6000,10,78000,26.15,0,9.9,43,25,fals,12.29,958.73,2015 + 60 months,MORTGAGE,debt_consolidation,ME,Verified,INDIVIDUAL,35000,0,93000,22.04,0,63.2,18,13,fals,18.55,2923.93,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,20000,10,55000,26.78,1,61.8,39,17,fals,13.11,2997.82,2013 + 36 months,RENT,credit_card,NV,Verified,INDIVIDUAL,12000,10,53000,14.1,0,81.6,36,16,fals,9.76,1864.82,2012 + 60 months,RENT,credit_card,MD,Verified,INDIVIDUAL,24000,0,60000,18.28,1,67.7,28,18,fals,9.17,3550.64,2015 + 36 months,OWN,credit_card,WY,Verified,INDIVIDUAL,5400,3,55000,24,0,22,30,20,fals,8.9,775.36,2014 + 36 months,MORTGAGE,major_purchase,TX,Verified,INDIVIDUAL,1500,10,63000,21.18,0,35.3,48,25,fals,14.98,369.77,2014 + 36 months,MORTGAGE,home_improvement,IN,Not Verified,INDIVIDUAL,4500,4,61847,19.41,0,21.6,39,18,fals,7.12,517.1,2014 + 36 months,OWN,credit_card,NY,Verified,INDIVIDUAL,8000,10,73000,8.3,0,60.5,29,27,fals,11.14,1447.79,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,22000,7,55000,6.95,0,24.3,23,12,fals,9.17,1316.6,2015 + 36 months,RENT,debt_consolidation,OH,Not Verified,INDIVIDUAL,5000,2,38500,21.79,0,62.5,20,19,tru,14.64,551.61,2014 + 36 months,RENT,medical,TN,Verified,INDIVIDUAL,2000,9,40000,11.97,2,64.9,16,14,fals,9.49,86.22,2016 + 36 months,MORTGAGE,debt_consolidation,OK,Not Verified,INDIVIDUAL,8000,1,41000,6.5,6,79.6,17,13,fals,8.19,912.34,2015 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,12000,2,111000,9.37,1,39.6,17,10,fals,9.17,924.43,2015 + 36 months,RENT,debt_consolidation,OH,Not Verified,INDIVIDUAL,10000,10,40000,30.48,1,24.7,40,16,tru,8.18,-6234.14,2015 + 36 months,MORTGAGE,small_business,MD,Verified,INDIVIDUAL,10000,4,60000,28.04,0,42.1,13,7,fals,17.57,2937.24,2014 + 36 months,RENT,debt_consolidation,WA,Not Verified,INDIVIDUAL,1600,3,35928,19.86,0,79.5,11,8,fals,17.1,456.4,2013 + 36 months,MORTGAGE,debt_consolidation,IN,Not Verified,INDIVIDUAL,16000,4,60000,24.78,2,46.6,45,12,fals,18.49,4295.77,2013 + 36 months,MORTGAGE,other,WI,Verified,INDIVIDUAL,24500,7,49000,14.6,1,54.3,31,24,tru,14.65,-8462.85,2015 + 36 months,RENT,major_purchase,PA,Verified,INDIVIDUAL,9000,10,89000,10.53,0,62.5,51,14,tru,18.99,-954,2014 + 36 months,RENT,medical,NY,Verified,INDIVIDUAL,4000,1,58000,26.06,0,1.3,30,16,tru,10.78,-2305.38,2016 + 36 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,18400,2,50000,6.81,0,22.9,18,11,fals,13.99,1777.07,2015 + 36 months,RENT,other,NY,Not Verified,INDIVIDUAL,10400,8,60000,16.44,0,37.7,9,8,fals,7.62,1113.12,2012 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,10000,2,78000,13.25,0,29.9,42,25,fals,7.69,767.58,2014 + 36 months,OWN,home_improvement,NC,Not Verified,INDIVIDUAL,4800,10,29070,24.69,0,33.6,12,10,fals,9.8,442.35,2016 + 36 months,RENT,credit_card,AL,Verified,INDIVIDUAL,14000,7,52000,22.13,0,68.6,14,17,fals,12.79,1097.23,2016 + 60 months,OWN,debt_consolidation,KY,Not Verified,INDIVIDUAL,19200,6,75000,6.94,2,5,19,13,fals,7.89,1811.14,2016 + 36 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,11200,8,89000,10.5,1,59.2,22,16,fals,14.33,2434.28,2013 + 36 months,RENT,debt_consolidation,MO,Not Verified,INDIVIDUAL,2500,5,32000,18.38,0,37.7,27,15,fals,13.35,547.63,2014 + 60 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,23000,10,55500,12.15,1,59.6,18,37,fals,14.64,1634.83,2014 + 36 months,RENT,small_business,CA,Verified,INDIVIDUAL,19075,9,77000,20.2,1,68.2,22,9,fals,18.49,5919.91,2013 + 36 months,RENT,credit_card,FL,Verified,INDIVIDUAL,8225,1,21835,7.94,0,60.7,37,32,fals,17.57,2022.12,2014 + 36 months,RENT,other,NY,Verified,INDIVIDUAL,7000,10,65000,15.01,0,47.7,18,8,fals,10.74,1219.11,2012 + 60 months,MORTGAGE,home_improvement,WV,Verified,INDIVIDUAL,12000,5,55000,17.85,2,40.8,20,19,tru,19.52,-5385.66,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,5500,3,65000,17.3,0,26.2,22,17,fals,8.18,140.76,2015 + 36 months,OWN,other,NY,Not Verified,INDIVIDUAL,10100,7,33738,10.14,0,59.3,16,4,fals,24.5,4264.12,2014 + 60 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,24000,10,108000,17.35,0,81.4,25,9,tru,22.47,6118.05,2012 + 36 months,RENT,moving,CA,Verified,INDIVIDUAL,8100,2,75000,7.74,0,26.5,23,16,fals,18.25,2377.61,2014 + 36 months,MORTGAGE,car,AZ,Not Verified,INDIVIDUAL,4000,10,52000,23.7,0,52.8,30,16,fals,19.52,1264.16,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,25000,10,110000,21.58,0,70.2,21,17,fals,8.39,2729.4,2014 + 60 months,RENT,debt_consolidation,TX,Not Verified,INDIVIDUAL,20000,1,48000,32.82,0,30.8,47,10,fals,12.29,2914.98,2015 + 36 months,RENT,debt_consolidation,LA,Verified,INDIVIDUAL,15000,3,77000,25.45,0,73.2,26,12,fals,15.8,3935.35,2013 + 36 months,MORTGAGE,home_improvement,MN,Not Verified,INDIVIDUAL,5000,null,60000,25.76,0,8.1,36,27,tru,7.49,-4694.18,2015 + 36 months,OWN,debt_consolidation,OH,Not Verified,INDIVIDUAL,3625,10,30000,34.48,0,93.7,11,18,tru,18.99,-2002.63,2016 + 36 months,OWN,credit_card,FL,Not Verified,INDIVIDUAL,10000,10,75858,9.84,0,63,28,11,tru,12.99,-8322.72,2014 + 36 months,MORTGAGE,credit_card,GA,Verified,INDIVIDUAL,12000,10,37000,22.09,0,49.3,20,19,fals,12.69,2089.45,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,10000,9,51000,11.11,0,73,9,9,tru,19.53,-6870.56,2016 + 60 months,MORTGAGE,credit_card,CT,Verified,INDIVIDUAL,28800,9,85000,27.49,0,62,30,12,tru,12.49,-19102.98,2014 + 36 months,OWN,home_improvement,NC,Not Verified,INDIVIDUAL,12000,6,150000,3.51,0,4.6,39,17,fals,6.03,493.06,2014 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,25000,10,225000,13.63,3,61,32,18,fals,7.89,2300.94,2015 + 36 months,OWN,home_improvement,TX,Verified,INDIVIDUAL,8400,5,39000,18.06,0,90,6,16,fals,18.99,2684.83,2014 + 36 months,OWN,home_improvement,MI,Verified,INDIVIDUAL,7425,2,60000,9.98,0,80.5,42,32,fals,15.31,698.12,2013 + 60 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,19200,10,74767,29.13,0,47.5,21,21,tru,18.99,-10256.98,2014 + 60 months,MORTGAGE,debt_consolidation,LA,Verified,INDIVIDUAL,21000,10,92000,33.59,2,79.7,24,14,fals,17.57,5819.49,2015 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,18000,4,48000,31.05,0,66.6,18,29,fals,15.61,3379.48,2014 + 36 months,MORTGAGE,debt_consolidation,UT,Verified,INDIVIDUAL,15500,10,79000,28.42,3,71.2,39,18,fals,12.12,1397.84,2012 + 60 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,18000,10,90000,24.72,2,80,35,15,tru,22.2,-46.5,2013 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,7000,null,30000,18.88,0,21.8,36,20,fals,6.24,134.26,2015 + 36 months,RENT,credit_card,TX,Not Verified,INDIVIDUAL,5000,5,70000,10.89,0,37.3,25,15,fals,12.35,102.02,2013 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,17375,9,58000,20.79,0,40.8,27,14,fals,10.99,3100.06,2014 + 36 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,27850,4,75000,28.88,0,50.9,48,14,fals,13.98,5014.99,2014 + 36 months,RENT,debt_consolidation,WI,Verified,INDIVIDUAL,10000,null,39000,21.91,0,22.4,17,30,tru,15.61,-5804.2,2013 + 60 months,RENT,debt_consolidation,PA,Verified,INDIVIDUAL,14900,3,58000,31.51,0,56.8,48,10,tru,19.52,-5932.21,2014 + 60 months,MORTGAGE,debt_consolidation,DC,Verified,INDIVIDUAL,18375,5,73000,22.13,0,95.1,30,12,tru,18.54,-8907.08,2015 + 36 months,MORTGAGE,moving,MI,Not Verified,INDIVIDUAL,5000,10,90000,36.27,0,82.6,46,28,tru,16.29,-4715.87,2014 + 36 months,RENT,debt_consolidation,NV,Verified,INDIVIDUAL,9600,10,39000,23.72,0,58.1,14,7,tru,10.99,-3739.86,2014 + 36 months,MORTGAGE,credit_card,MN,Verified,INDIVIDUAL,12200,10,30000,27.8,0,39.2,15,10,fals,7.89,1053.99,2015 + 36 months,MORTGAGE,debt_consolidation,MD,Not Verified,INDIVIDUAL,4100,1,50000,10.67,0,96.8,15,9,fals,16.99,1161.56,2014 + 36 months,RENT,credit_card,IL,Verified,INDIVIDUAL,12000,0,70000,27.33,0,45.5,27,10,tru,12.59,-6389.21,2015 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,10000,10,56000,31.22,0,57.2,16,9,fals,7.26,407.34,2015 + 36 months,RENT,credit_card,TN,Verified,INDIVIDUAL,5000,1,32000,25.24,0,85.5,40,18,fals,8.19,634.96,2015 + 36 months,MORTGAGE,car,CO,Verified,INDIVIDUAL,12000,0,75000,18.36,0,28.6,49,18,fals,6.24,950,2015 + 60 months,MORTGAGE,debt_consolidation,WI,Verified,INDIVIDUAL,27275,10,62000,14.2,1,64.1,45,21,tru,22.99,-23500.97,2015 + 60 months,RENT,major_purchase,MO,Verified,INDIVIDUAL,12000,10,62000,3.77,0,8.6,13,7,tru,17.57,-5639,2015 + 36 months,RENT,major_purchase,CA,Not Verified,INDIVIDUAL,9000,10,101000,28.8,1,69.7,19,22,fals,12.99,1924.51,2014 + 36 months,MORTGAGE,credit_card,KY,Verified,INDIVIDUAL,24000,10,319374,10.91,0,59.8,39,12,fals,7.89,1336.13,2015 + 60 months,MORTGAGE,credit_card,MD,Verified,INDIVIDUAL,18000,10,71000,7.81,0,75.5,21,26,fals,13.98,5385.81,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,13500,10,68500,21.67,0,71.1,19,9,fals,15.61,3313.52,2014 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,3000,10,37000,24.91,1,47.1,24,14,fals,18.99,2.94,2016 + 60 months,MORTGAGE,major_purchase,WA,Verified,INDIVIDUAL,16000,7,45000,7.55,0,16.9,11,8,fals,17.99,944.36,2012 + 36 months,RENT,debt_consolidation,WA,Not Verified,INDIVIDUAL,12000,3,50000,20.15,0,52.8,9,7,fals,9.99,1396.6,2015 + 36 months,MORTGAGE,credit_card,OR,Not Verified,INDIVIDUAL,25000,6,58000,11.45,0,51,14,12,fals,7.26,2266.28,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,15000,8,59000,26.51,0,38,27,17,fals,9.99,1212.56,2015 + 60 months,MORTGAGE,debt_consolidation,OR,Verified,INDIVIDUAL,24000,4,60000,14.14,0,57.1,11,11,fals,14.49,5681.66,2014 + 36 months,MORTGAGE,credit_card,WI,Verified,INDIVIDUAL,10000,10,35000,20.57,0,68.6,26,21,fals,11.55,1879.92,2013 + 60 months,RENT,debt_consolidation,MD,Verified,INDIVIDUAL,22000,10,55000,14.2,0,50.4,33,15,tru,18.25,-6966.6,2014 + 36 months,RENT,debt_consolidation,TN,Verified,INDIVIDUAL,19000,1,50000,20.12,0,52.8,11,10,fals,11.44,2698.87,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,10000,2,60000,26.78,0,20.2,37,12,fals,11.53,1210.38,2015 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,28000,10,100000,22.42,0,89.1,30,13,fals,18.85,8719.27,2013 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,17700,9,59200,32.37,0,75.4,29,37,fals,21.99,4426.97,2014 + 60 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,30000,3,80000,10.79,2,0,20,13,fals,21.99,4752.12,2015 + 60 months,RENT,debt_consolidation,AL,Verified,INDIVIDUAL,13750,0,41000,19.08,0,93.8,13,18,fals,18.92,4764.01,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,18900,10,45000,21.42,0,58.6,25,28,fals,12.69,1538.04,2015 + 36 months,RENT,other,PA,Not Verified,INDIVIDUAL,1000,2,23556,16.35,0,54.3,12,5,fals,18.75,315.07,2013 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,4750,1,26000,8.08,0,31.7,12,4,fals,18.24,387.94,2014 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,35000,1,120000,7.39,0,27.4,17,15,fals,11.14,3971.44,2013 + 60 months,OWN,home_improvement,CA,Not Verified,INDIVIDUAL,25000,2,130000,7.07,1,11.8,42,23,fals,6.49,5.19,2016 + 36 months,RENT,debt_consolidation,NC,Verified,INDIVIDUAL,15000,2,92000,14.57,1,75.4,23,14,fals,18.85,4438.06,2013 + 60 months,MORTGAGE,major_purchase,NY,Not Verified,INDIVIDUAL,15000,2,70000,21.16,1,59.2,32,25,fals,14.65,3700.41,2015 + 36 months,MORTGAGE,credit_card,KS,Not Verified,INDIVIDUAL,16700,10,89000,9.4,0,37.4,25,15,fals,7.62,2042.75,2013 + 60 months,MORTGAGE,medical,CO,Verified,INDIVIDUAL,11200,0,45000,16,0,43.9,18,7,tru,18.54,-6500.7,2015 + 36 months,RENT,debt_consolidation,HI,Not Verified,INDIVIDUAL,8400,2,78000,19.34,1,63.5,33,21,fals,12.99,1618.41,2013 + 60 months,OWN,credit_card,NY,Verified,INDIVIDUAL,35000,7,120000,10.61,0,53.1,48,30,fals,15.31,9627.29,2014 + 36 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,6000,10,50000,13.44,0,50.3,18,17,fals,18.75,1890.41,2013 + 60 months,MORTGAGE,home_improvement,VA,Not Verified,INDIVIDUAL,14800,10,156800,9.4,0,20.6,28,15,fals,12.29,270.13,2015 + 60 months,OWN,debt_consolidation,OH,Verified,INDIVIDUAL,29175,9,65000,14.29,0,79.3,17,14,tru,19.72,-5610.6,2013 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,10800,6,80000,5.04,0,39.2,27,35,tru,18.84,-7192.88,2015 + 60 months,MORTGAGE,home_improvement,NC,Verified,INDIVIDUAL,16150,3,80000,6.44,0,61.3,15,10,fals,19.99,6324.65,2014 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,22500,9,51012,28.3,1,75.6,24,14,fals,22.95,11174.17,2013 + 36 months,MORTGAGE,credit_card,TX,Not Verified,INDIVIDUAL,6000,10,54000,33.82,0,15.4,28,22,fals,6.89,106.07,2015 + 36 months,RENT,other,NJ,Not Verified,INDIVIDUAL,8400,2,155000,10.06,0,20.4,49,13,fals,15.31,2128.72,2013 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,6000,2,70000,8.21,4,27.3,20,12,fals,9.49,472.91,2016 + 60 months,RENT,credit_card,TX,Verified,INDIVIDUAL,16000,6,47000,37.31,0,54.9,21,16,fals,13.33,1287.55,2015 + 36 months,RENT,credit_card,MN,Not Verified,INDIVIDUAL,8000,0,29000,21.52,0,57,24,13,fals,13.65,1794.21,2014 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,11200,0,47500,16.43,0,12.4,43,10,fals,6.49,778.01,2014 + 36 months,OWN,debt_consolidation,OR,Verified,INDIVIDUAL,7000,8,50000,19.47,0,65.6,7,6,fals,7.12,757.25,2014 + 36 months,MORTGAGE,medical,IN,Not Verified,INDIVIDUAL,14300,0,95000,9.11,1,21.6,46,12,fals,15.61,3699.82,2013 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,16000,1,78500,16.15,0,87.9,19,10,fals,23.43,1500.17,2014 + 36 months,MORTGAGE,credit_card,TX,Verified,INDIVIDUAL,16000,null,61576,19.72,0,90.9,19,16,tru,12.99,-5188.08,2014 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,35000,10,96012,27.81,0,7.8,21,20,fals,8.9,4543.3,2014 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,16000,6,50000,9.19,0,15.9,57,24,fals,6.49,256.8,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,29400,1,220000,18.59,0,96.9,22,16,fals,12.85,9526.63,2014 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,15000,2,120000,18.87,0,14.2,33,20,fals,8.67,797.59,2015 + 60 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,16000,0,52000,29.29,1,77.2,15,7,fals,24.74,2142.55,2016 + 36 months,RENT,other,NM,Verified,INDIVIDUAL,11325,5,31475,14.8,0,45.9,10,13,tru,16.55,-9033.89,2015 + 60 months,RENT,debt_consolidation,IN,Verified,INDIVIDUAL,13500,4,31000,31.2,0,46.4,17,5,tru,27.34,-9102.3,2016 + 60 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,21425,10,89740,8.08,0,35,22,31,tru,17.14,-16124.9,2015 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,5300,2,85000,9.28,2,25.9,28,24,tru,13.67,-1841.96,2013 + 60 months,OWN,debt_consolidation,OH,Verified,INDIVIDUAL,19200,10,58000,20.11,0,59.7,8,8,fals,20.5,3096.33,2016 + 36 months,OWN,car,NC,Not Verified,INDIVIDUAL,6000,0,45000,11.41,0,2.5,16,11,fals,7.89,565.32,2015 + 60 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,20000,0,68000,20.77,0,68.5,29,16,tru,21.98,-10073.74,2012 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,21000,4,50000,23.36,0,54.2,30,10,fals,18.55,5491.18,2015 + 36 months,MORTGAGE,credit_card,NC,Not Verified,INDIVIDUAL,14000,10,72000,20.02,0,83.9,12,41,fals,7.69,1729.74,2014 + 36 months,RENT,credit_card,NC,Not Verified,INDIVIDUAL,15000,2,80000,9.07,0,24.8,9,24,tru,10.15,-9664.23,2014 + 60 months,OWN,home_improvement,CA,Verified,INDIVIDUAL,17400,10,66000,26.97,1,19.8,36,23,fals,14.33,2681.52,2012 + 36 months,RENT,credit_card,TN,Verified,INDIVIDUAL,10750,3,32000,33.56,0,50,27,14,fals,16.29,2628.7,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,17000,7,125000,9.83,0,39,45,12,fals,5.32,361.12,2016 + 60 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,35000,2,120000,18.12,0,48,23,14,fals,13.35,9889.36,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,24150,10,72000,35.6,0,68,22,17,fals,18.25,4837.18,2015 + 60 months,OWN,debt_consolidation,FL,Verified,INDIVIDUAL,12000,7,60000,18.28,0,34.4,29,12,tru,24.99,-8847.31,2015 + 36 months,RENT,credit_card,CO,Verified,INDIVIDUAL,9000,2,30000,24.08,0,71.4,16,7,fals,11.44,1463.55,2015 + 36 months,MORTGAGE,vacation,TX,Verified,INDIVIDUAL,4000,0,48000,16.73,0,60,18,14,fals,16.59,1091.36,2014 + 36 months,OWN,debt_consolidation,NY,Not Verified,INDIVIDUAL,24000,7,180000,12,0,25.4,45,29,fals,7.59,1354.15,2016 + 60 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,35000,8,130000,22.49,0,74.3,36,11,tru,23.99,-26947.38,2015 + 36 months,RENT,vacation,NY,Not Verified,INDIVIDUAL,10000,3,74000,8.09,0,20.8,5,10,fals,9.76,1310.07,2012 + 36 months,RENT,home_improvement,CA,Verified,INDIVIDUAL,8000,4,54000,28.39,0,33.4,37,11,fals,10.15,1282.44,2014 + 36 months,MORTGAGE,credit_card,OH,Verified,INDIVIDUAL,14000,10,55000,15.3,0,60,16,24,fals,6.49,974.16,2014 + 36 months,MORTGAGE,debt_consolidation,KY,Not Verified,INDIVIDUAL,7200,10,98000,16.04,0,93,20,21,fals,12.69,115.7,2015 + 60 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,21000,10,42000,23.57,0,74.2,23,18,tru,18.49,-14414.8,2015 + 36 months,RENT,debt_consolidation,MI,Not Verified,INDIVIDUAL,10000,0,48000,22.09,0,48.1,22,25,tru,12.99,-2803.1,2014 + 60 months,RENT,debt_consolidation,CT,Verified,INDIVIDUAL,25000,10,57987,31.89,1,47.3,31,18,tru,18.25,-16044.77,2015 + 60 months,MORTGAGE,debt_consolidation,KY,Verified,INDIVIDUAL,9925,10,28874,15.71,1,35.4,22,23,fals,16.29,3667.39,2013 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,6000,3,22000,35.84,0,62.8,8,4,fals,14.46,512.31,2016 + 36 months,OWN,small_business,MN,Verified,INDIVIDUAL,24000,4,71000,13.02,0,43.7,10,15,tru,17.27,-22328.25,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,13700,2,79000,20.22,0,39.3,51,11,fals,12.12,4147.44,2012 + 36 months,OWN,vacation,TX,Verified,INDIVIDUAL,12000,10,65000,13.9,0,19.7,50,17,fals,9.67,419.71,2014 + 36 months,RENT,medical,NV,Verified,INDIVIDUAL,17000,0,122000,13.89,0,37.9,52,24,fals,15.31,1968.29,2014 + 36 months,OWN,debt_consolidation,MN,Not Verified,INDIVIDUAL,17100,1,96000,34.23,0,63.4,20,20,tru,25.44,-12679.04,2016 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,8125,7,36000,21.51,0,39.5,13,12,fals,14.49,1945.25,2014 + 60 months,RENT,credit_card,TX,Verified,INDIVIDUAL,23500,2,55000,30.92,0,68,18,11,tru,14.65,-10085.03,2015 + 36 months,OWN,debt_consolidation,MA,Verified,INDIVIDUAL,6250,null,32000,16.17,0,36.5,21,19,fals,10.99,843.34,2014 + 36 months,RENT,debt_consolidation,VA,Not Verified,INDIVIDUAL,10000,4,45000,22.29,0,62.3,12,15,tru,13.68,-6257.47,2013 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,35000,2,107000,1.17,0,15.3,33,17,fals,9.49,1888.3,2016 + 36 months,OWN,debt_consolidation,AL,Not Verified,INDIVIDUAL,7800,10,48000,20.03,0,28.6,24,13,fals,6.39,438.45,2015 + 36 months,MORTGAGE,debt_consolidation,UT,Verified,INDIVIDUAL,11500,7,119000,15.6,1,31.3,29,21,fals,10.99,2051.81,2014 + 36 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,10000,4,68000,10.71,1,86.2,21,33,fals,14.09,2319.63,2013 + 36 months,RENT,debt_consolidation,AL,Verified,INDIVIDUAL,24000,3,150000,12.05,1,54.1,52,21,fals,6.39,2089.21,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,1850,8,28000,10.33,0,70.8,10,13,fals,11.14,82.08,2012 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,30000,2,200000,23.24,0,99.2,28,45,fals,19.72,8833.98,2013 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,8450,10,28200,38.17,2,38.7,20,11,fals,18.25,1451.05,2016 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,12000,5,100000,17.84,0,35.5,29,14,fals,14.64,291.13,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,14675,9,66000,28.08,1,48.8,41,23,tru,18.25,-12500.85,2015 + 36 months,RENT,moving,NY,Not Verified,INDIVIDUAL,6000,5,62000,20.72,0,75.2,25,13,fals,18.25,1836.01,2013 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,11300,10,40000,26.79,0,92.5,17,16,tru,17.57,-8441.87,2015 + 60 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,35000,10,150000,18.04,0,46.3,54,27,fals,17.57,1310.28,2014 + 36 months,RENT,debt_consolidation,CT,Not Verified,INDIVIDUAL,9000,7,65200,17.14,0,42.2,12,5,fals,10.99,880.6,2015 + 36 months,RENT,credit_card,WI,Verified,INDIVIDUAL,4000,null,24220,9.34,0,43.6,18,15,fals,10.64,639.15,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,5875,7,25300,24.52,0,35.3,10,10,fals,10.99,1048.19,2013 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,10500,10,48000,13.55,0,49.8,27,17,fals,10.99,1874.89,2014 + 36 months,MORTGAGE,debt_consolidation,MA,Not Verified,INDIVIDUAL,10000,1,82000,3.81,0,64.2,21,25,fals,11.67,1603.49,2014 + 60 months,OWN,debt_consolidation,WV,Verified,INDIVIDUAL,30000,6,70000,27.07,0,29.9,15,19,tru,21.6,-22501.38,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,9000,8,42500,31.37,0,28.7,40,15,fals,16.99,1110.73,2016 + 60 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,15000,10,58000,16.38,2,78,26,14,fals,17.57,3708.56,2014 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,20000,5,57000,15.79,0,60.3,15,10,tru,12.99,-11173.33,2016 + 36 months,MORTGAGE,home_improvement,CA,Verified,INDIVIDUAL,6000,1,73000,19.6,0,19.1,45,14,fals,14.46,503,2016 + 36 months,MORTGAGE,home_improvement,VA,Verified,INDIVIDUAL,14300,4,38000,12.28,0,52.7,26,12,fals,13.11,1148.25,2013 + 60 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,25000,10,98000,14.39,1,76.1,16,20,fals,11.53,911.43,2015 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,5000,10,85000,6.58,0,59.2,17,9,fals,10.49,352.31,2016 + 36 months,MORTGAGE,small_business,GA,Verified,INDIVIDUAL,5000,2,70000,19.85,0,80.2,17,12,fals,13.49,271.02,2016 + 36 months,RENT,debt_consolidation,IL,Not Verified,INDIVIDUAL,6000,8,70000,17.95,0,47.5,28,12,fals,7.12,280.36,2014 + 36 months,RENT,other,NY,Verified,INDIVIDUAL,4800,0,95000,0.54,0,51,5,11,fals,20.2,1487.23,2013 + 36 months,MORTGAGE,medical,OH,Verified,INDIVIDUAL,4200,10,74839.91,19.91,0,89.2,34,37,fals,21.6,856.81,2013 + 36 months,RENT,other,CA,Not Verified,INDIVIDUAL,1450,10,52400,15.3,0,74.3,16,14,fals,18.25,61.88,2015 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,21250,10,74000,9.2,0,49.3,26,10,fals,23.4,7363.56,2013 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,20000,2,96000,15.82,0,68.3,22,15,fals,12.35,3726.11,2013 + 36 months,OWN,other,FL,Verified,INDIVIDUAL,5000,9,70000,14.74,1,34.8,15,11,tru,11.49,-4485.07,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,23550,10,65000,13.79,0,40.3,27,12,tru,16.99,-21231.59,2015 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,24000,0,110000,24.06,0,73.1,53,26,fals,10.49,3514.29,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,33425,5,75000,16.27,0,72.1,11,11,fals,15.8,8760.7,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,24000,10,115000,20.23,1,81.2,48,10,fals,9.49,374.42,2014 + 60 months,MORTGAGE,major_purchase,CA,Verified,INDIVIDUAL,21000,2,144000,18.35,0,90,33,16,fals,15.8,550.18,2013 + 36 months,MORTGAGE,debt_consolidation,NJ,Not Verified,INDIVIDUAL,18000,5,87500,11.34,0,27.9,31,12,fals,6.62,1895.95,2013 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,8400,2,65000,10.43,0,31.1,49,14,fals,8.39,605.35,2016 + 36 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,11300,2,47500,27.01,0,66.2,22,12,fals,22.99,1280.89,2015 + 36 months,RENT,credit_card,NV,Not Verified,INDIVIDUAL,3000,null,24000,25.15,0,45.8,17,19,fals,9.49,154.28,2016 + 36 months,RENT,debt_consolidation,OH,Verified,INDIVIDUAL,20000,10,72000,13.47,1,23.9,29,21,fals,13.49,845.32,2016 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,10000,7,102700,14.43,0,22,27,17,fals,11.99,1627.7,2014 + 36 months,RENT,moving,TX,Not Verified,INDIVIDUAL,1450,2,24000,22.5,0,3.8,13,7,fals,14.65,311.96,2015 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,28000,10,89207,17.01,1,62.3,31,14,fals,26.3,1499.1,2017 + 60 months,MORTGAGE,credit_card,KY,Verified,INDIVIDUAL,30000,10,300000,14.6,0,83.8,51,24,fals,18.84,8207.86,2015 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,20700,10,45000,16.45,2,0.4,17,18,fals,19.99,4127.64,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,4000,10,92000,22.67,0,77.7,19,15,fals,10.99,449.89,2015 + 36 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,14400,3,60000,13.36,0,35.3,10,10,fals,9.17,1565.84,2014 + 60 months,MORTGAGE,credit_card,IL,Verified,INDIVIDUAL,14400,2,55000,14.82,0,62.9,13,13,tru,15.61,-3213.99,2014 + 36 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,10000,10,90000,15.45,0,40.1,33,18,fals,5.32,126.62,2016 + 60 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,20000,10,79000,20.27,0,82.3,35,16,fals,13.99,3336.13,2016 + 36 months,RENT,other,NJ,Verified,INDIVIDUAL,7000,7,190000,4.5,0,75.1,14,13,fals,12.39,1150.58,2015 + 36 months,RENT,credit_card,FL,Verified,INDIVIDUAL,2100,0,34200,24.11,0,21.6,19,8,fals,7.62,255.79,2013 + 36 months,RENT,credit_card,AZ,Verified,INDIVIDUAL,14400,3,150000,6.49,1,96.6,16,15,fals,8.18,1501.27,2015 + 36 months,RENT,debt_consolidation,AZ,Verified,INDIVIDUAL,7475,6,57000,33.45,0,30,24,8,fals,21.18,1879.91,2014 + 60 months,RENT,debt_consolidation,KS,Verified,INDIVIDUAL,16300,2,45300,31.26,0,75.6,28,19,fals,18.75,7898.23,2012 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,20000,9,58000,20.09,0,65,20,20,tru,13.67,-13483.46,2016 + 60 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,24500,10,80000,22.71,0,45.6,28,13,fals,15.61,5225.29,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,7200,null,62000,32.64,0,46.4,40,34,fals,13.49,336.65,2016 + 36 months,RENT,debt_consolidation,LA,Verified,INDIVIDUAL,5300,10,39187,25.21,0,68.4,13,20,tru,16.29,-1138.98,2014 + 60 months,RENT,debt_consolidation,IL,Verified,INDIVIDUAL,15000,10,59600,17.18,0,76.6,41,26,fals,19.97,5961.35,2014 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,10000,0,48200,22.31,0,42.1,15,8,fals,11.49,480.67,2016 + 36 months,MORTGAGE,home_improvement,OH,Not Verified,INDIVIDUAL,15000,10,107000,28.81,0,14.6,37,31,fals,8.99,282.48,2016 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,19300,4,46000,26.4,0,42.4,11,6,fals,17.57,1917.22,2015 + 60 months,MORTGAGE,debt_consolidation,MS,Verified,INDIVIDUAL,19250,10,45838,26.68,0,96.5,26,15,fals,22.99,6632.72,2015 + 60 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,10200,10,44000,22.34,0,36,49,17,tru,16.29,-5580.88,2014 + 36 months,RENT,car,PA,Not Verified,INDIVIDUAL,3500,10,42000,12.43,0,36.4,29,10,fals,19.52,1151.83,2014 + 36 months,MORTGAGE,credit_card,TX,Verified,INDIVIDUAL,11000,6,34300,16.67,1,26,18,12,fals,10.16,1807.55,2013 + 36 months,MORTGAGE,other,NY,Not Verified,INDIVIDUAL,4000,6,65488,20.94,0,80.7,35,13,fals,18.85,1267.44,2013 + 36 months,MORTGAGE,credit_card,PA,Not Verified,INDIVIDUAL,7500,10,42000,18.91,0,85,18,23,fals,14.09,1365.57,2013 + 60 months,RENT,debt_consolidation,PA,Verified,INDIVIDUAL,21600,9,54000,36.22,0,65,44,11,tru,22.99,-11145,2015 + 60 months,OWN,major_purchase,WA,Not Verified,INDIVIDUAL,22500,10,45000,22.61,0,8,16,20,fals,24.99,2120.11,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,18000,6,75000,22.96,0,91.2,20,18,fals,10.99,637.99,2016 + 36 months,OWN,credit_card,TX,Verified,INDIVIDUAL,10000,1,25000,29.67,0,42.3,26,12,fals,8.9,1431.12,2014 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,3500,0,32000,7.58,0,28.1,12,7,fals,6.62,368.64,2012 + 36 months,RENT,debt_consolidation,CT,Not Verified,INDIVIDUAL,4500,8,31000,25.63,0,76.4,14,5,fals,12.12,889.95,2012 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,7550,10,26520,18.33,0,78.8,13,12,fals,14.33,1069.4,2013 + 60 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,17000,7,51150,22.64,0,62.9,14,8,fals,17.57,7599.8,2014 + 36 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,28000,0,130000,5.2,0,48.6,16,12,fals,6.89,2474.69,2015 + 36 months,OWN,other,PA,Verified,INDIVIDUAL,1500,4,65000,12.72,0,61.4,28,26,fals,8.18,120.69,2015 + 36 months,RENT,debt_consolidation,TN,Verified,INDIVIDUAL,16000,2,82000,18.48,0,38.8,33,21,tru,12.59,-6186.66,2015 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,3500,10,86000,3.03,0,4.2,36,13,fals,6.62,360.64,2013 + 60 months,RENT,credit_card,IL,Verified,INDIVIDUAL,10000,3,42000,18.17,0,57.4,21,13,fals,17.76,1668.57,2013 + 36 months,RENT,other,MA,Not Verified,INDIVIDUAL,8000,4,46000,17.4,0,54.9,24,11,fals,9.76,885.7,2012 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,15000,10,42000,22.2,1,29.2,29,15,tru,10.49,-4988.93,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,11800,9,104288,24.39,0,64.3,23,11,fals,18.49,3662.09,2012 + 36 months,MORTGAGE,debt_consolidation,NY,Not Verified,INDIVIDUAL,16000,4,70000,17.61,0,16.7,25,20,fals,11.22,1509.32,2015 + 60 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,18250,9,41600,14.8,0,81.7,16,13,fals,15.61,6560.81,2014 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,20000,4,194000,13.6,0,71.4,10,28,fals,5.32,827.36,2016 + 60 months,RENT,house,CA,Verified,INDIVIDUAL,26050,8,102000,8.4,1,49.1,28,15,fals,25.89,542.83,2015 + 36 months,RENT,debt_consolidation,MT,Not Verified,INDIVIDUAL,24000,1,60000,32.13,1,86.4,31,38,tru,18.25,-21528.26,2016 + 36 months,OWN,medical,FL,Verified,INDIVIDUAL,3550,null,29604,25.01,0,74.4,47,64,tru,21.99,-1298.05,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,10400,9,90000,24.21,0,49.1,21,10,tru,16.29,-6250.78,2016 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,16000,0,130000,16.42,1,83.4,13,10,tru,18.2,-10248.79,2015 + 60 months,OWN,credit_card,NY,Verified,INDIVIDUAL,12000,2,45000,8.5,0,32.8,32,11,fals,10.64,610.81,2015 + 36 months,RENT,debt_consolidation,TX,Not Verified,INDIVIDUAL,2500,8,36000,16.03,0,85.3,24,8,tru,9.49,-419.24,2015 + 36 months,OWN,credit_card,FL,Verified,INDIVIDUAL,20000,3,126000,7.05,0,72.4,14,8,fals,13.99,4554.38,2012 + 60 months,OWN,credit_card,NJ,Verified,INDIVIDUAL,19200,10,71400,12.05,0,21,52,16,tru,8.67,-13276.75,2014 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,9425,8,30000,27.48,0,51.3,18,10,tru,15.77,-6556.85,2015 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,10000,10,50000,0.74,0,4.4,14,8,fals,7.26,717.51,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,10000,9,38000,6.73,2,97.8,8,9,fals,10.49,979.02,2015 + 36 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,25375,4,67000,18.95,1,96.9,15,19,fals,18.25,7282.77,2015 + 36 months,MORTGAGE,debt_consolidation,OR,Verified,INDIVIDUAL,9800,7,42000,31,0,59.4,29,18,fals,18.25,976.32,2013 + 60 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,18000,null,68000,15.04,0,37.1,16,12,fals,15.99,1081.98,2017 + 60 months,OWN,debt_consolidation,GA,Verified,INDIVIDUAL,15125,2,50000,10.87,0,60.3,10,7,tru,24.99,-8327.56,2015 + 36 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,20000,10,50000,12.82,0,90.9,11,15,fals,8.39,2691.91,2014 + 60 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,33000,0,92000,21.99,0,66.1,40,8,tru,15.61,-8333.92,2014 + 36 months,OWN,credit_card,PA,Verified,INDIVIDUAL,6700,0,29760,21.9,0,16.2,21,6,fals,12.74,500.6,2016 + 60 months,RENT,other,UT,Verified,INDIVIDUAL,18000,10,180000,35.18,0,76,44,15,tru,22.45,-16539.62,2016 + 36 months,OWN,credit_card,NV,Verified,INDIVIDUAL,9000,null,33156,10.39,0,85.2,11,15,fals,13.11,1933.96,2012 + 36 months,MORTGAGE,credit_card,OH,Verified,INDIVIDUAL,22750,10,85000,15.47,0,65.7,22,16,fals,8.9,3021.19,2012 + 36 months,OWN,vacation,WA,Not Verified,INDIVIDUAL,6000,10,52000,11.69,4,37.7,21,23,fals,19.72,289.7,2013 + 36 months,MORTGAGE,credit_card,NY,Not Verified,INDIVIDUAL,8000,10,60000,12.56,0,66.6,21,19,fals,15.31,2027.38,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,6000,5,55000,8.18,0,58.3,9,6,fals,12.12,1095.39,2012 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,25000,0,110000,22.8,0,45.7,21,34,fals,10.78,2797.02,2016 + 60 months,MORTGAGE,credit_card,UT,Verified,INDIVIDUAL,30000,1,120000,24.54,0,87.5,29,27,fals,14.65,3313.26,2015 + 60 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,20000,1,51250,25.69,0,47.4,25,8,tru,19.52,-4961.92,2014 + 60 months,RENT,debt_consolidation,CO,Not Verified,INDIVIDUAL,10500,9,127000,15.05,0,39.1,39,13,fals,17.57,4188.24,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,15000,3,112000,18.96,0,94.1,19,12,fals,11.49,388.37,2017 + 36 months,MORTGAGE,debt_consolidation,LA,Not Verified,INDIVIDUAL,10000,2,68000,11.6,1,45.8,14,14,fals,12.99,2070.48,2014 + 36 months,MORTGAGE,home_improvement,NY,Verified,INDIVIDUAL,17600,10,149000,13.24,2,67.4,36,26,fals,12.49,3425.6,2014 + 60 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,35000,1,137000,11.89,0,14,23,15,fals,12.99,7089.1,2015 + 36 months,RENT,debt_consolidation,KY,Verified,INDIVIDUAL,6825,8,19000,33.89,0,91,35,16,fals,16.99,1139.29,2015 + 60 months,RENT,other,TX,Verified,INDIVIDUAL,12800,10,85000,17.13,0,33.3,27,11,fals,16.55,3258.26,2015 + 60 months,RENT,debt_consolidation,MD,Verified,INDIVIDUAL,32350,6,73000,16.69,0,90.2,19,15,fals,21.15,18237.24,2013 + 36 months,RENT,credit_card,NY,Not Verified,INDIVIDUAL,12000,5,80000,9.8,0,55.6,19,23,fals,7.9,1517.36,2013 + 36 months,MORTGAGE,home_improvement,NC,Verified,INDIVIDUAL,5000,8,15600,6.46,0,29.1,11,10,fals,12.29,746.84,2015 + 36 months,RENT,debt_consolidation,TN,Not Verified,INDIVIDUAL,4000,0,40000,12.3,0,12.2,8,5,fals,6.89,359.01,2015 + 36 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,4000,10,50000,14.57,0,28.8,11,13,fals,18.99,179.68,2016 + 60 months,MORTGAGE,credit_card,FL,Verified,INDIVIDUAL,35000,8,270000,7.1,0,73,20,19,tru,12.99,-29174.83,2016 + 36 months,RENT,debt_consolidation,CT,Not Verified,INDIVIDUAL,7750,1,50000,14.64,3,54,14,25,tru,11.53,-4180.41,2015 + 36 months,MORTGAGE,credit_card,TX,Verified,INDIVIDUAL,7000,10,39000,17.57,2,34.6,20,19,fals,13.33,730.92,2015 + 60 months,RENT,credit_card,IL,Verified,INDIVIDUAL,35000,10,104999,32.04,1,21.6,26,16,tru,18.49,-22498.09,2016 + 36 months,RENT,other,IL,Not Verified,INDIVIDUAL,7125,3,26000,24.52,0,1.4,63,11,fals,16.55,546.85,2015 + 60 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,25000,9,50000,5.09,1,31.4,16,11,fals,13.33,4652.41,2015 + 60 months,RENT,debt_consolidation,NC,Verified,INDIVIDUAL,10225,2,38900,16.01,0,25.7,46,12,fals,14.99,3400.57,2014 + 36 months,RENT,credit_card,MI,Not Verified,INDIVIDUAL,4500,2,35000,23.35,0,32.5,8,4,fals,12.99,957.57,2014 + 36 months,MORTGAGE,credit_card,OK,Verified,INDIVIDUAL,14675,5,70000,32.61,0,80.4,58,19,fals,15.22,3695.62,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,9000,10,180000,7.13,0,65.5,24,10,tru,13.65,88.43,2014 + 60 months,MORTGAGE,home_improvement,OH,Not Verified,INDIVIDUAL,35000,10,185000,17.7,0,69.3,29,33,fals,22.35,5157.3,2016 + 60 months,MORTGAGE,credit_card,GA,Verified,INDIVIDUAL,10000,5,67000,16.64,0,55.1,30,13,tru,13.33,-5193.58,2015 + 36 months,RENT,debt_consolidation,MN,Not Verified,INDIVIDUAL,1700,7,24000,37.6,3,19.3,37,22,fals,13.33,203.96,2015 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,12000,10,30000,19.44,0,25.6,39,20,fals,18.99,557.77,2014 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,30000,10,128000,21.7,0,58.1,32,14,fals,8.18,3089.87,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,15000,1,95000,14.59,0,74.7,25,10,fals,15.99,9.47,2016 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,5000,9,58000,11.05,0,33.9,13,11,fals,9.67,763.7,2014 + 36 months,RENT,major_purchase,OH,Verified,INDIVIDUAL,17000,5,50000,10.15,0,22.1,11,10,fals,7.69,1894.97,2014 + 36 months,MORTGAGE,credit_card,IL,Not Verified,INDIVIDUAL,8000,10,57000,7.28,0,71,28,24,fals,11.67,685.75,2014 + 36 months,OWN,home_improvement,NY,Verified,INDIVIDUAL,8000,10,156000,15.46,0,99.1,28,19,fals,11.47,314.7,2016 + 36 months,RENT,credit_card,NY,Not Verified,INDIVIDUAL,4800,7,42000,21.03,0,34.1,13,10,fals,12.99,507.72,2014 + 36 months,RENT,major_purchase,NY,Not Verified,INDIVIDUAL,8500,2,45500,12.19,0,37.9,30,10,tru,9.75,-6318.36,2016 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,18500,3,127000,14.77,0,48.4,45,14,fals,9.71,2342.93,2013 + 36 months,RENT,debt_consolidation,GA,Verified,INDIVIDUAL,20500,9,62000,11.46,0,54.5,37,14,fals,12.39,2935.32,2015 + 60 months,OWN,debt_consolidation,MO,Verified,INDIVIDUAL,14400,10,37000,5.38,0,25.2,32,11,fals,14.31,1888.71,2015 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,1000,10,34000,2.29,0,49,3,21,fals,17.77,297.3,2012 + 60 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,19425,5,44000,5.35,0,46.4,12,9,fals,16.29,8694.79,2013 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,12000,5,120000,25.06,0,37.9,23,14,fals,10.15,1854.06,2014 + 36 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,7975,0,24960,12.55,0,81.5,14,18,fals,18.99,780.64,2016 + 36 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,16000,10,50000,30.36,1,95.4,34,18,tru,13.33,-5220.65,2015 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,12000,6,73000,15.96,0,83.6,26,20,fals,10.75,1372.16,2016 + 60 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,21500,5,62000,29.97,1,64.4,55,18,fals,17.14,5690.5,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,17850,10,44695,23.52,0,71.1,9,13,tru,15.61,-5888.35,2015 + 60 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,35000,2,175000,23.44,0,8.2,21,14,fals,13.98,11315.24,2014 + 60 months,RENT,other,NY,Not Verified,INDIVIDUAL,15775,2,157000,28.88,0,32.8,44,19,fals,12.29,2931.99,2015 + 60 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,18000,1,58000,14.71,1,68.6,18,10,fals,15.61,6901.25,2013 + 36 months,MORTGAGE,home_improvement,PA,Verified,INDIVIDUAL,4700,10,48000,23.68,0,41.6,15,28,fals,16.55,620.64,2015 + 36 months,RENT,credit_card,MO,Not Verified,INDIVIDUAL,5000,7,70902,14.4,0,47,46,17,fals,5.32,268.63,2016 + 60 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,12800,0,72500,22.73,0,36,20,12,tru,12.59,-12240.79,2015 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,15000,2,58000,20.57,0,74.1,15,13,fals,15.61,3953.92,2014 + 60 months,OWN,debt_consolidation,FL,Verified,INDIVIDUAL,25000,2,90000,10.72,1,58.3,20,17,fals,17.57,728.24,2014 + 36 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,6150,5,35000,16.5,1,27.8,19,21,fals,12.99,257.1,2014 + 36 months,MORTGAGE,home_improvement,NM,Verified,INDIVIDUAL,5000,4,98000,8.76,0,98.2,14,19,fals,17.76,1268.4,2013 + 36 months,MORTGAGE,debt_consolidation,WY,Verified,INDIVIDUAL,27375,6,78000,29.4,4,62.5,56,15,fals,19.52,4268.02,2015 + 36 months,MORTGAGE,debt_consolidation,MO,Verified,INDIVIDUAL,15000,10,62500,24.48,1,66.9,15,18,fals,16.2,3897.55,2013 + 36 months,MORTGAGE,house,FL,Verified,INDIVIDUAL,24250,0,55000,12.33,0,37.7,22,19,fals,11.14,2006.92,2013 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,12000,2,29000,39.9,0,41.6,25,12,fals,21.97,1285.11,2016 + 36 months,RENT,credit_card,NV,Verified,INDIVIDUAL,11800,7,46225,6.67,0,36.2,12,13,fals,15.31,854.25,2013 + 60 months,MORTGAGE,debt_consolidation,CT,Verified,INDIVIDUAL,30000,10,75000,11.84,0,35.7,13,10,tru,14.47,-8838.6,2014 + 36 months,RENT,debt_consolidation,MD,Not Verified,INDIVIDUAL,15000,0,67000,3.87,3,87,51,20,tru,14.77,-8253.77,2016 + 36 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,6250,10,24000,26.2,0,14.6,35,18,fals,7.91,563.45,2016 + 36 months,RENT,other,MA,Verified,INDIVIDUAL,35000,10,400000,6.66,0,65.6,25,15,fals,19.72,11575.77,2012 + 36 months,MORTGAGE,home_improvement,AZ,Verified,INDIVIDUAL,28000,0,110000,8.07,1,13,33,11,fals,7.62,3070.47,2013 + 36 months,RENT,debt_consolidation,MD,Verified,INDIVIDUAL,9600,10,56862,11.65,0,52.9,19,13,fals,7.89,917.03,2015 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,21000,3,50028,8.88,0,46.5,7,4,tru,12.29,-12536.72,2015 + 36 months,MORTGAGE,credit_card,MA,Verified,INDIVIDUAL,24000,3,62000,22.09,0,71.6,21,24,fals,13.53,5332.59,2014 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,29975,3,75000,29.9,3,58.3,32,29,fals,14.64,4413.16,2014 + 36 months,RENT,other,FL,Not Verified,INDIVIDUAL,2500,2,120000,6.91,1,54.7,34,12,fals,10.49,86.78,2015 + 36 months,RENT,credit_card,MD,Verified,INDIVIDUAL,10000,4,72000,14.53,0,39.5,14,6,fals,9.49,533.57,2015 + 36 months,RENT,debt_consolidation,MN,Verified,INDIVIDUAL,20000,10,75000,31.34,0,62.3,42,22,fals,13.33,2043.51,2015 + 60 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,22125,10,75000,21.76,0,55,32,20,tru,12.69,-11413.4,2015 + 60 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,16175,4,100000,15,0,64.5,14,10,tru,19.99,-14925.58,2016 + 60 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,10450,null,38168,38.65,5,59.4,48,17,tru,17.86,-6824.59,2015 + 60 months,RENT,credit_card,CO,Verified,INDIVIDUAL,19200,4,70000,18.67,0,64.7,23,13,fals,15.8,3891.66,2013 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,25325,9,137500,16.19,4,3.2,24,14,tru,19.99,-2399.92,2014 + 36 months,MORTGAGE,credit_card,MD,Verified,INDIVIDUAL,15000,0,56270,26.85,1,65.7,30,14,tru,10.99,1203.33,2014 + 60 months,RENT,other,PA,Not Verified,INDIVIDUAL,23800,6,70000,5.78,1,49,27,8,tru,25.83,-14125.42,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,4500,10,53000,22.62,0,78.6,22,9,fals,15.31,516.39,2012 + 36 months,RENT,debt_consolidation,WA,Verified,INDIVIDUAL,20000,1,95000,22.64,0,48.1,32,14,fals,8.39,1749.15,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,25000,0,75000,34.62,0,30.1,40,16,fals,14.99,6036.62,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,7500,7,25000,25.01,0,82.8,31,18,fals,14.47,1186.44,2014 + 36 months,MORTGAGE,credit_card,AL,Not Verified,INDIVIDUAL,28000,10,125000,28.98,0,47.2,23,14,fals,8.18,1562.8,2015 + 36 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,12800,4,54000,15.09,0,66.4,16,13,fals,12.12,1610.62,2012 + 60 months,RENT,debt_consolidation,NV,Verified,INDIVIDUAL,10850,3,37000,25.11,0,9.6,17,7,tru,19.52,-5129.73,2014 + 60 months,MORTGAGE,credit_card,TN,Verified,INDIVIDUAL,35000,7,96600,8.86,0,54,11,10,fals,11.53,5486.08,2015 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,5800,10,23000,29.28,0,42.5,13,11,fals,13.65,1187.06,2014 + 36 months,MORTGAGE,home_improvement,VA,Not Verified,INDIVIDUAL,4000,1,81000,31.18,2,87.1,39,20,fals,12.99,837.69,2014 + 36 months,RENT,other,CA,Not Verified,INDIVIDUAL,10000,1,90000,9.89,0,31.8,11,7,fals,11.53,265.93,2015 + 36 months,MORTGAGE,debt_consolidation,AR,Verified,INDIVIDUAL,12500,8,40000,7.98,0,38.9,10,14,fals,9.75,836.17,2016 + 60 months,MORTGAGE,debt_consolidation,WA,Not Verified,INDIVIDUAL,18000,10,50000,16.99,0,63.1,14,11,fals,13.66,3816.36,2015 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,8400,1,67000,17.48,1,35.4,17,16,fals,7.62,1023.18,2012 + 36 months,RENT,debt_consolidation,CT,Verified,INDIVIDUAL,4800,8,150000,11.01,3,75.3,20,10,fals,9.49,256.37,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,14400,2,265000,17.4,1,46.2,26,15,fals,8.24,418.81,2017 + 36 months,RENT,major_purchase,CO,Verified,INDIVIDUAL,2600,10,87000,17.81,0,62.9,24,24,fals,12.79,219.61,2016 + 36 months,OWN,other,CA,Verified,INDIVIDUAL,10000,10,48000,10.9,0,96.8,10,10,tru,14.65,-8429.07,2012 + 36 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,4150,1,36500,18.04,2,32.7,27,12,fals,12.99,488.09,2015 + 36 months,MORTGAGE,home_improvement,FL,Verified,INDIVIDUAL,26850,5,515000,8.94,0,76.5,33,19,fals,11.67,4324.1,2014 + 60 months,MORTGAGE,credit_card,UT,Verified,INDIVIDUAL,12000,8,110000,11.65,0,72.8,39,13,fals,13.67,137.92,2013 + 36 months,MORTGAGE,home_improvement,FL,Verified,INDIVIDUAL,21000,null,120000,3.47,0,35.4,32,35,fals,15.1,5184.13,2013 + 60 months,RENT,debt_consolidation,MN,Verified,INDIVIDUAL,23275,10,74200,14.17,0,67,15,15,fals,14.09,8125.62,2012 + 60 months,MORTGAGE,debt_consolidation,RI,Not Verified,INDIVIDUAL,25000,7,72000,33.95,0,51.5,44,14,tru,15.31,-18455.4,2016 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,8000,10,73000,9.3,0,75.2,48,39,fals,10.99,1245.07,2014 + 60 months,OWN,debt_consolidation,AZ,Verified,INDIVIDUAL,35000,10,80000,20.79,0,11.3,34,14,tru,12.99,-20403.53,2014 + 36 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,15000,7,75000,17.42,0,56.8,32,16,fals,13.11,2248.36,2013 + 60 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,28000,6,83000,26.27,0,57.5,26,22,fals,15.31,11494.89,2013 + 36 months,MORTGAGE,debt_consolidation,NJ,Not Verified,INDIVIDUAL,15000,4,80000,13.56,0,44.4,16,12,fals,13.11,2598.86,2012 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,25000,1,80000,30.41,0,32.8,61,14,fals,17.27,4140.22,2016 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,25000,8,90000,11.97,0,90.4,18,12,fals,10.99,3218.92,2014 + 60 months,RENT,credit_card,CA,Verified,INDIVIDUAL,12000,null,38000,16.99,1,88,13,20,fals,16.55,527.72,2015 + 36 months,RENT,debt_consolidation,MA,Verified,INDIVIDUAL,19200,2,200000,5.33,0,51.7,34,13,fals,10.15,2124.95,2014 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,3000,null,11000,14.18,0,49.5,9,19,fals,13.05,641.5,2013 + 60 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,15875,10,65904,35.54,0,53.9,14,6,fals,19.99,3453.99,2015 + 60 months,MORTGAGE,credit_card,NJ,Verified,INDIVIDUAL,29000,null,58000,22.62,0,83.5,26,28,fals,20.99,7827.1,2015 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,16800,10,97000,19.37,0,95,18,12,tru,9.99,-11692.27,2015 + 36 months,RENT,debt_consolidation,OK,Not Verified,INDIVIDUAL,16000,3,82680,21.36,0,87.1,30,17,fals,12.12,3151.84,2013 + 60 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,17975,0,60000,11.67,0,51.4,20,11,fals,20.2,6565.63,2014 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,10000,null,31776,11.03,1,27.2,44,20,fals,17.99,764.19,2016 + 36 months,MORTGAGE,debt_consolidation,AL,Not Verified,INDIVIDUAL,3750,5,53000,24.08,0,92.9,28,42,fals,7.9,473.42,2012 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,10000,7,58000,24.25,3,55.1,16,16,tru,14.46,-8335.53,2016 + 36 months,RENT,credit_card,NY,Verified,INDIVIDUAL,27350,10,60000,13.02,0,85.1,14,7,fals,15.41,1159.88,2015 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,10000,10,70000,12.24,1,38.8,46,15,fals,12.79,849.06,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,15000,10,70000,16.66,0,17.5,54,20,fals,12.29,2531.18,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,8000,10,45000,13.6,4,65.4,26,13,fals,12.99,965.55,2015 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,13150,10,68052,16.45,3,24.8,31,17,tru,12.49,-7616.91,2014 + 36 months,MORTGAGE,other,IL,Verified,INDIVIDUAL,1000,10,69000,30.44,0,17.4,50,22,fals,13.99,37.39,2016 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,10000,3,75000,15.23,0,39.8,11,9,tru,20.99,-6507.08,2015 + 36 months,RENT,debt_consolidation,UT,Not Verified,INDIVIDUAL,5075,0,60000,2.6,0,47.2,8,4,tru,13.65,-4384.6,2014 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,24000,1,80000,17.54,0,54.2,36,14,fals,8.9,3434.76,2013 + 36 months,MORTGAGE,home_improvement,CA,Verified,INDIVIDUAL,16000,4,100000,13.69,0,89,10,7,fals,11.99,3140.95,2014 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,11975,3,72000,10.83,0,61.1,12,5,fals,16.99,1258.63,2015 + 60 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,22250,6,50000,25.4,0,81.6,27,14,tru,15.31,-5291.88,2012 + 36 months,RENT,credit_card,FL,Not Verified,INDIVIDUAL,14700,10,42000,27.6,0,63,30,11,tru,11.99,-2882.97,2014 + 36 months,RENT,other,PA,Verified,INDIVIDUAL,5000,10,52000,16.8,0,0,18,14,fals,16.29,526.47,2014 + 36 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,12400,0,48000,12.5,0,17.9,34,16,fals,6.03,1186.39,2013 + 60 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,35000,10,90000,34.37,0,95.5,43,20,fals,18.25,2177.85,2015 + 36 months,RENT,credit_card,PA,Verified,INDIVIDUAL,20000,10,59000,37.49,0,47.9,39,14,fals,13.66,2927.38,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,35000,10,116000,12.45,0,66.5,35,17,fals,14.46,1000.01,2016 + 60 months,RENT,other,MA,Verified,INDIVIDUAL,10000,10,60000,20.9,0,71.9,20,9,tru,18.99,-4631.61,2014 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,6000,6,39000,16.74,0,80.3,9,28,fals,8.9,858.7,2012 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,7000,10,67000,24.04,0,74,28,22,fals,9.99,742.13,2015 + 36 months,RENT,debt_consolidation,AL,Verified,INDIVIDUAL,6500,3,38000,16.93,0,46.9,7,11,fals,10.99,762.77,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,10500,10,75000,8.4,0,97.2,15,23,fals,21,3741.15,2013 + 36 months,RENT,debt_consolidation,NV,Verified,INDIVIDUAL,6500,10,36000,18.3,1,50.4,14,15,tru,17.27,-2209.33,2013 + 36 months,OWN,debt_consolidation,MI,Not Verified,INDIVIDUAL,10000,10,52000,14.63,0,54.3,10,12,fals,10.64,1724.63,2013 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,35000,10,120000,15.48,0,85.7,27,15,fals,18.99,10498.86,2014 + 60 months,MORTGAGE,debt_consolidation,KS,Verified,INDIVIDUAL,22250,1,50000,21.65,0,59.7,14,6,tru,20.2,-12329.19,2013 + 36 months,RENT,other,CA,Not Verified,INDIVIDUAL,7000,5,32000,22.01,0,27.6,20,26,fals,18.55,2179.99,2013 + 36 months,RENT,moving,RI,Verified,INDIVIDUAL,5000,7,24000,6.45,0,56.3,11,8,fals,13.99,806.73,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,15000,0,40000,9.45,0,47.4,19,11,tru,14.09,88.15,2013 + 36 months,OWN,debt_consolidation,MI,Not Verified,INDIVIDUAL,10000,1,68000,16.25,0,31.3,24,23,tru,8.19,-1212.38,2014 + 60 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,26400,8,102913,15.94,0,31.4,37,13,fals,9.17,1167.92,2015 + 60 months,MORTGAGE,other,NY,Verified,INDIVIDUAL,24000,8,180000,7.39,0,38.1,38,19,fals,25.8,16484.11,2013 + 60 months,MORTGAGE,home_improvement,IL,Verified,INDIVIDUAL,16000,7,41600,10.99,0,55,15,7,fals,21.98,10294.58,2012 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,20000,6,115000,17.39,0,73.5,26,10,fals,12.39,764.24,2014 + 36 months,RENT,credit_card,UT,Not Verified,INDIVIDUAL,9600,3,41364,14.27,0,92.3,9,9,tru,13.98,-705.78,2013 + 60 months,RENT,debt_consolidation,OH,Verified,INDIVIDUAL,10000,10,52000,19.92,1,47.7,34,15,tru,15.31,-7905.91,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,5500,10,20000,21.25,0,87.2,12,30,fals,14.47,432.49,2013 + 36 months,MORTGAGE,debt_consolidation,IN,Not Verified,INDIVIDUAL,4000,8,42000,23.89,0,50.7,28,13,fals,18.24,410,2014 + 36 months,MORTGAGE,credit_card,MD,Verified,INDIVIDUAL,13000,3,75000,13.01,0,81,25,15,fals,7.49,419.88,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,13850,6,45000,34.21,0,22.5,40,29,fals,6.03,70.48,2012 + 36 months,OWN,other,OH,Verified,INDIVIDUAL,5000,2,70695,15.04,0,39.3,11,5,fals,13.49,2.22,2016 + 36 months,OWN,credit_card,CA,Not Verified,INDIVIDUAL,12000,0,36000,11.67,0,29.1,20,17,tru,7.26,-12000,2015 + 36 months,MORTGAGE,credit_card,IL,Verified,INDIVIDUAL,10000,9,45000,28.59,0,41.2,47,18,fals,6.89,759.46,2015 + 60 months,OWN,debt_consolidation,FL,Verified,INDIVIDUAL,35000,6,88000,21.7,0,54.9,37,11,fals,19.19,8643.97,2015 + 36 months,RENT,debt_consolidation,DE,Verified,INDIVIDUAL,24000,10,115000,24.64,0,73.7,19,27,tru,12.69,-7046.66,2015 + 60 months,MORTGAGE,home_improvement,MI,Verified,INDIVIDUAL,15000,8,35000,15.71,0,59.6,28,14,fals,17.14,3531.9,2014 + 36 months,OWN,debt_consolidation,NJ,Verified,INDIVIDUAL,35000,null,150000,20.58,0,42.2,53,17,fals,14.46,2204.48,2016 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,3000,10,108000,7.12,3,49.7,24,14,fals,17.1,694.11,2013 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,15950,7,137000,10.63,0,32.7,44,13,fals,6.03,929.33,2014 + 60 months,MORTGAGE,credit_card,CT,Verified,INDIVIDUAL,18600,4,40000,16.71,0,97.6,11,7,fals,20.99,11585.18,2012 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,15200,2,50000,16.39,0,72.4,26,15,fals,11.99,2972.22,2013 + 36 months,MORTGAGE,debt_consolidation,NJ,Not Verified,INDIVIDUAL,12000,10,108000,15.36,0,58.8,30,28,fals,6.49,1237.85,2014 + 60 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,10500,3,45000,11.04,0,19.2,16,27,tru,18.84,-5081.79,2015 + 36 months,OWN,credit_card,TX,Verified,INDIVIDUAL,15000,6,65000,18.49,0,65,21,15,fals,11.48,1417.25,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,19200,10,250000,2.25,1,53.3,31,39,fals,8.9,2728.72,2013 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,6000,5,65000,8.25,0,21.1,12,12,fals,13.33,1101.78,2015 + 36 months,MORTGAGE,car,OK,Not Verified,INDIVIDUAL,6000,0,48000,9.74,0,2.1,8,9,fals,6.03,404.24,2014 + 36 months,MORTGAGE,debt_consolidation,MO,Verified,INDIVIDUAL,20000,9,52000,18.07,0,49.1,25,11,fals,10.75,1377.43,2016 + 36 months,RENT,credit_card,OR,Not Verified,INDIVIDUAL,7500,1,32000,27.61,0,91.8,25,11,fals,10.99,1295.99,2013 + 36 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,31000,10,115000,25.21,0,70.5,26,17,fals,14.16,6578.98,2014 + 36 months,OWN,credit_card,CA,Verified,INDIVIDUAL,15000,3,45000,14.09,0,39.5,13,8,fals,10.49,579.91,2017 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,10800,0,55000,38.14,0,62.1,32,21,tru,10.99,-5156.71,2015 + 36 months,RENT,credit_card,NJ,Verified,INDIVIDUAL,15000,2,46675,23.89,0,60.3,24,9,fals,13.33,2239.52,2015 + 60 months,MORTGAGE,credit_card,MO,Not Verified,INDIVIDUAL,18000,1,70054,26.01,0,62.8,40,16,fals,17.57,3614.96,2015 + 36 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,15000,9,34000,23.9,0,77.6,11,10,fals,16.29,767.8,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,25000,4,65200,24.94,0,57.5,28,19,fals,12.85,4900.65,2014 + 36 months,MORTGAGE,debt_consolidation,WV,Verified,INDIVIDUAL,12000,10,120000,22.34,0,59.8,40,9,fals,11.67,1855.92,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,6250,9,35655,23.58,2,55.4,15,6,tru,14.3,-1583.26,2013 + 36 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,10000,10,51000,20.05,0,37.1,20,12,fals,14.99,1974.25,2014 + 36 months,MORTGAGE,credit_card,KS,Verified,INDIVIDUAL,28000,10,120000,20.98,0,99,18,16,fals,7.26,2868.86,2015 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,11000,10,40000,23.52,0,63.1,20,14,fals,9.17,1624.04,2014 + 36 months,MORTGAGE,debt_consolidation,MD,Not Verified,INDIVIDUAL,11000,3,42000,21.17,0,82,49,25,fals,15.59,2160.27,2015 + 36 months,MORTGAGE,debt_consolidation,MT,Not Verified,INDIVIDUAL,25000,10,94000,19.65,0,47.4,31,29,fals,7.62,2053.04,2013 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,2500,null,70000,23.97,0,78.8,18,25,fals,15.61,284.88,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,12000,6,250000,8.49,0,62.5,26,20,fals,6.99,949.52,2014 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,3000,null,21000,17.71,0,54,16,15,fals,11.14,542.91,2012 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,31200,10,99000,21.88,0,62.1,23,11,tru,18.55,-16171.6,2015 + 60 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,35000,7,150000,14.76,0,83.4,38,11,fals,24.5,18017.77,2014 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,28000,7,115000,3.11,0,22,22,12,fals,7.12,143.99,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,7700,10,124000,13.42,0,33.1,29,12,fals,14.99,791.54,2016 + 36 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,10200,2,60000,19.78,0,56,31,12,fals,13.49,528.97,2017 + 36 months,OWN,other,CA,Verified,INDIVIDUAL,5000,0,90000,9.98,0,52,29,19,fals,18.25,1522.18,2013 + 36 months,RENT,debt_consolidation,MA,Not Verified,INDIVIDUAL,7175,4,30000,20.8,0,49.7,12,5,fals,20.5,122.58,2014 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,20000,10,79364,33.63,0,94.2,16,19,tru,14.46,-12184.47,2016 + 36 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,11500,null,35000,32.81,0,78,16,29,fals,15.22,2895.35,2013 + 36 months,RENT,debt_consolidation,SC,Verified,INDIVIDUAL,20000,10,64000,15.19,0,80.4,30,23,fals,13.11,3978.86,2012 + 36 months,MORTGAGE,credit_card,MT,Not Verified,INDIVIDUAL,10000,10,108000,8.16,1,49,23,17,fals,15.61,130.08,2014 + 36 months,RENT,house,WA,Verified,INDIVIDUAL,15000,10,108000,4.4,1,31.9,22,13,fals,17.27,3255.3,2012 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,18000,10,100000,5.85,0,6.3,25,16,fals,7.26,761.53,2015 + 36 months,MORTGAGE,credit_card,PA,Not Verified,INDIVIDUAL,14000,2,98750,9.32,1,44.3,35,10,tru,10.99,-5462.53,2014 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,20675,4,74620,9.01,1,54.3,20,19,fals,15.99,3105.14,2014 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,10000,0,50000,31.5,0,36.8,32,11,fals,7.89,548.27,2015 + 36 months,OWN,debt_consolidation,TX,Not Verified,INDIVIDUAL,13000,1,58000,17.28,5,71.7,26,8,fals,17.77,3668.96,2013 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,7800,10,72000,7.9,2,48.5,28,11,tru,20.2,-4242.58,2014 + 36 months,MORTGAGE,debt_consolidation,NE,Verified,INDIVIDUAL,8000,6,108000,25.11,2,75.9,16,21,fals,11.49,235.55,2017 + 36 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,8000,10,35000,13.51,1,34,49,21,fals,10.99,357.67,2015 + 36 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,32500,4,90012,24.6,1,74.1,18,11,fals,12.69,5162.03,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,14000,8,83000,21.3,1,53.6,26,33,fals,10.99,2307.24,2014 + 36 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,12000,10,93384,29.27,0,65.9,35,18,fals,8.67,1491.81,2014 + 36 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,7525,10,82000,18.47,0,73.2,21,15,fals,13.33,1011.23,2015 + 36 months,MORTGAGE,credit_card,PA,Verified,INDIVIDUAL,30000,8,73165,13.01,0,52.2,18,16,fals,12.49,4297.3,2014 + 36 months,RENT,credit_card,LA,Not Verified,INDIVIDUAL,5775,10,49400,18.68,0,46.2,47,20,fals,17.1,1660.4,2013 + 36 months,OWN,debt_consolidation,NY,Verified,INDIVIDUAL,20000,0,75000,17.39,0,54.8,29,13,fals,19.05,6314.07,2013 + 60 months,MORTGAGE,debt_consolidation,MN,Verified,INDIVIDUAL,15250,10,55000,21.62,0,66.5,21,17,fals,10.64,4112.71,2013 + 36 months,MORTGAGE,credit_card,NY,Verified,INDIVIDUAL,7200,10,40000,15.72,0,88.5,18,15,fals,12.49,828.96,2014 + 36 months,MORTGAGE,other,FL,Verified,INDIVIDUAL,40000,10,102000,9.87,0,90.8,28,28,fals,8.24,1046.09,2017 + 36 months,RENT,other,NJ,Verified,INDIVIDUAL,12000,10,100000,10.29,4,20.5,36,17,fals,11.47,103.15,2016 + 36 months,MORTGAGE,other,MI,Verified,INDIVIDUAL,3500,null,42000,24.11,0,67,29,42,fals,17.57,1028.07,2014 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,7000,5,750000,2.87,1,59,15,20,fals,9.49,254.1,2016 + 36 months,OWN,credit_card,CO,Verified,INDIVIDUAL,9600,10,34000,33.18,1,30.6,17,21,fals,12.99,2042.93,2014 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,17000,4,60000,16.21,0,49.4,16,8,tru,14.49,-9401.9,2014 + 36 months,RENT,credit_card,NV,Verified,INDIVIDUAL,35000,0,100000,11.31,0,65,22,16,fals,14.46,2372.94,2016 + 36 months,RENT,debt_consolidation,SC,Not Verified,INDIVIDUAL,6000,8,64000,4.5,1,66.4,22,13,fals,15.61,582.69,2015 + 36 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,8400,2,55000,31.38,0,46,22,11,fals,8.18,1054.12,2015 + 36 months,RENT,debt_consolidation,IL,Verified,INDIVIDUAL,35000,1,116000,16.55,0,72.1,28,16,fals,12.69,1255.24,2015 + 60 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,18825,4,51000,24.8,0,36.3,23,15,tru,22.99,-17258.93,2015 + 36 months,RENT,debt_consolidation,HI,Verified,INDIVIDUAL,35000,10,110000,15.36,0,46.8,17,20,tru,11.99,-10353.47,2013 + 60 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,20100,5,130000,33.23,0,59,37,20,fals,18.99,3811.08,2016 + 36 months,MORTGAGE,major_purchase,CA,Verified,INDIVIDUAL,3600,null,47000,26.38,5,10.6,41,22,fals,8.18,258.61,2015 + 36 months,MORTGAGE,credit_card,FL,Not Verified,INDIVIDUAL,5000,4,32000,5.96,0,47.8,15,11,fals,10.99,892.12,2013 + 60 months,MORTGAGE,debt_consolidation,ND,Not Verified,INDIVIDUAL,10000,7,50000,16.92,0,18.9,30,20,fals,12.74,172.42,2017 + 60 months,MORTGAGE,credit_card,MN,Verified,INDIVIDUAL,35000,5,267000,16.96,1,67.1,59,19,tru,23.5,-24035.75,2013 + 60 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,25000,2,76000,28.61,0,81.4,18,16,fals,14.09,8092.92,2012 + 36 months,MORTGAGE,credit_card,NC,Verified,INDIVIDUAL,33100,0,72000,26.45,0,76.1,29,16,tru,16.99,-29522.64,2016 + 36 months,RENT,other,OH,Not Verified,INDIVIDUAL,16000,5,65000,29.71,0,52.1,28,10,fals,16.2,4346.94,2013 + 60 months,MORTGAGE,home_improvement,MI,Not Verified,INDIVIDUAL,18000,10,66000,16.15,0,62.8,28,32,tru,19.99,-1312.35,2014 + 36 months,MORTGAGE,credit_card,WA,Verified,INDIVIDUAL,18000,8,48000,12.88,0,83,16,14,fals,10.99,930.09,2014 + 36 months,RENT,debt_consolidation,UT,Verified,INDIVIDUAL,5000,1,30000,25.08,0,91,10,6,fals,10.99,662.78,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,10000,8,40000,1.5,0,2,16,18,fals,11.99,1955.4,2014 + 36 months,MORTGAGE,debt_consolidation,CT,Verified,INDIVIDUAL,35000,10,85000,22.47,0,30.5,39,25,fals,10.64,6036.29,2013 + 36 months,RENT,other,SC,Verified,INDIVIDUAL,6075,5,89000,9.76,0,78.2,9,11,fals,17.77,1802.03,2012 + 36 months,RENT,debt_consolidation,KY,Verified,INDIVIDUAL,16000,10,65000,11.74,0,50.3,43,20,fals,15.1,3690.56,2013 + 60 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,29775,8,84000,29.7,0,24,33,14,fals,13.18,4070.8,2015 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,30000,10,100000,18.66,0,92.7,35,26,fals,12.49,1205.53,2014 + 36 months,RENT,debt_consolidation,WA,Not Verified,INDIVIDUAL,19475,8,100000,20.76,0,59.7,32,9,tru,12.12,-12347.33,2012 + 60 months,MORTGAGE,debt_consolidation,MO,Verified,INDIVIDUAL,15000,5,65000,29.01,2,21.4,52,11,tru,11.49,-12801.35,2016 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,16200,2,45000,15.65,1,51.9,21,20,tru,13.99,-7290.75,2015 + 36 months,RENT,major_purchase,WA,Not Verified,INDIVIDUAL,15000,3,80000,0.75,0,12.9,6,7,fals,12.69,1539.04,2015 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,12950,2,70000,13.1,2,20.8,31,11,fals,14.99,1975.15,2014 + 60 months,RENT,debt_consolidation,AZ,Verified,INDIVIDUAL,12600,4,35000,33.92,0,22.7,14,9,fals,18.99,1399.55,2015 + 36 months,RENT,other,CA,Not Verified,INDIVIDUAL,1000,0,60000,25.18,0,101.9,9,10,fals,18.24,274.54,2014 + 60 months,MORTGAGE,debt_consolidation,SC,Verified,INDIVIDUAL,19050,10,65000,16.73,0,86.9,11,16,tru,24.99,-966.09,2014 + 36 months,MORTGAGE,debt_consolidation,NV,Not Verified,INDIVIDUAL,17500,10,48000,37.67,1,50.8,42,20,fals,16.29,4670.53,2014 + 36 months,RENT,other,NC,Verified,INDIVIDUAL,2400,null,62000,7.16,0,85.4,15,15,fals,17.1,688.77,2013 + 60 months,MORTGAGE,debt_consolidation,AR,Verified,INDIVIDUAL,30800,9,80000,21.19,1,41.5,51,30,fals,12.29,5190.06,2015 + 60 months,MORTGAGE,medical,VT,Verified,INDIVIDUAL,20000,10,75000,12.37,0,88.8,26,11,fals,20.99,5095.29,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,1000,null,11856,24.39,0,57.9,39,20,fals,18.25,305.98,2013 + 60 months,OWN,home_improvement,ME,Verified,INDIVIDUAL,32000,4,169000,16.74,0,51.1,32,15,fals,19.99,390.92,2016 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,16000,10,65000,20.77,0,66.8,21,14,tru,6.03,-4626.34,2014 + 60 months,RENT,debt_consolidation,GA,Verified,INDIVIDUAL,35000,1,133000,16.77,2,49.2,50,19,tru,21.99,-26344.53,2015 + 36 months,RENT,other,CA,Verified,INDIVIDUAL,5000,3,90688,13.75,0,95.9,26,14,fals,14.49,1043.69,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,16000,8,53000,13.72,0,51.8,29,20,fals,6.49,1019.06,2016 + 36 months,MORTGAGE,debt_consolidation,VA,Verified,INDIVIDUAL,28100,10,74000,23.94,0,28.8,44,15,fals,19.24,6071.89,2015 + 36 months,MORTGAGE,other,FL,Verified,INDIVIDUAL,6000,10,40594,6.45,0,62.1,7,10,fals,11.99,182.47,2016 + 36 months,RENT,debt_consolidation,CT,Not Verified,INDIVIDUAL,3000,2,92000,13.66,1,16.2,20,8,fals,6.99,140.76,2016 + 36 months,MORTGAGE,credit_card,KY,Verified,INDIVIDUAL,7000,10,80000,29.57,2,77,48,18,fals,11.53,748.59,2015 + 36 months,RENT,credit_card,NJ,Not Verified,INDIVIDUAL,24000,0,120000,27.24,0,21.8,13,5,fals,11.49,1195.95,2016 + 36 months,MORTGAGE,debt_consolidation,AL,Verified,INDIVIDUAL,8000,10,65000,17.02,0,32.5,45,15,fals,13.68,1461.76,2013 + 60 months,OWN,debt_consolidation,KY,Verified,INDIVIDUAL,19650,2,65000,13.96,0,38,36,14,fals,10.16,3853.1,2013 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,4650,3,71000,28.28,0,72.3,42,26,fals,14.64,166.38,2014 + 60 months,OWN,debt_consolidation,FL,Verified,INDIVIDUAL,30000,10,72000,18.13,0,49.7,21,19,fals,14.99,2668.05,2014 + 36 months,MORTGAGE,debt_consolidation,MS,Verified,INDIVIDUAL,12000,1,55500,23.83,0,53.9,25,15,tru,9.75,-8521.51,2016 + 60 months,MORTGAGE,debt_consolidation,UT,Verified,INDIVIDUAL,35000,4,105000,16.94,0,84.2,28,14,fals,25.8,13001.09,2015 + 36 months,MORTGAGE,credit_card,PA,Not Verified,INDIVIDUAL,6000,6,60000,20.38,2,48.8,25,9,fals,12.99,477.11,2013 + 36 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,30400,6,186000,14.72,0,72,36,24,fals,20.99,10189.44,2014 + 36 months,OWN,debt_consolidation,MS,Not Verified,INDIVIDUAL,12000,4,33560,29.72,0,51,17,7,fals,7.69,1341.9,2014 + 60 months,MORTGAGE,debt_consolidation,RI,Verified,INDIVIDUAL,16000,10,79500,22.22,0,74.4,37,15,tru,15.8,-2807.96,2013 + 36 months,MORTGAGE,debt_consolidation,WY,Verified,INDIVIDUAL,10000,null,72000,27.83,2,68.7,37,22,fals,13.53,1220.48,2013 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,5000,8,38000,19.33,0,69.6,13,12,fals,9.67,721.11,2013 + 36 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,16000,5,95000,18.85,3,65.5,27,26,fals,11.99,1455.35,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,12800,10,85000,5.56,0,95.2,7,16,fals,13.53,2844.05,2014 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,12000,0,30000,18,0,67.1,11,7,fals,14.85,2198.81,2015 + 60 months,MORTGAGE,credit_card,WA,Verified,INDIVIDUAL,18950,10,53000,28.66,0,43.3,21,12,fals,21,331.95,2013 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,25000,1,60000,11.94,0,11.4,35,14,fals,13.33,4206.64,2015 + 60 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,32000,5,175000,27.45,0,84.1,32,15,tru,18.99,-21278.43,2015 + 36 months,RENT,credit_card,TX,Not Verified,INDIVIDUAL,10000,8,45000,23.58,0,62.8,19,8,fals,14.99,1004.94,2015 + 60 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,17625,10,110000,5.88,0,41.6,27,14,fals,18.25,3038.01,2013 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,10000,10,32000,13.35,2,40.7,13,12,fals,15.31,2377.11,2012 + 60 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,17325,9,76000,15.11,0,97.1,22,18,tru,25.57,-12188.24,2015 + 36 months,RENT,credit_card,MO,Not Verified,INDIVIDUAL,20000,10,40000,30.63,0,56.7,44,8,tru,10.99,-10082.88,2015 + 60 months,MORTGAGE,debt_consolidation,KY,Verified,INDIVIDUAL,25000,5,135456,31.45,0,35.7,50,25,tru,13.66,-13195.91,2015 + 36 months,OWN,credit_card,CA,Verified,INDIVIDUAL,27500,2,55000,23.14,0,62.8,19,15,fals,7.89,2575.88,2015 + 36 months,MORTGAGE,debt_consolidation,MD,Not Verified,INDIVIDUAL,10000,null,49000,24.2,0,30.8,29,11,fals,9.17,1024.71,2015 + 36 months,MORTGAGE,debt_consolidation,KY,Verified,INDIVIDUAL,9950,2,30000,12.4,0,32.8,16,8,fals,8.9,957.96,2013 + 36 months,MORTGAGE,credit_card,PA,Not Verified,INDIVIDUAL,4500,5,35000,12.62,4,30.7,25,12,fals,11.55,845.96,2013 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,12275,4,60000,18.98,0,1.4,33,20,fals,12.99,1508.4,2014 + 36 months,MORTGAGE,debt_consolidation,MI,Not Verified,INDIVIDUAL,12000,10,75000,18.72,1,75.2,23,18,fals,13.67,2695.52,2012 + 60 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,21000,10,80000,20.09,0,60,9,24,fals,14.99,2130.68,2016 + 36 months,OWN,debt_consolidation,NJ,Verified,INDIVIDUAL,6800,3,20000,26.53,0,58.1,14,5,tru,17.99,-5086.13,2016 + 36 months,OWN,debt_consolidation,PA,Verified,INDIVIDUAL,9000,10,39000,18.15,0,43.9,17,11,fals,10.99,471.72,2015 + 60 months,MORTGAGE,credit_card,GA,Verified,INDIVIDUAL,32000,10,325000,10.44,0,59.9,38,20,fals,17.56,3171.84,2013 + 36 months,OWN,debt_consolidation,AZ,Verified,INDIVIDUAL,13250,1,60000,14.76,1,57.6,31,14,fals,19.52,3413.43,2013 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,20000,3,75000,29.33,0,34,43,13,fals,11.44,2242.45,2015 + 36 months,RENT,debt_consolidation,MI,Not Verified,INDIVIDUAL,3825,6,38000,13.48,0,75,7,15,fals,18.85,842.7,2013 + 36 months,RENT,credit_card,GA,Verified,INDIVIDUAL,10075,6,35000,28.37,0,87.1,19,12,fals,17.77,2998.78,2012 + 60 months,MORTGAGE,other,IN,Verified,INDIVIDUAL,14275,7,42000,21.23,0,54.5,25,13,tru,28.99,-6495.88,2015 + 60 months,MORTGAGE,debt_consolidation,CT,Verified,INDIVIDUAL,18000,10,72000,13.63,0,45.8,30,14,fals,16.99,3319.61,2014 + 36 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,20000,9,96000,20.88,0,44.8,29,17,tru,15.31,-13738.27,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,15000,6,185000,10.1,0,28.7,33,15,fals,10.64,1516.94,2015 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,27325,7,72000,19.35,0,81.1,21,13,fals,19.47,6823.38,2014 + 60 months,MORTGAGE,credit_card,IL,Verified,INDIVIDUAL,21600,8,71000,12.2,0,81.9,15,8,tru,17.57,-1479.61,2014 + 36 months,RENT,vacation,GA,Verified,INDIVIDUAL,4225,10,82000,19.05,0,55.1,28,30,fals,14.99,361.49,2016 + 36 months,RENT,debt_consolidation,MI,Not Verified,INDIVIDUAL,3000,2,30000,9.24,0,37,38,24,fals,17.57,872.18,2014 + 60 months,RENT,debt_consolidation,UT,Verified,INDIVIDUAL,22750,7,58000,30.63,0,92,24,22,tru,21.49,5902.54,2012 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,16000,2,37558,15.08,1,28.4,37,16,fals,11.55,4796.89,2013 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,35000,10,200000,8.23,1,35.8,17,27,fals,17.86,5099.06,2015 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,6000,0,40000,15.06,0,38.3,19,9,fals,9.67,521.6,2014 + 36 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,3000,3,18000,13.87,0,18.4,16,8,fals,10.99,516.93,2014 + 36 months,RENT,major_purchase,MO,Not Verified,INDIVIDUAL,30000,5,400000,16.78,0,38.5,48,15,fals,19.99,714.06,2016 + 60 months,MORTGAGE,credit_card,MI,Not Verified,INDIVIDUAL,15000,2,40000,6.54,0,72.1,17,16,tru,12.99,-7814.27,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,14825,9,92000,23.79,0,45.7,24,13,fals,16.29,1648.55,2014 + 60 months,MORTGAGE,credit_card,MI,Verified,INDIVIDUAL,12000,10,96200,19,0,111.1,29,20,fals,13.99,1161.51,2015 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,15750,7,35000,12.77,1,26.6,23,12,fals,20.49,2345.41,2015 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,3000,10,95000,17.96,0,51,22,11,fals,11.49,101.14,2017 + 36 months,OWN,credit_card,NY,Not Verified,INDIVIDUAL,8925,1,40000,21.33,6,65.8,22,11,fals,7.69,323.09,2014 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,20000,6,100000,29.26,0,81.4,43,23,fals,14.33,2064.01,2013 + 36 months,OWN,debt_consolidation,SD,Verified,INDIVIDUAL,8000,2,30000,36.4,1,80.6,33,24,fals,12.99,430.16,2016 + 36 months,OWN,credit_card,CA,Verified,INDIVIDUAL,9500,3,60000,14.62,0,84.5,9,13,fals,10.78,1251.11,2016 + 36 months,MORTGAGE,credit_card,IN,Not Verified,INDIVIDUAL,9500,2,66000,17.02,0,4.2,33,12,tru,6.03,-9210.86,2014 + 60 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,31350,2,145000,9.57,1,26.5,35,17,fals,17.57,6955.16,2015 + 60 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,20000,10,85000,14.47,0,68.1,29,14,fals,12.49,4380.99,2014 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,15000,10,50000,13.08,0,54.9,25,15,fals,11.99,2794.91,2014 + 36 months,RENT,debt_consolidation,SC,Not Verified,INDIVIDUAL,7000,10,30000,10.96,0,76.2,10,12,fals,13.35,1460.69,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,5000,3,45000,15.18,3,30.2,33,13,fals,7.69,614.83,2014 + 60 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,30000,10,114200,7.85,0,84.4,20,30,fals,14.65,6868.08,2015 + 36 months,RENT,debt_consolidation,OR,Verified,INDIVIDUAL,12800,10,65000,29.6,0,42.5,49,11,tru,13.99,-3132,2015 + 36 months,MORTGAGE,home_improvement,VA,Verified,INDIVIDUAL,7450,2,99599,19.8,0,44.9,39,12,fals,12.74,168.46,2017 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,7625,1,58000,18.17,0,57,31,6,fals,16.99,2034.99,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,5600,6,60000,19.2,0,88,15,14,fals,13.99,726.17,2015 + 36 months,OWN,debt_consolidation,VA,Verified,INDIVIDUAL,4000,null,20000,6.84,0,30,9,23,fals,7.9,505.76,2013 + 36 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,5000,0,24000,28.9,0,54.3,20,11,fals,14.16,1041.88,2014 + 36 months,MORTGAGE,home_improvement,FL,Not Verified,INDIVIDUAL,5000,6,64000,8.63,0,25.1,18,11,fals,9.49,129.76,2016 + 36 months,OWN,debt_consolidation,TX,Not Verified,INDIVIDUAL,6350,10,40000,17.22,0,3.8,17,10,fals,12.12,709.44,2012 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,9000,0,60000,27.26,0,94.8,25,17,tru,12.29,-4722.9,2015 + 60 months,RENT,major_purchase,TX,Verified,INDIVIDUAL,35000,5,75000,11.35,0,4,14,15,tru,21.97,-26686.83,2016 + 60 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,15000,7,51500,26.59,0,56,43,13,tru,22.7,2651.76,2013 + 36 months,MORTGAGE,debt_consolidation,NY,Not Verified,INDIVIDUAL,10125,10,50000,15.41,0,67.1,15,14,tru,15.59,-5895.5,2015 + 36 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,25000,10,73000,24.12,0,57.1,42,16,fals,15.1,6242.87,2013 + 36 months,RENT,credit_card,IL,Verified,INDIVIDUAL,30000,9,80000,33.25,1,51.2,27,10,tru,15.61,-12412.38,2015 + 36 months,MORTGAGE,credit_card,WI,Not Verified,INDIVIDUAL,9000,10,112801,11.05,0,29.8,27,32,fals,6.92,610.74,2015 + 36 months,OWN,debt_consolidation,MI,Verified,INDIVIDUAL,6400,null,16000,33.61,0,25.6,26,15,fals,11.99,1115.22,2015 + 36 months,RENT,debt_consolidation,NY,Not Verified,INDIVIDUAL,16000,2,43000,13.76,0,53.5,34,12,fals,12.12,3164.45,2013 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,7000,3,47000,15.6,0,17.3,18,8,fals,6.03,254.2,2014 + 36 months,MORTGAGE,debt_consolidation,LA,Verified,INDIVIDUAL,35000,4,170000,21.26,0,62,42,20,fals,11.53,5272.58,2015 + 36 months,MORTGAGE,home_improvement,MI,Verified,INDIVIDUAL,9500,10,66000,7.92,0,22,30,17,fals,9.17,1181.95,2015 + 60 months,MORTGAGE,debt_consolidation,WV,Verified,INDIVIDUAL,20000,6,98800,24.43,0,81.6,44,19,fals,10.99,4799.45,2014 + 60 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,13000,0,36000,27.57,0,78.6,24,13,fals,12.39,3139.57,2015 + 36 months,MORTGAGE,debt_consolidation,SC,Not Verified,INDIVIDUAL,10000,6,40000,9.39,0,45.8,25,16,fals,10.15,541.85,2014 + 36 months,MORTGAGE,other,WA,Verified,INDIVIDUAL,6800,3,20000,24.07,0,28.2,23,10,fals,19.19,791.28,2015 + 60 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,11200,10,115000,15.8,0,61.4,70,19,fals,20.49,2877.52,2015 + 36 months,MORTGAGE,credit_card,RI,Verified,INDIVIDUAL,28000,8,94000,18.6,0,75.3,36,23,fals,12.99,5958.62,2013 + 36 months,MORTGAGE,debt_consolidation,IL,Not Verified,INDIVIDUAL,10700,3,32160,16.12,0,56.7,24,11,fals,11.47,710.73,2016 + 36 months,MORTGAGE,credit_card,NY,Not Verified,INDIVIDUAL,8000,10,50000,37.06,0,21.9,25,12,fals,7.26,533.96,2015 + 60 months,MORTGAGE,credit_card,CA,Not Verified,INDIVIDUAL,16800,9,42000,19.37,0,67.1,23,21,tru,12.59,-12939.01,2015 + 60 months,MORTGAGE,other,NV,Verified,INDIVIDUAL,35000,2,100000,23.48,0,75.1,21,13,fals,21.99,4623.84,2014 + 36 months,MORTGAGE,other,IN,Verified,INDIVIDUAL,4200,9,92000,18.16,0,97.6,12,12,fals,16.55,776.66,2015 + 36 months,MORTGAGE,debt_consolidation,UT,Not Verified,INDIVIDUAL,12150,8,69192,11.9,0,65,40,18,fals,13.98,2794.96,2014 + 60 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,35000,0,163000,22.88,3,67,49,26,fals,18.25,597.45,2015 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,22800,10,60000,25.49,0,64.8,12,20,fals,14.85,1893.98,2016 + 36 months,RENT,debt_consolidation,HI,Verified,INDIVIDUAL,8000,6,67000,25.78,0,56.4,17,11,fals,11.99,698.9,2016 + 36 months,MORTGAGE,credit_card,NY,Not Verified,INDIVIDUAL,28000,7,122000,15.13,0,17.2,58,20,fals,5.32,1529.92,2015 + 36 months,MORTGAGE,credit_card,MN,Not Verified,INDIVIDUAL,15200,10,68000,18.04,0,53.4,27,17,fals,7.12,1651.14,2014 + 36 months,RENT,other,LA,Verified,INDIVIDUAL,7500,10,45000,5.11,0,38.2,6,23,fals,18.24,2323.03,2014 + 60 months,MORTGAGE,debt_consolidation,NH,Verified,INDIVIDUAL,27300,10,70000,18.05,0,82.4,29,12,tru,21.98,-11279.36,2012 + 36 months,MORTGAGE,credit_card,IL,Not Verified,INDIVIDUAL,10400,2,55000,33.84,6,89,31,11,fals,6.24,874.12,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,16000,10,98000,9.53,0,70.8,14,27,fals,10.99,429.2,2014 + 60 months,RENT,credit_card,CA,Verified,INDIVIDUAL,20000,10,70000,24.16,0,51,16,12,fals,10.99,814.39,2015 + 36 months,MORTGAGE,credit_card,NM,Verified,INDIVIDUAL,8000,10,36000,22.5,0,30.2,19,16,fals,8.18,641.32,2015 + 36 months,MORTGAGE,credit_card,WI,Verified,INDIVIDUAL,10000,10,59101,31.47,1,3.2,30,13,fals,9.17,1001.29,2014 + 36 months,MORTGAGE,debt_consolidation,SC,Not Verified,INDIVIDUAL,18000,4,85000,13.65,0,56.7,27,14,fals,13.99,607.75,2015 + 36 months,RENT,debt_consolidation,IL,Verified,INDIVIDUAL,12000,5,38000,11.56,0,41.4,11,10,fals,13.99,804.88,2016 + 60 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,19950,10,50000,17.52,0,60.2,14,14,tru,17.57,-10331.06,2015 + 60 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,18000,7,50000,21.29,1,40.7,36,16,tru,18.84,-13870.35,2015 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,16700,10,67392,22.49,0,84.6,12,12,tru,19.99,-12700.14,2012 + 36 months,OWN,credit_card,CA,Not Verified,INDIVIDUAL,13625,5,36000,27.34,0,46.5,9,4,tru,13.98,-12399.8,2014 + 36 months,MORTGAGE,other,PA,Not Verified,INDIVIDUAL,7600,10,73580,17.37,0,37.7,29,15,fals,11.99,905.79,2016 + 36 months,RENT,debt_consolidation,WI,Verified,INDIVIDUAL,25000,6,160000,15.61,0,40.4,31,11,fals,6.49,1726.14,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,35000,10,170000,8.32,0,51,38,45,fals,9.17,979.87,2015 + 60 months,MORTGAGE,debt_consolidation,MO,Verified,INDIVIDUAL,20000,3,73000,18.63,1,72.3,26,16,tru,15.61,-10007.39,2015 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,6000,null,24000,21.95,0,36.6,21,31,fals,8.18,647.98,2015 + 36 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,28000,10,189895,15.04,0,8.5,29,16,fals,5.93,2141.09,2015 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,24000,1,110000,10.82,0,58.7,31,12,fals,8.9,3434.76,2014 + 36 months,MORTGAGE,debt_consolidation,TN,Not Verified,INDIVIDUAL,11000,6,64000,17.96,0,5.3,39,15,fals,9.49,523.04,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,5000,10,37000,26.34,0,38.4,24,19,fals,13.99,199.62,2017 + 36 months,RENT,debt_consolidation,MO,Verified,INDIVIDUAL,6250,4,25000,28.08,0,75.7,17,6,fals,15.8,1630.34,2012 + 60 months,OWN,credit_card,PA,Verified,INDIVIDUAL,24000,9,66900,31.2,0,82.9,43,15,fals,12.49,6347.57,2014 + 36 months,MORTGAGE,credit_card,IL,Not Verified,INDIVIDUAL,9600,1,50000,13.9,2,48,34,12,fals,13.11,1757.69,2013 + 36 months,MORTGAGE,debt_consolidation,WA,Verified,INDIVIDUAL,6400,2,40000,18.21,0,68.5,22,11,tru,16.77,-4807.85,2012 + 36 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,12000,4,105000,19.41,0,71,28,18,fals,12.12,2373.35,2012 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,14000,10,123000,13.61,0,12.7,35,15,fals,10.99,3570.49,2014 + 36 months,MORTGAGE,debt_consolidation,MN,Verified,INDIVIDUAL,21750,0,110000,27.7,0,79.2,27,21,fals,18.25,209.5,2016 + 36 months,RENT,major_purchase,MN,Verified,INDIVIDUAL,15000,10,45240,25.17,0,83,15,11,fals,13.67,1938.36,2016 + 60 months,MORTGAGE,credit_card,MT,Verified,INDIVIDUAL,25000,10,75000,21.68,0,56.8,34,17,fals,14.49,2999.66,2014 + 36 months,RENT,debt_consolidation,NM,Not Verified,INDIVIDUAL,18050,3,85000,12.93,0,32,21,12,tru,13.49,-14503.81,2016 + 36 months,MORTGAGE,credit_card,MN,Not Verified,INDIVIDUAL,20000,3,115000,14.08,0,63.7,40,18,fals,7.62,127.46,2013 + 36 months,MORTGAGE,credit_card,IL,Verified,INDIVIDUAL,19950,10,41600,33.73,0,51.3,22,11,fals,15.99,1047.77,2016 + 60 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,25200,5,60000,7.78,1,58.4,5,5,tru,21.18,-16983.83,2016 + 60 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,18325,10,115000,22.6,0,33.5,32,10,fals,30.84,1600.83,2016 + 36 months,OWN,home_improvement,NY,Verified,INDIVIDUAL,28000,10,205000,21.47,0,21,32,14,fals,15.99,410.41,2016 + 36 months,RENT,credit_card,UT,Verified,INDIVIDUAL,6000,7,40000,20.4,0,50,26,30,fals,7.69,330.41,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,15000,null,64000,9.43,0,63.1,9,11,fals,12.85,161.41,2014 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,8725,null,23000,17.95,0,74.4,28,47,tru,16.99,-2193.37,2014 + 36 months,RENT,debt_consolidation,AZ,Verified,INDIVIDUAL,5000,8,58000,29.3,0,33.3,21,10,fals,10.49,317.77,2014 + 36 months,MORTGAGE,debt_consolidation,MD,Verified,INDIVIDUAL,7000,2,132000,8.77,0,84.8,33,22,tru,12.74,-6752.61,2016 + 36 months,MORTGAGE,other,OR,Verified,INDIVIDUAL,3000,6,35000,32.47,1,29.2,26,11,tru,20.31,-1505.16,2013 + 60 months,OWN,debt_consolidation,KY,Verified,INDIVIDUAL,30000,8,225000,7.42,0,29.6,36,31,fals,7.39,1798.2,2016 + 36 months,OWN,medical,NJ,Verified,INDIVIDUAL,2000,10,65000,8.49,0,70.3,36,27,fals,11.44,0.88,2016 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,5500,10,42000,21.63,2,47.3,24,26,fals,10.64,29.27,2015 + 60 months,MORTGAGE,medical,FL,Verified,INDIVIDUAL,17000,9,65712,26.13,0,54.6,22,12,tru,23.63,962.34,2013 + 36 months,MORTGAGE,debt_consolidation,MD,Verified,INDIVIDUAL,5000,10,80000,9.17,3,97.4,30,24,fals,18.25,1470.13,2013 + 36 months,MORTGAGE,credit_card,AR,Not Verified,INDIVIDUAL,17000,5,90000,14.69,0,57.7,39,17,fals,14.49,1042.04,2016 + 36 months,RENT,debt_consolidation,MA,Verified,INDIVIDUAL,8000,4,75000,15.1,1,11.9,18,8,fals,13.99,313.17,2016 + 60 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,11000,8,47000,15.5,0,43.7,23,8,fals,20.99,1337.37,2015 + 36 months,MORTGAGE,credit_card,TN,Verified,INDIVIDUAL,13200,10,65000,26.2,0,93.9,18,11,fals,9.17,1379.05,2015 + 36 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,10000,3,64000,23.18,0,48.6,25,10,fals,11.39,431.12,2016 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,18600,6,78000,29.02,0,49,17,5,fals,19.48,311.71,2016 + 36 months,MORTGAGE,credit_card,MO,Verified,INDIVIDUAL,24000,10,84200,18.8,0,40.9,13,24,fals,7.26,2075.59,2015 + 36 months,RENT,credit_card,NY,Not Verified,INDIVIDUAL,7000,6,60000,9.42,0,23.7,14,20,fals,8.9,1001.77,2013 + 36 months,MORTGAGE,other,OK,Not Verified,INDIVIDUAL,2750,1,90260,9.09,0,88.8,26,27,fals,11.53,451.3,2015 + 36 months,RENT,debt_consolidation,UT,Verified,INDIVIDUAL,7800,0,23000,18.94,0,77.9,28,11,fals,15.99,168.63,2017 + 36 months,MORTGAGE,home_improvement,IL,Not Verified,INDIVIDUAL,9000,0,99400,17.07,0,38.8,34,36,fals,14.64,1623.09,2014 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,30000,8,150000,5.35,0,62.5,29,12,tru,11.49,-26003.66,2016 + 36 months,RENT,moving,FL,Not Verified,INDIVIDUAL,7175,10,40000,18.72,0,54,10,10,tru,24.5,-5474.66,2014 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,16000,5,48800,13.92,0,51.3,30,11,fals,7.89,1092.43,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,30000,0,86000,24.28,2,69.8,46,21,fals,16.99,3787.08,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,25000,3,140000,10.26,0,16.8,22,4,fals,9.17,2860.5,2015 + 60 months,RENT,debt_consolidation,MA,Verified,INDIVIDUAL,14000,2,60000,12.34,0,67.6,8,8,fals,13.33,953.31,2015 + 36 months,OWN,debt_consolidation,IL,Not Verified,INDIVIDUAL,3000,10,30000,26.12,0,63.6,25,8,tru,16.29,-1329.63,2012 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,15000,1,48000,14.53,0,41.9,16,9,fals,12.69,2390.7,2015 + 60 months,RENT,debt_consolidation,TN,Verified,INDIVIDUAL,18700,1,52000,12.69,0,28.9,14,17,fals,21.99,4853.19,2014 + 36 months,OWN,credit_card,MI,Verified,INDIVIDUAL,17625,2,39187,31.38,0,71.6,9,9,fals,13.98,4054.54,2014 + 36 months,MORTGAGE,debt_consolidation,IL,Not Verified,INDIVIDUAL,6000,8,52000,13.85,0,63.2,26,13,fals,11.55,1127.95,2013 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,10600,10,103586,4.12,0,37.4,42,21,fals,16.99,3498.01,2013 + 36 months,MORTGAGE,credit_card,MA,Not Verified,INDIVIDUAL,14000,8,71000,13.56,0,66,16,16,fals,11.44,2507.46,2015 + 36 months,RENT,debt_consolidation,OR,Not Verified,INDIVIDUAL,16750,10,70000,17.88,1,85.6,40,13,fals,18.25,4643.65,2013 + 36 months,MORTGAGE,debt_consolidation,NC,Not Verified,INDIVIDUAL,25000,null,100000,5.35,0,23.8,29,24,fals,5.32,14.93,2016 + 36 months,RENT,credit_card,VA,Verified,INDIVIDUAL,5000,4,180000,11.39,1,60.8,12,18,fals,11.49,250.18,2017 + 36 months,MORTGAGE,home_improvement,MD,Verified,INDIVIDUAL,2500,10,58700,7.99,0,1.8,60,21,fals,14.46,20.09,2016 + 36 months,RENT,debt_consolidation,TX,Not Verified,INDIVIDUAL,19000,4,38000,21.38,0,28,39,10,fals,16.29,1910.95,2016 + 36 months,RENT,debt_consolidation,AZ,Verified,INDIVIDUAL,16000,1,50000,31.04,0,24.1,22,10,fals,8.18,1094.68,2015 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,13500,6,65300,12.44,3,45.2,36,37,tru,14.16,-8509.39,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,12000,10,127000,16.69,0,63.7,30,16,fals,7.89,717.97,2015 + 36 months,OWN,debt_consolidation,SC,Verified,INDIVIDUAL,35000,7,100000,13.61,1,75.6,31,16,tru,12.99,-27587.58,2015 + 60 months,RENT,debt_consolidation,OK,Verified,INDIVIDUAL,20000,10,58000,18.36,1,52.1,17,14,tru,18.54,-13997.53,2015 + 36 months,OWN,home_improvement,CA,Verified,INDIVIDUAL,30000,10,211000,7.3,0,42.8,25,13,fals,14.99,7556.22,2014 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,30000,10,125000,7.9,0,43.8,31,19,fals,14.3,3739.73,2013 + 36 months,MORTGAGE,credit_card,PA,Verified,INDIVIDUAL,4800,6,65000,22.88,0,63.8,15,7,fals,13.33,733.48,2015 + 36 months,MORTGAGE,credit_card,AL,Verified,INDIVIDUAL,25000,10,75000,25.38,0,59.8,36,14,fals,13.05,5074.38,2013 + 36 months,RENT,credit_card,SC,Not Verified,INDIVIDUAL,7000,2,45000,27.73,1,49,17,6,fals,10.15,756.11,2014 + 36 months,RENT,credit_card,IN,Not Verified,INDIVIDUAL,15000,1,52000,24.16,0,74.8,15,21,fals,18.75,4726.02,2013 + 36 months,RENT,small_business,NY,Verified,INDIVIDUAL,3075,2,30000,2,0,17.5,5,5,fals,15.1,155.19,2013 + 60 months,RENT,small_business,AZ,Verified,INDIVIDUAL,12000,2,110000,8.79,0,41.3,20,14,tru,21.67,-953.67,2015 + 36 months,OWN,debt_consolidation,NC,Verified,INDIVIDUAL,30000,4,105000,8.16,0,9.9,23,20,fals,7.99,1551.09,2016 + 36 months,RENT,other,FL,Verified,INDIVIDUAL,2000,4,40000,15.6,0,56.3,12,8,tru,21.97,-1426.1,2016 + 60 months,MORTGAGE,debt_consolidation,MI,Verified,INDIVIDUAL,15000,9,33000,2.98,0,35.3,13,8,fals,22.74,826,2017 + 60 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,28000,10,148000,13.06,0,83,25,9,tru,24.08,-332.24,2014 + 36 months,MORTGAGE,credit_card,CO,Verified,INDIVIDUAL,20000,0,105000,7.67,0,65,24,18,fals,7.9,2528.97,2014 + 36 months,OWN,debt_consolidation,AL,Verified,INDIVIDUAL,5600,10,27342,38.85,0,38.5,26,10,fals,11.22,440.88,2015 + 36 months,MORTGAGE,debt_consolidation,MA,Not Verified,INDIVIDUAL,10000,10,50000,23.07,1,40.5,30,21,fals,9.75,1034.47,2016 + 36 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,13725,10,97800,17.14,0,64,34,17,fals,8.9,1876.83,2012 + 36 months,RENT,debt_consolidation,MO,Verified,INDIVIDUAL,11850,10,35000,18.45,0,46.7,17,13,fals,19.52,3901.19,2014 + 36 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,6000,10,96000,11.46,0,44.4,23,21,fals,5.32,328.9,2016 + 60 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,10375,4,45600,31.89,0,67.3,24,12,tru,17.86,-7927.74,2015 + 36 months,RENT,debt_consolidation,NC,Not Verified,INDIVIDUAL,10000,0,61000,27.86,0,85.4,28,22,tru,22.2,-7137.41,2013 + 36 months,RENT,debt_consolidation,OH,Verified,INDIVIDUAL,10000,0,105000,12.71,3,77.8,50,14,fals,16.99,1047.18,2016 + 36 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,18225,10,52000,28.45,0,77.5,18,15,tru,12.99,808.69,2013 + 60 months,MORTGAGE,major_purchase,NH,Verified,INDIVIDUAL,36000,10,92000,20.3,0,19.8,27,16,fals,13.99,2477.79,2016 + 36 months,MORTGAGE,debt_consolidation,TX,Not Verified,INDIVIDUAL,15000,8,95000,15.3,0,38.7,44,17,fals,7.9,1677.16,2014 + 60 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,21200,2,75000,24.21,2,68.2,28,16,fals,15.41,4823.97,2015 + 36 months,RENT,credit_card,TX,Verified,INDIVIDUAL,7400,1,56000,10.82,0,87.5,12,7,fals,9.17,828.96,2015 + 36 months,RENT,house,NJ,Verified,INDIVIDUAL,4575,10,65000,3.19,0,24.5,13,7,fals,12.99,962.89,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,30225,8,128000,18.09,0,26,24,11,tru,16.29,-20588.96,2016 + 36 months,MORTGAGE,debt_consolidation,OR,Not Verified,INDIVIDUAL,13925,10,46000,27.92,0,58.2,27,21,fals,12.29,2675.65,2015 + 60 months,OWN,credit_card,FL,Verified,INDIVIDUAL,15000,10,65000,21.82,0,36.9,28,10,fals,14.49,1509.65,2014 + 36 months,OWN,debt_consolidation,CO,Verified,INDIVIDUAL,18000,10,126000,14.43,0,21.8,31,29,fals,7.12,1562.26,2014 + 60 months,MORTGAGE,debt_consolidation,CO,Verified,INDIVIDUAL,12950,10,36000,18,0,71.4,17,13,tru,20.99,-10810.63,2015 + 36 months,OWN,credit_card,NC,Verified,INDIVIDUAL,10000,10,48000,15.38,0,53.6,17,24,fals,10.64,1724.63,2013 + 36 months,MORTGAGE,debt_consolidation,NC,Verified,INDIVIDUAL,20000,4,75000,14.38,0,48.4,25,16,fals,10.74,2244.4,2012 + 36 months,MORTGAGE,debt_consolidation,NV,Verified,INDIVIDUAL,24000,2,110000,15.87,3,50.8,38,18,tru,14.33,-13183.01,2013 + 36 months,RENT,medical,CA,Not Verified,INDIVIDUAL,2100,1,36000,4.4,0,4.8,4,6,tru,23.5,-1100.28,2013 + 36 months,OWN,credit_card,MN,Not Verified,INDIVIDUAL,8100,7,45000,28.75,0,77,46,19,fals,11.44,849.23,2015 + 36 months,RENT,small_business,VA,Verified,INDIVIDUAL,9500,0,55000,14.84,2,33,14,7,fals,22.95,3565.05,2013 + 60 months,MORTGAGE,other,NJ,Not Verified,INDIVIDUAL,10000,10,130000,17.08,0,35.1,30,13,fals,13.35,1826.01,2014 + 36 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,5000,0,24000,25.15,0,55.3,10,6,fals,6.49,481.36,2014 + 60 months,RENT,debt_consolidation,TX,Verified,INDIVIDUAL,35000,5,75000,30.01,0,14,42,20,fals,20.75,8360.95,2016 + 36 months,RENT,credit_card,OH,Verified,INDIVIDUAL,11875,10,34000,17.23,0,69.8,19,13,fals,14.64,2043.48,2014 + 60 months,MORTGAGE,home_improvement,CA,Verified,INDIVIDUAL,15125,10,45000,35.95,0,80.9,14,10,fals,21.97,631.64,2016 + 36 months,MORTGAGE,home_improvement,TX,Not Verified,INDIVIDUAL,8000,2,128000,2.38,0,5.9,30,14,fals,15.31,1803.95,2013 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,35000,null,90000,24.76,0,75.6,33,32,tru,25.29,-25270.39,2016 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,24000,null,73000,18.79,0,63.5,32,39,fals,8.39,2240.08,2014 + 36 months,RENT,debt_consolidation,WA,Not Verified,INDIVIDUAL,15000,8,90000,8.4,1,34.1,11,13,fals,8.59,875.29,2016 + 36 months,OWN,other,VA,Verified,INDIVIDUAL,20000,null,52384,10.29,1,73.6,17,47,tru,17.56,-15688.1,2013 + 36 months,MORTGAGE,debt_consolidation,PA,Not Verified,INDIVIDUAL,16000,4,92000,14.92,0,69.3,42,18,tru,17.27,-13709.6,2012 + 36 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,12000,5,90000,12.85,0,23.6,9,6,fals,7.9,1422.66,2013 + 60 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,23400,4,93000,19.81,0,25,26,17,fals,9.75,762.94,2016 + 36 months,OWN,other,CA,Not Verified,INDIVIDUAL,10000,4,60000,34.34,0,19.2,37,14,fals,6.89,829.57,2015 + 60 months,RENT,debt_consolidation,MI,Verified,INDIVIDUAL,12600,2,35000,31.93,0,67,14,7,tru,19.99,-7444.31,2015 + 36 months,MORTGAGE,credit_card,UT,Verified,INDIVIDUAL,10000,0,58000,9.41,0,82.2,9,8,fals,12.12,1977.77,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,7900,3,22000,37.32,0,80,13,11,fals,17.57,2097.61,2015 + 36 months,RENT,credit_card,GA,Verified,INDIVIDUAL,8400,5,66000,12.3,0,48,9,8,fals,8.9,1040.03,2014 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,10000,10,85000,28.8,0,27.4,29,14,tru,6.39,-4170.82,2015 + 36 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,10625,2,31157.85,16.87,0,83.7,10,11,fals,13.05,2272.11,2013 + 36 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,2950,10,30000,19.76,0,10.6,9,7,fals,17.86,542.23,2015 + 60 months,RENT,credit_card,DE,Verified,INDIVIDUAL,17600,3,65000,5.39,0,43.3,4,3,fals,18.55,4542.08,2015 + 60 months,MORTGAGE,credit_card,NC,Verified,INDIVIDUAL,28000,10,210000,15.78,0,36.6,52,23,fals,13.11,7906.01,2013 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,20000,5,89000,17.5,3,68.4,39,17,fals,15.88,5285.42,2013 + 36 months,MORTGAGE,debt_consolidation,KS,Not Verified,INDIVIDUAL,8000,10,120000,30.98,0,85.2,21,22,tru,12.49,-512.75,2014 + 36 months,RENT,house,TX,Not Verified,INDIVIDUAL,14400,10,54000,18.02,9,43.6,33,20,fals,17.57,3082.6,2015 + 60 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,14000,10,43000,11.86,0,78.7,10,21,fals,15.61,4593.15,2014 + 60 months,MORTGAGE,debt_consolidation,OR,Verified,INDIVIDUAL,27650,1,63000,20.5,0,48.8,17,13,fals,16.99,10952.56,2014 + 36 months,MORTGAGE,major_purchase,TN,Verified,INDIVIDUAL,8000,null,52476,27.19,0,29.5,33,18,fals,19.99,670.85,2016 + 36 months,RENT,major_purchase,NY,Verified,INDIVIDUAL,6000,6,200000,3.24,0,77,10,11,tru,11.14,912.3,2013 + 36 months,MORTGAGE,debt_consolidation,TN,Not Verified,INDIVIDUAL,12500,0,74000,22.77,0,44.9,30,13,fals,7.49,1366.57,2014 + 36 months,OWN,debt_consolidation,NM,Not Verified,INDIVIDUAL,7450,9,22000,19.47,0,65.1,12,7,fals,7.9,843.57,2012 + 60 months,RENT,debt_consolidation,FL,Verified,INDIVIDUAL,10625,10,35000,16.91,0,82.7,15,38,tru,20.5,-2090.9,2013 + 36 months,MORTGAGE,credit_card,IL,Not Verified,INDIVIDUAL,14000,5,62000,9.63,0,50,19,30,fals,15.88,366.79,2013 + 36 months,OWN,credit_card,MI,Verified,INDIVIDUAL,15000,1,56000,12.69,0,16.5,51,15,fals,7.89,127.32,2015 + 36 months,MORTGAGE,car,MI,Verified,INDIVIDUAL,24500,null,64000,3.99,0,20.3,19,37,fals,6.03,243.1,2013 + 36 months,MORTGAGE,other,CO,Verified,INDIVIDUAL,2000,2,40000,24.72,0,80.4,27,10,fals,12.79,135.41,2016 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,9600,9,27000,26.58,0,82.3,11,8,tru,13.99,-5992.93,2015 + 36 months,RENT,credit_card,NY,Verified,INDIVIDUAL,20050,3,45000,30.35,0,64.4,32,4,fals,19.05,6100.05,2013 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,12000,6,135000,11.34,0,39.1,28,18,fals,10.49,806.18,2014 + 36 months,OWN,debt_consolidation,CT,Verified,INDIVIDUAL,4000,6,60000,14.06,1,56.4,9,21,fals,13.49,85.91,2017 + 36 months,RENT,credit_card,WA,Verified,INDIVIDUAL,19500,8,70000,20.45,0,70.7,13,8,fals,10.16,3204.31,2013 + 60 months,MORTGAGE,small_business,CA,Verified,INDIVIDUAL,26000,7,190000,12.96,0,46.4,32,23,fals,11.14,6458.1,2012 + 36 months,OWN,credit_card,SC,Verified,INDIVIDUAL,2800,9,32000,14.1,0,23.3,18,10,fals,9.16,156.34,2016 + 36 months,MORTGAGE,car,NY,Not Verified,INDIVIDUAL,7000,8,87000,7.74,0,7.7,24,12,fals,13.99,3.22,2016 + 60 months,OWN,major_purchase,PA,Verified,INDIVIDUAL,18000,10,97000,8.41,0,48.8,20,15,fals,17.27,1019.56,2013 + 36 months,MORTGAGE,credit_card,CT,Verified,INDIVIDUAL,10200,null,28751,11.98,1,81.5,33,16,tru,8.39,-5994.26,2014 + 36 months,RENT,credit_card,NY,Not Verified,INDIVIDUAL,17000,10,95000,6.42,2,19.5,41,19,fals,10.16,358.39,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,13600,4,38000,11.92,1,94.4,10,7,fals,19.47,4463.34,2014 + 36 months,MORTGAGE,debt_consolidation,PA,Verified,INDIVIDUAL,5000,10,78000,9.05,0,15.3,38,22,fals,9.16,427.92,2016 + 36 months,RENT,debt_consolidation,WA,Verified,INDIVIDUAL,7200,2,60000,25.74,0,51.2,37,10,fals,9.99,1162.44,2013 + 36 months,MORTGAGE,debt_consolidation,AR,Not Verified,INDIVIDUAL,18000,6,61000,18.67,0,60.5,41,19,fals,8.18,2079.43,2015 + 36 months,OWN,credit_card,NC,Verified,INDIVIDUAL,35000,10,220000,11.92,0,53.6,38,20,fals,9.17,3863,2015 + 36 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,7200,10,87000,4.41,0,18.8,9,15,fals,14.33,1700.88,2012 + 36 months,RENT,debt_consolidation,IL,Not Verified,INDIVIDUAL,6000,1,28000,13.16,0,59.9,19,7,fals,12.85,758.28,2014 + 36 months,OWN,credit_card,NC,Not Verified,INDIVIDUAL,7000,10,50000,29.91,0,89.6,37,29,fals,10.15,606.47,2014 + 36 months,MORTGAGE,debt_consolidation,OH,Not Verified,INDIVIDUAL,8000,10,62000,21.85,0,17.9,39,16,fals,5.32,178.18,2016 + 36 months,MORTGAGE,debt_consolidation,IN,Not Verified,INDIVIDUAL,12575,9,35000,27.78,0,59.9,30,11,tru,16.99,-3621.27,2014 + 36 months,OWN,debt_consolidation,NY,Verified,INDIVIDUAL,30000,3,85000,22.29,1,37.2,41,12,tru,16.99,-27889.44,2016 + 60 months,OWN,debt_consolidation,TX,Verified,INDIVIDUAL,12000,2,29000,37.87,0,48.3,24,20,tru,14.33,-9460.59,2015 + 36 months,RENT,credit_card,FL,Verified,INDIVIDUAL,5600,4,47000,16.37,0,29.3,6,4,tru,18.75,-4075.08,2012 + 60 months,MORTGAGE,debt_consolidation,LA,Verified,INDIVIDUAL,14000,10,82000,20.04,0,55.2,31,24,tru,20.5,-8217.84,2016 + 36 months,OWN,home_improvement,IL,Verified,INDIVIDUAL,30000,10,100000,2.45,0,1.1,32,33,fals,6.92,69.21,2015 + 60 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,12000,10,90000,7.35,0,57.9,28,17,fals,16.99,2478.64,2015 + 36 months,MORTGAGE,debt_consolidation,WA,Not Verified,INDIVIDUAL,4000,5,80000,12.2,0,96.7,14,11,fals,11.14,37.5,2013 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,30000,9,96000,17.65,0,21,26,14,fals,18.55,4672.03,2015 + 60 months,RENT,debt_consolidation,NJ,Verified,INDIVIDUAL,21925,10,56000,15.56,0,17.6,29,16,tru,18.92,-16081.63,2014 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,6600,1,40000,20.13,0,41.7,18,14,fals,8.39,888.34,2014 + 36 months,MORTGAGE,home_improvement,OH,Not Verified,INDIVIDUAL,5600,3,78000,13.03,0,37.4,16,15,fals,5.32,407.55,2015 + 36 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,8000,10,67000,22.41,0,85.8,32,12,fals,16.55,1519.25,2015 + 36 months,MORTGAGE,credit_card,NJ,Verified,INDIVIDUAL,9000,5,109000,4.8,1,46,24,29,fals,8.39,460.82,2016 + 36 months,OWN,debt_consolidation,NY,Verified,INDIVIDUAL,12000,10,71000,16.35,0,91.5,9,17,tru,13.53,-3857.46,2014 + 36 months,RENT,debt_consolidation,KS,Not Verified,INDIVIDUAL,6000,0,40000,26.01,0,68.2,19,8,tru,9.99,-2181.33,2015 + 60 months,RENT,credit_card,GA,Verified,INDIVIDUAL,10000,0,51000,4.52,0,63.2,8,6,tru,13.66,-5771.76,2014 + 36 months,RENT,debt_consolidation,LA,Verified,INDIVIDUAL,5000,3,36000,17.83,0,53.8,24,13,fals,11.99,668.75,2015 + 36 months,RENT,credit_card,CA,Verified,INDIVIDUAL,25000,3,100000,12.7,0,58.6,9,5,fals,18.55,7785.92,2013 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,8000,2,49000,12.22,0,63.3,16,28,fals,15.61,1123.26,2014 + 36 months,RENT,debt_consolidation,TX,Not Verified,INDIVIDUAL,8000,10,56805,6.78,2,24,25,13,fals,13.53,1759.5,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,8000,3,30000,22.88,1,75.2,18,11,fals,14.64,1931.09,2014 + 36 months,MORTGAGE,credit_card,TN,Verified,INDIVIDUAL,21000,10,95000,23.82,0,77.5,32,13,fals,14.09,3391.8,2013 + 36 months,RENT,debt_consolidation,VA,Not Verified,INDIVIDUAL,10000,1,60000,18.74,0,53.2,35,8,fals,9.8,1153.17,2016 + 36 months,MORTGAGE,debt_consolidation,FL,Not Verified,INDIVIDUAL,19350,7,55000,35.48,0,32.4,40,23,fals,12.99,1390.68,2016 + 36 months,MORTGAGE,credit_card,CT,Not Verified,INDIVIDUAL,15725,10,54000,18.91,0,18.4,38,37,fals,7.89,1724.58,2015 + 36 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,10000,2,52215,7.47,0,79.5,18,9,fals,13.67,325.26,2016 + 36 months,MORTGAGE,home_improvement,MO,Verified,INDIVIDUAL,3000,3,105000,33.48,0,53.7,32,19,fals,8.9,429.35,2014 + 36 months,RENT,debt_consolidation,PA,Not Verified,INDIVIDUAL,15000,10,45000,19.18,0,58,14,14,fals,9.71,2350.84,2013 + 36 months,RENT,credit_card,IL,Verified,INDIVIDUAL,6000,2,90000,8.31,0,79.1,27,15,fals,9.17,399.08,2015 + 36 months,RENT,home_improvement,RI,Verified,INDIVIDUAL,4000,7,40000,25.8,0,22.5,27,13,fals,15.8,695.42,2013 + 36 months,RENT,medical,NJ,Verified,INDIVIDUAL,9000,10,41600,28.3,0,31.6,17,12,fals,6.62,805.99,2012 + 60 months,MORTGAGE,debt_consolidation,CA,Not Verified,INDIVIDUAL,35000,3,400000,18.85,1,85.3,41,33,tru,14.99,-24922.37,2014 + 36 months,RENT,credit_card,NJ,Not Verified,INDIVIDUAL,9000,7,51000,5.51,0,41.3,21,16,fals,6.24,633.32,2015 + 36 months,MORTGAGE,debt_consolidation,TX,Verified,INDIVIDUAL,12000,10,75000,18.37,1,49.1,63,14,fals,9.99,1596.8,2015 + 60 months,MORTGAGE,home_improvement,AZ,Verified,INDIVIDUAL,10900,10,67000,11.52,1,35.7,15,10,tru,21.48,-4990.8,2016 + 36 months,RENT,credit_card,IL,Verified,INDIVIDUAL,8000,5,43000,6.81,0,64.8,10,10,fals,6.62,831.97,2012 + 60 months,MORTGAGE,debt_consolidation,MA,Verified,INDIVIDUAL,21000,3,146086,10.68,2,71.7,27,13,fals,13.99,1642.74,2015 + 60 months,RENT,debt_consolidation,WA,Verified,INDIVIDUAL,12000,3,48000,8.35,0,81.6,8,6,fals,15.61,305.15,2014 + 60 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,30700,1,80800,26.41,0,69.8,31,16,tru,25.99,-18591,2015 + 60 months,RENT,medical,AZ,Verified,INDIVIDUAL,12000,0,96000,26.56,1,80.2,45,21,tru,18.25,-8285.87,2015 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,4675,10,40000,19.11,0,31.7,11,11,fals,22.35,556.28,2016 + 36 months,MORTGAGE,debt_consolidation,GA,Verified,INDIVIDUAL,30000,5,115000,10.16,0,41.8,35,11,fals,11.99,41.47,2017 + 36 months,MORTGAGE,medical,NC,Not Verified,INDIVIDUAL,8000,10,125000,10.12,0,29.8,39,28,fals,7.89,713.33,2015 + 36 months,OWN,debt_consolidation,NY,Verified,INDIVIDUAL,7925,10,40000,22.29,0,34.5,11,14,fals,11.44,22.66,2017 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,3000,3,72000,12.08,1,32.8,34,25,fals,12.49,600.62,2014 + 36 months,MORTGAGE,debt_consolidation,AL,Not Verified,INDIVIDUAL,4000,6,74000,23.89,0,39.5,26,14,fals,14.3,942.54,2013 + 36 months,MORTGAGE,other,NY,Verified,INDIVIDUAL,5000,null,55000,19.03,0,54.2,34,13,tru,12.49,-1891.46,2014 + 36 months,MORTGAGE,credit_card,MD,Not Verified,INDIVIDUAL,7000,10,80000,18.9,0,83.5,23,17,fals,6.68,701.49,2015 + 36 months,OWN,debt_consolidation,HI,Verified,INDIVIDUAL,21000,1,50000,12.55,0,76.4,15,11,tru,15.31,-11647.39,2016 + 60 months,RENT,credit_card,AZ,Verified,INDIVIDUAL,22000,10,50000,24.48,0,78,31,7,tru,22.4,-15342.13,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,12200,10,88000,16.46,0,72.6,17,17,fals,7.89,1106.47,2015 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,2400,9,115000,10.75,0,17.6,26,24,fals,8.9,343.47,2014 + 36 months,RENT,other,CO,Verified,INDIVIDUAL,5000,0,25000,34.85,0,65.4,26,7,tru,17.86,-4102.86,2015 + 36 months,MORTGAGE,debt_consolidation,IL,Verified,INDIVIDUAL,10000,4,40000,22.02,0,43.6,28,29,fals,12.74,4.6,2017 + 36 months,MORTGAGE,vacation,WV,Not Verified,INDIVIDUAL,7500,10,95000,9.16,1,57,43,23,fals,16.29,1799.28,2014 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,18250,null,42100,25.1,0,41.2,28,20,fals,9.99,2946.4,2013 + 36 months,MORTGAGE,debt_consolidation,CA,Verified,INDIVIDUAL,7000,10,130720,10.49,0,17.6,30,15,fals,6.92,705.41,2015 + 60 months,MORTGAGE,credit_card,CA,Verified,INDIVIDUAL,18000,1,64541,17.85,1,20,25,12,fals,22.74,159.18,2017 + 36 months,MORTGAGE,credit_card,NC,Verified,INDIVIDUAL,6000,5,32000,16.84,0,48.9,10,11,fals,12.69,906.42,2015 + 36 months,RENT,debt_consolidation,NY,Verified,INDIVIDUAL,30000,9,142660,13.42,0,92.5,25,24,tru,14.33,-10946.94,2013 + 36 months,RENT,other,CO,Verified,INDIVIDUAL,5000,0,38000,28.58,0,36.8,80,24,tru,7.49,-1425.13,2015 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,6000,0,20000,17.89,0,6,32,14,fals,8.18,453.04,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,6325,6,43000,7.76,0,74.7,13,5,fals,12.12,468.94,2012 + 36 months,MORTGAGE,home_improvement,TX,Verified,INDIVIDUAL,20700,10,46000,14.64,0,46.1,40,28,fals,13.35,4534.54,2014 + 36 months,MORTGAGE,other,TX,Verified,INDIVIDUAL,4000,2,58000,15.33,0,4.4,25,13,fals,12.99,232.93,2016 + 60 months,MORTGAGE,credit_card,IN,Verified,INDIVIDUAL,14900,null,35500,11.12,0,62.8,14,12,fals,18.25,4704.72,2015 + 36 months,MORTGAGE,debt_consolidation,NJ,Verified,INDIVIDUAL,24500,10,179200,12.16,0,52.9,25,17,fals,7.69,2989.41,2014 + 60 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,19800,2,91000,27.55,0,86.2,25,15,tru,21.49,-12062.04,2013 + 36 months,MORTGAGE,debt_consolidation,AZ,Verified,INDIVIDUAL,9000,0,50000,38.07,0,46.8,31,14,fals,9.16,512.35,2016 + 36 months,RENT,credit_card,KS,Verified,INDIVIDUAL,35000,8,200000,6.48,0,16.5,34,22,fals,7.39,2449.99,2016 + 60 months,MORTGAGE,debt_consolidation,GA,Not Verified,INDIVIDUAL,13325,10,104000,26.03,0,14.3,35,22,tru,7.89,-8782.65,2015 + 60 months,MORTGAGE,debt_consolidation,IN,Verified,INDIVIDUAL,18000,1,110000,10.81,0,82.1,49,16,fals,19.52,5820.02,2014 + 36 months,MORTGAGE,vacation,TX,Not Verified,INDIVIDUAL,10000,10,55000,2.23,0,26.6,10,15,tru,13.44,-8233.93,2015 + 36 months,RENT,debt_consolidation,CA,Not Verified,INDIVIDUAL,9600,1,51000,27.84,1,73.8,14,8,tru,15.31,-241,2013 + 36 months,RENT,debt_consolidation,VA,Verified,INDIVIDUAL,6450,10,43576,25.27,0,41.6,29,13,fals,16.55,1437.95,2015 + 36 months,MORTGAGE,debt_consolidation,TN,Verified,INDIVIDUAL,10000,10,74400,24.18,0,61,56,27,fals,9.99,863.75,2013 + 36 months,RENT,other,OH,Verified,INDIVIDUAL,4500,9,44000,33.14,0,49.7,23,11,fals,17.57,63.65,2015 + 60 months,MORTGAGE,major_purchase,MI,Verified,INDIVIDUAL,15000,10,63000,21.3,2,36.5,42,18,tru,16.29,-9797.57,2013 + 36 months,MORTGAGE,credit_card,NY,Verified,INDIVIDUAL,13000,1,75000,23.36,4,38.8,65,11,tru,14.65,-8619.75,2015 + 36 months,MORTGAGE,debt_consolidation,NY,Verified,INDIVIDUAL,7500,10,79000,2.26,0,5,37,22,fals,13.65,1480.3,2014 + 36 months,RENT,credit_card,CA,Not Verified,INDIVIDUAL,7500,6,42720,16.52,0,45.5,18,12,fals,12.12,1450.39,2013 + 36 months,MORTGAGE,home_improvement,AL,Verified,INDIVIDUAL,15000,2,48000,12.03,0,35,47,28,fals,9.67,2340.72,2014 + 60 months,MORTGAGE,credit_card,RI,Not Verified,INDIVIDUAL,21600,10,140000,23.48,0,69,33,25,fals,11.99,2915.34,2016 + 36 months,MORTGAGE,home_improvement,MA,Not Verified,INDIVIDUAL,10000,3,127000,9.13,0,66.1,13,14,fals,10.99,1492.1,2014 + 36 months,OWN,credit_card,CA,Verified,INDIVIDUAL,20000,8,100000,13.59,0,69.2,26,17,fals,12.69,3416.27,2015 + 36 months,RENT,other,FL,Verified,INDIVIDUAL,23400,1,65000,13.11,0,39,35,14,fals,20.99,8010,2015 + 60 months,MORTGAGE,major_purchase,NC,Verified,INDIVIDUAL,20000,8,62000,9.74,0,19,13,12,tru,22.99,-15406.72,2015 + 36 months,MORTGAGE,home_improvement,FL,Verified,INDIVIDUAL,3950,10,52000,13.8,0,96.2,28,11,fals,21.18,341.59,2016 + 60 months,MORTGAGE,debt_consolidation,FL,Verified,INDIVIDUAL,16000,7,60000,6.63,0,86.1,20,21,fals,25.57,3715.25,2014 + 60 months,RENT,credit_card,FL,Verified,INDIVIDUAL,12000,3,48000,33.9,0,70.2,16,15,fals,12.69,1510.36,2015 + 36 months,MORTGAGE,debt_consolidation,OH,Verified,INDIVIDUAL,25000,4,105000,3.84,0,47.9,30,12,fals,7.89,1056.01,2015 + 60 months,RENT,debt_consolidation,IL,Verified,INDIVIDUAL,13250,4,55000,19.66,0,74,18,18,fals,18.25,5145.01,2014 + 36 months,RENT,debt_consolidation,CO,Verified,INDIVIDUAL,20000,8,75000,14.11,0,60.3,32,13,fals,15.8,5015.9,2012 + 36 months,MORTGAGE,credit_card,MO,Verified,INDIVIDUAL,10000,null,50000,29.94,0,16.7,37,33,fals,7.69,1229.66,2014 + 36 months,MORTGAGE,other,MO,Verified,INDIVIDUAL,3500,null,72000,7.97,0,8,42,19,fals,6.03,334.86,2012 + 36 months,RENT,debt_consolidation,FL,Not Verified,INDIVIDUAL,8500,7,45000,16.43,0,36.4,30,8,fals,10.99,725.14,2014 + 60 months,OWN,debt_consolidation,CA,Verified,INDIVIDUAL,10000,6,65000,21.95,0,49.3,27,13,fals,13.99,3073.97,2012 + 36 months,MORTGAGE,home_improvement,MD,Not Verified,INDIVIDUAL,4800,4,65000,12.39,0,17.1,9,18,fals,9.99,600,2015 + 36 months,RENT,debt_consolidation,WI,Verified,INDIVIDUAL,7825,1,30000,18.24,0,45.8,11,5,tru,18.84,-5098.89,2015 + 60 months,RENT,debt_consolidation,PA,Verified,INDIVIDUAL,35000,10,85000,29.61,0,49,37,16,fals,17.57,3327.54,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,3450,4,36000,4.97,0,43.6,7,11,fals,15.31,629.34,2016 + 36 months,OWN,debt_consolidation,CA,Not Verified,INDIVIDUAL,21000,2,64000,29.35,0,22.5,19,16,fals,7.89,2178.98,2015 + 36 months,MORTGAGE,credit_card,MO,Verified,INDIVIDUAL,15000,10,118450,11.22,1,91.9,21,44,fals,9.67,2305.07,2014 + 36 months,RENT,other,AZ,Verified,INDIVIDUAL,12700,5,56000,4.11,0,58.4,15,22,tru,17.14,-5918.99,2015 + 60 months,MORTGAGE,credit_card,NM,Verified,INDIVIDUAL,11050,10,68000,3.92,1,48.5,8,32,tru,14.65,-7916.47,2015 + 36 months,MORTGAGE,debt_consolidation,WI,Not Verified,INDIVIDUAL,4000,3,56000,11.44,0,11.5,20,13,fals,6.03,234.82,2012 + 60 months,MORTGAGE,debt_consolidation,OR,Not Verified,INDIVIDUAL,10000,9,58000,36.23,2,56.1,29,11,fals,17.57,2676.35,2015 + 36 months,OWN,debt_consolidation,OR,Verified,INDIVIDUAL,24450,10,49000,11.38,2,7.8,27,19,fals,7.12,1814.75,2014 + 36 months,MORTGAGE,credit_card,GA,Not Verified,INDIVIDUAL,15000,5,140000,8.4,0,35.8,26,24,fals,6.99,1501.16,2014 + 36 months,RENT,debt_consolidation,CA,Verified,INDIVIDUAL,17475,9,53778,16.74,1,89.8,25,11,tru,18.75,-7477.57,2013 diff --git a/src/test/resources/xgboost-tracker.properties b/src/test/resources/xgboost-tracker.properties new file mode 100644 index 00000000..a44eec83 --- /dev/null +++ b/src/test/resources/xgboost-tracker.properties @@ -0,0 +1,2 @@ +#This is needed for Xgboost rabbit tracker to bind local ip address when running in local mode. +host-ip=0.0.0.0 \ No newline at end of file diff --git a/src/test/scala/com/databricks/labs/automl/AbstractUnitSpec.scala b/src/test/scala/com/databricks/labs/automl/AbstractUnitSpec.scala new file mode 100644 index 00000000..8002092c --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/AbstractUnitSpec.scala @@ -0,0 +1,394 @@ +package com.databricks.labs.automl + +import java.util.UUID + +import com.databricks.labs.automl.inference.InferencePayload +import com.databricks.labs.automl.params.ConfusionOutput +import com.databricks.labs.automl.pipeline.{ + DropColumnsTransformer, + ZipRegisterTempTransformer +} +import com.databricks.labs.automl.utils.SchemaUtils +import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.junit.runner.RunWith +import org.scalatest._ + +import scala.collection.mutable.ArrayBuffer + +@RunWith(classOf[org.scalatestplus.junit.JUnitRunner]) +abstract class AbstractUnitSpec + extends FlatSpec + with Matchers + with OptionValues + with Inside + with Inspectors + +object AutomationUnitTestsUtil { + + lazy val sparkSession: SparkSession = SparkSession + .builder() + .master("local[*]") + .appName("providentiaml-unit-tests") + .getOrCreate() + + sparkSession.sparkContext.setLogLevel("ERROR") + + def convertCsvToDf(csvPath: String): DataFrame = { + sparkSession.read + .format("csv") + .option("header", true) + .option("inferSchema", true) + .load(getClass.getResource(csvPath).getPath) + } + + def getAdultDf(): DataFrame = { + import sparkSession.implicits._ + val adultDf = convertCsvToDf("/adult_data.csv") + + var adultDfCleaned = adultDf + for (colName <- adultDf.columns) { + adultDfCleaned = adultDfCleaned + .withColumn( + colName.split("\\s+").mkString + "_trimmed", + trim(col(colName)) + ) + .drop(colName) + } + adultDfCleaned + .withColumn("label", when($"class_trimmed" === "<=50K", 0).otherwise(1)) + .drop("class_trimmed") + } + + def assertConfusionOutput(confusionOutput: ConfusionOutput): Unit = { + assert( + confusionOutput != null, + "should have not returned null confusion output" + ) + assert( + confusionOutput.confusionData != null, + "should not have returned null confusion output data" + ) + assert( + confusionOutput.predictionData != null, + "should not have returned null prediction data" + ) + assert( + confusionOutput.confusionData.count() > 0, + "should have more than 0 rows for confusion data" + ) + assert( + confusionOutput.predictionData.count() > 0, + "should have more than 0 rows for prediction data" + ) + } + + def assertPredOutput(inputCount: Long, predictCount: Long): Unit = { + assert(inputCount == predictCount, "count should have matched") + } + + def getSerializablesToTmpLocation(): String = { + System.getProperty("java.io.tmpdir") + "/" + UUID + .randomUUID() + .toString + "/automl" + } + + def getRandomForestConfig(inputDataset: DataFrame, + evolutionStrategy: String): AutomationRunner = { + val rfBoundaries = Map( + "numTrees" -> Tuple2(50.0, 1000.0), + "maxBins" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 0.075), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ) + new AutomationRunner(inputDataset) + .setModelingFamily("RandomForest") + .setLabelCol("label") + .setFeaturesCol("features") + .naFillOn() + .varianceFilterOn() + .outlierFilterOff() + .pearsonFilterOff() + .covarianceFilterOff() + .oneHotEncodingOn() + .scalingOff() + .setStandardScalerMeanFlagOff() + .setStandardScalerStdDevFlagOff() + .mlFlowLoggingOff() + .mlFlowLogArtifactsOff() + .autoStoppingOff() + .setFilterPrecision(0.9) + .setParallelism(20) + .setKFold(1) + .setTrainPortion(0.70) + .setTrainSplitMethod("stratified") + .setFirstGenerationGenePool(5) + .setNumberOfGenerations(2) + .setNumberOfParentsToRetain(2) + .setNumberOfMutationsPerGeneration(2) + .setGeneticMixing(0.8) + .setGenerationalMutationStrategy("fixed") + .setScoringMetric("f1") + .setFeatureImportanceCutoffType("count") + .setFeatureImportanceCutoffValue(12.0) + .setEvolutionStrategy(evolutionStrategy) + .setInferenceConfigSaveLocation( + AutomationUnitTestsUtil.getSerializablesToTmpLocation() + ) + .setNumericBoundaries(rfBoundaries) + } + + def getLogisticRegressionConfig( + inputDataset: DataFrame, + evolutionStrategy: String + ): AutomationRunner = { + new AutomationRunner(inputDataset) + .setModelingFamily("LogisticRegression") + .setLabelCol("label") + .setFeaturesCol("features") + .naFillOn() + .varianceFilterOn() + .outlierFilterOff() + .pearsonFilterOff() + .covarianceFilterOff() + .oneHotEncodingOff() + .scalingOn() + .setStandardScalerMeanFlagOn() + .setStandardScalerStdDevFlagOff() + .mlFlowLoggingOff() + .mlFlowLogArtifactsOff() + .autoStoppingOff() + .setFilterPrecision(0.9) + .setParallelism(8) + .setKFold(2) + .setTrainPortion(0.70) + .setFirstGenerationGenePool(5) + .setNumberOfGenerations(2) + .setNumberOfParentsToRetain(2) + .setNumberOfMutationsPerGeneration(2) + .setGeneticMixing(0.8) + .setGenerationalMutationStrategy("fixed") + .setFeatureImportanceCutoffType("count") + .setFeatureImportanceCutoffValue(12.0) + .setEvolutionStrategy(evolutionStrategy) + .setInferenceConfigSaveLocation( + AutomationUnitTestsUtil.getSerializablesToTmpLocation() + ) + .setTrainSplitMethod("kSample") + .setLabelBalanceMode("match") + + } + + def getXgBoostConfig(inputDataset: DataFrame, + evolutionStrategy: String): AutomationRunner = { + new AutomationRunner(inputDataset) + .setModelingFamily("XGBoost") + .setLabelCol("label") + .setFeaturesCol("features") + .naFillOn() + .varianceFilterOn() + .outlierFilterOff() + .pearsonFilterOff() + .covarianceFilterOn() + .oneHotEncodingOn() + .scalingOff() + .setStandardScalerMeanFlagOff() + .setStandardScalerStdDevFlagOff() + .mlFlowLoggingOff() + .mlFlowLogArtifactsOff() + .autoStoppingOff() + .setFilterPrecision(0.9) + .setParallelism(20) + .setKFold(4) + .setTrainPortion(0.70) + .setTrainSplitMethod("stratified") + .setScoringMetric("f1") + .setFirstGenerationGenePool(5) + .setNumberOfGenerations(2) + .setNumberOfParentsToRetain(2) + .setNumberOfMutationsPerGeneration(2) + .setGeneticMixing(0.8) + .setGenerationalMutationStrategy("fixed") + .setFeatureImportanceCutoffType("count") + .setFeatureImportanceCutoffValue(10.0) + .setEvolutionStrategy(evolutionStrategy) + .setInferenceConfigSaveLocation( + AutomationUnitTestsUtil.getSerializablesToTmpLocation() + ) + } + + def getMlpcConfig(inputDataset: DataFrame, + evolutionStrategy: String): AutomationRunner = { + new AutomationRunner(inputDataset) + .setModelingFamily("MLPC") + .setLabelCol("label") + .setFeaturesCol("features") + .naFillOn() + .varianceFilterOn() + .outlierFilterOff() + .pearsonFilterOff() + .covarianceFilterOn() + .oneHotEncodingOff() + .scalingOn() + .setStandardScalerMeanFlagOff() + .setStandardScalerStdDevFlagOff() + .mlFlowLoggingOff() + .mlFlowLogArtifactsOff() + .autoStoppingOff() + .setFilterPrecision(0.9) + .setParallelism(20) + .setKFold(5) + .setTrainPortion(0.70) + .setTrainSplitMethod("random") + .setScoringMetric("f1") + .setFirstGenerationGenePool(5) + .setNumberOfGenerations(2) + .setNumberOfParentsToRetain(2) + .setNumberOfMutationsPerGeneration(2) + .setGeneticMixing(0.8) + .setGenerationalMutationStrategy("fixed") + .setFeatureImportanceCutoffType("count") + .setFeatureImportanceCutoffValue(10.0) + .setEvolutionStrategy(evolutionStrategy) + .setInferenceConfigSaveLocation( + AutomationUnitTestsUtil.getSerializablesToTmpLocation() + ) + } + + def getProjectDir(): String = { + System.getProperty("user.dir") + } +} + +case class TestVars(df: DataFrame, + features: Array[String], + tempTableName: String, + labelCol: String, + featuresCol: String = "features") + +object PipelineTestUtils { + def getTestVars(): TestVars = { + TestVars( + AutomationUnitTestsUtil.getAdultDf(), + Array("age_trimmed", "workclass_trimmed", "fnlwgt_trimmed"), + "zipRegisterTempTransformer_1", + "label" + ) + } + + def addZipRegisterTmpTransformerStage( + labelCol: String, + featuresCol: Array[String] + ): PipelineStage = { + new ZipRegisterTempTransformer() + .setTempViewOriginalDatasetName(Identifiable.randomUID("zipWithId")) + .setLabelColumn(labelCol) + .setFeatureColumns(featuresCol) + } + + def buildFeaturesPipelineStages( + df: DataFrame, + labelCol: String, + dropColumns: Boolean = true, + ignoreCols: Array[String] = Array.empty + ): Array[_ <: PipelineStage] = { + + val fields = SchemaUtils.extractTypes( + df.select( + df.columns.filterNot(item => ignoreCols.contains(item)).map(col): _* + ), + labelCol + ) + val stringFields = fields.categoricalFields + val vectorizableFields = fields.numericFields.toArray + val dateFields = fields.dateFields.toArray + val timeFields = fields.timeFields.toArray + val booleanFields = fields.booleanFields.toArray + + val stages = new ArrayBuffer[PipelineStage] + + stringFields.foreach(columnName => { + stages += new StringIndexer() + .setInputCol(columnName) + .setOutputCol(SchemaUtils.generateStringIndexedColumn(columnName)) + .setHandleInvalid("keep") + }) + + stages += new DropColumnsTransformer().setInputCols(stringFields.toArray) + + val featureAssemblerInputCols: Array[String] = stringFields + .map(item => SchemaUtils.generateStringIndexedColumn(item)) + .toArray[String] ++ vectorizableFields + + stages += new VectorAssembler() + .setInputCols(featureAssemblerInputCols) + .setOutputCol("features") + + if (dropColumns) { + stages += new DropColumnsTransformer() + .setInputCols(featureAssemblerInputCols) + } + + stages.toArray + } + + def saveAndLoadPipeline(stages: Array[_ <: PipelineStage], + dataFrame: DataFrame, + pipelineName: String): PipelineModel = { + val pipelineSavePath = AutomationUnitTestsUtil + .getProjectDir() + "/target/pipeline-tests/" + pipelineName + val pipelineModel = new Pipeline().setStages(stages).fit(dataFrame) + pipelineModel.transform(dataFrame) + pipelineModel.write.overwrite().save(pipelineSavePath) + PipelineModel.load(pipelineSavePath) + } + + def saveAndLoadPipelineModel(pipelineModel: PipelineModel, + dataFrame: DataFrame, + pipelineName: String): PipelineModel = { + val pipelineSavePath = AutomationUnitTestsUtil + .getProjectDir() + "/target/pipeline-tests/" + pipelineName + pipelineModel.transform(dataFrame) + pipelineModel.write.overwrite().save(pipelineSavePath) + PipelineModel.load(pipelineSavePath) + } + + def getVectorizedFeatures(df: DataFrame, + labelCol: String, + ignoreCols: Array[String]): Array[String] = { + val fields = SchemaUtils.extractTypes( + df.select( + df.columns.filterNot(item => ignoreCols.contains(item)) map col: _* + ), + labelCol + ) + val stringFields = fields.categoricalFields + val vectorizableFields = fields.numericFields.toArray + val dateFields = fields.dateFields.toArray + val timeFields = fields.timeFields.toArray + val booleanFields = fields.booleanFields.toArray + + stringFields + .map(item => SchemaUtils.generateStringIndexedColumn(item)) + .toArray[String] ++ vectorizableFields + } + +} + +object InferenceUnitTestUtil { + def generateInferencePayload(): InferencePayload = { + val adultDataset = AutomationUnitTestsUtil.getAdultDf() + val adultDsColumns = adultDataset.columns; + { + new InferencePayload { + override def data: DataFrame = adultDataset + override def modelingColumns: Array[String] = Array("label") + override def allColumns: Array[String] = adultDsColumns + } + } + } +} diff --git a/src/test/scala/com/databricks/labs/automl/AutomationRunnerIT.scala b/src/test/scala/com/databricks/labs/automl/AutomationRunnerIT.scala new file mode 100644 index 00000000..07573a71 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/AutomationRunnerIT.scala @@ -0,0 +1,115 @@ +package com.databricks.labs.automl + +class AutomationRunnerIT extends AbstractUnitSpec { + + it should "return confusion report for Logistic Regression in batch evolution strategy" in { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + val fullConfig = AutomationUnitTestsUtil.getLogisticRegressionConfig( + adultDfwithLabel, + "batch" + ) + val confusionOutput = fullConfig.runWithConfusionReport() + AutomationUnitTestsUtil.assertConfusionOutput(confusionOutput) + AutomationUnitTestsUtil.assertPredOutput( + adultDfwithLabel.count(), + confusionOutput.predictionData.count() + ) + } + + it should "return confusion report for Random Forest in batch evolution strategy" in { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + val fullConfig = + AutomationUnitTestsUtil.getRandomForestConfig(adultDfwithLabel, "batch") + val confusionOutput = fullConfig.runWithConfusionReport() + AutomationUnitTestsUtil.assertConfusionOutput(confusionOutput) + AutomationUnitTestsUtil.assertPredOutput( + adultDfwithLabel.count(), + confusionOutput.predictionData.count() + ) + } + + it should "return confusion report for XgBoost in batch evolution strategy" in { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + adultDfwithLabel.show(1000) + val fullConfig = + AutomationUnitTestsUtil.getXgBoostConfig(adultDfwithLabel, "batch") + val confusionOutput = fullConfig.runWithConfusionReport() + AutomationUnitTestsUtil.assertConfusionOutput(confusionOutput) + AutomationUnitTestsUtil.assertPredOutput( + adultDfwithLabel.count(), + confusionOutput.predictionData.count() + ) + } + + it should "return confusion report for Logistic Regression in continuous evolution strategy" in { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + val fullConfig = AutomationUnitTestsUtil.getLogisticRegressionConfig( + adultDfwithLabel, + "continuous" + ) + val confusionOutput = fullConfig.runWithConfusionReport() + AutomationUnitTestsUtil.assertConfusionOutput(confusionOutput) + AutomationUnitTestsUtil.assertPredOutput( + adultDfwithLabel.count(), + confusionOutput.predictionData.count() + ) + } + + it should "return confusion report for Random Forest in continuous evolution strategy" in { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + val fullConfig = AutomationUnitTestsUtil.getRandomForestConfig( + adultDfwithLabel, + "continuous" + ) + val confusionOutput = fullConfig.runWithConfusionReport() + AutomationUnitTestsUtil.assertConfusionOutput(confusionOutput) + AutomationUnitTestsUtil.assertPredOutput( + adultDfwithLabel.count(), + confusionOutput.predictionData.count() + ) + } + + it should "return confusion report for XgBoost in continuous evolution strategy" in { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + val fullConfig = + AutomationUnitTestsUtil.getXgBoostConfig(adultDfwithLabel, "continuous") + val confusionOutput = fullConfig.runWithConfusionReport() + AutomationUnitTestsUtil.assertConfusionOutput(confusionOutput) + AutomationUnitTestsUtil.assertPredOutput( + adultDfwithLabel.count(), + confusionOutput.predictionData.count() + ) + } + + it should "return predictions with XgBoost" in { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + val fullConfig = + AutomationUnitTestsUtil.getXgBoostConfig(adultDfwithLabel, "continuous") + val predictionRowsCount = + fullConfig.runWithPrediction().dataWithPredictions.count() + assert( + predictionRowsCount > 0, + "should have returned more than 0 rows for prediction output" + ) + assert( + predictionRowsCount == adultDfwithLabel.count(), + "should have same row count for input dataset and prediction dataset" + ) + } + + it should "return predictions with XgBoost when feature interaction mode is on" in { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + val fullConfig = AutomationUnitTestsUtil + .getXgBoostConfig(adultDfwithLabel, "continuous") + .featureInteractionOn() + .setFeatureInteractionRetentionMode("all") + .setFeatureInteractionParallelism(2) + val confusionOutput = fullConfig.runWithConfusionReport() + AutomationUnitTestsUtil.assertConfusionOutput(confusionOutput) + AutomationUnitTestsUtil.assertPredOutput( + adultDfwithLabel.count(), + confusionOutput.predictionData.count() + ) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/AutomationRunnerTest.scala b/src/test/scala/com/databricks/labs/automl/AutomationRunnerTest.scala new file mode 100644 index 00000000..604d008e --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/AutomationRunnerTest.scala @@ -0,0 +1,149 @@ +package com.databricks.labs.automl + +import org.apache.spark.sql.{AnalysisException, Row} + +class AutomationRunnerTest extends AbstractUnitSpec { + + "AutomationRunner" should "throw NullPointerException if it is instantiated with null constructor" in { + a[NullPointerException] should be thrownBy { + new AutomationRunner(null).runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with wrong evolution strategy" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + val fullConfig = + AutomationUnitTestsUtil.getXgBoostConfig(adultDfwithLabel, "err") + fullConfig.runWithConfusionReport() + } + } + + it should "throw AssertionError with wrong label column" in { + a[org.apache.spark.sql.AnalysisException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setLabelCol("label_test") + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with wrong modeling family" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setModelingFamily("test") + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with wrong firstGenerationGenePool config" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setFirstGenerationGenePool(1) + .runWithConfusionReport() + } + } + + it should "throw NullPointerException with empty input dataset with schema" in { + a[NullPointerException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner( + AutomationUnitTestsUtil.sparkSession.createDataFrame( + AutomationUnitTestsUtil.sparkSession.sparkContext.emptyRDD[Row], + adultDfwithLabel.schema + ) + ).runWithConfusionReport() + } + } + + it should "throw AnalysisException with empty input dataset with no schema" in { + a[AnalysisException] should be thrownBy { + new AutomationRunner(AutomationUnitTestsUtil.sparkSession.emptyDataFrame) + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with zero number of generations" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setNumberOfGenerations(0) + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with zero number of parents To retain" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setNumberOfParentsToRetain(0) + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with zero number of mutations per generation" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setNumberOfMutationsPerGeneration(0) + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with zero genetic mixing" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setGeneticMixing(0) + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with wrong generational mutation strategy" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setGenerationalMutationStrategy("err") + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with wrong feature importance cutoff type" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setFeatureImportanceCutoffType("err") + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with invalid continuousDiscretizerBucketCount" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setFeatureInteractionContinuousDiscretizerBucketCount(0) + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with invalid parallelism setting" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setFeatureInteractionParallelism(0) + .runWithConfusionReport() + } + } + + it should "throw IllegalArgumentException with invalid retention mode setting" in { + a[IllegalArgumentException] should be thrownBy { + val adultDfwithLabel = AutomationUnitTestsUtil.getAdultDf() + new AutomationRunner(adultDfwithLabel) + .setFeatureInteractionRetentionMode("err") + .runWithConfusionReport() + } + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/DiscreteTestDataGenerator.scala b/src/test/scala/com/databricks/labs/automl/DiscreteTestDataGenerator.scala new file mode 100644 index 00000000..498595eb --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/DiscreteTestDataGenerator.scala @@ -0,0 +1,1039 @@ +package com.databricks.labs.automl + +import com.databricks.labs.automl.utilities.{ + CardinalityFilteringTestSchema, + ClassifierSchema, + DataGeneratorUtilities, + FeatureCorrelationTestSchema, + FeatureInteractionSchema, + KSampleSchema, + ModelDetectionSchema, + NaFillTestSchema, + OutlierTestSchema, + PearsonRegressionTestSchema, + PearsonTestSchema, + RegressorSchema, + SanitizerSchema, + SanitizerSchemaRegressor, + VarianceTestSchema +} +import org.apache.spark.ml.feature.VectorAssembler +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ + +object DiscreteTestDataGenerator extends DataGeneratorUtilities { + + final private val MLFLOWID_START = 1 + final private val MLFLOWID_STEP = 1 + final private val MLFLOW_ID_MODE = "ascending" + + private def generateMlFlowID(rows: Int): Array[Int] = { + generateRepeatingIntData( + rows, + MLFLOWID_START, + MLFLOWID_STEP, + MLFLOW_ID_MODE, + rows + ) + } + + def generateNAFillData(rows: Int, naRate: Int): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + val DOUBLES_START = 1.0 + val DOUBLES_STEP = 2.0 + val DOUBLES_MODE = "ascending" + val FLOAT_START = 0.0f + val FLOAT_STEP = 2.0f + val FLOAT_MODE = "descending" + val INT_START = 1 + val INT_STEP = 2 + val INT_MODE = "ascending" + val ORD_START = 5 + val ORD_STEP = 4 + val ORD_MODE = "descending" + val ORD_DISTINCT_COUNT = 6 + val STRING_DISTINCT_COUNT = 5 + val DATE_YEAR = 2019 + val DATE_MONTH = 7 + val DATE_DAY = 25 + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "random" + val LABEL_DISTINCT_COUNT = 4 + + import spark.implicits._ + + val targetNaModulus = rows / naRate + + val doublesSpace = generateDoublesDataWithNulls( + rows, + DOUBLES_START, + DOUBLES_STEP, + DOUBLES_MODE, + targetNaModulus, + 0 + ) + + val floatSpace = generateFloatsDataWithNulls( + rows, + FLOAT_START, + FLOAT_STEP, + FLOAT_MODE, + targetNaModulus, + 1 + ) + + val intSpace = generateIntDataWithNulls( + rows, + INT_START, + INT_STEP, + INT_MODE, + targetNaModulus, + 2 + ) + val ordinalIntSpace = generateRepeatingIntDataWithNulls( + rows, + ORD_START, + ORD_STEP, + ORD_MODE, + ORD_DISTINCT_COUNT, + targetNaModulus, + 3 + ) + val stringSpace = + generateStringDataWithNulls( + rows, + STRING_DISTINCT_COUNT, + targetNaModulus, + 4 + ) + val booleanSpace = generateBooleanDataWithNulls(rows, targetNaModulus, 5) + val daysSpace = generateDatesWithNulls( + rows, + DATE_YEAR, + DATE_MONTH, + DATE_DAY, + targetNaModulus, + 6 + ) + val labelData = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + val mlFlowIdData = generateMlFlowID(rows) + + val seqData = for (i <- 0 until rows) + yield + NaFillTestSchema( + doublesSpace(i), + floatSpace(i), + intSpace(i), + ordinalIntSpace(i), + stringSpace(i), + booleanSpace(i), + daysSpace(i), + labelData(i), + mlFlowIdData(i) + ) + val dfConversion = seqData + .toDF() + .withColumn("dateData", to_date(col("dateData"), "yyyy-MM-dd")) + + reassignToNulls(dfConversion) + + } + + def generateModelDetectionData(rows: Int, uniqueLabels: Int): DataFrame = { + + val FEATURE_START = 2.0 + val FEATURE_STEP = 1.0 + val FEATURE_MODE = "ascending" + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "random" + + val spark = AutomationUnitTestsUtil.sparkSession + + val featureData = + generateDoublesData(rows, FEATURE_START, FEATURE_STEP, FEATURE_MODE) + val labelData = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + uniqueLabels + ).map(_.toDouble) + + val mlFlowIdData = generateMlFlowID(rows) + + val seqData = for (i <- 0 until rows) + yield ModelDetectionSchema(featureData(i), labelData(i), mlFlowIdData(i)) + + import spark.implicits._ + + seqData.toDF() + + } + + def generateOutlierData(rows: Int, uniqueLabels: Int): DataFrame = { + + val EXPONENTIAL_START = 0.0 + val EXPONENTIAL_STEP = 2.0 + val EXPONENTIAL_MODE = "ascending" + val EXPONENTIAL_POWER = 3 + val LINEAR_START = 0.0 + val LINEAR_STEP = 1.0 + val LINEAR_MODE = "descending" + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "random" + val LABEL_DISTINCT = 5 + val EXPONENTIAL_TAIL_START = 1.0 + val EXPONENTIAL_TAIL_STEP = 5.0 + val EXPONENTIAL_TAIL_MODE = "ascending" + val EXPONENTIAL_TAIL_POWER = 2 + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + val a = generateExponentialData( + rows, + EXPONENTIAL_START, + EXPONENTIAL_STEP, + EXPONENTIAL_MODE, + EXPONENTIAL_POWER + ) + val b = generateDoublesData(rows, LINEAR_START, LINEAR_STEP, LINEAR_MODE) + val c = generateTailedExponentialData( + rows, + EXPONENTIAL_TAIL_START, + EXPONENTIAL_TAIL_STEP, + EXPONENTIAL_TAIL_MODE, + EXPONENTIAL_TAIL_POWER + ) + val label = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT + ) + val mlFlowIdData = generateMlFlowID(rows) + + val seqData = for (i <- 0 until rows) + yield OutlierTestSchema(a(i), b(i), c(i), label(i), mlFlowIdData(i)) + + seqData.toDF() + + } + + def generateVarianceFilteringData(rows: Int): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val DOUBLE_SERIES_START = 1.0 + val DOUBLE_SERIES_STEP = 1.0 + val DOUBLE_SERIES_MODE = "ascending" + val REPEATING_DOUBLE_VALUE = 42.42 + val REPEATING_INT_VALUE = 9 + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "random" + val LABEL_DISTINCT_COUNT = 7 + + val a = generateDoublesData( + rows, + DOUBLE_SERIES_START, + DOUBLE_SERIES_STEP, + DOUBLE_SERIES_MODE + ) + val b = generateFibonacciData(rows) + val c = generateStaticDoubleSeries(rows, REPEATING_DOUBLE_VALUE) + val d = generateStaticIntSeries(rows, REPEATING_INT_VALUE) + val label = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + + val mlFlowIdData = generateMlFlowID(rows) + + val seqData = for (i <- 0 until rows) + yield + VarianceTestSchema(a(i), b(i), c(i), d(i), label(i), mlFlowIdData(i)) + + seqData.toDF() + } + + def generatePearsonFilteringData(rows: Int): (DataFrame, Array[String]) = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val POSITIVE_CORR_1_START = 1 + val POSITIVE_CORR_1_STEP = 1 + val POSITIVE_CORR_1_MODE = "ascending" + val POSITIVE_CORR_1_DISTINCT_COUNT = 4 + + val POSITIVE_CORR_2_START = 1 + val POSITIVE_CORR_2_STEP = 1 + val POSITIVE_CORR_2_MODE = "descending" + val POSITIVE_CORR_2_DISTINCT_COUNT = 4 + + val NOFILTER_1_START = 1.0 + val NOFILTER_1_STEP = 1.0 + val NOFILTER_1_MODE = "random" + + val NOFILTER_2_START = 1 + val NOFILTER_2_STEP = 1 + val NOFILTER_2_MODE = "random" + val NOFILTER_2_DISTINCT_COUNT = 7 + + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "ascending" + val LABEL_DISTINCT_COUNT = 4 + + val positiveCorr1 = generateRepeatingIntData( + rows, + POSITIVE_CORR_1_START, + POSITIVE_CORR_1_STEP, + POSITIVE_CORR_1_MODE, + POSITIVE_CORR_1_DISTINCT_COUNT + ) + val positiveCorr2 = generateRepeatingIntData( + rows, + POSITIVE_CORR_2_START, + POSITIVE_CORR_2_STEP, + POSITIVE_CORR_2_MODE, + POSITIVE_CORR_2_DISTINCT_COUNT + ) + val noFilter1 = generateDoublesData( + rows, + NOFILTER_1_START, + NOFILTER_1_STEP, + NOFILTER_1_MODE + ) + val noFilter2 = generateRepeatingIntData( + rows, + NOFILTER_2_START, + NOFILTER_2_STEP, + NOFILTER_2_MODE, + NOFILTER_2_DISTINCT_COUNT + ) + val label = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + val mlFlowIdData = generateMlFlowID(rows) + + val seqData = for (i <- 0 until rows) + yield + PearsonTestSchema( + positiveCorr1(i), + positiveCorr2(i), + noFilter1(i), + noFilter2(i), + label(i), + mlFlowIdData(i) + ) + + val rawData = seqData.toDF() + + val featureCols = + rawData.schema.names + .filterNot(x => x.contains("label") || x.contains("automl_internal_id")) + + val assembler = + new VectorAssembler().setInputCols(featureCols).setOutputCol("features") + + (assembler.transform(rawData), featureCols) + + } + + def generatePearsonRegressionFilteringData( + rows: Int + ): (DataFrame, Array[String]) = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val POSITIVE_CORR_1_START = 1 + val POSITIVE_CORR_1_STEP = 1 + val POSITIVE_CORR_1_MODE = "ascending" + + val POSITIVE_CORR_2_START = 1 + val POSITIVE_CORR_2_STEP = 1 + val POSITIVE_CORR_2_MODE = "descending" + + val POSITIVE_CORR_3_START = 1 + val POSITIVE_CORR_3_STEP = 2 + val POSITIVE_CORR_3_MODE = "ascending" + + val NOFILTER_1_START = 1.0 + val NOFILTER_1_STEP = 1.0 + val NOFILTER_1_MODE = "random" + + val NOFILTER_2_START = 1 + val NOFILTER_2_STEP = 1 + val NOFILTER_2_MODE = "random" + val NOFILTER_2_DISTINCT_COUNT = 7 + + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "ascending" + + val positiveCorr1 = generateDoublesData( + rows, + POSITIVE_CORR_1_START, + POSITIVE_CORR_1_STEP, + POSITIVE_CORR_1_MODE + ) + val positiveCorr2 = generateDoublesData( + rows, + POSITIVE_CORR_2_START, + POSITIVE_CORR_2_STEP, + POSITIVE_CORR_2_MODE + ).map(x => x * -1.0) + val positiveCorr3 = generateIntData( + rows, + POSITIVE_CORR_3_START, + POSITIVE_CORR_3_STEP, + POSITIVE_CORR_3_MODE + ) + val noFilter1 = generateDoublesData( + rows, + NOFILTER_1_START, + NOFILTER_1_STEP, + NOFILTER_1_MODE + ) + val noFilter2 = generateRepeatingIntData( + rows, + NOFILTER_2_START, + NOFILTER_2_STEP, + NOFILTER_2_MODE, + NOFILTER_2_DISTINCT_COUNT + ) + val label = generateDoublesData(rows, LABEL_START, LABEL_STEP, LABEL_MODE) + val mlFlowIdData = generateMlFlowID(rows) + + val seqData = for (i <- 0 until rows) + yield + PearsonRegressionTestSchema( + positiveCorr1(i), + positiveCorr2(i), + positiveCorr3(i), + noFilter1(i), + noFilter2(i), + label(i), + mlFlowIdData(i) + ) + + val rawData = seqData.toDF() + + val featureCols = + rawData.schema.names + .filterNot(x => x.contains("label") || x.contains("automl_internal_id")) + + val assembler = + new VectorAssembler().setInputCols(featureCols).setOutputCol("features") + + (assembler.transform(rawData), featureCols) + + } + + def generateFeatureCorrelationData(rows: Int): (DataFrame, Array[String]) = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + /** + * A - 100% correlation in linear series + * B - 100% correlation in reverse ordering + * C - 2 of 3 correlated + * D - repeated categorical no correlation + */ + val A1_START = 1.0 + val A1_STEP = 1.0 + val A1_MODE = "ascending" + val A2_START = 1.0 + val A2_STEP = 1.0 + val A2_MODE = "ascending" + val B1_START = 0 + val B1_STEP = 5 + val B1_MODE = "ascending" + val B2_START = 0 + val B2_STEP = 5 + val B2_MODE = "ascending" + val C1_START = 1.0 + val C1_STEP = 3.0 + val C1_MODE = "descending" + val C2_START = 1.0 + val C2_STEP = 3.0 + val C2_MODE = "ascending" + val C2_DISTINCT_COUNT = 5 + val C3_START = 1.0 + val C3_STEP = 3.0 + val C3_MODE = "descending" + val C3_DISTINCT_COUNT = 5 + val D1_START = 100L + val D1_STEP = 50L + val D1_MODE = "random" + val D2_START = 10L + val D2_STEP = 1L + val D2_MODE = "random" + val D2_DISTINCT_COUNT = 10 + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "ascending" + val LABEL_DISTINCT_COUNT = 4 + + val a1 = generateDoublesData(rows, A1_START, A1_STEP, A1_MODE) + val a2 = generateDoublesData(rows, A2_START, A2_STEP, A2_MODE) + val b1 = generateIntData(rows, B1_START, B1_STEP, B1_MODE) + val b2 = generateIntData(rows, B2_START, B2_STEP, B2_MODE).map(x => x * -1) + val c1 = generateDoublesData(rows, C1_START, C1_STEP, C1_MODE) + val c2 = generateRepeatingDoublesData( + rows, + C2_START, + C2_STEP, + C2_MODE, + C2_DISTINCT_COUNT + ) + val c3 = generateRepeatingDoublesData( + rows, + C3_START, + C3_STEP, + C3_MODE, + C3_DISTINCT_COUNT + ) + val d1 = generateLongData(rows, D1_START, D1_STEP, D1_MODE) + val d2 = generateRepeatingLongData( + rows, + D2_START, + D2_STEP, + D2_MODE, + D2_DISTINCT_COUNT + ) + val label = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + val mlFlowIdData = generateMlFlowID(rows) + + val seqData = for (i <- 0 until rows) + yield + FeatureCorrelationTestSchema( + a1(i), + a2(i), + b1(i), + b2(i), + c1(i), + c2(i), + c3(i), + d1(i), + d2(i), + label(i), + mlFlowIdData(i) + ) + + val rawData = seqData.toDF() + + val featureCols = + rawData.schema.names + .filterNot(x => x.contains("label") || x.contains("automl_internal_id")) + + (rawData, featureCols) + + } + + def generateCardinalityFilteringData( + rows: Int + ): (DataFrame, Array[String]) = { + + val spark = AutomationUnitTestsUtil.sparkSession + + val CATEGORICAL_FIELDS = Array("b", "c", "d") + + import spark.implicits._ + + val A_START = 1.0 + val A_STEP = 5.0 + val A_MODE = "ascending" + val B_START = 1 + val B_STEP = 1 + val B_MODE = "ascending" + val B_DISTINCT_COUNT = 3 + val C_START = 10L + val C_STEP = 10L + val C_MODE = "descending" + val C_DISTINCT_COUNT = 10 + val D_DISTINCT_COUNT = 55 + + val a = generateDoublesData(rows, A_START, A_STEP, A_MODE) + val b = + generateRepeatingIntData(rows, B_START, B_STEP, B_MODE, B_DISTINCT_COUNT) + val c = + generateRepeatingLongData(rows, C_START, C_STEP, C_MODE, C_DISTINCT_COUNT) + val d = generateStringData(rows, D_DISTINCT_COUNT) + + val seqData = for (i <- 0 until rows) + yield CardinalityFilteringTestSchema(a(i), b(i), c(i), d(i)) + val data = seqData.toDF() + + (data, CATEGORICAL_FIELDS) + + } + + def generateSanitizerData(rows: Int, modelType: String): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val A_START = 1.0 + val A_STEP = 2.0 + val A_MODE = "descending" + val B_START = 1 + val B_STEP = 1 + val B_MODE = "ascending" + val B_DISTINCT_COUNT = 3 + val C_START = 100L + val C_STEP = 100L + val C_MODE = "random" + val C_DISTINCT_COUNT = 500 + val D_DISTINCT_COUNT = 12 + val E_START = 1000 + val E_STEP = 10 + val E_MODE = "ascending" + val LABEL_DISTINCT_COUNT = 4 + val LABEL_REGRESSION_START = 1.0 + val LABEL_REGRESSION_STEP = 3.0 + val LABEL_REGRESSION_MODE = "ascending" + + val a = generateDoublesData(rows, A_START, A_STEP, A_MODE) + val b = + generateRepeatingIntData(rows, B_START, B_STEP, B_MODE, B_DISTINCT_COUNT) + val c = + generateRepeatingLongData(rows, C_START, C_STEP, C_MODE, C_DISTINCT_COUNT) + val d = generateStringData(rows, D_DISTINCT_COUNT) + val e = generateIntData(rows, E_START, E_STEP, E_MODE) + val f = generateBooleanData(rows) + val mlflow = generateMlFlowID(rows) + + val label = generateStringData(rows, LABEL_DISTINCT_COUNT) + val labelRegression = generateDoublesData( + rows, + LABEL_REGRESSION_START, + LABEL_REGRESSION_STEP, + LABEL_REGRESSION_MODE + ) + + val output = modelType match { + case "classifier" => + val seqData = for (i <- 0 until rows) + yield + SanitizerSchema( + a(i), + b(i), + c(i), + d(i), + e(i), + f(i), + label(i), + mlflow(i) + ) + seqData.toDF() + case "regressor" => + val seqData = for (i <- 0 until rows) + yield + SanitizerSchemaRegressor( + a(i), + b(i), + c(i), + d(i), + e(i), + f(i), + labelRegression(i), + mlflow(i) + ) + seqData.toDF() + } + + output + } + + def generateKSampleData(rows: Int): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val A_START = 1.0 + val A_STEP = 1.0 + val A_MODE = "ascending" + val B_START = 1 + val B_STEP = 1 + val B_MODE = "descending" + val B_DISTINCT_COUNT = 4 + val C_START = 10.0 + val C_STEP = 10.0 + val C_MODE = "ascending" + val C_DISTINCT_COUNT = 6 + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "ascending" + val LABEL_DISTINCT_COUNT = 3 + val mlflow = generateMlFlowID(rows) + + val a = generateDoublesData(rows, A_START, A_STEP, A_MODE) + val b = + generateRepeatingIntData(rows, B_START, B_STEP, B_MODE, B_DISTINCT_COUNT) + val c = + generateDoublesBlocks(rows, C_START, C_STEP, C_MODE, C_DISTINCT_COUNT) + val label = generateIntegerBlocksSkewed( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + + val seqData = for (i <- 0 until rows) + yield KSampleSchema(a(i), b(i), c(i), label(i), mlflow(i)) + + seqData.toDF() + + } + + def generateFeatureInteractionData(rows: Int): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val A_START = 0.0 + val A_STEP = 0.1 + val A_MODE = "ascending" + val B_START = 1.0 + val B_STEP = 1.0 + val B_MODE = "ascending" + val C_START = 1 + val C_STEP = 2 + val C_MODE = "ascending" + val C_DISTINCT_COUNT = 3 + val D_DISTINCT_COUNT = 9 + val E_START = 1 + val E_STEP = 1 + val E_MODE = "descending" + val E_DISTINCT_COUNT = 5 + val F_DISTINCT_COUNT = 7 + val LABEL_START = 1 + val LABEL_STEP = 1 + val LABEL_MODE = "ascending" + val LABEL_DISTINCT_COUNT = 4 + val mlflow = generateMlFlowID(rows) + + val a = generateDoublesData(rows, A_START, A_STEP, A_MODE) + val b = generateDoublesData(rows, B_START, B_STEP, B_MODE) + val c = + generateRepeatingIntData(rows, C_START, C_STEP, C_MODE, C_DISTINCT_COUNT) + val d = generateStringData(rows, D_DISTINCT_COUNT) + val e = + generateRepeatingIntData(rows, E_START, E_STEP, E_MODE, E_DISTINCT_COUNT) + val f = generateStringData(rows, F_DISTINCT_COUNT) + val label = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + + val seqData = for (i <- 0 until rows) + yield + FeatureInteractionSchema( + a(i), + b(i), + c(i), + d(i), + e(i), + f(i), + label(i), + mlflow(i) + ) + + seqData.toDF() + + } + + def generateBinaryClassificationData(rows: Int): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val A_START = 1.0 + val A_STEP = 1.0 + val A_MODE = "ascending" + val B_START = 0.0 + val B_STEP = 5.0 + val B_MODE = "descending" + val B_NULL_RATE = 13 + val B_NULL_OFFSET = 3 + val C_START = 1 + val C_STEP = 1 + val C_MODE = "ascending" + val C_DISTINCT_COUNT = 4 + val D_DISTINCT_COUNT = 51 + val E_START = 10.0 + val E_STEP = 10.0 + val E_MODE = "ascending" + val E_DISTINCT_COUNT = 7 + val LABEL_START = 0 + val LABEL_STEP = 1 + val LABEL_MODE = "ascending" + val LABEL_DISTINCT_COUNT = 2 + + val a = generateDoublesData(rows, A_START, A_STEP, A_MODE) + val b = generateDoublesDataWithNulls( + rows, + B_START, + B_STEP, + B_MODE, + B_NULL_RATE, + B_NULL_OFFSET + ) + val c = + generateRepeatingIntData(rows, C_START, C_STEP, C_MODE, C_DISTINCT_COUNT) + val d = generateStringData(rows, D_DISTINCT_COUNT) + val e = generateRepeatingDoublesData( + rows, + E_START, + E_STEP, + E_MODE, + E_DISTINCT_COUNT + ) + val label = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + + val seqData = for (i <- 0 until rows) + yield ClassifierSchema(a(i), b(i), c(i), d(i), e(i), label(i)) + + seqData.toDF() + + } + + def generateMultiClassClassificationData(rows: Int): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val A_START = 1.0 + val A_STEP = 1.0 + val A_MODE = "ascending" + val B_START = 0.0 + val B_STEP = 5.0 + val B_MODE = "descending" + val B_NULL_RATE = 13 + val B_NULL_OFFSET = 3 + val C_START = 1 + val C_STEP = 1 + val C_MODE = "ascending" + val C_DISTINCT_COUNT = 5 + val D_DISTINCT_COUNT = 42 + val E_START = 10.0 + val E_STEP = 10.0 + val E_MODE = "ascending" + val E_DISTINCT_COUNT = 7 + val LABEL_START = 0 + val LABEL_STEP = 1 + val LABEL_MODE = "ascending" + val LABEL_DISTINCT_COUNT = 3 + + val a = generateDoublesData(rows, A_START, A_STEP, A_MODE) + val b = generateDoublesDataWithNulls( + rows, + B_START, + B_STEP, + B_MODE, + B_NULL_RATE, + B_NULL_OFFSET + ) + val c = + generateRepeatingIntData(rows, C_START, C_STEP, C_MODE, C_DISTINCT_COUNT) + val d = generateStringData(rows, D_DISTINCT_COUNT) + val e = generateRepeatingDoublesData( + rows, + E_START, + E_STEP, + E_MODE, + E_DISTINCT_COUNT + ) + val label = generateRepeatingIntData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + + val seqData = for (i <- 0 until rows) + yield ClassifierSchema(a(i), b(i), c(i), d(i), e(i), label(i)) + + seqData.toDF() + + } + + def generateRegressionData(rows: Int): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val A_START = 1.0 + val A_STEP = 1.0 + val A_MODE = "ascending" + val B_START = 0.0 + val B_STEP = 5.0 + val B_MODE = "descending" + val B_NULL_RATE = 13 + val B_NULL_OFFSET = 3 + val C_START = 1 + val C_STEP = 1 + val C_MODE = "ascending" + val C_DISTINCT_COUNT = 5 + val D_DISTINCT_COUNT = 4 + val E_START = 10.0 + val E_STEP = 10.0 + val E_MODE = "ascending" + val E_DISTINCT_COUNT = 7 + val LABEL_START = 0.0 + val LABEL_STEP = 0.1 + val LABEL_MODE = "random" + + val a = generateDoublesData(rows, A_START, A_STEP, A_MODE) + val b = generateDoublesDataWithNulls( + rows, + B_START, + B_STEP, + B_MODE, + B_NULL_RATE, + B_NULL_OFFSET + ) + val c = + generateRepeatingIntData(rows, C_START, C_STEP, C_MODE, C_DISTINCT_COUNT) + val d = generateStringData(rows, D_DISTINCT_COUNT) + val e = generateRepeatingDoublesData( + rows, + E_START, + E_STEP, + E_MODE, + E_DISTINCT_COUNT + ) + val label = generateDoublesData(rows, LABEL_START, LABEL_STEP, LABEL_MODE) + + val seqData = for (i <- 0 until rows) + yield RegressorSchema(a(i), b(i), c(i), d(i), e(i), label(i)) + + seqData.toDF() + + } + + def generateRegressionDataRepeating(rows: Int): DataFrame = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + val A_START = 1.0 + val A_STEP = 1.0 + val A_MODE = "ascending" + val B_START = 0.0 + val B_STEP = 5.0 + val B_MODE = "descending" + val B_NULL_RATE = 13 + val B_NULL_OFFSET = 3 + val C_START = 1 + val C_STEP = 1 + val C_MODE = "ascending" + val C_DISTINCT_COUNT = 5 + val D_DISTINCT_COUNT = 4 + val E_START = 10.0 + val E_STEP = 10.0 + val E_MODE = "ascending" + val E_DISTINCT_COUNT = 7 + val LABEL_START = 0.0 + val LABEL_STEP = 0.1 + val LABEL_MODE = "ascending" + val LABEL_DISTINCT_COUNT = rows / 50 + + val a = generateDoublesData(rows, A_START, A_STEP, A_MODE) + val b = generateDoublesDataWithNulls( + rows, + B_START, + B_STEP, + B_MODE, + B_NULL_RATE, + B_NULL_OFFSET + ) + val c = + generateRepeatingIntData(rows, C_START, C_STEP, C_MODE, C_DISTINCT_COUNT) + val d = generateStringData(rows, D_DISTINCT_COUNT) + val e = generateRepeatingDoublesData( + rows, + E_START, + E_STEP, + E_MODE, + E_DISTINCT_COUNT + ) + val label = generateRepeatingDoublesData( + rows, + LABEL_START, + LABEL_STEP, + LABEL_MODE, + LABEL_DISTINCT_COUNT + ) + + val seqData = for (i <- 0 until rows) + yield RegressorSchema(a(i), b(i), c(i), d(i), e(i), label(i)) + + seqData.toDF() + + } + + def generateDecayArray(targetCount: Int): Array[Double] = { + generatePeriodicData(targetCount, 0.5, "decay") + } + def generateBiModalArray(targetCount: Int): Array[Double] = { + generatePeriodicData(targetCount, 0.5, "bimodal") + } + def generateChaoticArray(targetCount: Int): Array[Double] = { + generatePeriodicData(targetCount, 0.5, "chaotic") + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/ManualRunnerTest.scala b/src/test/scala/com/databricks/labs/automl/ManualRunnerTest.scala new file mode 100644 index 00000000..0f58fbb2 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/ManualRunnerTest.scala @@ -0,0 +1,40 @@ +package com.databricks.labs.automl + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.params.{ConfusionOutput, DataGeneration} + +class ManualRunnerTest extends AbstractUnitSpec { + + + "ManualRunner " should "throw NullPointerException if it is instantiated with null constructor" in { + a [NullPointerException] should be thrownBy { + new ManualRunner(null) + } + } + + it should "throw NullPointerException if it is instantiated with wrong object state " in { + a[NullPointerException] should be thrownBy { + new ManualRunner(DataGeneration(null, null, null)).run() + } + } + + it should "execute ManualRunner without any exceptions" in { + val adultDataset = AutomationUnitTestsUtil.getAdultDf() + val confusionOutput: ConfusionOutput = new ManualRunner(new DataPrep(adultDataset).prepData()) + .setScoringMetric("areaUnderROC") + .setNumberOfGenerations(2) + .setFirstGenerationGenePool(5) + .mlFlowLoggingOff() + .mlFlowLogArtifactsOff() + .setInferenceConfigSaveLocation(AutomationUnitTestsUtil.getSerializablesToTmpLocation()) + .runWithConfusionReport() + + assert(confusionOutput != null, "confusionOutput should not have been null") + assert(confusionOutput.confusionData != null, "confusion data should not have been null") + assert(confusionOutput.predictionData != null, "prediction data should not have been null") + assert(confusionOutput.predictionData.count() == adultDataset.count(), "prediction dataset count should have match original dataset's count") + } + + + +} diff --git a/src/test/scala/com/databricks/labs/automl/executor/DataPrepTest.scala b/src/test/scala/com/databricks/labs/automl/executor/DataPrepTest.scala new file mode 100644 index 00000000..84b66276 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/executor/DataPrepTest.scala @@ -0,0 +1,111 @@ +package com.databricks.labs.automl.executor + +import com.databricks.labs.automl.params.DataGeneration +import com.databricks.labs.automl.utilities.ValidationUtilities +import com.databricks.labs.automl.{AbstractUnitSpec, AutomationUnitTestsUtil} +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.types.{DataTypes, StructType} + +class DataPrepTest extends AbstractUnitSpec { + + it should "throw NullPointerException for passing null dataset" in { + a[NullPointerException] should be thrownBy { + new DataPrep(null).prepData() + } + } + + it should "throw AnalysisException for passing empty dataset" in { + a[AnalysisException] should be thrownBy { + new DataPrep(AutomationUnitTestsUtil.sparkSession.emptyDataFrame) + .prepData() + } + } + + it should "return valid DataGeneration for preparing data" in { + val adultDataset = AutomationUnitTestsUtil.getAdultDf() + val dataGeneration: DataGeneration = new DataPrep(adultDataset).prepData() + + val EXPECTED_MODEL_TYPE = "classifier" + + val EXPECTED_FIELDS = Array( + "age_trimmed_si", + "workclass_trimmed_si", + "education_trimmed_si", + "education-num_trimmed_si", + "marital-status_trimmed_si", + "occupation_trimmed_si", + "relationship_trimmed_si", + "race_trimmed_si", + "sex_trimmed_si", + "capital-gain_trimmed_si", + "capital-loss_trimmed_si", + "hours-per-week_trimmed_si", + "native-country_trimmed_si", + "features", + "label" + ) + + println( + s"Data Generation fields: ${dataGeneration.data.schema.names.mkString(", ")}" + ) + + assert( + dataGeneration != null, + "DataPrep should not have returned null for a valid input dataset" + ) + assert( + dataGeneration.data != null, + "DataPrep should not have returned null dataset" + ) + assert( + dataGeneration.fields != null, + "DataPrep should not have returned null fields" + ) + assert( + dataGeneration.modelType != null, + "DataPrep should not have returned null model type" + ) + assert( + dataGeneration.data.count() == adultDataset.count(), + "DataPrep should not have returned different rows for input Dataset" + ) + assert( + dataGeneration.fields.length == adultDataset.columns.length, + "DataPrep should not have changed number of columns" + ) + + assert( + dataGeneration.modelType == EXPECTED_MODEL_TYPE, + "Should have detected correct model type" + ) + + ValidationUtilities.fieldCreationAssertion( + EXPECTED_FIELDS, + dataGeneration.data.schema.names + ) + + ValidationUtilities.fieldCreationAssertion( + EXPECTED_FIELDS, + dataGeneration.fields + ) + + } + + it should "return valid schema for preparing data" in { + val adultDataset = AutomationUnitTestsUtil.getAdultDf() + val newSchema: StructType = + new DataPrep(adultDataset).prepData().data.schema + val originalSchema: StructType = adultDataset.schema + + for (field <- newSchema) { + val fieldName = field.name + if (originalSchema.fieldNames.contains(fieldName)) { + assert( + field.dataType == DataTypes.DoubleType || field.dataType == DataTypes.IntegerType, + s"column $fieldName should have been indexed properly" + ) + } + } + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/exploration/tools/OneDimStatsTest.scala b/src/test/scala/com/databricks/labs/automl/exploration/tools/OneDimStatsTest.scala new file mode 100644 index 00000000..63a4df26 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/exploration/tools/OneDimStatsTest.scala @@ -0,0 +1,147 @@ +package com.databricks.labs.automl.exploration.tools + +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} + +class OneDimStatsTest extends AbstractUnitSpec { + + final val ALPHA = 0.05 + + final val PERFECT_NORMAL_DATA = Array(2.0, 3.0, 3.0, 4.0, 4.0, 4.0, 5.0, 5.0, + 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, + 8.0, 8.0, 8.0, 9.0, 9.0, 9.0, 9.0, 10.0, 10.0, 10.0, 11.0, 11.0, 12.0) + + it should "calculate stats correctly for a decaying right skewed distribution" in { + + val EXPECTED_MEAN = -0.6548573547866678 + val EXPECTED_GEOM = 0.0 + val EXPECTED_VARIANCE = 9.950225427633128E-5 + val EXPECTED_SEMI_VARIANCE = 2.4970686878096203E-5 + val EXPECTED_STDDEV = 0.009975081667652215 + val EXPECTED_SKEW = 7.75779394110764 + val EXPECTED_KURTOSIS = 132.06335325374056 + val EXPECTED_KURTOSIS_TYPE = "Leptokurtic" + val EXPECTED_SKEW_TYPE = "Asymmetrical Right Tailed" + + val data = DiscreteTestDataGenerator.generateDecayArray(500).map(_ * -1) + + val result = OneDimStats.evaluate(data) + println(result) + assert(result.mean == EXPECTED_MEAN, "mean is not correct") + assert(result.geomMean == EXPECTED_GEOM, "geometric mean is not correct") + assert(result.variance == EXPECTED_VARIANCE, "variance is not correct") + assert( + result.semiVariance == EXPECTED_SEMI_VARIANCE, + "semi variance is not correct" + ) + assert(result.stddev == EXPECTED_STDDEV, "stddev is not correct") + assert(result.skew == EXPECTED_SKEW, "skew is not correct") + assert(result.kurtosis == EXPECTED_KURTOSIS, "kurtosis is not correct") + assert( + result.kurtosisType == EXPECTED_KURTOSIS_TYPE, + "kurtosis type is not correct" + ) + assert(result.skewType == EXPECTED_SKEW_TYPE, "skew type is not correct") + + } + + it should "calculate stats correctly for a chaotic distribution" in { + + val EXPECTED_MEAN = 0.6475713954442238 + val EXPECTED_GEOM = 0.6088821996450652 + val EXPECTED_VARIANCE = 0.04587715724784365 + val EXPECTED_SEMI_VARIANCE = 0.024330243806613878 + val EXPECTED_STDDEV = 0.21418953580379144 + val EXPECTED_SKEW = -0.17336511627873177 + val EXPECTED_KURTOSIS = -1.698405967752837 + val EXPECTED_KURTOSIS_TYPE = "Platykurtic" + val EXPECTED_SKEW_TYPE = "Symmetric Normal" + + val data = DiscreteTestDataGenerator.generateChaoticArray(500) + + val result = OneDimStats.evaluate(data) + + assert(result.mean == EXPECTED_MEAN, "mean is not correct") + assert(result.geomMean == EXPECTED_GEOM, "geometric mean is not correct") + assert(result.variance == EXPECTED_VARIANCE, "variance is not correct") + assert( + result.semiVariance == EXPECTED_SEMI_VARIANCE, + "semi variance is not correct" + ) + assert(result.stddev == EXPECTED_STDDEV, "stddev is not correct") + assert(result.skew == EXPECTED_SKEW, "skew is not correct") + assert(result.kurtosis == EXPECTED_KURTOSIS, "kurtosis is not correct") + assert( + result.kurtosisType == EXPECTED_KURTOSIS_TYPE, + "kurtosis type is not correct" + ) + assert(result.skewType == EXPECTED_SKEW_TYPE, "skew type is not correct") + + } + + it should "calculate stats correctly for a decaying left tailed distribution" in { + + val EXPECTED_MEAN = 0.6535971281474307 + val EXPECTED_GEOM = 0.6531836300290403 + val EXPECTED_VARIANCE = 4.995262976704292E-4 + val EXPECTED_SEMI_VARIANCE = 3.6385129297841144E-4 + val EXPECTED_STDDEV = 0.022350084958908528 + val EXPECTED_SKEW = -3.360258460470165 + val EXPECTED_KURTOSIS = 24.457852041324198 + val EXPECTED_KURTOSIS_TYPE = "Leptokurtic" + val EXPECTED_SKEW_TYPE = "Asymmetrical Left Tailed" + + val data = DiscreteTestDataGenerator.generateDecayArray(100) + + val result = OneDimStats.evaluate(data) + + assert(result.mean == EXPECTED_MEAN, "mean is not correct") + assert(result.geomMean == EXPECTED_GEOM, "geometric mean is not correct") + assert(result.variance == EXPECTED_VARIANCE, "variance is not correct") + assert( + result.semiVariance == EXPECTED_SEMI_VARIANCE, + "semi variance is not correct" + ) + assert(result.stddev == EXPECTED_STDDEV, "stddev is not correct") + assert(result.skew == EXPECTED_SKEW, "skew is not correct") + assert(result.kurtosis == EXPECTED_KURTOSIS, "kurtosis is not correct") + assert( + result.kurtosisType == EXPECTED_KURTOSIS_TYPE, + "kurtosis type is not correct" + ) + assert(result.skewType == EXPECTED_SKEW_TYPE, "skew type is not correct") + + } + + it should "calculate stats correctly for Normal Distributions" in { + + val EXPECTED_MEAN = 7.0 + val EXPECTED_GEOM = 6.52006260856204 + val EXPECTED_VARIANCE = 6.0 + val EXPECTED_SEMI_VARIANCE = 3.0 + val EXPECTED_STDDEV = 2.449489742783178 + val EXPECTED_SKEW = 0.0 + val EXPECTED_KURTOSIS = -0.5449197860962554 + val EXPECTED_KURTOSIS_TYPE = "Platykurtic" + val EXPECTED_SKEW_TYPE = "Symmetric Normal" + + val result = OneDimStats.evaluate(PERFECT_NORMAL_DATA) + + assert(result.mean == EXPECTED_MEAN, "mean is not correct") + assert(result.geomMean == EXPECTED_GEOM, "geometric mean is not correct") + assert(result.variance == EXPECTED_VARIANCE, "variance is not correct") + assert( + result.semiVariance == EXPECTED_SEMI_VARIANCE, + "semi variance is not correct" + ) + assert(result.stddev == EXPECTED_STDDEV, "stddev is not correct") + assert(result.skew == EXPECTED_SKEW, "skew is not correct") + assert(result.kurtosis == EXPECTED_KURTOSIS, "kurtosis is not correct") + assert( + result.kurtosisType == EXPECTED_KURTOSIS_TYPE, + "kurtosis type is not correct" + ) + assert(result.skewType == EXPECTED_SKEW_TYPE, "skew type is not correct") + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/exploration/tools/PairedTTestTest.scala b/src/test/scala/com/databricks/labs/automl/exploration/tools/PairedTTestTest.scala new file mode 100644 index 00000000..f7ac6d55 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/exploration/tools/PairedTTestTest.scala @@ -0,0 +1,78 @@ +package com.databricks.labs.automl.exploration.tools + +import com.databricks.labs.automl.AbstractUnitSpec + +class PairedTTestTest extends AbstractUnitSpec { + + it should "correctly classify ANOVA non-equivalency with identical distribution types" in { + + val expectedCorrelation = 0.9867499999999999 + val expectedPearsons = 0.5353263726770642 + val expectedSpearmans = 0.49848254581765294 + val expectedKendalls = 0.44946657497549475 + + val expectedTTestPValue = 8.487543144559928E-4 + val expectedTTestStat = -4.763582449331357 + val expectedTTestEquivalencyJudgement = 'N' + val expectedKSDStat = 1.0 + val expectedKSPValue = 0.0 + val expectedKSJudgement = 'Y' + + val left = Seq(0.1, 0.02, 0.399, 0.4, 0.566, 0.6, 0.7, 0.8, 0.9, 1.0) + val right = Seq(2.0, 4.0, 6.0, 8.0, 12.0, 16.0, 20.0, 11.0, 18.0, 4.0) + + val test = PairedTesting.evaluate(left, right, 0.05) + + assert( + (test.tTestData.tTestPValue == expectedTTestPValue) && (test.tTestData.tStat == expectedTTestStat) + && test.tTestData.tTestSignificance && + (test.tTestData.equivalencyJudgement == expectedTTestEquivalencyJudgement) && + (test.kolmogorovSmirnovData.ksTestDStatistic == expectedKSDStat) && + (test.kolmogorovSmirnovData.ksTestPvalue == expectedKSPValue) && + (test.kolmogorovSmirnovData.ksTestEquivalency == expectedKSJudgement) && + (test.correlationTestData.covariance == expectedCorrelation) && + (test.correlationTestData.pearsonCoefficient == expectedPearsons) && + (test.correlationTestData.spearmanCoefficient == expectedSpearmans) && + (test.correlationTestData.kendallsTauCoefficient == expectedKendalls), + "values are incorrect" + ) + + } + + it should "correctly classify ANOVA equivalency with similar distribution types" in { + + val expectedCorrelation = -45280.167386206616 + val expectedPearsons = -0.47399155389027436 + val expectedSpearmans = 0.49090909090909085 + val expectedKendalls = 0.6 + val expectedTTestPValue = 0.3403272377813734 + val expectedTTestStat = -1.0012295762272705 + val expectedTTestEquivalencyJudgement = 'Y' + val expectedKSDStat = 0.7272727272727273 + val expectedKSPValue = 6.549178375803762E-4 + val expectedKSJudgement = 'Y' + + val left = + Seq(0.1, 0.02, 0.399, 0.4, 0.566, 0.6, 0.7, 0.8, 0.9, 1.0, 0.00001) + val right = + Seq(0.1, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0, 10.0, 100.0, 1000.0, 1000000.0) + + val test = PairedTesting.evaluate(left, right, 0.05) + + assert( + (test.tTestData.tTestPValue == expectedTTestPValue) && (test.tTestData.tStat == expectedTTestStat) + && !test.tTestData.tTestSignificance && + (test.tTestData.equivalencyJudgement == expectedTTestEquivalencyJudgement) && + (test.kolmogorovSmirnovData.ksTestDStatistic == expectedKSDStat) && + (test.kolmogorovSmirnovData.ksTestPvalue == expectedKSPValue) && + (test.kolmogorovSmirnovData.ksTestEquivalency == expectedKSJudgement) && + (test.correlationTestData.covariance == expectedCorrelation) && + (test.correlationTestData.pearsonCoefficient == expectedPearsons) && + (test.correlationTestData.spearmanCoefficient == expectedSpearmans) && + (test.correlationTestData.kendallsTauCoefficient == expectedKendalls), + "values are incorrect" + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/exploration/tools/PolynomialRegressorTest.scala b/src/test/scala/com/databricks/labs/automl/exploration/tools/PolynomialRegressorTest.scala new file mode 100644 index 00000000..794eb4af --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/exploration/tools/PolynomialRegressorTest.scala @@ -0,0 +1,55 @@ +package com.databricks.labs.automl.exploration.tools + +import com.databricks.labs.automl.AbstractUnitSpec +import org.apache.commons.math3.fitting.{ + PolynomialCurveFitter, + WeightedObservedPoints +} + +class PolynomialRegressorTest extends AbstractUnitSpec { + + //TODO: write actual tests here. + + it should "do something" in { + +// val data1 = Seq(0.1, 0.02, 0.399, 0.4, 0.566, 0.6, 0.7, 0.8, 0.9, 1.0, 37.5, +// 0.5, 17.8, 19.1) +// val data2 = Seq(2.0, 45.0, 6.0, 8.0, 17.0, 120.0, 104.0, 19.0, 18.0, 20.0, +// 0.1, 99.9, 14.2, 63.2) + + val data1 = Seq(0.1, 0.02, 0.399, 0.4, 0.566, 0.6, 0.7, 0.8, 0.9, 1.0) + val data2 = Seq(2.0, 4.0, 6.0, 8.0, 12.0, 16.0, 20.0, 11.0, 18.0, 4.0) + + val g = SimpleRegressor.calculate(data1, data2) + println(g) + + // polynomial regressor.... + + val theData = new WeightedObservedPoints() + data1.zip(data2).foreach(x => theData.add(x._1, x._2)) + + val fitter = PolynomialCurveFitter.create(2) + + val trial = fitter.fit(theData.toList) + + println(trial.mkString(",")) + + val fitter2 = PolynomialCurveFitter.create(3) + val trial2 = fitter2.fit(theData.toList) + println(trial2.mkString(",")) + + val fitter1 = PolynomialCurveFitter.create(1) + val trial1 = fitter1.fit(theData.toList) + println(trial1.mkString(",")) + + val doIt = PolynomialRegressor.fit(data1, data2, 4) + + println(doIt) + + val mult = + PolynomialRegressor.fitMultipleOrders(data1, data2, Array(1, 2, 3, 4, 5)) + mult.foreach(println) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/exploration/tools/ShapiroWilkTest.scala b/src/test/scala/com/databricks/labs/automl/exploration/tools/ShapiroWilkTest.scala new file mode 100644 index 00000000..9539cb34 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/exploration/tools/ShapiroWilkTest.scala @@ -0,0 +1,135 @@ +package com.databricks.labs.automl.exploration.tools + +import com.databricks.labs.automl.AbstractUnitSpec + +class ShapiroWilkTest extends AbstractUnitSpec { + + final val ALPHA = 0.05 + + final val NORMAL_DATA = + Array(-10.0, -5.0, -1.0, -0.5, -0.1, 0.0, 0.1, 0.5, 1.0, 5.0, 10.0) + final val LEFT_SKEW_DATA = + Array(-1000.0, -500.0, -100.0, -50.0, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5) + final val RIGHT_SKEW_DATA = + Array(0.0, 0.1, 0.2, 0.3, 0.5, 1.0, 10.0, 100.0, 1000.0, 10000.0) + final val ZERO_RANGE_DATA = Array(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0) + final val LOW_RANGE_DATA = + Array(1.1111111e-15, 1.11111e-15, 1.11111e-15, 1.11111e-15, 1.11111e-15, + 1.111111e-15) + final val SMALL_DATA = Array(0.0, 1.0) + + it should "throw an exception for extremely low range data" in { + intercept[IllegalArgumentException] { + ShapiroWilk.test(LOW_RANGE_DATA, ALPHA) + } + } + + it should "throw an exception for zero range data" in { + intercept[IllegalArgumentException] { + ShapiroWilk.test(ZERO_RANGE_DATA, ALPHA) + } + } + + it should "throw an exception for small data" in { + intercept[IllegalArgumentException] { + ShapiroWilk.test(SMALL_DATA, ALPHA) + } + } + + it should "return the correct result for normal data for Shapiro-Wilk" in { + + val EXPECTED_W = 0.9059018777993929 + val EXPECTED_Z = 0.7792008630260997 + val EXPECTED_PROB = 0.7820692990441541 + val EXPECTED_NORMALCY = false + val EXPECTED_DECISION = "Y" + + val result = ShapiroWilk.test(NORMAL_DATA, ALPHA) + + assert( + result.w == EXPECTED_W, + "failed to calculate the correct W value for normal distribution" + ) + assert( + result.z == EXPECTED_Z, + "failed to calculate Z score properly for normal distribution" + ) + assert( + result.probability == EXPECTED_PROB, + "failed to calculate probability properly for normal distribution" + ) + assert( + result.normalcyTest == EXPECTED_NORMALCY, + "failed to determine normalcy test for normal distribution" + ) + assert( + result.normalcy == EXPECTED_DECISION, + "failed to define normalcy decision correctly for normal distribution" + ) + } + + it should "return the correct result for left skewed data for Shapiro-Wilk" in { + + val EXPECTED_W = 0.5926784836883106 + val EXPECTED_Z = 3.914789120443548 + val EXPECTED_PROB = 0.9999547585887872 + val EXPECTED_NORMALCY = true + val EXPECTED_DECISION = "N" + + val result = ShapiroWilk.test(LEFT_SKEW_DATA, ALPHA) + + assert( + result.w == EXPECTED_W, + "failed to calculate the correct W value for normal distribution" + ) + assert( + result.z == EXPECTED_Z, + "failed to calculate Z score properly for normal distribution" + ) + assert( + result.probability == EXPECTED_PROB, + "failed to calculate probability properly for normal distribution" + ) + assert( + result.normalcyTest == EXPECTED_NORMALCY, + "failed to determine normalcy test for normal distribution" + ) + assert( + result.normalcy == EXPECTED_DECISION, + "failed to define normalcy decision correctly for normal distribution" + ) + } + + it should "return the correct result for right skewed data for Shapiro-Wilk" in { + + val EXPECTED_W = 0.41812749335736665 + val EXPECTED_Z = 4.933291170618088 + val EXPECTED_PROB = 0.9999995955990474 + val EXPECTED_NORMALCY = true + val EXPECTED_DECISION = "N" + + val result = ShapiroWilk.test(RIGHT_SKEW_DATA, ALPHA) + + assert( + result.w == EXPECTED_W, + "failed to calculate the correct W value for normal distribution" + ) + assert( + result.z == EXPECTED_Z, + "failed to calculate Z score properly for normal distribution" + ) + assert( + result.probability == EXPECTED_PROB, + "failed to calculate probability properly for normal distribution" + ) + assert( + result.normalcyTest == EXPECTED_NORMALCY, + "failed to determine normalcy test for normal distribution" + ) + assert( + result.normalcy == EXPECTED_DECISION, + "failed to define normalcy decision correctly for normal distribution" + ) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/exploration/tools/SimpleRegressorTest.scala b/src/test/scala/com/databricks/labs/automl/exploration/tools/SimpleRegressorTest.scala new file mode 100644 index 00000000..1fc39999 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/exploration/tools/SimpleRegressorTest.scala @@ -0,0 +1,139 @@ +package com.databricks.labs.automl.exploration.tools + +import com.databricks.labs.automl.AbstractUnitSpec + +class SimpleRegressorTest extends AbstractUnitSpec { + + it should "give correct results with excessive outliers" in { + + val data1 = Seq(0.1, 0.02, 0.399, 0.4, 0.566, 0.6, 0.7, 0.8, 0.9, 1.0, 37.5, + 0.5, 17.8, 19.1) + val data2 = Seq(2.0, 45.0, 6.0, 8.0, 17.0, 120.0, 104.0, 19.0, 18.0, 20.0, + 0.1, 99.9, 14.2, 63.2) + + val res = SimpleRegressor.calculate(data1, data2) + + assert(res.slope == -0.8302724558196052, "incorrect slope") + assert(res.slopeStdErr == 1.0413727775143946, "incorrect slope stderr") + assert( + res.slopeConfidenceInterval == 2.2689563681145803, + "incorrect slope CI" + ) + assert(res.intercept == 43.08153224007564, "incorrect intercept") + assert( + res.interceptStdErr == 12.73014698229427, + "incorrect intercept stderr" + ) + assert(res.rSquared == 0.05030726305316163, "incorrect r-squared") + assert(res.significance == 0.4407756266472016, "incorrect significance") + assert(res.mse == 1768.2580059436523, "incorrect mean squared error") + assert(res.rmse == 42.050659994150536, "incorrect rmse") + assert( + res.sumSquares == 1124.0210715333192, + "incorrect sum of squares regression" + ) + assert( + res.totalSumSquares == 22343.117142857147, + "incorrect total sum of squares" + ) + assert( + res.sumSquareError == 21219.096071323827, + "incorrect sum square error" + ) + assert(res.pairLength == 14L, "incorrect pair length") + assert( + res.pearsonR == -0.22429280651229463, + "incorrect pearson R coefficient of correlation" + ) + assert( + res.crossProductSum == -1353.7978571428573, + "incorrect cross product sum" + ) + + } + + it should "give correct results for generally linearly correlated series" in { + + val data1 = + Seq(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 0.5, 2.1, 3.1) + val data2 = Seq(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 20.0, + 22.1, 24.9, 26.2, 28.2) + + val res = SimpleRegressor.calculate(data1, data2) + + assert(res.slope == 7.935469580626107, "incorrect slope") + assert(res.slopeStdErr == 1.6828113994525227, "incorrect slope stderr") + assert( + res.slopeConfidenceInterval == 3.666531067037454, + "incorrect slope CI" + ) + assert(res.intercept == 7.6179858239810985, "incorrect intercept") + assert( + res.interceptStdErr == 2.1152611294308854, + "incorrect intercept stderr" + ) + assert(res.rSquared == 0.6495010445058065, "incorrect r-squared") + assert(res.significance == 5.007338509579462E-4, "incorrect significance") + assert(res.mse == 27.396166691277813, "incorrect mean squared error") + assert(res.rmse == 5.234134760519432, "incorrect rmse") + assert( + res.sumSquares == 609.2059997046663, + "incorrect sum of squares regression" + ) + assert(res.totalSumSquares == 937.96, "incorrect total sum of squares") + assert( + res.sumSquareError == 328.75400029533375, + "incorrect sum square error" + ) + assert(res.pairLength == 14L, "incorrect pair length") + assert( + res.pearsonR == 0.8059162763623815, + "incorrect pearson R coefficient of correlation" + ) + assert( + res.crossProductSum == 76.77000000000001, + "incorrect cross product sum" + ) + + } + + it should "give correct results for perfectly linearly correlated series" in { + + val data1 = + Seq(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 0.5, 2.1, 3.1) + val data2 = + Seq(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 0.5, 2.1, 3.1) + + val res = SimpleRegressor.calculate(data1, data2) + + assert(res.slope == 1.0, "incorrect slope") + assert(res.slopeStdErr == 0.0, "incorrect slope stderr") + assert(res.slopeConfidenceInterval == 0.0, "incorrect slope CI") + assert(res.intercept == 0.0, "incorrect intercept") + assert(res.interceptStdErr == 0.0, "incorrect intercept stderr") + assert(res.rSquared == 1.0, "incorrect r-squared") + assert(res.significance == 0.0, "incorrect significance") + assert(res.mse == 0.0, "incorrect mean squared error") + assert(res.rmse == 0.0, "incorrect rmse") + assert( + res.sumSquares == 9.674285714285716, + "incorrect sum of squares regression" + ) + assert( + res.totalSumSquares == 9.674285714285716, + "incorrect total sum of squares" + ) + assert(res.sumSquareError == 0.0, "incorrect sum square error") + assert(res.pairLength == 14L, "incorrect pair length") + assert( + res.pearsonR == 1.0, + "incorrect pearson R coefficient of correlation" + ) + assert( + res.crossProductSum == 9.674285714285716, + "incorrect cross product sum" + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/exploration/visualizations/OneDimVisualizationsTest.scala b/src/test/scala/com/databricks/labs/automl/exploration/visualizations/OneDimVisualizationsTest.scala new file mode 100644 index 00000000..2e2e33a5 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/exploration/visualizations/OneDimVisualizationsTest.scala @@ -0,0 +1,19 @@ +package com.databricks.labs.automl.exploration.visualizations + +import com.databricks.labs.automl.exploration.tools.OneDimStats +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} + +class OneDimVisualizationsTest extends AbstractUnitSpec { + + it should "work" in { + + val data = DiscreteTestDataGenerator.generateDecayArray(100) + + val result = OneDimStats.evaluate(data) + + val render = OneDimVisualizations.generateOneDimPlots(result) + + println(render) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/feature/FeatureInteractionKSampleIntegrationTest.scala b/src/test/scala/com/databricks/labs/automl/feature/FeatureInteractionKSampleIntegrationTest.scala new file mode 100644 index 00000000..e8c38561 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/feature/FeatureInteractionKSampleIntegrationTest.scala @@ -0,0 +1,150 @@ +package com.databricks.labs.automl.feature + +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.utilities.ValidationUtilities +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationRunner, + AutomationUnitTestsUtil +} + +class FeatureInteractionKSampleIntegrationTest extends AbstractUnitSpec { + + it should "Perform Data Prep Correctly with FeatureInteraction and kSampling both on" in { + + val EXPECTED_FIELDS = Array( + "age_trimmed_si", + "workclass_trimmed_si", + "education_trimmed_si", + "education-num_trimmed_si", + "marital-status_trimmed_si", + "occupation_trimmed_si", + "relationship_trimmed_si", + "race_trimmed_si", + "sex_trimmed_si", + "capital-gain_trimmed_si", + "capital-loss_trimmed_si", + "hours-per-week_trimmed_si", + "native-country_trimmed_si", + "i_age_trimmed_si_workclass_trimmed_si_si", + "i_relationship_trimmed_si_sex_trimmed_si_si", + "i_education-num_trimmed_si_sex_trimmed_si_si", + "i_education-num_trimmed_si_race_trimmed_si_si", + "i_occupation_trimmed_si_capital-gain_trimmed_si_si", + "i_workclass_trimmed_si_capital-gain_trimmed_si_si", + "i_education_trimmed_si_capital-gain_trimmed_si_si", + "i_marital-status_trimmed_si_hours-per-week_trimmed_si_si", + "i_age_trimmed_si_occupation_trimmed_si_si", + "i_education-num_trimmed_si_occupation_trimmed_si_si", + "i_age_trimmed_si_capital-loss_trimmed_si_si", + "i_education-num_trimmed_si_capital-loss_trimmed_si_si", + "i_age_trimmed_si_education_trimmed_si_si", + "i_relationship_trimmed_si_capital-gain_trimmed_si_si", + "i_education-num_trimmed_si_capital-gain_trimmed_si_si", + "i_occupation_trimmed_si_capital-loss_trimmed_si_si", + "i_workclass_trimmed_si_capital-loss_trimmed_si_si", + "i_education_trimmed_si_capital-loss_trimmed_si_si", + "i_marital-status_trimmed_si_native-country_trimmed_si_si", + "i_age_trimmed_si_relationship_trimmed_si_si", + "i_education-num_trimmed_si_relationship_trimmed_si_si", + "i_age_trimmed_si_hours-per-week_trimmed_si_si", + "i_education-num_trimmed_si_hours-per-week_trimmed_si_si", + "i_age_trimmed_si_education-num_trimmed_si_si", + "i_relationship_trimmed_si_capital-loss_trimmed_si_si", + "i_marital-status_trimmed_si_relationship_trimmed_si_si", + "i_sex_trimmed_si_capital-gain_trimmed_si_si", + "i_occupation_trimmed_si_hours-per-week_trimmed_si_si", + "i_workclass_trimmed_si_hours-per-week_trimmed_si_si", + "i_education_trimmed_si_hours-per-week_trimmed_si_si", + "i_occupation_trimmed_si_relationship_trimmed_si_si", + "i_age_trimmed_si_race_trimmed_si_si", + "i_marital-status_trimmed_si_sex_trimmed_si_si", + "i_age_trimmed_si_native-country_trimmed_si_si", + "i_education-num_trimmed_si_native-country_trimmed_si_si", + "i_age_trimmed_si_marital-status_trimmed_si_si", + "i_relationship_trimmed_si_hours-per-week_trimmed_si_si", + "i_marital-status_trimmed_si_race_trimmed_si_si", + "i_sex_trimmed_si_capital-loss_trimmed_si_si", + "i_occupation_trimmed_si_native-country_trimmed_si_si", + "i_workclass_trimmed_si_native-country_trimmed_si_si", + "i_education_trimmed_si_native-country_trimmed_si_si", + "i_occupation_trimmed_si_race_trimmed_si_si", + "i_age_trimmed_si_sex_trimmed_si_si", + "i_marital-status_trimmed_si_capital-gain_trimmed_si_si", + "i_workclass_trimmed_si_education_trimmed_si_si", + "i_marital-status_trimmed_si_occupation_trimmed_si_si", + "i_relationship_trimmed_si_native-country_trimmed_si_si", + "i_sex_trimmed_si_hours-per-week_trimmed_si_si", + "i_relationship_trimmed_si_race_trimmed_si_si", + "i_education_trimmed_si_education-num_trimmed_si_si", + "i_education-num_trimmed_si_marital-status_trimmed_si_si", + "i_occupation_trimmed_si_sex_trimmed_si_si", + "i_marital-status_trimmed_si_capital-loss_trimmed_si_si", + "i_age_trimmed_si_capital-gain_trimmed_si_si", + "i_workclass_trimmed_si_education-num_trimmed_si_si", + "i_race_trimmed_si_sex_trimmed_si_si", + "i_sex_trimmed_si_native-country_trimmed_si_si", + "i_workclass_trimmed_si_marital-status_trimmed_si_si", + "i_education_trimmed_si_marital-status_trimmed_si_si", + "i_workclass_trimmed_si_relationship_trimmed_si_si", + "i_workclass_trimmed_si_occupation_trimmed_si_si", + "i_capital-gain_trimmed_si_hours-per-week_trimmed_si_si", + "i_workclass_trimmed_si_race_trimmed_si_si", + "i_capital-loss_trimmed_si_hours-per-week_trimmed_si_si", + "i_race_trimmed_si_capital-gain_trimmed_si_si", + "i_capital-gain_trimmed_si_capital-loss_trimmed_si_si", + "i_education_trimmed_si_occupation_trimmed_si_si", + "i_workclass_trimmed_si_sex_trimmed_si_si", + "i_capital-loss_trimmed_si_native-country_trimmed_si_si", + "i_capital-gain_trimmed_si_native-country_trimmed_si_si", + "i_hours-per-week_trimmed_si_native-country_trimmed_si_si", + "i_race_trimmed_si_capital-loss_trimmed_si_si", + "i_education_trimmed_si_relationship_trimmed_si_si", + "i_race_trimmed_si_hours-per-week_trimmed_si_si", + "i_education_trimmed_si_race_trimmed_si_si", + "i_race_trimmed_si_native-country_trimmed_si_si", + "i_education_trimmed_si_sex_trimmed_si_si", + "features", + "label", + "synthetic_ksample" + ) + + val testData = AutomationUnitTestsUtil.getAdultDf() + + val runConfig = Map( + "labelCol" -> "label", + "tunerKFold" -> 3, + "tunerTrainSplitMethod" -> "kSample", + "featureInteractionFlag" -> true, + "featureInteractionRetentionMode" -> "all", + "tunerNumberOfGenerations" -> 3, + "tunerInitialGenerationMode" -> "permutations", + "mlFlowLoggingFlag" -> false + ) + + val rfConfig = ConfigurationGenerator.generateConfigFromMap( + "RandomForest", + "classifier", + runConfig + ) + + val prep = new AutomationRunner(testData) + .setMainConfig(ConfigurationGenerator.generateMainConfig(rfConfig)) + .prepData() + + println(prep.data.schema.names.mkString(", ")) + + assert( + prep.modelType == "classifier", + s"model detection type incorrect: ${prep.modelType}, should have been classifier" + ) + assert(!prep.data.isEmpty, "prepared Dataframe should not be empty.") + ValidationUtilities.fieldCreationAssertion( + EXPECTED_FIELDS, + prep.data.schema.names + ) + ValidationUtilities.fieldCreationAssertion(EXPECTED_FIELDS, prep.fields) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/feature/FeatureInteractionTest.scala b/src/test/scala/com/databricks/labs/automl/feature/FeatureInteractionTest.scala new file mode 100644 index 00000000..7358b3a5 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/feature/FeatureInteractionTest.scala @@ -0,0 +1,191 @@ +package com.databricks.labs.automl.feature + +import com.databricks.labs.automl.pipeline.FeaturePipeline +import com.databricks.labs.automl.sanitize.DataSanitizer +import com.databricks.labs.automl.utilities.ValidationUtilities +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} +import org.apache.spark.sql.DataFrame + +class FeatureInteractionTest extends AbstractUnitSpec { + + final private val LABEL_COL = "label" + final private val FEATURE_COL = "features" + final private val IGNORE_FIELDS = Array("automl_internal_id") + final private val CONTINUOUS_DISCRETIZER_BUCKET_COUNT = 25 + final private val PARALLELISM = 8 + + def cleanupData: (DataFrame, Array[String], Array[String], String) = { + + val data = DiscreteTestDataGenerator.generateFeatureInteractionData(1000) + + val (sanitized, fillConfig, modelType) = new DataSanitizer(data) + .setLabelCol(LABEL_COL) + .setFieldsToIgnoreInVector(IGNORE_FIELDS) + .generateCleanData() + + val (cleanData, featureFields, totalFields) = new FeaturePipeline(sanitized) + .setLabelCol(LABEL_COL) + .setFeatureCol(FEATURE_COL) + .makeFeaturePipeline(IGNORE_FIELDS) + + val nominalFields = featureFields + .filter(x => x.takeRight(3) == "_si") + .filterNot(x => x.contains(LABEL_COL)) + + val continuousFields = featureFields + .diff(nominalFields) + .filterNot(_.contains(LABEL_COL)) + .filterNot(_.contains(FEATURE_COL)) + + (cleanData, nominalFields, continuousFields, modelType) + } + + it should "create correct interacted columns in 'all' mode" in { + + val RETENTION_MODE = "all" + val TARGET_INTERACTION_PERCENTAGE = 1.0 + val EXPECTED_FIELDS = Array( + "d_si", + "f_si", + "a", + "b", + "c", + "e", + "i_a_b", + "i_a_c", + "i_a_e", + "i_b_c", + "i_b_e", + "i_c_e", + "i_d_si_a", + "i_d_si_b", + "i_d_si_c", + "i_d_si_e", + "i_d_si_f_si_si", + "i_f_si_b", + "i_f_si_c", + "i_f_si_a", + "i_f_si_e", + "automl_internal_id", + "features", + "label" + ) + + val (cleanData, nominalFields, continuousFields, modelType) = cleanupData + + val interacted = FeatureInteraction.interactFeatures( + cleanData, + nominalFields, + continuousFields, + modelType, + RETENTION_MODE, + LABEL_COL, + FEATURE_COL, + CONTINUOUS_DISCRETIZER_BUCKET_COUNT, + PARALLELISM, + TARGET_INTERACTION_PERCENTAGE + ) + + ValidationUtilities.fieldCreationAssertion( + EXPECTED_FIELDS, + interacted.data.schema.names + ) + } + + it should "create correct interacted columns in 'strict' mode" in { + + val RETENTION_MODE = "strict" + val TARGET_INTERACTION_PERCENTAGE = 0.001 + val EXPECTED_FIELDS = Array( + "d_si", + "f_si", + "a", + "b", + "c", + "e", + "i_b_c", + "i_a_b", + "i_a_c", + "i_d_si_a", + "i_d_si_c", + "i_f_si_c", + "i_f_si_e", + "i_c_e", + "i_d_si_b", + "i_d_si_e", + "i_f_si_a", + "i_f_si_b", + "automl_internal_id", + "features", + "label" + ) + + val (cleanData, nominalFields, continuousFields, modelType) = cleanupData + + val interacted = FeatureInteraction.interactFeatures( + cleanData, + nominalFields, + continuousFields, + modelType, + RETENTION_MODE, + LABEL_COL, + FEATURE_COL, + CONTINUOUS_DISCRETIZER_BUCKET_COUNT, + PARALLELISM, + TARGET_INTERACTION_PERCENTAGE + ) + + ValidationUtilities.fieldCreationAssertion( + EXPECTED_FIELDS, + interacted.data.schema.names + ) + + } + + it should "create correct interacted columns in 'optimistic' mode" in { + + val RETENTION_MODE = "optimistic" + val TARGET_INTERACTION_PERCENTAGE = -0.1 + val EXPECTED_FIELDS = Array( + "d_si", + "f_si", + "a", + "b", + "c", + "e", + "i_d_si_a", + "i_d_si_b", + "i_d_si_c", + "i_d_si_e", + "i_f_si_b", + "i_f_si_c", + "i_f_si_a", + "i_f_si_e", + "automl_internal_id", + "features", + "label" + ) + + val (cleanData, nominalFields, continuousFields, modelType) = cleanupData + + val interacted = FeatureInteraction.interactFeatures( + cleanData, + nominalFields, + continuousFields, + modelType, + RETENTION_MODE, + LABEL_COL, + FEATURE_COL, + CONTINUOUS_DISCRETIZER_BUCKET_COUNT, + PARALLELISM, + TARGET_INTERACTION_PERCENTAGE + ) + + ValidationUtilities.fieldCreationAssertion( + EXPECTED_FIELDS, + interacted.data.schema.names + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/feature/KSampleTest.scala b/src/test/scala/com/databricks/labs/automl/feature/KSampleTest.scala new file mode 100644 index 00000000..fb6d6a90 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/feature/KSampleTest.scala @@ -0,0 +1,303 @@ +package com.databricks.labs.automl.feature + +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + DiscreteTestDataGenerator +} +import org.apache.spark.ml.feature.VectorAssembler +import org.apache.spark.sql.DataFrame + +class KSampleTest extends AbstractUnitSpec { + + final private val LABEL_COL = "label" + final private val FEATURES_COL = "features" + final private val SYNTHETIC_COL = "synthetic_ksample" + final private val FIELDS_TO_IGNORE_IN_VECTOR = Array("automl_internal_id") + final private val KGROUPS = 25 + final private val KMEANS_MAX_ITER = 100 + final private val KMEANS_TOLERANCE = 1E-6 + final private val KMEANS_DISTANCE_MEASUREMENT = "euclidean" + final private val KMEANS_SEED = 42L + final private val KMEANS_PREDICTION_COL = "kgroups_ksample" + final private val LSH_HASH_TABLES = 10 + final private val LSH_SEED = 42L + final private val LSH_OUTPUT_COL = "hashes_ksample" + final private val QUORUM_COUNT = 7 + final private val MINIMUM_VECTOR_COUNT_TO_MUTATE = 1 + final private val VECTOR_MUTATION_METHOD = "random" + final private val MUTATION_MODE = "weighted" + final private val MUTATION_VALUE = 0.5 + final private val CARDINALITY_THRESHOLD = 20 + final private val FEATURE_FIELDS = Array("a", "b", "c") + + def createVector(df: DataFrame): DataFrame = { + + new VectorAssembler() + .setInputCols(FEATURE_FIELDS) + .setOutputCol(FEATURES_COL) + .transform(df) + + } + + def getImbalance(df: DataFrame): Map[Int, Long] = { + + val spark = AutomationUnitTestsUtil.sparkSession + + import spark.implicits._ + + df.groupBy(LABEL_COL) + .count() + .map(x => Map(x.getAs[Int]("label") -> x.getAs[Long]("count"))) + .collect() + .flatten + .toMap + } + + it should "Correctly KSample boost minority classes in match mode" in { + + val LABEL_BALANCE_MODE = "match" + val NUMERIC_RATIO = 0.2 + val NUMERIC_TARGET = 500 + + val data = DiscreteTestDataGenerator.generateKSampleData(300) + + val featurizedData = createVector(data) + + val EXPECTED_CLASS_1_COUNT_PRE = 188 + val EXPECTED_CLASS_2_COUNT_PRE = 75 + val EXPECTED_CLASS_3_COUNT_PRE = 37 + + val EXPECTED_CLASS_1_COUNT_POST = 188 + val EXPECTED_CLASS_2_COUNT_POST = 188 + val EXPECTED_CLASS_3_COUNT_POST = 188 + + val upSampled = SyntheticFeatureGenerator( + featurizedData, + FEATURES_COL, + LABEL_COL, + SYNTHETIC_COL, + FIELDS_TO_IGNORE_IN_VECTOR, + KGROUPS, + KMEANS_MAX_ITER, + KMEANS_TOLERANCE, + KMEANS_DISTANCE_MEASUREMENT, + KMEANS_SEED, + KMEANS_PREDICTION_COL, + LSH_HASH_TABLES, + LSH_SEED, + LSH_OUTPUT_COL, + QUORUM_COUNT, + MINIMUM_VECTOR_COUNT_TO_MUTATE, + VECTOR_MUTATION_METHOD, + MUTATION_MODE, + MUTATION_VALUE, + LABEL_BALANCE_MODE, + CARDINALITY_THRESHOLD, + NUMERIC_RATIO, + NUMERIC_TARGET + ) + + val preRowCount = featurizedData.count() + val postRowCount = upSampled.count() + + val imbalancePre = getImbalance(featurizedData) + val imbalancePost = getImbalance(upSampled) + + assert( + imbalancePre(1) == EXPECTED_CLASS_1_COUNT_PRE, + "correct class 1 count" + ) + + assert( + imbalancePre(2) == EXPECTED_CLASS_2_COUNT_PRE, + "correct class 2 count" + ) + + assert( + imbalancePre(3) == EXPECTED_CLASS_3_COUNT_PRE, + "correct class 3 count" + ) + + assert( + imbalancePost(1) == EXPECTED_CLASS_1_COUNT_POST, + "no modification to main class" + ) + assert( + imbalancePost(2) == EXPECTED_CLASS_2_COUNT_POST, + "matched expected synthetic class count" + ) + assert( + imbalancePost(3) == EXPECTED_CLASS_3_COUNT_POST, + "matched expected synthetic class count" + ) + + assert(postRowCount > preRowCount, "created rows") + + } + + it should "Correctly KSample boost minority classes in percentage mode" in { + + val LABEL_BALANCE_MODE = "percentage" + val NUMERIC_RATIO = 0.5 + val NUMERIC_TARGET = 150 + + val data = DiscreteTestDataGenerator.generateKSampleData(300) + + val featurizedData = createVector(data) + + val EXPECTED_CLASS_1_COUNT_PRE = 188 + val EXPECTED_CLASS_2_COUNT_PRE = 75 + val EXPECTED_CLASS_3_COUNT_PRE = 37 + + val EXPECTED_CLASS_1_COUNT_POST = 188 + val EXPECTED_CLASS_2_COUNT_POST = 94 + val EXPECTED_CLASS_3_COUNT_POST = 94 + + val upSampled = SyntheticFeatureGenerator( + featurizedData, + FEATURES_COL, + LABEL_COL, + SYNTHETIC_COL, + FIELDS_TO_IGNORE_IN_VECTOR, + KGROUPS, + KMEANS_MAX_ITER, + KMEANS_TOLERANCE, + KMEANS_DISTANCE_MEASUREMENT, + KMEANS_SEED, + KMEANS_PREDICTION_COL, + LSH_HASH_TABLES, + LSH_SEED, + LSH_OUTPUT_COL, + QUORUM_COUNT, + MINIMUM_VECTOR_COUNT_TO_MUTATE, + VECTOR_MUTATION_METHOD, + MUTATION_MODE, + MUTATION_VALUE, + LABEL_BALANCE_MODE, + CARDINALITY_THRESHOLD, + NUMERIC_RATIO, + NUMERIC_TARGET + ) + + val preRowCount = featurizedData.count() + val postRowCount = upSampled.count() + + val imbalancePre = getImbalance(featurizedData) + val imbalancePost = getImbalance(upSampled) + + assert( + imbalancePre(1) == EXPECTED_CLASS_1_COUNT_PRE, + "correct class 1 count" + ) + + assert( + imbalancePre(2) == EXPECTED_CLASS_2_COUNT_PRE, + "correct class 2 count" + ) + + assert( + imbalancePre(3) == EXPECTED_CLASS_3_COUNT_PRE, + "correct class 3 count" + ) + + assert( + imbalancePost(1) == EXPECTED_CLASS_1_COUNT_POST, + "no modification to main class" + ) + assert( + imbalancePost(2) == EXPECTED_CLASS_2_COUNT_POST, + "matched expected synthetic class count" + ) + assert( + imbalancePost(3) == EXPECTED_CLASS_3_COUNT_POST, + "matched expected synthetic class count" + ) + + assert(postRowCount > preRowCount, "created rows") + + } + + it should "Correctly KSample boost minority classes in target mode" in { + + val LABEL_BALANCE_MODE = "target" + val NUMERIC_RATIO = 0.5 + val NUMERIC_TARGET = 500 + + val data = DiscreteTestDataGenerator.generateKSampleData(300) + + val featurizedData = createVector(data) + + val EXPECTED_CLASS_1_COUNT_PRE = 188 + val EXPECTED_CLASS_2_COUNT_PRE = 75 + val EXPECTED_CLASS_3_COUNT_PRE = 37 + + val EXPECTED_CLASS_1_COUNT_POST = 188 + val EXPECTED_CLASS_2_COUNT_POST = 500 + val EXPECTED_CLASS_3_COUNT_POST = 500 + + val upSampled = SyntheticFeatureGenerator( + featurizedData, + FEATURES_COL, + LABEL_COL, + SYNTHETIC_COL, + FIELDS_TO_IGNORE_IN_VECTOR, + KGROUPS, + KMEANS_MAX_ITER, + KMEANS_TOLERANCE, + KMEANS_DISTANCE_MEASUREMENT, + KMEANS_SEED, + KMEANS_PREDICTION_COL, + LSH_HASH_TABLES, + LSH_SEED, + LSH_OUTPUT_COL, + QUORUM_COUNT, + MINIMUM_VECTOR_COUNT_TO_MUTATE, + VECTOR_MUTATION_METHOD, + MUTATION_MODE, + MUTATION_VALUE, + LABEL_BALANCE_MODE, + CARDINALITY_THRESHOLD, + NUMERIC_RATIO, + NUMERIC_TARGET + ) + + val preRowCount = featurizedData.count() + val postRowCount = upSampled.count() + + val imbalancePre = getImbalance(featurizedData) + val imbalancePost = getImbalance(upSampled) + + assert( + imbalancePre(1) == EXPECTED_CLASS_1_COUNT_PRE, + "correct class 1 count" + ) + + assert( + imbalancePre(2) == EXPECTED_CLASS_2_COUNT_PRE, + "correct class 2 count" + ) + + assert( + imbalancePre(3) == EXPECTED_CLASS_3_COUNT_PRE, + "correct class 3 count" + ) + + assert( + imbalancePost(1) == EXPECTED_CLASS_1_COUNT_POST, + "no modification to main class" + ) + assert( + imbalancePost(2) == EXPECTED_CLASS_2_COUNT_POST, + "matched expected synthetic class count" + ) + assert( + imbalancePost(3) == EXPECTED_CLASS_3_COUNT_POST, + "matched expected synthetic class count" + ) + + assert(postRowCount > preRowCount, "created rows") + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/inference/InferenceToolsTest.scala b/src/test/scala/com/databricks/labs/automl/inference/InferenceToolsTest.scala new file mode 100644 index 00000000..52f81546 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/inference/InferenceToolsTest.scala @@ -0,0 +1,50 @@ +package com.databricks.labs.automl.inference + +import com.databricks.labs.automl.{AbstractUnitSpec, AutomationUnitTestsUtil, InferenceUnitTestUtil} +import com.fasterxml.jackson.core.JsonParseException + +class InferenceToolsTest extends AbstractUnitSpec { + + it should "return null inference payload internal state" in { + val inferencePayload: InferencePayload = new InferenceTools{}.createInferencePayload(null, null, null) + assert(inferencePayload.data == null, "should have returned null dataset") + assert(inferencePayload.modelingColumns == null, "should have returned null modeling columns") + assert(inferencePayload.allColumns == null, "should have returned null columns") + } + + it should "return removal of columns" in { + val columnRemoved = "age_trimmed" + val originalInferencePayload = InferenceUnitTestUtil.generateInferencePayload() + val inferencePayload: InferencePayload = new InferenceTools{}.removeArrayOfColumns( + InferenceUnitTestUtil.generateInferencePayload(), Array(columnRemoved)) + assert(inferencePayload != null, "should not have returned null inference payload") + assert(!inferencePayload.allColumns.contains(columnRemoved), s"should have removed column: $columnRemoved") + val noOfColumns = originalInferencePayload.allColumns.size - 1 + assert(inferencePayload.allColumns.size == noOfColumns, s"should have returned $noOfColumns number of columns") + } + + it should "return inference payload in json for null" in { + val inferenceJsonReturn: InferenceJsonReturn = new InferenceTools{}.convertInferenceConfigToJson(null) + assert(inferenceJsonReturn.prettyJson equals "null", "should have returned null prettyJson for invalid input") + assert(inferenceJsonReturn.compactJson equals "null", "should have returned null compactJson for invalid input") + } + + it should "raise JsonParseException due to deserialization of an illegal json" in { + a [JsonParseException] should be thrownBy { + new InferenceTools {}.convertJsonConfigToClass("error") + } + } + + it should "raise ArrayIndexOutOfBoundsException due to extraction of inference String from an empty dataframe" in { + a [ArrayIndexOutOfBoundsException] should be thrownBy { + new InferenceTools {}.extractInferenceJsonFromDataFrame(AutomationUnitTestsUtil.sparkSession.emptyDataFrame) + } + } + + it should "raise ArrayIndexOutOfBoundsException due to extraction of inference config from an empty dataframe" in { + a [ArrayIndexOutOfBoundsException] should be thrownBy { + new InferenceTools {}.extractInferenceConfigFromDataFrame(AutomationUnitTestsUtil.sparkSession.emptyDataFrame) + } + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/model/DecisionTreeTunerTest.scala b/src/test/scala/com/databricks/labs/automl/model/DecisionTreeTunerTest.scala new file mode 100644 index 00000000..a0291243 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/model/DecisionTreeTunerTest.scala @@ -0,0 +1,433 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.model.tools.split.DataSplitUtility +import com.databricks.labs.automl.params.TreesModelsWithResults +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + DiscreteTestDataGenerator +} + +class DecisionTreeTunerTest extends AbstractUnitSpec { + + "DecisionTreeTuner" should "throw UnsupportedOperationException for passing invalid params" in { + a[UnsupportedOperationException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "Trees" + ) + + new DecisionTreeTuner(null, splitData, null).evolveBest() + } + } + + it should "throw UnsupportedOperationException for passing invalid modelSelection" in { + a[UnsupportedOperationException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "Trees" + ) + new DecisionTreeTuner( + AutomationUnitTestsUtil.getAdultDf(), + splitData, + "err" + ).evolveBest() + } + } + + it should "return valid DecisionTreeTuner for Binary Classification Data" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("trees", "classifier") + ) + + val data = new DataPrep( + DiscreteTestDataGenerator.generateBinaryClassificationData(10000) + ).prepData().data + + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "Trees" + ) + + val treesModelsWithResults: TreesModelsWithResults = new DecisionTreeTuner( + data, + trainSplits, + "classifier" + ).setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setTreesNumericBoundaries(_mainConfig.numericBoundaries) + .setTreesStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("areaUnderROC") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(_mainConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + treesModelsWithResults != null, + "treesModelsWithResults should not have been null" + ) + assert( + treesModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + treesModelsWithResults.model != null, + "model should not have been null" + ) + assert( + treesModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } + + it should "return valid DecisionTreeClassifier model for Multi-class Classification data" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("trees", "classifier") + ) + + val data = new DataPrep( + DiscreteTestDataGenerator.generateMultiClassClassificationData(10000) + ).prepData().data + + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "Trees" + ) + + val treesModelsWithResults: TreesModelsWithResults = new DecisionTreeTuner( + data, + trainSplits, + "classifier" + ).setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setTreesNumericBoundaries(_mainConfig.numericBoundaries) + .setTreesStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("accuracy") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(_mainConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + treesModelsWithResults != null, + "treesModelsWithResults should not have been null" + ) + assert( + treesModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + treesModelsWithResults.model != null, + "model should not have been null" + ) + assert( + treesModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + + } + + it should "return valid DecisionTreeRegressor model for Regression data" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("trees", "regressor") + ) + + val data = new DataPrep( + DiscreteTestDataGenerator.generateMultiClassClassificationData(10000) + ).prepData().data + + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "Trees" + ) + + val treesModelsWithResults: TreesModelsWithResults = new DecisionTreeTuner( + data, + trainSplits, + "classifier" + ).setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setTreesNumericBoundaries(_mainConfig.numericBoundaries) + .setTreesStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("rmse") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("minimize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode(_mainConfig.geneticConfig.mutationMagnitudeMode) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + treesModelsWithResults != null, + "treesModelsWithResults should not have been null" + ) + assert( + treesModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + treesModelsWithResults.model != null, + "model should not have been null" + ) + assert( + treesModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/model/GBTreesTunerTest.scala b/src/test/scala/com/databricks/labs/automl/model/GBTreesTunerTest.scala new file mode 100644 index 00000000..439529ba --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/model/GBTreesTunerTest.scala @@ -0,0 +1,412 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.model.tools.split.DataSplitUtility +import com.databricks.labs.automl.params.GBTModelsWithResults +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + DiscreteTestDataGenerator +} + +class GBTreesTunerTest extends AbstractUnitSpec { + + "GBTreesTuner" should "throw UnsupportedOperationException for passing invalid params" in { + a[UnsupportedOperationException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "GBT" + ) + + new GBTreesTuner(null, splitData, null).evolveBest() + } + } + + it should "should throw UnsupportedOperationException for passing invalid modelSelection" in { + a[UnsupportedOperationException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "GBT" + ) + + new GBTreesTuner(AutomationUnitTestsUtil.getAdultDf(), splitData, "err") + .evolveBest() + } + } + + it should "should return valid GBTClassifier for Binary Classification Data" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("gbt", "classifier") + ) + val data = new DataPrep( + DiscreteTestDataGenerator.generateBinaryClassificationData(10000) + ).prepData().data + val splitData = DataSplitUtility.split( + data, + 1, + "random", + "income", + "dbfs:/test", + "cache", + "GBT" + ) + val gbtModelsWithResults: GBTModelsWithResults = + new GBTreesTuner(data, splitData, "classifier") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setGBTNumericBoundaries(_mainConfig.numericBoundaries) + .setGBTStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("areaUnderROC") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + gbtModelsWithResults != null, + "gbtModelsWithResults should not have been null" + ) + assert( + gbtModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + gbtModelsWithResults.model != null, + "model should not have been null" + ) + assert( + gbtModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } + + it should "should throw an exception for attempting to run Multiclass Classification in GBT" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("gbt", "classifier") + ) + val data = + new DataPrep( + DiscreteTestDataGenerator.generateMultiClassClassificationData(10000) + ).prepData().data + + val splitData = DataSplitUtility.split( + data, + 1, + "random", + "income", + "dbfs:/test", + "cache", + "GBT" + ) + + val gbtModelsWithResults: GBTreesTuner = + new GBTreesTuner(data, splitData, "classifier") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setGBTNumericBoundaries(_mainConfig.numericBoundaries) + .setGBTStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("areaUnderROC") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + + intercept[IllegalArgumentException] { + gbtModelsWithResults.evolveBest() + } + + } + + it should "should return valid GBTRegressor for Regression Data" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("gbt", "regressor") + ) + val data = new DataPrep( + DiscreteTestDataGenerator.generateRegressionData(10000) + ).prepData().data + val splitData = DataSplitUtility.split( + data, + 1, + "random", + "income", + "dbfs:/test", + "cache", + "GBT" + ) + + val gbtModelsWithResults: GBTModelsWithResults = + new GBTreesTuner(data, splitData, "regressor") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setGBTNumericBoundaries(_mainConfig.numericBoundaries) + .setGBTStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("rmse") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("minimize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + gbtModelsWithResults != null, + "gbtModelsWithResults should not have been null" + ) + assert( + gbtModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + gbtModelsWithResults.model != null, + "model should not have been null" + ) + assert( + gbtModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/model/LinearRegressionTunerTest.scala b/src/test/scala/com/databricks/labs/automl/model/LinearRegressionTunerTest.scala new file mode 100644 index 00000000..a89ae87a --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/model/LinearRegressionTunerTest.scala @@ -0,0 +1,180 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.model.tools.split.DataSplitUtility +import com.databricks.labs.automl.params.LinearRegressionModelsWithResults +import com.databricks.labs.automl.{AbstractUnitSpec, AutomationUnitTestsUtil} + +class LinearRegressionTunerTest extends AbstractUnitSpec { + "LinearRegressionTuner" should "throw NoSuchElementException for passing invalid params" in { + a[NullPointerException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "LinearRegression" + ) + + new LinearRegressionTuner(null, splitData).evolveBest() + } + } + + it should "should throw NoSuchElementException for passing invalid dataset" in { + a[AssertionError] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "LinearRegression" + ) + + new LinearRegressionTuner( + AutomationUnitTestsUtil.sparkSession.emptyDataFrame, + splitData + ).evolveBest() + } + } + + it should "should return valid Regression Model" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator + .generateDefaultConfig("linearregression", "regressor") + ) + val data = new DataPrep( + AutomationUnitTestsUtil.convertCsvToDf("/AirQualityUCI.csv") + ).prepData().data + val splitData = DataSplitUtility.split( + data, + 1, + "random", + "income", + "dbfs:/test", + "cache", + "LinearRegression" + ) + + val linearRegressionModelsWithResults: LinearRegressionModelsWithResults = + new LinearRegressionTuner(data, splitData) + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setLinearRegressionNumericBoundaries( + Map( + "elasticNetParams" -> Tuple2(0.0, 1.0), + "maxIter" -> Tuple2(10.0, 100.0), + "regParam" -> Tuple2(0.2, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-6) + ) + ) + .setLinearRegressionStringBoundaries( + Map("loss" -> List("squaredError")) + ) + .setScoringMetric("rmse") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(1) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("minimize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + + assert( + linearRegressionModelsWithResults != null, + "linearRegressionModelsWithResults should not have been null" + ) + assert( + linearRegressionModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + linearRegressionModelsWithResults.model != null, + "model should not have been null" + ) + assert( + linearRegressionModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/model/LogisticRegressionTunerTest.scala b/src/test/scala/com/databricks/labs/automl/model/LogisticRegressionTunerTest.scala new file mode 100644 index 00000000..a7c1e1c9 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/model/LogisticRegressionTunerTest.scala @@ -0,0 +1,183 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.model.tools.split.DataSplitUtility +import com.databricks.labs.automl.params.LogisticRegressionModelsWithResults +import com.databricks.labs.automl.{AbstractUnitSpec, AutomationUnitTestsUtil} + +class LogisticRegressionTunerTest extends AbstractUnitSpec { + + "LogisticRegressionTuner" should "throw IllegalArgumentException for passing invalid params" in { + a[IllegalArgumentException] should be thrownBy { + val data = + new DataPrep(AutomationUnitTestsUtil.getAdultDf()).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + "income", + "dbfs:/test", + "cache", + "LogisticRegression" + ) + + new LogisticRegressionTuner(null, trainSplits).evolveBest() + } + } + + it should "should throw IllegalArgumentException for passing invalid dataset" in { + a[IllegalArgumentException] should be thrownBy { + + val data = + new DataPrep(AutomationUnitTestsUtil.getAdultDf()).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + "income", + "dbfs:/test", + "cache", + "LogisticRegression" + ) + + new LogisticRegressionTuner( + AutomationUnitTestsUtil.sparkSession.emptyDataFrame, + trainSplits + ).evolveBest() + } + } + + it should "should return valid Binary Classification Model" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator + .generateDefaultConfig("logisticregression", "classifier") + ) + + val data = + new DataPrep(AutomationUnitTestsUtil.getAdultDf()).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "LogisticRegression" + ) + + val logisticRegressionModelsWithResults + : LogisticRegressionModelsWithResults = + new LogisticRegressionTuner(data, trainSplits) + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setLogisticRegressionNumericBoundaries( + Map( + "elasticNetParams" -> Tuple2(1.0, 1.0), + "maxIter" -> Tuple2(10.0, 20.0), + "regParam" -> Tuple2(0.9, 1.0), + "tolerance" -> Tuple2(1E-9, 1E-5) + ) + ) + .setScoringMetric("areaUnderROC") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + logisticRegressionModelsWithResults != null, + "logisticRegressionModelsWithResults should not have been null" + ) + assert( + logisticRegressionModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + logisticRegressionModelsWithResults.model != null, + "model should not have been null" + ) + assert( + logisticRegressionModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/model/MLPCTunerTest.scala b/src/test/scala/com/databricks/labs/automl/model/MLPCTunerTest.scala new file mode 100644 index 00000000..bc249fe2 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/model/MLPCTunerTest.scala @@ -0,0 +1,188 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.model.tools.split.DataSplitUtility +import com.databricks.labs.automl.params.MLPCModelsWithResults +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + DiscreteTestDataGenerator +} +import org.apache.spark.sql.AnalysisException + +class MLPCTunerTest extends AbstractUnitSpec { + + "MLPCTuner" should "throw NullPointerException for passing invalid params" in { + a[NullPointerException] should be thrownBy { + + val data = new DataPrep( + DiscreteTestDataGenerator.generateBinaryClassificationData(10000) + ).prepData().data + + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + "label", + "dbfs:/test", + "cache", + "MLPC" + ) + + new MLPCTuner(null, trainSplits).evolveBest() + } + } + + it should "should throw AnalysisException for passing invalid dataset" in { + a[AnalysisException] should be thrownBy { + + val data = new DataPrep( + DiscreteTestDataGenerator.generateBinaryClassificationData(10000) + ).prepData().data + + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + "label", + "dbfs:/test", + "cache", + "MLPC" + ) + + new MLPCTuner( + AutomationUnitTestsUtil.sparkSession.emptyDataFrame, + trainSplits + ).evolveBest() + } + } + + it should "should return valid Binary Classification Model" in { + + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator + .generateDefaultConfig("mlpc", "classifier") + ) + + val data = + new DataPrep(AutomationUnitTestsUtil.getAdultDf()).prepData().data + + val splitData = DataSplitUtility.split( + data, + 1, + "random", + "income", + "dbfs:/test", + "cache", + "MLPC" + ) + + val logisticRegressionModelsWithResults: MLPCModelsWithResults = + new MLPCTuner(data, splitData) + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setMlpcNumericBoundaries(_mainConfig.numericBoundaries) + .setMlpcStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("areaUnderROC") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + logisticRegressionModelsWithResults != null, + "logisticRegressionModelsWithResults should not have been null" + ) + assert( + logisticRegressionModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + logisticRegressionModelsWithResults.model != null, + "model should not have been null" + ) + assert( + logisticRegressionModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/model/RandomForestTunerTest.scala b/src/test/scala/com/databricks/labs/automl/model/RandomForestTunerTest.scala new file mode 100644 index 00000000..64474574 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/model/RandomForestTunerTest.scala @@ -0,0 +1,430 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.model.tools.split.DataSplitUtility +import com.databricks.labs.automl.params.RandomForestModelsWithResults +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + DiscreteTestDataGenerator +} + +class RandomForestTunerTest extends AbstractUnitSpec { + + "RandomForestTuner" should "throw UnsupportedOperationException for passing invalid params" in { + a[UnsupportedOperationException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "RandomForest" + ) + + new RandomForestTuner(null, splitData, null).evolveBest() + } + } + + it should "throw UnsupportedOperationException for passing invalid modelSelection" in { + a[UnsupportedOperationException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "RandomForest" + ) + + new RandomForestTuner( + AutomationUnitTestsUtil.getAdultDf(), + splitData, + "err" + ).evolveBest() + } + } + + it should " return a valid model for a Binary Classification task" in { + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("randomforest", "classifier") + ) + val data = new DataPrep( + DiscreteTestDataGenerator.generateBinaryClassificationData(10000) + ).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "RandomForest" + ) + + val randomForestModelsWithResults: RandomForestModelsWithResults = + new RandomForestTuner(data, trainSplits, "classifier") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setRandomForestNumericBoundaries(_mainConfig.numericBoundaries) + .setRandomForestStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("areaUnderROC") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + randomForestModelsWithResults != null, + "randomForestModelsWithResults should not have been null" + ) + assert( + randomForestModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + randomForestModelsWithResults.model != null, + "model should not have been null" + ) + assert( + randomForestModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } + + it should " return a valid model for a Multi-class Classification task" in { + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("randomforest", "classifier") + ) + val data = new DataPrep( + DiscreteTestDataGenerator.generateMultiClassClassificationData(10000) + ).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "RandomForest" + ) + + val randomForestModelsWithResults: RandomForestModelsWithResults = + new RandomForestTuner(data, trainSplits, "classifier") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setRandomForestNumericBoundaries(_mainConfig.numericBoundaries) + .setRandomForestStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("accuracy") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + randomForestModelsWithResults != null, + "randomForestModelsWithResults should not have been null" + ) + assert( + randomForestModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + randomForestModelsWithResults.model != null, + "model should not have been null" + ) + assert( + randomForestModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } + + it should " return a valid model for a Regression task" in { + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("randomforest", "regressor") + ) + val data = new DataPrep( + DiscreteTestDataGenerator.generateRegressionData(10000) + ).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "RandomForest" + ) + + val randomForestModelsWithResults: RandomForestModelsWithResults = + new RandomForestTuner(data, trainSplits, "regressor") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setRandomForestNumericBoundaries( + Map( + "numTrees" -> (50.0, 100.0), + "maxBins" -> (30.0, 100.0), + "maxDepth" -> (2.0, 10.0), + "minInfoGain" -> (0.3, 0.5), + "subSamplingRate" -> (0.5, 0.6) + ) + ) + .setRandomForestStringBoundaries(_mainConfig.stringBoundaries) + .setScoringMetric("rmse") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(2) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("minimize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + randomForestModelsWithResults != null, + "randomForestModelsWithResults should not have been null" + ) + assert( + randomForestModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + randomForestModelsWithResults.model != null, + "model should not have been null" + ) + assert( + randomForestModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/model/SVMTunerTest.scala b/src/test/scala/com/databricks/labs/automl/model/SVMTunerTest.scala new file mode 100644 index 00000000..bb6a5a80 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/model/SVMTunerTest.scala @@ -0,0 +1,43 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.model.tools.split.DataSplitUtility +import com.databricks.labs.automl.{AbstractUnitSpec, AutomationUnitTestsUtil} + +class SVMTunerTest extends AbstractUnitSpec { + + "SVMTuner" should "throw IllegalArgumentException for passing invalid params" in { + a[IllegalArgumentException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "SVM" + ) + + new SVMTuner(null, splitData).evolveBest() + } + } + + it should "should throw IllegalArgumentException for passing invalid dataset" in { + a[IllegalArgumentException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "SVM" + ) + + new SVMTuner( + AutomationUnitTestsUtil.sparkSession.emptyDataFrame, + splitData + ).evolveBest() + } + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/model/XgBoostTunerTest.scala b/src/test/scala/com/databricks/labs/automl/model/XgBoostTunerTest.scala new file mode 100644 index 00000000..06031533 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/model/XgBoostTunerTest.scala @@ -0,0 +1,415 @@ +package com.databricks.labs.automl.model + +import com.databricks.labs.automl.executor.DataPrep +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.model.tools.split.DataSplitUtility +import com.databricks.labs.automl.params.XGBoostModelsWithResults +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + DiscreteTestDataGenerator +} + +class XgBoostTunerTest extends AbstractUnitSpec { + + "XgBoostTuner" should "throw UnsupportedOperationException for passing invalid params" in { + a[UnsupportedOperationException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "XGBoost" + ) + + new XGBoostTuner(null, splitData, null).evolveBest() + } + } + + it should "throw UnsupportedOperationException for passing invalid modelSelection" in { + a[UnsupportedOperationException] should be thrownBy { + val splitData = DataSplitUtility.split( + AutomationUnitTestsUtil.getAdultDf(), + 1, + "random", + "income", + "dbfs:/test", + "cache", + "XGBoost" + ) + + new XGBoostTuner(AutomationUnitTestsUtil.getAdultDf(), splitData, "err") + .evolveBest() + } + } + + it should "return valid XGBoost model for Binary Classification" in { + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("xgboost", "classifier") + ) + val data = new DataPrep( + DiscreteTestDataGenerator.generateBinaryClassificationData(10000) + ).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "XGBoost" + ) + + val xGBoostModelsWithResults: XGBoostModelsWithResults = + new XGBoostTuner(data, trainSplits, "classifier") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setXGBoostNumericBoundaries(_mainConfig.numericBoundaries) + .setScoringMetric("areaUnderROC") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + xGBoostModelsWithResults != null, + "xGBoostModelsWithResults should not have been null" + ) + assert( + xGBoostModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + xGBoostModelsWithResults.model != null, + "model should not have been null" + ) + assert( + xGBoostModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } + + it should "return valid XGBoost model for Multiclass Classification" in { + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("xgboost", "classifier") + ) + val data = new DataPrep( + DiscreteTestDataGenerator.generateMultiClassClassificationData(10000) + ).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "XGBoost" + ) + + val xGBoostModelsWithResults: XGBoostModelsWithResults = + new XGBoostTuner(data, trainSplits, "classifier") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setXGBoostNumericBoundaries(_mainConfig.numericBoundaries) + .setScoringMetric("accuracy") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("maximize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + xGBoostModelsWithResults != null, + "xGBoostModelsWithResults should not have been null" + ) + assert( + xGBoostModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + xGBoostModelsWithResults.model != null, + "model should not have been null" + ) + assert( + xGBoostModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } + + it should "return valid XGBoost model for Regression" in { + val _mainConfig = ConfigurationGenerator.generateMainConfig( + ConfigurationGenerator.generateDefaultConfig("xgboost", "regressor") + ) + val data = new DataPrep( + DiscreteTestDataGenerator.generateRegressionData(10000) + ).prepData().data + val trainSplits = DataSplitUtility.split( + data, + 1, + "random", + _mainConfig.labelCol, + "dbfs:/test", + "cache", + "XGBoost" + ) + val xGBoostModelsWithResults: XGBoostModelsWithResults = + new XGBoostTuner(data, trainSplits, "regressor") + .setFirstGenerationGenePool(5) + .setLabelCol(_mainConfig.labelCol) + .setFeaturesCol(_mainConfig.featuresCol) + .setFieldsToIgnore(_mainConfig.fieldsToIgnoreInVector) + .setXGBoostNumericBoundaries(_mainConfig.numericBoundaries) + .setScoringMetric("rmse") + .setTrainPortion(_mainConfig.geneticConfig.trainPortion) + .setTrainSplitMethod(_mainConfig.geneticConfig.trainSplitMethod) + .setSyntheticCol(_mainConfig.geneticConfig.kSampleConfig.syntheticCol) + .setKGroups(_mainConfig.geneticConfig.kSampleConfig.kGroups) + .setKMeansMaxIter(_mainConfig.geneticConfig.kSampleConfig.kMeansMaxIter) + .setKMeansTolerance( + _mainConfig.geneticConfig.kSampleConfig.kMeansTolerance + ) + .setKMeansDistanceMeasurement( + _mainConfig.geneticConfig.kSampleConfig.kMeansDistanceMeasurement + ) + .setKMeansSeed(_mainConfig.geneticConfig.kSampleConfig.kMeansSeed) + .setKMeansPredictionCol( + _mainConfig.geneticConfig.kSampleConfig.kMeansPredictionCol + ) + .setLSHHashTables(_mainConfig.geneticConfig.kSampleConfig.lshHashTables) + .setLSHSeed(_mainConfig.geneticConfig.kSampleConfig.lshSeed) + .setLSHOutputCol(_mainConfig.geneticConfig.kSampleConfig.lshOutputCol) + .setQuorumCount(_mainConfig.geneticConfig.kSampleConfig.quorumCount) + .setMinimumVectorCountToMutate( + _mainConfig.geneticConfig.kSampleConfig.minimumVectorCountToMutate + ) + .setVectorMutationMethod( + _mainConfig.geneticConfig.kSampleConfig.vectorMutationMethod + ) + .setMutationMode(_mainConfig.geneticConfig.kSampleConfig.mutationMode) + .setMutationValue(_mainConfig.geneticConfig.kSampleConfig.mutationValue) + .setLabelBalanceMode( + _mainConfig.geneticConfig.kSampleConfig.labelBalanceMode + ) + .setCardinalityThreshold( + _mainConfig.geneticConfig.kSampleConfig.cardinalityThreshold + ) + .setNumericRatio(_mainConfig.geneticConfig.kSampleConfig.numericRatio) + .setNumericTarget(_mainConfig.geneticConfig.kSampleConfig.numericTarget) + .setTrainSplitChronologicalColumn( + _mainConfig.geneticConfig.trainSplitChronologicalColumn + ) + .setTrainSplitChronologicalRandomPercentage( + _mainConfig.geneticConfig.trainSplitChronologicalRandomPercentage + ) + .setParallelism(4) + .setKFold(1) + .setSeed(_mainConfig.geneticConfig.seed) + .setOptimizationStrategy("minimize") + .setFirstGenerationGenePool(5) + .setNumberOfMutationGenerations(2) + .setNumberOfMutationsPerGeneration(5) + .setNumberOfParentsToRetain(2) + .setGeneticMixing(_mainConfig.geneticConfig.geneticMixing) + .setGenerationalMutationStrategy( + _mainConfig.geneticConfig.generationalMutationStrategy + ) + .setMutationMagnitudeMode( + _mainConfig.geneticConfig.mutationMagnitudeMode + ) + .setFixedMutationValue(_mainConfig.geneticConfig.fixedMutationValue) + .setEarlyStoppingFlag(_mainConfig.autoStoppingFlag) + .setEarlyStoppingScore(_mainConfig.autoStoppingScore) + .setEvolutionStrategy(_mainConfig.geneticConfig.evolutionStrategy) + .setContinuousEvolutionImprovementThreshold( + _mainConfig.geneticConfig.continuousEvolutionImprovementThreshold + ) + .setGeneticMBORegressorType( + _mainConfig.geneticConfig.geneticMBORegressorType + ) + .setGeneticMBOCandidateFactor( + _mainConfig.geneticConfig.geneticMBOCandidateFactor + ) + .setDataReductionFactor(_mainConfig.dataReductionFactor) + .setFirstGenMode(_mainConfig.geneticConfig.initialGenerationMode) + .setFirstGenPermutations(4) + .setFirstGenIndexMixingMode( + _mainConfig.geneticConfig.initialGenerationConfig.indexMixingMode + ) + .setFirstGenArraySeed( + _mainConfig.geneticConfig.initialGenerationConfig.arraySeed + ) + .setHyperSpaceModelCount(50000) + .evolveBest() + assert( + xGBoostModelsWithResults != null, + "xGBoostModelsWithResults should not have been null" + ) + assert( + xGBoostModelsWithResults.evalMetrics != null, + "evalMetrics should not have been null" + ) + assert( + xGBoostModelsWithResults.model != null, + "model should not have been null" + ) + assert( + xGBoostModelsWithResults.modelHyperParams != null, + "modelHyperParams should not have been null" + ) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/AutoMlOutputDatasetTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/AutoMlOutputDatasetTransformerTest.scala new file mode 100644 index 00000000..fb3ae1c6 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/AutoMlOutputDatasetTransformerTest.scala @@ -0,0 +1,49 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.sql.DataFrame + +class AutoMlOutputDatasetTransformerTest extends AbstractUnitSpec { + + "AutoMlOutputDatasetTransformer" should "drop Id column, retain original columns from the original dataset" in { + val testVars = PipelineTestUtils.getTestVars() + val pipelineAdultDf = new ZipRegisterTempTransformer() + .setFeatureColumns(testVars.features) + .setTempViewOriginalDatasetName(testVars.tempTableName) + .setLabelColumn(testVars.labelCol) + .transform(testVars.df) + val pipelineOutputDf = new AutoMlOutputDatasetTransformer() + .setFeatureColumns(testVars.features) + .setTempViewOriginalDatasetName(testVars.tempTableName) + .setLabelColumn(testVars.labelCol) + .transform(pipelineAdultDf) + assertAutoMlOutputDatasetTransformerTest(pipelineAdultDf, pipelineOutputDf) + } + + "AutoMlOutputDatasetTransformer" should "work with Pipeline save/load" in { + val testVars = PipelineTestUtils.getTestVars() + val pipelineAdultDf = new ZipRegisterTempTransformer() + .setFeatureColumns(testVars.features) + .setTempViewOriginalDatasetName(testVars.tempTableName) + .setLabelColumn(testVars.labelCol) + val pipelineOutputDf = new AutoMlOutputDatasetTransformer() + .setFeatureColumns(testVars.features) + .setTempViewOriginalDatasetName(testVars.tempTableName) + .setLabelColumn(testVars.labelCol) + val transformedAdultDfwithLabel = PipelineTestUtils + .saveAndLoadPipeline(Array(pipelineAdultDf, pipelineOutputDf), testVars.df, "automl-output-df-pipe") + .transform(testVars.df) + assert(!transformedAdultDfwithLabel.columns.contains(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + "AutoMlOutputDatasetTransformer should have dropped Id column and retained original columns") + } + + + def assertAutoMlOutputDatasetTransformerTest(pipelineAdultDf: DataFrame, + pipelineOutputDf: DataFrame): Unit = { + assert(pipelineAdultDf.columns.contains(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + "ZipRegisterTempTransformer stage should have added Id column") + assert(!pipelineOutputDf.columns.contains(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + "AutoMlOutputDatasetTransformer should have dropped Id column and retained original columns") + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/CardinalityLimitColumnPrunerTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/CardinalityLimitColumnPrunerTransformerTest.scala new file mode 100644 index 00000000..0eeea436 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/CardinalityLimitColumnPrunerTransformerTest.scala @@ -0,0 +1,46 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.ml.PipelineStage +import org.apache.spark.sql.DataFrame + +import scala.collection.mutable.ArrayBuffer + +class CardinalityLimitColumnPrunerTransformerTest extends AbstractUnitSpec { + + "CardinalityLimitColumnPrunerTransformerTest" should " should check cardinality" in { + val testVars = PipelineTestUtils.getTestVars() + val stages = new ArrayBuffer[PipelineStage] + val nonFeatureCols = + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, testVars.labelCol) + stages += PipelineTestUtils + .addZipRegisterTmpTransformerStage( + testVars.labelCol, + testVars.df.columns.filterNot(item => nonFeatureCols.contains(item)) + ) + stages += new CardinalityLimitColumnPrunerTransformer() + .setLabelColumn(testVars.labelCol) + .setCardinalityLimit(2) + .setCardinalityCheckMode("silent") + .setCardinalityType("exact") + .setCardinalityPrecision(0.0) + val pipelineModel = PipelineTestUtils.saveAndLoadPipeline( + stages.toArray, + testVars.df, + "card-limit-pipeline" + ) + val adultCadDf = pipelineModel.transform(testVars.df) + assertCardinalityTest(adultCadDf) + adultCadDf.show(10) + } + + private def assertCardinalityTest(adultCadDf: DataFrame): Unit = { + assert( + adultCadDf.columns + .exists(item => Array("sex_trimmed", "label").contains(item)), + "CardinalityLimitColumnPrunerTransformer should have retained columns with a defined cardinality" + ) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/ColumnNameTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/ColumnNameTransformerTest.scala new file mode 100644 index 00000000..1de957e3 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/ColumnNameTransformerTest.scala @@ -0,0 +1,18 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} + +class ColumnNameTransformerTest extends AbstractUnitSpec { + + "ColumnNameTransformerTest" should "remove columns" in { + val testVars = PipelineTestUtils.getTestVars() + val columnNameTransformer = new ColumnNameTransformer() + .setInputColumns(Array("age_trimmed")) + .setOutputColumns(Array("age_trimmed_r")) + val renamedDf = columnNameTransformer.transform(testVars.df) + assert(renamedDf.count() == testVars.df.count(), "ColumnNameTransformerTest should not have changed number of rows") + assert(renamedDf.columns.length == testVars.df.columns.length, "ColumnNameTransformerTest should not have changed number of columns") + assert(renamedDf.columns.contains("age_trimmed_r"), "ColumnNameTransformerTest should contain renamed column") + assert(!renamedDf.columns.contains("age_trimmed"), "ColumnNameTransformerTest should not contain original column") + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/CovarianceFilterTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/CovarianceFilterTransformerTest.scala new file mode 100644 index 00000000..d7cf7c36 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/CovarianceFilterTransformerTest.scala @@ -0,0 +1,50 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.{ + AbstractUnitSpec, + DiscreteTestDataGenerator, + PipelineTestUtils +} +import org.apache.spark.ml.PipelineStage + +import scala.collection.mutable.ArrayBuffer + +case class Sample(a: Double, + b: Double, + c: Double, + label: Int, + automl_internal_id: Long) + +class CovarianceFilterTransformerTest extends AbstractUnitSpec { + + "CovarianceFilterTransformerTest" should "apply the filter with right settings" in { + + val EXPECTED_REMAINING_COLS = + Array("a1", "c2", "d1", "d2", "label", "automl_internal_id") + + val data = DiscreteTestDataGenerator.generateFeatureCorrelationData(1000) + + val stages = new ArrayBuffer[PipelineStage] + stages += new CovarianceFilterTransformer() + .setFeatureColumns(Array("a", "b", "c")) + .setLabelColumn("label") + .setFeatureCol("features") + .setCorrelationCutoffHigh(1.0) + .setCorrelationCutoffLow(-1.0) + + val transformedDf = PipelineTestUtils + .saveAndLoadPipeline(stages.toArray, data._1, "covar-filter-pipeline") + .transform(data._1) + + transformedDf.show(10) + + assert( + EXPECTED_REMAINING_COLS.forall(transformedDf.schema.names.contains), + "kept correct columns" + ) + assert( + transformedDf.schema.names.forall(EXPECTED_REMAINING_COLS.contains), + "removed correct columns" + ) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/DataSanitizerTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/DataSanitizerTransformerTest.scala new file mode 100644 index 00000000..f15dc27f --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/DataSanitizerTransformerTest.scala @@ -0,0 +1,33 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.ml.PipelineStage + +import scala.collection.mutable.ArrayBuffer + +class DataSanitizerTransformerTest extends AbstractUnitSpec { + + "DataSanitizerTransformer" should " sanitize based on the settings" in { + val testVars = PipelineTestUtils.getTestVars() + val stages = new ArrayBuffer[PipelineStage] + val nonFeatureCols = Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, testVars.labelCol) + stages += PipelineTestUtils + .addZipRegisterTmpTransformerStage( + testVars.labelCol, + testVars.df.columns.filterNot(item => nonFeatureCols.contains(item)) + ) + stages += new DataSanitizerTransformer() + .setLabelColumn(testVars.labelCol) + .setFeatureCol("features") + .setNaFillFlag(true) + stages ++= PipelineTestUtils.buildFeaturesPipelineStages(testVars.df, testVars.labelCol) + val transformedAdultDf = PipelineTestUtils + .saveAndLoadPipeline(stages.toArray, testVars.df, "data_sanitizer_stage") + .transform(testVars.df) + assert(transformedAdultDf.count() == testVars.df.count(), "Number of rows shouldn't have changed") + assert(transformedAdultDf.columns.length == 3, "Should only contain label, ID and feature columns") + transformedAdultDf.show(10) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/DatasetsUnionTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/DatasetsUnionTransformerTest.scala new file mode 100644 index 00000000..e91c9eee --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/DatasetsUnionTransformerTest.scala @@ -0,0 +1,28 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.sql.functions._ + +class DatasetsUnionTransformerTest extends AbstractUnitSpec { + + "DatasetsUnionTransformerTest" should "correctly union DFs" in { + val testVars = PipelineTestUtils.getTestVars() + val df1 = testVars.df.withColumn(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, monotonically_increasing_id()) + val df2 = testVars.df.withColumn(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, monotonically_increasing_id()) + + new RegisterTempTableTransformer() + .setTempTableName("test_1") + .setStatement("Select * from __THIS__") + .transform(df1) + + val unionDf = new DatasetsUnionTransformer() + .setUnionDatasetName("test_1") + .transform(df2) + + assert(unionDf.count() == df1.count() + df2.count(), + "DatasetsUnionTransformer did not correctly union the datasets") + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/DateFieldTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/DateFieldTransformerTest.scala new file mode 100644 index 00000000..d4ab256d --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/DateFieldTransformerTest.scala @@ -0,0 +1,87 @@ +package com.databricks.labs.automl.pipeline + +import java.sql.{Date, Timestamp} + +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + PipelineTestUtils +} +import org.apache.spark.ml.PipelineStage +import org.apache.spark.sql.Row +import org.apache.spark.sql.types._ + +import scala.collection.JavaConverters._ +import scala.collection.mutable.ArrayBuffer + +class DateFieldTransformerTest extends AbstractUnitSpec { + + "DateFieldTransformerTest" should "should convert data/time fiels" in { + val spark = AutomationUnitTestsUtil.sparkSession + val dateColToBeTransformed = "download_date" + val tsColToBeTransformed = "event_ts" + val sourceDF = spark.createDataFrame( + List( + Row( + 300L, + Date.valueOf("2016-09-30"), + Timestamp.valueOf("2007-09-23 10:10:10.0"), + 0 + ), + Row( + 400L, + Date.valueOf("2016-10-30"), + Timestamp.valueOf("2007-09-24 12:05:55.0"), + 1 + ) + ).asJava, + StructType( + Array( +// StructField("download_events", ArrayType(StringType, false), nullable = true), + StructField("download_events", LongType, nullable = true), + StructField(dateColToBeTransformed, DateType, nullable = true), + StructField(tsColToBeTransformed, TimestampType, nullable = true), + StructField("label", IntegerType, nullable = false) + ) + ) + ) + + val stages = new ArrayBuffer[PipelineStage] + stages += PipelineTestUtils + .addZipRegisterTmpTransformerStage( + "label", + Array("download_events", "download_date", "event_ts") + ) + stages += new DateFieldTransformer() + .setLabelColumn("label") + .setMode("split") + val transformedDfwithDateTsFeatures = PipelineTestUtils + .saveAndLoadPipeline(stages.toArray, sourceDF, "date-field-tran-pipeline") + .transform(sourceDF) + val expectedColsToBePresent = + Array( + "download_date_year", + "download_date_month", + "download_date_day", + "event_ts_year", + "event_ts_month", + "event_ts_day", + "event_ts_hour", + "event_ts_minute", + "event_ts_second" + ) + assert( + !transformedDfwithDateTsFeatures.columns.exists( + col => Array(dateColToBeTransformed, tsColToBeTransformed).contains(col) + ), + s"""Original columns ${Array(dateColToBeTransformed, tsColToBeTransformed) + .mkString(", ")} should have been dropped""" + ) + assert( + transformedDfwithDateTsFeatures.columns + .exists(col => expectedColsToBePresent.contains(col)), + s"""These columns ${dateColToBeTransformed + .mkString(", ")} should have been added""" + ) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/DropColumnsTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/DropColumnsTransformerTest.scala new file mode 100644 index 00000000..12f75a37 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/DropColumnsTransformerTest.scala @@ -0,0 +1,34 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.ml.PipelineStage + +import scala.collection.mutable.ArrayBuffer + +class DropColumnsTransformerTest extends AbstractUnitSpec { + + "DropColumnsTransformer" should "drop columns with Broadcast" in { + val testVars = PipelineTestUtils.getTestVars() + val stages = new ArrayBuffer[PipelineStage] + val nonFeatureCols = + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, testVars.labelCol) + val columnsToRemove = + Array("age_trimmed", "workclass_trimmed", "fnlwgt_trimmed") + stages += PipelineTestUtils + .addZipRegisterTmpTransformerStage( + testVars.labelCol, + testVars.df.columns.filterNot(item => nonFeatureCols.contains(item)) + ) + stages += new DropColumnsTransformer() + .setInputCols(columnsToRemove) + val pipelineModel = PipelineTestUtils + .saveAndLoadPipeline(stages.toArray, testVars.df, "drop-columns-pipeline") + val bc = testVars.df.sparkSession.sparkContext.broadcast(pipelineModel) + val dfWithDroppedCols = bc.value.transform(testVars.df) + assert( + !dfWithDroppedCols.columns.exists(item => columnsToRemove.contains(item)), + "DropColumnsTransformer should have removed input columns" + ) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeredDfTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeredDfTest.scala new file mode 100644 index 00000000..27396bba --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeredDfTest.scala @@ -0,0 +1,43 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.executor.FamilyRunner +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} + +class FeatureEngineeredDfTest extends AbstractUnitSpec { + "It" should "return feature engineered df" in { + val testVars = PipelineTestUtils.getTestVars() + val overrides = Map( + "labelCol" -> "label", "mlFlowLoggingFlag" -> false, + "scalingFlag" -> true, "oneHotEncodeFlag" -> true, + "numericBoundaries" -> Map( + "numTrees" -> Tuple2(50.0, 100.0), + "maxBins" -> Tuple2(10.0, 20.0), + "maxDepth" -> Tuple2(2.0, 5.0), + "minInfoGain" -> Tuple2(0.0, 0.03), + "subSamplingRate" -> Tuple2(0.5, 1.0)), + "tunerParallelism" -> 10, + "outlierFilterFlag" -> true, + "outlierFilterPrecision" -> 0.05, + "outlierLowerFilterNTile" -> 0.05, + "outlierUpperFilterNTile" -> 0.95, + "tunerKFold" -> 1, + "tunerTrainPortion" -> 0.70, + "tunerFirstGenerationGenePool" -> 5, + "tunerNumberOfGenerations" -> 2, + "tunerNumberOfParentsToRetain" -> 1, + "tunerNumberOfMutationsPerGeneration" -> 1, + "tunerGeneticMixing" -> 0.8, + "tunerGenerationalMutationStrategy" -> "fixed", + "tunerEvolutionStrategy" -> "batch", + "pipelineDebugFlag" -> true, + "mlFlowLoggingFlag" -> false + ) + val randomForestConfig = ConfigurationGenerator + .generateConfigFromMap("RandomForest", "classifier", overrides) + val featEngPipe = FamilyRunner(testVars.df, Array(randomForestConfig)).generateFeatureEngineeredPipeline() + + featEngPipe("RandomForest").transform(testVars.df).show(100) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringOutputDfTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringOutputDfTest.scala new file mode 100644 index 00000000..71c6b7eb --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringOutputDfTest.scala @@ -0,0 +1,123 @@ +package com.databricks.labs.automl.pipeline + +import java.sql.{Date, Timestamp} + +import com.databricks.labs.automl.AbstractUnitSpec +import com.databricks.labs.automl.executor.FamilyRunner +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.functions.rand +import org.apache.spark.sql.types._ + +import scala.collection.JavaConverters._ + +class FeatureEngineeringOutputDfTest extends AbstractUnitSpec { + + "it" should "do this" in { + val random = new scala.util.Random + // Create a start and end value pair + val startDay = 1 + val endDay = 24 + val startMonth = 1 + val endMonth = 12 + val startYear = 2000 + val endYear = 2020 + + val dateColToBeTransformed = "download_date" + val tsColToBeTransformed = "event_ts" + val sourceDFTmp = SparkSession + .builder() + .master("local[*]") + .appName("providentiaml-unit-tests") + .getOrCreate() + .createDataFrame( + (List.fill(50)( + Row( + scala.math.abs(scala.util.Random.nextLong()), + scala.math.abs(scala.util.Random.nextLong()) + "", + Date.valueOf( + s"${startYear + random.nextInt((endYear - startYear) + 1)}-${startMonth + random + .nextInt((endMonth - startMonth) + 1)}-${startDay + random + .nextInt((endDay - startDay) + 1)}" + ), + Timestamp.valueOf( + s"${startYear + random.nextInt((endYear - startYear) + 1)}-${startMonth + random + .nextInt((endMonth - startMonth) + 1)}-${startDay + random + .nextInt((endDay - startDay) + 1)} ${startMonth + random + .nextInt((endMonth - startMonth) + 1)}:${startMonth + random + .nextInt((endMonth - startMonth) + 1)}:${startMonth + random + .nextInt((endMonth - startMonth) + 1)}.${startMonth + random + .nextInt((endMonth - startMonth) + 1)}" + ), + "pass" + ) + ) ++ List.fill(50)( + Row( + scala.math.abs(scala.util.Random.nextLong()), + scala.math.abs(scala.util.Random.nextLong()) + "", + Date.valueOf("2016-10-30"), + Timestamp.valueOf("2007-09-24 12:05:55.0"), + "fail" + ) + )).asJava, + StructType( + Array( + StructField("download_events", LongType, nullable = true), + StructField("download_events_descr", StringType, nullable = true), + StructField(dateColToBeTransformed, DateType, nullable = true), + StructField(tsColToBeTransformed, TimestampType, nullable = true), + StructField("label", StringType, nullable = false) + ) + ) + ) + val sourceDF = sourceDFTmp.orderBy(rand()) + + val overrides = Map( + "labelCol" -> "label", + "mlFlowLoggingFlag" -> false, + "scalingFlag" -> true, + "oneHotEncodeFlag" -> false, + "numericBoundaries" -> Map( + "numTrees" -> Tuple2(50.0, 100.0), + "maxBins" -> Tuple2(10.0, 20.0), + "maxDepth" -> Tuple2(2.0, 5.0), + "minInfoGain" -> Tuple2(0.0, 0.03), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ), + "tunerParallelism" -> 10, + "outlierFilterFlag" -> true, + "outlierFilterPrecision" -> 0.05, + "outlierLowerFilterNTile" -> 0.05, + "outlierUpperFilterNTile" -> 0.95, + "tunerTrainSplitMethod" -> "random", + "tunerTrainPortion" -> 0.70, + "tunerFirstGenerationGenePool" -> 5, + "tunerNumberOfGenerations" -> 2, + "tunerNumberOfParentsToRetain" -> 1, + "tunerNumberOfMutationsPerGeneration" -> 1, + "tunerGeneticMixing" -> 0.8, + "tunerGenerationalMutationStrategy" -> "fixed", + "tunerEvolutionStrategy" -> "batch", + "pipelineDebugFlag" -> true, + "mlFlowLoggingFlag" -> false, + "fillConfigCardinalityLimit" -> "100" + ) + + val randomForestConfig = ConfigurationGenerator.generateConfigFromMap( + "RandomForest", + "classifier", + overrides + ) + val runner = FamilyRunner(sourceDF, Array(randomForestConfig)) + .generateFeatureEngineeredPipeline(verbose = true) + val outputDf = runner("RandomForest").transform(sourceDF) + val noOfCols = outputDf.columns + assert( + noOfCols.length == 17, + s"Feature engineered dataset's columns should have been 17, but $noOfCols were found" + ) + outputDf.show(100) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringPipelineContextTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringPipelineContextTest.scala new file mode 100644 index 00000000..8682f18f --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/FeatureEngineeringPipelineContextTest.scala @@ -0,0 +1,197 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.AutomationUnitTestsUtil.convertCsvToDf +import com.databricks.labs.automl.executor.FamilyRunner +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + PipelineTestUtils +} +import org.apache.spark.ml.PipelineModel +import org.apache.spark.sql.functions.{col, trim} + +class FeatureEngineeringPipelineContextTest extends AbstractUnitSpec { + + ignore should "correctly generate feature engineered dataset" in { + val testVars = PipelineTestUtils.getTestVars() + // Generate config + val overrides = Map( + "labelCol" -> "label", + "mlFlowLoggingFlag" -> false, + "scalingFlag" -> true, + "oneHotEncodeFlag" -> true, + "numericBoundaries" -> Map( + "numTrees" -> Tuple2(50.0, 1000.0), + "maxBins" -> Tuple2(10.0, 100.0), + "maxDepth" -> Tuple2(2.0, 20.0), + "minInfoGain" -> Tuple2(0.0, 0.075), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ) + ) + val randomForestConfig = ConfigurationGenerator.generateConfigFromMap( + "RandomForest", + "classifier", + overrides + ) + randomForestConfig.switchConfig.outlierFilterFlag = true + randomForestConfig.featureEngineeringConfig.outlierFilterPrecision = 0.05 + randomForestConfig.featureEngineeringConfig.outlierLowerFilterNTile = 0.05 + randomForestConfig.featureEngineeringConfig.outlierUpperFilterNTile = 0.95 + randomForestConfig.tunerConfig.tunerParallelism = 10 + randomForestConfig.tunerConfig.tunerTrainSplitMethod = "kSample" + randomForestConfig.tunerConfig.tunerKFold = 1 + randomForestConfig.tunerConfig.tunerTrainPortion = 0.70 + randomForestConfig.tunerConfig.tunerFirstGenerationGenePool = 5 + randomForestConfig.tunerConfig.tunerNumberOfGenerations = 2 + randomForestConfig.tunerConfig.tunerNumberOfParentsToRetain = 2 + randomForestConfig.tunerConfig.tunerNumberOfMutationsPerGeneration = 2 + randomForestConfig.tunerConfig.tunerGeneticMixing = 0.8 + randomForestConfig.tunerConfig.tunerGenerationalMutationStrategy = "fixed" + randomForestConfig.tunerConfig.tunerEvolutionStrategy = "batch" + val featuresEngPipelineModel = FeatureEngineeringPipelineContext + .generatePipelineModel( + testVars.df, + ConfigurationGenerator.generateMainConfig(randomForestConfig) + ) + .pipelineModel + val pipelineModel = + PipelineTestUtils.saveAndLoadPipelineModel( + featuresEngPipelineModel, + testVars.df, + "full-feature-eng-pipeline" + ) + assert( + pipelineModel + .transform(testVars.df) + .count() == 99, + "Total row count shouldn't have changed" + ) + } + + // + ignore should "run train, save/load pipeline and predict" in { + val testVars = PipelineTestUtils.getTestVars() + val overrides = Map( + "labelCol" -> "label", + "mlFlowLoggingFlag" -> false, + "scalingFlag" -> true, + "oneHotEncodeFlag" -> true, + "numericBoundaries" -> Map( + "numTrees" -> Tuple2(50.0, 100.0), + "maxBins" -> Tuple2(10.0, 20.0), + "maxDepth" -> Tuple2(2.0, 5.0), + "minInfoGain" -> Tuple2(0.0, 0.03), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ), + "tunerParallelism" -> 10, + "outlierFilterFlag" -> false, + "outlierFilterPrecision" -> 0.05, + "outlierLowerFilterNTile" -> 0.05, + "outlierUpperFilterNTile" -> 0.95, + "tunerTrainSplitMethod" -> "kSample", + "tunerKFold" -> 1, + "tunerTrainPortion" -> 0.70, + "tunerFirstGenerationGenePool" -> 5, + "tunerNumberOfGenerations" -> 2, + "tunerNumberOfParentsToRetain" -> 1, + "tunerNumberOfMutationsPerGeneration" -> 1, + "tunerGeneticMixing" -> 0.8, + "tunerGenerationalMutationStrategy" -> "fixed", + "tunerEvolutionStrategy" -> "batch", + "pipelineDebugFlag" -> true + ) + val randomForestConfig = ConfigurationGenerator + .generateConfigFromMap("RandomForest", "classifier", overrides) + val runner = + FamilyRunner(testVars.df, Array(randomForestConfig)).executeWithPipeline() + val pipelineModel = runner.bestPipelineModel("RandomForest") + val bcPipelineModel = + testVars.df.sparkSession.sparkContext.broadcast(pipelineModel) + + val predictDf = bcPipelineModel.value.transform(testVars.df.drop("label")) + assert( + predictDf.count() == testVars.df.count(), + "Inference df count should have matched the input dataset" + ) + assert( + testVars.df.columns + .filterNot("label".equals(_)) + .forall(item => predictDf.columns.contains(item)), + "All original columns must be present in the predict dataset" + ) + // Test write and load of full inference pipeline + val pipelineSavePath = AutomationUnitTestsUtil + .getProjectDir() + "/target/pipeline-tests/infer-final-pipeline" + pipelineModel.write.overwrite().save(pipelineSavePath) + PipelineModel + .load(pipelineSavePath) + .transform(testVars.df.drop("label")) + .show(100) + } + + ignore should "run train pipeline" in { + val overrides = Map( + "labelCol" -> "label", + "mlFlowLoggingFlag" -> false, + "scalingFlag" -> true, + "oneHotEncodeFlag" -> true, + "numericBoundaries" -> Map( + "numTrees" -> Tuple2(50.0, 100.0), + "maxBins" -> Tuple2(10.0, 20.0), + "maxDepth" -> Tuple2(2.0, 5.0), + "minInfoGain" -> Tuple2(0.0, 0.03), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ), + "tunerParallelism" -> 10, + "outlierFilterFlag" -> true, + "outlierFilterPrecision" -> 0.05, + "outlierLowerFilterNTile" -> 0.05, + "outlierUpperFilterNTile" -> 0.95, + "tunerTrainSplitMethod" -> "kSample", + "tunerKFold" -> 1, + "tunerTrainPortion" -> 0.70, + "tunerFirstGenerationGenePool" -> 5, + "tunerNumberOfGenerations" -> 2, + "tunerNumberOfParentsToRetain" -> 1, + "tunerNumberOfMutationsPerGeneration" -> 1, + "tunerGeneticMixing" -> 0.8, + "tunerGenerationalMutationStrategy" -> "fixed", + "tunerEvolutionStrategy" -> "batch", + "pipelineDebugFlag" -> false + ) + + val adultDf = convertCsvToDf("/adult_data.csv") + var adultDfCleaned = adultDf + for (colName <- adultDf.columns) { + adultDfCleaned = adultDfCleaned + .withColumn( + colName.split("\\s+").mkString + "_trimmed", + trim(col(colName)) + ) + .drop(colName) + } + adultDfCleaned = adultDfCleaned.withColumnRenamed("class_trimmed", "label") + + val randomForestConfig = ConfigurationGenerator + .generateConfigFromMap("RandomForest", "classifier", overrides) + val runner = FamilyRunner(adultDfCleaned, Array(randomForestConfig)) + .executeWithPipeline() + val predictDf = runner + .bestPipelineModel("RandomForest") + .transform(adultDfCleaned.drop("label")) + + // Test write and load of full inference pipeline + val pipelineSavePath = AutomationUnitTestsUtil + .getProjectDir() + "/target/pipeline-tests/infer-final-pipeline-lab" + runner + .bestPipelineModel("RandomForest") + .write + .overwrite() + .save(pipelineSavePath) + PipelineModel + .load(pipelineSavePath) + .transform(adultDfCleaned.drop("label")) + .show(100) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/FeatureInteractionPipelineTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/FeatureInteractionPipelineTest.scala new file mode 100644 index 00000000..c7773584 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/FeatureInteractionPipelineTest.scala @@ -0,0 +1,54 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.executor.FamilyRunner +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} + +class FeatureInteractionPipelineTest extends AbstractUnitSpec { + + "It" should "return feature engineered df" in { + val testVars = PipelineTestUtils.getTestVars() + val overrides = Map( + "labelCol" -> "label", + "mlFlowLoggingFlag" -> false, + "featuresCol" -> "features", + "featureInteractionFlag" -> true, + "featureInteractionRetentionMode" -> "optimistic", + "featureInteractionContinuousDiscretizerBucketCount" -> 20, + "featureInteractionParallelism" -> 8, + "featureInteractionTargetInteractionPercentage" -> 25.0, + "scalingFlag" -> true, + "oneHotEncodeFlag" -> true, + "numericBoundaries" -> Map( + "numTrees" -> Tuple2(50.0, 100.0), + "maxBins" -> Tuple2(10.0, 20.0), + "maxDepth" -> Tuple2(2.0, 5.0), + "minInfoGain" -> Tuple2(0.0, 0.03), + "subSamplingRate" -> Tuple2(0.5, 1.0) + ), + "tunerParallelism" -> 10, + "outlierFilterFlag" -> false, + "outlierFilterPrecision" -> 0.05, + "outlierLowerFilterNTile" -> 0.05, + "outlierUpperFilterNTile" -> 0.95, + "tunerTrainSplitMethod" -> "random", + "tunerKFold" -> 1, + "tunerTrainPortion" -> 0.70, + "tunerFirstGenerationGenePool" -> 5, + "tunerNumberOfGenerations" -> 2, + "tunerNumberOfParentsToRetain" -> 1, + "tunerNumberOfMutationsPerGeneration" -> 1, + "tunerGeneticMixing" -> 0.8, + "tunerGenerationalMutationStrategy" -> "fixed", + "tunerEvolutionStrategy" -> "batch", + "pipelineDebugFlag" -> true, + "mlFlowLoggingFlag" -> false + ) + val randomForestConfig = ConfigurationGenerator + .generateConfigFromMap("RandomForest", "classifier", overrides) + val featEngPipe = FamilyRunner(testVars.df, Array(randomForestConfig)) + .generateFeatureEngineeredPipeline() + + featEngPipe("RandomForest").transform(testVars.df).show(100) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/MlFlowLoggingValidationStageTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/MlFlowLoggingValidationStageTransformerTest.scala new file mode 100644 index 00000000..c3d6d083 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/MlFlowLoggingValidationStageTransformerTest.scala @@ -0,0 +1,48 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.exceptions.MlFlowValidationException +import com.databricks.labs.automl.{ + AbstractUnitSpec, + AutomationUnitTestsUtil, + DiscreteTestDataGenerator +} + +class MlFlowLoggingValidationStageTransformerTest extends AbstractUnitSpec { + + "MlFlowLoggingValidationStageTransformerTest" should "not changed input dataset" in { + val spark = AutomationUnitTestsUtil.sparkSession + val mlFlowLoggingValidationStageTransformer = + new MlFlowLoggingValidationStageTransformer() + .setMlFlowLoggingFlag(false) + .setMlFlowExperimentName("test_name") + .setMlFlowTrackingURI("test_Uri") + .setMlFlowAPIToken("test_token") + val adultDf = + DiscreteTestDataGenerator.generateFeatureInteractionData(5000) + val adultDfMlFlowValidation = + mlFlowLoggingValidationStageTransformer.transform(adultDf) + assert( + adultDf.count() == adultDfMlFlowValidation.count(), + "MlFlowLoggingValidationStageTransformerTest should not have changed input dataset rows" + ) + assert( + adultDf.columns.length == adultDfMlFlowValidation.columns.length, + "MlFlowLoggingValidationStageTransformerTest should not have changed number of columns" + ) + assert( + adultDf.columns.sameElements(adultDfMlFlowValidation.columns), + "MlFlowLoggingValidationStageTransformerTest should not have changed columns" + ) + } + + it should "throw MlFlowValidationException" in { + a[MlFlowValidationException] should be thrownBy { + new MlFlowLoggingValidationStageTransformer() + .setMlFlowLoggingFlag(true) + .setMlFlowExperimentName("test_name") + .setMlFlowTrackingURI("test_Uri") + .setMlFlowAPIToken("test_token") + .transform(AutomationUnitTestsUtil.getAdultDf()) + } + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/OutlierFilterTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/OutlierFilterTransformerTest.scala new file mode 100644 index 00000000..c0d206ca --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/OutlierFilterTransformerTest.scala @@ -0,0 +1,53 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.ml.PipelineStage + +import scala.collection.mutable.ArrayBuffer + +class OutlierFilterTransformerTest extends AbstractUnitSpec { + + it should "correctly apply outlier filtering" in { + val testVars = PipelineTestUtils.getTestVars() + val stages = new ArrayBuffer[PipelineStage] + val nonFeatureCols = + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, testVars.labelCol) + stages += PipelineTestUtils + .addZipRegisterTmpTransformerStage( + testVars.labelCol, + testVars.df.columns.filterNot(item => nonFeatureCols.contains(item)) + ) + stages ++= PipelineTestUtils.buildFeaturesPipelineStages( + testVars.df, + testVars.labelCol, + dropColumns = false + ) + stages += new DropColumnsTransformer() + .setInputCols(Array(testVars.featuresCol)) + stages += new OutlierFilterTransformer() + .setLabelColumn(testVars.labelCol) + .setFilterBounds("both") + .setLowerFilterNTile(0.1) + .setUpperFilterNTile(0.4) + .setFilterPrecision(0.01) + .setContinuousDataThreshold(50) + .setParallelism(10) + .setFieldsToIgnore(Array.empty) + .setDebugEnabled(false) + + val outlierDf = PipelineTestUtils + .saveAndLoadPipeline( + stages.toArray, + testVars.df, + "outlier-filter-pipeline" + ) + .transform(testVars.df) + outlierDf.show() + assert( + outlierDf.count() == 31, + "OutlierFilterTransformer should have filtered rows, check outlier filter settings" + ) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/PearsonFilterTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/PearsonFilterTransformerTest.scala new file mode 100644 index 00000000..6d0fc4ef --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/PearsonFilterTransformerTest.scala @@ -0,0 +1,57 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.ml.PipelineStage + +import scala.collection.mutable.ArrayBuffer + +class PearsonFilterTransformerTest extends AbstractUnitSpec { + + "PearsonFilterTransformerTest" should "correctly apply pearson filter" in { + val testVars = PipelineTestUtils.getTestVars() + val stages = new ArrayBuffer[PipelineStage]() + val nonFeatureCols = + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL, testVars.labelCol) + stages += PipelineTestUtils + .addZipRegisterTmpTransformerStage( + testVars.labelCol, + testVars.df.columns.filterNot(item => nonFeatureCols.contains(item)) + ) + val vectFeatures = PipelineTestUtils + .getVectorizedFeatures( + testVars.df, + testVars.labelCol, + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + ) + stages ++= PipelineTestUtils + .buildFeaturesPipelineStages( + testVars.df, + testVars.labelCol, + dropColumns = false, + ignoreCols = Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + ) + stages += new PearsonFilterTransformer() + .setLabelColumn(testVars.labelCol) + .setFeatureCol(testVars.featuresCol) + .setFeatureColumns(vectFeatures) + .setAutoFilterNTile(0.75) + .setFilterDirection("greater") + .setFilterManualValue(0) + .setFilterMode("auto") + .setFilterStatistic("pearsonStat") + .setModelType("classifier") + val pearsonDf = PipelineTestUtils + .saveAndLoadPipeline( + stages.toArray, + testVars.df, + "pearson-filter-pipeline" + ) + .transform(testVars.df) + + assert( + pearsonDf.columns.length == 8, + "PearsonFilterTransformer should have retained only 8 columns" + ) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/RepartitionTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/RepartitionTransformerTest.scala new file mode 100644 index 00000000..874b791c --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/RepartitionTransformerTest.scala @@ -0,0 +1,17 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.sql.functions._ + +class RepartitionTransformerTest extends AbstractUnitSpec { + + "RepartitionTransformer" should "return correct repartitioned dataset" in { + val testVars = PipelineTestUtils.getTestVars() + val inputDf = testVars.df.withColumn("automl_internal_id", monotonically_increasing_id()) + val repartitionTransformer = new RepartitionTransformer() + .setPartitionScaleFactor(2) + val transformedDf = repartitionTransformer.transform(inputDf) + assert(transformedDf.rdd.getNumPartitions == inputDf.rdd.getNumPartitions * 2, + "DataFrame wasn't repartitioned as expected") + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/VarianceFilterTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/VarianceFilterTransformerTest.scala new file mode 100644 index 00000000..70b4e06f --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/VarianceFilterTransformerTest.scala @@ -0,0 +1,27 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} +import org.apache.spark.ml.PipelineStage + +import scala.collection.mutable.ArrayBuffer + +class VarianceFilterTransformerTest extends AbstractUnitSpec { + + "VarianceFilterTransformerTest" should "apply the filter correctly" in { + val testVars = PipelineTestUtils.getTestVars() + val stages = new ArrayBuffer[PipelineStage] + stages += PipelineTestUtils + .addZipRegisterTmpTransformerStage( + testVars.labelCol, + testVars.df.columns.filterNot(item => AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL.equals(item)) + ) + stages ++= PipelineTestUtils.buildFeaturesPipelineStages(testVars.df, testVars.labelCol, dropColumns = false) + stages += new VarianceFilterTransformer() + .setLabelColumn(testVars.labelCol) + .setFeatureCol(testVars.featuresCol) + PipelineTestUtils + .saveAndLoadPipeline(stages.toArray, testVars.df, "variance-filter-pipeline") + .transform(testVars.df).show(10) + } +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/XgboostLoanRiskTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/XgboostLoanRiskTest.scala new file mode 100644 index 00000000..21a9adf3 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/XgboostLoanRiskTest.scala @@ -0,0 +1,50 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.executor.FamilyRunner +import com.databricks.labs.automl.executor.config.ConfigurationGenerator +import com.databricks.labs.automl.{AbstractUnitSpec, AutomationUnitTestsUtil} + +class XgboostLoanRiskTest extends AbstractUnitSpec { + + ignore should "run successfully" in { + val loanRiskDf = AutomationUnitTestsUtil.convertCsvToDf("/loan_risk.csv") + val genericMapOverrides = Map( + "labelCol" -> "label", + "scoringMetric" -> "areaUnderROC", + "oneHotEncodeFlag" -> true, + "autoStoppingFlag" -> true, + "tunerAutoStoppingScore" -> 0.91, + "tunerParallelism" -> 1 * 2, + "tunerKFold" -> 2, + "tunerTrainPortion" -> 0.7, + "tunerTrainSplitMethod" -> "stratified", + "tunerInitialGenerationMode" -> "permutations", + "tunerInitialGenerationPermutationCount" -> 8, + "tunerInitialGenerationIndexMixingMode" -> "linear", + "tunerInitialGenerationArraySeed" -> 42L, + "tunerFirstGenerationGenePool" -> 16, + "tunerNumberOfGenerations" -> 3, + "tunerNumberOfParentsToRetain" -> 2, + "tunerNumberOfMutationsPerGeneration" -> 4, + "tunerGeneticMixing" -> 0.8, + "tunerGenerationalMutationStrategy" -> "fixed", + "tunerEvolutionStrategy" -> "batch", + "tunerHyperSpaceInferenceFlag" -> true, + "tunerHyperSpaceInferenceCount" -> 20000, + "tunerHyperSpaceModelType" -> "XGBoost", + "tunerHyperSpaceModelCount" -> 8, + "mlFlowLoggingFlag" -> false, + "mlFlowLogArtifactsFlag" -> false, + "pipelineDebugFlag" -> true + ) + val xgBoostConfig = ConfigurationGenerator.generateConfigFromMap( + "XGBoost", + "classifier", + genericMapOverrides + ) + val familyRunner = + FamilyRunner(loanRiskDf, Array(xgBoostConfig)).executeWithPipeline() + familyRunner.bestPipelineModel(("XGBoost")).transform(loanRiskDf).show(10) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/pipeline/ZipRegisterTempTransformerTest.scala b/src/test/scala/com/databricks/labs/automl/pipeline/ZipRegisterTempTransformerTest.scala new file mode 100644 index 00000000..1e09b79c --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/pipeline/ZipRegisterTempTransformerTest.scala @@ -0,0 +1,31 @@ +package com.databricks.labs.automl.pipeline + +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, PipelineTestUtils} + +class ZipRegisterTempTransformerTest extends AbstractUnitSpec { + + ignore should "add Id column, retain feature columns and create temp view of original dataset" in { + val testVars = PipelineTestUtils.getTestVars() + val featureColumns = + Array("age_trimmed", "workclass_trimmed", "fnlwgt_trimmed", "label") + val zipRegisterTempTransformer = new ZipRegisterTempTransformer() + .setFeatureColumns(featureColumns) + .setLabelColumn(testVars.labelCol) + .setTempViewOriginalDatasetName("zipRegisterTempTransformer") + .setDebugEnabled(true) + +// LogManager.getRootLogger.setLevel(Level.DEBUG) + val transformedAdultDf = zipRegisterTempTransformer.transform(testVars.df) + assert( + transformedAdultDf.count() == 99, + "transformed table rows shouldn't have changed" + ) + assert( + transformedAdultDf.columns + .contains(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + "Id column should have been generated" + ) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/DataSanitizerTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/DataSanitizerTest.scala new file mode 100644 index 00000000..8fcd883f --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/DataSanitizerTest.scala @@ -0,0 +1,25 @@ +package com.databricks.labs.automl.sanitize + +import com.databricks.labs.automl.{AbstractUnitSpec, AutomationUnitTestsUtil} +import org.apache.spark.sql.DataFrame + +class DataSanitizerTest extends AbstractUnitSpec { + + "DataSanitizerTest" should "throw NullPointerException for null input dataset" in { + a [NullPointerException] should be thrownBy { + new DataSanitizer(null).decideModel() + } + } + + it should "return clean data for valid input dataset" in { + val adultDataset: DataFrame = AutomationUnitTestsUtil.getAdultDf() + .withColumnRenamed("class","label") + val cleanedData = new DataSanitizer(adultDataset).generateCleanData() + assert(cleanedData != null, "clean data object should not be null") + assert(cleanedData._1 != null, "cleaned dataset should not be null") + assert(cleanedData._2 != null, "Nafillconfig should not be been null") + assert(cleanedData._3 != null, "Model decided cannot be null") + assert(cleanedData._3 equals "classifier", "Model decided should be classifier") + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/components/CardinalityCheckTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/components/CardinalityCheckTest.scala new file mode 100644 index 00000000..701ce074 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/components/CardinalityCheckTest.scala @@ -0,0 +1,85 @@ +package com.databricks.labs.automl.sanitize.components + +import com.databricks.labs.automl.utils.data.CategoricalHandler +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} +import org.apache.spark.sql.DataFrame + +class CardinalityCheckTest extends AbstractUnitSpec { + + def dataGeneration(rows: Int): (DataFrame, Array[String]) = { + DiscreteTestDataGenerator.generateCardinalityFilteringData(rows) + } + + def cardinalityCalculation(rows: Int, + mode: String, + limit: Int): Array[String] = { + + val (data, fields) = dataGeneration(rows) + + new CategoricalHandler(data, mode) + .setCardinalityType("exact") + .setPrecision(0.0) + .validateCategoricalFields(fields.toList, limit) + + } + + it should "silently filter out a single column" in { + + val ROWS = 1000 + val MODE = "silent" + val LIMIT = 50 + val EXPECTED_RETAINED_FIELDS = Array("b", "c") + + val retainFields = cardinalityCalculation(ROWS, MODE, LIMIT) + + assert( + EXPECTED_RETAINED_FIELDS.forall(retainFields.contains), + "Filtered correct elements" + ) + } + + it should "silently filter out all columns" in { + val ROWS = 1000 + val MODE = "silent" + val LIMIT = 2 + val EXPECTED_RETAINED_FIELDS = Array.empty[String] + + val retainFields = cardinalityCalculation(ROWS, MODE, LIMIT) + + assert( + EXPECTED_RETAINED_FIELDS.forall(retainFields.contains), + "Filtered out correct elements" + ) + + } + + it should "throw an assertion error in warn mode" in { + + val ROWS = 1000 + val MODE = "warn" + val LIMIT = 20 + val EXPECTED_RETAINED_FIELDS = Array("c") + + intercept[AssertionError] { + cardinalityCalculation(ROWS, MODE, LIMIT) + } + + } + + it should "filter nothing" in { + + val ROWS = 1000 + val MODE = "silent" + val LIMIT = 500 + val EXPECTED_RETAINED_FIELDS = Array("b", "c", "d") + + val retainFields = cardinalityCalculation(ROWS, MODE, LIMIT) + + assert( + EXPECTED_RETAINED_FIELDS.forall(retainFields.contains), + "Filtered out correct elements" + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/components/FeatureCorrelationTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/components/FeatureCorrelationTest.scala new file mode 100644 index 00000000..c4a6ce58 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/components/FeatureCorrelationTest.scala @@ -0,0 +1,92 @@ +package com.databricks.labs.automl.sanitize.components + +import com.databricks.labs.automl.exceptions.FeatureCorrelationException +import com.databricks.labs.automl.sanitize.FeatureCorrelationDetection +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} + +class FeatureCorrelationTest extends AbstractUnitSpec { + + private final val LABEL_COL = "label" + private final val PARALLELISM = 4 + + final val (data, features) = + DiscreteTestDataGenerator.generateFeatureCorrelationData(1000) + + private def generateFeatureCorrelation( + high: Double, + low: Double + ): FeatureCorrelationDetection = { + + new FeatureCorrelationDetection(data, features) + .setLabelCol(LABEL_COL) + .setParallelism(PARALLELISM) + .setCorrelationCutoffLow(low) + .setCorrelationCutoffHigh(high) + + } + + it should "execute appropriate filtering with max and min values" in { + + val INTENDED_FIELDS = + Array("a1", "c2", "d1", "d2", "label", "automl_internal_id") + + val filteredData = + generateFeatureCorrelation(1.0, -1.0).filterFeatureCorrelation() + + val schemaNames = filteredData.schema.names + + assert( + INTENDED_FIELDS.forall(schemaNames.contains), + "appropriate fields have been filtered" + ) + + } + + it should "throw an exception if all fields have been filtered out" in { + + intercept[RuntimeException] { + generateFeatureCorrelation(-0.1, 0.1).filterFeatureCorrelation() + } + } + + it should "throw an exception for improper configuration" in { + + intercept[FeatureCorrelationException] { + generateFeatureCorrelation(-10, 0.0).filterFeatureCorrelation() + } + + } + + it should "filter appropriate fields with removing positive correlation" in { + + val INTENDED_FIELDS = + Array("a1", "c2", "d1", "d2", "label", "automl_internal_id") + + val filteredData = + generateFeatureCorrelation(0.1, -1.0).filterFeatureCorrelation() + + val schemaNames = filteredData.schema.names + assert( + INTENDED_FIELDS.forall(schemaNames.contains), + "appropriate fields filtered" + ) + + } + + it should "filter appropriate number of fields with removing negative correlation" in { + + val expectedRemainingFieldCount = 6 + + val filteredData = + generateFeatureCorrelation(1.0, -0.1).filterFeatureCorrelation() + + val schemaNames = filteredData.schema.names + + assert( + schemaNames.length == expectedRemainingFieldCount, + "appropriate fields filtered" + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/components/ModelDetectionTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/components/ModelDetectionTest.scala new file mode 100644 index 00000000..756444a7 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/components/ModelDetectionTest.scala @@ -0,0 +1,161 @@ +package com.databricks.labs.automl.sanitize.components + +import com.databricks.labs.automl.sanitize.DataSanitizer +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} + +class ModelDetectionTest extends AbstractUnitSpec { + + private final val CLASSIFICATION_DISTINCT_COUNT = 3 + private final val REGRESSION_DISTINCT_COUNT = 11 + private final val ROW_COUNT = 500 + private final val LABEL_COL = "label" + private final val FEATURE_COL = "features" + private final val DISTINCT_THRESHOLD = 10 + private final val PARALLELISM = 4 + private final val FILTER_PRECISION = 0.01 + private final val STRING_FILL_MAP = Map.empty[String, String] + private final val NUMERIC_FILL_MAP = Map.empty[String, AnyVal] + private final val CHAR_FILL_VALUE = "hodor" + private final val NUM_FILL_VALUE = 42.0 + private final val NUM_FILL_STAT = "mean" + private final val CHAR_FILL_STAT = "max" + private final val FILL_MODE = "auto" + private final val REGRESSION_NAME = "regressor" + private final val CLASSIFICATION_NAME = "classifier" + + it should "correctly identify a classification problem in auto mode" in { + + val classificationData = + DiscreteTestDataGenerator.generateModelDetectionData( + ROW_COUNT, + CLASSIFICATION_DISTINCT_COUNT + ) + + val sanitizer = new DataSanitizer(classificationData) + .setLabelCol(LABEL_COL) + .setFeatureCol(FEATURE_COL) + .setNumericFillStat(NUM_FILL_STAT) + .setCharacterFillStat(CHAR_FILL_STAT) + .setParallelism(PARALLELISM) + .setCategoricalNAFillMap(STRING_FILL_MAP) + .setCharacterNABlanketFillValue(CHAR_FILL_VALUE) + .setNumericNABlanketFillValue(NUM_FILL_VALUE) + .setNumericNAFillMap(NUMERIC_FILL_MAP) + .setNAFillMode(FILL_MODE) + .setFilterPrecision(FILTER_PRECISION) + .setFieldsToIgnoreInVector( + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + ) + + val (data, fillMap, modelDecision) = sanitizer.generateCleanData() + + assert( + modelDecision == CLASSIFICATION_NAME, + "detect classification setting correctly" + ) + + } + + it should "correctly identify a regression problem in auto mode" in { + + val classificationData = + DiscreteTestDataGenerator.generateModelDetectionData( + ROW_COUNT, + REGRESSION_DISTINCT_COUNT + ) + + val sanitizer = new DataSanitizer(classificationData) + .setLabelCol(LABEL_COL) + .setFeatureCol(FEATURE_COL) + .setNumericFillStat(NUM_FILL_STAT) + .setCharacterFillStat(CHAR_FILL_STAT) + .setParallelism(PARALLELISM) + .setCategoricalNAFillMap(STRING_FILL_MAP) + .setCharacterNABlanketFillValue(CHAR_FILL_VALUE) + .setNumericNABlanketFillValue(NUM_FILL_VALUE) + .setNumericNAFillMap(NUMERIC_FILL_MAP) + .setNAFillMode(FILL_MODE) + .setFilterPrecision(FILTER_PRECISION) + .setFieldsToIgnoreInVector( + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + ) + + val (data, fillMap, modelDecision) = sanitizer.generateCleanData() + + assert( + modelDecision == REGRESSION_NAME, + "detect regression setting correctly" + ) + + } + + it should "correctly identify a classification problem with overrides" in { + + val classificationData = + DiscreteTestDataGenerator.generateModelDetectionData( + ROW_COUNT, + CLASSIFICATION_DISTINCT_COUNT + ) + + val sanitizer = new DataSanitizer(classificationData) + .setLabelCol(LABEL_COL) + .setFeatureCol(FEATURE_COL) + .setModelSelectionDistinctThreshold(DISTINCT_THRESHOLD) + .setNumericFillStat(NUM_FILL_STAT) + .setCharacterFillStat(CHAR_FILL_STAT) + .setParallelism(PARALLELISM) + .setCategoricalNAFillMap(STRING_FILL_MAP) + .setCharacterNABlanketFillValue(CHAR_FILL_VALUE) + .setNumericNABlanketFillValue(NUM_FILL_VALUE) + .setNumericNAFillMap(NUMERIC_FILL_MAP) + .setNAFillMode(FILL_MODE) + .setFilterPrecision(FILTER_PRECISION) + .setFieldsToIgnoreInVector( + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + ) + + val (data, fillMap, modelDecision) = sanitizer.generateCleanData() + + assert( + modelDecision == CLASSIFICATION_NAME, + "detect classification setting correctly" + ) + + } + + it should "correctly identify a regression problem with overrides" in { + + val classificationData = + DiscreteTestDataGenerator.generateModelDetectionData( + ROW_COUNT, + REGRESSION_DISTINCT_COUNT + ) + + val sanitizer = new DataSanitizer(classificationData) + .setLabelCol(LABEL_COL) + .setFeatureCol(FEATURE_COL) + .setModelSelectionDistinctThreshold(DISTINCT_THRESHOLD) + .setNumericFillStat(NUM_FILL_STAT) + .setCharacterFillStat(CHAR_FILL_STAT) + .setParallelism(PARALLELISM) + .setCategoricalNAFillMap(STRING_FILL_MAP) + .setCharacterNABlanketFillValue(CHAR_FILL_VALUE) + .setNumericNABlanketFillValue(NUM_FILL_VALUE) + .setNumericNAFillMap(NUMERIC_FILL_MAP) + .setNAFillMode(FILL_MODE) + .setFilterPrecision(FILTER_PRECISION) + .setFieldsToIgnoreInVector( + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + ) + + val (data, fillMap, modelDecision) = sanitizer.generateCleanData() + + assert( + modelDecision == REGRESSION_NAME, + "detect regression setting correctly" + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/components/NAFillTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/components/NAFillTest.scala new file mode 100644 index 00000000..376c5f7d --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/components/NAFillTest.scala @@ -0,0 +1,339 @@ +package com.databricks.labs.automl.sanitize.components + +import com.databricks.labs.automl.inference.NaFillConfig +import com.databricks.labs.automl.sanitize.DataSanitizer +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types._ + +class NAFillTest extends AbstractUnitSpec { + + private final val LABEL_COL = "label" + private final val FEATURE_COL = "features" + private final val DISTINCT_THRESHOLD = 10 + private final val PARALLELISM = 4 + private final val FILTER_PRECISION = 0.01 + private final val EXPECTED_MODEL_TYPE = "classifier" + private final val NA_FILL_ROW_COUNT = 100 + private final val NA_RATE = 10 + + private final val STRING_FILL_MAP = Map("strData" -> "zzzz") + private final val NUMERIC_FILL_MAP = Map( + "dblData" -> 99999.99, + "fltData" -> 99999.9f, + "intData" -> 9999, + "ordinalIntData" -> 9999 + ) + private final val CHAR_FILL_VALUE = "hodor" + private final val NUM_FILL_VALUE = -999.0 + + def setupNAFillTest(numFillStat: String, + catFillStat: String, + fillMode: String): DataSanitizer = { + + val data = + DiscreteTestDataGenerator.generateNAFillData(NA_FILL_ROW_COUNT, NA_RATE) + + new DataSanitizer(data) + .setLabelCol(LABEL_COL) + .setFeatureCol(FEATURE_COL) + .setModelSelectionDistinctThreshold(DISTINCT_THRESHOLD) + .setNumericFillStat(numFillStat) + .setCharacterFillStat(catFillStat) + .setParallelism(PARALLELISM) + .setCategoricalNAFillMap(STRING_FILL_MAP) + .setCharacterNABlanketFillValue(CHAR_FILL_VALUE) + .setNumericNABlanketFillValue(NUM_FILL_VALUE) + .setNumericNAFillMap(NUMERIC_FILL_MAP) + .setNAFillMode(fillMode) + .setFilterPrecision(FILTER_PRECISION) + .setFieldsToIgnoreInVector( + Array(AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + ) + + } + + def setupMapFillModes(mode: String): (DataFrame, NaFillConfig, String) = { + + val sanitizer = setupNAFillTest("mean", "max", mode) + + sanitizer.generateCleanData() + + } + + def checkForNulls(df: DataFrame, columnName: String): Unit = { + + assert( + df.na.drop(Seq(columnName)).count() == df.count(), + "na values have been filled" + ) + + } + + def generateNAFillConfigTest( + numFillStat: String, + catFillStat: String, + mode: String, + expectedContinuousFillMap: Map[String, Double], + expectedCategoricalFillMap: Map[String, String], + expectedBooleanFillMap: Map[String, Boolean] + ): Unit = { + + val expectedPreFillSchema = StructType( + Seq( + StructField("dblData", DoubleType, nullable = true), + StructField("fltData", FloatType, nullable = true), + StructField("intData", IntegerType, nullable = true), + StructField("ordinalIntData", IntegerType, nullable = true), + StructField("strData", StringType, nullable = true), + StructField("boolData", BooleanType, nullable = false), + StructField("dateData", DateType, nullable = true), + StructField("label", IntegerType, nullable = true), + StructField("automl_internal_id", LongType, nullable = true) + ) + ) + + val expectedPostFillSchema = StructType( + Seq( + StructField("dblData", DoubleType, nullable = false), + StructField("fltData", FloatType, nullable = false), + StructField("intData", IntegerType, nullable = true), + StructField("ordinalIntData", IntegerType, nullable = true), + StructField("strData", StringType, nullable = false), + StructField("boolData", BooleanType, nullable = false), + StructField("dateData", DateType, nullable = true), + StructField("label", IntegerType, nullable = true), + StructField("automl_internal_id", LongType, nullable = true) + ) + ) + + val data = + DiscreteTestDataGenerator.generateNAFillData(NA_FILL_ROW_COUNT, NA_RATE) + + val sanitizer = setupNAFillTest(numFillStat, catFillStat, mode) + + val (naFilledDF, fillMap, modelType) = sanitizer.generateCleanData() + + // Validate the incoming schema is correct for the test + assert( + data.schema == expectedPreFillSchema, + "for pre naFill schema validation" + ) + + // Validate the post-fill schema + assert( + naFilledDF.schema == expectedPostFillSchema, + "for post naFill schema validation" + ) + + // Make sure the fill for continuous data is correct + assert( + fillMap.numericColumns == expectedContinuousFillMap, + "for numeric fill na values" + ) + + // Make sure the fill for categorical data is correct + assert( + fillMap.categoricalColumns == expectedCategoricalFillMap, + "for categorical fill na values" + ) + // Make sure that Boolean types have been filled. + fillMap.booleanColumns.keys.toArray + .map(x => naFilledDF.select(x).na.drop().count()) + .foreach { x => + assert(x == NA_FILL_ROW_COUNT, "for boolean fill characteristics") + } + + // Ensure the model type is correct + assert(modelType == EXPECTED_MODEL_TYPE, "for model type detection") + + val fullColumnCheck = fillMap.categoricalColumns.keys.toSeq ++ fillMap.numericColumns.keys.toSeq ++ fillMap.booleanColumns.keys.toSeq + fullColumnCheck.foreach { x => + checkForNulls(naFilledDF, x) + } + + } + + it should "correctly fill na's in auto mode, numeric mean, categorical max" in { + + val NUM_FILL_STAT = "mean" + val CAT_FILL_STAT = "max" + + val expectedContinuousFillMap = Map( + "dblData" -> 101.0, + "fltData" -> 102.0, + "intData" -> 99.22222222222223, + "ordinalIntData" -> 15.311111111111112 + ) + + val expectedCategoricalFillMap = Map("strData" -> "e") + + val expectedBooleanFillMap = Map("boolData" -> false) + + generateNAFillConfigTest( + "mean", + "max", + "auto", + expectedContinuousFillMap, + expectedCategoricalFillMap, + expectedBooleanFillMap + ) + + } + + it should "correctly fill na's in auto mode, numeric max, categorical min" in { + + val expectedContinuousFillMap = Map( + "dblData" -> 199.0, + "fltData" -> 200.0, + "intData" -> 199.0, + "ordinalIntData" -> 25.0 + ) + + val expectedCategoricalFillMap = Map("strData" -> "a") + + val expectedBooleanFillMap = Map("boolData" -> true) + + generateNAFillConfigTest( + "max", + "min", + "auto", + expectedContinuousFillMap, + expectedCategoricalFillMap, + expectedBooleanFillMap + ) + + } + + it should "correctly fill na's in auto mode, numeric median, categorical min" in { + + val expectedContinuousFillMap = Map( + "dblData" -> 99.0, + "fltData" -> 100.0, + "intData" -> 99.0, + "ordinalIntData" -> 17.0 + ) + + val expectedCategoricalFillMap = Map("strData" -> "a") + + val expectedBooleanFillMap = Map("boolData" -> true) + + generateNAFillConfigTest( + "median", + "min", + "auto", + expectedContinuousFillMap, + expectedCategoricalFillMap, + expectedBooleanFillMap + ) + + } + + it should "correctly fill na's in mapFill mode" in { + + val FILL_MODE = "mapFill" + + val (naFilledDF, fillMap, modelType) = setupMapFillModes(FILL_MODE) + + NUMERIC_FILL_MAP.keys.foreach { x => + assert( + naFilledDF.filter(col(x) === NUMERIC_FILL_MAP(x)).count() > 0, + "for numeric fill columns" + ) + + } + + STRING_FILL_MAP.keys.foreach { x => + assert( + naFilledDF.filter(col(x) === STRING_FILL_MAP(x)).count() > 0, + "for categorical fill columns" + ) + } + + val fullColumnCheck = fillMap.categoricalColumns.keys.toSeq ++ fillMap.numericColumns.keys.toSeq ++ fillMap.booleanColumns.keys.toSeq + + fullColumnCheck.foreach { x => + checkForNulls(naFilledDF, x) + } + + } + + it should "correctly fill na's in blanketFillAll mode" in { + val FILL_MODE = "blanketFillAll" + + val (naFilledDF, fillMap, modelType) = setupMapFillModes(FILL_MODE) + + NUMERIC_FILL_MAP.keys.foreach { x => + assert( + naFilledDF.filter(col(x) === NUM_FILL_VALUE).count() > 0, + "for numeric map fill" + ) + } + + STRING_FILL_MAP.keys.foreach { x => + assert( + naFilledDF.filter(col(x) === CHAR_FILL_VALUE).count() > 0, + "for categorical map fill" + ) + } + + val fullColumnCheck = fillMap.categoricalColumns.keys.toSeq ++ fillMap.numericColumns.keys.toSeq ++ fillMap.booleanColumns.keys.toSeq + fullColumnCheck.foreach { x => + checkForNulls(naFilledDF, x) + } + } + + it should "correctly fill na's in blanketFillCharOnly mode" in { + val FILL_MODE = "blanketFillCharOnly" + val expectedContinuousFillMap = Map( + "dblData" -> 101.0, + "fltData" -> 102.0, + "intData" -> 99.22222222222223, + "ordinalIntData" -> 15.311111111111112 + ) + + val (naFilledDF, fillMap, modelType) = setupMapFillModes(FILL_MODE) + + assert( + fillMap.numericColumns == expectedContinuousFillMap, + "for numeric fill na values" + ) + STRING_FILL_MAP.keys.foreach { x => + assert( + naFilledDF.filter(col(x) === CHAR_FILL_VALUE).count() > 0, + "for categorical map fill" + ) + } + + val fullColumnCheck = fillMap.categoricalColumns.keys.toSeq ++ fillMap.numericColumns.keys.toSeq ++ fillMap.booleanColumns.keys.toSeq + fullColumnCheck.foreach { x => + checkForNulls(naFilledDF, x) + } + + } + it should "correctly fill na's in blanketFillNumOnly" in { + val FILL_MODE = "blanketFillNumOnly" + val expectedCategoricalFillMap = Map("strData" -> "e") + val (naFilledDF, fillMap, modelType) = setupMapFillModes(FILL_MODE) + + assert( + fillMap.categoricalColumns == expectedCategoricalFillMap, + "for categorical fill na values" + ) + NUMERIC_FILL_MAP.keys.foreach { x => + assert( + naFilledDF.filter(col(x) === NUM_FILL_VALUE).count() > 0, + "for numeric map fill" + ) + } + + val fullColumnCheck = fillMap.categoricalColumns.keys.toSeq ++ fillMap.numericColumns.keys.toSeq ++ fillMap.booleanColumns.keys.toSeq + fullColumnCheck.foreach { x => + checkForNulls(naFilledDF, x) + } + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/components/OutlierFilteringTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/components/OutlierFilteringTest.scala new file mode 100644 index 00000000..5373f4d0 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/components/OutlierFilteringTest.scala @@ -0,0 +1,245 @@ +package com.databricks.labs.automl.sanitize.components + +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} +import com.databricks.labs.automl.params.ManualFilters +import com.databricks.labs.automl.sanitize.OutlierFiltering +import com.databricks.labs.automl.utils.AutoMlPipelineMlFlowUtils + +class OutlierFilteringTest extends AbstractUnitSpec { + + final private val OUTLIER_ROW_COUNT = 500 + final private val OUTLIER_LABEL_DISTINCT_COUNT = 5 + final private val EXPECTED_FILTER_COUNT_UPPER = 30 + final private val EXPECTED_FILTER_COUNT_LOWER = 20 + final private val EXPECTED_FILTER_COUNT_BOTH = 50 + final private val EXPECTED_FILTER_EXCLUSION_MODE = 49 + final private val EXPECTED_FILTER_COUNT_MANUAL = 244 + final private val EXPECTED_PRESERVE_COUNT_UPPER = 470 + final private val EXPECTED_PRESERVE_COUNT_LOWER = 480 + final private val EXPECTED_PRESERVE_COUNT_BOTH = 450 + final private val EXPECTED_PRESERVE_EXCLUSION_MODE = 451 + final private val EXPECTED_PRESERVE_COUNT_MANUAL = 256 + final private val UPPER_FILTER_COL_A_VALUE = 8.30584E8 + final private val LOWER_FILTER_COL_A_VALUE = 0.0 + final private val BOTH_FILTER_COL_A_VALUE = Array(8.30584E8, 8.35896888E8, + 8.41232384E8, 8.46590536E8, 8.51971392E8, 8.57375E8, 8.62801408E8, + 8.68250664E8, 8.73722816E8, 8.79217912E8, 8.84736E8, 8.90277128E8, + 8.95841344E8, 9.01428696E8, 9.07039232E8, 9.12673E8, 9.18330048E8, + 9.24010424E8, 9.29714176E8, 9.35441352E8, 9.41192E8, 9.46966168E8, + 9.52763904E8, 9.58585256E8, 9.64430272E8, 9.70299E8, 9.76191488E8, + 9.82107784E8, 9.88047936E8, 9.94011992E8, 0.0, 8.0, 64.0, 216.0, 512.0, + 1000.0, 1728.0, 2744.0, 4096.0, 5832.0, 8000.0, 10648.0, 13824.0, 17576.0, + 21952.0, 27000.0, 32768.0, 39304.0, 46656.0, 54872.0) + final private val BOTH_FILTER_COL_B_VALUE = Array(499.0, 498.0, 497.0, 496.0, + 495.0, 494.0, 493.0, 492.0, 491.0, 490.0, 489.0, 488.0, 487.0, 486.0, 485.0, + 484.0, 483.0, 482.0, 481.0, 480.0, 479.0, 478.0, 477.0, 476.0, 475.0, 474.0, + 473.0, 472.0, 471.0, 19.0, 18.0, 17.0, 16.0, 15.0, 14.0, 13.0, 12.0, 11.0, + 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0, 0.0) + final private val UPPER_FILTER_COL_C_MANUAL_VALUE = 961.0 + final private val LABEL_COL = "label" + final private val FILTER_COL = "a" + final private val EXCLUSION_FIELD = "b" + final private val MANUAL_FIELD = "c" + final private val EXCLUSION_COLS = Array("a", "c") + final private val MANUAL_FILTERS = List(ManualFilters(MANUAL_FIELD, 900.0)) + + it should "filter appropriate values from an exponential distribution in 'upper' mode at 95p" in { + + val exponentialData = DiscreteTestDataGenerator.generateOutlierData( + OUTLIER_ROW_COUNT, + OUTLIER_LABEL_DISTINCT_COUNT + ) + + val outlierTransformer = new OutlierFiltering(exponentialData) + .setLabelCol(LABEL_COL) + .setFilterBounds("upper") + .setUpperFilterNTile(0.95) + .setLowerFilterNTile(0.4) + .setFilterPrecision(0.01) + .setContinuousDataThreshold(100) + .setParallelism(1) + + val filteredHigh = outlierTransformer.filterContinuousOutliers( + Array(LABEL_COL, AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + Array.empty[String] + ) + + val filterCount = filteredHigh._2.count() + val nonFilterCount = filteredHigh._1.count() + val filteredColA = + filteredHigh._2.collect().map(_.getAs[Double](FILTER_COL)).head + + assert( + nonFilterCount == EXPECTED_PRESERVE_COUNT_UPPER, + s"rows of non-filtered outlier data in the 95p upper mode." + ) + assert( + filterCount == EXPECTED_FILTER_COUNT_UPPER, + s"rows of outlier filtered data in the 95p upper mode." + ) + assert( + filteredColA == UPPER_FILTER_COL_A_VALUE, + s"for the correct value of col $FILTER_COL row to be filtered out in the 95p upper mode." + ) + } + + it should "filter appropriate values from an exponential distribution in 'lower' mode at 5p" in { + + val exponentialData = DiscreteTestDataGenerator.generateOutlierData( + OUTLIER_ROW_COUNT, + OUTLIER_LABEL_DISTINCT_COUNT + ) + + val outlierTransformer = new OutlierFiltering(exponentialData) + .setLabelCol("label") + .setFilterBounds("lower") + .setUpperFilterNTile(0.95) + .setLowerFilterNTile(0.05) + .setFilterPrecision(0.01) + .setContinuousDataThreshold(100) + .setParallelism(1) + + val filteredHigh = outlierTransformer.filterContinuousOutliers( + Array("label", AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + Array.empty[String] + ) + + val filterCount = filteredHigh._2.count() + val nonFilterCount = filteredHigh._1.count() + val filteredColA = + filteredHigh._2.collect().map(_.getAs[Double](FILTER_COL)).head + + assert( + nonFilterCount == EXPECTED_PRESERVE_COUNT_LOWER, + s"rows of non-filtered outlier data in the 95p lower mode." + ) + assert( + filterCount == EXPECTED_FILTER_COUNT_LOWER, + s"rows of outlier filtered data in the 95p lower mode." + ) + assert( + filteredColA == LOWER_FILTER_COL_A_VALUE, + s"for the correct value of col $FILTER_COL row to be filtered out in the 5p lower mode." + ) + } + + it should "filter appropriate values from an exponential distribution in 'both' mode at 5p and 95p" in { + + val exponentialData = DiscreteTestDataGenerator.generateOutlierData( + OUTLIER_ROW_COUNT, + OUTLIER_LABEL_DISTINCT_COUNT + ) + + val outlierTransformer = new OutlierFiltering(exponentialData) + .setLabelCol("label") + .setFilterBounds("both") + .setUpperFilterNTile(0.95) + .setLowerFilterNTile(0.05) + .setFilterPrecision(0.01) + .setContinuousDataThreshold(100) + .setParallelism(1) + + val filteredHigh = outlierTransformer.filterContinuousOutliers( + Array("label", AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + Array.empty[String] + ) + + val filterCount = filteredHigh._2.count() + val nonFilterCount = filteredHigh._1.count() + val filteredColA = + filteredHigh._2.collect().map(_.getAs[Double](FILTER_COL)) + assert( + nonFilterCount == EXPECTED_PRESERVE_COUNT_BOTH, + s"rows of non-filtered outlier data in the 95p upper mode." + ) + assert( + filterCount == EXPECTED_FILTER_COUNT_BOTH, + s"rows of outlier filtered data in the 95p upper mode." + ) + assert( + filteredColA.sameElements(BOTH_FILTER_COL_A_VALUE), + s"for the correct value of col $FILTER_COL row to be filtered out in the 5p/95p both mode." + ) + } + + it should "filter appropriate values with column exclusion in 'both' mode." in { + + val exponentialData = DiscreteTestDataGenerator.generateOutlierData( + OUTLIER_ROW_COUNT, + OUTLIER_LABEL_DISTINCT_COUNT + ) + + val outlierTransformer = new OutlierFiltering(exponentialData) + .setLabelCol("label") + .setFilterBounds("both") + .setUpperFilterNTile(0.95) + .setLowerFilterNTile(0.05) + .setFilterPrecision(0.01) + .setContinuousDataThreshold(100) + .setParallelism(1) + + val filteredHigh = outlierTransformer.filterContinuousOutliers( + Array("label", AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL), + EXCLUSION_COLS + ) + + val filterCount = filteredHigh._2.count() + val nonFilterCount = filteredHigh._1.count() + val filteredColB = + filteredHigh._2.collect().map(_.getAs[Double](EXCLUSION_FIELD)) + + assert( + nonFilterCount == EXPECTED_PRESERVE_EXCLUSION_MODE, + s"rows of non-filtered outlier data in the 95p upper mode." + ) + assert( + filterCount == EXPECTED_FILTER_EXCLUSION_MODE, + s"rows of outlier filtered data in the 95p upper mode." + ) + assert( + filteredColB.sameElements(BOTH_FILTER_COL_B_VALUE), + s"for the correct value of col $EXCLUSION_FIELD row to be filtered out in the 95p upper mode." + ) + } + + it should "filter appropriate values manually in 'upper' mode." in { + + val exponentialData = DiscreteTestDataGenerator.generateOutlierData( + OUTLIER_ROW_COUNT, + OUTLIER_LABEL_DISTINCT_COUNT + ) + + val outlierTransformer = new OutlierFiltering(exponentialData) + .setLabelCol("label") + .setFilterBounds("upper") + .setUpperFilterNTile(0.95) + .setLowerFilterNTile(0.05) + .setFilterPrecision(0.01) + .setContinuousDataThreshold(100) + .setParallelism(1) + + val filteredHigh = outlierTransformer.filterContinuousOutliers( + MANUAL_FILTERS, + Array("label", AutoMlPipelineMlFlowUtils.AUTOML_INTERNAL_ID_COL) + ) + + val filterCount = filteredHigh._2.count() + val nonFilterCount = filteredHigh._1.count() + val filteredData = + filteredHigh._2.collect().map(_.getAs[Double](MANUAL_FIELD)).head + + assert( + nonFilterCount == EXPECTED_PRESERVE_COUNT_MANUAL, + s"rows of non-filtered outlier data in the 95p upper mode." + ) + assert( + filterCount == EXPECTED_FILTER_COUNT_MANUAL, + s"rows of outlier filtered data in the 95p upper mode." + ) + assert( + filteredData == UPPER_FILTER_COL_C_MANUAL_VALUE, + s"for the correct value of col $MANUAL_FIELD row to be filtered out in manual mode." + ) + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/components/PearsonFilteringTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/components/PearsonFilteringTest.scala new file mode 100644 index 00000000..74c77b4d --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/components/PearsonFilteringTest.scala @@ -0,0 +1,108 @@ +package com.databricks.labs.automl.sanitize.components + +import com.databricks.labs.automl.sanitize.PearsonFiltering +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} +import org.apache.spark.sql.DataFrame + +class PearsonFilteringTest extends AbstractUnitSpec { + + private final val TARGET_ROWS = 500 + private final val DEBUG_COUNT = 10 + private final val LABEL_COL = "label" + private final val FEATURES_COL = "features" + + def instantiatePearson(data: DataFrame, + features: Array[String], + modelType: String): PearsonFiltering = { + + new PearsonFiltering(data, features, modelType) + .setLabelCol(LABEL_COL) + .setFeaturesCol(FEATURES_COL) + + } + + it should "filter intended columns by using pvalue filtering with auto thresholding" in { + + val INTENDED_FILTER = Array("positiveCorr1", "positiveCorr2") + + val (data, fields) = + DiscreteTestDataGenerator.generatePearsonFilteringData(TARGET_ROWS) + + val filtered = instantiatePearson(data, fields, "classifier") + .setFilterStatistic("pvalue") + .setFilterDirection("greater") + .filterFields() + + assert( + INTENDED_FILTER.forall(filtered.schema.names.contains) === false, + "appropriate columns have been dropped" + ) + + filtered.show(DEBUG_COUNT) + + } + + it should "filter columns using pvalue except for ignored column" in { + + val INTENDED_FILTER = Array("positiveCorr1") + + val (data, fields) = + DiscreteTestDataGenerator.generatePearsonFilteringData(TARGET_ROWS) + + val filtered = instantiatePearson(data, fields, "classifier") + .filterFields(Array("positiveCorr2")) + + assert( + INTENDED_FILTER.forall(filtered.schema.names.contains) === false, + "appropriate columns have been dropped and excluded fields remain" + ) + filtered.show(DEBUG_COUNT) + + } + + it should "filter correctly for a regression problem" in { + + val INTENDED_FILTER = + Array("positiveCorr1", "positiveCorr2", "positiveCorr3") + + val (data, fields) = + DiscreteTestDataGenerator.generatePearsonRegressionFilteringData( + TARGET_ROWS + ) + + val filtered = instantiatePearson(data, fields, "regressor") + .setFilterMode("auto") + .setAutoFilterNTile(0.99) + .filterFields() + + assert( + INTENDED_FILTER.forall(filtered.schema.names.contains) === false, + "appropriate columns have been dropped." + ) + + filtered.show(DEBUG_COUNT) + + } + + it should "filter correctly for regression except for ignored columns" in { + + val INTENDED_FILTER = Array("positiveCorr2") + + val (data, fields) = + DiscreteTestDataGenerator.generatePearsonRegressionFilteringData( + TARGET_ROWS + ) + + val filtered = instantiatePearson(data, fields, "regressor") + .setFilterMode("manual") + .setFilterManualValue(0.9) + .filterFields(Array("positiveCorr1", "positiveCorr3")) + + assert( + INTENDED_FILTER.forall(filtered.schema.names.contains) === false, + "appropriate columns have been dropped and exclusions have been ignored." + ) + filtered.show(DEBUG_COUNT) + + } +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/components/SanitizerTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/components/SanitizerTest.scala new file mode 100644 index 00000000..d297a1d4 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/components/SanitizerTest.scala @@ -0,0 +1,88 @@ +package com.databricks.labs.automl.sanitize.components + +import com.databricks.labs.automl.sanitize.DataSanitizer +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} +import org.apache.spark.sql.DataFrame + +class SanitizerTest extends AbstractUnitSpec { + + final private val ROW_COUNT = 1000 + final private val CATEGORICAL_COLUMNS = Array("d") + final private val NUMERIC_COLUMNS = Array("a", "b", "c", "e") + final private val BOOLEAN_COLUMNS = Array("f") + final private val LABEL_COL = "label" + + def generateData(modelType: String): DataFrame = { + + DiscreteTestDataGenerator.generateSanitizerData(ROW_COUNT, modelType) + + } + + it should "detect categorical columns" in { + + val result = new DataSanitizer(generateData("classifier")) + .setLabelCol(LABEL_COL) + .generateCleanData() + + assert( + CATEGORICAL_COLUMNS.forall( + result._2.categoricalColumns.keys.toArray.contains + ), + "found categorical columns" + ) + + } + + it should "detect numeric columns" in { + + val result = new DataSanitizer(generateData("classifier")) + .setLabelCol(LABEL_COL) + .generateCleanData() + + assert( + NUMERIC_COLUMNS.forall(result._2.numericColumns.keys.toArray.contains), + "found numeric columns" + ) + + } + + it should "detect boolean columns" in { + + val result = new DataSanitizer(generateData("classifier")) + .setLabelCol(LABEL_COL) + .generateCleanData() + + assert( + BOOLEAN_COLUMNS.forall(result._2.booleanColumns.keys.toArray.contains), + "found boolean columns" + ) + + } + + it should "detect a classifier model type" in { + + val result = new DataSanitizer(generateData("classifier")) + .setLabelCol(LABEL_COL) + .generateCleanData() + + assert( + result._3 == "classifier", + "detects classifier type based on label data" + ) + + } + + it should "detect a regressor model type" in { + + val result = new DataSanitizer(generateData("regressor")) + .setLabelCol(LABEL_COL) + .generateCleanData() + + assert( + result._3 == "regressor", + "detects regressor type based on label data" + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/sanitize/components/VarianceFilteringTest.scala b/src/test/scala/com/databricks/labs/automl/sanitize/components/VarianceFilteringTest.scala new file mode 100644 index 00000000..00e1a7d8 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/sanitize/components/VarianceFilteringTest.scala @@ -0,0 +1,69 @@ +package com.databricks.labs.automl.sanitize.components + +import com.databricks.labs.automl.sanitize.VarianceFiltering +import com.databricks.labs.automl.{AbstractUnitSpec, DiscreteTestDataGenerator} +import org.apache.spark.sql.DataFrame + +class VarianceFilteringTest extends AbstractUnitSpec { + + private final val LABEL_COL = "label" + private final val FEATURES_COL = "features" + private final val DATETIME_CONVERSION_TYPE = "split" + private final val PARALLELISM = 1 + + def setupVarianceTest( + rowCount: Int, + fieldsToIgnore: Array[String] = Array.empty[String] + ): (DataFrame, Array[String]) = { + + val data = DiscreteTestDataGenerator.generateVarianceFilteringData(rowCount) + + new VarianceFiltering(data) + .setLabelCol(LABEL_COL) + .setFeatureCol(FEATURES_COL) + .setDateTimeConversionType(DATETIME_CONVERSION_TYPE) + .setParallelism(PARALLELISM) + .filterZeroVariance(fieldsToIgnore) + + } + + it should "correctly filter out zero variance columns with no exclusions" in { + + val FILTER_COLUMNS = Array("c", "d") + + val result = setupVarianceTest(200) + + val dfSchemaNames = result._1.schema.names + + assert( + result._2.sameElements(FILTER_COLUMNS), + "detect the correct columns to remove" + ) + + assert( + FILTER_COLUMNS.forall(dfSchemaNames.contains) === false, + "appropriate columns have been dropped" + ) + + } + + it should "correctly filter out zero variance with an exclusion applied" in { + + val FILTER_COLUMNS = Array("c") + + val result = setupVarianceTest(200, Array("d")) + + val dfSchemaNames = result._1.schema.names + + assert( + result._2.sameElements(FILTER_COLUMNS), + "detect the correct columns to remove and ignore the correct columns" + ) + assert( + FILTER_COLUMNS.forall(dfSchemaNames.contains) === false, + "appropriate columns have been dropped" + ) + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/utilities/DataGeneratorTestSchemas.scala b/src/test/scala/com/databricks/labs/automl/utilities/DataGeneratorTestSchemas.scala new file mode 100644 index 00000000..a5d91ee7 --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/utilities/DataGeneratorTestSchemas.scala @@ -0,0 +1,106 @@ +package com.databricks.labs.automl.utilities + +case class SchemaNamesTypes(name: String, dataType: String) + +case class ModelDetectionSchema(a: Double, + label: Double, + automl_internal_id: Long) + +case class OutlierTestSchema(a: Double, + b: Double, + c: Double, + label: Int, + automl_internal_id: Long) + +case class NaFillTestSchema(dblData: Double, + fltData: Float, + intData: Int, + ordinalIntData: Int, + strData: String, + boolData: Boolean, + dateData: String, + label: Int, + automl_internal_id: Long) + +case class VarianceTestSchema(a: Double, + b: Double, + c: Double, + d: Int, + label: Int, + automl_internal_id: Long) + +case class PearsonTestSchema(positiveCorr1: Int, + positiveCorr2: Int, + noFilter1: Double, + noFilter2: Int, + label: Int, + automl_internal_id: Long) + +case class PearsonRegressionTestSchema(positiveCorr1: Double, + positiveCorr2: Double, + positiveCorr3: Int, + noFilter1: Double, + noFilter2: Int, + label: Double, + automl_internal_id: Long) + +case class FeatureCorrelationTestSchema(a1: Double, + a2: Double, + b1: Int, + b2: Int, + c1: Double, + c2: Double, + c3: Double, + d1: Long, + d2: Long, + label: Double, + automl_internal_id: Long) + +case class CardinalityFilteringTestSchema(a: Double, b: Int, c: Long, d: String) + +case class SanitizerSchema(a: Double, + b: Int, + c: Long, + d: String, + e: Int, + f: Boolean, + label: String, + automl_internal_id: Long) + +case class SanitizerSchemaRegressor(a: Double, + b: Int, + c: Long, + d: String, + e: Int, + f: Boolean, + label: Double, + automl_internal_id: Long) + +case class KSampleSchema(a: Double, + b: Int, + c: Double, + label: Int, + automl_internal_id: Long) + +case class FeatureInteractionSchema(a: Double, + b: Double, + c: Int, + d: String, + e: Int, + f: String, + label: Int, + automl_internal_id: Long) + +case class ClassifierSchema(a: Double, + b: Double, + c: Int, + d: String, + e: Double, + label: Int) + +case class RegressorSchema(a: Double, + b: Double, + c: Int, + d: String, + e: Double, + label: Double) diff --git a/src/test/scala/com/databricks/labs/automl/utilities/DataGeneratorUtilities.scala b/src/test/scala/com/databricks/labs/automl/utilities/DataGeneratorUtilities.scala new file mode 100644 index 00000000..80a1823b --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/utilities/DataGeneratorUtilities.scala @@ -0,0 +1,978 @@ +package com.databricks.labs.automl.utilities + +import org.joda.time.LocalDate +import com.databricks.labs.automl.utils.structures.ArrayGeneratorMode +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.{col, when} + +import scala.collection.mutable.ArrayBuffer +import scala.util.Random + +trait DataGeneratorUtilities { + + final val DOUBLE_FILL = Double.MinValue + final val FLOAT_FILL = Float.MinValue + final val INT_FILL = Int.MinValue + final val LONG_FILL = Long.MinValue + + import com.databricks.labs.automl.utils.structures.ArrayGeneratorMode._ + + final case class ArrayGeneratorException( + private val mode: String, + private val allowableModes: Array[String], + cause: Throwable = None.orNull + ) extends RuntimeException( + s"The array generator mode " + + s"specified: $mode is not in the allowable list of supported models: ${allowableModes + .mkString(", ")}", + cause + ) + + case class ModuloResult(factor: Int, remain: Int) + + /** + * Enumeration assignment for array sorting mode + * @param mode String - one of 'ascending', 'descending' or 'random' + * @return Enumerated value + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def getArrayMode(mode: String): ArrayGeneratorMode.Value = { + mode match { + case "ascending" => ASC + case "descending" => DESC + case "random" => RAND + case _ => + throw ArrayGeneratorException( + mode, + Array("random", "descending", "ascending") + ) + } + } + + /** + * Helper method for getting the module quotient and remainder + * @param x Integer value + * @param y Integer divisor value + * @return ModuleResult of the quotient factor and the remainder from the division + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + private def getRemainder(x: Int, y: Int): ModuloResult = { + import scala.math.Integral.Implicits._ + val (q, r) = x /% y + ModuloResult(q, r) + } + + /** + * Method for generating String Arrays of arbitrary size and uniqueness + * @param targetCount number of string elements to generate + * @param uniqueValueCount the number of unique elements within the collection + * @return Array of Strings + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateStringData(targetCount: Int, + uniqueValueCount: Int): Array[String] = { + + val orderedColl = ('a' to 'z').toArray.map(_.toString) + + val repeatingChain = getRemainder(targetCount, orderedColl.length) + + val uniqueBuffer = ArrayBuffer[String](orderedColl: _*) + + if (uniqueValueCount > orderedColl.length) { + for (x <- 0 until repeatingChain.factor) { + orderedColl.foreach(y => uniqueBuffer += y + x.toString) + } + } + + val uniqueArray = uniqueBuffer.take(uniqueValueCount).toArray + + val outputArray = if (uniqueValueCount > orderedColl.length) { + Array + .fill(repeatingChain.factor)(uniqueArray) + .flatten ++ uniqueArray + .take(repeatingChain.remain) + } else { + Array.fill(targetCount / (uniqueValueCount - 1))(uniqueArray).flatten + } + + outputArray.take(targetCount) + } + + /** + * Method for generating String Arrays with nullable values based on the supplied rate and offset + * @param targetCount Count of Strings to generate in the Array + * @param uniqueValueCount Number of unique / distinct Strings to have in the collection + * @param targetNullRate The rate with which to generate null values in the Array (deterministic) + * @param nullOffset The offset in which to start iterating through and nulling the values in the Array + * @return Array[String] with nulls inserted. + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateStringDataWithNulls(targetCount: Int, + uniqueValueCount: Int, + targetNullRate: Int, + nullOffset: Int): Array[String] = { + + generateStringData(targetCount, uniqueValueCount).zipWithIndex.map { + case (v, i) => if ((i + nullOffset) % targetNullRate != 0.0) v else null + } + + } + + /** + * to simulate a non-linear distribution for validation tests in handling the data in Feature Engineering tests + * @param targetCount Number of values to generate + * @return Array of Doubles in the Fibonacci sequence + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateFibonacciData(targetCount: Int): Array[Double] = { + lazy val fibs + : Stream[BigInt] = BigInt(0) #:: BigInt(1) #:: (fibs zip fibs.tail) + .map(x => x._1 + x._2) + fibs.take(targetCount).toArray.map(_.toDouble) + } + + /** + * Method for generating a Fibonacci sequence with a fill condition in order to nullify later + * @param targetCount number of elements in the Fibonacci sequence to generate + * @param targetNullRate Frequency of null values to generate + * @param nullOffset first index position to start filling in the fillable nullable values + * @return Array of Double of the Fibonacci sequence + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateFibonacciDataWithNulls(targetCount: Int, + targetNullRate: Int, + nullOffset: Int): Array[Double] = { + generateFibonacciData(targetCount).zipWithIndex.map { + case (v, i) => + if ((i + nullOffset) % targetNullRate != 0.0) v else DOUBLE_FILL + } + } + + /** + * Method for generating linear-distributed Doubles + * @param targetCount Desired number of doubles in the array + * @param start starting point for data generation in the Range function + * @param step the value of distance between the start and the target count value + * @param mode for the DataFrame's by row reference of sorting order of the Array + * @return Array of Doubles + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateDoublesData(targetCount: Int, + start: Double, + step: Double, + mode: String): Array[Double] = { + + val sortMode = getArrayMode(mode) + + val stoppingPoint = (targetCount * step) + start + + val doubleArray = + Range.BigDecimal(start, stoppingPoint, step).toArray.map(_.toDouble) + + sortMode match { + case ASC => doubleArray.sortWith(_ < _) + case DESC => doubleArray.sortWith(_ > _) + case RAND => Random.shuffle(doubleArray.toList).toArray + } + } + + /** + * Method for generating a series of Doubles with a fill condition in order to nullify later + * @param targetCount Desired number of doubles to generate + * @param start Starting position + * @param step Space between each value + * @param mode sorting mode + * @param targetNullRate Frequency of null values to generate + * @param nullOffset first index position to start from to generate nulls + * @return Array of Doubles with fillable values in place where nulls will be in the test DataFrame + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateDoublesDataWithNulls(targetCount: Int, + start: Double, + step: Double, + mode: String, + targetNullRate: Int, + nullOffset: Int): Array[Double] = { + + generateDoublesData(targetCount, start, step, mode).zipWithIndex.map { + case (v, i) => + if ((i + nullOffset) % targetNullRate != 0.0) v else DOUBLE_FILL + } + + } + + /** + * Method for generating ordinal Double Data (repeating values) + * @param targetCount Total number of Doubles to generate in the series + * @param start Starting point for the repeating series + * @param step Distance between the repeating series values + * @param mode sorting mode for the repeating arrays + * @param distinctValues number of elements in the repeating series + * @return Array[Double] of Repeating Ordinal Doubles + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateRepeatingDoublesData(targetCount: Int, + start: Double, + step: Double, + mode: String, + distinctValues: Int): Array[Double] = { + + val sortMode = getArrayMode(mode) + val subStopPoint = (distinctValues * step) + start - 1.0 + val distinctArray = (start to subStopPoint by step).toArray + val sortedArray = sortMode match { + case ASC => distinctArray.sortWith(_ < _) + case DESC => distinctArray.sortWith(_ > _) + case RAND => distinctArray + } + val outputArray = Array + .fill(targetCount / (sortedArray.length - 1))(sortedArray) + .flatten ++ sortedArray + .take(targetCount) + + if (sortMode == RAND) Random.shuffle(outputArray.toList).toArray + else outputArray + + } + + /** + * Method for generating synthetic float series data + * @param targetCount Number of floats to generate + * @param start Starting offset position + * @param step Offset between each element + * @param mode sorting mode (ascending / descending / random shuffle) + * @return Array of Floats + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateFloatsData(targetCount: Int, + start: Float, + step: Float, + mode: String): Array[Float] = { + + val sortMode = getArrayMode(mode) + val stoppingPoint = (targetCount * step) + start + + val floatArray = (start to stoppingPoint by step).toArray + + sortMode match { + case ASC => floatArray.sortWith(_ < _) + case DESC => floatArray.sortWith(_ > _) + case RAND => Random.shuffle(floatArray.toList).toArray + } + + } + + /** + * Method for generating an array of floats with a fill condition in order to nullify later + * @param targetCount Number of floats to generate + * @param start Starting offset position + * @param step Offset between each element + * @param mode sorting mode (ascending / descending / random shuffle) + * @param targetNullRate frequency of min val nullable values to generate + * @param nullOffset first index position to start from to generate the null-fillable min values + * @return Array of Float with fillable values in place where nulls will be converted in the test DataFrame + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateFloatsDataWithNulls(targetCount: Int, + start: Float, + step: Float, + mode: String, + targetNullRate: Int, + nullOffset: Int): Array[Float] = { + generateFloatsData(targetCount, start, step, mode).zipWithIndex.map { + case (v, i) => + if ((i + nullOffset) % targetNullRate != 0.0) v else FLOAT_FILL + } + } + + /** + * Method for generating a series of Doubles with a logarithmic distribution + * @param targetCount Number of Doubles to generate + * @param start starting min value for the linear series + * @param step distance between values for the linear series that then gets converted into log scale + * @param mode sorting mode + * @return Array of Doubles in a logarithmic distribution + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateLogData(targetCount: Int, + start: Double, + step: Double, + mode: String): Array[Double] = { + val doublesSequence = generateDoublesData(targetCount, start, step, mode) + + val minSeq = + if (doublesSequence.min == 0.0) DOUBLE_FILL else doublesSequence.min + val maxSeq = doublesSequence.max + + val b = math.log(maxSeq / minSeq) / (maxSeq - minSeq) + val a = maxSeq / math.exp(b * maxSeq) + + doublesSequence.map { x => + a * math.exp(b * x) + } + + } + + /** + * Generate a log distribution with a fill condition in order to nullify later + * @param targetCount Target number of elements to produce in a log distribution + * @param start Starting min value of the distribution + * @param step spacing on a linear series between values + * @param mode sorting mode + * @param targetNullRate frequency of fillable values for null replacement + * @param nullOffset index of series to start the nullable MinValue elements for null replacement + * @return Array of Doubles of a logarithmic distribution with Double.MinValue for null replacement + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateLogDataWithNulls(targetCount: Int, + start: Double, + step: Double, + mode: String, + targetNullRate: Int, + nullOffset: Int): Array[Double] = { + generateLogData(targetCount, start, step, mode).zipWithIndex.map { + case (v, i) => + if ((i + nullOffset) % targetNullRate != 0.0) v else DOUBLE_FILL + } + } + + /** + * Generate a series of Doubles with an exponential distribution + * @param targetCount Desired number of elements in the series + * @param start starting point for the minimum value of the linear series + * @param step numeric distance between each value of the linear series + * @param mode sort mode for the series + * @param power exponent that each element will be raised to + * @return Array of Doubles that have been raised to a particular power + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateExponentialData(targetCount: Int, + start: Double, + step: Double, + mode: String, + power: Int): Array[Double] = { + + generateDoublesData(targetCount, start, step, mode).map(math.pow(_, power)) + + } + + /** + * Generate an exponential distributed Doubles series with a fill condition in order to nullify later + * @param targetCount Desired number of elements in the series + * @param start starting point for the minimum value of the linear series + * @param step numeric distance between each value of the linear series + * @param mode sort mode for the series + * @param power exponent that each element will be raised to + * @param targetNullRate frequency of fillable values for null replacement + * @param nullOffset index of series to start the nullable MinValue elements for null replacement + * @return Array of Doubles of a exponential distribution with Double.MinValue for null replacement + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateExponentialDataWithNulls(targetCount: Int, + start: Double, + step: Double, + mode: String, + power: Int, + targetNullRate: Int, + nullOffset: Int): Array[Double] = { + + generateExponentialData(targetCount, start, step, mode, power).zipWithIndex + .map { + case (v, i) => + if ((i + nullOffset) % targetNullRate != 0.0) v else DOUBLE_FILL + } + + } + + /** + * Method for generating a series of Integers, linearly distributed + * @param targetCount Number of Integers to generate + * @param start Starting Integer to work from + * @param step numeric distance between each value in the linear series + * @param mode sorting mode for the series (ascending, descending, or random) + * @return Array of Integers with a linear distribution + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateIntData(targetCount: Int, + start: Int, + step: Int, + mode: String): Array[Int] = { + + val sortMode = getArrayMode(mode) + val stoppingPoint = (targetCount * step) + start + + val intArray = (start to stoppingPoint by step).toArray + + sortMode match { + case ASC => intArray.sortWith(_ < _) + case DESC => intArray.sortWith(_ > _) + case RAND => Random.shuffle(intArray.toList).toArray + } + + } + + /** + * Method for generating a linear series of Integers with a fill condition in order to nullify later + * @param targetCount Number of Integers to generate + * @param start starting position for the series (min Int value) + * @param step numeric distance between each value in the linear series + * @param mode sorting mode for the series + * @param targetNullRate frequency of fillable values to insert that will later be nulled + * @param nullOffset adjustment to the starting position + * @return Array of Integers in a linear distribution with nulls inserted + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateIntDataWithNulls(targetCount: Int, + start: Int, + step: Int, + mode: String, + targetNullRate: Int, + nullOffset: Int): Array[Int] = { + + generateIntData(targetCount, start, step, mode).zipWithIndex.map { + case (v, i) => + if ((i + nullOffset) % targetNullRate != 0.0) v else INT_FILL + } + + } + + /** + * Method for generating series of Longs + * @param targetCount Number of Long values to generate + * @param start starting position for the series (minimum Long value) + * @param step numeric distance between each value in the linear series + * @param mode sorting mode for the series + * @return Array[Long] in a linear distribution + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateLongData(targetCount: Int, + start: Long, + step: Long, + mode: String): Array[Long] = { + val sortMode = getArrayMode(mode) + val stoppingPoint = (targetCount * step) + start + + val longArray = (start to stoppingPoint by step).toArray + + sortMode match { + case ASC => longArray.sortWith(_ < _) + case DESC => longArray.sortWith(_ > _) + case RAND => Random.shuffle(longArray.toList).toArray + } + } + + /** + * Method for generating a series of Long values with a fill condition in order to nullify later + * @param targetCount Number of Long values to generate + * @param start starting position for the series (minimum Long value) + * @param step numeric distance between each value in the linear series + * @param mode sorting mode for the series + * @param targetNullRate rate of inserting fillable values to null later + * @param nullOffset offset on starting position of series to place fillable values + * @return Array[Long] with nullable values filled + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateLongDataWithNulls(targetCount: Int, + start: Long, + step: Long, + mode: String, + targetNullRate: Int, + nullOffset: Int): Array[Long] = { + generateLongData(targetCount, start, step, mode).zipWithIndex.map { + case (v, i) => + if ((i + nullOffset) % targetNullRate != 0.0) v else LONG_FILL + } + } + + /** + * Method for generating ordinal Long Data (repeating values) + * @param targetCount Total number of Longs to generate in the series + * @param start Starting point for the repeating series + * @param step Distance between the repeating series values + * @param mode sorting mode for the repeating arrays + * @param distinctValues number of elements in the repeating series + * @return Array[Long] of Repeating Ordinal Longs + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateRepeatingLongData(targetCount: Int, + start: Long, + step: Long, + mode: String, + distinctValues: Int): Array[Long] = { + + val sortMode = getArrayMode(mode) + val subStopPoint = (distinctValues * step) + start - 1L + val distinctArray = (start to subStopPoint by step).toArray + val sortedArray = sortMode match { + case ASC => distinctArray.sortWith(_ < _) + case DESC => distinctArray.sortWith(_ > _) + case RAND => distinctArray + } + val outputArray = Array + .fill(targetCount / (sortedArray.length - 1))(sortedArray) + .flatten + .take(targetCount) + + if (sortMode == RAND) Random.shuffle(outputArray.toList).toArray + else outputArray + + } + + /** + * Method for generating ordinal Integer Data (repeating values) + * @param targetCount Total number of Integers to generate in the series + * @param start Starting point for the repeating series + * @param step Distance between the repeating series values + * @param mode sorting mode for the repeating arrays + * @param distinctValues number of elements in the repeating series + * @return Array[Int] of Repeating Ordinal Integers + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateRepeatingIntData(targetCount: Int, + start: Int, + step: Int, + mode: String, + distinctValues: Int): Array[Int] = { + + val sortMode = getArrayMode(mode) + val subStopPoint = (distinctValues * step) + start - 1 + val distinctArray = (start to subStopPoint by step).toArray + val sortedArray = sortMode match { + case ASC => distinctArray.sortWith(_ < _) + case DESC => distinctArray.sortWith(_ > _) + case RAND => distinctArray + } + val outputArray = Array + .fill(targetCount / (sortedArray.length - 1))(sortedArray) + .flatten + .take(targetCount) + + if (sortMode == RAND) Random.shuffle(outputArray.toList).toArray + else outputArray + + } + + /** + * Method for generating ordinal Integer Data with Int.MinValue inserted to be replaced with null + * @param targetCount Total number of Integers to generate in the series + * @param start Starting point for the repeating series + * @param step Distance between the repeating series values + * @param mode sorting mode for the repeating arrays + * @param distinctValues number of elements in the repeating series + * @param targetNullRate Desired frequency of Int.MinValue to convert to null values + * @param nullOffset adjustment to the starting position of Int.MinValue frequency + * @return Array[Int] of Repeating Ordinal Integers with Int.MinValue for null replacement + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateRepeatingIntDataWithNulls(targetCount: Int, + start: Int, + step: Int, + mode: String, + distinctValues: Int, + targetNullRate: Int, + nullOffset: Int): Array[Int] = { + generateRepeatingIntData(targetCount, start, step, mode, distinctValues).zipWithIndex + .map { + case (v, i) => + if ((i + nullOffset) % targetNullRate != 0.0) v else INT_FILL + } + } + + /** + * Method for generating a series of Dates + * @param targetCount Number of dates to generate + * @param startingYear Starting year for the date series + * @param startingMonth Starting month for the date series + * @param startingDay Starting day for the date series + * @return Array[String] Series of Dates + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateDates(targetCount: Int, + startingYear: Int, + startingMonth: Int, + startingDay: Int): Array[String] = { + + val start = new LocalDate(startingYear, startingMonth, startingDay) + val dates = for (x <- 0 to targetCount) yield start.plusDays(x) + dates.map(_.toString).toArray + + } + + /** + * Method for generating a series of Dates with nulls inserted in the series + * @param targetCount Number of dates to generate + * @param startingYear Starting year for the date series + * @param startingMonth Starting month for the date series + * @param startingDay Starting day for the date series + * @param targetNullRate Frequency with which to insert null values in the date column + * @param nullOffset Adjustment to the starting position in the series to begin nulling data out + * @return Array[String] series of dates with null values inserted + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateDatesWithNulls(targetCount: Int, + startingYear: Int, + startingMonth: Int, + startingDay: Int, + targetNullRate: Int, + nullOffset: Int): Array[String] = { + generateDates(targetCount, startingYear, startingMonth, startingDay).zipWithIndex + .map { + case (v, i) => if ((i + nullOffset) % targetNullRate != 0.0) v else null + } + } + + /** + * Method to generate an Array of Boolean values + * @param targetCount Number of alternating Boolean values to generate + * @return Array[Boolean] + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateBooleanData(targetCount: Int): Array[Boolean] = { + Array.fill(targetCount)(Array(true, false)).flatten.take(targetCount) + } + + /** + * Method for generating Boolean data and forcing values to Null in the Array + * @param targetCount Number of Boolean values to generate + * @param targetNullRate Frequency of null values to insert + * @param nullOffset Adjustment to the start position to insert null values + * @return Array[Boolean] with null values inserted + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateBooleanDataWithNulls(targetCount: Int, + targetNullRate: Int, + nullOffset: Int): Array[Boolean] = { + + generateBooleanData(targetCount).zipWithIndex + .map { + case (v, i) => if ((i + nullOffset) % targetNullRate != 0.0) v else null + } + .map(_.asInstanceOf[Boolean]) + + } + + /** + * Method for generating a two-tailed distribution of data in a series of data + * @param targetCount Number of elements to generate + * @param start Starting point for the median of the series + * @param step numeric distance between elements on a linear scale + * @param mode sorting mode + * @return Array of tailed Linear data + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateTailedData(targetCount: Int, + start: Double, + step: Double, + mode: String): Array[Double] = { + + val splitCount = math.ceil(targetCount / 2).toInt + + val mergedData = generateDoublesData(splitCount, start, step, mode).map( + x => x * -1.0 + ) ++ generateDoublesData(splitCount, start, step, mode) + val limitedData = mergedData.take(targetCount) + getArrayMode(mode) match { + case ASC => limitedData.sortWith(_ < _) + case DESC => limitedData.sortWith(_ > _) + case RAND => Random.shuffle(limitedData.toList).toArray + } + + } + + /** + * Method for generating a two-tailed distribution of exponential data + * @param targetCount Number of elements to generate + * @param start starting point for the median of the series + * @param step linear series distance between data points (which is then raised to the power provided) + * @param mode sorting mode for the final series + * @param power power to raise each element of the series by + * @return Array of tailed exponential data + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateTailedExponentialData(targetCount: Int, + start: Double, + step: Double, + mode: String, + power: Int): Array[Double] = { + val splitCount = math.ceil(targetCount / 2).toInt + + val mergedData = generateExponentialData( + splitCount, + start, + step, + mode, + power + ).map(x => x * -1) ++ generateExponentialData( + splitCount, + start, + step, + mode, + power + ) + val limitedData = mergedData.take(targetCount) + getArrayMode(mode) match { + case ASC => limitedData.sortWith(_ < _) + case DESC => limitedData.sortWith(_ > _) + case RAND => Random.shuffle(limitedData.toList).toArray + } + } + + /** + * Method for converting the temporary holders of 'null' values to actual nulls in a DataFrame + * @param df DataFrame with temporary placeholder values + * @return DataFrame with Nulls inserted + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def reassignToNulls(df: DataFrame): DataFrame = { + + val EXCLUSION_TYPES = Seq("boolean", "date", "time", "string") + + val namesAndTypes = df.schema + .map(x => SchemaNamesTypes(x.name, x.dataType.typeName)) + .filterNot(x => EXCLUSION_TYPES.contains(x.dataType)) + + namesAndTypes.foldLeft(df) { (df, x) => + { + + val dtype = x.dataType match { + case "double" => DOUBLE_FILL + case "float" => FLOAT_FILL + case "integer" => INT_FILL + case "long" => LONG_FILL + } + df.withColumn( + x.name, + when(col(x.name) === dtype, null).otherwise(col(x.name)) + ) + } + } + } + + def generateStaticIntSeries(targetCount: Int, value: Int): Array[Int] = { + Array.fill(targetCount)(value) + } + + def generateStaticDoubleSeries(targetCount: Int, + value: Double): Array[Double] = { + Array.fill(targetCount)(value) + } + + /** + * Method for generating a classifier data set that has blocks of labels that will ensure significant separation + * wihtin the feature vectors created. (Useful for testing items such as KSampling) + * @param targetCount Number of elements to generate in each Array + * @param start Starting value of the distinct values + * @param step Distance between each of the distinct values + * @param mode How to handle the associations of the blocks (ascending, descending, or random) + * @param distinctValues Number of distinct elements to create within the Array + * @return Array[Int] of data (in blocks or random) + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateIntegerBlocks(targetCount: Int, + start: Int, + step: Int, + mode: String, + distinctValues: Int): Array[Int] = { + + val sortMode = getArrayMode(mode) + val blockGenSize = + math.ceil(targetCount.toDouble / distinctValues.toDouble).toInt + + val calculatedStop = start + (distinctValues * step) + + sortMode match { + case ASC => + (start to calculatedStop by step).toArray + .flatMap(x => Array.fill(blockGenSize)(x)) + case DESC => + (start to calculatedStop by step).toArray.reverse + .flatMap(x => Array.fill(blockGenSize)(x)) + case RAND => + Random + .shuffle( + (start to calculatedStop by step).toArray + .flatMap(x => Array.fill(blockGenSize)(x)) + .toList + ) + .toArray + } + + } + + /** + * Method for generating a skewed classification value for the label field to test out features such as KSampling and + * stratified split + * @param targetCount number of elements to generate in the Array + * @param start Starting value + * @param step distance between each block grouped value in the unique array + * @param mode whether to sort of the blocks, ascending or descending, or to randomize them after generation + * @param distinctValues number of unique values in the array + * @return Array[Int] for classification label column values + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateIntegerBlocksSkewed(targetCount: Int, + start: Int, + step: Int, + mode: String, + distinctValues: Int): Array[Int] = { + + val sortMode = getArrayMode(mode) + val blockGenSize = + math.ceil(targetCount.toDouble / distinctValues.toDouble).toInt + + val calculatedStop = start + (distinctValues * step) + + var extractCount = targetCount + + val data = (0 until distinctValues).toArray.flatMap(x => { + extractCount = extractCount / 2 + Array.fill(extractCount)(start + (x * step)) + }) + + val remainder = targetCount - data.length + + sortMode match { + case ASC => + Array.fill(remainder)(start) ++ data + case DESC => + (Array.fill(remainder)(start) ++ data).reverse + case RAND => + Random + .shuffle(Array.fill(remainder)(start).toList ++ data.toList) + .toArray + } + + } + + /** + * Method for generating a field of data with repeating blocks of Doubles + * @param targetCount Number of elements to generate in each Array + * @param start Starting value of the distinct values + * @param step Distance between each of the distinct values + * @param mode How to handle the associations of the blocks (ascending, descending, or random) + * @param distinctValues Number of distinct elements to create within the Array + * @return Array[Double] of data (in blocks or random) + * @since 0.6.2 + * @author Ben Wilson, Databricks + */ + def generateDoublesBlocks(targetCount: Int, + start: Double, + step: Double, + mode: String, + distinctValues: Int): Array[Double] = { + + val sortMode = getArrayMode(mode) + val blockGenSize = + math.ceil(targetCount.toDouble / distinctValues.toDouble).toInt + + val calculatedStop = start + (distinctValues * step) + + sortMode match { + case ASC => + (start to calculatedStop by step).toArray + .flatMap(x => Array.fill(blockGenSize)(x)) + case DESC => + (start to calculatedStop by step).toArray.reverse + .flatMap(x => Array.fill(blockGenSize)(x)) + case RAND => + Random + .shuffle( + (start to calculatedStop by step).toArray + .flatMap(x => Array.fill(blockGenSize)(x)) + .toList + ) + .toArray + } + + } + + /** + * Private method for generating an arbitrary logistic map series of data (recurrence relation of degree 2) + * @param targetCount Number of elements to produce + * @param start Starting position + * @param lambda r factor (mortality) + * @return Array[Double] as influenced by the r value to determine the degree of distribution within the data + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + private def logisticMap(targetCount: Int, + start: Double, + lambda: Double): Array[Double] = { + + Stream.iterate(start)(x => lambda * x * (1 - x)).take(targetCount).toArray + + } + + /** + * Method for generating different sequences of data based on polynomial mapping + * + * General notes about r values (lambda) + * 0 - 1 (decay to 0) + * 1 - 2 (stabilizes to r-1/r + * 2 - 3 (flucturate and eventually hit r-1/r) + * 3 - 3.44949 (permanent oscialltions between two values) (bi-modal) + * 3.4495 - 3.54409 (permanent oscillations among 4 values) (quad-modal) + * > 3.5441 8, 16, 32, 64... relative chaos of repeating values + * > 3.56995 - pure chaos + * + * @param targetCount Number of elements to produce + * @param start Starting position in the space 0.0 -> 1.0 + * @param dataType String of one of "decay", "bimodal" or "chaotic" to generate 3 different types of numeric series + * + * @note read https://en.wikipedia.org/wiki/Logistic_map for further information. + * @since 0.7.0 + * @author Ben Wilson, Databricks + */ + def generatePeriodicData(targetCount: Int, + start: Double, + dataType: String): Array[Double] = { + + require( + start > 0.0 && start < 1.0, + s"simulating series outside of range 0, 1 will generate useless data" + ) + + dataType match { + case "decay" => logisticMap(targetCount, start, 2.9) + case "bimodal" => logisticMap(targetCount, start, 3.25) + case "chaotic" => logisticMap(targetCount, start, 3.55) + case _ => + throw new IllegalArgumentException( + "Unsupported type. Must be either 'decay', 'bimodal', or 'chaotic'" + ) + } + + } + +} diff --git a/src/test/scala/com/databricks/labs/automl/utilities/ValidationUtilities.scala b/src/test/scala/com/databricks/labs/automl/utilities/ValidationUtilities.scala new file mode 100644 index 00000000..5356311a --- /dev/null +++ b/src/test/scala/com/databricks/labs/automl/utilities/ValidationUtilities.scala @@ -0,0 +1,19 @@ +package com.databricks.labs.automl.utilities + +object ValidationUtilities { + + def fieldCreationAssertion(expectedFields: Array[String], + generatedFieldNames: Array[String]): Unit = { + + assert( + generatedFieldNames.forall(expectedFields.contains), + "did not create any unexpected columns" + ) + assert( + expectedFields.forall(generatedFieldNames.contains), + "creating the correct columns and retaining appropriate fields" + ) + + } + +}