-
Notifications
You must be signed in to change notification settings - Fork 5
Description
by running the grouping command:
dirtyData.groupBy(col("tid")).count().where(col("count") > 11).show()
it seems that two data points are got duplicated (or triplicated)
+----+-----+
| tid|count|
+----+-----+
| 0| 55|
|1206| 66|
+----+-----+
In particular, the dirty dataset for rows with "tid"==0 looks like the following:
+---+------------+--------------------+--------------------+-----+
|tid| attrName| dirty-value| clean-value|label|
+---+------------+--------------------+--------------------+-----+
| 0| id| 825| 1436| 1|
| 0| id| 2222| 1436| 1|
| 0| id| 2233| 1436| 1|
| 0| id| 665| 1436| 1|
| 0| beer-name| Bodacious Bock| Pub Beer| 1|
| 0| beer-name| 10 Ton| Pub Beer| 1|
| 0| beer-name| American Lager| Pub Beer| 1|
| 0| beer-name| Toasted Lager| Pub Beer| 1|
| 0| style| Bock| American Pale Lager| 1|
| 0| style| Oatmeal Stout| American Pale Lager| 1|
| 0| style|American Adjunct ...| American Pale Lager| 1|
| 0| style| Vienna Lager| American Pale Lager| 1|
| 0| ounces| 16.0| 12.0| 1|
| 0| ounces| 16.0| 12.0| 1|
| 0| abv| 0.075| 0.05| 1|
| 0| abv| 0.07| 0.05| 1|
| 0| abv|0.040999999999999995| 0.05| 1|
| 0| abv| 0.055| 0.05| 1|
| 0| ibu| 8.0| null| 1|
| 0| ibu| 28.0| null| 1|
| 0| brewery_id| 499| 408| 1|
| 0| brewery_id| 94| 408| 1|
| 0| brewery_id| 129| 408| 1|
| 0| brewery_id| 489| 408| 1|
| 0|brewery-name|Wildwood Brewing ...|10 Barrel Brewing...| 1|
| 0|brewery-name|Warped Wing Brewi...|10 Barrel Brewing...| 1|
| 0|brewery-name| Straub Brewery|10 Barrel Brewing...| 1|
| 0|brewery-name|Blue Point Brewin...|10 Barrel Brewing...| 1|
| 0| city| Stevensville| Bend| 1|
| 0| city| Dayton| Bend| 1|
| 0| city| St Mary's| Bend| 1|
| 0| city| Patchogue| Bend| 1|
| 0| state| MT| OR| 1|
| 0| state| OH| OR| 1|
| 0| state| PA| OR| 1|
| 0| state| NY| OR| 1|