Skip to content

beers_dirty_5_implicitmissingvaluemedianmode: Accidential duplication of 2 rows #5

@visenger

Description

@visenger

by running the grouping command:
dirtyData.groupBy(col("tid")).count().where(col("count") > 11).show()

it seems that two data points are got duplicated (or triplicated)
+----+-----+
| tid|count|
+----+-----+
| 0| 55|
|1206| 66|
+----+-----+

In particular, the dirty dataset for rows with "tid"==0 looks like the following:

+---+------------+--------------------+--------------------+-----+
|tid| attrName| dirty-value| clean-value|label|
+---+------------+--------------------+--------------------+-----+
| 0| id| 825| 1436| 1|
| 0| id| 2222| 1436| 1|
| 0| id| 2233| 1436| 1|
| 0| id| 665| 1436| 1|
| 0| beer-name| Bodacious Bock| Pub Beer| 1|
| 0| beer-name| 10 Ton| Pub Beer| 1|
| 0| beer-name| American Lager| Pub Beer| 1|
| 0| beer-name| Toasted Lager| Pub Beer| 1|
| 0| style| Bock| American Pale Lager| 1|
| 0| style| Oatmeal Stout| American Pale Lager| 1|
| 0| style|American Adjunct ...| American Pale Lager| 1|
| 0| style| Vienna Lager| American Pale Lager| 1|
| 0| ounces| 16.0| 12.0| 1|
| 0| ounces| 16.0| 12.0| 1|
| 0| abv| 0.075| 0.05| 1|
| 0| abv| 0.07| 0.05| 1|
| 0| abv|0.040999999999999995| 0.05| 1|
| 0| abv| 0.055| 0.05| 1|
| 0| ibu| 8.0| null| 1|
| 0| ibu| 28.0| null| 1|
| 0| brewery_id| 499| 408| 1|
| 0| brewery_id| 94| 408| 1|
| 0| brewery_id| 129| 408| 1|
| 0| brewery_id| 489| 408| 1|
| 0|brewery-name|Wildwood Brewing ...|10 Barrel Brewing...| 1|
| 0|brewery-name|Warped Wing Brewi...|10 Barrel Brewing...| 1|
| 0|brewery-name| Straub Brewery|10 Barrel Brewing...| 1|
| 0|brewery-name|Blue Point Brewin...|10 Barrel Brewing...| 1|
| 0| city| Stevensville| Bend| 1|
| 0| city| Dayton| Bend| 1|
| 0| city| St Mary's| Bend| 1|
| 0| city| Patchogue| Bend| 1|
| 0| state| MT| OR| 1|
| 0| state| OH| OR| 1|
| 0| state| PA| OR| 1|
| 0| state| NY| OR| 1|

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions