Skip to content

Refactor dataset filtering and event deserialization#521

Merged
probberechts merged 2 commits intoPySport:masterfrom
probberechts:feat/filtered-dataset
Mar 10, 2026
Merged

Refactor dataset filtering and event deserialization#521
probberechts merged 2 commits intoPySport:masterfrom
probberechts:feat/filtered-dataset

Conversation

@probberechts
Copy link
Copy Markdown
Contributor

Context

For #421 (and related PRs) we need to be able to distinguish between complete datasets and filtered datasets. For example, we should be able to raise an exception when a user tries to insert synthetically generated caries in a dataset that contains only shots. I was trying to figure out what would be the cleanest way to support this. Here is a proposal.

Changes

1. Dynamic Class Generation for Filtered Datasets

I implemented a FilteredDataset mixin that uses dynamic class generation. When Dataset.filter() is called, it now returns a dynamically generated subclass (e.g., FilteredEventDataset). This approach preserves the original type hierarchy while allowing us to tag the instance as filtered.

Example

from kloppy import statsbomb
from kloppy.domain import EventDataset, FilteredDataset

dataset = statsbomb.load_open_data()
filtered = dataset.filter("pass")

print(type(filtered))  
# <class 'FilteredEventDataset'>

print(isinstance(filtered, EventDataset)) 
# True (It still works as an EventDataset)

print(isinstance(filtered, FilteredDataset)) 
# True (It is now identified as filtered)

2. Standardized Filtering Behavior & Event References

While thinking about this, I also noticed that we have two methods for creating a filtered dataset that return different results.

If you filter while loading the data, it sets the refs taking only into account the specified event types. In this case, the previous shot.

from kloppy import statsbomb
dataset = statsbomb.load_open_data(event_types=["shot"])
dataset.records[1].prev()
# StatsBombShotEvent(qualifiers=[BodyPartQualifier(value=<BodyPart.RIGHT_FOOT: 'RIGHT_FOOT'>)])

If you filter after loading the data, it sets the refs taking into account all events.

dataset = statsbomb.load_open_data().filter("shot")
dataset.records[1].prev()
# GenericEvent(...)

I have refactored the EventDataDeserializer to standardize on the second approach. Accessing .prev() or .next() on a filtered dataset should reflect the actual event sequence of the match (e.g., the pass that led to the shot), rather than the previous record in the filtered list.

dataset = statsbomb.load_open_data(event_types=["shot"])
dataset.records[1].prev()
# GenericEvent(...)

In terms of performance, this really doesn't make a big difference.

3. Enhanced Metadata Merging

As a side-effect of the previous changes, I also improved how additional_metadata is handled during deserialization. Currently, only the StatsBomb deserializer used this, but we should move towards supporting this for all deserializers.

- Implemented dynamic class generation for filtered datasets. `Dataset.filter()` now returns a dynamically generated subclass (e.g., `FilteredEventDataset`) that inherits from both `FilteredDataset` (mixin) and the original dataset class. This avoids strict inheritance coupling while preserving type checks.
- Refactored `EventDataDeserializer` to use the Template Method pattern:
  - `deserialize()` is now concrete and handles standard filtering and metadata merging.
  - `_deserialize()` is the new abstract hook for provider-specific parsing logic.
- Added robust metadata merging in `deserialize`:
  - Valid fields in `additional_metadata` update the Metadata object directly.
  - Unsupported keys are automatically merged into `metadata.attributes` with a warning, preventing data loss or TypeErrors.
@probberechts probberechts requested a review from koenvo December 17, 2025 19:08
@probberechts probberechts changed the title refactor: Refactor dataset filtering and event deserialization Refactor dataset filtering and event deserialization Dec 17, 2025
"""
return replace(
self,
records=self.find_all(filter_),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding ais_filtered=True flag? It might get lost on subsequent actions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply adding a is_filtered flag was indeed my first idea. I didn't really like it because of two reasons:

  1. It's somewhat hidden. A user really has to know this flag exists, while changing the class is very obvious.
  2. If we rely on a filtered=True flag, every method that needs to behave differently for filtered data requires an internal conditional check. By using a distinct class, we keep the base EventDataset completely clean.

@probberechts probberechts modified the milestone: 3.19.0 Mar 10, 2026
@probberechts probberechts merged commit 851c9f2 into PySport:master Mar 10, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants