Refactor dataset filtering and event deserialization by probberechts · Pull Request #521 · PySport/kloppy

probberechts · 2025-12-17T19:08:03Z

Context

For #421 (and related PRs) we need to be able to distinguish between complete datasets and filtered datasets. For example, we should be able to raise an exception when a user tries to insert synthetically generated caries in a dataset that contains only shots. I was trying to figure out what would be the cleanest way to support this. Here is a proposal.

Changes

1. Dynamic Class Generation for Filtered Datasets

I implemented a FilteredDataset mixin that uses dynamic class generation. When Dataset.filter() is called, it now returns a dynamically generated subclass (e.g., FilteredEventDataset). This approach preserves the original type hierarchy while allowing us to tag the instance as filtered.

Example

from kloppy import statsbomb
from kloppy.domain import EventDataset, FilteredDataset

dataset = statsbomb.load_open_data()
filtered = dataset.filter("pass")

print(type(filtered))  
# <class 'FilteredEventDataset'>

print(isinstance(filtered, EventDataset)) 
# True (It still works as an EventDataset)

print(isinstance(filtered, FilteredDataset)) 
# True (It is now identified as filtered)

2. Standardized Filtering Behavior & Event References

While thinking about this, I also noticed that we have two methods for creating a filtered dataset that return different results.

If you filter while loading the data, it sets the refs taking only into account the specified event types. In this case, the previous shot.

from kloppy import statsbomb
dataset = statsbomb.load_open_data(event_types=["shot"])
dataset.records[1].prev()
# StatsBombShotEvent(qualifiers=[BodyPartQualifier(value=<BodyPart.RIGHT_FOOT: 'RIGHT_FOOT'>)])

If you filter after loading the data, it sets the refs taking into account all events.

dataset = statsbomb.load_open_data().filter("shot")
dataset.records[1].prev()
# GenericEvent(...)

I have refactored the EventDataDeserializer to standardize on the second approach. Accessing .prev() or .next() on a filtered dataset should reflect the actual event sequence of the match (e.g., the pass that led to the shot), rather than the previous record in the filtered list.

dataset = statsbomb.load_open_data(event_types=["shot"])
dataset.records[1].prev()
# GenericEvent(...)

In terms of performance, this really doesn't make a big difference.

3. Enhanced Metadata Merging

As a side-effect of the previous changes, I also improved how additional_metadata is handled during deserialization. Currently, only the StatsBomb deserializer used this, but we should move towards supporting this for all deserializers.

- Implemented dynamic class generation for filtered datasets. `Dataset.filter()` now returns a dynamically generated subclass (e.g., `FilteredEventDataset`) that inherits from both `FilteredDataset` (mixin) and the original dataset class. This avoids strict inheritance coupling while preserving type checks. - Refactored `EventDataDeserializer` to use the Template Method pattern: - `deserialize()` is now concrete and handles standard filtering and metadata merging. - `_deserialize()` is the new abstract hook for provider-specific parsing logic. - Added robust metadata merging in `deserialize`: - Valid fields in `additional_metadata` update the Metadata object directly. - Unsupported keys are automatically merged into `metadata.attributes` with a warning, preventing data loss or TypeErrors.

koenvo · 2026-03-04T13:49:47Z

kloppy/domain/models/common.py

        """
-        return replace(
-            self,
-            records=self.find_all(filter_),


What about adding ais_filtered=True flag? It might get lost on subsequent actions?

Simply adding a is_filtered flag was indeed my first idea. I didn't really like it because of two reasons:

It's somewhat hidden. A user really has to know this flag exists, while changing the class is very obvious.

If we rely on a filtered=True flag, every method that needs to behave differently for filtered data requires an internal conditional check. By using a distinct class, we keep the base EventDataset completely clean.

probberechts requested a review from koenvo December 17, 2025 19:08

probberechts changed the title ~~refactor: Refactor dataset filtering and event deserialization~~ Refactor dataset filtering and event deserialization Dec 17, 2025

koenvo reviewed Mar 4, 2026

View reviewed changes

Merge branch 'master' into feat/filtered-dataset

1246641

probberechts modified the milestone: 3.19.0 Mar 10, 2026

probberechts merged commit 851c9f2 into PySport:master Mar 10, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dataset filtering and event deserialization#521

Refactor dataset filtering and event deserialization#521
probberechts merged 2 commits intoPySport:masterfrom
probberechts:feat/filtered-dataset

probberechts commented Dec 17, 2025

Uh oh!

koenvo Mar 4, 2026

Uh oh!

probberechts Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

probberechts commented Dec 17, 2025

Context

Changes

1. Dynamic Class Generation for Filtered Datasets

2. Standardized Filtering Behavior & Event References

3. Enhanced Metadata Merging

Uh oh!

koenvo Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

probberechts Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants