-
Notifications
You must be signed in to change notification settings - Fork 12
Description
The current algorithm does not accommodate variation in recordedBy that includes multiple collectors.
For example, recordedBy will not be considered as overlapping between a record containing recordedBy=Tim Robertson|Nicky Nicolson and another with Tim Robertson.
@nickynicolson has previous work that attempts to parse recordedBy into tokens accommodating variety in delimiters used (, | etc). This is in Python, so not easily portable to Java.
To determine if it is worth exploring this approach, we could create a new table that tokenises the recordedBy String into an array of names, and then add a SQL JOIN to create a new occurrence table containing this field (e.g. a tokenizedRecordedBy). The clustering could be modified to use this field in both the blocking and the compare stages, and a report of the impact generated.
If this identifies useful links, the best approach to incorporate this into the clustering could be explored.