Skip to content

Topic 3: Explore tokenizing the recordedBy  #28

@timrobertson100

Description

@timrobertson100

The current algorithm does not accommodate variation in recordedBy that includes multiple collectors.
For example, recordedBy will not be considered as overlapping between a record containing recordedBy=Tim Robertson|Nicky Nicolson and another with Tim Robertson.

@nickynicolson has previous work that attempts to parse recordedBy into tokens accommodating variety in delimiters used (, | etc). This is in Python, so not easily portable to Java.

To determine if it is worth exploring this approach, we could create a new table that tokenises the recordedBy String into an array of names, and then add a SQL JOIN to create a new occurrence table containing this field (e.g. a tokenizedRecordedBy). The clustering could be modified to use this field in both the blocking and the compare stages, and a report of the impact generated.

If this identifies useful links, the best approach to incorporate this into the clustering could be explored.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions