Problem
The website events table (hub-level click-through, catalog view, exhibition, and primary source views) currently resolves contributor names via live HTTP calls to api.dp.la at display time. The call path is:
EventsController#website_events
→ WebsiteEventsPresenter#item_contributor_lookup
→ DplaApiResponseBuilder#data_providers_for_items
→ HTTParty GET https://api.dp.la/v2/items/{batch}
data_providers_for_items batches item IDs in slices of 50. A full single-page GA4 response (up to 10,000 rows) triggers up to 200 sequential live API calls per page load. The CSV export path uses multi_page_response (up to 100,000 rows), raising that ceiling to 2,000 calls.
These calls happen on the hot request path, adding latency proportional to the number of items shown and creating a hard dependency on api.dp.la availability.
Why pre-computation is possible
The set of item IDs that can ever appear in this table is bounded. It consists only of DPLA items that have been clicked through to (or otherwise interacted with) by users in the selected date range. This is a small subset of the full 45M-item catalogue — in practice, tens to hundreds of thousands of distinct IDs across all hubs across all time.
Once the date ceiling introduced in #289 is in place (restricting display to completed months only), the full set of displayable item IDs is knowable at rebuild time: it is exactly the set of item IDs returned by querying GA4 for all click-through events across all hubs through max_date. Any item ID a user can see at display time was already present in GA4 when the rebuild ran. No live API fallback is needed or correct — if an item ID is missing from the cache at display time, that is a rebuild bug, not a runtime condition to handle with an outbound call.
Proposed solution
1. Extend the monthly rebuild to produce item_data_providers.json
Extend the rebuild process (currently generate_hub_stats.py or a new companion step) to:
- Collect all click-through item IDs from GA4 across all hubs, for all time (full history on first run; incremental — previous month only — on subsequent runs, merging into the existing file).
- Resolve
dataProvider names from Elasticsearch directly using the same ES connection the script already has, rather than going through the HTTP API. A terms query on _id handles batches of thousands in a single round-trip.
- Merge with the existing file from S3 to produce a cumulative mapping — new IDs are added; existing ones are not re-fetched (dataProvider names rarely change).
- Write
item_data_providers.json to s3://dashboard-analytics/hub-stats/ alongside hub_stats.json, using the same generated_at timestamp.
Output format:
{
"generated_at": "2026-05-09T19:11:52.884439+00:00",
"items": {
"abc123...": "California Digital Library",
"def456...": "Mountain West Digital Library"
}
}
2. Add ItemDataProviders cache class in Rails
Parallel to HubStats, add a class that reads item_data_providers.json from S3 with a 24-hour Rails.cache TTL and exposes a lookup method:
class ItemDataProviders
EMPTY = {}.freeze
def self.lookup(item_ids)
data = Rails.cache.fetch("item_data_providers", expires_in: 24.hours) do
# fetch from S3, same pattern as HubStats
rescue Aws::S3::Errors::ServiceError, JSON::ParserError
EMPTY
end
data["items"].slice(*item_ids)
end
end
3. Replace live API call in WebsiteEventsPresenter
def item_contributor_lookup
@item_contributor_lookup ||= begin
item_ids = rows.filter_map { |row| id(row) }
ItemDataProviders.lookup(item_ids)
end
end
DplaApiResponseBuilder#data_providers_for_items and DplaApiResponseBuilder itself can then be removed entirely — it will have no remaining callers.
Implementation notes
- GA4 credentials for the rebuild script:
generate_hub_stats.py currently has no GA4 access (it queries ES directly). Two options: (a) store the GA4 service account credentials in Secrets Manager and fetch them in the Python script; (b) implement the GA4 collection step as a Rails Rake task using the existing GaResponseBuilder infrastructure, writing the item IDs to S3 for the Python script to pick up. Either approach works; (b) avoids duplicating GA4 client code.
- First-run backfill: on first deploy,
item_data_providers.json does not exist. ItemDataProviders.lookup returns EMPTY gracefully (same pattern as HubStats before the first rebuild). The full backfill is populated on the next rebuild run, or can be triggered manually.
dataProvider name changes: rare, but if a contributor is renamed in the index, the cached name becomes stale. The cumulative merge step can optionally re-resolve IDs seen in the previous month to catch renames, or accept eventual consistency (updated on next full rebuild).
Dependencies
Out of scope
Problem
The website events table (hub-level click-through, catalog view, exhibition, and primary source views) currently resolves contributor names via live HTTP calls to
api.dp.laat display time. The call path is:data_providers_for_itemsbatches item IDs in slices of 50. A full single-page GA4 response (up to 10,000 rows) triggers up to 200 sequential live API calls per page load. The CSV export path usesmulti_page_response(up to 100,000 rows), raising that ceiling to 2,000 calls.These calls happen on the hot request path, adding latency proportional to the number of items shown and creating a hard dependency on
api.dp.laavailability.Why pre-computation is possible
The set of item IDs that can ever appear in this table is bounded. It consists only of DPLA items that have been clicked through to (or otherwise interacted with) by users in the selected date range. This is a small subset of the full 45M-item catalogue — in practice, tens to hundreds of thousands of distinct IDs across all hubs across all time.
Once the date ceiling introduced in #289 is in place (restricting display to completed months only), the full set of displayable item IDs is knowable at rebuild time: it is exactly the set of item IDs returned by querying GA4 for all click-through events across all hubs through
max_date. Any item ID a user can see at display time was already present in GA4 when the rebuild ran. No live API fallback is needed or correct — if an item ID is missing from the cache at display time, that is a rebuild bug, not a runtime condition to handle with an outbound call.Proposed solution
1. Extend the monthly rebuild to produce
item_data_providers.jsonExtend the rebuild process (currently
generate_hub_stats.pyor a new companion step) to:dataProvidernames from Elasticsearch directly using the same ES connection the script already has, rather than going through the HTTP API. Atermsquery on_idhandles batches of thousands in a single round-trip.item_data_providers.jsontos3://dashboard-analytics/hub-stats/alongsidehub_stats.json, using the samegenerated_attimestamp.Output format:
{ "generated_at": "2026-05-09T19:11:52.884439+00:00", "items": { "abc123...": "California Digital Library", "def456...": "Mountain West Digital Library" } }2. Add
ItemDataProviderscache class in RailsParallel to
HubStats, add a class that readsitem_data_providers.jsonfrom S3 with a 24-hourRails.cacheTTL and exposes a lookup method:3. Replace live API call in
WebsiteEventsPresenterDplaApiResponseBuilder#data_providers_for_itemsandDplaApiResponseBuilderitself can then be removed entirely — it will have no remaining callers.Implementation notes
generate_hub_stats.pycurrently has no GA4 access (it queries ES directly). Two options: (a) store the GA4 service account credentials in Secrets Manager and fetch them in the Python script; (b) implement the GA4 collection step as a Rails Rake task using the existingGaResponseBuilderinfrastructure, writing the item IDs to S3 for the Python script to pick up. Either approach works; (b) avoids duplicating GA4 client code.item_data_providers.jsondoes not exist.ItemDataProviders.lookupreturnsEMPTYgracefully (same pattern asHubStatsbefore the first rebuild). The full backfill is populated on the next rebuild run, or can be triggered manually.dataProvidername changes: rare, but if a contributor is renamed in the index, the cached name becomes stale. The cumulative merge step can optionally re-resolve IDs seen in the previous month to catch renames, or accept eventual consistency (updated on next full rebuild).Dependencies
DplaApiResponseBuildershould remain as a fallback rather than be deleted.Out of scope
max_datederivation fromgenerated_at— covered in the Cache GA4 responses at monthly granularity to eliminate load-time latency #289 comment (see this note)min_date— covered in Date range picker and data availability messaging should reflect actual per-hub data start date, not hardcoded 2018 #290