Skip to content

Replace live DPLA API calls in website events table with S3-backed item_data_providers cache #312

Description

@DominicBM

Problem

The website events table (hub-level click-through, catalog view, exhibition, and primary source views) currently resolves contributor names via live HTTP calls to api.dp.la at display time. The call path is:

EventsController#website_events
  → WebsiteEventsPresenter#item_contributor_lookup
    → DplaApiResponseBuilder#data_providers_for_items
      → HTTParty GET https://api.dp.la/v2/items/{batch}

data_providers_for_items batches item IDs in slices of 50. A full single-page GA4 response (up to 10,000 rows) triggers up to 200 sequential live API calls per page load. The CSV export path uses multi_page_response (up to 100,000 rows), raising that ceiling to 2,000 calls.

These calls happen on the hot request path, adding latency proportional to the number of items shown and creating a hard dependency on api.dp.la availability.

Why pre-computation is possible

The set of item IDs that can ever appear in this table is bounded. It consists only of DPLA items that have been clicked through to (or otherwise interacted with) by users in the selected date range. This is a small subset of the full 45M-item catalogue — in practice, tens to hundreds of thousands of distinct IDs across all hubs across all time.

Once the date ceiling introduced in #289 is in place (restricting display to completed months only), the full set of displayable item IDs is knowable at rebuild time: it is exactly the set of item IDs returned by querying GA4 for all click-through events across all hubs through max_date. Any item ID a user can see at display time was already present in GA4 when the rebuild ran. No live API fallback is needed or correct — if an item ID is missing from the cache at display time, that is a rebuild bug, not a runtime condition to handle with an outbound call.

Proposed solution

1. Extend the monthly rebuild to produce item_data_providers.json

Extend the rebuild process (currently generate_hub_stats.py or a new companion step) to:

  1. Collect all click-through item IDs from GA4 across all hubs, for all time (full history on first run; incremental — previous month only — on subsequent runs, merging into the existing file).
  2. Resolve dataProvider names from Elasticsearch directly using the same ES connection the script already has, rather than going through the HTTP API. A terms query on _id handles batches of thousands in a single round-trip.
  3. Merge with the existing file from S3 to produce a cumulative mapping — new IDs are added; existing ones are not re-fetched (dataProvider names rarely change).
  4. Write item_data_providers.json to s3://dashboard-analytics/hub-stats/ alongside hub_stats.json, using the same generated_at timestamp.

Output format:

{
  "generated_at": "2026-05-09T19:11:52.884439+00:00",
  "items": {
    "abc123...": "California Digital Library",
    "def456...": "Mountain West Digital Library"
  }
}

2. Add ItemDataProviders cache class in Rails

Parallel to HubStats, add a class that reads item_data_providers.json from S3 with a 24-hour Rails.cache TTL and exposes a lookup method:

class ItemDataProviders
  EMPTY = {}.freeze

  def self.lookup(item_ids)
    data = Rails.cache.fetch("item_data_providers", expires_in: 24.hours) do
      # fetch from S3, same pattern as HubStats
    rescue Aws::S3::Errors::ServiceError, JSON::ParserError
      EMPTY
    end
    data["items"].slice(*item_ids)
  end
end

3. Replace live API call in WebsiteEventsPresenter

def item_contributor_lookup
  @item_contributor_lookup ||= begin
    item_ids = rows.filter_map { |row| id(row) }
    ItemDataProviders.lookup(item_ids)
  end
end

DplaApiResponseBuilder#data_providers_for_items and DplaApiResponseBuilder itself can then be removed entirely — it will have no remaining callers.

Implementation notes

  • GA4 credentials for the rebuild script: generate_hub_stats.py currently has no GA4 access (it queries ES directly). Two options: (a) store the GA4 service account credentials in Secrets Manager and fetch them in the Python script; (b) implement the GA4 collection step as a Rails Rake task using the existing GaResponseBuilder infrastructure, writing the item IDs to S3 for the Python script to pick up. Either approach works; (b) avoids duplicating GA4 client code.
  • First-run backfill: on first deploy, item_data_providers.json does not exist. ItemDataProviders.lookup returns EMPTY gracefully (same pattern as HubStats before the first rebuild). The full backfill is populated on the next rebuild run, or can be triggered manually.
  • dataProvider name changes: rare, but if a contributor is renamed in the index, the cached name becomes stale. The cumulative merge step can optionally re-resolve IDs seen in the previous month to catch renames, or accept eventual consistency (updated on next full rebuild).

Dependencies

Out of scope

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions