Replace live DPLA API calls in website events table with S3-backed item_data_providers cache

## Problem

The website events table (hub-level click-through, catalog view, exhibition, and primary source views) currently resolves contributor names via live HTTP calls to `api.dp.la` at display time. The call path is:

```
EventsController#website_events
  → WebsiteEventsPresenter#item_contributor_lookup
    → DplaApiResponseBuilder#data_providers_for_items
      → HTTParty GET https://api.dp.la/v2/items/{batch}
```

`data_providers_for_items` batches item IDs in slices of 50. A full single-page GA4 response (up to 10,000 rows) triggers up to **200 sequential live API calls** per page load. The CSV export path uses `multi_page_response` (up to 100,000 rows), raising that ceiling to 2,000 calls.

These calls happen on the hot request path, adding latency proportional to the number of items shown and creating a hard dependency on `api.dp.la` availability.

## Why pre-computation is possible

The set of item IDs that can ever appear in this table is bounded. It consists only of DPLA items that have been clicked through to (or otherwise interacted with) by users in the selected date range. This is a small subset of the full 45M-item catalogue — in practice, tens to hundreds of thousands of distinct IDs across all hubs across all time.

Once the date ceiling introduced in #289 is in place (restricting display to completed months only), the full set of displayable item IDs is knowable at rebuild time: it is exactly the set of item IDs returned by querying GA4 for all click-through events across all hubs through `max_date`. Any item ID a user can see at display time was already present in GA4 when the rebuild ran. **No live API fallback is needed or correct** — if an item ID is missing from the cache at display time, that is a rebuild bug, not a runtime condition to handle with an outbound call.

## Proposed solution

### 1. Extend the monthly rebuild to produce `item_data_providers.json`

Extend the rebuild process (currently `generate_hub_stats.py` or a new companion step) to:

1. **Collect all click-through item IDs from GA4** across all hubs, for all time (full history on first run; incremental — previous month only — on subsequent runs, merging into the existing file).
2. **Resolve `dataProvider` names from Elasticsearch directly** using the same ES connection the script already has, rather than going through the HTTP API. A `terms` query on `_id` handles batches of thousands in a single round-trip.
3. **Merge with the existing file** from S3 to produce a cumulative mapping — new IDs are added; existing ones are not re-fetched (dataProvider names rarely change).
4. **Write `item_data_providers.json`** to `s3://dashboard-analytics/hub-stats/` alongside `hub_stats.json`, using the same `generated_at` timestamp.

Output format:
```json
{
  "generated_at": "2026-05-09T19:11:52.884439+00:00",
  "items": {
    "abc123...": "California Digital Library",
    "def456...": "Mountain West Digital Library"
  }
}
```

### 2. Add `ItemDataProviders` cache class in Rails

Parallel to `HubStats`, add a class that reads `item_data_providers.json` from S3 with a 24-hour `Rails.cache` TTL and exposes a lookup method:

```ruby
class ItemDataProviders
  EMPTY = {}.freeze

  def self.lookup(item_ids)
    data = Rails.cache.fetch("item_data_providers", expires_in: 24.hours) do
      # fetch from S3, same pattern as HubStats
    rescue Aws::S3::Errors::ServiceError, JSON::ParserError
      EMPTY
    end
    data["items"].slice(*item_ids)
  end
end
```

### 3. Replace live API call in `WebsiteEventsPresenter`

```ruby
def item_contributor_lookup
  @item_contributor_lookup ||= begin
    item_ids = rows.filter_map { |row| id(row) }
    ItemDataProviders.lookup(item_ids)
  end
end
```

`DplaApiResponseBuilder#data_providers_for_items` and `DplaApiResponseBuilder` itself can then be removed entirely — it will have no remaining callers.

## Implementation notes

- **GA4 credentials for the rebuild script**: `generate_hub_stats.py` currently has no GA4 access (it queries ES directly). Two options: (a) store the GA4 service account credentials in Secrets Manager and fetch them in the Python script; (b) implement the GA4 collection step as a Rails Rake task using the existing `GaResponseBuilder` infrastructure, writing the item IDs to S3 for the Python script to pick up. Either approach works; (b) avoids duplicating GA4 client code.
- **First-run backfill**: on first deploy, `item_data_providers.json` does not exist. `ItemDataProviders.lookup` returns `EMPTY` gracefully (same pattern as `HubStats` before the first rebuild). The full backfill is populated on the next rebuild run, or can be triggered manually.
- **`dataProvider` name changes**: rare, but if a contributor is renamed in the index, the cached name becomes stale. The cumulative merge step can optionally re-resolve IDs seen in the previous month to catch renames, or accept eventual consistency (updated on next full rebuild).

## Dependencies

- **Blocked by #289** for the no-fallback guarantee. Once #289's date ceiling is in place, the set of displayable item IDs is strictly bounded by completed months and the guarantee holds. Prior to #289, a user requesting the current month could surface item IDs not yet in the cache — in that interim period, `DplaApiResponseBuilder` should remain as a fallback rather than be deleted.

## Out of scope

- `max_date` derivation from `generated_at` — covered in the #289 comment (see [this note](https://github.com/dpla/dashboard-analytics/issues/289#issuecomment-4433215683))
- Per-hub dynamic `min_date` — covered in #290

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace live DPLA API calls in website events table with S3-backed item_data_providers cache #312

Problem

Why pre-computation is possible

Proposed solution

1. Extend the monthly rebuild to produce `item_data_providers.json`

2. Add `ItemDataProviders` cache class in Rails

3. Replace live API call in `WebsiteEventsPresenter`

Implementation notes

Dependencies

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Replace live DPLA API calls in website events table with S3-backed item_data_providers cache #312

Description

Problem

Why pre-computation is possible

Proposed solution

1. Extend the monthly rebuild to produce item_data_providers.json

2. Add ItemDataProviders cache class in Rails

3. Replace live API call in WebsiteEventsPresenter

Implementation notes

Dependencies

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Extend the monthly rebuild to produce `item_data_providers.json`

2. Add `ItemDataProviders` cache class in Rails

3. Replace live API call in `WebsiteEventsPresenter`