Bulk Export #104

smunini · 2026-05-11T15:24:42Z

smunini
May 11, 2026
Maintainer

Introduction

If authentication and authorization decide who may access healthcare data, bulk data export decides how much of it can leave the building at one time - and at what pace, and through what door. It is the API that population health platforms, payer-provider data exchanges, registry submissions, research extracts, and AI training pipelines all converge on. CRUD and search are the bread and butter of FHIR, but the moment a workload needs every Observation for every patient in a cohort, it stops looking like a request/response problem and starts looking like a data engineering problem.

This document shares my thoughts on how to approach Bulk Data Export for the Helios FHIR Server. Like the persistence layer discussion and the authentication and authorization discussion, this is an architectural strategy document rather than a comprehensive specification. It explains the motivating direction, the key building blocks, and the Rust trait designs that will shape the export subsystem.

Who should read this? Anyone with an interest in FHIR bulk data interoperability, healthcare analytics infrastructure, or the operational realities of running long-lived, asynchronous jobs alongside a high-throughput FHIR API. Feedback is very much welcome - this is open source, developed in the open, and your perspective matters.

A note on scope: this document covers export only - the FHIR Bulk Data Access IG, specifically the $export family of operations. The companion problem of bulk submit - the inverse direction, taking large NDJSON payloads back into the server - is a separate concern with separate trade-offs. The Argonaut Project's current draft of $bulk-submit is being worked through here; we will publish a separate discussion document for ingestion once that draft stabilizes.

The Lay of the Land: What the Bulk Data Access IG Says

The Bulk Data Access IG defines an asynchronous, manifest-based, NDJSON-over-HTTPS pattern for exporting large volumes of FHIR data. The shape of every export, regardless of scope, is the same:

The client kicks off an export with a $export operation. The server responds immediately with 202 Accepted and a Content-Location header pointing to a status URL.
The client polls the status URL. While the export is in flight, the server keeps responding 202. When the export is complete, the server returns 200 OK with a JSON manifest describing every output file.
The client downloads the output files from the URLs in the manifest. Each file is application/fhir+ndjson - one resource per line, no Bundle wrapper.
Optionally, the client deletes the export (a DELETE on the status URL) when finished, signaling that the server may reclaim the output files.

Three flavors of export sit on top of this same pattern:

System-level - [base]/$export - everything the caller is permitted to see across the entire server.
Patient-level - [base]/Patient/$export - every resource in every patient's compartment that the caller is permitted to see.
Group-level - [base]/Group/[id]/$export - every resource in the compartment of every patient who is a member of the named Group.

The kick-off request accepts a substantial parameter surface. Some of the headline parameters that any compliant implementation must understand:

Parameter	Meaning
`_type`	Comma-delimited list of resource types to include. Defaults to all types the scope and level permit.
`_since`	FHIR `instant`. Only resources whose `meta.lastUpdated` is at or after this point.
`_until`	FHIR `instant`. Only resources whose `meta.lastUpdated` is at or before this point.
`_typeFilter`	A FHIR REST search expression applied to a single resource type (e.g. `MedicationRequest?status=active`). May be repeated.
`_outputFormat`	The output media type. `application/fhir+ndjson` is required; abbreviated `application/ndjson` and `ndjson` must also be accepted.
`_elements`	Comma-delimited element paths to include. The server should mark subsetted resources with the `SUBSETTED` tag.
`includeAssociatedData`	Hints for related resources to include (e.g. `LatestProvenanceResources`).
`organizeOutputBy`	Reorganize output by instances of a particular resource type, using Parameters header blocks per group.
`allowPartialManifests`	Permit the server to publish a manifest with `link[]` pagination before all output is finished.
`patient` (POST only)	Restrict the export to a list of `Patient` references.

The manifest returned at completion has a fixed schema:

{
  "transactionTime":      "2026-05-11T00:00:00Z",
  "request":              "https://fhir.example.org/Group/cohort-1/$export?_type=Patient,Observation",
  "requiresAccessToken":  true,
  "output": [
    { "type": "Patient",     "url": "https://files.example.org/exports/abc/Patient-001.ndjson",     "count": 12500 },
    { "type": "Observation", "url": "https://files.example.org/exports/abc/Observation-001.ndjson", "count": 980321 }
  ],
  "deleted": [],
  "error":   [],
  "link":    []
}

Three fields are non-obvious and worth highlighting up front:

transactionTime is the server's frozen wall-clock at the moment the export was started. Every resource in the output must reflect server state as of that instant. This is the anchor that lets clients implement incremental sync correctly using _since on the next run.
requiresAccessToken is a hint to the client about how to fetch the output files. If true, the URLs require the same Authorization: Bearer ... token that authorized the kick-off. If false, the URLs are pre-signed (or otherwise capability-based) and the client SHALL NOT send a token. The decision is the server's; both modes are valid.
error is not a status indicator. It is a list of NDJSON files containing FHIR OperationOutcome resources, one per line, describing per-resource-type failures that did not cause the entire job to fail. An export with a populated error array still finished 200 OK from a workflow perspective.

Authorization for bulk operations sits squarely on SMART Backend Services, with scopes of the form system/Patient.rs, system/*.rs, and similar. That story is told in detail in Discussion #45 and shipped today; we will not re-derive it here. What this document does assume from that work is that the auth layer produces a RequestContext containing a validated Principal with a ScopeSet, and that this context flows into every export handler intact.

The Essential Flow

In words and then in pictures.

Kick-off. The client sends GET /Group/cohort-1/$export?_type=Patient,Observation&_since=2026-01-01T00:00:00Z with Accept: application/fhir+json and Prefer: respond-async and a Bearer token. The server validates the token, parses parameters, opens an export job in shared state, returns 202 Accepted with Content-Location: https://fhir.example.org/export-status/abc. The handler does no actual data extraction in line; it returns within milliseconds.

Polling. The client polls GET /export-status/abc periodically (honoring Retry-After). While the job runs, the server returns 202 Accepted, optionally with an X-Progress header carrying a free-form message. When the job is finished, the server returns 200 OK with the JSON manifest above. The client now has every URL it needs to fetch the data.

Download. The client fetches each output[].url in parallel. The server streams each file as application/fhir+ndjson, optionally Content-Encoding: gzip. The number of files, their sizes, and the order are server-chosen.

Cleanup. When the client is done - or whenever it decides to abandon the export - it sends DELETE /export-status/abc. The server returns 202 Accepted. The files may now be reclaimed. The status URL begins returning 404 Not Found.

Client                          HFS (any instance)              Worker pool             Object store
  |                                    |                              |                         |
  |-- GET /Group/c/$export ----------->|                              |                         |
  |   Authorization: Bearer ...        |                              |                         |
  |   Prefer: respond-async            |  start_export(req)           |                         |
  |                                    |   -> persists ExportJob row  |                         |
  |<-- 202 Accepted -------------------|                              |                         |
  |   Content-Location: /status/abc    |                              |                         |
  |                                    |                              |                         |
  |                                    |             claim_next() --->|                         |
  |                                    |              (SKIP LOCKED)   |                         |
  |                                    |                              |-- fetch_export_batch -->|
  |                                    |                              |   for each type/cursor  |
  |                                    |                              |<------------------------|
  |                                    |                              |-- open_writer ......... |
  |                                    |                              |-- write NDJSON lines -->|
  |                                    |                              |-- finalize_part ....... |
  |                                    |                              |                         |
  |-- GET /status/abc ---------------->|                              |                         |
  |<-- 202 Accepted, X-Progress -------|                              |                         |
  |                                    |                              |                         |
  |              ... time passes ...   |                              |                         |
  |                                    |                              | mark ExportStatus =     |
  |                                    |                              | Complete; publish       |
  |                                    |                              | manifest                |
  |                                    |                              |                         |
  |-- GET /status/abc ---------------->|                              |                         |
  |<-- 200 OK + JSON manifest ---------|                              |                         |
  |                                    |                              |                         |
  |-- GET output[0].url -------------------------------------------------------------------- -->|
  |<-- 200 OK + NDJSON ---------------------------------------------------------------------- --|
  |                                    |                              |                         |
  |-- DELETE /status/abc ------------->|                              |                         |
  |<-- 202 Accepted -------------------|                              |                         |

Notice three things in that diagram. The kick-off handler does no extraction. The polling client may land on a different HFS instance than the one that received the kick-off - the status URL must work regardless. The download path is not necessarily served by HFS at all; if the manifest's requiresAccessToken is false, the URLs may point directly at the object store.

These are not implementation details. They are the architectural premises the rest of this document is responding to.

The Architectural Tensions

Before we get to traits, it is worth naming the tensions that the design has to resolve. Bulk export is not a single piece of code; it is a system, and the system has to hold together under four pressures simultaneously.

Long-running work behind a short-lived HTTP request. The kick-off responds 202 in milliseconds. The job behind it may run for minutes, hours, or - for the largest population-level pulls - long enough to outlive the process that started it. The handler and the worker cannot be the same thing.

State that outlives a process. Job status, cursors, manifests, output files - all of it has to survive process restarts, deploys, and crashes. There is no "in-memory only" version of an export that is also production-grade. Whatever state we keep must be durable from the first call to start_export().

One server versus many. A small clinic might run HFS as a single process on a single VM. A national exchange will run HFS as a fleet of pods behind a load balancer, scaled to traffic. The kick-off, the status polls, and the file downloads will land on whatever instance the load balancer picks at the moment. Job state cannot live in one instance's memory if any other instance might field the next request.

The download endpoint is a fileserver. Once a job is done, every output URL is a sustained GET. Megabytes per file, gigabytes per export, hundreds of files in the manifest. That is a fundamentally different workload from "look up a Patient by ID and return JSON" - it is hot-path bandwidth, not request/response latency. Bolting it onto the same Tokio executor that fields Patient.read queries is workable in a single-instance deployment and a mistake at scale.

The design that follows pulls these four tensions apart cleanly, so each one is solved by a single, replaceable abstraction.

Single-Instance vs Multi-Instance: A Tale of Two Deployments

HFS has always tried to scale down as well as it scales up. The persistence layer ships with a zero-config SQLite default; the same trait surface accepts PostgreSQL, MongoDB, Elasticsearch, and S3. Bulk export follows the same philosophy: the same traits serve both the single-VM clinic install and the multi-pod cloud deployment, and the operator decides at startup which concrete implementations to wire in.

Single-instance: zero-config

The simplest possible export deployment is a single HFS process, running on a single VM, writing job state into the SQLite database it already manages, and writing output files to the local filesystem under ${HFS_DATA_DIR}/exports/{job_id}/. The worker that performs the extraction is a Tokio task spawned from the same process; the polling and download handlers serve the local SQLite row and the local file directly.

This is fine for clinics, single-tenant deployments, demos, conformance testing, and CI. It works without an external job queue, without object storage, without a network filesystem. The trade-off is that you cannot horizontally scale HFS - the moment a second pod appears behind the load balancer, status polls will start landing on the wrong instance, and the design falls apart.

Multi-instance: shared state, work pool

The horizontally scaled deployment splits responsibilities cleanly:

Job state lives in a shared transactional store. Every HFS instance reads and writes the same bulk_export_jobs table. Status polls work from any instance because every instance is looking at the same row.
Output files live in object storage. Every HFS instance can serve any file URL because they all point at the same bucket - or, in the requiresAccessToken: false case, the client downloads pre-signed URLs directly from the object store and HFS is not in the path at all.
Workers are co-located with HFS pods (the default) or run as a separate hfs-exporter binary against the same shared state (an option discussed later). Workers claim jobs out of the shared store using a leasing pattern, so adding or removing workers requires no coordination beyond the shared store itself.

Single-instance topology                 Multi-instance topology
=========================                ============================
                                                                          
  +-------------------+                    +-------+   +-------+   +-------+
  |     HFS pod       |                    | HFS   |   | HFS   |   | HFS   |
  |  +-----------+    |                    | pod 1 |   | pod 2 |   | pod 3 |
  |  | REST API  |    |                    +---+---+   +---+---+   +---+---+
  |  +-----+-----+    |                        |           |           |
  |        |          |                        +-----+-----+-----+-----+
  |  +-----v-----+    |                              |           |
  |  | Worker    |    |                              v           v
  |  +-----+-----+    |                       +-------------+ +---------------+
  |        |          |                       | PostgreSQL  | | S3 (output)   |
  |  +-----v-----+    |                       | (job state) | |               |
  |  | SQLite    |    |                       +-------------+ +---------------+
  |  | local FS  |    |                              ^
  |  +-----------+    |                              |
  +-------------------+                       +------+------+
                                              | Worker pool |
                                              | (in-pod or  |
                                              | hfs-exporter)|
                                              +-------------+

The cardinal architectural rule is that the same code path serves both topologies. The BulkExportStorage trait is implemented by an embedded SQLite backend for single-instance and a PostgreSQL backend for multi-instance. The ExportOutputStore trait is implemented by a local-FS backend for single-instance and an S3 backend for multi-instance. The handler does not know which one is wired up. The worker does not know which one is wired up. Only the bootstrap code, reading environment variables, knows.

The Recommendation: PostgreSQL for Job State, S3 for Output, In-Process Workers

There is a tension between offering a menu of options and recommending a default. Discussion #45 leaned toward recommendations - JwksBearerAuthProvider as the default token validator, with IntrospectionAuthProvider as the fallback for opaque tokens. We take the same posture here.

For multi-instance job state, the default is PostgreSQL. PostgreSQL is already a supported HFS primary store, so adopting it for export state adds no new operational dependency for the common case. SELECT ... FOR UPDATE SKIP LOCKED is the canonical pattern for transactional job queuing in Rust ecosystems (sqlx, tokio-postgres, every Sidekiq-style job library); it handles worker fail-over, lease expiry, and at-least-once delivery without an external broker. The bulk_export_jobs table is small, write-amplified per heartbeat, and bounded in size by output retention - it does not threaten the resource store's hot path.

For multi-instance output storage, the default is S3-compatible object storage. This is the same S3 backend the persistence layer already ships, with the same AwsS3Client and the same keyspace conventions. Output keys are scoped under /{tenant}/exports/{job_id}/{resource_type}-{part}.ndjson. The manifest publishes pre-signed URLs with a configurable TTL, so the manifest's requiresAccessToken is false and the client downloads directly from S3 without HFS in the bandwidth path. For deployments that want to keep token-based access (audit-heavy environments, environments without a CDN), the same files are streamable through HFS's own download handler with requiresAccessToken: true instead - configuration, not code.

For execution, the default is an embedded worker pool. Workers run in the same process as the HFS REST API by default. This keeps the operational surface small: one binary, one deployment, one set of logs and metrics. A configurable HFS_BULK_EXPORT_WORKER_CONCURRENCY limits how many jobs each pod runs at once, and HFS_BULK_EXPORT_DISABLE_LOCAL_WORKER=true lets operators turn off in-pod workers entirely when they want to dedicate request-serving capacity. The optional hfs-exporter binary, discussed later, addresses the cases where worker isolation needs to be physical, not just configurational.

Now the vendor-style walkthroughs, in the same shape as the IdP integration section of Discussion #45.

PostgreSQL (recommended default for job state)

How it connects. The same HFS_DATABASE_URL that drives the persistence layer. The export subsystem adds two tables - bulk_export_jobs and bulk_export_outputs - alongside the existing resource schema. No new connection pool, no new credentials, no new operational surface.

Trade-offs. PostgreSQL is durable, transactional, and well understood. SELECT ... FOR UPDATE SKIP LOCKED makes multi-worker claiming straightforward and safe. The cost is that every status poll is a query against PostgreSQL, which on a hot system means tuning indexes on (tenant_id, status, lease_expiry) and accepting that very-high-poll-rate workloads (more than a few thousand polls per second per tenant) will eventually want a caching layer in front.

Configuration sketch.

HFS_BULK_EXPORT_BACKEND=postgres-s3
HFS_BULK_EXPORT_DATABASE_URL=postgresql://hfs:***@db.example.org/hfs
HFS_BULK_EXPORT_TABLE_PREFIX=bulk_export_

Redis (alternative for low-latency status polls)

How it connects. A standard Redis or Redis Cluster endpoint. Job records are hashes keyed by job ID; an indexed sorted-set holds pending jobs ordered by enqueue time; claim is BLMOVE from pending to in-flight-{worker_id} lists with a TTL-backed lease.

Trade-offs. Redis makes status polls trivially fast - a single HGETALL is under a millisecond - and the claim semantics are clean. The cost is durability: a Redis crash without AOF persistence can lose in-flight job state. For exports, the worst-case impact is a job that has to be restarted; cursors live in PostgreSQL via the same BulkExportStorage trait, so re-running is mostly idempotent, but operators who run Redis as a cache rather than a primary store should think twice. Best fit: deployments that already operate Redis as a hot path and want polling latency under a millisecond.

Configuration sketch.

HFS_BULK_EXPORT_BACKEND=redis-s3
HFS_BULK_EXPORT_REDIS_URL=rediss://redis.example.org:6380
HFS_BULK_EXPORT_REDIS_KEY_PREFIX=hfs:export:

DynamoDB / Cosmos DB / Spanner (cloud-managed equivalents)

How they connect. Each cloud's identity model. DynamoDB via the AWS SDK; Cosmos DB via the Azure SDK; Spanner via Google Cloud credentials. Each is a BulkExportStorage implementation that mirrors the PostgreSQL pattern but uses conditional writes (DynamoDB: ConditionExpression, Cosmos: ETag preconditions, Spanner: read-modify-write transactions) in place of SKIP LOCKED.

Trade-offs. Managed durability and global replication, at the cost of additional integration code per provider and per-call billing. The same caveats as the IdP discussion in #45 apply: every cloud has its own claim-name and capability quirks; the abstraction has to absorb them.

These are not first-tier targets for the initial implementation, but the trait design must not preclude them. Anyone running HFS purely on a single cloud will eventually want them.

Kafka / NATS JetStream (workers physically separate from request handlers)

How they connect. Kafka topics or JetStream streams act as the work queue; HFS publishes a job-created event on kick-off, and a separate hfs-exporter binary consumes the topic. Job state still lives in PostgreSQL (or wherever the BulkExportStorage impl points), so status polls and downloads do not touch the broker.

Trade-offs. This is the "we run our exporters on a different node pool because they're bandwidth-heavy" case. Adds a broker as an operational dependency, but lets you scale request handlers and exporters independently, and gives you explicit at-least-once delivery semantics with offsets. Best fit: fleets large enough that the export workload visibly distorts request-serving capacity.

S3-compatible object storage (recommended default for output files)

How it connects. The same AwsS3Client the persistence layer's S3 backend uses, configured via the standard AWS credential chain. Output files are uploaded as multipart objects under /{tenant}/exports/{job_id}/{resource_type}-{part}.ndjson. The manifest publishes pre-signed GET URLs with a TTL configured by HFS_BULK_EXPORT_FILE_URL_TTL.

Trade-offs. Object storage is the right tool for the job - massively parallel reads, transparent CDN integration, region-redundant durability, lifecycle policies for automatic expiry. The only meaningful cost is that pre-signed URLs reveal that something exists at this URL until this expiry, which some audit regimes treat as out of band. Those deployments switch HFS_BULK_EXPORT_REQUIRES_ACCESS_TOKEN=true, the manifest reports requiresAccessToken: true, and downloads flow through HFS's own handler instead.

Cloudflare R2 / Google Cloud Storage / MinIO (S3-compatible drop-ins)

R2, GCS (via interop), and MinIO all speak the S3 API. The same AwsS3Client works against each with no code change, only endpoint_url and force_path_style configuration adjustments. We will document a docker compose example with MinIO as part of the development environment so contributors can exercise the multi-instance path without an AWS account.

Local filesystem (single-instance only)

How it connects. Files are written to ${HFS_DATA_DIR}/exports/{tenant_id}/{job_id}/{resource_type}-{part}.ndjson. The download handler serves them via tokio::fs::File and Axum's streaming body.

Trade-offs. No external dependencies; perfect for development and single-VM deployments. Not safe in a multi-instance topology because the writing instance and the reading instance may differ. A shared NFS mount makes this technically work across instances, but it is brittle (lock semantics, cache coherency, fsync surprises) and we do not recommend it.

Designing the Rust Traits

The persistence crate already carries most of the building blocks. The export module at crates/persistence/src/core/bulk_export.rs defines the types and traits below; the S3 backend implements them today, and the embedded SQLite and PostgreSQL backends will follow. We present the existing surface first - so readers know what already exists - and then propose the additions that this discussion is centrally about.

The Existing Surface: Types

The vocabulary of an export job. These are stable; nothing in this document proposes changing them.

/// Unique identifier for an export job.
#[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct ExportJobId(String);

/// Status of an export job.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum ExportStatus {
    /// Job has been accepted but not yet started processing.
    Accepted,
    /// Job is currently processing.
    InProgress,
    /// Job has completed successfully.
    Complete,
    /// Job failed with an error.
    Error,
    /// Job was cancelled by the user.
    Cancelled,
}

/// Level at which the export is being performed.
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "lowercase")]
pub enum ExportLevel {
    /// System-level export (`[base]/$export`).
    System,
    /// Patient-level export (`[base]/Patient/$export`).
    Patient,
    /// Group-level export (`[base]/Group/[id]/$export`).
    Group { group_id: String },
}

/// A type filter for the export request.
///
/// Type filters allow specifying FHIR search parameters that should be applied
/// when exporting a specific resource type.
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub struct TypeFilter {
    pub resource_type: String,
    pub query: String,
}

/// Request parameters for starting an export job.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExportRequest {
    pub level: ExportLevel,
    pub resource_types: Vec<String>,
    pub since: Option<DateTime<Utc>>,
    pub until: Option<DateTime<Utc>>,
    pub type_filters: Vec<TypeFilter>,
    pub elements: Vec<String>,
    pub include_associated_data: Vec<String>,
    pub output_format: String,
    pub batch_size: u32,
    // ... builder methods (with_types, with_batch_size, with_type_filter, ...)
}

ExportProgress and its per-type companion TypeExportProgress round out the model. TypeExportProgress carries the cursor state that lets a worker resume mid-job after a crash - the same cursor type that ExportDataProvider::fetch_export_batch accepts and returns.

/// Progress for a single resource type within an export.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TypeExportProgress {
    pub resource_type: String,
    pub total_count: Option<u64>,
    pub exported_count: u64,
    pub cursor: Option<String>,
    pub completed: bool,
}

/// Overall progress for an export job.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExportProgress {
    pub job_id: ExportJobId,
    pub status: ExportStatus,
    pub transaction_time: DateTime<Utc>,
    pub per_type: Vec<TypeExportProgress>,
    pub message: Option<String>,
    pub error: Option<String>,
}

ExportManifest and ExportOutputFile model the terminal manifest that the status endpoint serves at 200 OK. NdjsonBatch is what data providers produce - one logical batch of NDJSON lines plus a cursor and a "is this the last batch" flag.

/// A descriptor for a single output file in an export manifest.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExportOutputFile {
    pub resource_type: String,
    pub url: String,
    pub count: Option<u64>,
}

/// The terminal manifest for a completed export.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExportManifest {
    pub transaction_time: DateTime<Utc>,
    pub request: String,
    pub requires_access_token: bool,
    pub output: Vec<ExportOutputFile>,
    pub deleted: Vec<ExportOutputFile>,
    pub error: Vec<ExportOutputFile>,
    pub message: Option<String>,
}

/// A batch of NDJSON lines produced by an `ExportDataProvider`.
#[derive(Debug, Clone)]
pub struct NdjsonBatch {
    pub lines: Vec<String>,
    pub next_cursor: Option<String>,
    pub is_last: bool,
}

The Existing Surface: Job-State Trait

BulkExportStorage is the contract that any backend providing job state implements. Single-instance backends (embedded SQLite) and multi-instance backends (PostgreSQL, Redis) both satisfy this trait. The handler does not care which is wired up.

/// Storage trait for bulk export job management.
///
/// This trait handles the lifecycle of export jobs: creating, tracking,
/// completing, and cleaning up exports.
#[async_trait]
pub trait BulkExportStorage: Send + Sync {
    /// Starts a new export job and returns its ID.
    ///
    /// On a multi-instance deployment, this is a single transactional
    /// insert into the shared store. The job is `Accepted` and waiting
    /// for a worker to claim it.
    async fn start_export(
        &self,
        tenant: &TenantContext,
        request: ExportRequest,
    ) -> StorageResult<ExportJobId>;

    /// Returns the current progress of an export job.
    ///
    /// Called by the status-polling handler. Must succeed regardless of
    /// which HFS instance the polling client lands on.
    async fn get_export_status(
        &self,
        tenant: &TenantContext,
        job_id: &ExportJobId,
    ) -> StorageResult<ExportProgress>;

    /// Cancels an in-progress export job.
    ///
    /// Cooperative: the worker observes the cancellation on its next
    /// status check and unwinds cleanly, leaving partial output that the
    /// cleanup pass will reclaim.
    async fn cancel_export(
        &self,
        tenant: &TenantContext,
        job_id: &ExportJobId,
    ) -> StorageResult<()>;

    /// Deletes a finished export job and reclaims its output files.
    async fn delete_export(
        &self,
        tenant: &TenantContext,
        job_id: &ExportJobId,
    ) -> StorageResult<()>;

    /// Returns the terminal manifest for a completed export.
    async fn get_export_manifest(
        &self,
        tenant: &TenantContext,
        job_id: &ExportJobId,
    ) -> StorageResult<ExportManifest>;

    /// Lists export jobs visible to the tenant.
    async fn list_exports(
        &self,
        tenant: &TenantContext,
        include_completed: bool,
    ) -> StorageResult<Vec<ExportProgress>>;
}

The Existing Surface: Data-Provider Trait Hierarchy

ExportDataProvider is what the worker calls when it needs more resources to write. It is not BulkExportStorage - one provides job lifecycle, the other provides data. In practice the same backend object often implements both (the persistence layer already does), but conceptually they are separate concerns that could live in separate processes.

/// Data provider for export operations.
#[async_trait]
pub trait ExportDataProvider: Send + Sync {
    /// Lists resource types available for export, intersected with what
    /// the request asked for.
    async fn list_export_types(
        &self,
        tenant: &TenantContext,
        request: &ExportRequest,
    ) -> StorageResult<Vec<String>>;

    /// Counts resources of a type matching the request filters.
    ///
    /// Used to publish a meaningful `total_count` on `TypeExportProgress`
    /// when the underlying store can answer the count cheaply.
    async fn count_export_resources(
        &self,
        tenant: &TenantContext,
        request: &ExportRequest,
        resource_type: &str,
    ) -> StorageResult<u64>;

    /// Fetches the next batch of resources for the given type.
    ///
    /// The cursor is opaque from the caller's perspective. The provider
    /// chooses its encoding (page tokens, last-id, last-modified-tuple)
    /// and the worker passes it back unchanged.
    async fn fetch_export_batch(
        &self,
        tenant: &TenantContext,
        request: &ExportRequest,
        resource_type: &str,
        cursor: Option<&str>,
        batch_size: u32,
    ) -> StorageResult<NdjsonBatch>;
}

/// Provider for patient compartment exports.
#[async_trait]
pub trait PatientExportProvider: ExportDataProvider {
    async fn list_patient_ids(
        &self,
        tenant: &TenantContext,
        request: &ExportRequest,
        cursor: Option<&str>,
        batch_size: u32,
    ) -> StorageResult<(Vec<String>, Option<String>)>;

    async fn fetch_patient_compartment_batch(
        &self,
        tenant: &TenantContext,
        request: &ExportRequest,
        resource_type: &str,
        patient_ids: &[String],
        cursor: Option<&str>,
        batch_size: u32,
    ) -> StorageResult<NdjsonBatch>;
}

/// Provider for group-level exports.
#[async_trait]
pub trait GroupExportProvider: PatientExportProvider {
    async fn get_group_members(
        &self,
        tenant: &TenantContext,
        group_id: &str,
    ) -> StorageResult<Vec<String>>;

    async fn resolve_group_patient_ids(
        &self,
        tenant: &TenantContext,
        group_id: &str,
    ) -> StorageResult<Vec<String>>;
}

The trait hierarchy reflects the spec: every server that can do a Patient export can also do a System export; every server that can do a Group export can also do a Patient export. The compiler enforces the capability ladder.

The Proposal: Output Storage as a First-Class Trait

The traits above answer "what data do we export" and "how is the job tracked". They do not answer "where do the bytes go". Today, each backend that implements BulkExportStorage also implicitly decides where output files live - the S3 backend writes them to S3, a future SQLite backend would write them locally. This works, but it conflates two decisions that operators reasonably want to make independently. A site running PostgreSQL for job state and S3 for output is a perfectly normal configuration. A site running PostgreSQL for both is also reasonable. The current shape does not support that cleanly.

We propose a separate ExportOutputStore trait:

/// Pluggable backend for bulk export output files.
///
/// Implementations decide where NDJSON output is physically stored
/// (local filesystem, S3, R2, GCS, Azure Blob, MinIO, etc.) and how
/// download URLs are generated for the manifest. The job-state backend
/// is unaware - it stores the keys and TTL hints; the output store
/// turns keys into URLs and bytes.
#[async_trait]
pub trait ExportOutputStore: Send + Sync {
    /// Opens an async writer for a new output part.
    ///
    /// The returned key uniquely identifies this part. Implementations
    /// SHOULD use `(tenant, job_id, resource_type, part_index)` as the
    /// natural ordering of keys for ease of cleanup.
    async fn open_writer(
        &self,
        tenant: &TenantContext,
        job_id: &ExportJobId,
        resource_type: &str,
        part_index: u32,
    ) -> StorageResult<ExportPartWriter>;

    /// Marks a part as finalized and immutable.
    ///
    /// For object stores using multipart upload, this completes the
    /// upload. For local filesystem, this fsyncs and renames from
    /// `.tmp` to the final name.
    async fn finalize_part(
        &self,
        tenant: &TenantContext,
        job_id: &ExportJobId,
        key: &ExportPartKey,
        line_count: u64,
    ) -> StorageResult<FinalizedPart>;

    /// Produces a download URL for a finalized part.
    ///
    /// If the store supports pre-signed URLs, returns a URL that the
    /// client can fetch directly. Otherwise returns a stable HFS-served
    /// URL that the download handler will resolve back to a key.
    fn download_url(
        &self,
        tenant: &TenantContext,
        job_id: &ExportJobId,
        key: &ExportPartKey,
        ttl: Duration,
    ) -> StorageResult<DownloadUrl>;

    /// Deletes all output parts for a job. Idempotent.
    ///
    /// Called from `BulkExportStorage::delete_export` and from the
    /// cleanup pass when the output TTL elapses.
    async fn delete_job_outputs(
        &self,
        tenant: &TenantContext,
        job_id: &ExportJobId,
    ) -> StorageResult<()>;
}

/// A finalized output file as it will appear in the manifest.
#[derive(Debug, Clone)]
pub struct FinalizedPart {
    pub key: ExportPartKey,
    pub resource_type: String,
    pub line_count: u64,
    pub size_bytes: u64,
}

/// A download URL plus the access posture for the manifest.
#[derive(Debug, Clone)]
pub struct DownloadUrl {
    pub url: String,
    /// `true` if the URL requires the same Bearer token used at kick-off
    /// (HFS-served streaming). `false` if the URL is pre-signed and the
    /// client must NOT send a token.
    pub requires_access_token: bool,
}

Two implementations cover the common cases. LocalFsOutputStore writes under ${HFS_DATA_DIR}/exports/, hands back HFS-served URLs, and sets requires_access_token: true. S3OutputStore writes to a bucket configured via the existing S3BackendConfig, hands back pre-signed URLs, and sets requires_access_token: false (or true if the operator wants to keep downloads on the HFS data path).

The Proposal: Workers, Leases, and the Claim Strategy

Today, the S3 backend's start_export does the work synchronously inside the kick-off handler. That works for the single-instance case but breaks two design goals: it pins the handler thread for the duration of the job, and it gives no way to scale workers independently of request handlers. The handler must return immediately; the work happens elsewhere.

A worker is the runtime that performs an export. Workers may run in-pod (the default) or in a separate hfs-exporter binary. Either way, they share state through BulkExportStorage and they coordinate through a leasing protocol.

/// A lease over a single export job, held by exactly one worker at a time.
///
/// Leases have an expiry; if the worker holding the lease does not
/// heartbeat before the expiry, the lease is reclaimable by another
/// worker. This is the at-least-once-delivery primitive of the export
/// subsystem.
#[derive(Debug, Clone)]
pub struct ExportJobLease {
    pub job_id: ExportJobId,
    pub tenant: TenantContext,
    pub worker_id: WorkerId,
    pub lease_expiry: DateTime<Utc>,
    pub fencing_token: u64,
}

/// The runtime that actually performs export work.
///
/// Implementations bind together a `BulkExportStorage` (for job state),
/// an `ExportDataProvider` (for resource data), and an `ExportOutputStore`
/// (for NDJSON files). The same trait is satisfied by an in-pod worker
/// and by the standalone `hfs-exporter` binary.
#[async_trait]
pub trait ExportWorker: Send + Sync {
    /// Attempts to claim the next available job for this worker.
    ///
    /// Returns `Ok(None)` if no job is available. The strategy used
    /// (FOR UPDATE SKIP LOCKED on Postgres, BLMOVE on Redis, mutex on
    /// in-memory) is encapsulated by the `ExportClaimStrategy` impl
    /// the worker was constructed with.
    async fn claim_next(
        &self,
        worker_id: &WorkerId,
    ) -> StorageResult<Option<ExportJobLease>>;

    /// Runs the export job for as long as the lease is valid.
    ///
    /// Performs `fetch_export_batch` in a loop, writes NDJSON to the
    /// output store, persists progress (cursors, counts) after each
    /// batch, and heartbeats the lease. On cancellation observed via
    /// `BulkExportStorage::get_export_status`, unwinds cleanly.
    async fn run_job(
        &self,
        lease: ExportJobLease,
    ) -> StorageResult<JobOutcome>;

    /// Renews a lease that the worker still holds.
    ///
    /// Called periodically while a job runs. Returns the new expiry,
    /// or `Err(LeaseLost)` if another worker has already reclaimed
    /// the job - in which case the current worker MUST stop writing
    /// to the output store immediately.
    async fn heartbeat(
        &self,
        lease: &ExportJobLease,
    ) -> StorageResult<DateTime<Utc>>;

    /// Releases a lease early (e.g. on graceful shutdown).
    async fn release(
        &self,
        lease: ExportJobLease,
    ) -> StorageResult<()>;
}

The choice of claim mechanism is itself pluggable, so the same ExportWorker runtime works against any job-state backend:

/// Strategy for atomically claiming the next available export job.
///
/// The trait surface is small on purpose: every backend has its own
/// idiomatic primitive for this, and we want each implementation to
/// reach for its native pattern rather than emulating someone else's.
#[async_trait]
pub trait ExportClaimStrategy: Send + Sync {
    /// Atomically transitions a single eligible job from `Accepted`
    /// (or expired-lease `InProgress`) to held-by-this-worker.
    async fn claim_next(
        &self,
        worker_id: &WorkerId,
        lease_duration: Duration,
    ) -> StorageResult<Option<ExportJobLease>>;
}

Three implementations are in scope for the initial work: PostgresSkipLocked (default for multi-instance), RedisListMove (alternative for low-poll-latency deployments), and InMemoryMutex (used by the embedded single-instance backend and by tests).

Two design notes that come up in review:

Why a lease with expiry rather than an explicit ack/nack queue? Because the work is long-lived and idempotent (cursors live in TypeExportProgress). A worker that dies mid-job leaves its lease to expire; another worker picks the job up from the last persisted cursor. This is simpler to operate than a queue with explicit redelivery, and it matches how tokio-postgres job-queue libraries already work.

Why a fencing token? Because the lease-expiry pattern allows two workers to briefly believe they hold the same job if the original worker hung rather than crashed. The fencing token, written into every output-file key and checked on the output store's finalize_part, prevents the zombie worker from corrupting output the live worker is producing. Inspired directly by the fencing tokens pattern Martin Kleppmann wrote about; nothing novel.

The Proposal: File-Download Authorization

The download endpoint is its own small authorization problem. The manifest can be served two ways - requiresAccessToken: true, meaning download URLs point at HFS and require the kick-off's Bearer token; or requiresAccessToken: false, meaning download URLs point at the object store and are pre-signed. The download handler in HFS needs to handle the first case; the second case bypasses HFS entirely.

/// Authorization decision for a bulk export file download.
///
/// `BearerScopeAuth` validates the incoming Bearer token has the same
/// `system/*.rs` scope that authorized the kick-off, and that the
/// token's subject matches the job's owner.
///
/// `PresignedUrlAuth` is used when the manifest publishes pre-signed
/// URLs directly; the download handler is not in the path.
#[async_trait]
pub trait ExportFileAuth: Send + Sync {
    async fn authorize_download(
        &self,
        token: Option<&str>,
        tenant: &TenantContext,
        manifest_entry: &ExportOutputFile,
    ) -> Result<(), ExportAuthError>;
}

The default implementation is BearerScopeAuth. It revalidates the token against the same AuthProvider discussed in Discussion #45, checks that the system/{ResourceType}.rs scope covers the file's resource type, and lets the handler stream the file. The pre-signed URL case never runs through this trait - by the time a client is downloading a pre-signed URL, the object store is doing the auth check via the URL's signature.

The REST Layer: How the Endpoints Wire Up

Four handlers, all in the established HFS style: generic over the storage trait, taking a TenantExtractor, returning RestResult<Response>. We sketch the kick-off here; the other three follow the same pattern.

pub async fn export_kickoff_handler<S>(
    State(state): State<AppState<S>>,
    Path(level_path): Path<ExportLevelPath>,
    tenant: TenantExtractor,
    ctx: RequestContextExtractor,
    Query(params): Query<ExportQueryParams>,
    headers: HeaderMap,
) -> RestResult<Response>
where
    S: BulkExportStorage + ExportDataProvider + Send + Sync,
{
    let request = ExportRequest::from_query_and_headers(level_path, params, &headers)?;
    state.policy().authorize_kickoff(&ctx, &request).await?;
    let job_id = state.storage().start_export(tenant.context(), request).await?;

    let status_url = state.base_url().join(&format!("/export-status/{}", job_id))?;
    Ok(Response::builder()
        .status(StatusCode::ACCEPTED)
        .header("Content-Location", status_url.as_str())
        .body(Body::empty())?)
}

Three observations on this handler that are easy to miss:

It does not spawn a Tokio task to run the export. The start_export call writes the job row and returns. A worker picks the job up via claim_next out of band. The handler returns within milliseconds even for jobs that will take hours.

It does not know whether the deployment is single-instance or multi-instance. state.storage() is whichever BulkExportStorage was wired in at startup; the handler is identical either way.

It runs inside the auth middleware described in Discussion #45. By the time this function runs, ctx is a fully validated RequestContext containing a Principal with a ScopeSet. The handler does not re-validate the token; it only calls authorize_kickoff on whatever AuthorizationPolicy is configured, which evaluates SMART system scopes (and any composed deployment-specific policies) against the requested export.

The status handler is symmetric:

pub async fn export_status_handler<S>(
    State(state): State<AppState<S>>,
    Path(job_id): Path<ExportJobId>,
    tenant: TenantExtractor,
    ctx: RequestContextExtractor,
) -> RestResult<Response>
where
    S: BulkExportStorage + Send + Sync,
{
    let progress = state.storage().get_export_status(tenant.context(), &job_id).await?;

    match progress.status {
        ExportStatus::Accepted | ExportStatus::InProgress => {
            Ok(Response::builder()
                .status(StatusCode::ACCEPTED)
                .header("X-Progress", progress.message.unwrap_or_default())
                .header("Retry-After", "1")
                .body(Body::empty())?)
        }
        ExportStatus::Complete => {
            let manifest = state.storage().get_export_manifest(tenant.context(), &job_id).await?;
            Ok(Response::builder()
                .status(StatusCode::OK)
                .header("Content-Type", "application/json")
                .header("Expires", manifest_expires(&manifest))
                .body(Body::from(serde_json::to_vec(&manifest)?))?)
        }
        ExportStatus::Error => /* 500 + OperationOutcome */,
        ExportStatus::Cancelled => /* 404 + OperationOutcome per IG */,
    }
}

Content-Location URLs are constructed from HFS_BASE_URL plus the tenant prefix (when HFS_TENANT_ROUTING_MODE=url_path) plus /export-status/{job_id}. They are absolute. They survive load balancer changes because every HFS instance constructs the same URL for the same job, and every instance can answer the poll against the shared BulkExportStorage.

Error semantics. Per-resource-type failures during the run are not catastrophic - they accumulate as OperationOutcome resources in error[] NDJSON files attached to the manifest, and the job still terminates Complete. Only conditions that prevent the export from producing any valid output (authorization failure mid-stream, total backend outage, output-store failure on every write) transition the job to ExportStatus::Error. This matches the IG's expectation that bulk jobs prefer partial success over hard failure.

Group Export: The Hard Part

Group/[id]/$export is where the spec's edges become apparent. The export returns "every resource in the patient compartment of every patient who is a member of this group" - which is the cross product of three things HFS has to compute on the fly.

First, who are the members? Groups can list members directly, list nested Groups whose members must be flattened, or (in the forthcoming Bulk Cohort profile) carry member-filter modifier extensions whose values are FHIR search expressions to evaluate against the live data. GroupExportProvider::get_group_members handles the direct case; resolve_group_patient_ids handles the rest.

Second, what is the patient compartment for this FHIR version? Compartments are defined per FHIR version by CompartmentDefinition resources; the mapping from (version, resource_type) to (search_param_names) is generated alongside the FHIR models. HFS already has this lookup at crates/rest/src/handlers/compartment.rs::get_compartment_params_for_version. The bulk export worker reuses it.

Third, how do we enumerate efficiently? For each requested resource type, the worker calls fetch_patient_compartment_batch with the resolved patient ID list and a cursor. The implementation chooses whether to issue per-patient queries, range queries, or a single chunked query depending on what the underlying store is good at; the trait does not prescribe.

The IG's behavior on _since plus group membership has a subtle wrinkle that operators should be aware of: if a patient was added to the group after _since, the server MAY return that patient's resources from before _since (because they were not part of the group at that time, but are now). The current draft says "behavior SHOULD be documented" - we will document our choice (we plan to include them by default, matching the prevalent reading) and let operators override via HFS_BULK_EXPORT_SINCE_NEWLY_ADDED=exclude when their use case demands the alternative.

The Bulk Cohort member-filter profile - where a client can POST a Group whose membership is defined by FHIR search criteria evaluated server-side - is intentionally out of scope for the first cut. It is its own design problem (asynchronous Group construction, dynamic membership, refresh semantics) and deserves its own discussion document once the export plumbing is in production.

Authorization

Bulk export's authorization story is short because Discussion #45 has already done the work.

By the time an export handler runs, the auth middleware has produced a RequestContext with a validated Principal and a ScopeSet. The export kick-off handler asks the same AuthorizationPolicy trait whether the principal's scopes cover the requested export. The scopes that matter are the standard SMART Backend Services system scopes:

Scope	Covers
`system/*.rs`	Read and search every resource type - the broadest bulk scope.
`system/Patient.rs`	Read and search Patient. Required at minimum for any Patient or Group export.
`system/Observation.rs`	Read and search Observation. Required to include Observation in any export.
`system/[Type].read`	The legacy v1-style alias; the policy implementation should accept it for backward compatibility.

The composability described in #45 carries through here. A deployment that wants additional restrictions on bulk operations - say, a BulkRateLimitPolicy that throttles concurrent jobs per client, or a BulkTenantQuotaPolicy that caps total exported volume per tenant per day - implements AuthorizationPolicy and composes it via CompositeAuthorizationPolicy. The export handlers do not know these policies exist; they only know that authorize_kickoff returned Permit.

What this means in practice: there is no new auth surface for bulk export. The same JwksBearerAuthProvider you configured for the rest of HFS validates the kick-off token. The same scope syntax governs what can be exported. The same audit trail records every job.

Configuration: What Operators Will Touch

Variable	Default	Description
`HFS_BULK_EXPORT_ENABLED`	`true`	Master switch. When `false`, the operation endpoints return `501 Not Implemented`.
`HFS_BULK_EXPORT_BACKEND`	`embedded`	Job-state backend: `embedded` (SQLite + local FS), `postgres-s3`, `redis-s3`.
`HFS_BULK_EXPORT_DATABASE_URL`	(from `HFS_DATABASE_URL`)	Connection string for the job-state store when distinct from the resource store.
`HFS_BULK_EXPORT_OUTPUT_BACKEND`	`local-fs`	Output store: `local-fs` or `s3`.
`HFS_BULK_EXPORT_OUTPUT_DIR`	`${HFS_DATA_DIR}/exports`	Local FS root for output files.
`HFS_BULK_EXPORT_S3_BUCKET`	(none)	Bucket for output files when `OUTPUT_BACKEND=s3`.
`HFS_BULK_EXPORT_REQUIRES_ACCESS_TOKEN`	`auto`	Manifest hint: `auto` (pre-signed when supported), `true` (always token), `false` (always pre-signed).
`HFS_BULK_EXPORT_FILE_URL_TTL`	`3600`	Seconds. Pre-signed URL lifetime in the manifest.
`HFS_BULK_EXPORT_OUTPUT_TTL`	`86400`	Seconds. How long output files are retained after job completion.
`HFS_BULK_EXPORT_WORKER_CONCURRENCY`	`2`	Maximum jobs this pod runs concurrently.
`HFS_BULK_EXPORT_DISABLE_LOCAL_WORKER`	`false`	When `true`, this pod does not run workers (use with separate `hfs-exporter`).
`HFS_BULK_EXPORT_MAX_CONCURRENT_PER_TENANT`	`4`	Cap on simultaneous in-flight jobs per tenant.
`HFS_BULK_EXPORT_BATCH_SIZE`	`1000`	Resources per `fetch_export_batch` call.
`HFS_BULK_EXPORT_LEASE_DURATION`	`60`	Seconds. Initial lease length issued at claim.
`HFS_BULK_EXPORT_HEARTBEAT_INTERVAL`	`20`	Seconds. Worker heartbeat cadence.
`HFS_BULK_EXPORT_SINCE_NEWLY_ADDED`	`include`	For Group exports: `include` or `exclude` resources from before `_since` for patients added after `_since`.

The single-instance default - HFS_BULK_EXPORT_BACKEND=embedded with HFS_BULK_EXPORT_OUTPUT_BACKEND=local-fs - requires zero additional configuration on top of the standard HFS environment. A deployment grows into the multi-instance path by changing two variables and pointing at a Postgres and an S3 bucket; no code changes, no different binary.

Conformance Testing

The Inferno Bulk Data Test Kit is the canonical conformance harness for FHIR bulk data servers. It exercises every kick-off variant, the polling pattern, the manifest schema, the NDJSON output format, the cancellation flow, and SMART Backend Services authorization end to end. After the initial implementation lands, we will:

Spin up HFS in a Docker Compose configuration (HFS + PostgreSQL + MinIO + Keycloak for SMART) suitable for Inferno to exercise.
Wire a cargo xtask inferno-bulk-data target that runs the test kit headlessly against this configuration.
Add the run to .github/workflows/inferno.yml alongside the existing test kits.
Publish the Inferno conformance badge in crates/hfs/README.md next to the other test-kit badges.

This is intentionally separate from unit and integration tests in the workspace. Inferno tests are slow, network-bound, and authoritative - they belong in CI as a nightly job, not on every PR.

What's Not in Scope (Yet)

A handful of things are deliberately deferred. None are blockers; each is a follow-up.

$bulk-submit (the inverse direction - large NDJSON payloads into the server). The Argonaut Project's current draft is at https://hackmd.io/@argonaut/rJoqHZrPle. It will get its own discussion document once the draft stabilizes; the shared-state architecture proposed here generalizes naturally to ingestion.
The Bulk Cohort member-filter profile for dynamic Group construction. This is its own design problem (asynchronous Group creation, dynamic membership, refresh semantics) and deserves its own discussion.
Legacy $import from earlier IG drafts. Superseded by $bulk-submit; we will not implement the legacy shape.
Prefer: separate-export-status (the variant where status polling returns 200 OK with an X-Export-Status header instead of 202 Accepted). Marked as a follow-up; it is a low-effort addition once the core flow is in place.
organizeOutputBy (reorganized output with Parameters header blocks per group). Wait for broader IG adoption before committing to it.
includeAssociatedData=LatestProvenanceResources and similar Provenance hints. Implement once the audit subsystem's Provenance support lands.

Proposed Next Steps

The traits sketched above are a starting point. To move toward implementation:

Ship the embedded single-instance default first. SQLite job-state backend + local-FS output store + in-process worker. This gets every test, every demo, and every single-VM deployment unblocked, and exercises the trait surface end to end before the multi-instance work begins.
Add ExportOutputStore and ExportClaimStrategy to helios-persistence. Two new traits; refactor the existing S3 bulk-export code to satisfy them rather than implementing everything inside the S3 backend's BulkExportStorage.
Implement PostgresSkipLocked and the PostgreSQL BulkExportStorage. With the S3 ExportOutputStore, this is the multi-instance default. Cover with testcontainers integration tests against real Postgres + MinIO.
Wire the REST handlers in helios-rest. Four handlers: kick-off (with sub-routes for system, patient, group), status, cancel/delete, file-download. Plumb Content-Location, X-Progress, Retry-After, Expires, and the manifest content type correctly.
Add pre-signed URL generation to the existing S3 backend. A short addition; the existing AwsS3Client already supports it via the AWS SDK.
Bundle a docker compose configuration with HFS + PostgreSQL + MinIO + Keycloak, suitable for running the Inferno Bulk Data Test Kit locally and in CI.
Wire the Inferno Bulk Data Test Kit into .github/workflows/inferno.yml as a nightly conformance job, and publish the badge.
Document HFS_BULK_EXPORT_* envvars in CLAUDE.md and crates/hfs/README.md, including the single-instance vs multi-instance configuration recipes.
Add audit events for bulk-export lifecycle via the existing record_export_event helper already in crates/persistence/src/core/bulk_export.rs::audit, plumbed through the kick-off, completion, cancellation, and download handlers.
Open the $bulk-submit discussion once the Argonaut draft is stable, building on the shared-state architecture established here.

Closing Thoughts

Bulk export is the API that turns a FHIR server from a transaction processor into a data platform. Population health teams, research data lakes, payer-provider exchanges, AI training pipelines - none of them are reading one Patient at a time. They are reading entire compartments, entire cohorts, entire systems, and they want to do it asynchronously, resumably, and at a rate that does not require an arrangement with the FHIR server's on-call.

The architecture proposed here is built around two convictions. First, that the same trait surface should serve a single VM with SQLite and a horizontally scaled fleet with PostgreSQL and S3 - the operator chooses, the code does not change. Second, that the long-running, bandwidth-heavy parts of an export should be cleanly separable from the request-serving HFS process, so that operators can scale them independently without rewriting handlers.

The Rust trait system makes both convictions enforceable. The compiler guarantees that every export handler receives a validated RequestContext and a TenantContext. That BulkExportStorage, ExportDataProvider, and ExportOutputStore are independently replaceable. That the worker runtime is identical whether it is co-located with the REST API or running standalone. These guarantees hold regardless of how complex the deployment becomes, and they hold across the inevitable migrations from "we started on SQLite and outgrew it" to "we now run on Postgres + S3 + a separate worker tier".

After the implementation lands, the Inferno Bulk Data Test Kit becomes the daily check on whether HFS is a conformant bulk data server. The kit covers every kick-off variant, every flavor of the polling state machine, every required manifest field, the NDJSON contract, and SMART Backend Services authorization end to end. Treating Inferno conformance as a non-negotiable in CI is what turns "we shipped bulk export" into "we shipped a bulk export implementation interoperable with the rest of the ecosystem".

Thank you for reading. I look forward to the discussion.

Steve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk Export #104

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Bulk Export #104

Uh oh!

smunini May 11, 2026 Maintainer

Introduction

The Lay of the Land: What the Bulk Data Access IG Says

The Essential Flow

The Architectural Tensions

Single-Instance vs Multi-Instance: A Tale of Two Deployments

Single-instance: zero-config

Multi-instance: shared state, work pool

The Recommendation: PostgreSQL for Job State, S3 for Output, In-Process Workers

PostgreSQL (recommended default for job state)

Redis (alternative for low-latency status polls)

DynamoDB / Cosmos DB / Spanner (cloud-managed equivalents)

Kafka / NATS JetStream (workers physically separate from request handlers)

S3-compatible object storage (recommended default for output files)

Cloudflare R2 / Google Cloud Storage / MinIO (S3-compatible drop-ins)

Local filesystem (single-instance only)

Designing the Rust Traits

The Existing Surface: Types

The Existing Surface: Job-State Trait

The Existing Surface: Data-Provider Trait Hierarchy

The Proposal: Output Storage as a First-Class Trait

The Proposal: Workers, Leases, and the Claim Strategy

The Proposal: File-Download Authorization

The REST Layer: How the Endpoints Wire Up

Group Export: The Hard Part

Authorization

Configuration: What Operators Will Touch

Conformance Testing

What's Not in Scope (Yet)

Proposed Next Steps

Closing Thoughts

Replies: 0 comments

smunini
May 11, 2026
Maintainer