From 1ee9ee0e45d6d2843d7f9e8793eed0ccafe83b29 Mon Sep 17 00:00:00 2001 From: Haiyuan Cao Date: Wed, 3 Jun 2026 16:49:59 -0700 Subject: [PATCH] docs(integrations/bqaa): document 2.0.0 reliability fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reflects the three reliability/observability fixes that landed in google/adk-python commit a5fa3da0: 1. Cross-region Storage Write API routing — writes to non-US datasets (EU, northamerica-northeast1, …) now route to the region that owns the write stream instead of failing silently. 2. Dropped-event observability — adds `BigQueryAgentAnalyticsPlugin.get_drop_stats()` with the `drop_reason` enumeration (queue_full, arrow_prep_failed, retry_exhausted, non_retryable, unexpected_error) so hosts can poll the counts and ship them to their own monitoring. 3. No duplicate spans in Cloud Trace — the plugin no longer calls `tracer.start_span(...)` on the configured tracer provider, so it does not surface a duplicate span next to every framework span when Agent Engine telemetry (GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true) or any other Cloud Trace exporter is wired into the host. `trace_id` is still inherited from the ambient OTel span so BigQuery rows continue to join cleanly to Cloud Trace traces. Updates ------- * Add a "Version 2.0.0 bundles three reliability and observability fixes" subsection to the page intro. * Rewrite the "Tracing and observability" section so the Automatic-Span-Management / OTel-Integration / Fallback bullets match the new no-export behavior. Add a "Migrating from 1.x" callout for anyone who was relying on the plugin to feed their own OTel exporter. * Update the quickstart's OTel TracerProvider comment block to reflect that the plugin only inherits trace_id and does not itself create exportable spans. * Add `get_drop_stats()` to the Public methods list and a dedicated "Dropped-event observability" subsection with the drop_reason table and a polling/export example. No structural changes to the schema, event-types, query-recipes, Agent Engine deploy walkthrough, or security sections. Refs ---- * google/adk-python commit a5fa3da0 ("feat: BigQuery Agent Analytics reliability fixes") * haiyuan-eng-google/BigQuery-Agent-Analytics-SDK#94 (the customer bug the Cloud Trace fix closes) --- docs/integrations/bigquery-agent-analytics.md | 165 ++++++++++++++---- 1 file changed, 134 insertions(+), 31 deletions(-) diff --git a/docs/integrations/bigquery-agent-analytics.md b/docs/integrations/bigquery-agent-analytics.md index 6e1d289844..f6ab4e8871 100644 --- a/docs/integrations/bigquery-agent-analytics.md +++ b/docs/integrations/bigquery-agent-analytics.md @@ -30,6 +30,25 @@ TRANSFER_AGENT, TRANSFER_A2A), and **HITL Event Tracing** for human-in-the-loop interactions. Version 1.27.0 adds **Automatic View Creation** (generate flat, query-friendly event views). +The plugin includes three reliability and observability fixes: + +- **Cross-region Storage Write API routing.** Writes to BigQuery datasets + outside the `US` multi-region (for example `EU` or `northamerica-northeast1`) + now route to the region that owns the write stream. Previously they could + fail with a "session not found" / stream-not-found error and silently drop + every row. +- **Dropped-event observability.** Dropped rows are tracked per drop reason + (`queue_full`, `arrow_prep_failed`, `retry_exhausted`, `non_retryable`, + `unexpected_error`) and exposed via + `BigQueryAgentAnalyticsPlugin.get_drop_stats()` so a host can poll and export + the counts to its own monitoring. +- **No duplicate spans in Cloud Trace.** When Agent Engine telemetry + (`GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true`) or any other Cloud Trace + exporter is wired to the global tracer provider, the plugin no longer + produces a duplicate span next to each framework span. The plugin still + inherits `trace_id` from the ambient OTel span, so BigQuery rows continue to + join cleanly to Cloud Trace traces. + !!! warning "BigQuery Storage Write API" This feature uses **BigQuery Storage Write API**, which is a paid service. @@ -150,15 +169,21 @@ LIMIT 20; from google.adk.tools.bigquery import BigQueryToolset, BigQueryCredentialsConfig - # --- OpenTelemetry TracerProvider Setup (Optional) --- - # ADK includes OpenTelemetry as a core dependency. - # Configuring a TracerProvider enables full distributed tracing - # (populates trace_id, span_id with standard OTel identifiers). - # If no TracerProvider is configured, the plugin falls back to internal - # UUIDs for span correlation while still preserving the parent-child hierarchy. - from opentelemetry import trace - from opentelemetry.sdk.trace import TracerProvider - trace.set_tracer_provider(TracerProvider()) + # --- OpenTelemetry note (no setup required for BQAA) --- + # The BQAA plugin does NOT export OTel spans of its own; it generates + # span_id values internally as 16-hex strings. The plugin's `trace_id` + # column inherits from whichever OpenTelemetry span is active in the + # surrounding runtime when the agent runs: + # * Agent Engine wires its invocation span automatically, so + # `trace_id` in BigQuery joins to Cloud Trace out of the box. + # * Locally, framework-instrumented runners open an invocation span + # for you. + # * If neither is available, the plugin falls back to a per-invocation + # trace_id and the parent-child hierarchy is still preserved in + # BigQuery — no OTel setup needed. + # Setting a bare `TracerProvider` with no ambient span will NOT cause + # `trace_id` to be populated with a "real" OTel id; only an *active* + # span does. See the "Tracing and observability" section for details. # --- Configuration --- PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "your-gcp-project-id") @@ -378,9 +403,9 @@ provides a comprehensive reference with example values. | **session_id** | `STRING` | `NULLABLE` | A persistent identifier for the entire conversation thread. Stays constant across multiple turns and sub-agent calls. | `04275a01-1649-4a30-b6a7-5b443c69a7bc` | | **invocation_id** | `STRING` | `NULLABLE` | The unique identifier for a single execution turn or request cycle. Corresponds to `trace_id` in many contexts. | `e-b55b2000-68c6-4e8b-b3b3-ffb454a92e40` | | **user_id** | `STRING` | `NULLABLE` | The identifier of the user (human or system) initiating the session. Extracted from the `User` object or metadata. | `test_user` | -| **trace_id** | `STRING` | `NULLABLE` | The **OpenTelemetry** Trace ID (32-char hex). Links all operations within a single distributed request lifecycle. | `e-b55b2000-68c6-4e8b-b3b3-ffb454a92e40` | -| **span_id** | `STRING` | `NULLABLE` | The **OpenTelemetry** Span ID (16-char hex). Uniquely identifies this specific atomic operation. | `69867a836cd94798be2759d8e0d70215` | -| **parent_span_id** | `STRING` | `NULLABLE` | The Span ID of the immediate caller. Used to reconstruct the parent-child execution tree (DAG). | `ef5843fe40764b4b8afec44e78044205` | +| **trace_id** | `STRING` | `NULLABLE` | 32-character hex Trace ID. Inherited from the ambient OpenTelemetry span when one is active (e.g. Agent Engine's invocation span or the ADK Runner span) so BigQuery rows join cleanly to your existing Cloud Trace traces; otherwise generated by the plugin per invocation. Links all operations within a single distributed request lifecycle. | `a2c7f13d3a3f0bbb8793692f76a6012a` | +| **span_id** | `STRING` | `NULLABLE` | 16-character hex Span ID identifying this specific atomic operation. **Generated by the plugin internally** (the plugin does not call `tracer.start_span` against your configured OpenTelemetry provider — see [Tracing and observability](#tracing-and-observability)). | `3916f5762bcd4d42` | +| **parent_span_id** | `STRING` | `NULLABLE` | 16-character hex Span ID of the immediate caller. Used to reconstruct the parent-child execution tree (DAG). | `4c4a42bfdeb84934` | | **content** | `JSON` | `NULLABLE` | The primary event payload. Structure is polymorphic based on `event_type`. | `{"system_prompt": "You are...", "prompt": [{"role": "user", "content": "hello"}], "response": "Hi", "usage": {"total": 15}}` | | **attributes** | `JSON` | `NULLABLE` | Metadata/Enrichment (usage stats, model info, tool provenance, custom tags). | `{"model": "gemini-flash-latest", "usage_metadata": {"total_token_count": 15}, "session_metadata": {"session_id": "...", "app_name": "...", "user_id": "...", "state": {}}, "custom_tags": {"env": "prod"}}` | | **latency_ms** | `JSON` | `NULLABLE` | Performance metrics. Standard keys are `total_ms` (wall-clock duration) and `time_to_first_token_ms` (streaming latency). | `{"total_ms": 1250, "time_to_first_token_ms": 450}` | @@ -403,9 +428,9 @@ you can optionally create the table manually using the DDL below. session_id STRING OPTIONS(description="A unique identifier to group events within a single conversation or user session."), invocation_id STRING OPTIONS(description="A unique identifier for each individual agent execution or turn within a session."), user_id STRING OPTIONS(description="The identifier of the user associated with the current session."), - trace_id STRING OPTIONS(description="OpenTelemetry trace ID for distributed tracing."), - span_id STRING OPTIONS(description="OpenTelemetry span ID for this specific operation."), - parent_span_id STRING OPTIONS(description="OpenTelemetry parent span ID to reconstruct hierarchy."), + trace_id STRING OPTIONS(description="32-char hex trace ID. Inherited from the ambient OpenTelemetry span when one is active; otherwise generated per invocation by the plugin."), + span_id STRING OPTIONS(description="16-char hex span ID for this specific operation. Generated internally by the plugin (no OpenTelemetry span is created or exported)."), + parent_span_id STRING OPTIONS(description="16-char hex span ID of the immediate caller, used to reconstruct the parent-child execution tree."), content JSON OPTIONS(description="The event-specific data (payload) stored as JSON."), content_parts ARRAY Span -> Tool/LLM) - is *always* preserved in the BigQuery logs, even without a configured - `TracerProvider`. +The plugin populates the `trace_id`, `span_id`, and `parent_span_id` columns on +every emitted row so the parent-child execution tree (Agent → LLM call / Tool +call) reconstructs cleanly from BigQuery. + +- **Internal span tracking, no OTel span export.** The plugin generates + `span_id` values directly as 16-hex strings and tracks the parent-child + hierarchy on its own internal stack. It does **not** call + `tracer.start_span(...)` on any configured OpenTelemetry + `TracerProvider`, so its instrumentation never reaches your configured + exporter — this is what prevents duplicate spans in Cloud Trace when Agent + Engine telemetry is enabled (`GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true`) + or when you wire any other Cloud Trace exporter into the host process. +- **`trace_id` inherited from the ambient OTel span when present.** If the + surrounding runtime has already started an OTel span — Agent Engine's + invocation span, the ADK `Runner` invocation span, or any span you opened + before the agent runs — the plugin reads its `trace_id` and stamps it on + every BigQuery row. BigQuery rows therefore join cleanly to your existing + Cloud Trace traces via a shared `trace_id`. +- **Fallback when no ambient span is present.** If no ambient OTel span is + active (e.g. a non-Agent-Engine deployment with no host-side tracer + configured), the plugin generates a per-invocation 32-hex `trace_id` so the + parent-child hierarchy is always preserved in BigQuery, even without any + external tracer setup. +- **No `TracerProvider` is required.** Configuring an OpenTelemetry + `TracerProvider` in your host process is optional. It only matters if you + want the plugin's `trace_id` to be sourced from your own pre-existing + ambient span (e.g. to correlate against telemetry from non-ADK services). + The plugin no longer needs the provider for its own bookkeeping. + +!!! info "If you relied on the plugin to feed your OTel exporter" + + Some older configurations used the BQAA plugin as a side channel for + OpenTelemetry span emission — that path is intentionally gone. Configure + OTel instrumentation in the host application instead (Agent Engine wires + this automatically; for local deployments use ADK's own framework + instrumentation or an explicit `TracerProvider`). The plugin's BigQuery + rows will continue to join to your traces via `trace_id`. ### Public methods @@ -1449,6 +1495,9 @@ The plugin exposes several public methods for lifecycle management: - **`await plugin.create_analytics_views()`**: Manually (re-)create all per-event-type analytics views. Useful after a schema upgrade or when views need to be refreshed. +- **`plugin.get_drop_stats()`**: Return a snapshot of dropped-event counts per + `drop_reason`. See [Dropped-event + observability](#dropped-event-observability) below. - **Async context manager**: The plugin supports `async with` for automatic startup and shutdown: @@ -1461,6 +1510,60 @@ The plugin exposes several public methods for lifecycle management: # plugin.shutdown() is called automatically on exit ``` +### Dropped-event observability {#dropped-event-observability} + +BigQuery logging is best-effort — events can be dropped when the in-memory +queue overflows or when a write ultimately fails. The plugin tracks dropped +rows per `drop_reason` and exposes a polling API so a host can detect, alert +on, and ship the counts to its own monitoring. + +**Drop reasons:** + +| Reason | Cause | +|---|---| +| `queue_full` | The in-memory batch queue overflowed (host produces events faster than the drainer can ship). Increase `queue_max_size` on `BigQueryLoggerConfig`, raise `batch_size` to drain in larger chunks, or scale the consumer side (more concurrent invocations finishing faster). | +| `arrow_prep_failed` | A row could not be converted to its Arrow representation (typically schema/type mismatch). Inspect logs for the offending field. | +| `retry_exhausted` | The Storage Write API call kept returning a retryable error (e.g. transient gRPC failures) until the retry budget was used up. | +| `non_retryable` | Storage Write API returned a non-retryable error (permissions, quota, schema rejection). Usually requires operator intervention. | +| `unexpected_error` | Any other exception caught while preparing or writing the batch. | + +**Reading the counts:** + +```python +# Snapshot of {drop_reason: count} since plugin start. +stats = plugin.get_drop_stats() +# Example: {"queue_full": 12, "retry_exhausted": 0, ...} + +total_dropped = sum(stats.values()) +``` + +**Exporting to your monitoring system** — poll periodically and ship the deltas: + +```python +import asyncio + +async def export_loop(plugin): + last = {k: 0 for k in ( + "queue_full", "arrow_prep_failed", + "retry_exhausted", "non_retryable", "unexpected_error", + )} + while True: + current = plugin.get_drop_stats() + for reason, count in current.items(): + delta = count - last.get(reason, 0) + if delta: + # e.g. metric_client.write_point( + # metric="bqaa_dropped_events", + # labels={"reason": reason}, value=delta) + ... + last = current + await asyncio.sleep(60) +``` + +A non-zero `queue_full` or `retry_exhausted` count on a sustained basis is the +clearest signal that BQAA is at risk of data loss — surface it on a dashboard or +alert. + ### Multiprocessing and fork safety The plugin is fork-aware: it sets `GRPC_ENABLE_FORK_SUPPORT=1` before loading