Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 134 additions & 31 deletions docs/integrations/bigquery-agent-analytics.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,25 @@ TRANSFER_AGENT, TRANSFER_A2A), and **HITL Event Tracing** for human-in-the-loop
interactions. Version 1.27.0 adds **Automatic View Creation** (generate flat,
query-friendly event views).

The plugin includes three reliability and observability fixes:

- **Cross-region Storage Write API routing.** Writes to BigQuery datasets
outside the `US` multi-region (for example `EU` or `northamerica-northeast1`)
now route to the region that owns the write stream. Previously they could
fail with a "session not found" / stream-not-found error and silently drop
every row.
- **Dropped-event observability.** Dropped rows are tracked per drop reason
(`queue_full`, `arrow_prep_failed`, `retry_exhausted`, `non_retryable`,
`unexpected_error`) and exposed via
`BigQueryAgentAnalyticsPlugin.get_drop_stats()` so a host can poll and export
the counts to its own monitoring.
- **No duplicate spans in Cloud Trace.** When Agent Engine telemetry
(`GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true`) or any other Cloud Trace
exporter is wired to the global tracer provider, the plugin no longer
produces a duplicate span next to each framework span. The plugin still
inherits `trace_id` from the ambient OTel span, so BigQuery rows continue to
join cleanly to Cloud Trace traces.

!!! warning "BigQuery Storage Write API"

This feature uses **BigQuery Storage Write API**, which is a paid service.
Expand Down Expand Up @@ -150,15 +169,21 @@ LIMIT 20;
from google.adk.tools.bigquery import BigQueryToolset, BigQueryCredentialsConfig


# --- OpenTelemetry TracerProvider Setup (Optional) ---
# ADK includes OpenTelemetry as a core dependency.
# Configuring a TracerProvider enables full distributed tracing
# (populates trace_id, span_id with standard OTel identifiers).
# If no TracerProvider is configured, the plugin falls back to internal
# UUIDs for span correlation while still preserving the parent-child hierarchy.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
trace.set_tracer_provider(TracerProvider())
# --- OpenTelemetry note (no setup required for BQAA) ---
# The BQAA plugin does NOT export OTel spans of its own; it generates
# span_id values internally as 16-hex strings. The plugin's `trace_id`
# column inherits from whichever OpenTelemetry span is active in the
# surrounding runtime when the agent runs:
# * Agent Engine wires its invocation span automatically, so
# `trace_id` in BigQuery joins to Cloud Trace out of the box.
# * Locally, framework-instrumented runners open an invocation span
# for you.
# * If neither is available, the plugin falls back to a per-invocation
# trace_id and the parent-child hierarchy is still preserved in
# BigQuery — no OTel setup needed.
# Setting a bare `TracerProvider` with no ambient span will NOT cause
# `trace_id` to be populated with a "real" OTel id; only an *active*
# span does. See the "Tracing and observability" section for details.

# --- Configuration ---
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "your-gcp-project-id")
Expand Down Expand Up @@ -378,9 +403,9 @@ provides a comprehensive reference with example values.
| **session_id** | `STRING` | `NULLABLE` | A persistent identifier for the entire conversation thread. Stays constant across multiple turns and sub-agent calls. | `04275a01-1649-4a30-b6a7-5b443c69a7bc` |
| **invocation_id** | `STRING` | `NULLABLE` | The unique identifier for a single execution turn or request cycle. Corresponds to `trace_id` in many contexts. | `e-b55b2000-68c6-4e8b-b3b3-ffb454a92e40` |
| **user_id** | `STRING` | `NULLABLE` | The identifier of the user (human or system) initiating the session. Extracted from the `User` object or metadata. | `test_user` |
| **trace_id** | `STRING` | `NULLABLE` | The **OpenTelemetry** Trace ID (32-char hex). Links all operations within a single distributed request lifecycle. | `e-b55b2000-68c6-4e8b-b3b3-ffb454a92e40` |
| **span_id** | `STRING` | `NULLABLE` | The **OpenTelemetry** Span ID (16-char hex). Uniquely identifies this specific atomic operation. | `69867a836cd94798be2759d8e0d70215` |
| **parent_span_id** | `STRING` | `NULLABLE` | The Span ID of the immediate caller. Used to reconstruct the parent-child execution tree (DAG). | `ef5843fe40764b4b8afec44e78044205` |
| **trace_id** | `STRING` | `NULLABLE` | 32-character hex Trace ID. Inherited from the ambient OpenTelemetry span when one is active (e.g. Agent Engine's invocation span or the ADK Runner span) so BigQuery rows join cleanly to your existing Cloud Trace traces; otherwise generated by the plugin per invocation. Links all operations within a single distributed request lifecycle. | `a2c7f13d3a3f0bbb8793692f76a6012a` |
| **span_id** | `STRING` | `NULLABLE` | 16-character hex Span ID identifying this specific atomic operation. **Generated by the plugin internally** (the plugin does not call `tracer.start_span` against your configured OpenTelemetry provider — see [Tracing and observability](#tracing-and-observability)). | `3916f5762bcd4d42` |
| **parent_span_id** | `STRING` | `NULLABLE` | 16-character hex Span ID of the immediate caller. Used to reconstruct the parent-child execution tree (DAG). | `4c4a42bfdeb84934` |
| **content** | `JSON` | `NULLABLE` | The primary event payload. Structure is polymorphic based on `event_type`. | `{"system_prompt": "You are...", "prompt": [{"role": "user", "content": "hello"}], "response": "Hi", "usage": {"total": 15}}` |
| **attributes** | `JSON` | `NULLABLE` | Metadata/Enrichment (usage stats, model info, tool provenance, custom tags). | `{"model": "gemini-flash-latest", "usage_metadata": {"total_token_count": 15}, "session_metadata": {"session_id": "...", "app_name": "...", "user_id": "...", "state": {}}, "custom_tags": {"env": "prod"}}` |
| **latency_ms** | `JSON` | `NULLABLE` | Performance metrics. Standard keys are `total_ms` (wall-clock duration) and `time_to_first_token_ms` (streaming latency). | `{"total_ms": 1250, "time_to_first_token_ms": 450}` |
Expand All @@ -403,9 +428,9 @@ you can optionally create the table manually using the DDL below.
session_id STRING OPTIONS(description="A unique identifier to group events within a single conversation or user session."),
invocation_id STRING OPTIONS(description="A unique identifier for each individual agent execution or turn within a session."),
user_id STRING OPTIONS(description="The identifier of the user associated with the current session."),
trace_id STRING OPTIONS(description="OpenTelemetry trace ID for distributed tracing."),
span_id STRING OPTIONS(description="OpenTelemetry span ID for this specific operation."),
parent_span_id STRING OPTIONS(description="OpenTelemetry parent span ID to reconstruct hierarchy."),
trace_id STRING OPTIONS(description="32-char hex trace ID. Inherited from the ambient OpenTelemetry span when one is active; otherwise generated per invocation by the plugin."),
span_id STRING OPTIONS(description="16-char hex span ID for this specific operation. Generated internally by the plugin (no OpenTelemetry span is created or exported)."),
parent_span_id STRING OPTIONS(description="16-char hex span ID of the immediate caller, used to reconstruct the parent-child execution tree."),
content JSON OPTIONS(description="The event-specific data (payload) stored as JSON."),
content_parts ARRAY<STRUCT<
mime_type STRING,
Expand Down Expand Up @@ -1420,22 +1445,43 @@ config = BigQueryLoggerConfig(

### Tracing and observability

The plugin supports **OpenTelemetry** for distributed tracing. OpenTelemetry is
included as a core dependency of ADK and is always available.

- **Automatic Span Management**: The plugin automatically generates spans for
Agent execution, LLM calls, and Tool executions.
- **OpenTelemetry Integration**: If a `TracerProvider` is configured (as shown
in the example above), the plugin will use valid OTel spans, populating
`trace_id`, `span_id`, and `parent_span_id` with standard OTel identifiers.
This allows you to correlate agent logs with other services in your
distributed system.
- **Fallback Mechanism**: If no `TracerProvider` is configured (i.e., only the
default no-op provider is active), the plugin automatically falls back to
generating internal UUIDs for spans and uses the `invocation_id` as the trace
ID. This ensures that the parent-child hierarchy (Agent -> Span -> Tool/LLM)
is *always* preserved in the BigQuery logs, even without a configured
`TracerProvider`.
The plugin populates the `trace_id`, `span_id`, and `parent_span_id` columns on
every emitted row so the parent-child execution tree (Agent → LLM call / Tool
call) reconstructs cleanly from BigQuery.

- **Internal span tracking, no OTel span export.** The plugin generates
`span_id` values directly as 16-hex strings and tracks the parent-child
hierarchy on its own internal stack. It does **not** call
`tracer.start_span(...)` on any configured OpenTelemetry
`TracerProvider`, so its instrumentation never reaches your configured
exporter — this is what prevents duplicate spans in Cloud Trace when Agent
Engine telemetry is enabled (`GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true`)
or when you wire any other Cloud Trace exporter into the host process.
- **`trace_id` inherited from the ambient OTel span when present.** If the
surrounding runtime has already started an OTel span — Agent Engine's
invocation span, the ADK `Runner` invocation span, or any span you opened
before the agent runs — the plugin reads its `trace_id` and stamps it on
every BigQuery row. BigQuery rows therefore join cleanly to your existing
Cloud Trace traces via a shared `trace_id`.
- **Fallback when no ambient span is present.** If no ambient OTel span is
active (e.g. a non-Agent-Engine deployment with no host-side tracer
configured), the plugin generates a per-invocation 32-hex `trace_id` so the
parent-child hierarchy is always preserved in BigQuery, even without any
external tracer setup.
- **No `TracerProvider` is required.** Configuring an OpenTelemetry
`TracerProvider` in your host process is optional. It only matters if you
want the plugin's `trace_id` to be sourced from your own pre-existing
ambient span (e.g. to correlate against telemetry from non-ADK services).
The plugin no longer needs the provider for its own bookkeeping.

!!! info "If you relied on the plugin to feed your OTel exporter"

Some older configurations used the BQAA plugin as a side channel for
OpenTelemetry span emission — that path is intentionally gone. Configure
OTel instrumentation in the host application instead (Agent Engine wires
this automatically; for local deployments use ADK's own framework
instrumentation or an explicit `TracerProvider`). The plugin's BigQuery
rows will continue to join to your traces via `trace_id`.

### Public methods

Expand All @@ -1449,6 +1495,9 @@ The plugin exposes several public methods for lifecycle management:
- **`await plugin.create_analytics_views()`**: Manually (re-)create all
per-event-type analytics views. Useful after a schema upgrade or when views
need to be refreshed.
- **`plugin.get_drop_stats()`**: Return a snapshot of dropped-event counts per
`drop_reason`. See [Dropped-event
observability](#dropped-event-observability) below.
- **Async context manager**: The plugin supports `async with` for automatic
startup and shutdown:

Expand All @@ -1461,6 +1510,60 @@ The plugin exposes several public methods for lifecycle management:
# plugin.shutdown() is called automatically on exit
```

### Dropped-event observability {#dropped-event-observability}

BigQuery logging is best-effort — events can be dropped when the in-memory
queue overflows or when a write ultimately fails. The plugin tracks dropped
rows per `drop_reason` and exposes a polling API so a host can detect, alert
on, and ship the counts to its own monitoring.

**Drop reasons:**

| Reason | Cause |
|---|---|
| `queue_full` | The in-memory batch queue overflowed (host produces events faster than the drainer can ship). Increase `queue_max_size` on `BigQueryLoggerConfig`, raise `batch_size` to drain in larger chunks, or scale the consumer side (more concurrent invocations finishing faster). |
| `arrow_prep_failed` | A row could not be converted to its Arrow representation (typically schema/type mismatch). Inspect logs for the offending field. |
| `retry_exhausted` | The Storage Write API call kept returning a retryable error (e.g. transient gRPC failures) until the retry budget was used up. |
| `non_retryable` | Storage Write API returned a non-retryable error (permissions, quota, schema rejection). Usually requires operator intervention. |
| `unexpected_error` | Any other exception caught while preparing or writing the batch. |

**Reading the counts:**

```python
# Snapshot of {drop_reason: count} since plugin start.
stats = plugin.get_drop_stats()
# Example: {"queue_full": 12, "retry_exhausted": 0, ...}

total_dropped = sum(stats.values())
```

**Exporting to your monitoring system** — poll periodically and ship the deltas:

```python
import asyncio

async def export_loop(plugin):
last = {k: 0 for k in (
"queue_full", "arrow_prep_failed",
"retry_exhausted", "non_retryable", "unexpected_error",
)}
while True:
current = plugin.get_drop_stats()
for reason, count in current.items():
delta = count - last.get(reason, 0)
if delta:
# e.g. metric_client.write_point(
# metric="bqaa_dropped_events",
# labels={"reason": reason}, value=delta)
...
last = current
await asyncio.sleep(60)
```

A non-zero `queue_full` or `retry_exhausted` count on a sustained basis is the
clearest signal that BQAA is at risk of data loss — surface it on a dashboard or
alert.

### Multiprocessing and fork safety

The plugin is fork-aware: it sets `GRPC_ENABLE_FORK_SUPPORT=1` before loading
Expand Down