Skip to content

FlagdProvider default keep_alive_time=0 causes GOAWAY crash with flagd server #365

@aepfli

Description

@aepfli

Description

The FlagdProvider defaults keep_alive_time to 0 (DEFAULT_KEEP_ALIVE = 0 in config.py), which is passed directly to grpc.keepalive_time_ms as a channel option.

In the Python gRPC C-core implementation, keepalive_time_ms=0 is interpreted as "send pings immediately / every ~1ms" rather than "disabled". This causes flagd's Go gRPC server to reject the connection with:

GOAWAY [ENHANCE_YOUR_CALM] "too_many_pings"

flagd's Go gRPC server uses the default grpc-go enforcement policy with MinPingInterval=5min. Any keepalive interval below 5 minutes triggers the GOAWAY.

Impact

  • On aarch64/ARM (e.g., Raspberry Pi), the GOAWAY triggers a fatal gRPC C-core assertion failure in ev_epoll1_linux.cc: Check failed: next_worker->state == KICKEDcrashing the entire Python process
  • On x86_64, the client typically reconnects but logs repeated GOAWAY warnings, degrading performance and reliability

This affects both the IN_PROCESS and RPC resolver types since both create gRPC channels with the same keepalive options.

Reproduction

from openfeature.contrib.provider.flagd import FlagdProvider
from openfeature.contrib.provider.flagd.config import ResolverType

# Default keep_alive_time=0 → immediate crash/GOAWAY with flagd server
provider = FlagdProvider(
    host="flagd",
    port=8015,
    resolver_type=ResolverType.IN_PROCESS,
)

flagd server logs:

Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"

Client crash (ARM only):

F0000 00:00:1773608949.106482  38 ev_epoll1_linux.cc:1125] Check failed: next_worker->state == KICKED

Suggested Fix

Change the default from 0 to a value that respects flagd's server-side MinPingInterval of 5 minutes:

# config.py
DEFAULT_KEEP_ALIVE = 600000  # 10 minutes (ms) — above flagd's 5min MinPingInterval

Or alternatively, don't set grpc.keepalive_time_ms at all when keep_alive_time=0, letting the gRPC library use its own default (which is "infinite" / disabled in the C-core):

# grpc.py - _generate_channel()
options = []
if config.keep_alive_time > 0:
    options.append(("grpc.keepalive_time_ms", config.keep_alive_time))

Workaround

Set keep_alive_time explicitly when creating the provider:

provider = FlagdProvider(
    host="flagd",
    port=8015,
    resolver_type=ResolverType.IN_PROCESS,
    keep_alive_time=600000,  # 10 minutes
)

Or via environment variable:

FLAGD_KEEP_ALIVE_TIME_MS=600000

Environment

  • openfeature-provider-flagd version: 0.2.7
  • flagd version: v0.13.2
  • Python: 3.11 (aarch64-linux-gnu)
  • grpcio: compiled with C-core (not pure-Python)
  • Platform: Raspberry Pi 5 (ARM64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions