Skip to content

Health and metrics exposure#43

Open
robertkluin wants to merge 15 commits intomainfrom
health-and-metrics-exposure
Open

Health and metrics exposure#43
robertkluin wants to merge 15 commits intomainfrom
health-and-metrics-exposure

Conversation

@robertkluin
Copy link
Contributor

This adds the ability to enable telemetry / diagnostics endpoints within the controller. Eventually these may be used for health checks, but the immediately objective is to improve observability and help with debugging.

The subsystems are (mostly) self-encapsulated and communicate via
queues. In order to expose internal system metrics and state a queue
will be used to export telemetry data. This will decouple the metrics
capture / exposure from the internal subsystems.
Adding telemetry in order to add health checks and aid in debugging.
This is disabled by default and the structures returned are alpha
quality so that we can better understand the practical usage of them.
For the initial telemetry capture only the reconciliation scheduler and
event watcher subsystems emit data. This is rough and expected to evolve
after further testing and design.
api=api,
namespace=namespace,
api_version=API_VERSION,
plural_kind=f"{kind_title.lower()}s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support the odd plural cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, normally. In this case these are only Koreo resources themselves (ResourceFunction, ResourceTemplate, ValueFunction, and Workflow)—all of which use simple plural rules. It is not currently an issue, but certainly could be if we add something with complex plural rules.

Without jitter, reconciliation of all resources a controller is
monitoring can become aligned. This is meant to help scatter that in
order to more effectively spread load.
Starting uvicorn prior to the controller helps uvicorn start
successfully. The cause is not yet clear.
Removing try/excepts in order to debug an issue with unclean uvicorn
shutdowns.
Python's TaskGroup has several special case errors. If these aren't
handled specially, then very verbose errors are dumped as the system
exits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants