diff --git a/modules/log-collector-resources-scheduling.adoc b/modules/log-collector-resources-scheduling.adoc index 65e403cb8a4b..bb0949a7ddde 100644 --- a/modules/log-collector-resources-scheduling.adoc +++ b/modules/log-collector-resources-scheduling.adoc @@ -17,21 +17,60 @@ Administrators can change the resources and scheduling of the collector by confi .Procedure -. Update the `ClusterLogForwarder` CR: +. Update the `ClusterLogForwarder` CR to configure scheduling and resources. + -The following example displays `ClusterLogForwarder` CR YAML: +The following example schedules collectors on infrastructure nodes: + [source,yaml] ---- -apiVersion: observability.openshift.io/v1 +apiVersion: observability.openshift.io/v1 kind: ClusterLogForwarder metadata: - name: - namespace: + name: instance + namespace: openshift-logging spec: collector: nodeSelector: - collector: needed + node-role.kubernetes.io/infra: "" +# ... +---- ++ +The following example schedules collectors on dedicated infrastructure nodes with taints: ++ +[source,yaml] +---- +apiVersion: observability.openshift.io/v1 +kind: ClusterLogForwarder +metadata: + name: instance + namespace: openshift-logging +spec: + collector: + nodeSelector: + node-role.kubernetes.io/infra: "" + tolerations: + - key: node-role.kubernetes.io/infra + operator: Exists + effect: NoSchedule + - key: node-role.kubernetes.io/infra + operator: Exists + effect: NoExecute +# ... +---- ++ +The following example shows all available scheduling and resource fields: ++ +[source,yaml] +---- +apiVersion: observability.openshift.io/v1 +kind: ClusterLogForwarder +metadata: + name: instance + namespace: openshift-logging +spec: + collector: + nodeSelector: + node-role.kubernetes.io/infra: "" resources: limits: memory: 1Gi @@ -39,48 +78,27 @@ spec: cpu: 100m memory: 1Gi tolerations: - - key: "logging" - operator: "Exists" - effect: "NoExecute" - tolerationSeconds: 6000 + - key: node-role.kubernetes.io/infra + operator: Exists + effect: NoSchedule affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - - key: label-1 + - key: node-role.kubernetes.io/infra operator: Exists weight: 1 - podAffinity: + podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - - key: test + - key: app.kubernetes.io/component operator: In values: - - value1 - topologyKey: kubernetes.io/hostname - weight: 50 - requiredDuringSchedulingIgnoredDuringExecution: - - labelSelector: - matchExpressions: - - key: run - operator: In - values: - - test - namespaceSelector: {} - topologyKey: kubernetes.io/hostname - podAntiAffinity: - preferredDuringSchedulingIgnoredDuringExecution: - - podAffinityTerm: - labelSelector: - matchExpressions: - - key: security - operator: In - values: - - S2 - topologyKey: topology.kubernetes.io/zone + - collector + topologyKey: topology.kubernetes.io/zone weight: 100 # ... ---- diff --git a/modules/logging-loki-pod-placement.adoc b/modules/logging-loki-pod-placement.adoc index 751e0632a887..6e57154c844c 100644 --- a/modules/logging-loki-pod-placement.adoc +++ b/modules/logging-loki-pod-placement.adoc @@ -58,7 +58,7 @@ where: In the earlier example configuration, all Loki pods are moved to nodes containing the `node-role.kubernetes.io/infra: ""` label. -The following example displays `LokiStack` CR with node selectors and tolerations: +The following example displays `LokiStack` CR with node selectors and tolerations for dedicated infrastructure nodes with taints. The configuration pattern is shown for three components and applies to all components: [source,yaml] ---- apiVersion: loki.grafana.com/v1 @@ -89,56 +89,6 @@ spec: - effect: NoExecute key: node-role.kubernetes.io/infra value: reserved - indexGateway: - nodeSelector: - node-role.kubernetes.io/infra: "" - tolerations: - - effect: NoSchedule - key: node-role.kubernetes.io/infra - value: reserved - - effect: NoExecute - key: node-role.kubernetes.io/infra - value: reserved - ingester: - nodeSelector: - node-role.kubernetes.io/infra: "" - tolerations: - - effect: NoSchedule - key: node-role.kubernetes.io/infra - value: reserved - - effect: NoExecute - key: node-role.kubernetes.io/infra - value: reserved - querier: - nodeSelector: - node-role.kubernetes.io/infra: "" - tolerations: - - effect: NoSchedule - key: node-role.kubernetes.io/infra - value: reserved - - effect: NoExecute - key: node-role.kubernetes.io/infra - value: reserved - queryFrontend: - nodeSelector: - node-role.kubernetes.io/infra: "" - tolerations: - - effect: NoSchedule - key: node-role.kubernetes.io/infra - value: reserved - - effect: NoExecute - key: node-role.kubernetes.io/infra - value: reserved - ruler: - nodeSelector: - node-role.kubernetes.io/infra: "" - tolerations: - - effect: NoSchedule - key: node-role.kubernetes.io/infra - value: reserved - - effect: NoExecute - key: node-role.kubernetes.io/infra - value: reserved gateway: nodeSelector: node-role.kubernetes.io/infra: "" @@ -149,9 +99,15 @@ spec: - effect: NoExecute key: node-role.kubernetes.io/infra value: reserved + # ... repeat for indexGateway, ingester, querier, queryFrontend, ruler # ... ---- +[NOTE] +==== +Apply the same `nodeSelector` and `tolerations` configuration to all LokiStack components: `compactor`, `distributor`, `gateway`, `indexGateway`, `ingester`, `querier`, `queryFrontend`, and `ruler`. +==== + To configure the `nodeSelector` and `tolerations` fields of the `LokiStack` (CR), you can use the [command]`oc explain` command to view the description and fields for a particular resource: [source,terminal] diff --git a/modules/logging-scheduling-use-cases.adoc b/modules/logging-scheduling-use-cases.adoc new file mode 100644 index 000000000000..98d6420c2ad1 --- /dev/null +++ b/modules/logging-scheduling-use-cases.adoc @@ -0,0 +1,70 @@ +// Module included in the following assemblies: +// +// * scheduling_resources/scheduling-logging-resources.adoc + +:_mod-docs-content-type: CONCEPT +[id="logging-scheduling-use-cases_{context}"] += Scheduling use cases for logging components + +[role="_abstract"] +Different deployment scenarios require different scheduling approaches. Use this guide to determine which scheduling mechanism to use for your logging infrastructure. + +The following table describes common use cases and the scheduling mechanisms to apply: + +.Scheduling mechanisms by use case +[cols="2,1,1,1,1",options="header"] +|=== +|Use case |Node selectors |Taints and tolerations |Affinity rules |Resource limits + +|Schedule logging on infrastructure nodes +|Required +|Optional +|Not required +|Optional + +|Dedicate nodes exclusively to logging +|Required +|Required +|Not required +|Optional + +|Distribute logging across availability zones +|Not required +|Not required +|Required +|Not required + +|Tune logging performance and resource usage +|Not required +|Not required +|Not required +|Required + +|=== + +== Infrastructure nodes + +When you have dedicated infrastructure nodes labeled with `node-role.kubernetes.io/infra`, use node selectors to schedule logging components on those nodes. This separates logging workloads from application workloads, which optimizes costs and maintains clear operational boundaries. + +To prevent non-logging workloads from using infrastructure nodes, apply taints to the infrastructure nodes and configure tolerations on logging pods. This ensures that infrastructure node resources are reserved exclusively for logging. + +== High availability across zones + +In multi-zone clusters, use pod anti-affinity rules to distribute LokiStack components across availability zones. This maintains logging availability during zone failures and meets business continuity requirements. + +For example, configure anti-affinity to prevent multiple ingester pods from running in the same zone. If one zone fails, the remaining zones continue to process logs. + +== Performance tuning + +When you experience high log volume or performance issues, adjust CPU and memory resource limits for collector pods. Increasing resource limits allows collectors to handle higher throughput, while setting appropriate limits prevents logging from consuming excessive node resources. + +Monitor collector resource usage and adjust limits based on actual consumption and node capacity. + +== Verification + +After configuring scheduling rules, verify that pods are running on the expected nodes: + +* For collectors, use the `oc get pods` command with the `--selector` and `-o wide` flags to view pod placement. +* For LokiStack components, check the pod status and node assignment for each component type. + +If pods are not scheduled as expected, check node labels, taints, and pod tolerations. Verify that the scheduling configuration matches your cluster's node configuration. diff --git a/modules/troubleshooting-logging-pod-scheduling.adoc b/modules/troubleshooting-logging-pod-scheduling.adoc new file mode 100644 index 000000000000..4a77fea73938 --- /dev/null +++ b/modules/troubleshooting-logging-pod-scheduling.adoc @@ -0,0 +1,99 @@ +// Module included in the following assemblies: +// +// * scheduling_resources/scheduling-logging-resources.adoc + +:_mod-docs-content-type: PROCEDURE +[id="troubleshooting-logging-pod-scheduling_{context}"] += Troubleshooting logging pod scheduling + +[role="_abstract"] +If logging pods are not scheduled on the expected nodes or remain in a pending state, verify the node labels, taints, and pod scheduling configuration. + +.Prerequisites + +* You have administrator permissions. +* You have installed the {clo} or {loki-op}. + +.Procedure + +. Check the pod status to identify scheduling issues: ++ +[source,terminal] +---- +$ oc get pods -n openshift-logging -o wide +---- ++ +Pods that cannot be scheduled display a `Pending` status. + +. Describe the pod to view scheduling events: ++ +[source,terminal] +---- +$ oc describe pod -n openshift-logging +---- ++ +Review the `Events` section for messages such as: ++ +* `0/X nodes are available: X node(s) didn't match Pod's node affinity/selector` +* `0/X nodes are available: X node(s) had untolerated taint` +* `0/X nodes are available: Insufficient cpu, Insufficient memory` + +. Verify that target nodes have the required labels: ++ +[source,terminal] +---- +$ oc get nodes --show-labels +---- ++ +Confirm that nodes intended for logging have the labels specified in the `nodeSelector` configuration. + +. If using taints and tolerations, verify node taints: ++ +[source,terminal] +---- +$ oc describe node +---- ++ +Review the `Taints` section and confirm that logging pods have matching tolerations configured. + +. Verify the pod's scheduling configuration: ++ +For collector pods, check the `ClusterLogForwarder` custom resource: ++ +[source,terminal] +---- +$ oc get clusterlogforwarder -n -o yaml +---- ++ +For LokiStack pods, check the `LokiStack` custom resource: ++ +[source,terminal] +---- +$ oc get lokistack logging-loki -n openshift-logging -o yaml +---- + +. Correct any mismatches between the pod configuration and node labels or taints: ++ +* If node labels are missing, add them: ++ +[source,terminal] +---- +$ oc label node = +---- ++ +* If the pod's `nodeSelector` has a typing error, update the custom resource with the correct label. ++ +* If a taint is missing from the pod's tolerations, add it to the custom resource. + +. After making corrections, verify that the pods are scheduled: ++ +[source,terminal] +---- +$ oc get pods -n openshift-logging -o wide +---- ++ +Pods should move to `Running` status on the expected nodes. + +.Verification + +* Confirm that logging pods are running on the intended nodes by checking the `NODE` column in the pod list. diff --git a/scheduling_resources/scheduling-logging-resources.adoc b/scheduling_resources/scheduling-logging-resources.adoc index 848aed354251..93102a5f3113 100644 --- a/scheduling_resources/scheduling-logging-resources.adoc +++ b/scheduling_resources/scheduling-logging-resources.adoc @@ -11,12 +11,16 @@ You can schedule logging resources by defining node selectors, taints and tolera include::modules/logging-about-pod-scheduling-controls.adoc[leveloffset=+1] +include::modules/logging-scheduling-use-cases.adoc[leveloffset=+1] + include::modules/log-collector-resources-scheduling.adoc[leveloffset=+1] include::modules/cluster-logging-collector-pod-location.adoc[leveloffset=+1] include::modules/logging-loki-pod-placement.adoc[leveloffset=+1] +include::modules/troubleshooting-logging-pod-scheduling.adoc[leveloffset=+1] + [role="_additional-resources"] == Additional resources