Bug 2043518
| Summary: | Better message in the CMO degraded/unavailable conditions when pods can't be scheduled | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Simon Pasquier <spasquie> |
| Component: | Monitoring | Assignee: | Sunil Thaha <sthaha> |
| Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> |
| Severity: | medium | Docs Contact: | Brian Burt <bburt> |
| Priority: | medium | ||
| Version: | 4.6.z | CC: | anpicker, bburt, deads, dgoodwin, juzhao, sthaha, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
* Before this update, if Prometheus Operator failed to run or schedule Prometheus pods, the system provided no underlying reason for the failure. With this update, if Prometheus pods are not run or scheduled, the Cluster Monitoring Operator updates the `clusterOperator` monitoring status with a reason for the failure, which can be used to troubleshoot the underlying issue. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2043518[*BZ#2043518*])
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-17 19:47:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Simon Pasquier
2022-01-21 12:48:47 UTC
The unschedulable pod condition would be a good starting point. Unsetting target release because it won't be ready in time for 4.11 code freeze. Test with pr
1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance):
prometheusK8s:
volumeClaimTemplate:
metadata:
name: prometheus-db
spec:
storageClassName: foo
resources:
requests:
storage: 50Gi
2. Wait for CMO to go degraded
Actual results:
$ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
{
"lastTransitionTime": "2022-08-25T08:28:23Z",
"message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.",
"reason": "UpdatingPrometheusK8SFailed",
"status": "False",
"type": "Available"
},
{
"lastTransitionTime": "2022-08-25T08:28:23Z",
"message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.",
"reason": "UpdatingPrometheusK8SFailed",
"status": "True",
"type": "Degraded"
}
]
Message when alert manager has wrong pvc is wrong, may the issue can be tracked in a new bug
% oc -n openshift-monitoring describe pod alertmanager-main-0 |tail -n 10
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 13m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
Warning FailedScheduling 13m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
hongyli@hongyli-mac Downloads % oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
{
"lastTransitionTime": "2022-08-25T08:51:55Z",
"reason": "AsExpected",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2022-08-25T08:51:55Z",
"message": "waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas",
"reason": "UpdatingAlertmanagerFailed",
"status": "True",
"type": "Degraded"
}
]
Set the bug as Teseted as alertmanage issue will be tracked in bug https://issues.redhat.com/browse/OCPBUGS-610 based on comment 18 and comment 19, set the bug to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |