Description of problem: When some pods can't be scheduled (because PVC can't be created for instance), CMO should report a detailed message about the underlying reason. From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1484080898826571776 Version-Release number of selected component (if applicable): 4.10 and before How reproducible: Not always Steps to Reproduce: 1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance): prometheusK8s: volumeClaimTemplate: metadata: name: prometheus-db spec: storageClassName: foo resources: requests: storage: 50Gi 2. Wait for CMO to go degraded Actual results: $ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))' [ { "lastTransitionTime": "2022-01-21T12:34:14Z", "message": "Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas", "reason": "UpdatingPrometheusK8SFailed", "status": "True", "type": "Degraded" }, { "lastTransitionTime": "2022-01-21T12:34:14Z", "message": "Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.", "reason": "UpdatingPrometheusK8SFailed", "status": "False", "type": "Available" } ] Expected results: CMO should surface a better explanation as to why the pods aren't in the desired state. Additional info: $ oc get pods -n openshift-monitoring prometheus-k8s-0 -o jsonpath='{.status}' | jq . { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2022-01-21T12:19:10Z", "message": "0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], "phase": "Pending", "qosClass": "Burstable" }
The unschedulable pod condition would be a good starting point.
Unsetting target release because it won't be ready in time for 4.11 code freeze.
Test with pr 1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance): prometheusK8s: volumeClaimTemplate: metadata: name: prometheus-db spec: storageClassName: foo resources: requests: storage: 50Gi 2. Wait for CMO to go degraded Actual results: $ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))' [ { "lastTransitionTime": "2022-08-25T08:28:23Z", "message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.", "reason": "UpdatingPrometheusK8SFailed", "status": "False", "type": "Available" }, { "lastTransitionTime": "2022-08-25T08:28:23Z", "message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.", "reason": "UpdatingPrometheusK8SFailed", "status": "True", "type": "Degraded" } ]
Message when alert manager has wrong pvc is wrong, may the issue can be tracked in a new bug % oc -n openshift-monitoring describe pod alertmanager-main-0 |tail -n 10 QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 13m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling. Warning FailedScheduling 13m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling. hongyli@hongyli-mac Downloads % oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))' [ { "lastTransitionTime": "2022-08-25T08:51:55Z", "reason": "AsExpected", "status": "True", "type": "Available" }, { "lastTransitionTime": "2022-08-25T08:51:55Z", "message": "waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas", "reason": "UpdatingAlertmanagerFailed", "status": "True", "type": "Degraded" } ]
Set the bug as Teseted as alertmanage issue will be tracked in bug https://issues.redhat.com/browse/OCPBUGS-610
based on comment 18 and comment 19, set the bug to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399