2043518 – Better message in the CMO degraded/unavailable conditions when pods can't be scheduled

Bug 2043518 - Better message in the CMO degraded/unavailable conditions when pods can't be scheduled

Summary: Better message in the CMO degraded/unavailable conditions when pods can't be ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Sunil Thaha
QA Contact:	hongyan li
Docs Contact:	Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-21 12:48 UTC by Simon Pasquier
Modified:	2023-01-17 19:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Before this update, if Prometheus Operator failed to run or schedule Prometheus pods, the system provided no underlying reason for the failure. With this update, if Prometheus pods are not run or scheduled, the Cluster Monitoring Operator updates the `clusterOperator` monitoring status with a reason for the failure, which can be used to troubleshoot the underlying issue. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2043518[BZ#2043518])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:47:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1558	0	None	open	Bug 2043518: set degraded and available status based on Prometheus pod status	2022-08-25 05:22:19 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:47:32 UTC

Description Simon Pasquier 2022-01-21 12:48:47 UTC

Description of problem:
When some pods can't be scheduled (because PVC can't be created for instance), CMO should report a detailed message about the underlying reason.

From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1484080898826571776

Version-Release number of selected component (if applicable):
4.10 and before

How reproducible:
Not always

Steps to Reproduce:
1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance):

    prometheusK8s:
      volumeClaimTemplate:
        metadata:
          name: prometheus-db
        spec:
          storageClassName: foo
          resources:
            requests:
              storage: 50Gi


2. Wait for CMO to go degraded

Actual results:

$ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-01-21T12:34:14Z",
    "message": "Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "True",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2022-01-21T12:34:14Z",
    "message": "Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "False",
    "type": "Available"
  }
]


Expected results:

CMO should surface a better explanation as to why the pods aren't in the desired state.

Additional info:

$ oc get pods -n openshift-monitoring prometheus-k8s-0 -o jsonpath='{.status}' | jq .
{
  "conditions": [
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2022-01-21T12:19:10Z",
      "message": "0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
    }
  ],
  "phase": "Pending",
  "qosClass": "Burstable"
}

Comment 1 David Eads 2022-01-21 13:56:31 UTC

The unschedulable pod condition would be a good starting point.

Comment 8 Simon Pasquier 2022-06-14 08:54:33 UTC

Unsetting target release because it won't be ready in time for 4.11 code freeze.

Comment 14 hongyan li 2022-08-25 08:35:08 UTC

Test with pr
1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance):

    prometheusK8s:
      volumeClaimTemplate:
        metadata:
          name: prometheus-db
        spec:
          storageClassName: foo
          resources:
            requests:
              storage: 50Gi


2. Wait for CMO to go degraded

Actual results:

$ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-08-25T08:28:23Z",
    "message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "False",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-25T08:28:23Z",
    "message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "True",
    "type": "Degraded"
  }
]

Comment 15 hongyan li 2022-08-25 08:56:58 UTC

Message when alert manager has wrong pvc is wrong, may the issue can be tracked in a new bug

% oc -n openshift-monitoring describe pod alertmanager-main-0 |tail -n 10
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
hongyli@hongyli-mac Downloads % oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-08-25T08:51:55Z",
    "reason": "AsExpected",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-25T08:51:55Z",
    "message": "waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas",
    "reason": "UpdatingAlertmanagerFailed",
    "status": "True",
    "type": "Degraded"
  }
]

Comment 16 hongyan li 2022-08-26 08:10:11 UTC

Set the bug as Teseted as alertmanage issue will be tracked in bug https://issues.redhat.com/browse/OCPBUGS-610

Comment 20 Junqi Zhao 2022-10-10 01:17:33 UTC

based on comment 18 and comment 19, set the bug to verified

Comment 23 errata-xmlrpc 2023-01-17 19:47:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.