Bug 2043518 - Better message in the CMO degraded/unavailable conditions when pods can't be scheduled
Summary: Better message in the CMO degraded/unavailable conditions when pods can't be ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.12.0
Assignee: Sunil Thaha
QA Contact: hongyan li
Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-21 12:48 UTC by Simon Pasquier
Modified: 2023-01-17 19:47 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Before this update, if Prometheus Operator failed to run or schedule Prometheus pods, the system provided no underlying reason for the failure. With this update, if Prometheus pods are not run or scheduled, the Cluster Monitoring Operator updates the `clusterOperator` monitoring status with a reason for the failure, which can be used to troubleshoot the underlying issue. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2043518[*BZ#2043518*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:47:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1558 0 None open Bug 2043518: set degraded and available status based on Prometheus pod status 2022-08-25 05:22:19 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:47:32 UTC

Description Simon Pasquier 2022-01-21 12:48:47 UTC
Description of problem:
When some pods can't be scheduled (because PVC can't be created for instance), CMO should report a detailed message about the underlying reason.

From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1484080898826571776

Version-Release number of selected component (if applicable):
4.10 and before

How reproducible:
Not always

Steps to Reproduce:
1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance):

    prometheusK8s:
      volumeClaimTemplate:
        metadata:
          name: prometheus-db
        spec:
          storageClassName: foo
          resources:
            requests:
              storage: 50Gi


2. Wait for CMO to go degraded

Actual results:

$ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-01-21T12:34:14Z",
    "message": "Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "True",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2022-01-21T12:34:14Z",
    "message": "Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "False",
    "type": "Available"
  }
]


Expected results:

CMO should surface a better explanation as to why the pods aren't in the desired state.

Additional info:

$ oc get pods -n openshift-monitoring prometheus-k8s-0 -o jsonpath='{.status}' | jq .
{
  "conditions": [
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2022-01-21T12:19:10Z",
      "message": "0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
    }
  ],
  "phase": "Pending",
  "qosClass": "Burstable"
}

Comment 1 David Eads 2022-01-21 13:56:31 UTC
The unschedulable pod condition would be a good starting point.

Comment 8 Simon Pasquier 2022-06-14 08:54:33 UTC
Unsetting target release because it won't be ready in time for 4.11 code freeze.

Comment 14 hongyan li 2022-08-25 08:35:08 UTC
Test with pr
1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance):

    prometheusK8s:
      volumeClaimTemplate:
        metadata:
          name: prometheus-db
        spec:
          storageClassName: foo
          resources:
            requests:
              storage: 50Gi


2. Wait for CMO to go degraded

Actual results:

$ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-08-25T08:28:23Z",
    "message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "False",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-25T08:28:23Z",
    "message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "True",
    "type": "Degraded"
  }
]

Comment 15 hongyan li 2022-08-25 08:56:58 UTC
Message when alert manager has wrong pvc is wrong, may the issue can be tracked in a new bug

% oc -n openshift-monitoring describe pod alertmanager-main-0 |tail -n 10
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
hongyli@hongyli-mac Downloads % oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-08-25T08:51:55Z",
    "reason": "AsExpected",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-25T08:51:55Z",
    "message": "waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas",
    "reason": "UpdatingAlertmanagerFailed",
    "status": "True",
    "type": "Degraded"
  }
]

Comment 16 hongyan li 2022-08-26 08:10:11 UTC
Set the bug as Teseted as alertmanage issue will be tracked in bug https://issues.redhat.com/browse/OCPBUGS-610

Comment 20 Junqi Zhao 2022-10-10 01:17:33 UTC
based on comment 18 and comment 19, set the bug to verified

Comment 23 errata-xmlrpc 2023-01-17 19:47:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.