Bug 2043518

Summary:	Better message in the CMO degraded/unavailable conditions when pods can't be scheduled
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Monitoring	Assignee:	Sunil Thaha <sthaha>
Status:	CLOSED ERRATA	QA Contact:	hongyan li <hongyli>
Severity:	medium	Docs Contact:	Brian Burt <bburt>
Priority:	medium
Version:	4.6.z	CC:	anpicker, bburt, deads, dgoodwin, juzhao, sthaha, wking
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	* Before this update, if Prometheus Operator failed to run or schedule Prometheus pods, the system provided no underlying reason for the failure. With this update, if Prometheus pods are not run or scheduled, the Cluster Monitoring Operator updates the `clusterOperator` monitoring status with a reason for the failure, which can be used to troubleshoot the underlying issue. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2043518[BZ#2043518])	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:47:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon Pasquier 2022-01-21 12:48:47 UTC

Description of problem:
When some pods can't be scheduled (because PVC can't be created for instance), CMO should report a detailed message about the underlying reason.

From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview/1484080898826571776

Version-Release number of selected component (if applicable):
4.10 and before

How reproducible:
Not always

Steps to Reproduce:
1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance):

    prometheusK8s:
      volumeClaimTemplate:
        metadata:
          name: prometheus-db
        spec:
          storageClassName: foo
          resources:
            requests:
              storage: 50Gi


2. Wait for CMO to go degraded

Actual results:

$ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-01-21T12:34:14Z",
    "message": "Failed to rollout the stack. Error: updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "True",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2022-01-21T12:34:14Z",
    "message": "Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "False",
    "type": "Available"
  }
]


Expected results:

CMO should surface a better explanation as to why the pods aren't in the desired state.

Additional info:

$ oc get pods -n openshift-monitoring prometheus-k8s-0 -o jsonpath='{.status}' | jq .
{
  "conditions": [
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2022-01-21T12:19:10Z",
      "message": "0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
    }
  ],
  "phase": "Pending",
  "qosClass": "Burstable"
}

Comment 1 David Eads 2022-01-21 13:56:31 UTC

The unschedulable pod condition would be a good starting point.

Comment 8 Simon Pasquier 2022-06-14 08:54:33 UTC

Unsetting target release because it won't be ready in time for 4.11 code freeze.

Comment 14 hongyan li 2022-08-25 08:35:08 UTC

Test with pr
1. Configure Prometheus with an invalid volume claim template (unknown storage class for instance):

    prometheusK8s:
      volumeClaimTemplate:
        metadata:
          name: prometheus-db
        spec:
          storageClassName: foo
          resources:
            requests:
              storage: 50Gi


2. Wait for CMO to go degraded

Actual results:

$ oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-08-25T08:28:23Z",
    "message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "False",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-25T08:28:23Z",
    "message": "NoPodReady: shard 0: pod prometheus-k8s-0: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.\nshard 0: pod prometheus-k8s-1: 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.",
    "reason": "UpdatingPrometheusK8SFailed",
    "status": "True",
    "type": "Degraded"
  }
]

Comment 15 hongyan li 2022-08-25 08:56:58 UTC

Message when alert manager has wrong pvc is wrong, may the issue can be tracked in a new bug

% oc -n openshift-monitoring describe pod alertmanager-main-0 |tail -n 10
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
hongyli@hongyli-mac Downloads % oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-08-25T08:51:55Z",
    "reason": "AsExpected",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-25T08:51:55Z",
    "message": "waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas",
    "reason": "UpdatingAlertmanagerFailed",
    "status": "True",
    "type": "Degraded"
  }
]

Comment 16 hongyan li 2022-08-26 08:10:11 UTC

Set the bug as Teseted as alertmanage issue will be tracked in bug https://issues.redhat.com/browse/OCPBUGS-610

Comment 20 Junqi Zhao 2022-10-10 01:17:33 UTC

based on comment 18 and comment 19, set the bug to verified

Comment 23 errata-xmlrpc 2023-01-17 19:47:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399