1949840 – CMO reports unavailable during upgrades

Bug 1949840 - CMO reports unavailable during upgrades

Summary: CMO reports unavailable during upgrades

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Jan Fajerski
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1925517 1956308 1977470 (view as bug list)
Depends On:
Blocks:	1982369 1984730 2005205
TreeView+	depends on / blocked

Reported:	2021-04-15 08:49 UTC by Jan Fajerski
Modified:	2021-10-18 17:30 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2005205 2005206 (view as bug list)
Environment:	[bz-Monitoring] clusteroperator/monitoring should not change condition/Available
Last Closed:	2021-10-18 17:30:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1193	0	None	closed	Bug 1949840: Improve update and status reporting	2021-07-14 16:44:08 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:30:35 UTC

Internal Links: 1925517

Description Jan Fajerski 2021-04-15 08:49:58 UTC

Description of problem:

From https://bugzilla.redhat.com/show_bug.cgi?id=1948082:

cmo#1112 is going to recycle the current Degraded=True reason when setting Available=False and Progressing=False.  Makes sense to me.  But ideally monitoring won't go Available=False at all during CI updates.  Checking the Degraded information from the job from comment 0:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/monitoring condition/Degraded'
Apr 09 13:16:50.662 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingPrometheusOperatorFailed changed: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority
Apr 09 13:16:50.662 - 329s  E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority
Apr 09 13:22:19.749 W clusteroperator/monitoring condition/Degraded status/False changed: 
Apr 09 13:25:31.238 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingControlPlanecomponentsFailed changed: Failed to rollout the stack. Error: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.56:8080: connect: no route to host
Apr 09 13:25:31.238 - 212s  E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.56:8080: connect: no route to host
Apr 09 13:29:03.744 W clusteroperator/monitoring condition/Degraded status/False changed: 

So ~5m with UpdatingPrometheusOperatorFailed on an admission webhook x509.  And then ~3m with UpdatingControlPlanecomponentsFailed on an admission webhook "no route to host".  Should I spin out another bug for sorting out what's going on with those, or did you want to address it as part of this bug as well?  Definitely spin it out if the issues are with a non-monitoring component that you can't directly fix.

Version-Release number of selected component (if applicable):


How reproducible:
Upgrade a cluster and monitor CMO available status.

Steps to Reproduce:
1.
2.
3.

Actual results:
CMO reports unavailable.

Expected results:
CMO should report available during normal upgrade.


Additional info:

Comment 1 Simon Pasquier 2021-04-15 09:56:43 UTC

One thing that occurred to me is that CMO runs each component's task in parallel [1][2][3]. It means that the prometheus-operator task might be rolling out a new version (which isn't ready yet) while at the same time, another task is trying to update Prometheus rules (and failing because the request to the validating webhook endpoint times out). It would be more logical to serialize things a bit:
1) run the prometheus-operator task and wait for completion
2) run all other tasks

We should also probably set the "failurePolicy" field to "Ignore" instead of the default "Fail" value [4][5] because CVO would face the same issue during upgrades when it synchronizes rules for other components (I think another BZ describing this problem already exists). The validating webhook is a guard rail but even if an invalid rule resource gets accepted, the "PrometheusRuleFailures" alert should fire eventually [6].

[1] https://github.com/openshift/cluster-monitoring-operator/blob/926059c55323131efeb5f79cb2706175b593893d/pkg/operator/operator.go#L444-L464
[2] https://github.com/openshift/cluster-monitoring-operator/blob/926059c55323131efeb5f79cb2706175b593893d/pkg/operator/operator.go#L472-L481
[3] https://github.com/openshift/cluster-monitoring-operator/blob/926059c55323131efeb5f79cb2706175b593893d/pkg/tasks/tasks.go#L38-L52
[4] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#validatingwebhook-v1-admissionregistration-k8s-io
[5] https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#failure-policy
[6] https://github.com/openshift/cluster-monitoring-operator/blob/926059c55323131efeb5f79cb2706175b593893d/assets/prometheus-k8s/prometheus-rule.yaml#L194-L203

Comment 2 Damien Grisonnet 2021-04-15 10:10:06 UTC

This sounds similar to bug 1925517. Simon, is it the BZ you were referring to?

Comment 3 Simon Pasquier 2021-04-15 10:42:55 UTC

@Damien I think it's a different issue.

Comment 4 Damien Grisonnet 2021-04-15 13:23:42 UTC

I'd say let's tackle this BZ and bug 1948082 separately since they are not really related.

My suggestion for this one would be to implement comment 1 in two seperate efforts. We could start by fixing the `failurePolicy` of the validation webhook so that we make sure that CMO's availability is not impacted by the webhook during upgrades. And only then we could start serializing the tasks.

Comment 5 W. Trevor King 2021-04-16 03:52:28 UTC

Linking the e2e test-case so Sippy will find and link this bug.

Comment 6 Jan Fajerski 2021-05-25 15:00:42 UTC

No time for this in the current sprint.

Comment 7 Brad Ison 2021-05-26 14:04:11 UTC

Just a note that this seems related: Bug 1925517

Comment 8 Brad Ison 2021-05-27 10:04:06 UTC

*** Bug 1925517 has been marked as a duplicate of this bug. ***

Comment 10 hongyan li 2021-07-14 15:01:49 UTC

Searched build.log of recent 2 days, didn't find  "UpdatingPrometheusOperatorFailed on an admission webhook x509.  And then with UpdatingControlPlanecomponentsFailed"

https://search.ci.openshift.org/?search=reconciling+prometheus-operator+rules+PrometheusRule+failed&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 11 W. Trevor King 2021-07-14 16:56:42 UTC

PR merged 6d ago, but I still see some Available!=True in 4.9:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=clusteroperator/monitoring+should+not+change+condition/Available' | grep '4[.]9.*failures match' | grep -v 'pull-ci-\|rehearse-' | sort
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 18 runs, 100% failed, 6% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 18 runs, 83% failed, 7% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial (all) - 7 runs, 43% failed, 33% of failures match = 14% impact
release-openshift-ocp-installer-e2e-aws-ovn-4.9 (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&name=periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial&search=clusteroperator/monitoring+should+not+change+condition/Available' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1415211495435800576

That job has:

  : [bz-Monitoring] clusteroperator/monitoring should not change condition/Available
    Run #0: Failed expand_less	0s
    1 unexpected clusteroperator state transitions during e2e test run 

    Jul 14 09:17:10.264 - 918s  E clusteroperator/monitoring condition/Available status/False reason/Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.

  : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]	16s
    fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jul 14 09:32:28.656: Unexpected alerts fired or pending after the test run:

    alert ClusterOperatorDown fired for 299 seconds with labels: {endpoint="metrics", instance="192.168.12.71:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-56c5b4888-bdzbx", service="cluster-version-operator", severity="critical", version="4.9.0-0.nightly-2021-07-14-072531"}
    alert KubePodNotReady fired for 1122 seconds with labels: {namespace="openshift-monitoring", pod="prometheus-k8s-1", severity="warning"}
    alert KubeStatefulSetReplicasMismatch fired for 1122 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="prometheus-k8s"}

Drilling in on that pod:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1415211495435800576/artifacts/e2e-vsphere-serial/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-1").status'
  ...
    "initContainerStatuses": [
      {
        "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:602c84d12df80ef741fe171dc3338f681163b6c8fc13d33a70e50314984ef934",
        "imageID": "",
        "lastState": {},
        "name": "init-config-reloader",
        "ready": false,
        "restartCount": 0,
        "state": {
          "waiting": {
            "reason": "PodInitializing"
          }
        }
      }
    ],
  ...
  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1415211495435800576/artifacts/e2e-vsphere-serial/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-1").spec.nodeName'
  ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7

And on the node:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1415211495435800576/artifacts/e2e-vsphere-serial/gather-extra/artifacts/nodes/ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7/journal | gunzip | grep prometheus-k8s-1 | tail -n3
  Jul 14 09:35:53.516703 ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7 hyperkube[1759]: E0714 09:35:53.516678    1759 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/vsphere-volume/[WorkloadDatastore] 5137595f-7ce3-e95a-5c03-06d835dea807/ci-op-9d96w9r5-c66f0-k-pvc-9fff39b2-d247-4c4e-96d2-cc0271056593.vmdk podName: nodeName:}" failed. No retries permitted until 2021-07-14 09:37:55.516652207 +0000 UTC m=+4554.200789810 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-9fff39b2-d247-4c4e-96d2-cc0271056593\" (UniqueName: \"kubernetes.io/vsphere-volume/[WorkloadDatastore] 5137595f-7ce3-e95a-5c03-06d835dea807/ci-op-9d96w9r5-c66f0-k-pvc-9fff39b2-d247-4c4e-96d2-cc0271056593.vmdk\") pod \"prometheus-k8s-1\" (UID: \"d5f52060-532f-4e05-96dc-8160afaa735d\") "
  Jul 14 09:37:11.350676 ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7 hyperkube[1759]: E0714 09:37:11.350589    1759 kubelet.go:1713] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[prometheus-data], unattached volumes=[tls-assets prometheus-data secret-grpc-tls configmap-serving-certs-ca-bundle secret-prometheus-k8s-htpasswd secret-prometheus-k8s-thanos-sidecar-tls web-config prometheus-trusted-ca-bundle config secret-kube-etcd-client-certs kube-api-access-4b6h4 secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 configmap-kubelet-serving-ca-bundle config-out secret-prometheus-k8s-tls secret-prometheus-k8s-proxy]: timed out waiting for the condition" pod="openshift-monitoring/prometheus-k8s-1"
  Jul 14 09:37:11.350676 ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7 hyperkube[1759]: E0714 09:37:11.350630    1759 pod_workers.go:190] "Error syncing pod, skipping" err="unmounted volumes=[prometheus-data], unattached volumes=[tls-assets prometheus-data secret-grpc-tls configmap-serving-certs-ca-bundle secret-prometheus-k8s-htpasswd secret-prometheus-k8s-thanos-sidecar-tls web-config prometheus-trusted-ca-bundle config secret-kube-etcd-client-certs kube-api-access-4b6h4 secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 configmap-kubelet-serving-ca-bundle config-out secret-prometheus-k8s-tls secret-prometheus-k8s-proxy]: timed out waiting for the condition" pod="openshift-monitoring/prometheus-k8s-1" podUID=d5f52060-532f-4e05-96dc-8160afaa735d

So that's a kubelet or storage issue, and not monitoring's fault.  VERIFIED here makes sense to me :)

Comment 12 Simon Pasquier 2021-07-26 06:34:51 UTC

*** Bug 1977470 has been marked as a duplicate of this bug. ***

Comment 16 Jayapriya Pai 2021-09-13 14:04:58 UTC

*** Bug 1956308 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2021-10-18 17:30:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.