Description of problem: From https://bugzilla.redhat.com/show_bug.cgi?id=1948082: cmo#1112 is going to recycle the current Degraded=True reason when setting Available=False and Progressing=False. Makes sense to me. But ideally monitoring won't go Available=False at all during CI updates. Checking the Degraded information from the job from comment 0: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/monitoring condition/Degraded' Apr 09 13:16:50.662 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingPrometheusOperatorFailed changed: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority Apr 09 13:16:50.662 - 329s E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority Apr 09 13:22:19.749 W clusteroperator/monitoring condition/Degraded status/False changed: Apr 09 13:25:31.238 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingControlPlanecomponentsFailed changed: Failed to rollout the stack. Error: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.56:8080: connect: no route to host Apr 09 13:25:31.238 - 212s E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.56:8080: connect: no route to host Apr 09 13:29:03.744 W clusteroperator/monitoring condition/Degraded status/False changed: So ~5m with UpdatingPrometheusOperatorFailed on an admission webhook x509. And then ~3m with UpdatingControlPlanecomponentsFailed on an admission webhook "no route to host". Should I spin out another bug for sorting out what's going on with those, or did you want to address it as part of this bug as well? Definitely spin it out if the issues are with a non-monitoring component that you can't directly fix. Version-Release number of selected component (if applicable): How reproducible: Upgrade a cluster and monitor CMO available status. Steps to Reproduce: 1. 2. 3. Actual results: CMO reports unavailable. Expected results: CMO should report available during normal upgrade. Additional info:
One thing that occurred to me is that CMO runs each component's task in parallel [1][2][3]. It means that the prometheus-operator task might be rolling out a new version (which isn't ready yet) while at the same time, another task is trying to update Prometheus rules (and failing because the request to the validating webhook endpoint times out). It would be more logical to serialize things a bit: 1) run the prometheus-operator task and wait for completion 2) run all other tasks We should also probably set the "failurePolicy" field to "Ignore" instead of the default "Fail" value [4][5] because CVO would face the same issue during upgrades when it synchronizes rules for other components (I think another BZ describing this problem already exists). The validating webhook is a guard rail but even if an invalid rule resource gets accepted, the "PrometheusRuleFailures" alert should fire eventually [6]. [1] https://github.com/openshift/cluster-monitoring-operator/blob/926059c55323131efeb5f79cb2706175b593893d/pkg/operator/operator.go#L444-L464 [2] https://github.com/openshift/cluster-monitoring-operator/blob/926059c55323131efeb5f79cb2706175b593893d/pkg/operator/operator.go#L472-L481 [3] https://github.com/openshift/cluster-monitoring-operator/blob/926059c55323131efeb5f79cb2706175b593893d/pkg/tasks/tasks.go#L38-L52 [4] https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#validatingwebhook-v1-admissionregistration-k8s-io [5] https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#failure-policy [6] https://github.com/openshift/cluster-monitoring-operator/blob/926059c55323131efeb5f79cb2706175b593893d/assets/prometheus-k8s/prometheus-rule.yaml#L194-L203
This sounds similar to bug 1925517. Simon, is it the BZ you were referring to?
@Damien I think it's a different issue.
I'd say let's tackle this BZ and bug 1948082 separately since they are not really related. My suggestion for this one would be to implement comment 1 in two seperate efforts. We could start by fixing the `failurePolicy` of the validation webhook so that we make sure that CMO's availability is not impacted by the webhook during upgrades. And only then we could start serializing the tasks.
Linking the e2e test-case so Sippy will find and link this bug.
No time for this in the current sprint.
Just a note that this seems related: Bug 1925517
*** Bug 1925517 has been marked as a duplicate of this bug. ***
Searched build.log of recent 2 days, didn't find "UpdatingPrometheusOperatorFailed on an admission webhook x509. And then with UpdatingControlPlanecomponentsFailed" https://search.ci.openshift.org/?search=reconciling+prometheus-operator+rules+PrometheusRule+failed&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
PR merged 6d ago, but I still see some Available!=True in 4.9: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=clusteroperator/monitoring+should+not+change+condition/Available' | grep '4[.]9.*failures match' | grep -v 'pull-ci-\|rehearse-' | sort periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 18 runs, 100% failed, 6% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 18 runs, 83% failed, 7% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-from-stable-4.7-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt (all) - 4 runs, 75% failed, 33% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial (all) - 7 runs, 43% failed, 33% of failures match = 14% impact release-openshift-ocp-installer-e2e-aws-ovn-4.9 (all) - 5 runs, 60% failed, 33% of failures match = 20% impact release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&name=periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial&search=clusteroperator/monitoring+should+not+change+condition/Available' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1415211495435800576 That job has: : [bz-Monitoring] clusteroperator/monitoring should not change condition/Available Run #0: Failed expand_less 0s 1 unexpected clusteroperator state transitions during e2e test run Jul 14 09:17:10.264 - 918s E clusteroperator/monitoring condition/Available status/False reason/Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] 16s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jul 14 09:32:28.656: Unexpected alerts fired or pending after the test run: alert ClusterOperatorDown fired for 299 seconds with labels: {endpoint="metrics", instance="192.168.12.71:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-56c5b4888-bdzbx", service="cluster-version-operator", severity="critical", version="4.9.0-0.nightly-2021-07-14-072531"} alert KubePodNotReady fired for 1122 seconds with labels: {namespace="openshift-monitoring", pod="prometheus-k8s-1", severity="warning"} alert KubeStatefulSetReplicasMismatch fired for 1122 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="prometheus-k8s"} Drilling in on that pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1415211495435800576/artifacts/e2e-vsphere-serial/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-1").status' ... "initContainerStatuses": [ { "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:602c84d12df80ef741fe171dc3338f681163b6c8fc13d33a70e50314984ef934", "imageID": "", "lastState": {}, "name": "init-config-reloader", "ready": false, "restartCount": 0, "state": { "waiting": { "reason": "PodInitializing" } } } ], ... $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1415211495435800576/artifacts/e2e-vsphere-serial/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-1").spec.nodeName' ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7 And on the node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1415211495435800576/artifacts/e2e-vsphere-serial/gather-extra/artifacts/nodes/ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7/journal | gunzip | grep prometheus-k8s-1 | tail -n3 Jul 14 09:35:53.516703 ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7 hyperkube[1759]: E0714 09:35:53.516678 1759 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/vsphere-volume/[WorkloadDatastore] 5137595f-7ce3-e95a-5c03-06d835dea807/ci-op-9d96w9r5-c66f0-k-pvc-9fff39b2-d247-4c4e-96d2-cc0271056593.vmdk podName: nodeName:}" failed. No retries permitted until 2021-07-14 09:37:55.516652207 +0000 UTC m=+4554.200789810 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume \"pvc-9fff39b2-d247-4c4e-96d2-cc0271056593\" (UniqueName: \"kubernetes.io/vsphere-volume/[WorkloadDatastore] 5137595f-7ce3-e95a-5c03-06d835dea807/ci-op-9d96w9r5-c66f0-k-pvc-9fff39b2-d247-4c4e-96d2-cc0271056593.vmdk\") pod \"prometheus-k8s-1\" (UID: \"d5f52060-532f-4e05-96dc-8160afaa735d\") " Jul 14 09:37:11.350676 ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7 hyperkube[1759]: E0714 09:37:11.350589 1759 kubelet.go:1713] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[prometheus-data], unattached volumes=[tls-assets prometheus-data secret-grpc-tls configmap-serving-certs-ca-bundle secret-prometheus-k8s-htpasswd secret-prometheus-k8s-thanos-sidecar-tls web-config prometheus-trusted-ca-bundle config secret-kube-etcd-client-certs kube-api-access-4b6h4 secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 configmap-kubelet-serving-ca-bundle config-out secret-prometheus-k8s-tls secret-prometheus-k8s-proxy]: timed out waiting for the condition" pod="openshift-monitoring/prometheus-k8s-1" Jul 14 09:37:11.350676 ci-op-9d96w9r5-c66f0-k6nzk-worker-zg4q7 hyperkube[1759]: E0714 09:37:11.350630 1759 pod_workers.go:190] "Error syncing pod, skipping" err="unmounted volumes=[prometheus-data], unattached volumes=[tls-assets prometheus-data secret-grpc-tls configmap-serving-certs-ca-bundle secret-prometheus-k8s-htpasswd secret-prometheus-k8s-thanos-sidecar-tls web-config prometheus-trusted-ca-bundle config secret-kube-etcd-client-certs kube-api-access-4b6h4 secret-kube-rbac-proxy prometheus-k8s-rulefiles-0 configmap-kubelet-serving-ca-bundle config-out secret-prometheus-k8s-tls secret-prometheus-k8s-proxy]: timed out waiting for the condition" pod="openshift-monitoring/prometheus-k8s-1" podUID=d5f52060-532f-4e05-96dc-8160afaa735d So that's a kubelet or storage issue, and not monitoring's fault. VERIFIED here makes sense to me :)
*** Bug 1977470 has been marked as a duplicate of this bug. ***
*** Bug 1956308 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759