Bug 1957775
Summary: | CVO creating cloud-controller-manager too early causing upgrade failures | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | ravig <rgudimet> |
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> |
Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4.8 | CC: | aos-bugs, jluhrsen, jokerman, lmohanty, scuppett, vrutkovs, wking, yanyang |
Target Milestone: | --- | Keywords: | Upgrades |
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The cluster-version operator had been pre-creating ClusterOperator resources at the beginning of the update phase.
Consequence: For new second-level operators who were deeper in in the update, that could leave their ClusterOperator without status conditions for many minutes. After ten minutes without conditions, the ClusterOperatorDown and/or ClusterOperatorDegraded alerts would begin firing.
Fix: Now the cluster-version operator pre-creates ClusterOperator resources much more closely to the creation of the new second-level operator's deployment.
Result: The new second-level operator has time to come up and set status.conditions on the ClusterOperator before alarms start firing.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 23:06:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1959238 |
Description
ravig
2021-05-06 13:05:03 UTC
CVO extracts manifests from the image and applies them according to the run level: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/operators.md#what-is-the-order-that-resources-get-createdupdated-in https://github.com/openshift/cluster-cloud-controller-manager-operator/tree/master/manifests sets these to 26, so if the operator must run later it should update the runlevel in their manifests. Reassigning this to cloud-controller-manager folks Yeah, this is effectively a dup of bug 1929917, but for that older bug we just did a narrow, hacky patch. Ideally we fix this by adjusting ClusterOperator precreation. Currently we precreate at the beginning of the sync cycle, which can be a long time before we create the operator Deployment that will manage that ClusterOperator. We should instead only be precreating at the beginning of the manifest-task block that contains the ClusterOperator, because that's usually the same block that creates the namespace, operator Deployment, and other stuff that we'd want to must-gather if the operator pod failed to come up. I moved the relatedObjects defaulting off to bug 1957879, because it's orthogonal to this CVO bug about the timing of the precreation. Cloud-controller-manager has not been introduced in downstream nightly build yet. # oc adm release extract --from registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-09-105430 --to manifest Extracted release payload from digest sha256:97b2a9f92202962201c9c81a1a5f299391e945181ed7a157dd599145e24b7cc2 created at 2021-05-09T10:56:23Z # cd manifest # ls | grep cloud-controller-manager No manifests for cloud-controller-manager in nightly build. # oc adm release extract --from registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-05-09-205909 --to manifest_ci Extracted release payload from digest sha256:471146cbe0ba5992f6b20306de6df5af263627a169a375cdd713bfa3750d6a43 created at 2021-05-09T21:00:08Z # cd manifest_ci/ # ls | grep cloud-controller-manager 0000_26_cloud-controller-manager-operator_00_namespace.yaml 0000_26_cloud-controller-manager-operator_01_images.configmap.yaml 0000_26_cloud-controller-manager-operator_02_rbac_operator.yaml 0000_26_cloud-controller-manager-operator_03_rbac_provider.yaml 0000_26_cloud-controller-manager-operator_11_deployment.yaml 0000_26_cloud-controller-manager-operator_12_clusteroperator.yaml 0000_26_cloud-controller-manager-operator_13_credentialsrequest-openstack.yaml Cloud-controller-manager is included in ci build. So checking the test result of CI. Most recent ci failure for cloud-controller-manager alert was [1], from 4.7.10 to 4.8.0-0.ci-2021-05-11-010006. [1]https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608 $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&context=0&type=junit&groupBy=job&search=alert+ClusterOperatorDown+fired' | grep 'cloud-controller-manager' alert ClusterOperatorDown fired for 870 seconds with labels: {endpoint="metrics", instance="10.0.252.69:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-65897987c9-rjg6w", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 630 seconds with labels: {endpoint="metrics", instance="10.0.137.199:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-5b648b684-8h466", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 600 seconds with labels: {endpoint="metrics", instance="10.0.190.224:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-599856699c-hbpqj", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 630 seconds with labels: {endpoint="metrics", instance="10.0.140.3:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-7546c95b48-mb2rg", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 720 seconds with labels: {endpoint="metrics", instance="10.0.188.213:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-68cc44877c-fqcbr", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 630 seconds with labels: {endpoint="metrics", instance="10.0.241.56:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-5cb4c949fd-z5kj8", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 8421 seconds with labels: {endpoint="metrics", instance="10.0.157.146:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-c68db747c-rgzbl", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 8416 seconds with labels: {endpoint="metrics", instance="10.0.173.231:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-85bcfbc9b6-tdk6b", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 8413 seconds with labels: {endpoint="metrics", instance="10.0.133.67:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-5b55669cd8-xzfkg", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 8422 seconds with labels: {endpoint="metrics", instance="10.0.199.5:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-984c994f6-487k7", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 8439 seconds with labels: {endpoint="metrics", instance="10.0.178.200:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-86c9d868b4-z9rm4", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 8428 seconds with labels: {endpoint="metrics", instance="10.0.150.66:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-866595c45b-6ck9x", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 8399 seconds with labels: {endpoint="metrics", instance="10.0.177.212:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-dc86c7865-bpgns", service="cluster-version-operator", severity="critical"} alert ClusterOperatorDown fired for 8421 seconds with labels: {endpoint="metrics", instance="10.0.156.232:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-78d48b7556-s9jm8", service="cluster-version-operator", severity="critical"} So the issue seems not to be fixed. > [1]https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci- > openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/ > 1391921283058372608 Huh. Checking the creation time: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-mastr-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[].metadata | select(.name == "cloud-controller-manager").creationTimestamp' 2021-05-11T01:53:24Z Comparing with the running CVO: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version" and (.reason == "Pulled" or .reason == "LeaderElection")) | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort 2021-05-11T01:19:59Z 1 LeaderElection: ip-10-0-11-203_4e0c01c5-6641-432b-b911-4bb0fa8e46e9 became leader 2021-05-11T01:27:16Z 1 Pulled: Successfully pulled image "registry.build02.ci.openshift.org/ci-op-rkxjqyvj/release@sha256:24f0bcf67474e06ceb1091fc63bddd6010e1d13f5fe5604962a4579ee98b8e22" in 17.663489406s 2021-05-11T01:34:12Z 1 LeaderElection: ip-10-0-151-38_3059a40c-947d-42ee-9ab9-edfa9d6a1161 became leader 2021-05-11T01:53:21Z 1 Pulled: Successfully pulled image "registry.build02.ci.openshift.org/ci-op-rkxjqyvj/release@sha256:25679909cf78ea99a81bbd278606aed16979dff5528a5a86ca1c8a3d839507e2" in 10.20786225s 2021-05-11T01:53:37Z 1 Pulled: Successfully pulled image "registry.build02.ci.openshift.org/ci-op-rkxjqyvj/release@sha256:25679909cf78ea99a81bbd278606aed16979dff5528a5a86ca1c8a3d839507e2" in 9.138006891s 2021-05-11T01:53:38Z 1 LeaderElection: ip-10-0-252-69_e86c056a-4e60-4baa-898c-ff7b58f9b85b became leader 2021-05-11T02:48:20Z 1 Pulled: Container image "registry.build02.ci.openshift.org/ci-op-rkxjqyvj/release@sha256:25679909cf78ea99a81bbd278606aed16979dff5528a5a86ca1c8a3d839507e2" already present on machine ... So the ClusterOperator seems to have been created by the outgoing 4.7 CVO (which doesn't have the patch for this bug). I expect a 4.7 -> 4.8 update where you delete the new ClusterOperator a minute or two after it gets created would delay the recreation until we got to the manifest block for the new second-level operator. Or we can backport to 4.7 and test that way. Ah, or we can get more details out of hosted-Loki [1], with: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608"} | unpack | namespace="openshift-cluster-version" |= "Precreated resource clusteroperator \"cloud-controller-manager\"" showing: 1. 2021-05-11T01:53:24.860Z: cluster-version-operator-7664cbf99b-mxfd4 2. 2021-05-11T02:17:48.413Z: cluster-version-operator-65897987c9-rjg6w ... Getting the relevant CVO versions: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608"} | unpack | pod_name="cluster-version-operator-7664cbf99b-mxfd4" |= "Graceful shutdown complete" gives: 2021-05-10 18:53:25 I0511 01:53:25.453886 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.7.0-202104250659.p0-1db8867. and: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608"} | unpack | pod_name="cluster-version-operator-65897987c9-rjg6w" |= "Graceful shutdown complete" gives: 2021-05-10 19:48:20 I0511 02:48:20.331180 1 start.go:26] Graceful shutdown complete for ClusterVersionOperator v1.0.0-511-g86db02a6-dirty. So yup, makes sense. I'll open a backport PR for 4.7... [1]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade%2F1391921283058372608%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-cluster-version%5C%22%20%7C%3D%20%5C%22Precreated%20resource%20clusteroperator%20%5C%5C%5C%22cloud-controller-manager%5C%5C%5C%22%5C%22%22%7D%5D Verifying it by 4.7.10 to 4.8.0-0.ci-2021-05-12-045439 upgrade. Steps to verify: 1. Install a cluster with 4.7.10, $ oc get co is showing no cloud-controller-manager co is visible. 2. Upgrade to 4.8.0-0.ci-2021-05-12-045439 # oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:86eaf8b920e7ad906367c90dd41bebcd060b95c3bad636baf81ea8ffdd3def43 --allow-explicit-upgrade --force # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.10 True True 4m59s Working towards 4.8.0-0.ci-2021-05-12-045439: 69 of 695 done (9% complete) Watching the cloud-controller-manager, $ oc get co -w | grep cloud-controller-manager cloud-controller-manager cloud-controller-manager cloud-controller-manager So cloud-controller-manager is pre-created w/o condition status by 4.7 CVO. Removing the cloud-controller-manager CO, $ oc delete co/cloud-controller-manager clusteroperator.config.openshift.io "cloud-controller-manager" deleted At this comment, CVO is already rolled out to 4.8 Watching the cloud-controller-manager, $ oc get co -w | grep cloud-controller-manager cloud-controller-manager cloud-controller-manager cloud-controller-manager 4.8.0-0.ci-2021-05-12-045439 True False False 0s 4.8 CVO creates cloud-controller-manager and soon condition status is visible. Eventually, the upgrade is passed and no ClusterOperatorDegraded and/or ClusterOperatorDown for cloud-controller-manager is firing. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.ci-2021-05-12-045439 True False 14m Cluster version is 4.8.0-0.ci-2021-05-12-045439 Moving it to verified state. @yanyang, @wking, Was this bz supposed to address the upgrade test case failure we see in CI? That's what I thought by reading it, but the test case is still failing. Here is a recent upgrade job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1396907474669277184 snippet of the failure: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 1h15m55s May 24 21:10:10.643: Unexpected alerts fired or pending during the upgrade: alert AggregatedAPIDown fired for 240 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} alert ClusterOperatorDown fired for 750 seconds with labels: {endpoint="metrics", instance="10.0.139.245:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-f4bffc556-lnzkq", service="cluster-version-operator", severity="critical"} Moving this back to ASSIGNED to get some attention. If that is not right, I can open a new BZ or maybe there is another BZ already open that I didn't find yet. (In reply to jamo luhrsen from comment #9) > @yanyang, @wking, > > Was this bz supposed to address the upgrade test case failure we see in CI? > That's what I thought by reading it, but the test case > is still failing. > > Here is a recent upgrade job: > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci- > openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/ > 1396907474669277184 > > snippet of the failure: > disruption_tests: [sig-arch] Check if alerts are firing during or after > upgrade success expand_less 1h15m55s > May 24 21:10:10.643: Unexpected alerts fired or pending during the > upgrade: > > alert AggregatedAPIDown fired for 240 seconds with labels: > {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"} > alert ClusterOperatorDown fired for 750 seconds with labels: > {endpoint="metrics", instance="10.0.139.245:9099", > job="cluster-version-operator", name="cloud-controller-manager", > namespace="openshift-cluster-version", > pod="cluster-version-operator-f4bffc556-lnzkq", > service="cluster-version-operator", severity="critical"} > > Moving this back to ASSIGNED to get some attention. If that is not right, I > can open a new BZ or maybe there is another BZ > already open that I didn't find yet. wait, the fix that went in to 4.7 only happened 3 days ago and there has been no new nightly promoted since then. Maybe that is why we are still failing this test case. I'll monitor and put this back to VERIFIED once I see the test case passing. jamo, 4.7 stable build has not included the fix yet. So it's expected to experience the issue if you upgrade from a 4.7 stable cluster. 4.7 ci and nightly have the fix. From qe testing result, the failure is gone if you upgrade from a latest 4.7 ci or nightly cluster. Please check out Bug 1959238 to get the 4.7 bug status. Bug 1959238 went out with 4.7.13, so hopefully 4.7->4.8 will stop seeing this now. Moving back to MODIFIED. As per comment 11 i.e. https://bugzilla.redhat.com/show_bug.cgi?id=1957775#c11 , this bug should be in Verified state. As we already have verified it with 4.7 CI and nightly. We do not wait till the bug fix is available in stable channels. The bugs gets attached to erratas and get closed when an errata is released i.e. the build would be available in fast channels at that point of time. After sometime i.e. couple of days the versions would be available in stable channel. CI-search is promisingly empty: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+ClusterOperatorDown+fired+for.*name="cloud-controller-manager"' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact Checking that one: $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&search=alert+ ClusterOperatorDown+fired+for.*name="cloud-controller-manager"' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade/1399462453770719232 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade/1399462453770719232/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version' 2021-05-31T22:07:58Z 2021-05-31T23:22:03Z Completed 4.8.0-0.ci-2021-05-31-172317 2021-05-31T21:04:46Z 2021-05-31T22:07:43Z Completed 4.7.12 2021-05-31T20:37:41Z 2021-05-31T21:00:42Z Completed 4.6.30 And 4.7.12 is from before bug 1959238 landed, so looking good. Marking it verified as per comment 13 and comment 11. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |