Bug 1957775

Summary:	CVO creating cloud-controller-manager too early causing upgrade failures
Product:	OpenShift Container Platform	Reporter:	ravig <rgudimet>
Component:	Cluster Version Operator	Assignee:	W. Trevor King <wking>
Status:	CLOSED ERRATA	QA Contact:	Yang Yang <yanyang>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.8	CC:	aos-bugs, jluhrsen, jokerman, lmohanty, scuppett, vrutkovs, wking, yanyang
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The cluster-version operator had been pre-creating ClusterOperator resources at the beginning of the update phase. Consequence: For new second-level operators who were deeper in in the update, that could leave their ClusterOperator without status conditions for many minutes. After ten minutes without conditions, the ClusterOperatorDown and/or ClusterOperatorDegraded alerts would begin firing. Fix: Now the cluster-version operator pre-creates ClusterOperator resources much more closely to the creation of the new second-level operator's deployment. Result: The new second-level operator has time to come up and set status.conditions on the ClusterOperator before alarms start firing.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:06:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1959238

Description ravig 2021-05-06 13:05:03 UTC

Description of problem:

CVO creating cloud-controller-manager too early causing 4.7 -> 4.8 upgrade failures. Cloud-controller-manager got introduced in 4.7 causing this problem to show up in only these upgrades.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1390237569974800384

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Vadim Rutkovsky 2021-05-06 14:42:00 UTC

CVO extracts manifests from the image and applies them according to the run level: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/operators.md#what-is-the-order-that-resources-get-createdupdated-in
https://github.com/openshift/cluster-cloud-controller-manager-operator/tree/master/manifests sets these to 26, so if the operator must run later it should update the runlevel in their manifests.

Reassigning this to cloud-controller-manager folks

Comment 3 W. Trevor King 2021-05-06 15:37:42 UTC

Yeah, this is effectively a dup of bug 1929917, but for that older bug we just did a narrow, hacky patch.  Ideally we fix this by adjusting ClusterOperator precreation.  Currently we precreate at the beginning of the sync cycle, which can be a long time before we create the operator Deployment that will manage that ClusterOperator.  We should instead only be precreating at the beginning of the manifest-task block that contains the ClusterOperator, because that's usually the same block that creates the namespace, operator Deployment, and other stuff that we'd want to must-gather if the operator pod failed to come up.

Comment 4 W. Trevor King 2021-05-06 17:23:12 UTC

I moved the relatedObjects defaulting off to bug 1957879, because it's orthogonal to this CVO bug about the timing of the precreation.

Comment 6 Yang Yang 2021-05-11 03:37:48 UTC

Cloud-controller-manager has not been introduced in downstream nightly build yet. 

# oc adm release extract --from registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-05-09-105430 --to manifest
Extracted release payload from digest sha256:97b2a9f92202962201c9c81a1a5f299391e945181ed7a157dd599145e24b7cc2 created at 2021-05-09T10:56:23Z
# cd manifest
# ls | grep cloud-controller-manager

No manifests for cloud-controller-manager in nightly build.

# oc adm release extract --from registry.ci.openshift.org/ocp/release:4.8.0-0.ci-2021-05-09-205909  --to manifest_ci
Extracted release payload from digest sha256:471146cbe0ba5992f6b20306de6df5af263627a169a375cdd713bfa3750d6a43 created at 2021-05-09T21:00:08Z
# cd manifest_ci/
# ls | grep cloud-controller-manager
0000_26_cloud-controller-manager-operator_00_namespace.yaml
0000_26_cloud-controller-manager-operator_01_images.configmap.yaml
0000_26_cloud-controller-manager-operator_02_rbac_operator.yaml
0000_26_cloud-controller-manager-operator_03_rbac_provider.yaml
0000_26_cloud-controller-manager-operator_11_deployment.yaml
0000_26_cloud-controller-manager-operator_12_clusteroperator.yaml
0000_26_cloud-controller-manager-operator_13_credentialsrequest-openstack.yaml

Cloud-controller-manager is included in ci build.

So checking the test result of CI.

Most recent ci failure for cloud-controller-manager alert was [1], from 4.7.10 to 4.8.0-0.ci-2021-05-11-010006. 

[1]https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&context=0&type=junit&groupBy=job&search=alert+ClusterOperatorDown+fired' | grep 'cloud-controller-manager'
alert ClusterOperatorDown fired for 870 seconds with labels: {endpoint="metrics", instance="10.0.252.69:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-65897987c9-rjg6w", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 630 seconds with labels: {endpoint="metrics", instance="10.0.137.199:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-5b648b684-8h466", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 600 seconds with labels: {endpoint="metrics", instance="10.0.190.224:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-599856699c-hbpqj", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 630 seconds with labels: {endpoint="metrics", instance="10.0.140.3:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-7546c95b48-mb2rg", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 720 seconds with labels: {endpoint="metrics", instance="10.0.188.213:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-68cc44877c-fqcbr", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 630 seconds with labels: {endpoint="metrics", instance="10.0.241.56:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-5cb4c949fd-z5kj8", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 8421 seconds with labels: {endpoint="metrics", instance="10.0.157.146:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-c68db747c-rgzbl", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 8416 seconds with labels: {endpoint="metrics", instance="10.0.173.231:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-85bcfbc9b6-tdk6b", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 8413 seconds with labels: {endpoint="metrics", instance="10.0.133.67:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-5b55669cd8-xzfkg", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 8422 seconds with labels: {endpoint="metrics", instance="10.0.199.5:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-984c994f6-487k7", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 8439 seconds with labels: {endpoint="metrics", instance="10.0.178.200:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-86c9d868b4-z9rm4", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 8428 seconds with labels: {endpoint="metrics", instance="10.0.150.66:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-866595c45b-6ck9x", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 8399 seconds with labels: {endpoint="metrics", instance="10.0.177.212:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-dc86c7865-bpgns", service="cluster-version-operator", severity="critical"}
alert ClusterOperatorDown fired for 8421 seconds with labels: {endpoint="metrics", instance="10.0.156.232:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-78d48b7556-s9jm8", service="cluster-version-operator", severity="critical"}

So the issue seems not to be fixed.

Comment 7 W. Trevor King 2021-05-11 04:43:25 UTC

> [1]https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/
> 1391921283058372608

Huh.  Checking the creation time:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-mastr-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[].metadata | select(.name == "cloud-controller-manager").creationTimestamp'
  2021-05-11T01:53:24Z

Comparing with the running CVO:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version" and (.reason == "Pulled" or .reason == "LeaderElection")) | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | sort
  2021-05-11T01:19:59Z 1 LeaderElection: ip-10-0-11-203_4e0c01c5-6641-432b-b911-4bb0fa8e46e9 became leader
  2021-05-11T01:27:16Z 1 Pulled: Successfully pulled image "registry.build02.ci.openshift.org/ci-op-rkxjqyvj/release@sha256:24f0bcf67474e06ceb1091fc63bddd6010e1d13f5fe5604962a4579ee98b8e22" in 17.663489406s
  2021-05-11T01:34:12Z 1 LeaderElection: ip-10-0-151-38_3059a40c-947d-42ee-9ab9-edfa9d6a1161 became leader
  2021-05-11T01:53:21Z 1 Pulled: Successfully pulled image "registry.build02.ci.openshift.org/ci-op-rkxjqyvj/release@sha256:25679909cf78ea99a81bbd278606aed16979dff5528a5a86ca1c8a3d839507e2" in 10.20786225s
  2021-05-11T01:53:37Z 1 Pulled: Successfully pulled image "registry.build02.ci.openshift.org/ci-op-rkxjqyvj/release@sha256:25679909cf78ea99a81bbd278606aed16979dff5528a5a86ca1c8a3d839507e2" in 9.138006891s
  2021-05-11T01:53:38Z 1 LeaderElection: ip-10-0-252-69_e86c056a-4e60-4baa-898c-ff7b58f9b85b became leader
  2021-05-11T02:48:20Z 1 Pulled: Container image "registry.build02.ci.openshift.org/ci-op-rkxjqyvj/release@sha256:25679909cf78ea99a81bbd278606aed16979dff5528a5a86ca1c8a3d839507e2" already present on machine
  ...

So the ClusterOperator seems to have been created by the outgoing 4.7 CVO (which doesn't have the patch for this bug).  I expect a 4.7 -> 4.8 update where you delete the new ClusterOperator a minute or two after it gets created would delay the recreation until we got to the manifest block for the new second-level operator.  Or we can backport to 4.7 and test that way.  Ah, or we can get more details out of hosted-Loki [1], with:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608"} | unpack | namespace="openshift-cluster-version" |= "Precreated resource clusteroperator \"cloud-controller-manager\""

showing:

1. 2021-05-11T01:53:24.860Z: cluster-version-operator-7664cbf99b-mxfd4
2. 2021-05-11T02:17:48.413Z: cluster-version-operator-65897987c9-rjg6w
...

Getting the relevant CVO versions:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608"} | unpack | pod_name="cluster-version-operator-7664cbf99b-mxfd4" |= "Graceful shutdown complete"

gives:

  2021-05-10 18:53:25	I0511 01:53:25.453886       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator 4.7.0-202104250659.p0-1db8867.

and:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1391921283058372608"} | unpack | pod_name="cluster-version-operator-65897987c9-rjg6w" |= "Graceful shutdown complete"

gives:

  2021-05-10 19:48:20	I0511 02:48:20.331180       1 start.go:26] Graceful shutdown complete for ClusterVersionOperator v1.0.0-511-g86db02a6-dirty.

So yup, makes sense.  I'll open a backport PR for 4.7...

[1]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade%2F1391921283058372608%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-cluster-version%5C%22%20%7C%3D%20%5C%22Precreated%20resource%20clusteroperator%20%5C%5C%5C%22cloud-controller-manager%5C%5C%5C%22%5C%22%22%7D%5D

Comment 8 Yang Yang 2021-05-13 06:41:18 UTC

Verifying it by 4.7.10 to 4.8.0-0.ci-2021-05-12-045439 upgrade.

Steps to verify:

1. Install a cluster with 4.7.10, $ oc get co is showing no cloud-controller-manager co is visible.

2. Upgrade to 4.8.0-0.ci-2021-05-12-045439

# oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:86eaf8b920e7ad906367c90dd41bebcd060b95c3bad636baf81ea8ffdd3def43 --allow-explicit-upgrade --force

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.10    True        True          4m59s   Working towards 4.8.0-0.ci-2021-05-12-045439: 69 of 695 done (9% complete)

Watching the cloud-controller-manager,

$ oc get co -w | grep cloud-controller-manager

cloud-controller-manager                                                                  
cloud-controller-manager                                                                  
cloud-controller-manager                       

So cloud-controller-manager is pre-created w/o condition status by 4.7 CVO.

Removing the cloud-controller-manager CO,

$ oc delete co/cloud-controller-manager
clusteroperator.config.openshift.io "cloud-controller-manager" deleted

At this comment, CVO is already rolled out to 4.8

Watching the cloud-controller-manager,

$ oc get co -w | grep cloud-controller-manager

cloud-controller-manager                                                                                       
cloud-controller-manager                                                                                       
cloud-controller-manager                   4.8.0-0.ci-2021-05-12-045439   True        False         False      0s

4.8 CVO creates cloud-controller-manager and soon condition status is visible.

Eventually, the upgrade is passed and no ClusterOperatorDegraded and/or ClusterOperatorDown for cloud-controller-manager is firing.

# oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.ci-2021-05-12-045439   True        False         14m     Cluster version is 4.8.0-0.ci-2021-05-12-045439

Moving it to verified state.

Comment 9 jamo luhrsen 2021-05-24 23:25:35 UTC

@yanyang, @wking,

Was this bz supposed to address the upgrade test case failure we see in CI? That's what I thought by reading it, but the test case
is still failing.

Here is a recent upgrade job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1396907474669277184

snippet of the failure:
    disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less	1h15m55s
    May 24 21:10:10.643: Unexpected alerts fired or pending during the upgrade:

    alert AggregatedAPIDown fired for 240 seconds with labels: {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}
    alert ClusterOperatorDown fired for 750 seconds with labels: {endpoint="metrics", instance="10.0.139.245:9099", job="cluster-version-operator", name="cloud-controller-manager", namespace="openshift-cluster-version", pod="cluster-version-operator-f4bffc556-lnzkq", service="cluster-version-operator", severity="critical"}

Moving this back to ASSIGNED to get some attention. If that is not right, I can open a new BZ or maybe there is another BZ
already open that I didn't find yet.

Comment 10 jamo luhrsen 2021-05-24 23:34:13 UTC

(In reply to jamo luhrsen from comment #9)
> @yanyang, @wking,
> 
> Was this bz supposed to address the upgrade test case failure we see in CI?
> That's what I thought by reading it, but the test case
> is still failing.
> 
> Here is a recent upgrade job:
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/
> 1396907474669277184
> 
> snippet of the failure:
>     disruption_tests: [sig-arch] Check if alerts are firing during or after
> upgrade success expand_less	1h15m55s
>     May 24 21:10:10.643: Unexpected alerts fired or pending during the
> upgrade:
> 
>     alert AggregatedAPIDown fired for 240 seconds with labels:
> {name="v1beta1.metrics.k8s.io", namespace="default", severity="warning"}
>     alert ClusterOperatorDown fired for 750 seconds with labels:
> {endpoint="metrics", instance="10.0.139.245:9099",
> job="cluster-version-operator", name="cloud-controller-manager",
> namespace="openshift-cluster-version",
> pod="cluster-version-operator-f4bffc556-lnzkq",
> service="cluster-version-operator", severity="critical"}
> 
> Moving this back to ASSIGNED to get some attention. If that is not right, I
> can open a new BZ or maybe there is another BZ
> already open that I didn't find yet.

wait, the fix that went in to 4.7 only happened 3 days ago and there has been no new nightly
promoted since then. Maybe that is why we are still failing this test case. I'll monitor and
put this back to VERIFIED once I see the test case passing.

Comment 11 Yang Yang 2021-05-26 10:04:40 UTC

jamo, 4.7 stable build has not included the fix yet. So it's expected to experience the issue if you upgrade from a 4.7 stable cluster. 4.7 ci and nightly have the fix. From qe testing result, the failure is gone if you upgrade from a latest 4.7 ci or nightly cluster. Please check out Bug 1959238 to get the 4.7 bug status.

Comment 12 W. Trevor King 2021-06-01 14:32:31 UTC

Bug 1959238 went out with 4.7.13, so hopefully 4.7->4.8 will stop seeing this now.  Moving back to MODIFIED.

Comment 13 Lalatendu Mohanty 2021-06-01 14:36:29 UTC

As per comment 11 i.e. https://bugzilla.redhat.com/show_bug.cgi?id=1957775#c11 , this bug should be in Verified state. As we already have verified it with 4.7 CI and nightly. We do not wait till the bug fix is available in stable channels. The bugs gets attached to erratas and get closed when an errata is released i.e. the build would be available in fast channels at that point of time. After sometime i.e. couple of days the versions would be available in stable channel.

Comment 14 W. Trevor King 2021-06-01 14:43:36 UTC

CI-search is promisingly empty:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=alert+ClusterOperatorDown+fired+for.*name="cloud-controller-manager"' | grep 'failures match' | grep -v 'pull-ci-\|rehearse' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Checking that one:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&type=junit&search=alert+
ClusterOperatorDown+fired+for.*name="cloud-controller-manager"' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade/1399462453770719232
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade/1399462453770719232/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2021-05-31T22:07:58Z 2021-05-31T23:22:03Z Completed 4.8.0-0.ci-2021-05-31-172317
2021-05-31T21:04:46Z 2021-05-31T22:07:43Z Completed 4.7.12
2021-05-31T20:37:41Z 2021-05-31T21:00:42Z Completed 4.6.30

And 4.7.12 is from before bug 1959238 landed, so looking good.

Comment 15 Lalatendu Mohanty 2021-06-01 15:12:39 UTC

Marking it verified as per comment 13 and comment 11.

Comment 18 errata-xmlrpc 2021-07-27 23:06:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438