Bug 1941217
Summary: | Bare-metal operator is firing for ClusterOperatorDown for 15m during 4.6 to 4.7 upgrade | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | OpenShift BugZilla Robot <openshift-bugzilla-robot> | |
Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> | |
Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.7 | CC: | aos-bugs, dhellmann, hongkliu, jokerman, lmohanty, travi, wking, yanyang | |
Target Milestone: | --- | |||
Target Release: | 4.7.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1943973 (view as bug list) | Environment: | ||
Last Closed: | 2021-04-05 13:56:19 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1929917 | |||
Bug Blocks: | 1943973 |
Comment 2
Yang Yang
2021-03-26 12:23:19 UTC
Huh. [1] is the nightly page, and shows two update jobs from 4.6.23 [2,3]. Digging into [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version' 2021-03-26T09:51:42Z 2021-03-26T11:03:27Z Completed 4.7.0-0.nightly-2021-03-26-090502 2021-03-26T09:19:57Z 2021-03-26T09:48:42Z Completed 4.6.23 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal") | {metadata: (.metadata | {creationTimestamp}), status: (.status | {conditions, versions})}' { "metadata": { "creationTimestamp": "2021-03-26T09:51:51Z" }, "status": { "conditions": [ { "lastTransitionTime": "2021-03-26T10:22:07Z", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2021-03-26T10:22:07Z", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2021-03-26T10:22:07Z", "message": "Operational", "reason": "AsExpected", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-03-26T10:22:07Z", "status": "True", "type": "Upgradeable" }, { "lastTransitionTime": "2021-03-26T10:22:07Z", "message": "Nothing to do on this Platform", "reason": "UnsupportedPlatform", "status": "True", "type": "Disabled" } ], "versions": [ { "name": "operator", "version": "4.7.0-0.nightly-2021-03-26-090502" } ] } } So: * 09:51:42, update begins. * 09:51:51, ClusterOperator created. $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep 'T09:5[123]' | sort 2021-03-26T09:51:35Z 1 RetrievePayload: retrieving payload version="" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" ... 2021-03-26T09:51:50Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" 2021-03-26T09:51:51Z 1 Killing: Stopping container cluster-version-operator 2021-03-26T09:51:51Z 1 PayloadLoaded: payload loaded version="4.7.0-0.nightly-2021-03-26-090502" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" 2021-03-26T09:51:51Z 1 PreconditionsPassed: preconditions passed for payload loaded version="4.7.0-0.nightly-2021-03-26-090502" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" 2021-03-26T09:51:51Z 1 ScalingReplicaSet: Scaled down replica set cluster-version-operator-7c66c99f49 to 0 2021-03-26T09:51:51Z 1 SuccessfulDelete: Deleted pod: cluster-version-operator-7c66c99f49-qsnmz 2021-03-26T09:51:58Z 1 ScalingReplicaSet: Scaled up replica set cluster-version-operator-644b596c6c to 1 2021-03-26T09:51:58Z 1 SuccessfulCreate: Created pod: cluster-version-operator-644b596c6c-v82h2 2021-03-26T09:51:59Z 1 Pulling: Pulling image "registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" 2021-03-26T09:52:08Z 1 Created: Created container cluster-version-operator 2021-03-26T09:52:08Z 1 LeaderElection: ip-10-0-128-175_4815132a-051b-45cf-a45f-2be69ba0ce1d became leader ... So my guess is that the outgoing 4.6.23 CVO created the baremetal ClusterOperator. Can we verify this change the same way we verified 4.8, and take the change back to 4.6? I've opened [4] and bug 1943973 with that backport. [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2021-03-26-090502 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade/1375374149228171264 [4]: https://github.com/openshift/cluster-version-operator/pull/540 Turns out it will be the 4.6.z PR that has an impact, so no need for a note on this one. Installed a gcp cluster with 4.7.0-0.nightly-2021-03-27-082615 [root@preserve-yangyangmerrn-1 resources]# cat clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp' 2021-03-29T02:21:30Z [root@preserve-yangyangmerrn-1 resources]# cat events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort 2021-03-29T02:14:23Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-5694c75b94 to 1 2021-03-29T02:14:23Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-5694c75b94-ljrc8 2021-03-29T02:21:28Z 1 Created: Created container cluster-baremetal-operator 2021-03-29T02:21:28Z 1 Started: Started container cluster-baremetal-operator 2021-03-29T02:48:14Z 1 Killing: Stopping container cluster-baremetal-operator 2021-03-29T02:48:14Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-5694c75b94-92dnq 2021-03-29T02:48:48Z 1 Created: Created container cluster-baremetal-operator 2021-03-29T02:48:49Z 1 Started: Started container cluster-baremetal-operator null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-5694c75b94-92dnq to yangyang0329-dsv7k-master-2.c.openshift-qe.internal null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-5694c75b94-ljrc8 to yangyang0329-dsv7k-master-1.c.openshift-qe.internal Looks like the baremetal clusteroperator was created and started at almost the same time. But it was scheduled twice, not sure if the alert was fired during the 2nd time that the clusteroperator was started. Trevor, could you please help confirm if ^^ can verify the bz? Thanks. Looking at a 4.7.0-0.nightly-2021-03-27-082615 CI run [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1375726407786696704/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name + "\t" + ([.managedFields[].manager] | join("\t"))' | sort 2021-03-27T08:38:24Z authentication cluster-version-operator authentication-operator ... 2021-03-27T08:38:24Z storage cluster-version-operator cluster-storage-operator 2021-03-27T08:38:25Z operator-lifecycle-manager-packageserver cluster-version-operator olm 2021-03-27T08:45:18Z baremetal cluster-baremetal-operator So yeah, that's "CVO precreates all the ClusterOperators except baremetal, and then when the baremetal operator eventually comes around and starts running, it creates its ClusterOperator directly", which is what we want. I expect the alert to only be firing before the operator comes up, which you can confirm by looking at the ClusterOperator conditions while the operator is down (they should stay Available=True and Degraded=False). I think that's enough to mark this VERIFIED and unblock the 4.6.z PR, unless folks have additional concerns? [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1375726407786696704 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.5 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1005 |