Testing with 4.7.0-0.nightly-2021-03-26-090502 # oc adm release info registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-03-26-090502 --commits | grep cluster-version cluster-version-operator https://github.com/openshift/cluster-version-operator 7df967b0c2ed0763d6d0fe7f9c8623e9f970a760 $ git --no-pager log --first-parent --oneline -3 origin/release-4.7 [20:15:44] 7df967b (origin/release-4.7) Merge pull request #534 from openshift-cherrypick-robot/cherry-pick-531-to-release-4.7 397e8ba Merge pull request #516 from openshift-cherrypick-robot/cherry-pick-509-to-release-4.7 cf0e3d1 Merge pull request #521 from wking/post-main-when-main-never-launched Baremetal operator is still firing for ClusterOperatorDown during 4.6.23 to 4.7.0-0.nightly-2021-03-26-090502 upgrade # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.23 True True 10m Working towards 4.7.0-0.nightly-2021-03-26-090502: 96 of 668 done (14% complete) # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0326-6.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown")' { "labels": { "alertname": "ClusterOperatorDown", "endpoint": "metrics", "instance": "10.0.0.3:9099", "job": "cluster-version-operator", "name": "baremetal", "namespace": "openshift-cluster-version", "pod": "cluster-version-operator-dc56ccb94-rf6md", "service": "cluster-version-operator", "severity": "critical" }, "annotations": { "message": "Cluster operator baremetal has not been available for 10 minutes. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible." }, "state": "firing", "activeAt": "2021-03-26T11:19:29.21303266Z", "value": "0e+00" } # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.23 True False False 73m baremetal cloud-credential 4.6.23 True False False 120m cluster-autoscaler 4.6.23 True False False 113m config-operator 4.6.23 True False False 114m console 4.6.23 True False False 77m csi-snapshot-controller 4.6.23 True False False 81m dns 4.6.23 True False False 113m etcd 4.7.0-0.nightly-2021-03-26-090502 True False False 113m image-registry 4.6.23 True False False 104m ingress 4.6.23 True False False 104m insights 4.6.23 True False False 114m kube-apiserver 4.7.0-0.nightly-2021-03-26-090502 True False False 111m kube-controller-manager 4.7.0-0.nightly-2021-03-26-090502 True False False 111m kube-scheduler 4.6.23 True True False 111m kube-storage-version-migrator 4.6.23 True False False 82m machine-api 4.6.23 True False False 104m machine-approver 4.6.23 True False False 113m machine-config 4.6.23 True False False 112m marketplace 4.6.23 True False False 76m monitoring 4.6.23 True False False 75m network 4.6.23 True False False 115m node-tuning 4.6.23 True False False 114m openshift-apiserver 4.6.23 True False False 5m19s openshift-controller-manager 4.6.23 True False False 103m openshift-samples 4.6.23 True False False 104m operator-lifecycle-manager 4.6.23 True False False 113m operator-lifecycle-manager-catalog 4.6.23 True False False 113m operator-lifecycle-manager-packageserver 4.6.23 True False False 77m service-ca 4.6.23 True False False 114m storage 4.6.23 True False False 114m
Huh. [1] is the nightly page, and shows two update jobs from 4.6.23 [2,3]. Digging into [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version' 2021-03-26T09:51:42Z 2021-03-26T11:03:27Z Completed 4.7.0-0.nightly-2021-03-26-090502 2021-03-26T09:19:57Z 2021-03-26T09:48:42Z Completed 4.6.23 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal") | {metadata: (.metadata | {creationTimestamp}), status: (.status | {conditions, versions})}' { "metadata": { "creationTimestamp": "2021-03-26T09:51:51Z" }, "status": { "conditions": [ { "lastTransitionTime": "2021-03-26T10:22:07Z", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2021-03-26T10:22:07Z", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2021-03-26T10:22:07Z", "message": "Operational", "reason": "AsExpected", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-03-26T10:22:07Z", "status": "True", "type": "Upgradeable" }, { "lastTransitionTime": "2021-03-26T10:22:07Z", "message": "Nothing to do on this Platform", "reason": "UnsupportedPlatform", "status": "True", "type": "Disabled" } ], "versions": [ { "name": "operator", "version": "4.7.0-0.nightly-2021-03-26-090502" } ] } } So: * 09:51:42, update begins. * 09:51:51, ClusterOperator created. $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep 'T09:5[123]' | sort 2021-03-26T09:51:35Z 1 RetrievePayload: retrieving payload version="" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" ... 2021-03-26T09:51:50Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" 2021-03-26T09:51:51Z 1 Killing: Stopping container cluster-version-operator 2021-03-26T09:51:51Z 1 PayloadLoaded: payload loaded version="4.7.0-0.nightly-2021-03-26-090502" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" 2021-03-26T09:51:51Z 1 PreconditionsPassed: preconditions passed for payload loaded version="4.7.0-0.nightly-2021-03-26-090502" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" 2021-03-26T09:51:51Z 1 ScalingReplicaSet: Scaled down replica set cluster-version-operator-7c66c99f49 to 0 2021-03-26T09:51:51Z 1 SuccessfulDelete: Deleted pod: cluster-version-operator-7c66c99f49-qsnmz 2021-03-26T09:51:58Z 1 ScalingReplicaSet: Scaled up replica set cluster-version-operator-644b596c6c to 1 2021-03-26T09:51:58Z 1 SuccessfulCreate: Created pod: cluster-version-operator-644b596c6c-v82h2 2021-03-26T09:51:59Z 1 Pulling: Pulling image "registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7" 2021-03-26T09:52:08Z 1 Created: Created container cluster-version-operator 2021-03-26T09:52:08Z 1 LeaderElection: ip-10-0-128-175_4815132a-051b-45cf-a45f-2be69ba0ce1d became leader ... So my guess is that the outgoing 4.6.23 CVO created the baremetal ClusterOperator. Can we verify this change the same way we verified 4.8, and take the change back to 4.6? I've opened [4] and bug 1943973 with that backport. [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2021-03-26-090502 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade/1375374149228171264 [4]: https://github.com/openshift/cluster-version-operator/pull/540
Turns out it will be the 4.6.z PR that has an impact, so no need for a note on this one.
Installed a gcp cluster with 4.7.0-0.nightly-2021-03-27-082615 [root@preserve-yangyangmerrn-1 resources]# cat clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp' 2021-03-29T02:21:30Z [root@preserve-yangyangmerrn-1 resources]# cat events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort 2021-03-29T02:14:23Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-5694c75b94 to 1 2021-03-29T02:14:23Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-5694c75b94-ljrc8 2021-03-29T02:21:28Z 1 Created: Created container cluster-baremetal-operator 2021-03-29T02:21:28Z 1 Started: Started container cluster-baremetal-operator 2021-03-29T02:48:14Z 1 Killing: Stopping container cluster-baremetal-operator 2021-03-29T02:48:14Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-5694c75b94-92dnq 2021-03-29T02:48:48Z 1 Created: Created container cluster-baremetal-operator 2021-03-29T02:48:49Z 1 Started: Started container cluster-baremetal-operator null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-5694c75b94-92dnq to yangyang0329-dsv7k-master-2.c.openshift-qe.internal null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-5694c75b94-ljrc8 to yangyang0329-dsv7k-master-1.c.openshift-qe.internal Looks like the baremetal clusteroperator was created and started at almost the same time. But it was scheduled twice, not sure if the alert was fired during the 2nd time that the clusteroperator was started. Trevor, could you please help confirm if ^^ can verify the bz? Thanks.
Looking at a 4.7.0-0.nightly-2021-03-27-082615 CI run [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1375726407786696704/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name + "\t" + ([.managedFields[].manager] | join("\t"))' | sort 2021-03-27T08:38:24Z authentication cluster-version-operator authentication-operator ... 2021-03-27T08:38:24Z storage cluster-version-operator cluster-storage-operator 2021-03-27T08:38:25Z operator-lifecycle-manager-packageserver cluster-version-operator olm 2021-03-27T08:45:18Z baremetal cluster-baremetal-operator So yeah, that's "CVO precreates all the ClusterOperators except baremetal, and then when the baremetal operator eventually comes around and starts running, it creates its ClusterOperator directly", which is what we want. I expect the alert to only be firing before the operator comes up, which you can confirm by looking at the ClusterOperator conditions while the operator is down (they should stay Available=True and Degraded=False). I think that's enough to mark this VERIFIED and unblock the 4.6.z PR, unless folks have additional concerns? [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1375726407786696704
Thanks Trevor. Based on comment#6 and comment#7, move it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.5 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1005