Bug 1941217

Summary: Bare-metal operator is firing for ClusterOperatorDown for 15m during 4.6 to 4.7 upgrade
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: Yang Yang <yanyang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.7CC: aos-bugs, dhellmann, hongkliu, jokerman, lmohanty, travi, wking, yanyang
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1943973 (view as bug list) Environment:
Last Closed: 2021-04-05 13:56:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1929917    
Bug Blocks: 1943973    

Comment 2 Yang Yang 2021-03-26 12:23:19 UTC
Testing with 4.7.0-0.nightly-2021-03-26-090502

# oc adm release info registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-03-26-090502 --commits | grep cluster-version
  cluster-version-operator                       https://github.com/openshift/cluster-version-operator                       7df967b0c2ed0763d6d0fe7f9c8623e9f970a760

$ git --no-pager log --first-parent --oneline -3 origin/release-4.7              [20:15:44]
7df967b (origin/release-4.7) Merge pull request #534 from openshift-cherrypick-robot/cherry-pick-531-to-release-4.7
397e8ba Merge pull request #516 from openshift-cherrypick-robot/cherry-pick-509-to-release-4.7
cf0e3d1 Merge pull request #521 from wking/post-main-when-main-never-launched

Baremetal operator is still firing for ClusterOperatorDown during 4.6.23 to 4.7.0-0.nightly-2021-03-26-090502 upgrade

# oc get clusterversion
version   4.6.23    True        True          10m     Working towards 4.7.0-0.nightly-2021-03-26-090502: 96 of 668 done (14% complete)

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.yangyang0326-6.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown")'
  "labels": {
    "alertname": "ClusterOperatorDown",
    "endpoint": "metrics",
    "instance": "",
    "job": "cluster-version-operator",
    "name": "baremetal",
    "namespace": "openshift-cluster-version",
    "pod": "cluster-version-operator-dc56ccb94-rf6md",
    "service": "cluster-version-operator",
    "severity": "critical"
  "annotations": {
    "message": "Cluster operator baremetal has not been available for 10 minutes. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible."
  "state": "firing",
  "activeAt": "2021-03-26T11:19:29.21303266Z",
  "value": "0e+00"

# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.23                              True        False         False      73m
cloud-credential                           4.6.23                              True        False         False      120m
cluster-autoscaler                         4.6.23                              True        False         False      113m
config-operator                            4.6.23                              True        False         False      114m
console                                    4.6.23                              True        False         False      77m
csi-snapshot-controller                    4.6.23                              True        False         False      81m
dns                                        4.6.23                              True        False         False      113m
etcd                                       4.7.0-0.nightly-2021-03-26-090502   True        False         False      113m
image-registry                             4.6.23                              True        False         False      104m
ingress                                    4.6.23                              True        False         False      104m
insights                                   4.6.23                              True        False         False      114m
kube-apiserver                             4.7.0-0.nightly-2021-03-26-090502   True        False         False      111m
kube-controller-manager                    4.7.0-0.nightly-2021-03-26-090502   True        False         False      111m
kube-scheduler                             4.6.23                              True        True          False      111m
kube-storage-version-migrator              4.6.23                              True        False         False      82m
machine-api                                4.6.23                              True        False         False      104m
machine-approver                           4.6.23                              True        False         False      113m
machine-config                             4.6.23                              True        False         False      112m
marketplace                                4.6.23                              True        False         False      76m
monitoring                                 4.6.23                              True        False         False      75m
network                                    4.6.23                              True        False         False      115m
node-tuning                                4.6.23                              True        False         False      114m
openshift-apiserver                        4.6.23                              True        False         False      5m19s
openshift-controller-manager               4.6.23                              True        False         False      103m
openshift-samples                          4.6.23                              True        False         False      104m
operator-lifecycle-manager                 4.6.23                              True        False         False      113m
operator-lifecycle-manager-catalog         4.6.23                              True        False         False      113m
operator-lifecycle-manager-packageserver   4.6.23                              True        False         False      77m
service-ca                                 4.6.23                              True        False         False      114m
storage                                    4.6.23                              True        False         False      114m

Comment 3 W. Trevor King 2021-03-28 22:09:56 UTC
Huh.  [1] is the nightly page, and shows two update jobs from 4.6.23 [2,3].  Digging into [2]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2021-03-26T09:51:42Z 2021-03-26T11:03:27Z Completed 4.7.0-0.nightly-2021-03-26-090502
2021-03-26T09:19:57Z 2021-03-26T09:48:42Z Completed 4.6.23
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal") | {metadata: (.metadata | {creationTimestamp}), status: (.status | {conditions, versions})}'
  "metadata": {
    "creationTimestamp": "2021-03-26T09:51:51Z"
  "status": {
    "conditions": [
        "lastTransitionTime": "2021-03-26T10:22:07Z",
        "status": "False",
        "type": "Progressing"
        "lastTransitionTime": "2021-03-26T10:22:07Z",
        "status": "False",
        "type": "Degraded"
        "lastTransitionTime": "2021-03-26T10:22:07Z",
        "message": "Operational",
        "reason": "AsExpected",
        "status": "True",
        "type": "Available"
        "lastTransitionTime": "2021-03-26T10:22:07Z",
        "status": "True",
        "type": "Upgradeable"
        "lastTransitionTime": "2021-03-26T10:22:07Z",
        "message": "Nothing to do on this Platform",
        "reason": "UnsupportedPlatform",
        "status": "True",
        "type": "Disabled"
    "versions": [
        "name": "operator",
        "version": "4.7.0-0.nightly-2021-03-26-090502"


* 09:51:42, update begins.
* 09:51:51, ClusterOperator created.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep 'T09:5[123]' | sort
2021-03-26T09:51:35Z 1 RetrievePayload: retrieving payload version="" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7"
2021-03-26T09:51:50Z 1 VerifyPayload: verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7"
2021-03-26T09:51:51Z 1 Killing: Stopping container cluster-version-operator
2021-03-26T09:51:51Z 1 PayloadLoaded: payload loaded version="4.7.0-0.nightly-2021-03-26-090502" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7"
2021-03-26T09:51:51Z 1 PreconditionsPassed: preconditions passed for payload loaded version="4.7.0-0.nightly-2021-03-26-090502" image="registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7"
2021-03-26T09:51:51Z 1 ScalingReplicaSet: Scaled down replica set cluster-version-operator-7c66c99f49 to 0
2021-03-26T09:51:51Z 1 SuccessfulDelete: Deleted pod: cluster-version-operator-7c66c99f49-qsnmz
2021-03-26T09:51:58Z 1 ScalingReplicaSet: Scaled up replica set cluster-version-operator-644b596c6c to 1
2021-03-26T09:51:58Z 1 SuccessfulCreate: Created pod: cluster-version-operator-644b596c6c-v82h2
2021-03-26T09:51:59Z 1 Pulling: Pulling image "registry.build02.ci.openshift.org/ci-op-y7n7hhc0/release@sha256:610af040d6759dc08f48c2afb32697f263a2baa35f0caac369db73c905d9d9b7"
2021-03-26T09:52:08Z 1 Created: Created container cluster-version-operator
2021-03-26T09:52:08Z 1 LeaderElection: ip-10-0-128-175_4815132a-051b-45cf-a45f-2be69ba0ce1d became leader

So my guess is that the outgoing 4.6.23 CVO created the baremetal ClusterOperator.  Can we verify this change the same way we verified 4.8, and take the change back to 4.6?  I've opened [4] and bug 1943973 with that backport.

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.7.0-0.nightly/release/4.7.0-0.nightly-2021-03-26-090502
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1375374153195982848
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-metal-ipi-upgrade/1375374149228171264
[4]: https://github.com/openshift/cluster-version-operator/pull/540

Comment 4 W. Trevor King 2021-03-29 03:24:54 UTC
Turns out it will be the 4.6.z PR that has an impact, so no need for a note on this one.

Comment 6 Yang Yang 2021-03-29 06:21:54 UTC
Installed a gcp cluster with 4.7.0-0.nightly-2021-03-27-082615

[root@preserve-yangyangmerrn-1 resources]# cat clusteroperators.json | jq -r '.items[].metadata | select(.name == "baremetal").creationTimestamp'

[root@preserve-yangyangmerrn-1 resources]# cat events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-machine-api") | .firstTimestamp + " " + (.count | tostring) + " " + .reason + ": " + .message' | grep metal | sort
2021-03-29T02:14:23Z 1 ScalingReplicaSet: Scaled up replica set cluster-baremetal-operator-5694c75b94 to 1
2021-03-29T02:14:23Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-5694c75b94-ljrc8
2021-03-29T02:21:28Z 1 Created: Created container cluster-baremetal-operator
2021-03-29T02:21:28Z 1 Started: Started container cluster-baremetal-operator
2021-03-29T02:48:14Z 1 Killing: Stopping container cluster-baremetal-operator
2021-03-29T02:48:14Z 1 SuccessfulCreate: Created pod: cluster-baremetal-operator-5694c75b94-92dnq
2021-03-29T02:48:48Z 1 Created: Created container cluster-baremetal-operator
2021-03-29T02:48:49Z 1 Started: Started container cluster-baremetal-operator
 null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-5694c75b94-92dnq to yangyang0329-dsv7k-master-2.c.openshift-qe.internal
 null Scheduled: Successfully assigned openshift-machine-api/cluster-baremetal-operator-5694c75b94-ljrc8 to yangyang0329-dsv7k-master-1.c.openshift-qe.internal

Looks like the baremetal clusteroperator was created and started at almost the same time. But it was scheduled twice, not sure if the alert was fired during the 2nd time that the clusteroperator was started.

Trevor, could you please help confirm if ^^ can verify the bz? 


Comment 7 W. Trevor King 2021-03-29 21:51:54 UTC
Looking at a 4.7.0-0.nightly-2021-03-27-082615 CI run [1]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1375726407786696704/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name + "\t" + ([.managedFields[].manager] | join("\t"))' | sort
  2021-03-27T08:38:24Z authentication     cluster-version-operator        authentication-operator
  2021-03-27T08:38:24Z storage    cluster-version-operator        cluster-storage-operator
  2021-03-27T08:38:25Z operator-lifecycle-manager-packageserver   cluster-version-operator        olm
  2021-03-27T08:45:18Z baremetal  cluster-baremetal-operator

So yeah, that's "CVO precreates all the ClusterOperators except baremetal, and then when the baremetal operator eventually comes around and starts running, it creates its ClusterOperator directly", which is what we want.  I expect the alert to only be firing before the operator comes up, which you can confirm by looking at the ClusterOperator conditions while the operator is down (they should stay Available=True and Degraded=False).  I think that's enough to mark this VERIFIED and unblock the 4.6.z PR, unless folks have additional concerns?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.7/1375726407786696704

Comment 8 Yang Yang 2021-03-30 01:36:41 UTC
Thanks Trevor. Based on comment#6 and comment#7, move it to verified state.

Comment 10 errata-xmlrpc 2021-04-05 13:56:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.5 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.