Bug 1905221

Summary:

CVO transitions from "Initializing" to "Updating" despite not attempting many manifests

Product:

OpenShift Container Platform

Reporter:

Eran Cohen <ercohen>

Component:

Cluster Version Operator

Assignee:

Jack Ottofaro <jack.ottofaro>

Status:

CLOSED ERRATA

QA Contact:

Johnny Liu <jialiu>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.6.z

CC:

aos-bugs, itsoiref, jack.ottofaro, jokerman, mstaeble, ravbrown, sdodson, vrutkovs, wking, yanyang

Target Milestone:

---

Keywords:

Reopened

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-24 15:40:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1921180

Attachments:

Description	Flags
installer-gather	none
transition portion of the CVO logs from comment 0's log bundle, for convenience	none

Description Eran Cohen 2020-12-07 18:37:24 UTC

Created attachment 1737382 [details]
installer-gather

Created attachment 1737382 [details]
installer-gather

Version:

$ openshift-install version
[root@hpe-c21124gp3-01 installer]# ./bin/openshift-install version
./bin/openshift-install unreleased-master-3767-gdc5b4bc03edcd9c7289cfe80ee946aeeddb3bc7e-dirty
built from commit dc5b4bc03edcd9c7289cfe80ee946aeeddb3bc7e
release image registry.svc.ci.openshift.org/origin/release:4.5

Platform:

libvirt.

Please specify:
* UPI (semi-manual installation on customized infrastructure)

What happened?
Sometimes (30%) the bootkube hangs while waiting for cluster-bootstrap to complete, cluster-bootstrap is failing to apply these manifest (it's not converging):
Dec 07 18:08:45 master1 bootkube.sh[2243]: "99_openshift-machineconfig_99-master-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-master-ssh.yaml": no matches for kind "MachineConfig" in >
Dec 07 18:08:45 master1 bootkube.sh[2243]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in >
  
According to this commit, the missing CRD should be created by CVO
https://github.com/openshift/machine-config-operator/commit/1655cc4ec6ce03a6a994f3389b03bf573f5601f9
and I do see it in the log (in the 70% that succeeds)
I1207 18:15:10.423230       1 sync_worker.go:701] Running sync for customresourcedefinition "machineconfigs.machineconfiguration.openshift.io" (514 of 618)   

I expected cluster-bootstrap to finish applying the manifests

Anything else we need to know?

I'm looking into this as part of installing single node openshift.
I updated the cluster-bootstrap not to wait for required-pods,
Since no pod will start running and we just need the manifests to be applied.

Comment 1 Matthew Staebler 2020-12-07 20:30:01 UTC

You cannot change the behavior of the installer and then claim that it is a bug that the installer is not working the way that you expect. I am happy to work with you on making the changes that you would like to make, but bugzilla is not the right forum for that.

Comment 2 Matthew Staebler 2020-12-07 20:32:51 UTC

As a further hint, note the following in the cluster-version-operator logs. This is preventing the cluster-version-operator from applying any further manifests.

I1207 18:01:17.479693       1 task_graph.go:555] Result of work: [deployment openshift-cluster-version/cluster-version-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-version-operator-65cf74444d" has timed out progressing.)]
I1207 18:01:17.479712       1 sync_worker.go:865] Summarizing 1 errors
I1207 18:01:17.479716       1 sync_worker.go:869] Update error 5 of 618: WorkloadNotAvailable deployment openshift-cluster-version/cluster-version-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-version-operator-65cf74444d" has timed out progressing.) (*errors.errorString: deployment openshift-cluster-version/cluster-version-operator is not available and not progressing; updated replicas=1 of 1, available replicas=0 of 1)
E1207 18:01:17.479746       1 sync_worker.go:348] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-cluster-version/cluster-version-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-version-operator-65cf74444d" has timed out progressing.)

Comment 3 Eran Cohen 2020-12-08 09:58:09 UTC

This is the cluster-version-operator deployment conditions during bootstrap:
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  cluster-version-operator-566dd969b7 (1/1 replicas created)
NewReplicaSet:   <none>

I guess at some point (trying to reproduce) the Progressing status turns to False as well.
According to this code, it will stop all workers when it finds out that it's not available:
https://github.com/openshift/cluster-version-operator/blob/2c9d4a2200216ad417206b24e7bf114ec22f4422/lib/resourcebuilder/apps.go#L117

Comment 4 Eran Cohen 2020-12-08 10:12:47 UTC

This is the cluster-version-operator deployment conditions during bootstrap (in case it fail):

Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    False   ProgressDeadlineExceeded
OldReplicaSets:  cluster-version-operator-65cf74444d (1/1 replicas created)
NewReplicaSet:   <none>

Comment 5 Vadim Rutkovsky 2020-12-08 10:42:58 UTC

CVO is perfectly fine. Your (single?) master has requested the configuration but never joined the cluster as a node. In fact, log bundle is incomplete and doesn't have master info

Comment 6 Eran Cohen 2020-12-08 11:40:45 UTC

Hey, CVO is fine.
I'm working on this https://issues.redhat.com/browse/MGMT-3177 (install single node openshift with bootstrap in-place).
For this to work I need cluster-bootstrap to apply all the manifests during bootstrap without any node joining the cluster.
It currently fail (~30% of the times) to a the machineconfig CRs because CVO didn't apply the machineconfig CRD (as described above).
Do you have any idea on how I can solve this?

Comment 7 Vadim Rutkovsky 2020-12-08 11:45:40 UTC

> machineconfig CRs because CVO didn't apply the machineconfig CRD (as described above).

That's not how it works. CVO applies manifests, including machine-config operator. But it doesn't have a node to run at, 'cause your master never joined the cluster

Comment 8 Eran Cohen 2020-12-08 12:06:23 UTC

I'm referring to the CVO running on the bootstrap node
The manifest for it is rendered here: https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L66

Comment 9 W. Trevor King 2020-12-08 21:58:16 UTC

From your attached log bundle:

$ grep -o 'Running sync for customresourcedefinition "[^"]*"' bootstrap/containers/cluster-version-operator-013ac078178f5421c0f6b986541667d48de14ead8ebe8ef4fdbfca62a2e3e340.log | sort | uniq
...
Running sync for customresourcedefinition "machineautoscalers.autoscaling.openshift.io"
Running sync for customresourcedefinition "networks.config.openshift.io"
...

So no entry for machineconfigs.machineconfiguration.openshift.io [1] (from the MCO commit you linked in comment 0).  Checking the version of the machine-config operator you actually ran:

$ jq -r '.[].Config.Labels["io.openshift.build.commit.url"]' bootstrap/pods/ed22e15ba54f.inspect
https://github.com/openshift/machine-config-operator/commit/6896f6bc491fde7e8314631652f70841c7f9f31d

That still has the CRD manifest [2].  Not clear to me why the CVO in your failed run didn't pick up this particular CRD manifest.  Point us at an example release image pullspec?

[1]: https://github.com/openshift/machine-config-operator/blob/1655cc4ec6ce03a6a994f3389b03bf573f5601f9/install/0000_80_machine-config-operator_01_machineconfig.crd.yaml#L5
[2]: https://github.com/openshift/machine-config-operator/blob/6896f6bc491fde7e8314631652f70841c7f9f31d/install/0000_80_machine-config-operator_01_machineconfig.crd.yaml#L5

Comment 10 Eran Cohen 2020-12-09 08:06:48 UTC

The release image is: quay.io/openshift-release-dev/ocp-release:4.6.1-x86_64
What do you mean by release image pullspec?

Comment 11 Eran Cohen 2020-12-09 08:08:23 UTC

OK, I think it's this:
oc adm release info
Name:      4.6.1
Digest:    sha256:d78292e9730dd387ff6198197c8b0598da340be7678e8e1e4810b557a926c2b9
Created:   2020-10-22T07:18:38Z
OS/Arch:   linux/amd64
Manifests: 444

Pull From: quay.io/openshift-release-dev/ocp-release@sha256:d78292e9730dd387ff6198197c8b0598da340be7678e8e1e4810b557a926c2b9

Release Metadata:
  Version:  4.6.1
  Upgrades: 4.5.14, 4.5.15, 4.5.16, 4.6.0-rc.4, 4.6.0
  Metadata:
    description: 
  Metadata:
    url: https://access.redhat.com/errata/RHBA-2020:4196

Component Versions:
  kubernetes 1.19.0               
  machine-os 46.82.202010091720-0 Red Hat Enterprise Linux CoreOS

Images:
  NAME                                           DIGEST
  aws-ebs-csi-driver                             sha256:e176509a104b2f96ee4f5c57275c6a712409aa80ac40071345d0a03bdde2b456
  aws-ebs-csi-driver-operator                    sha256:2689d3e66bbfdd7d493d969524b8a7da00142d1b4372e3e880bd825beb3da558
  aws-machine-controllers                        sha256:b7dc5f4101a8cb88c20d853908982258cab77bb0ac391e965b50b15648ddd854
  aws-pod-identity-webhook                       sha256:aef7c7802e877679d62d3d40ca4cac4fa8e84b2974673a8912f14d95eca08a08
  azure-machine-controllers                      sha256:9ef5deb841f1f4a8680f91ebb21952f0eaabf500f4523d891c075b69769ec769
  baremetal-installer                            sha256:f84c2b8f7ae78bd0e086ff14bae6bc2af85606eda692f1d0300f5550271c40f5
  baremetal-machine-controllers                  sha256:6eb0b79a701665269ff5282556fef9dbae69888bcda354c8013479d4d91aa278
  baremetal-operator                             sha256:3c9a9d63e4e6746ced1adf0d47fd49d7afac791b4a19e21001a6d7d5dbac12b7
  baremetal-runtimecfg                           sha256:0a851f6be3d3ab94ad12a684d40c7c718065d7292fcfe5cfeb8453fc18c64afb
  cli                                            sha256:6aa4bb97adf2142b0e74ccae7fd3661ada73cbaac803b86bb8712261e916d66d
  cli-artifacts                                  sha256:79978a34d1ab3b0ed1ad2c93c09bcb2fcdd1806b35e48a53c99d106347e1a59d
  cloud-credential-operator                      sha256:5e591cab19b41c7ea26eab6056cd518f6d64b59e8051978de927b1b984abfb1d
  cluster-authentication-operator                sha256:059f0179c0528c6234dbdca7e70fe779cf37be5121f458dd045d2e9662192f06
  cluster-autoscaler                             sha256:51b08f319750810ef474b210dae72b0faba34b02e635deb1bae84a63bec67db4
  cluster-autoscaler-operator                    sha256:f18151bf70434e1841ed8182c42e819e92e3d1ad3bbd269c667be8b74ff78444
  cluster-bootstrap                              sha256:6da72b403f8db3c810d372ec5baedb95767a627de11ee427b07d62a910532730
  cluster-config-operator                        sha256:25197b2709c0691c28424c9b07e505a71d13bf481e18bc42636cc84ee8fef033
  cluster-csi-snapshot-controller-operator       sha256:6ca671c810426b8c4f13dd0c7ac19639f9f265b952b8feb5a828e59fab785335
  cluster-dns-operator                           sha256:afdf0a3b426ac1c03df52e88a2b884f0714e54a1a03f33091954441a05a7f6b9
  cluster-etcd-operator                          sha256:1ad85195e1a180698fe4b8df82e3d72075efb256b53f593d13e29faaf7f3e15a
  cluster-image-registry-operator                sha256:e2b3f973bc5b9e55d2240a556c4648c921a3c8d3e12381757f1990a864208617
  cluster-ingress-operator                       sha256:9dbb31ac799b2c30270268714dcb3d11bafb329b98639a446657c8c7db41938c
  cluster-kube-apiserver-operator                sha256:4691dc29704c9cb06d2345894f1a8f074b58a0d208318c5218241388b0916e1b
  cluster-kube-controller-manager-operator       sha256:4289297f0b7ee7edf394348fd07e1fa1b3162655f2a2af2245e23af4b179e7f2
  cluster-kube-scheduler-operator                sha256:45586fd7a5cfd43ff546dbfb282a70a91eaf0f069f604230af958dc802832f89
  cluster-kube-storage-version-migrator-operator sha256:ee4abe53e80e561239e510a6f9999b4dc80b7b3fdc9848ab43d0bf8df24e815d
  cluster-machine-approver                       sha256:f7b9278ef2fbe988f50e4bdeeea79d9373b55689d17b8c6d7c214429f5b3f9a0
  cluster-monitoring-operator                    sha256:3cdb4589ee683c85e3d8f3f239187bc089d30cac6c26847a54894f6c328817c3
  cluster-network-operator                       sha256:93b3e1246884e357e1654e6c9578481aff9eef07eed1f9fdd0e9c8cc89a3770c
  cluster-node-tuning-operator                   sha256:72323ce541f8a26fbad17ef65ff21b51498863bb851635a0faa8d5b1ac6ce0e4
  cluster-openshift-apiserver-operator           sha256:aa4f37543b45bc248db8d9bd2dc45b6e159a8869b044c2310f541afba15b2694
  cluster-openshift-controller-manager-operator  sha256:c6b3aaaa38679b1d752ec09bd68c6d80a8911c74ec16d27c49de88ecb97823ee
  cluster-policy-controller                      sha256:9b564f882e31f497f57a0d99d406d5231eb15e9a97f0b450c21bec2bac7ff033
  cluster-samples-operator                       sha256:7f93199dcc01838f017030e0e8dd32d1d23fa268d25472e338e6843c8830d364
  cluster-storage-operator                       sha256:01250de496444bb624ec7b472ac9b0f7023809c88306a71c6ac87bb302f7dbe3
  cluster-update-keys                            sha256:7a282fd4cbbf0996947c52e6d179ebc257be049f607a743dfde3779f585e82e4
  cluster-version-operator                       sha256:acdd9a3992699bd912a3ac91c842a0617405615f7baf55c226641dc384222fed
  configmap-reloader                             sha256:88ddfbded8bc27b227ed7397ece050b756e522a9ffc34cbfba3c94c5ee58b740
  console                                        sha256:f52825e9905c926d399cd0b7afbb2b7d0370ae22da0416feac9131d555db0b98
  console-operator                               sha256:a2a167f59783ca402118fe35ea5fefbf457e01b64836f8be3be6695aefd76d76
  container-networking-plugins                   sha256:4cfd55719faff41e96c2e4be69e3f2381a57b8b3445b80ae4acfe8ee33d7f99b
  coredns                                        sha256:54fbeb744b82662fd38c0d301ebaad6ca8983707bc44db7235ead0fb7b95808f
  csi-driver-manila                              sha256:9d776971c76510e30de7295c8ff20cbb86969b9fb3bebe8e213953034b35ad7d
  csi-driver-manila-operator                     sha256:6e95b360acdec4f041893e893db746dcd784360a83331e5e59e677bc3ecc11c0
  csi-driver-nfs                                 sha256:a6e073b7f4ade77d854620b8120e0a642d9772a30c81778971b78572885f7482
  csi-external-attacher                          sha256:82758fbc97d9da98f20eddcfb4a8bc279726b97da96263d4c165b404389cb553
  csi-external-provisioner                       sha256:a96e2e4a62bca22da0b6903c9e20d7c776bd241f13accf51ede88965b232aca8
  csi-external-resizer                           sha256:12fe384de71c7621d9061f48afafeed3dc337679a66afd8d0a871e200295a1e5
  csi-external-snapshotter                       sha256:0531ff2ccf0ddea76e42cc9951470528bbd7def094884bc569f660376798f40a
  csi-livenessprobe                              sha256:1dcc413b621958f97dfbb3fc998a9e225ef155a80ffb151eb4694bf8370b383a
  csi-node-driver-registrar                      sha256:6cdaecd5dd9df8fd74529be7fa5d8973daf6f4ea95be8acfb2f5ac97773ebe67
  csi-snapshot-controller                        sha256:cfe62d81269929501517e75a7d337f7d8fc78ac9a17665adebfef52a2024584d
  deployer                                       sha256:f42509c18cf5e41201d64cf3a9c1994ffa5318f8d7cee5de45fa2da914e68bbc
  docker-builder                                 sha256:2986a09ed686a571312bcb20d648baac46b422efa072f8b68eb41c7996e94610
  docker-registry                                sha256:f86db3170270fc635dff0d7f1ba6e79a8f45de7e1dcfa5621474d1f6e07352ec
  etcd                                           sha256:a8214df42df962b965e3f4daad0b61932235e57241160861e503d84e38b775d5
  gcp-machine-controllers                        sha256:af009062907bdf0c0ed59e40515e3576f9216b79fb2fe80e154d528d928d040f
  grafana                                        sha256:8248dec0d94b2928aa4d63a22973d9a8f8f173a1431b2ab4ad15fdfe80283d7c
  haproxy-router                                 sha256:52424fd2af6fd7d7a5a1233032eb3f3c67f7691996b209e013e29f1524c5188c
  hyperkube                                      sha256:30094a2d586aa282d85e14f1be19abec1c30ce431673377b0e1c12d83e6bac8c
  insights-operator                              sha256:11c3a8bdddbbb2229bd68bb80b6009588873118881952c702dfebd1484046191
  installer                                      sha256:3f206c2ca0472d318ed03d164c7c1502796974da881136060677154bc5432415
  installer-artifacts                            sha256:8a3d96b12f0140658d06f793abe7c5279c4340c46938870bce652cbb12cbe19a
  ironic                                         sha256:a3b7c580f02c5f0ffaab21ef32bf4796084073c9b28391635685ea0f69248af9
  ironic-hardware-inventory-recorder             sha256:9193b931ea4736d5d26bedb90aef5f1ae2bbcd01539ea04a53485f2e3a160ab0
  ironic-inspector                               sha256:e28947f75883920ce7fb8e171e0350dc7156b1feae094cffff8e33b4cb807854
  ironic-ipa-downloader                          sha256:4d20d0f0d9f0bb521e5ed5355cd7f9d710819935b0c0bdbec9e029983f11ece6
  ironic-machine-os-downloader                   sha256:8bd4dc5aa2bba650fef27e39e40ab29ab11ae13ad45f02af14ad9f6111d86e33
  ironic-static-ip-manager                       sha256:550a230293580919793a404afe9de6bf664c335e5ea9b6211b07bccf0f80efc7
  jenkins                                        sha256:0f28b84725a89504ffd3695ad4c2928e4235643b6135021a9b147f039ef943f3
  jenkins-agent-maven                            sha256:8839c47594ea0955ebd791275f8d2910597dbb8921edbf778d65d7c878671f44
  jenkins-agent-nodejs                           sha256:fac9f86793704ca6415f4f60434355d313e629d7b5cd214211cce4fe717e3afd
  k8s-prometheus-adapter                         sha256:a37a568c63563257309cb0ffb6e185d98f662ff3201d2099cbe0df404b93f0c8
  keepalived-ipfailover                          sha256:827c33240e2e92824087c56a3366ddc19e6ea146cee13ec2d9ed9c64c01f6c4d
  kube-proxy                                     sha256:61c7b95baabf9cf4303632a73d1533ebffdb9118ebfd4946dcf101a0ae117da3
  kube-rbac-proxy                                sha256:c75977f28becdf4f7065cfa37233464dd31208b1767e620c4f19658f53f8ff8c
  kube-state-metrics                             sha256:9e7f0468850aeb13585ef049f687cc42c05d82bc0e0200607d1a93d7f9740fe5
  kube-storage-version-migrator                  sha256:a1aaf99f2ed745c5353d9fc715fa8e9111f42165e3012fad73640c438ba6aa6f
  kuryr-cni                                      sha256:cacfe6aad8d5284f7643aab02e62862c90ffac123bdedc8a958edf3d754779a8
  kuryr-controller                               sha256:f1adee60cabec68dd24cbdc16ad4b4ed2d51f1daa203255d4e099b77a3ca4f84
  libvirt-machine-controllers                    sha256:64aefcedad8ed52ed54b51af16342ee83a2384c85e5b70a75f0f301e41e05740
  local-storage-static-provisioner               sha256:d0383d0a12c1e466fd174b88794d7c711d7f83825ec0a865b6e7cdf7b996e2ee
  machine-api-operator                           sha256:c51d44f380ecff7d36b1de4bb3bdbd4ac66abc6669724f28d81bd4af5741a8ac
  machine-config-operator                        sha256:8923050603588c27d79b33b371afb651288470d5cdeb14f8e10249bca1a1c461
  machine-os-content                             sha256:22f9d04db364c19da1fc219513752484c04bbe1340bc81475bf861aabfb42c8b
  mdns-publisher                                 sha256:6792231e4d68c0ecb99fb6a6b84ac440bdb7b39a6ac2e6301e2ef1e7a42bf49b
  multus-admission-controller                    sha256:92feaeb8763ece68147b522bfa8914bcd429e9825185b9b9c05247ad2857d03f
  multus-cni                                     sha256:8f4882cff3c2f9521215eac681c5abda42876e3e955431c1387fb457940b8344
  multus-route-override-cni                      sha256:71051bdf1b96c953fc1dfd48359915bf5c027613de6f5e2fa8adeea8d3dda311
  multus-whereabouts-ipam-cni                    sha256:6bada08687c20afe316c1f8cf76f001f31bedae317f256f3df3affaa5d0dc25e
  must-gather                                    sha256:fa63640328598f72567027e9cd0d50f00d4ec058dacc61f3be3c6cca7fbefac5
  network-metrics-daemon                         sha256:cf565b2fab365e027962a25a8cffb41aa35cb5a00d001e081d53c7fed5a0c54b
  oauth-apiserver                                sha256:65206861218064576dc092040e9c24b0393b8a07502e351f513400f187f38cc7
  oauth-proxy                                    sha256:12b11e2000b42ce1aaa228d9c1f4c9177395add2fa43835e667b7fc9007e40e6
  oauth-server                                   sha256:076b280e17c6bb4cc618db71403ccec75f8196c8849061a40c680a2808292bb6
  openshift-apiserver                            sha256:7584014b0cb8cb2c5a09b909c79f2f8ad6e49854bcfabf72e96a22330bcf6f56
  openshift-controller-manager                   sha256:dc6a6a1d4a6b2af67421561e53d1af1d40c99ae72de69f4c3cc390d447f12269
  openshift-state-metrics                        sha256:78d3478e632c761c18e2dcb55d26e388ecfd126d4fde60317868133dc2fd57f7
  openstack-machine-controllers                  sha256:56d7f816f3ebac92afbcedc1f70cdf7ce870c199f22c18ec7f0a389b595afb51
  operator-lifecycle-manager                     sha256:ae6a5decd040a6b3adfa074d3211ab92a36b77b2d849962d9a678e1c2c5ef5c1
  operator-marketplace                           sha256:01626d98c80e44e0cd3a522ad019eb236e39c30b0dfff0ac5a6fa98686159286
  operator-registry                              sha256:cf4f2d5c38d111332a5b5c34bb849af1dbb9454a7fdaeb948eebcaeaf54e750a
  ovirt-csi-driver                               sha256:478667370a183265595fabb478757e7995b01c133b5cd5a36c911ae81c4d86d4
  ovirt-csi-driver-operator                      sha256:33b1f9e501382666998662e4e6aed8abfdb49a89636bf9624a8aecd66814a321
  ovirt-machine-controllers                      sha256:f8c96ccd6afc5d922717d156e352411965e09b46c8bfb30183dd73cfd52c5ffd
  ovn-kubernetes                                 sha256:74f3b168a1e08a9871d3ed27fe1ed2c75c8ce7d08284eb096d11787616519112
  pod                                            sha256:8bd90fcca7990c0edead15298dcec963968274d299428da95eae41aa23157b90
  prom-label-proxy                               sha256:b635694dc0663a1404d43a4a9ac8513a3087c7cfead50f6ab413f3c217c40b2a
  prometheus                                     sha256:3d0361f380abf5252b1b640e3ceaaab8274e2af8cdb605b20b513a1a44b3a4dc
  prometheus-alertmanager                        sha256:e5bcf6d786fd218e1ef188eb40c39c31a98d03121fba3b7a1f16e87e45a7478b
  prometheus-config-reloader                     sha256:8c9e61400619c4613db5cc73097d287e3cd5d2125c85d1d84cc30cfdaa1093e7
  prometheus-node-exporter                       sha256:73fdcef5de85e739831c5a8b76dce349a3c8832ff416a46263743d7e61655cbb
  prometheus-operator                            sha256:26f6c930942ee4dea7c1e22d220bba11561c37bdc47101c4490ce0ef77c9203a
  sdn                                            sha256:82d1def7312de8ae5dee32d237ad59fe685923e78668fa3547e6bee445cd8842
  service-ca-operator                            sha256:357e35286fd26fed015c03a9c451f6fdcf61cf0821d959025e7f800e7c533f29
  telemeter                                      sha256:0c42cd3a74176732e8c06105c47674c7d410c7167c8e3fbd80f9a76e9bfda5bd
  tests                                          sha256:712f5587b13f4073a0a7453d3a641de37fee98d9c64c3f4137668a8437455655
  thanos                                         sha256:f644e4b495071a2b9d0d6f5d48cb96dad9f7ea8298cc22c98824bc70229ea9dd
  tools                                          sha256:c3cf658772cc3c17b69f8f3bf307bbcd77ab1e6e105052bebb6a3d0954edf4f4

Comment 12 W. Trevor King 2020-12-15 22:50:07 UTC

Bootstrap-in-place effort for the CVO is being tracked in https://issues.redhat.com/browse/OTA-329.

Comment 13 W. Trevor King 2020-12-16 23:58:24 UTC

Aha.  Back to comment 0's log bundle:

$ grep 'Running sync.*generation\|(51. of 618)\|Result of work' bootstrap/containers/cluster-version-operator-013ac078178f5421c0f6b986541667d48de14ead8ebe8ef4fdbfca62a2e3e340.log 
I1207 16:49:27.954976       1 sync_worker.go:494] Running sync 4.6.1 (force=false) on generation 1 in state Initializing at attempt 0
I1207 16:49:29.336838       1 task_graph.go:555] Result of work: []
I1207 16:49:40.065578       1 sync_worker.go:701] Running sync for namespace "openshift-vsphere-infra" (510 of 618)
I1207 16:49:42.640428       1 sync_worker.go:713] Done syncing for namespace "openshift-vsphere-infra" (510 of 618)
I1207 16:49:42.640451       1 sync_worker.go:701] Running sync for service "openshift-machine-config-operator/machine-config-daemon" (511 of 618)
I1207 16:49:45.093494       1 sync_worker.go:713] Done syncing for service "openshift-machine-config-operator/machine-config-daemon" (511 of 618)
I1207 16:49:45.093518       1 sync_worker.go:701] Running sync for customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618)
E1207 16:49:45.828734       1 task.go:81] error running apply for customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618): context canceled
I1207 16:49:48.220635       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 618) Could not update console "cluster" (20 of 618) Could not update namespace "openshift-config-managed" (46 of 618) Could not update secret "openshift-etcd-operator/etcd-client" (64 of 618) Could not update namespace "openshift-kube-apiserver-operator" (78 of 618) Could not update serviceaccount "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (113 of 618) Could not update serviceaccount "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator" (122 of 618) Could not update config "cluster" (126 of 618) Could not update credentialsrequest "openshift-cloud-credential-operator/openshift-machine-api-openstack" (137 of 618) Could not update serviceaccount "openshift-apiserver-operator/openshift-apiserver-operator" (179 of 618) Could not update kubestorageversionmigrator "cluster" (190 of 618) Could not update rolebinding "openshift-cloud-credential-operator/cloud-credential-operator" (201 of 618) Could not update configmap "openshift-authentication-operator/trusted-ca-bundle" (216 of 618) Could not update serviceaccount "openshift-machine-api/cluster-autoscaler-operator" (228 of 618) Could not update configmap "openshift-cluster-storage-operator/csi-snapshot-controller-operator-config" (246 of 618) Could not update credentialsrequest "openshift-cloud-credential-operator/openshift-image-registry-gcs" (265 of 618) Could not update credentialsrequest "openshift-cloud-credential-operator/openshift-ingress-azure" (283 of 618) Could not update role "openshift-config-managed/machine-approver" (296 of 618): resource may have been deleted Could not update namespace "openshift-monitoring" (311 of 618) Could not update clusterrole "cluster-node-tuning:tuned" (326 of 618) Could not update clusterrolebinding "system:openshift:operator:openshift-controller-manager-operator" (336 of 618) Could not update prometheusrule "openshift-cluster-samples-operator/samples-operator-alerts" (341 of 618) Could not update credentialsrequest "openshift-cloud-credential-operator/ovirt-csi-driver-operator" (371 of 618) Could not update oauthclient "console" (380 of 618): the server does not recognize this resource, check extension API servers Could not update clusterrolebinding "insights-operator-gather-reader" (426 of 618) Could not update customresourcedefinition "subscriptions.operators.coreos.com" (448 of 618) Could not update configmap "openshift-marketplace/marketplace-trusted-ca" (471 of 618) Could not update serviceca "cluster" (483 of 618) Cluster operator network is still updating Could not update service "openshift-dns-operator/metrics" (502 of 618) Could not update customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618) Could not update servicemonitor "openshift-cloud-credential-operator/cloud-credential-operator" (525 of 618) Could not update role "openshift-authentication/prometheus-k8s" (528 of 618) Could not update servicemonitor "openshift-image-registry/image-registry" (537 of 618) Could not update servicemonitor "openshift-cluster-machine-approver/cluster-machine-approver" (541 of 618): the server does not recognize this resource, check extension API servers Could not update operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (543 of 618) Could not update configmap "openshift-config-managed/release-verification" (555 of 618) Could not update role "openshift-console-operator/prometheus-k8s" (556 of 618) Could not update servicemonitor "openshift-dns-operator/dns-operator" (561 of 618) Could not update servicemonitor "openshift-etcd-operator/etcd-operator" (565 of 618) Could not update servicemonitor "openshift-ingress-operator/ingress-operator" (568 of 618) Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (572 of 618) Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (580 of 618) Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (587 of 618) Could not update servicemonitor "openshift-machine-api/machine-api-operator" (592 of 618) Could not update servicemonitor "openshift-machine-config-operator/machine-config-daemon" (595 of 618) Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (599 of 618) Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (604 of 618) Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (612 of 618) Could not update servicemonitor "openshift-service-ca-operator/service-ca-operator" (618 of 618)]
I1207 16:49:48.220816       1 sync_worker.go:869] Update error 512 of 618: UpdatePayloadFailed Could not update customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618) (*errors.errorString: context canceled)
* Could not update customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618)
I1207 16:49:48.222197       1 sync_worker.go:494] Running sync 4.6.1 (force=false) on generation 2 in state Updating at attempt 0
I1207 16:49:48.347533       1 task_graph.go:555] Result of work: []
I1207 16:55:30.136417       1 task_graph.go:555] Result of work: [Cluster operator etcd is still updating]
I1207 16:56:20.704109       1 sync_worker.go:494] Running sync 4.6.1 (force=false) on generation 2 in state Updating at attempt 1
I1207 16:56:20.761444       1 task_graph.go:555] Result of work: []
I1207 17:02:02.618796       1 task_graph.go:555] Result of work: [Cluster operator etcd is still updating]
...

So the CVO timed out on 512 of 618, before getting to the MachineConfig CRD at 514 of 618, and then buggily transitioned from Initializing to Updating.  Updates have ordered, blocking reconciliation, while the install stage is a free-for-all [1].  So the fact that etcd is not level blocks further attempts at installing the MachineConfig CRD, and the install eventually hangs.

The installer is waiting for ClusterVersion to go Available=True [2].  Not clear to me what the CVO was claiming in this particular install, because the log-bundle failed to capture ClusterVersion:

$ wc resources/clusterversion.json 
0 0 0 resources/clusterversion.json

But we need to double-check the transition out of "Initializing" and fix whatever we're missing here.  Checking on the reporting version:

$ head -n1 bootstrap/containers/cluster-version-operator-013ac078178f5421c0f6b986541667d48de14ead8ebe8ef
4fdbfca62a2e3e340.log 
I1207 16:48:48.401667       1 start.go:21] ClusterVersionOperator 4.6.0-202010100331.p0-c35a4e1

That's pretty close to the 4.6 tip, with the only subsequent change being unrelated:

$ git --no-pager log --oneline c35a4e1..origin/release-4.6
39a42566 (origin/release-4.6) Merge pull request #480 from openshift-cherrypick-robot/cherry-pick-477-to-release-4.6
42d3c6d7 (origin/pr/480) pkg/cvo/metrics: Abandon child goroutines after shutdownContext expires

[1]: https://github.com/openshift/cluster-version-operator/blob/1e51a0e4750ca110d4659f33bce210a3de6844b9/docs/user/reconciliation.md#manifest-graph
[2]: https://github.com/openshift/installer/blob/e235783fe671864e0628771d0e5d13904e4d062a/cmd/openshift-install/create.go#L388-L391

Comment 14 W. Trevor King 2020-12-17 00:05:14 UTC

I am not entirely clear on why nobody has been bitten by this before, but the fact that was reported in a released 4.6.z means it's not a new-in-4.7 regression to block 4.7 GA or a new-in-4.6.z regression to block future 4.6.z (c35a4e1 went live with 4.6 GA [1]).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1886947#c9

Comment 15 W. Trevor King 2020-12-17 00:13:56 UTC

Created attachment 1739834 [details]
transition portion of the CVO logs from comment 0's log bundle, for convenience

Comment 17 W. Trevor King 2020-12-17 00:56:37 UTC

Bug may be in [1], which lacks the hasNeverReachedLevel guard we have over in [2].  Or maybe equalSyncWork has a bug, because the transition logs include "Work updated" [3].  And before the bit I copied out for the transition logs, we had [4]:

  I1207 16:49:27.954369       1 sync_worker.go:249] Notify the sync worker that new work is available

Looking for the "no change" I expect [5]:

  $ grep 'Update work is equal to current target' bootstrap/containers/cluster-version-operator-013ac078178f5421c0f6b986541667d48de14ead8ebe8ef4fdbfca62a2e3e340.log | head -n1
  I1207 16:49:27.961015       1 sync_worker.go:222] Update work is equal to current target; no change required

So just after the "nominally new work" round.  Maybe whatever we put in the initial state needs some tweaking, so equalSyncWork does not trip up over the difference between our initial seed and the first round in from the ClusterVersion object.  Or something...

[1]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L402-L408
[2]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/cvo.go#L455-L464
[3]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L280
[4]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L249
[5]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L222

Comment 18 Jack Ottofaro 2020-12-17 23:14:56 UTC

(In reply to W. Trevor King from comment #17)
> Bug may be in [1], which lacks the hasNeverReachedLevel guard we have over
> in [2].  Or maybe equalSyncWork has a bug, because the transition logs
> include "Work updated" [3].  And before the bit I copied out for the
> transition logs, we had [4]:
> 
>   I1207 16:49:27.954369       1 sync_worker.go:249] Notify the sync worker
> that new work is available
> 
> Looking for the "no change" I expect [5]:
> 
>   $ grep 'Update work is equal to current target'
> bootstrap/containers/cluster-version-operator-
> 013ac078178f5421c0f6b986541667d48de14ead8ebe8ef4fdbfca62a2e3e340.log | head
> -n1
>   I1207 16:49:27.961015       1 sync_worker.go:222] Update work is equal to
> current target; no change required
> 
> So just after the "nominally new work" round.  Maybe whatever we put in the
> initial state needs some tweaking, so equalSyncWork does not trip up over
> the difference between our initial seed and the first round in from the
> ClusterVersion object.  Or something...
> 
> [1]:
> https://github.com/openshift/cluster-version-operator/blob/
> c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L402-L408
> [2]:
> https://github.com/openshift/cluster-version-operator/blob/
> c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/cvo.go#L455-L464
> [3]:
> https://github.com/openshift/cluster-version-operator/blob/
> c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L280
> [4]:
> https://github.com/openshift/cluster-version-operator/blob/
> c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L249
> [5]:
> https://github.com/openshift/cluster-version-operator/blob/
> c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L222

Right. Looking at equalSyncWork [1], logs indicate Desired hasn't changed. Can Overrides be changed during an install?

[1] https://github.com/openshift/cluster-version-operator/blob/1e51a0e4750ca110d4659f33bce210a3de6844b9/pkg/cvo/sync_worker.go#L471

Comment 19 Scott Dodson 2021-01-05 17:57:58 UTC

We believe that this is a side effect of a hack being attempted to enable SNO right now. Jack will provide more details on what we do plan to do in order to avoid problems associated with this but we do feel like this is unlikely to be a problem during normal installations. Therefore we're lowering severity and priority to medium.

Comment 20 Jack Ottofaro 2021-01-05 21:46:18 UTC

In Slack discussion with Eran confirmed that some components are set as unmanaged (change to Overrides, see https://bugzilla.redhat.com/show_bug.cgi?id=1905221#c18), as a temp hack, during single node installation. The hack is to get the single node installed and will be removed once Cluster High-availability Mode is supported. He will try the hack after cluster-bootstrap has finished to verify this avoids the CVO issue.

Although changing manifests during initial install is not currently supported it should not break CVO. Minimally CVO should be modified to not transition from initialising to updating upon such a change.

Comment 21 Igal Tsoiref 2021-01-10 08:58:48 UTC

We moved patch service to after pivot phase and saw that cvo doesn't have to be restarted. We saw that this is the patching that causes CVO to stuck. 
By the way after we moved patches we saw that many times after we pivot(bootstrap phase is really quick) cvo i snot starting cause it misses secrets.
Still not sure why? Some ideas?

Comment 22 Igal Tsoiref 2021-01-10 11:58:42 UTC

sometimes we miss cluster-version-operator-serving-cert, i think we have some race and maybe better to wait for it?

Comment 24 Johnny Liu 2021-01-21 03:07:27 UTC

@Eran or @Jack, 

Do you know how to verify this bug from QE perspective?

Comment 25 W. Trevor King 2021-01-21 03:16:02 UTC

Install a cluster with the fix, and, while it is installing (probably after bootstrap complete, just for access to logs), put a log-tail on the CVO pod and then insert an entry in ClusterVersion's spec.overrides.  Before the patch, that should transition the CVO to update-mode (look for log lines like "Running sync ... in state").  With the patch, the CVO should stay in the Initializing phase despite the overrides touch.

Comment 26 Eran Cohen 2021-01-21 08:59:34 UTC

I added the patch during cluster-bootstrap, once you have a control plane started on the bootstrap.

Comment 27 Johnny Liu 2021-01-22 11:58:33 UTC

Verified this bug with 4.7.0-0.nightly-2021-01-22-063949, and pass.

Trigger an install, once bootstrap instance boot up (before bootkube complete), ssh into it, run the following command:

[root@ip-10-0-7-22 ~]# vi /tmp/version-patch-first-override.yaml
[root@ip-10-0-7-22 ~]# cat /tmp/version-patch-first-override.yaml 
- op: add
  path: /spec/overrides
  value:
  - kind: DaemonSet
    group: apps/v1
    name: cluster-network-operator
    namespace: openshift-cluster-network-operator
    unmanaged: true
[root@ip-10-0-7-22 ~]# cd /etc/kubernetes/
[root@ip-10-0-7-22 kubernetes]# oc patch clusterversion version --type json -p "$(cat /tmp/version-patch-first-override.yaml)"
clusterversion.config.openshift.io/version patched
[root@ip-10-0-7-22 kubernetes]#  oc get all -n openshift-cluster-version
NAME                                           READY   STATUS    RESTARTS   AGE
pod/cluster-version-operator-7d5c5dfd8-j5mbh   0/1     Pending   0          7m14s

[root@ip-10-0-7-22 kubernetes]# crictl ps -a |grep version
774777ae3b19f       registry.ci.openshift.org/ocp/release@sha256:7e62d6eced986e77be37153e8e38069163e0457e29859664ff6d016dbdca0b3b            8 minutes ago       Running             cluster-version-operator         0                   ce4c60bed90b7

Note: check the log of cvo static pod running on bootstrap instance, but not check pod under openshift-cluster-verison namespace via oc command

[root@ip-10-0-7-22 kubernetes]# crictl logs 774777ae3b19f 2>&1 |grep "Running sync .* in state"
I0122 11:35:08.677588       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-01-22-063949 (force=false) on generation 1 in state Initializing at attempt 1
I0122 11:38:24.963065       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-01-22-063949 (force=false) on generation 1 in state Initializing at attempt 2
I0122 11:42:00.183391       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-01-22-063949 (force=false) on generation 1 in state Initializing at attempt 3

On an old build, I can reproduce it like this:
[root@ip-10-0-16-5 kubernetes]# crictl logs 9f4b34e61178e 2>&1 |grep "Running sync .* in state"
I0122 11:14:56.528110       1 sync_worker.go:521] Running sync 4.7.0-0.nightly-2021-01-17-065043 (force=false) on generation 1 in state Initializing at attempt 0
I0122 11:18:11.963633       1 sync_worker.go:521] Running sync 4.7.0-0.nightly-2021-01-17-065043 (force=false) on generation 1 in state Initializing at attempt 1
I0122 11:19:31.456647       1 sync_worker.go:521] Running sync 4.7.0-0.nightly-2021-01-17-065043 (force=false) on generation 2 in state Updating at attempt 0

Comment 30 errata-xmlrpc 2021-02-24 15:40:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633