Created attachment 1737382 [details] installer-gather Created attachment 1737382 [details] installer-gather Version: $ openshift-install version [root@hpe-c21124gp3-01 installer]# ./bin/openshift-install version ./bin/openshift-install unreleased-master-3767-gdc5b4bc03edcd9c7289cfe80ee946aeeddb3bc7e-dirty built from commit dc5b4bc03edcd9c7289cfe80ee946aeeddb3bc7e release image registry.svc.ci.openshift.org/origin/release:4.5 Platform: libvirt. Please specify: * UPI (semi-manual installation on customized infrastructure) What happened? Sometimes (30%) the bootkube hangs while waiting for cluster-bootstrap to complete, cluster-bootstrap is failing to apply these manifest (it's not converging): Dec 07 18:08:45 master1 bootkube.sh[2243]: "99_openshift-machineconfig_99-master-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-master-ssh.yaml": no matches for kind "MachineConfig" in > Dec 07 18:08:45 master1 bootkube.sh[2243]: "99_openshift-machineconfig_99-worker-ssh.yaml": unable to get REST mapping for "99_openshift-machineconfig_99-worker-ssh.yaml": no matches for kind "MachineConfig" in > According to this commit, the missing CRD should be created by CVO https://github.com/openshift/machine-config-operator/commit/1655cc4ec6ce03a6a994f3389b03bf573f5601f9 and I do see it in the log (in the 70% that succeeds) I1207 18:15:10.423230 1 sync_worker.go:701] Running sync for customresourcedefinition "machineconfigs.machineconfiguration.openshift.io" (514 of 618) I expected cluster-bootstrap to finish applying the manifests Anything else we need to know? I'm looking into this as part of installing single node openshift. I updated the cluster-bootstrap not to wait for required-pods, Since no pod will start running and we just need the manifests to be applied.
You cannot change the behavior of the installer and then claim that it is a bug that the installer is not working the way that you expect. I am happy to work with you on making the changes that you would like to make, but bugzilla is not the right forum for that.
As a further hint, note the following in the cluster-version-operator logs. This is preventing the cluster-version-operator from applying any further manifests. I1207 18:01:17.479693 1 task_graph.go:555] Result of work: [deployment openshift-cluster-version/cluster-version-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-version-operator-65cf74444d" has timed out progressing.)] I1207 18:01:17.479712 1 sync_worker.go:865] Summarizing 1 errors I1207 18:01:17.479716 1 sync_worker.go:869] Update error 5 of 618: WorkloadNotAvailable deployment openshift-cluster-version/cluster-version-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-version-operator-65cf74444d" has timed out progressing.) (*errors.errorString: deployment openshift-cluster-version/cluster-version-operator is not available and not progressing; updated replicas=1 of 1, available replicas=0 of 1) E1207 18:01:17.479746 1 sync_worker.go:348] unable to synchronize image (waiting 2m50.956499648s): deployment openshift-cluster-version/cluster-version-operator is not available MinimumReplicasUnavailable (Deployment does not have minimum availability.) or progressing ProgressDeadlineExceeded (ReplicaSet "cluster-version-operator-65cf74444d" has timed out progressing.)
This is the cluster-version-operator deployment conditions during bootstrap: Conditions: Type Status Reason ---- ------ ------ Available False MinimumReplicasUnavailable Progressing True ReplicaSetUpdated OldReplicaSets: cluster-version-operator-566dd969b7 (1/1 replicas created) NewReplicaSet: <none> I guess at some point (trying to reproduce) the Progressing status turns to False as well. According to this code, it will stop all workers when it finds out that it's not available: https://github.com/openshift/cluster-version-operator/blob/2c9d4a2200216ad417206b24e7bf114ec22f4422/lib/resourcebuilder/apps.go#L117
This is the cluster-version-operator deployment conditions during bootstrap (in case it fail): Conditions: Type Status Reason ---- ------ ------ Available False MinimumReplicasUnavailable Progressing False ProgressDeadlineExceeded OldReplicaSets: cluster-version-operator-65cf74444d (1/1 replicas created) NewReplicaSet: <none>
CVO is perfectly fine. Your (single?) master has requested the configuration but never joined the cluster as a node. In fact, log bundle is incomplete and doesn't have master info
Hey, CVO is fine. I'm working on this https://issues.redhat.com/browse/MGMT-3177 (install single node openshift with bootstrap in-place). For this to work I need cluster-bootstrap to apply all the manifests during bootstrap without any node joining the cluster. It currently fail (~30% of the times) to a the machineconfig CRs because CVO didn't apply the machineconfig CRD (as described above). Do you have any idea on how I can solve this?
> machineconfig CRs because CVO didn't apply the machineconfig CRD (as described above). That's not how it works. CVO applies manifests, including machine-config operator. But it doesn't have a node to run at, 'cause your master never joined the cluster
I'm referring to the CVO running on the bootstrap node The manifest for it is rendered here: https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L66
From your attached log bundle: $ grep -o 'Running sync for customresourcedefinition "[^"]*"' bootstrap/containers/cluster-version-operator-013ac078178f5421c0f6b986541667d48de14ead8ebe8ef4fdbfca62a2e3e340.log | sort | uniq ... Running sync for customresourcedefinition "machineautoscalers.autoscaling.openshift.io" Running sync for customresourcedefinition "networks.config.openshift.io" ... So no entry for machineconfigs.machineconfiguration.openshift.io [1] (from the MCO commit you linked in comment 0). Checking the version of the machine-config operator you actually ran: $ jq -r '.[].Config.Labels["io.openshift.build.commit.url"]' bootstrap/pods/ed22e15ba54f.inspect https://github.com/openshift/machine-config-operator/commit/6896f6bc491fde7e8314631652f70841c7f9f31d That still has the CRD manifest [2]. Not clear to me why the CVO in your failed run didn't pick up this particular CRD manifest. Point us at an example release image pullspec? [1]: https://github.com/openshift/machine-config-operator/blob/1655cc4ec6ce03a6a994f3389b03bf573f5601f9/install/0000_80_machine-config-operator_01_machineconfig.crd.yaml#L5 [2]: https://github.com/openshift/machine-config-operator/blob/6896f6bc491fde7e8314631652f70841c7f9f31d/install/0000_80_machine-config-operator_01_machineconfig.crd.yaml#L5
The release image is: quay.io/openshift-release-dev/ocp-release:4.6.1-x86_64 What do you mean by release image pullspec?
OK, I think it's this: oc adm release info Name: 4.6.1 Digest: sha256:d78292e9730dd387ff6198197c8b0598da340be7678e8e1e4810b557a926c2b9 Created: 2020-10-22T07:18:38Z OS/Arch: linux/amd64 Manifests: 444 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:d78292e9730dd387ff6198197c8b0598da340be7678e8e1e4810b557a926c2b9 Release Metadata: Version: 4.6.1 Upgrades: 4.5.14, 4.5.15, 4.5.16, 4.6.0-rc.4, 4.6.0 Metadata: description: Metadata: url: https://access.redhat.com/errata/RHBA-2020:4196 Component Versions: kubernetes 1.19.0 machine-os 46.82.202010091720-0 Red Hat Enterprise Linux CoreOS Images: NAME DIGEST aws-ebs-csi-driver sha256:e176509a104b2f96ee4f5c57275c6a712409aa80ac40071345d0a03bdde2b456 aws-ebs-csi-driver-operator sha256:2689d3e66bbfdd7d493d969524b8a7da00142d1b4372e3e880bd825beb3da558 aws-machine-controllers sha256:b7dc5f4101a8cb88c20d853908982258cab77bb0ac391e965b50b15648ddd854 aws-pod-identity-webhook sha256:aef7c7802e877679d62d3d40ca4cac4fa8e84b2974673a8912f14d95eca08a08 azure-machine-controllers sha256:9ef5deb841f1f4a8680f91ebb21952f0eaabf500f4523d891c075b69769ec769 baremetal-installer sha256:f84c2b8f7ae78bd0e086ff14bae6bc2af85606eda692f1d0300f5550271c40f5 baremetal-machine-controllers sha256:6eb0b79a701665269ff5282556fef9dbae69888bcda354c8013479d4d91aa278 baremetal-operator sha256:3c9a9d63e4e6746ced1adf0d47fd49d7afac791b4a19e21001a6d7d5dbac12b7 baremetal-runtimecfg sha256:0a851f6be3d3ab94ad12a684d40c7c718065d7292fcfe5cfeb8453fc18c64afb cli sha256:6aa4bb97adf2142b0e74ccae7fd3661ada73cbaac803b86bb8712261e916d66d cli-artifacts sha256:79978a34d1ab3b0ed1ad2c93c09bcb2fcdd1806b35e48a53c99d106347e1a59d cloud-credential-operator sha256:5e591cab19b41c7ea26eab6056cd518f6d64b59e8051978de927b1b984abfb1d cluster-authentication-operator sha256:059f0179c0528c6234dbdca7e70fe779cf37be5121f458dd045d2e9662192f06 cluster-autoscaler sha256:51b08f319750810ef474b210dae72b0faba34b02e635deb1bae84a63bec67db4 cluster-autoscaler-operator sha256:f18151bf70434e1841ed8182c42e819e92e3d1ad3bbd269c667be8b74ff78444 cluster-bootstrap sha256:6da72b403f8db3c810d372ec5baedb95767a627de11ee427b07d62a910532730 cluster-config-operator sha256:25197b2709c0691c28424c9b07e505a71d13bf481e18bc42636cc84ee8fef033 cluster-csi-snapshot-controller-operator sha256:6ca671c810426b8c4f13dd0c7ac19639f9f265b952b8feb5a828e59fab785335 cluster-dns-operator sha256:afdf0a3b426ac1c03df52e88a2b884f0714e54a1a03f33091954441a05a7f6b9 cluster-etcd-operator sha256:1ad85195e1a180698fe4b8df82e3d72075efb256b53f593d13e29faaf7f3e15a cluster-image-registry-operator sha256:e2b3f973bc5b9e55d2240a556c4648c921a3c8d3e12381757f1990a864208617 cluster-ingress-operator sha256:9dbb31ac799b2c30270268714dcb3d11bafb329b98639a446657c8c7db41938c cluster-kube-apiserver-operator sha256:4691dc29704c9cb06d2345894f1a8f074b58a0d208318c5218241388b0916e1b cluster-kube-controller-manager-operator sha256:4289297f0b7ee7edf394348fd07e1fa1b3162655f2a2af2245e23af4b179e7f2 cluster-kube-scheduler-operator sha256:45586fd7a5cfd43ff546dbfb282a70a91eaf0f069f604230af958dc802832f89 cluster-kube-storage-version-migrator-operator sha256:ee4abe53e80e561239e510a6f9999b4dc80b7b3fdc9848ab43d0bf8df24e815d cluster-machine-approver sha256:f7b9278ef2fbe988f50e4bdeeea79d9373b55689d17b8c6d7c214429f5b3f9a0 cluster-monitoring-operator sha256:3cdb4589ee683c85e3d8f3f239187bc089d30cac6c26847a54894f6c328817c3 cluster-network-operator sha256:93b3e1246884e357e1654e6c9578481aff9eef07eed1f9fdd0e9c8cc89a3770c cluster-node-tuning-operator sha256:72323ce541f8a26fbad17ef65ff21b51498863bb851635a0faa8d5b1ac6ce0e4 cluster-openshift-apiserver-operator sha256:aa4f37543b45bc248db8d9bd2dc45b6e159a8869b044c2310f541afba15b2694 cluster-openshift-controller-manager-operator sha256:c6b3aaaa38679b1d752ec09bd68c6d80a8911c74ec16d27c49de88ecb97823ee cluster-policy-controller sha256:9b564f882e31f497f57a0d99d406d5231eb15e9a97f0b450c21bec2bac7ff033 cluster-samples-operator sha256:7f93199dcc01838f017030e0e8dd32d1d23fa268d25472e338e6843c8830d364 cluster-storage-operator sha256:01250de496444bb624ec7b472ac9b0f7023809c88306a71c6ac87bb302f7dbe3 cluster-update-keys sha256:7a282fd4cbbf0996947c52e6d179ebc257be049f607a743dfde3779f585e82e4 cluster-version-operator sha256:acdd9a3992699bd912a3ac91c842a0617405615f7baf55c226641dc384222fed configmap-reloader sha256:88ddfbded8bc27b227ed7397ece050b756e522a9ffc34cbfba3c94c5ee58b740 console sha256:f52825e9905c926d399cd0b7afbb2b7d0370ae22da0416feac9131d555db0b98 console-operator sha256:a2a167f59783ca402118fe35ea5fefbf457e01b64836f8be3be6695aefd76d76 container-networking-plugins sha256:4cfd55719faff41e96c2e4be69e3f2381a57b8b3445b80ae4acfe8ee33d7f99b coredns sha256:54fbeb744b82662fd38c0d301ebaad6ca8983707bc44db7235ead0fb7b95808f csi-driver-manila sha256:9d776971c76510e30de7295c8ff20cbb86969b9fb3bebe8e213953034b35ad7d csi-driver-manila-operator sha256:6e95b360acdec4f041893e893db746dcd784360a83331e5e59e677bc3ecc11c0 csi-driver-nfs sha256:a6e073b7f4ade77d854620b8120e0a642d9772a30c81778971b78572885f7482 csi-external-attacher sha256:82758fbc97d9da98f20eddcfb4a8bc279726b97da96263d4c165b404389cb553 csi-external-provisioner sha256:a96e2e4a62bca22da0b6903c9e20d7c776bd241f13accf51ede88965b232aca8 csi-external-resizer sha256:12fe384de71c7621d9061f48afafeed3dc337679a66afd8d0a871e200295a1e5 csi-external-snapshotter sha256:0531ff2ccf0ddea76e42cc9951470528bbd7def094884bc569f660376798f40a csi-livenessprobe sha256:1dcc413b621958f97dfbb3fc998a9e225ef155a80ffb151eb4694bf8370b383a csi-node-driver-registrar sha256:6cdaecd5dd9df8fd74529be7fa5d8973daf6f4ea95be8acfb2f5ac97773ebe67 csi-snapshot-controller sha256:cfe62d81269929501517e75a7d337f7d8fc78ac9a17665adebfef52a2024584d deployer sha256:f42509c18cf5e41201d64cf3a9c1994ffa5318f8d7cee5de45fa2da914e68bbc docker-builder sha256:2986a09ed686a571312bcb20d648baac46b422efa072f8b68eb41c7996e94610 docker-registry sha256:f86db3170270fc635dff0d7f1ba6e79a8f45de7e1dcfa5621474d1f6e07352ec etcd sha256:a8214df42df962b965e3f4daad0b61932235e57241160861e503d84e38b775d5 gcp-machine-controllers sha256:af009062907bdf0c0ed59e40515e3576f9216b79fb2fe80e154d528d928d040f grafana sha256:8248dec0d94b2928aa4d63a22973d9a8f8f173a1431b2ab4ad15fdfe80283d7c haproxy-router sha256:52424fd2af6fd7d7a5a1233032eb3f3c67f7691996b209e013e29f1524c5188c hyperkube sha256:30094a2d586aa282d85e14f1be19abec1c30ce431673377b0e1c12d83e6bac8c insights-operator sha256:11c3a8bdddbbb2229bd68bb80b6009588873118881952c702dfebd1484046191 installer sha256:3f206c2ca0472d318ed03d164c7c1502796974da881136060677154bc5432415 installer-artifacts sha256:8a3d96b12f0140658d06f793abe7c5279c4340c46938870bce652cbb12cbe19a ironic sha256:a3b7c580f02c5f0ffaab21ef32bf4796084073c9b28391635685ea0f69248af9 ironic-hardware-inventory-recorder sha256:9193b931ea4736d5d26bedb90aef5f1ae2bbcd01539ea04a53485f2e3a160ab0 ironic-inspector sha256:e28947f75883920ce7fb8e171e0350dc7156b1feae094cffff8e33b4cb807854 ironic-ipa-downloader sha256:4d20d0f0d9f0bb521e5ed5355cd7f9d710819935b0c0bdbec9e029983f11ece6 ironic-machine-os-downloader sha256:8bd4dc5aa2bba650fef27e39e40ab29ab11ae13ad45f02af14ad9f6111d86e33 ironic-static-ip-manager sha256:550a230293580919793a404afe9de6bf664c335e5ea9b6211b07bccf0f80efc7 jenkins sha256:0f28b84725a89504ffd3695ad4c2928e4235643b6135021a9b147f039ef943f3 jenkins-agent-maven sha256:8839c47594ea0955ebd791275f8d2910597dbb8921edbf778d65d7c878671f44 jenkins-agent-nodejs sha256:fac9f86793704ca6415f4f60434355d313e629d7b5cd214211cce4fe717e3afd k8s-prometheus-adapter sha256:a37a568c63563257309cb0ffb6e185d98f662ff3201d2099cbe0df404b93f0c8 keepalived-ipfailover sha256:827c33240e2e92824087c56a3366ddc19e6ea146cee13ec2d9ed9c64c01f6c4d kube-proxy sha256:61c7b95baabf9cf4303632a73d1533ebffdb9118ebfd4946dcf101a0ae117da3 kube-rbac-proxy sha256:c75977f28becdf4f7065cfa37233464dd31208b1767e620c4f19658f53f8ff8c kube-state-metrics sha256:9e7f0468850aeb13585ef049f687cc42c05d82bc0e0200607d1a93d7f9740fe5 kube-storage-version-migrator sha256:a1aaf99f2ed745c5353d9fc715fa8e9111f42165e3012fad73640c438ba6aa6f kuryr-cni sha256:cacfe6aad8d5284f7643aab02e62862c90ffac123bdedc8a958edf3d754779a8 kuryr-controller sha256:f1adee60cabec68dd24cbdc16ad4b4ed2d51f1daa203255d4e099b77a3ca4f84 libvirt-machine-controllers sha256:64aefcedad8ed52ed54b51af16342ee83a2384c85e5b70a75f0f301e41e05740 local-storage-static-provisioner sha256:d0383d0a12c1e466fd174b88794d7c711d7f83825ec0a865b6e7cdf7b996e2ee machine-api-operator sha256:c51d44f380ecff7d36b1de4bb3bdbd4ac66abc6669724f28d81bd4af5741a8ac machine-config-operator sha256:8923050603588c27d79b33b371afb651288470d5cdeb14f8e10249bca1a1c461 machine-os-content sha256:22f9d04db364c19da1fc219513752484c04bbe1340bc81475bf861aabfb42c8b mdns-publisher sha256:6792231e4d68c0ecb99fb6a6b84ac440bdb7b39a6ac2e6301e2ef1e7a42bf49b multus-admission-controller sha256:92feaeb8763ece68147b522bfa8914bcd429e9825185b9b9c05247ad2857d03f multus-cni sha256:8f4882cff3c2f9521215eac681c5abda42876e3e955431c1387fb457940b8344 multus-route-override-cni sha256:71051bdf1b96c953fc1dfd48359915bf5c027613de6f5e2fa8adeea8d3dda311 multus-whereabouts-ipam-cni sha256:6bada08687c20afe316c1f8cf76f001f31bedae317f256f3df3affaa5d0dc25e must-gather sha256:fa63640328598f72567027e9cd0d50f00d4ec058dacc61f3be3c6cca7fbefac5 network-metrics-daemon sha256:cf565b2fab365e027962a25a8cffb41aa35cb5a00d001e081d53c7fed5a0c54b oauth-apiserver sha256:65206861218064576dc092040e9c24b0393b8a07502e351f513400f187f38cc7 oauth-proxy sha256:12b11e2000b42ce1aaa228d9c1f4c9177395add2fa43835e667b7fc9007e40e6 oauth-server sha256:076b280e17c6bb4cc618db71403ccec75f8196c8849061a40c680a2808292bb6 openshift-apiserver sha256:7584014b0cb8cb2c5a09b909c79f2f8ad6e49854bcfabf72e96a22330bcf6f56 openshift-controller-manager sha256:dc6a6a1d4a6b2af67421561e53d1af1d40c99ae72de69f4c3cc390d447f12269 openshift-state-metrics sha256:78d3478e632c761c18e2dcb55d26e388ecfd126d4fde60317868133dc2fd57f7 openstack-machine-controllers sha256:56d7f816f3ebac92afbcedc1f70cdf7ce870c199f22c18ec7f0a389b595afb51 operator-lifecycle-manager sha256:ae6a5decd040a6b3adfa074d3211ab92a36b77b2d849962d9a678e1c2c5ef5c1 operator-marketplace sha256:01626d98c80e44e0cd3a522ad019eb236e39c30b0dfff0ac5a6fa98686159286 operator-registry sha256:cf4f2d5c38d111332a5b5c34bb849af1dbb9454a7fdaeb948eebcaeaf54e750a ovirt-csi-driver sha256:478667370a183265595fabb478757e7995b01c133b5cd5a36c911ae81c4d86d4 ovirt-csi-driver-operator sha256:33b1f9e501382666998662e4e6aed8abfdb49a89636bf9624a8aecd66814a321 ovirt-machine-controllers sha256:f8c96ccd6afc5d922717d156e352411965e09b46c8bfb30183dd73cfd52c5ffd ovn-kubernetes sha256:74f3b168a1e08a9871d3ed27fe1ed2c75c8ce7d08284eb096d11787616519112 pod sha256:8bd90fcca7990c0edead15298dcec963968274d299428da95eae41aa23157b90 prom-label-proxy sha256:b635694dc0663a1404d43a4a9ac8513a3087c7cfead50f6ab413f3c217c40b2a prometheus sha256:3d0361f380abf5252b1b640e3ceaaab8274e2af8cdb605b20b513a1a44b3a4dc prometheus-alertmanager sha256:e5bcf6d786fd218e1ef188eb40c39c31a98d03121fba3b7a1f16e87e45a7478b prometheus-config-reloader sha256:8c9e61400619c4613db5cc73097d287e3cd5d2125c85d1d84cc30cfdaa1093e7 prometheus-node-exporter sha256:73fdcef5de85e739831c5a8b76dce349a3c8832ff416a46263743d7e61655cbb prometheus-operator sha256:26f6c930942ee4dea7c1e22d220bba11561c37bdc47101c4490ce0ef77c9203a sdn sha256:82d1def7312de8ae5dee32d237ad59fe685923e78668fa3547e6bee445cd8842 service-ca-operator sha256:357e35286fd26fed015c03a9c451f6fdcf61cf0821d959025e7f800e7c533f29 telemeter sha256:0c42cd3a74176732e8c06105c47674c7d410c7167c8e3fbd80f9a76e9bfda5bd tests sha256:712f5587b13f4073a0a7453d3a641de37fee98d9c64c3f4137668a8437455655 thanos sha256:f644e4b495071a2b9d0d6f5d48cb96dad9f7ea8298cc22c98824bc70229ea9dd tools sha256:c3cf658772cc3c17b69f8f3bf307bbcd77ab1e6e105052bebb6a3d0954edf4f4
Bootstrap-in-place effort for the CVO is being tracked in https://issues.redhat.com/browse/OTA-329.
Aha. Back to comment 0's log bundle: $ grep 'Running sync.*generation\|(51. of 618)\|Result of work' bootstrap/containers/cluster-version-operator-013ac078178f5421c0f6b986541667d48de14ead8ebe8ef4fdbfca62a2e3e340.log I1207 16:49:27.954976 1 sync_worker.go:494] Running sync 4.6.1 (force=false) on generation 1 in state Initializing at attempt 0 I1207 16:49:29.336838 1 task_graph.go:555] Result of work: [] I1207 16:49:40.065578 1 sync_worker.go:701] Running sync for namespace "openshift-vsphere-infra" (510 of 618) I1207 16:49:42.640428 1 sync_worker.go:713] Done syncing for namespace "openshift-vsphere-infra" (510 of 618) I1207 16:49:42.640451 1 sync_worker.go:701] Running sync for service "openshift-machine-config-operator/machine-config-daemon" (511 of 618) I1207 16:49:45.093494 1 sync_worker.go:713] Done syncing for service "openshift-machine-config-operator/machine-config-daemon" (511 of 618) I1207 16:49:45.093518 1 sync_worker.go:701] Running sync for customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618) E1207 16:49:45.828734 1 task.go:81] error running apply for customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618): context canceled I1207 16:49:48.220635 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 618) Could not update console "cluster" (20 of 618) Could not update namespace "openshift-config-managed" (46 of 618) Could not update secret "openshift-etcd-operator/etcd-client" (64 of 618) Could not update namespace "openshift-kube-apiserver-operator" (78 of 618) Could not update serviceaccount "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (113 of 618) Could not update serviceaccount "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator" (122 of 618) Could not update config "cluster" (126 of 618) Could not update credentialsrequest "openshift-cloud-credential-operator/openshift-machine-api-openstack" (137 of 618) Could not update serviceaccount "openshift-apiserver-operator/openshift-apiserver-operator" (179 of 618) Could not update kubestorageversionmigrator "cluster" (190 of 618) Could not update rolebinding "openshift-cloud-credential-operator/cloud-credential-operator" (201 of 618) Could not update configmap "openshift-authentication-operator/trusted-ca-bundle" (216 of 618) Could not update serviceaccount "openshift-machine-api/cluster-autoscaler-operator" (228 of 618) Could not update configmap "openshift-cluster-storage-operator/csi-snapshot-controller-operator-config" (246 of 618) Could not update credentialsrequest "openshift-cloud-credential-operator/openshift-image-registry-gcs" (265 of 618) Could not update credentialsrequest "openshift-cloud-credential-operator/openshift-ingress-azure" (283 of 618) Could not update role "openshift-config-managed/machine-approver" (296 of 618): resource may have been deleted Could not update namespace "openshift-monitoring" (311 of 618) Could not update clusterrole "cluster-node-tuning:tuned" (326 of 618) Could not update clusterrolebinding "system:openshift:operator:openshift-controller-manager-operator" (336 of 618) Could not update prometheusrule "openshift-cluster-samples-operator/samples-operator-alerts" (341 of 618) Could not update credentialsrequest "openshift-cloud-credential-operator/ovirt-csi-driver-operator" (371 of 618) Could not update oauthclient "console" (380 of 618): the server does not recognize this resource, check extension API servers Could not update clusterrolebinding "insights-operator-gather-reader" (426 of 618) Could not update customresourcedefinition "subscriptions.operators.coreos.com" (448 of 618) Could not update configmap "openshift-marketplace/marketplace-trusted-ca" (471 of 618) Could not update serviceca "cluster" (483 of 618) Cluster operator network is still updating Could not update service "openshift-dns-operator/metrics" (502 of 618) Could not update customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618) Could not update servicemonitor "openshift-cloud-credential-operator/cloud-credential-operator" (525 of 618) Could not update role "openshift-authentication/prometheus-k8s" (528 of 618) Could not update servicemonitor "openshift-image-registry/image-registry" (537 of 618) Could not update servicemonitor "openshift-cluster-machine-approver/cluster-machine-approver" (541 of 618): the server does not recognize this resource, check extension API servers Could not update operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (543 of 618) Could not update configmap "openshift-config-managed/release-verification" (555 of 618) Could not update role "openshift-console-operator/prometheus-k8s" (556 of 618) Could not update servicemonitor "openshift-dns-operator/dns-operator" (561 of 618) Could not update servicemonitor "openshift-etcd-operator/etcd-operator" (565 of 618) Could not update servicemonitor "openshift-ingress-operator/ingress-operator" (568 of 618) Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (572 of 618) Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (580 of 618) Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (587 of 618) Could not update servicemonitor "openshift-machine-api/machine-api-operator" (592 of 618) Could not update servicemonitor "openshift-machine-config-operator/machine-config-daemon" (595 of 618) Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (599 of 618) Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (604 of 618) Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (612 of 618) Could not update servicemonitor "openshift-service-ca-operator/service-ca-operator" (618 of 618)] I1207 16:49:48.220816 1 sync_worker.go:869] Update error 512 of 618: UpdatePayloadFailed Could not update customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618) (*errors.errorString: context canceled) * Could not update customresourcedefinition "containerruntimeconfigs.machineconfiguration.openshift.io" (512 of 618) I1207 16:49:48.222197 1 sync_worker.go:494] Running sync 4.6.1 (force=false) on generation 2 in state Updating at attempt 0 I1207 16:49:48.347533 1 task_graph.go:555] Result of work: [] I1207 16:55:30.136417 1 task_graph.go:555] Result of work: [Cluster operator etcd is still updating] I1207 16:56:20.704109 1 sync_worker.go:494] Running sync 4.6.1 (force=false) on generation 2 in state Updating at attempt 1 I1207 16:56:20.761444 1 task_graph.go:555] Result of work: [] I1207 17:02:02.618796 1 task_graph.go:555] Result of work: [Cluster operator etcd is still updating] ... So the CVO timed out on 512 of 618, before getting to the MachineConfig CRD at 514 of 618, and then buggily transitioned from Initializing to Updating. Updates have ordered, blocking reconciliation, while the install stage is a free-for-all [1]. So the fact that etcd is not level blocks further attempts at installing the MachineConfig CRD, and the install eventually hangs. The installer is waiting for ClusterVersion to go Available=True [2]. Not clear to me what the CVO was claiming in this particular install, because the log-bundle failed to capture ClusterVersion: $ wc resources/clusterversion.json 0 0 0 resources/clusterversion.json But we need to double-check the transition out of "Initializing" and fix whatever we're missing here. Checking on the reporting version: $ head -n1 bootstrap/containers/cluster-version-operator-013ac078178f5421c0f6b986541667d48de14ead8ebe8ef 4fdbfca62a2e3e340.log I1207 16:48:48.401667 1 start.go:21] ClusterVersionOperator 4.6.0-202010100331.p0-c35a4e1 That's pretty close to the 4.6 tip, with the only subsequent change being unrelated: $ git --no-pager log --oneline c35a4e1..origin/release-4.6 39a42566 (origin/release-4.6) Merge pull request #480 from openshift-cherrypick-robot/cherry-pick-477-to-release-4.6 42d3c6d7 (origin/pr/480) pkg/cvo/metrics: Abandon child goroutines after shutdownContext expires [1]: https://github.com/openshift/cluster-version-operator/blob/1e51a0e4750ca110d4659f33bce210a3de6844b9/docs/user/reconciliation.md#manifest-graph [2]: https://github.com/openshift/installer/blob/e235783fe671864e0628771d0e5d13904e4d062a/cmd/openshift-install/create.go#L388-L391
I am not entirely clear on why nobody has been bitten by this before, but the fact that was reported in a released 4.6.z means it's not a new-in-4.7 regression to block 4.7 GA or a new-in-4.6.z regression to block future 4.6.z (c35a4e1 went live with 4.6 GA [1]). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1886947#c9
Created attachment 1739834 [details] transition portion of the CVO logs from comment 0's log bundle, for convenience
Bug may be in [1], which lacks the hasNeverReachedLevel guard we have over in [2]. Or maybe equalSyncWork has a bug, because the transition logs include "Work updated" [3]. And before the bit I copied out for the transition logs, we had [4]: I1207 16:49:27.954369 1 sync_worker.go:249] Notify the sync worker that new work is available Looking for the "no change" I expect [5]: $ grep 'Update work is equal to current target' bootstrap/containers/cluster-version-operator-013ac078178f5421c0f6b986541667d48de14ead8ebe8ef4fdbfca62a2e3e340.log | head -n1 I1207 16:49:27.961015 1 sync_worker.go:222] Update work is equal to current target; no change required So just after the "nominally new work" round. Maybe whatever we put in the initial state needs some tweaking, so equalSyncWork does not trip up over the difference between our initial seed and the first round in from the ClusterVersion object. Or something... [1]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L402-L408 [2]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/cvo.go#L455-L464 [3]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L280 [4]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L249 [5]: https://github.com/openshift/cluster-version-operator/blob/c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L222
(In reply to W. Trevor King from comment #17) > Bug may be in [1], which lacks the hasNeverReachedLevel guard we have over > in [2]. Or maybe equalSyncWork has a bug, because the transition logs > include "Work updated" [3]. And before the bit I copied out for the > transition logs, we had [4]: > > I1207 16:49:27.954369 1 sync_worker.go:249] Notify the sync worker > that new work is available > > Looking for the "no change" I expect [5]: > > $ grep 'Update work is equal to current target' > bootstrap/containers/cluster-version-operator- > 013ac078178f5421c0f6b986541667d48de14ead8ebe8ef4fdbfca62a2e3e340.log | head > -n1 > I1207 16:49:27.961015 1 sync_worker.go:222] Update work is equal to > current target; no change required > > So just after the "nominally new work" round. Maybe whatever we put in the > initial state needs some tweaking, so equalSyncWork does not trip up over > the difference between our initial seed and the first round in from the > ClusterVersion object. Or something... > > [1]: > https://github.com/openshift/cluster-version-operator/blob/ > c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L402-L408 > [2]: > https://github.com/openshift/cluster-version-operator/blob/ > c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/cvo.go#L455-L464 > [3]: > https://github.com/openshift/cluster-version-operator/blob/ > c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L280 > [4]: > https://github.com/openshift/cluster-version-operator/blob/ > c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L249 > [5]: > https://github.com/openshift/cluster-version-operator/blob/ > c35a4e1e37fdf3559d208cae97cf5c9ded9064a7/pkg/cvo/sync_worker.go#L222 Right. Looking at equalSyncWork [1], logs indicate Desired hasn't changed. Can Overrides be changed during an install? [1] https://github.com/openshift/cluster-version-operator/blob/1e51a0e4750ca110d4659f33bce210a3de6844b9/pkg/cvo/sync_worker.go#L471
We believe that this is a side effect of a hack being attempted to enable SNO right now. Jack will provide more details on what we do plan to do in order to avoid problems associated with this but we do feel like this is unlikely to be a problem during normal installations. Therefore we're lowering severity and priority to medium.
In Slack discussion with Eran confirmed that some components are set as unmanaged (change to Overrides, see https://bugzilla.redhat.com/show_bug.cgi?id=1905221#c18), as a temp hack, during single node installation. The hack is to get the single node installed and will be removed once Cluster High-availability Mode is supported. He will try the hack after cluster-bootstrap has finished to verify this avoids the CVO issue. Although changing manifests during initial install is not currently supported it should not break CVO. Minimally CVO should be modified to not transition from initialising to updating upon such a change.
We moved patch service to after pivot phase and saw that cvo doesn't have to be restarted. We saw that this is the patching that causes CVO to stuck. By the way after we moved patches we saw that many times after we pivot(bootstrap phase is really quick) cvo i snot starting cause it misses secrets. Still not sure why? Some ideas?
sometimes we miss cluster-version-operator-serving-cert, i think we have some race and maybe better to wait for it?
@Eran or @Jack, Do you know how to verify this bug from QE perspective?
Install a cluster with the fix, and, while it is installing (probably after bootstrap complete, just for access to logs), put a log-tail on the CVO pod and then insert an entry in ClusterVersion's spec.overrides. Before the patch, that should transition the CVO to update-mode (look for log lines like "Running sync ... in state"). With the patch, the CVO should stay in the Initializing phase despite the overrides touch.
I added the patch during cluster-bootstrap, once you have a control plane started on the bootstrap.
Verified this bug with 4.7.0-0.nightly-2021-01-22-063949, and pass. Trigger an install, once bootstrap instance boot up (before bootkube complete), ssh into it, run the following command: [root@ip-10-0-7-22 ~]# vi /tmp/version-patch-first-override.yaml [root@ip-10-0-7-22 ~]# cat /tmp/version-patch-first-override.yaml - op: add path: /spec/overrides value: - kind: DaemonSet group: apps/v1 name: cluster-network-operator namespace: openshift-cluster-network-operator unmanaged: true [root@ip-10-0-7-22 ~]# cd /etc/kubernetes/ [root@ip-10-0-7-22 kubernetes]# oc patch clusterversion version --type json -p "$(cat /tmp/version-patch-first-override.yaml)" clusterversion.config.openshift.io/version patched [root@ip-10-0-7-22 kubernetes]# oc get all -n openshift-cluster-version NAME READY STATUS RESTARTS AGE pod/cluster-version-operator-7d5c5dfd8-j5mbh 0/1 Pending 0 7m14s [root@ip-10-0-7-22 kubernetes]# crictl ps -a |grep version 774777ae3b19f registry.ci.openshift.org/ocp/release@sha256:7e62d6eced986e77be37153e8e38069163e0457e29859664ff6d016dbdca0b3b 8 minutes ago Running cluster-version-operator 0 ce4c60bed90b7 Note: check the log of cvo static pod running on bootstrap instance, but not check pod under openshift-cluster-verison namespace via oc command [root@ip-10-0-7-22 kubernetes]# crictl logs 774777ae3b19f 2>&1 |grep "Running sync .* in state" I0122 11:35:08.677588 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-01-22-063949 (force=false) on generation 1 in state Initializing at attempt 1 I0122 11:38:24.963065 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-01-22-063949 (force=false) on generation 1 in state Initializing at attempt 2 I0122 11:42:00.183391 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-01-22-063949 (force=false) on generation 1 in state Initializing at attempt 3 On an old build, I can reproduce it like this: [root@ip-10-0-16-5 kubernetes]# crictl logs 9f4b34e61178e 2>&1 |grep "Running sync .* in state" I0122 11:14:56.528110 1 sync_worker.go:521] Running sync 4.7.0-0.nightly-2021-01-17-065043 (force=false) on generation 1 in state Initializing at attempt 0 I0122 11:18:11.963633 1 sync_worker.go:521] Running sync 4.7.0-0.nightly-2021-01-17-065043 (force=false) on generation 1 in state Initializing at attempt 1 I0122 11:19:31.456647 1 sync_worker.go:521] Running sync 4.7.0-0.nightly-2021-01-17-065043 (force=false) on generation 2 in state Updating at attempt 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633