Seen in 4.7.14 -> 4.8.0-0.nightly-2021-06-04-031425 CI [1], the update timed out: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 2h31m44s fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-9ly9p0g4/release@sha256:730a1b5b176a89f5ff10a00fd6f6f2dadb92d1e641a09da93fac8cf8d3c72841 Unexpected error: <*errors.errorString | 0xc001c6b730>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator machine-config is not available", } Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator machine-config is not available occurred Here's the machine-config ClusterOperator that the CVO blocked on: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "machine-config").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2021-06-04T05:21:01Z Progressing=True : Working towards 4.8.0-0.nightly-2021-06-04-031425 2021-06-04T05:25:28Z Upgradeable=False PoolUpdating: One or more machine config pools are updating, please see `oc get mcp` for further details 2021-06-04T05:35:27Z Degraded=True RequiredPoolsFailed: Unable to apply 4.8.0-0.nightly-2021-06-04-031425: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-1697134e9eab0c69edf22e218f5164df expected 765dc0d017ca01bffe64933585417d31a9d154d1 has 3c1fc49624d0a9edbbd4ac20223afbdbd4b5ccf4: 3 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-28faa60ab47c4858fd8d3e83efdd3a81, retrying 2021-06-04T05:18:54Z Available=False : Cluster not available for 4.8.0-0.nightly-2021-06-04-031425 And the pool summary from the ClusterOperator: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "machine-config").status.extension' { "master": "3 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-28faa60ab47c4858fd8d3e83efdd3a81", "worker": "all 2 nodes are at latest configuration rendered-worker-bc792c195cc1189ee960837829799f56" } But all the nodes seem to be both Ready=True and up to date with their desired MachineConfig: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/nodes.json | jq -r '.items[] | .metadata as $m | .status.conditions[] | select(.type == "Ready") | .status + " " + .lastTransitionTime + " " + $m.name + " " + $m.annotations["machineconfiguration.openshift.io/currentConfig"] + " -> " + $m.annotations["machineconfiguration.openshift.io/desiredConfig"]' True 2021-06-04T05:30:10Z master-0 rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 -> rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 True 2021-06-04T05:35:59Z master-1 rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 -> rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 True 2021-06-04T05:44:14Z master-2 rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 -> rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 True 2021-06-04T05:34:59Z worker-0 rendered-worker-bc792c195cc1189ee960837829799f56 -> rendered-worker-bc792c195cc1189ee960837829799f56 True 2021-06-04T05:29:49Z worker-1 rendered-worker-bc792c195cc1189ee960837829799f56 -> rendered-worker-bc792c195cc1189ee960837829799f56 And the pool itself doesn't say much about why it considers itself updating: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort 2021-06-04T04:04:14Z RenderDegraded=False : 2021-06-04T04:04:49Z Degraded=False : 2021-06-04T04:04:49Z NodeDegraded=False : 2021-06-04T05:25:08Z Updated=False : 2021-06-04T05:25:08Z Updating=True : All nodes are updating to rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784
The MCP is still considered updating (and thus unfinished) because one of the masters (master-0) never became "Ready" by definition of the MCO. In this case, it has DiskPressure: "lastHeartbeatTime": "2021-06-04T07:08:20Z", "lastTransitionTime": "2021-06-04T05:39:31Z", "message": "kubelet has disk pressure", "reason": "KubeletHasDiskPressure", "status": "True", "type": "DiskPressure" From: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/nodes.json So the pool/operator status is a bit confusing, since we don't explicit report why it's still considered progressing, something we can improve on maybe is to bubble up that status. It does get logged in the controller https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-54d599bf48-rskx5_machine-config-controller.log but as a notready reason. So then the root consideration here is why the node still has disk pressure, and whether that's expected (and requires larger disks). Maybe something for the metal team to look at?
[1] is up to track getting the MCO to say thing like "we are concerned about DiskPressure on master-0" and similarly in its ClusterOperator. [1]: https://issues.redhat.com/browse/GRPA-3838
We can increase the disk size if needed, but before doing so it'd be helpful to answer the following questions I think: * Does this KubeletHasDiskPressure only happen in the e2e-metal-ipi-upgrade job? * Is this a new issue (specific to upgrading to the latest version), if so can we identify when it started failing? * Can we identify what's using the disk space? (Not sure this will be possible without a local reproduce) * What root disk size is used for other platforms, has there been a recent increase in requirements elsewhere? If we increase the disk-space by default, it'll impact all metal-ipi jobs (and all developers using the same openshift-metal3/dev-scripts), so I'd like to confirm if th eissue is specific to upgrade, and also if it may indicate some recent regression impacting disk usage?
Spread across jobs: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=build-log&search=KubeletHasDiskPressure' | grep 'failures match' | sort periodic-ci-openshift-ovn-kubernetes-release-4.9-e2e-metal-ipi-ovn-dualstack-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi (all) - 25 runs, 92% failed, 9% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 67% of failures match = 67% impact pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi (all) - 7 runs, 57% failed, 25% of failures match = 14% impact pull-ci-openshift-ironic-image-master-e2e-metal-ipi (all) - 5 runs, 80% failed, 50% of failures match = 40% impact pull-ci-openshift-ironic-image-master-prevalidation-e2e-metal-ipi-prevalidation (all) - 3 runs, 33% failed, 100% of failures match = 33% impact pull-ci-openshift-ironic-image-master-prevalidation-e2e-metal-ipi-virtualmedia-prevalidation (all) - 6 runs, 67% failed, 100% of failures match = 67% impact pull-ci-openshift-kubernetes-release-4.8-e2e-metal-ipi-ovn-dualstack (all) - 3 runs, 67% failed, 50% of failures match = 33% impact pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi (all) - 11 runs, 64% failed, 14% of failures match = 9% impact pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi (all) - 8 runs, 88% failed, 29% of failures match = 25% impact pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack (all) - 5 runs, 80% failed, 50% of failures match = 40% impact rehearse-18988-pull-ci-openshift-ironic-image-release-4.8-prevalidation-e2e-metal-ipi-virtualmedia-prevalidation (all) - 2 runs, 100% failed, 50% of failures match = 50% impact rehearse-19047-periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-19054-periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-19070-pull-ci-openshift-coredns-release-4.8-e2e-metal-ipi-dualstack (all) - 1 runs, 100% failed, 100% of failures match = 100% impact And at least for some jobs it goes back about as far as CI-search's 14d history: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=build-log&search=KubeletHasDiskPressure' | grep -o '[0-9]* days ago' | sort -n | tail -n1 13 days ago I'm not sure what's using the space; you might need to poke around on a cluster-bot or other test cluster. One reason updates might take more storage is that they download a bunch of images with the new content, and I'm not sure how fast the kubelet/CRI-O prune the local image cache. I'm not sure what size volumes other platforms use, but for cloud platforms, we often use ridiculously oversized storage because the cloud providers tie things like I/O throttle tokens to storage size, and we need the tokens to keep etcd latency in line.
After investigating and reproducing the above w3m -dump results, in addition to the upgrade problem(whose related to the masters' disk size), there is another diskpressure problem for the jobs running all conformance/parallel suite tests(whose related to the workers' disk size). When we first set WORKER_DISK to default 20, we would not run all 1000 parallel suite tests. Now, openshift-tests runs ~1000 tests on 2 workers and storage size become insufficient at some point. These are the metrics after running all conformance/parallel and waiting to complete; oc describe node worker-0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning EvictionThresholdMet 33m kubelet Attempting to reclaim ephemeral-storage Normal NodeHasDiskPressure 33m kubelet Node worker-0 status is now: NodeHasDiskPressure Normal NodeHasNoDiskPressure 28m (x24 over 3h33m) kubelet Node worker-0 status is now: NodeHasNoDiskPressure oc describe node worker-1 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning EvictionThresholdMet 14m kubelet Attempting to reclaim ephemeral-storage Normal NodeHasDiskPressure 14m kubelet Node worker-1 status is now: NodeHasDiskPressure Normal NodeHasNoDiskPressure 9m27s kubelet Node worker-1 status is now: NodeHasNoDiskPressure [core@worker-0 ~]$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 3.9G 0 3.9G 0% /dev tmpfs 3.9G 168K 3.9G 1% /dev/shm tmpfs 3.9G 59M 3.9G 2% /run tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup /dev/sda4 20G 14G 6.3G 68% /sysroot tmpfs 3.9G 12K 3.9G 1% /tmp /dev/sda3 364M 196M 146M 58% /boot overlay 3.9G 59M 3.9G 2% /etc/NetworkManager/systemConnectionsMerged tmpfs 795M 8.0K 795M 1% /run/user/1000 [core@worker-0 sysroot]$ sudo du -h --max-depth=2 0 ./boot 3.2G ./ostree/repo 9.8G ./ostree/deploy 0 ./ostree/boot.0.1 13G ./ostree 13G . [core@worker-0 lib]$ sudo du -h --max-depth=2 0 ./kubelet/device-plugins 0 ./kubelet/pod-resources 1.4G ./kubelet/pods 16K ./kubelet/plugins 4.0K ./kubelet/plugins_registry 8.0K ./kubelet/pki 1.4G ./kubelet 132K ./systemd/catalog 0 ./systemd/coredump 0 ./systemd/timers 136K ./systemd 76K ./cni/networks 190M ./cni/bin 144K ./cni/multus 72K ./cni/results 190M ./cni 0 ./dbus 24K ./NetworkManager 0 ./authselect 0 ./alternatives 4.0K ./chrony 0 ./containers/sigstore 8.0G ./containers/storage 304K ./containers/cache 8.0G ./containers 0 ./dhclient 0 ./dnsmasq 0 ./initramfs 0 ./games 0 ./misc 0 ./fwupd 0 ./gssproxy/clients 0 ./gssproxy/rcache 0 ./gssproxy 0 ./iscsi/ifaces 0 ./iscsi/isns 0 ./iscsi/nodes 0 ./iscsi/send_targets 0 ./iscsi/slp 0 ./iscsi/static 0 ./iscsi 0 ./selinux/tmp 0 ./selinux 0 ./logrotate 0 ./nfs/rpc_pipefs 4.0K ./nfs/statd 0 ./nfs/v4recovery 4.0K ./nfs 0 ./openvswitch/pki 0 ./openvswitch 0 ./os-prober 0 ./polkit-1/localauthority 0 ./polkit-1 0 ./samba/lock 0 ./samba/private 0 ./samba 5.3M ./sss/db 0 ./sss/deskprofile 0 ./sss/gpo_cache 27M ./sss/mc 0 ./sss/pipes 0 ./sss/pubconf 0 ./sss/secrets 0 ./sss/keytabs 32M ./sss 0 ./tpm 0 ./unbound 0 ./portables 0 ./private 48K ./rpm-ostree/history 48K ./rpm-ostree 4.0K ./crio 0 ./csi-hostpath-data 9.5G .
Fresh IPv4 dev-scripts installation master-0 total disk usage: 15gb 51% [core@master-0 sysroot]$ sudo du -h --max-depth=2 0 ./boot 3.2G ./ostree/repo 12G ./ostree/deploy 0 ./ostree/boot.0.1 15G ./ostree 15G . [core@master-0 lib]$ sudo du -h --max-depth=2 0 ./kubelet/device-plugins 0 ./kubelet/pod-resources 1.7M ./kubelet/pods 0 ./kubelet/plugins 0 ./kubelet/plugins_registry 8.0K ./kubelet/pki 1.7M ./kubelet 132K ./systemd/catalog 0 ./systemd/coredump 0 ./systemd/timers 136K ./systemd 100K ./cni/networks 190M ./cni/bin 188K ./cni/multus 92K ./cni/results 190M ./cni 0 ./dbus 24K ./NetworkManager 0 ./authselect 0 ./alternatives 4.0K ./chrony 0 ./containers/sigstore 6.8G ./containers/storage 136K ./containers/cache 6.8G ./containers 0 ./dhclient 0 ./dnsmasq 0 ./initramfs 0 ./games 0 ./misc 0 ./fwupd 0 ./gssproxy/clients 0 ./gssproxy/rcache 0 ./gssproxy 0 ./iscsi/ifaces 0 ./iscsi/isns 0 ./iscsi/nodes 0 ./iscsi/send_targets 0 ./iscsi/slp 0 ./iscsi/static 0 ./iscsi 0 ./selinux/tmp 0 ./selinux 0 ./logrotate 0 ./nfs/rpc_pipefs 4.0K ./nfs/statd 0 ./nfs/v4recovery 4.0K ./nfs 0 ./openvswitch/pki 0 ./openvswitch 0 ./os-prober 0 ./polkit-1/localauthority 0 ./polkit-1 0 ./samba/lock 0 ./samba/private 0 ./samba 5.3M ./sss/db 0 ./sss/deskprofile 0 ./sss/gpo_cache 27M ./sss/mc 0 ./sss/pipes 0 ./sss/pubconf 0 ./sss/secrets 0 ./sss/keytabs 32M ./sss 0 ./tpm 0 ./unbound 0 ./portables 0 ./private 48K ./rpm-ostree/history 48K ./rpm-ostree 4.0K ./crio 277M ./etcd/member 277M ./etcd 3.8G ./metal3/images 3.8G ./metal3 12G
happened again in [1], which has lots of NodeFilesystemSpaceFillingUp under: Check if alerts are firing during or after upgrade success [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1402636541485912064
Arda managed to reproduce this, and we did some debugging - it seems that the script we use to download the RHCOS image can leak a tmpdir (containing the large image files), and in the reproduce environment this meant excessive disk usage by the metal3 pod (about 13G for metal3/images). I pushed https://github.com/openshift/ironic-rhcos-downloader/pull/44 which should resolve that issue - we may have other opportunities to reduce disk usage in future (for example having the second-level Daemonset cache download the pre-compressed image and cleaning up the uncompressed one), but I think fixing and backporting the leaky-tmpdir issue is the best first step.
There may be further fixes needed but I moved this to MODIFIED so we can get https://github.com/openshift/ironic-rhcos-downloader/pull/44 verified to enable backport to 4.7 ref https://github.com/openshift/ironic-rhcos-downloader/pull/45 - we need to ensure we can't leak $TMPDIR in the 4.7 pre-upgrade case too.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438