Bug 1968754
Summary: | CI: e2e-metal-ipi-upgrade failing on KubeletHasDiskPressure, which triggers machine-config RequiredPoolsFailed | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | |
Component: | Installer | Assignee: | Arda Guclu <aguclu> | |
Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Raviv Bar-Tal <rbartal> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | aos-bugs, janders, rbartal, rpittau, shardy | |
Version: | 4.8 | Keywords: | OtherQA, Triaged | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1971549 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:11:53 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1971549 |
Description
W. Trevor King
2021-06-07 22:06:19 UTC
The MCP is still considered updating (and thus unfinished) because one of the masters (master-0) never became "Ready" by definition of the MCO. In this case, it has DiskPressure: "lastHeartbeatTime": "2021-06-04T07:08:20Z", "lastTransitionTime": "2021-06-04T05:39:31Z", "message": "kubelet has disk pressure", "reason": "KubeletHasDiskPressure", "status": "True", "type": "DiskPressure" From: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/nodes.json So the pool/operator status is a bit confusing, since we don't explicit report why it's still considered progressing, something we can improve on maybe is to bubble up that status. It does get logged in the controller https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-54d599bf48-rskx5_machine-config-controller.log but as a notready reason. So then the root consideration here is why the node still has disk pressure, and whether that's expected (and requires larger disks). Maybe something for the metal team to look at? [1] is up to track getting the MCO to say thing like "we are concerned about DiskPressure on master-0" and similarly in its ClusterOperator. [1]: https://issues.redhat.com/browse/GRPA-3838 We can increase the disk size if needed, but before doing so it'd be helpful to answer the following questions I think: * Does this KubeletHasDiskPressure only happen in the e2e-metal-ipi-upgrade job? * Is this a new issue (specific to upgrading to the latest version), if so can we identify when it started failing? * Can we identify what's using the disk space? (Not sure this will be possible without a local reproduce) * What root disk size is used for other platforms, has there been a recent increase in requirements elsewhere? If we increase the disk-space by default, it'll impact all metal-ipi jobs (and all developers using the same openshift-metal3/dev-scripts), so I'd like to confirm if th eissue is specific to upgrade, and also if it may indicate some recent regression impacting disk usage? Spread across jobs: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=build-log&search=KubeletHasDiskPressure' | grep 'failures match' | sort periodic-ci-openshift-ovn-kubernetes-release-4.9-e2e-metal-ipi-ovn-dualstack-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi (all) - 25 runs, 92% failed, 9% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 67% of failures match = 67% impact pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi (all) - 7 runs, 57% failed, 25% of failures match = 14% impact pull-ci-openshift-ironic-image-master-e2e-metal-ipi (all) - 5 runs, 80% failed, 50% of failures match = 40% impact pull-ci-openshift-ironic-image-master-prevalidation-e2e-metal-ipi-prevalidation (all) - 3 runs, 33% failed, 100% of failures match = 33% impact pull-ci-openshift-ironic-image-master-prevalidation-e2e-metal-ipi-virtualmedia-prevalidation (all) - 6 runs, 67% failed, 100% of failures match = 67% impact pull-ci-openshift-kubernetes-release-4.8-e2e-metal-ipi-ovn-dualstack (all) - 3 runs, 67% failed, 50% of failures match = 33% impact pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi (all) - 11 runs, 64% failed, 14% of failures match = 9% impact pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi (all) - 8 runs, 88% failed, 29% of failures match = 25% impact pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack (all) - 5 runs, 80% failed, 50% of failures match = 40% impact rehearse-18988-pull-ci-openshift-ironic-image-release-4.8-prevalidation-e2e-metal-ipi-virtualmedia-prevalidation (all) - 2 runs, 100% failed, 50% of failures match = 50% impact rehearse-19047-periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-19054-periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact rehearse-19070-pull-ci-openshift-coredns-release-4.8-e2e-metal-ipi-dualstack (all) - 1 runs, 100% failed, 100% of failures match = 100% impact And at least for some jobs it goes back about as far as CI-search's 14d history: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=build-log&search=KubeletHasDiskPressure' | grep -o '[0-9]* days ago' | sort -n | tail -n1 13 days ago I'm not sure what's using the space; you might need to poke around on a cluster-bot or other test cluster. One reason updates might take more storage is that they download a bunch of images with the new content, and I'm not sure how fast the kubelet/CRI-O prune the local image cache. I'm not sure what size volumes other platforms use, but for cloud platforms, we often use ridiculously oversized storage because the cloud providers tie things like I/O throttle tokens to storage size, and we need the tokens to keep etcd latency in line. After investigating and reproducing the above w3m -dump results, in addition to the upgrade problem(whose related to the masters' disk size), there is another diskpressure problem for the jobs running all conformance/parallel suite tests(whose related to the workers' disk size). When we first set WORKER_DISK to default 20, we would not run all 1000 parallel suite tests. Now, openshift-tests runs ~1000 tests on 2 workers and storage size become insufficient at some point. These are the metrics after running all conformance/parallel and waiting to complete; oc describe node worker-0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning EvictionThresholdMet 33m kubelet Attempting to reclaim ephemeral-storage Normal NodeHasDiskPressure 33m kubelet Node worker-0 status is now: NodeHasDiskPressure Normal NodeHasNoDiskPressure 28m (x24 over 3h33m) kubelet Node worker-0 status is now: NodeHasNoDiskPressure oc describe node worker-1 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning EvictionThresholdMet 14m kubelet Attempting to reclaim ephemeral-storage Normal NodeHasDiskPressure 14m kubelet Node worker-1 status is now: NodeHasDiskPressure Normal NodeHasNoDiskPressure 9m27s kubelet Node worker-1 status is now: NodeHasNoDiskPressure [core@worker-0 ~]$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 3.9G 0 3.9G 0% /dev tmpfs 3.9G 168K 3.9G 1% /dev/shm tmpfs 3.9G 59M 3.9G 2% /run tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup /dev/sda4 20G 14G 6.3G 68% /sysroot tmpfs 3.9G 12K 3.9G 1% /tmp /dev/sda3 364M 196M 146M 58% /boot overlay 3.9G 59M 3.9G 2% /etc/NetworkManager/systemConnectionsMerged tmpfs 795M 8.0K 795M 1% /run/user/1000 [core@worker-0 sysroot]$ sudo du -h --max-depth=2 0 ./boot 3.2G ./ostree/repo 9.8G ./ostree/deploy 0 ./ostree/boot.0.1 13G ./ostree 13G . [core@worker-0 lib]$ sudo du -h --max-depth=2 0 ./kubelet/device-plugins 0 ./kubelet/pod-resources 1.4G ./kubelet/pods 16K ./kubelet/plugins 4.0K ./kubelet/plugins_registry 8.0K ./kubelet/pki 1.4G ./kubelet 132K ./systemd/catalog 0 ./systemd/coredump 0 ./systemd/timers 136K ./systemd 76K ./cni/networks 190M ./cni/bin 144K ./cni/multus 72K ./cni/results 190M ./cni 0 ./dbus 24K ./NetworkManager 0 ./authselect 0 ./alternatives 4.0K ./chrony 0 ./containers/sigstore 8.0G ./containers/storage 304K ./containers/cache 8.0G ./containers 0 ./dhclient 0 ./dnsmasq 0 ./initramfs 0 ./games 0 ./misc 0 ./fwupd 0 ./gssproxy/clients 0 ./gssproxy/rcache 0 ./gssproxy 0 ./iscsi/ifaces 0 ./iscsi/isns 0 ./iscsi/nodes 0 ./iscsi/send_targets 0 ./iscsi/slp 0 ./iscsi/static 0 ./iscsi 0 ./selinux/tmp 0 ./selinux 0 ./logrotate 0 ./nfs/rpc_pipefs 4.0K ./nfs/statd 0 ./nfs/v4recovery 4.0K ./nfs 0 ./openvswitch/pki 0 ./openvswitch 0 ./os-prober 0 ./polkit-1/localauthority 0 ./polkit-1 0 ./samba/lock 0 ./samba/private 0 ./samba 5.3M ./sss/db 0 ./sss/deskprofile 0 ./sss/gpo_cache 27M ./sss/mc 0 ./sss/pipes 0 ./sss/pubconf 0 ./sss/secrets 0 ./sss/keytabs 32M ./sss 0 ./tpm 0 ./unbound 0 ./portables 0 ./private 48K ./rpm-ostree/history 48K ./rpm-ostree 4.0K ./crio 0 ./csi-hostpath-data 9.5G . Fresh IPv4 dev-scripts installation master-0 total disk usage: 15gb 51% [core@master-0 sysroot]$ sudo du -h --max-depth=2 0 ./boot 3.2G ./ostree/repo 12G ./ostree/deploy 0 ./ostree/boot.0.1 15G ./ostree 15G . [core@master-0 lib]$ sudo du -h --max-depth=2 0 ./kubelet/device-plugins 0 ./kubelet/pod-resources 1.7M ./kubelet/pods 0 ./kubelet/plugins 0 ./kubelet/plugins_registry 8.0K ./kubelet/pki 1.7M ./kubelet 132K ./systemd/catalog 0 ./systemd/coredump 0 ./systemd/timers 136K ./systemd 100K ./cni/networks 190M ./cni/bin 188K ./cni/multus 92K ./cni/results 190M ./cni 0 ./dbus 24K ./NetworkManager 0 ./authselect 0 ./alternatives 4.0K ./chrony 0 ./containers/sigstore 6.8G ./containers/storage 136K ./containers/cache 6.8G ./containers 0 ./dhclient 0 ./dnsmasq 0 ./initramfs 0 ./games 0 ./misc 0 ./fwupd 0 ./gssproxy/clients 0 ./gssproxy/rcache 0 ./gssproxy 0 ./iscsi/ifaces 0 ./iscsi/isns 0 ./iscsi/nodes 0 ./iscsi/send_targets 0 ./iscsi/slp 0 ./iscsi/static 0 ./iscsi 0 ./selinux/tmp 0 ./selinux 0 ./logrotate 0 ./nfs/rpc_pipefs 4.0K ./nfs/statd 0 ./nfs/v4recovery 4.0K ./nfs 0 ./openvswitch/pki 0 ./openvswitch 0 ./os-prober 0 ./polkit-1/localauthority 0 ./polkit-1 0 ./samba/lock 0 ./samba/private 0 ./samba 5.3M ./sss/db 0 ./sss/deskprofile 0 ./sss/gpo_cache 27M ./sss/mc 0 ./sss/pipes 0 ./sss/pubconf 0 ./sss/secrets 0 ./sss/keytabs 32M ./sss 0 ./tpm 0 ./unbound 0 ./portables 0 ./private 48K ./rpm-ostree/history 48K ./rpm-ostree 4.0K ./crio 277M ./etcd/member 277M ./etcd 3.8G ./metal3/images 3.8G ./metal3 12G happened again in [1], which has lots of NodeFilesystemSpaceFillingUp under: Check if alerts are firing during or after upgrade success [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1402636541485912064 Arda managed to reproduce this, and we did some debugging - it seems that the script we use to download the RHCOS image can leak a tmpdir (containing the large image files), and in the reproduce environment this meant excessive disk usage by the metal3 pod (about 13G for metal3/images). I pushed https://github.com/openshift/ironic-rhcos-downloader/pull/44 which should resolve that issue - we may have other opportunities to reduce disk usage in future (for example having the second-level Daemonset cache download the pre-compressed image and cleaning up the uncompressed one), but I think fixing and backporting the leaky-tmpdir issue is the best first step. There may be further fixes needed but I moved this to MODIFIED so we can get https://github.com/openshift/ironic-rhcos-downloader/pull/44 verified to enable backport to 4.7 ref https://github.com/openshift/ironic-rhcos-downloader/pull/45 - we need to ensure we can't leak $TMPDIR in the 4.7 pre-upgrade case too. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |