1971549 – CI: e2e-metal-ipi-upgrade failing on KubeletHasDiskPressure, which triggers machine-config RequiredPoolsFailed

Bug 1971549 - CI: e2e-metal-ipi-upgrade failing on KubeletHasDiskPressure, which triggers machine-config RequiredPoolsFailed

Summary: CI: e2e-metal-ipi-upgrade failing on KubeletHasDiskPressure, which triggers m...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Steven Hardy
QA Contact:	Raviv Bar-Tal
Docs Contact:
URL:
Whiteboard:
Depends On:	1968754
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-14 10:19 UTC by Steven Hardy
Modified:	2021-07-06 11:39 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1968754
Environment:
Last Closed:	2021-07-06 11:38:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ironic-rhcos-downloader pull 45	0	None	open	[release-4.7] Bug 1971549: Unconditionally remove TMPDIR	2021-06-23 01:01:21 UTC
Red Hat Product Errata	RHBA-2021:2554	0	None	None	None	2021-07-06 11:39:14 UTC

Description Steven Hardy 2021-06-14 10:19:24 UTC

+++ This bug was initially created as a clone of Bug #1968754 +++

Seen in 4.7.14 -> 4.8.0-0.nightly-2021-06-04-031425 CI [1], the update timed out:

  [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	2h31m44s
  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:153]: during upgrade to registry.build01.ci.openshift.org/ci-op-9ly9p0g4/release@sha256:730a1b5b176a89f5ff10a00fd6f6f2dadb92d1e641a09da93fac8cf8d3c72841
  Unexpected error:
      <*errors.errorString | 0xc001c6b730>: {
          s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator machine-config is not available",
      }
      Cluster did not complete upgrade: timed out waiting for the condition: 
  Cluster operator machine-config is not available
  occurred

Here's the machine-config ClusterOperator that the CVO blocked on:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "machine-config").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2021-06-04T05:21:01Z Progressing=True : Working towards 4.8.0-0.nightly-2021-06-04-031425
  2021-06-04T05:25:28Z Upgradeable=False PoolUpdating: One or more machine config pools are updating, please see `oc get mcp` for further details
  2021-06-04T05:35:27Z Degraded=True RequiredPoolsFailed: Unable to apply 4.8.0-0.nightly-2021-06-04-031425: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-1697134e9eab0c69edf22e218f5164df expected 765dc0d017ca01bffe64933585417d31a9d154d1 has 3c1fc49624d0a9edbbd4ac20223afbdbd4b5ccf4: 3 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-28faa60ab47c4858fd8d3e83efdd3a81, retrying
  2021-06-04T05:18:54Z Available=False : Cluster not available for 4.8.0-0.nightly-2021-06-04-031425

And the pool summary from the ClusterOperator:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "machine-config").status.extension'
  {
    "master": "3 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-28faa60ab47c4858fd8d3e83efdd3a81",
    "worker": "all 2 nodes are at latest configuration rendered-worker-bc792c195cc1189ee960837829799f56"
  }

But all the nodes seem to be both Ready=True and up to date with their desired MachineConfig:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/nodes.json | jq -r '.items[] | .metadata as $m | .status.conditions[] | select(.type == "Ready") | .status + " " + .lastTransitionTime + " " + $m.name + " " + $m.annotations["machineconfiguration.openshift.io/currentConfig"] + " -> " + $m.annotations["machineconfiguration.openshift.io/desiredConfig"]'
  True 2021-06-04T05:30:10Z master-0 rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 -> rendered-master-28faa60ab47c4858fd8d3e83efdd3a81
  True 2021-06-04T05:35:59Z master-1 rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 -> rendered-master-28faa60ab47c4858fd8d3e83efdd3a81
  True 2021-06-04T05:44:14Z master-2 rendered-master-28faa60ab47c4858fd8d3e83efdd3a81 -> rendered-master-28faa60ab47c4858fd8d3e83efdd3a81
  True 2021-06-04T05:34:59Z worker-0 rendered-worker-bc792c195cc1189ee960837829799f56 -> rendered-worker-bc792c195cc1189ee960837829799f56
  True 2021-06-04T05:29:49Z worker-1 rendered-worker-bc792c195cc1189ee960837829799f56 -> rendered-worker-bc792c195cc1189ee960837829799f56

And the pool itself doesn't say much about why it considers itself updating:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[] | select(.metadata.name == "master").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' | sort
2021-06-04T04:04:14Z RenderDegraded=False : 
2021-06-04T04:04:49Z Degraded=False : 
2021-06-04T04:04:49Z NodeDegraded=False : 
2021-06-04T05:25:08Z Updated=False : 
2021-06-04T05:25:08Z Updating=True : All nodes are updating to rendered-master-28faa60ab47c4858fd8d3e83efdd3a81

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784

--- Additional comment from Yu Qi Zhang on 2021-06-07 22:21:32 UTC ---

The MCP is still considered updating (and thus unfinished) because one of the masters (master-0) never became "Ready" by definition of the MCO. In this case, it has DiskPressure:

"lastHeartbeatTime": "2021-06-04T07:08:20Z",
"lastTransitionTime": "2021-06-04T05:39:31Z",
"message": "kubelet has disk pressure",
"reason": "KubeletHasDiskPressure",
"status": "True",
"type": "DiskPressure"

From:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/nodes.json

So the pool/operator status is a bit confusing, since we don't explicit report why it's still considered progressing, something we can improve on maybe is to bubble up that status. It does get logged in the controller https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1400653631547510784/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-54d599bf48-rskx5_machine-config-controller.log but as a notready reason.

So then the root consideration here is why the node still has disk pressure, and whether that's expected (and requires larger disks). Maybe something for the metal team to look at?

--- Additional comment from W. Trevor King on 2021-06-07 22:32:32 UTC ---

[1] is up to track getting the MCO to say thing like "we are concerned about DiskPressure on master-0" and similarly in its ClusterOperator.

[1]: https://issues.redhat.com/browse/GRPA-3838

--- Additional comment from Steven Hardy on 2021-06-08 10:57:09 UTC ---

We can increase the disk size if needed, but before doing so it'd be helpful to answer the following questions I think:

* Does this KubeletHasDiskPressure only happen in the e2e-metal-ipi-upgrade job?
* Is this a new issue (specific to upgrading to the latest version), if so can we identify when it started failing?
* Can we identify what's using the disk space? (Not sure this will be possible without a local reproduce)
* What root disk size is used for other platforms, has there been a recent increase in requirements elsewhere?

If we increase the disk-space by default, it'll impact all metal-ipi jobs (and all developers using the same openshift-metal3/dev-scripts), so I'd like to confirm if th eissue is specific to upgrade, and also if it may indicate some recent regression impacting disk usage?

--- Additional comment from W. Trevor King on 2021-06-08 16:44:49 UTC ---

Spread across jobs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=build-log&search=KubeletHasDiskPressure' | grep 'failures match' | sort
periodic-ci-openshift-ovn-kubernetes-release-4.9-e2e-metal-ipi-ovn-dualstack-periodic (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi (all) - 25 runs, 92% failed, 9% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 9 runs, 100% failed, 67% of failures match = 67% impact
pull-ci-openshift-cluster-baremetal-operator-master-e2e-metal-ipi (all) - 7 runs, 57% failed, 25% of failures match = 14% impact
pull-ci-openshift-ironic-image-master-e2e-metal-ipi (all) - 5 runs, 80% failed, 50% of failures match = 40% impact
pull-ci-openshift-ironic-image-master-prevalidation-e2e-metal-ipi-prevalidation (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
pull-ci-openshift-ironic-image-master-prevalidation-e2e-metal-ipi-virtualmedia-prevalidation (all) - 6 runs, 67% failed, 100% of failures match = 67% impact
pull-ci-openshift-kubernetes-release-4.8-e2e-metal-ipi-ovn-dualstack (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi (all) - 11 runs, 64% failed, 14% of failures match = 9% impact
pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi (all) - 8 runs, 88% failed, 29% of failures match = 25% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-metal-ipi-ovn-dualstack (all) - 5 runs, 80% failed, 50% of failures match = 40% impact
rehearse-18988-pull-ci-openshift-ironic-image-release-4.8-prevalidation-e2e-metal-ipi-virtualmedia-prevalidation (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
rehearse-19047-periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-19054-periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
rehearse-19070-pull-ci-openshift-coredns-release-4.8-e2e-metal-ipi-dualstack (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

And at least for some jobs it goes back about as far as CI-search's 14d history:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=336h&type=build-log&search=KubeletHasDiskPressure' | grep -o '[0-9]* days ago' | sort -n | tail -n1
13 days ago

I'm not sure what's using the space; you might need to poke around on a cluster-bot or other test cluster.  One reason updates might take more storage is that they download a bunch of images with the new content, and I'm not sure how fast the kubelet/CRI-O prune the local image cache.

I'm not sure what size volumes other platforms use, but for cloud platforms, we often use ridiculously oversized storage because the cloud providers tie things like I/O throttle tokens to storage size, and we need the tokens to keep etcd latency in line.

--- Additional comment from Arda Guclu on 2021-06-10 12:07:41 UTC ---

After investigating and reproducing the above w3m -dump results, in addition to the upgrade problem(whose related to the masters' disk size), there is another diskpressure problem for the jobs running all conformance/parallel suite tests(whose related to the workers' disk size). When we first set WORKER_DISK to default 20, we would not run all 1000 parallel suite tests. Now, openshift-tests runs ~1000 tests on 2 workers and storage size become insufficient at some point.

These are the metrics after running all conformance/parallel and waiting to complete;

oc describe node worker-0

Events:
  Type     Reason                 Age                   From     Message
  ----     ------                 ----                  ----     -------
  Warning  EvictionThresholdMet   33m                   kubelet  Attempting to reclaim ephemeral-storage
  Normal   NodeHasDiskPressure    33m                   kubelet  Node worker-0 status is now: NodeHasDiskPressure
  Normal   NodeHasNoDiskPressure  28m (x24 over 3h33m)  kubelet  Node worker-0 status is now: NodeHasNoDiskPressure
  
oc describe node worker-1

Events:
  Type     Reason                 Age    From     Message
  ----     ------                 ----   ----     -------
  Warning  EvictionThresholdMet   14m    kubelet  Attempting to reclaim ephemeral-storage
  Normal   NodeHasDiskPressure    14m    kubelet  Node worker-1 status is now: NodeHasDiskPressure
  Normal   NodeHasNoDiskPressure  9m27s  kubelet  Node worker-1 status is now: NodeHasNoDiskPressure
  
[core@worker-0 ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        3.9G     0  3.9G   0% /dev
tmpfs           3.9G  168K  3.9G   1% /dev/shm
tmpfs           3.9G   59M  3.9G   2% /run
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sda4        20G   14G  6.3G  68% /sysroot
tmpfs           3.9G   12K  3.9G   1% /tmp
/dev/sda3       364M  196M  146M  58% /boot
overlay         3.9G   59M  3.9G   2% /etc/NetworkManager/systemConnectionsMerged
tmpfs           795M  8.0K  795M   1% /run/user/1000

[core@worker-0 sysroot]$ sudo du -h --max-depth=2
0	./boot
3.2G	./ostree/repo
9.8G	./ostree/deploy
0	./ostree/boot.0.1
13G	./ostree
13G	.

[core@worker-0 lib]$ sudo du -h --max-depth=2
0	./kubelet/device-plugins
0	./kubelet/pod-resources
1.4G	./kubelet/pods
16K	./kubelet/plugins
4.0K	./kubelet/plugins_registry
8.0K	./kubelet/pki
1.4G	./kubelet
132K	./systemd/catalog
0	./systemd/coredump
0	./systemd/timers
136K	./systemd
76K	./cni/networks
190M	./cni/bin
144K	./cni/multus
72K	./cni/results
190M	./cni
0	./dbus
24K	./NetworkManager
0	./authselect
0	./alternatives
4.0K	./chrony
0	./containers/sigstore
8.0G	./containers/storage
304K	./containers/cache
8.0G	./containers
0	./dhclient
0	./dnsmasq
0	./initramfs
0	./games
0	./misc
0	./fwupd
0	./gssproxy/clients
0	./gssproxy/rcache
0	./gssproxy
0	./iscsi/ifaces
0	./iscsi/isns
0	./iscsi/nodes
0	./iscsi/send_targets
0	./iscsi/slp
0	./iscsi/static
0	./iscsi
0	./selinux/tmp
0	./selinux
0	./logrotate
0	./nfs/rpc_pipefs
4.0K	./nfs/statd
0	./nfs/v4recovery
4.0K	./nfs
0	./openvswitch/pki
0	./openvswitch
0	./os-prober
0	./polkit-1/localauthority
0	./polkit-1
0	./samba/lock
0	./samba/private
0	./samba
5.3M	./sss/db
0	./sss/deskprofile
0	./sss/gpo_cache
27M	./sss/mc
0	./sss/pipes
0	./sss/pubconf
0	./sss/secrets
0	./sss/keytabs
32M	./sss
0	./tpm
0	./unbound
0	./portables
0	./private
48K	./rpm-ostree/history
48K	./rpm-ostree
4.0K	./crio
0	./csi-hostpath-data
9.5G	.

--- Additional comment from Arda Guclu on 2021-06-10 12:18:07 UTC ---

Fresh IPv4 dev-scripts installation master-0 total disk usage: 15gb 51%

[core@master-0 sysroot]$ sudo du -h --max-depth=2
0	./boot
3.2G	./ostree/repo
12G	./ostree/deploy
0	./ostree/boot.0.1
15G	./ostree
15G	.

[core@master-0 lib]$ sudo du -h --max-depth=2
0	./kubelet/device-plugins
0	./kubelet/pod-resources
1.7M	./kubelet/pods
0	./kubelet/plugins
0	./kubelet/plugins_registry
8.0K	./kubelet/pki
1.7M	./kubelet
132K	./systemd/catalog
0	./systemd/coredump
0	./systemd/timers
136K	./systemd
100K	./cni/networks
190M	./cni/bin
188K	./cni/multus
92K	./cni/results
190M	./cni
0	./dbus
24K	./NetworkManager
0	./authselect
0	./alternatives
4.0K	./chrony
0	./containers/sigstore
6.8G	./containers/storage
136K	./containers/cache
6.8G	./containers
0	./dhclient
0	./dnsmasq
0	./initramfs
0	./games
0	./misc
0	./fwupd
0	./gssproxy/clients
0	./gssproxy/rcache
0	./gssproxy
0	./iscsi/ifaces
0	./iscsi/isns
0	./iscsi/nodes
0	./iscsi/send_targets
0	./iscsi/slp
0	./iscsi/static
0	./iscsi
0	./selinux/tmp
0	./selinux
0	./logrotate
0	./nfs/rpc_pipefs
4.0K	./nfs/statd
0	./nfs/v4recovery
4.0K	./nfs
0	./openvswitch/pki
0	./openvswitch
0	./os-prober
0	./polkit-1/localauthority
0	./polkit-1
0	./samba/lock
0	./samba/private
0	./samba
5.3M	./sss/db
0	./sss/deskprofile
0	./sss/gpo_cache
27M	./sss/mc
0	./sss/pipes
0	./sss/pubconf
0	./sss/secrets
0	./sss/keytabs
32M	./sss
0	./tpm
0	./unbound
0	./portables
0	./private
48K	./rpm-ostree/history
48K	./rpm-ostree
4.0K	./crio
277M	./etcd/member
277M	./etcd
3.8G	./metal3/images
3.8G	./metal3
12G

--- Additional comment from W. Trevor King on 2021-06-11 05:55:43 UTC ---

happened again in [1], which has lots of NodeFilesystemSpaceFillingUp under:

  Check if alerts are firing during or after upgrade success

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1402636541485912064

--- Additional comment from Steven Hardy on 2021-06-11 10:44:27 UTC ---

Arda managed to reproduce this, and we did some debugging - it seems that the script we use to download the RHCOS image can leak a tmpdir (containing the large image files), and in the reproduce environment this meant excessive disk usage by the metal3 pod (about 13G for metal3/images).

I pushed https://github.com/openshift/ironic-rhcos-downloader/pull/44 which should resolve that issue - we may have other opportunities to reduce disk usage in future (for example having the second-level Daemonset cache download the pre-compressed image and cleaning up the uncompressed one), but I think fixing and backporting the leaky-tmpdir issue is the best first step.

Comment 4 errata-xmlrpc 2021-07-06 11:38:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.19 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2554

Note You need to log in before you can comment on or make changes to this bug.