I'm running a scale regression setup on : ========================================= OpenShift 4.13.2 OpenShift Virtualization 4.13.1 OpenShift Container Storage - 4.12.4-rhodf this is a large-scale setup with 132 nodes running 6000 RHEL VMs on an external RHCS. after the cluster was idle for a few days, I returned to find 96 VMs at Paused state due to IO errors,(Message: VMI was paused, IO error) and the csi-addons-controller-manager was crashing in a loop due to memory usage: ============================================================================================================================================ NAME READY STATUS csi-addons-controller-manager-58f765c9ff-2pzb4 1/2 OOMKilled ============================================================================================================================================ I increased the memory limit to 1GiB , using "oc edit csv odf-csi-addons-operator.v4.12.4-rhodf -n openshift-storage" which recovered the csi-addons-controller-manager, when it got back up it used ~ 650MiB of VmRSS. now with the csi-addons-controller-manager up, I was looking at the Paused VMs - Isolating the nodes which had Paused VM's hosted on them: ============================================================================================================================================ [root@d16-h06-000-r650 ~]# oc get vm -o wide |grep Paused|xargs -l bash -c 'oc get vmi $0'|grep worker|awk '{print $4}'|uniq worker010 worker084 worker094 worker005 worker152 worker115 worker090 worker003 worker070 worker090 worker019 ============================================================================================================================================ looking at dmesg on worker052 which had most of the VMs running on: ============================================================================================================================================ [Thu Jun 22 14:30:29 2023] rbd: rbd45: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd45: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd42: write at objno 2977 4128768~16384 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd42: write at objno 3086 393216~16384 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd0: failed to lock header: -108 [Thu Jun 22 14:30:29 2023] rbd: rbd42: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd45: write at objno 1018 3948544~16384 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd39: write at objno 4304 3297280~237568 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd42: write at objno 2944 16384~4096 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd42: write at objno 1015 2244608~16384 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd42: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write at objno 4031 3149824~8192 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd45: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd42: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write at objno 3079 360448~16384 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd45: write at objno 2944 1024~512 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd0: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write at objno 2944 16384~4096 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write at objno 768 2101248~8192 result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108 [Thu Jun 22 14:30:29 2023] rbd: rbd45: write at objno 922 2019328~4096 result -108 ============================================================================================================================================ the CEPH cluster didn't experience any issues and had good health: ============================================================================================================================================ [ceph: root@d28-h01-000-r750 /]# ceph health detail HEALTH_OK ============================================================================================================================================ my next step was to verify that the VMs didn't get corrupted, I achieved that by exporting the VM RBD image from a few VMs and using xfs_repair, this is for "rhel92-vm5403": ============================================================================================================================================ rbd export --export-format 1 ocp_pool/csi-vol-08720968-ab38-4ff0-bffb-b6104f10aa2e /mnt/my_mount_point/csi-vol-08720968-ab38-4ff0-bffb-b6104f10aa2e_format1.export using qemu nbd to mount the image: qemu-nbd --connect=/dev/nbd2 -f raw csi-vol-08720968-ab38-4ff0-bffb-b6104f10aa2e_format1.export we can get the info about the vm here: xfs_info /dev/nbd2p we can test that the image is not corrupted using: xfs_repair -n /dev/nbd2p1 example output: ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_fdblocks 184902, counted 194911 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 2 - agno = 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. ============================================================================================================================================ I also verified that I can create new RBD images: ============================================================================================================================================ csi-vol-ff6e4494-14ff-49f6-9795-ea99b8f25ef6 csi-vol-ff86834a-c950-4f21-aa5e-023c7b5c6221 csi-vol-ff87e622-fe6c-41dc-aea7-84581756e43c csi-vol-ff8914a2-9749-46c2-b409-66518a0c1784 csi-vol-ff8a385c-9400-448c-9d3e-6881ffa6e908 csi-vol-ff92a77d-dcc3-4819-a188-058352114ead csi-vol-ffac9f5d-6ac3-4839-9fec-3f13069a6e4f csi-vol-ffb4b71a-df89-47d9-901e-4434f532c1d0 csi-vol-ffd35b1f-a2e5-4532-9127-b1f3b3799e51 csi-vol-ffd78cb8-6a4b-4f30-8181-7de112c5b5c9 csi-vol-ffe74211-336e-4efc-a527-36d3e425c6c5 csi-vol-ffe7ff55-ceab-4311-b70d-2cbe7c8749a1 csi-vol-ffedd0c8-e2f8-44fa-a131-5349e78d70cf csi-vol-fff85a28-8465-486e-a6fe-9b0f153e6a44 csi-vol-fffea38e-6f41-4e63-9b1c-33132a78439a test <===== image I created. ============================================================================================================================================ looking at the VM itself it looks like we are unable to mount volumes, it's the same issue for all Paused VMs: ============================================================================================================================================ mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[hotplug-disks private public rhel92-vm5403-root-disk ephemeral-disks container-disks libvirt-runtime sockets]: timed out waiting for the condition Warning FailedMount 14m kubelet Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[rhel92-vm5403-root-disk private public ephemeral-disks container-disks libvirt-runtime sockets hotplug-disks]: timed out waiting for the condition Warning FailedMount 12m kubelet Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[container-disks libvirt-runtime rhel92-vm5403-root-disk sockets hotplug-disks private public ephemeral-disks]: timed out waiting for the condition Warning FailedMount 10m kubelet Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[libvirt-runtime sockets hotplug-disks private public rhel92-vm5403-root-disk ephemeral-disks container-disks]: timed out waiting for the condition Warning FailedMount 8m8s kubelet Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[libvirt-runtime rhel92-vm5403-root-disk sockets hotplug-disks private public ephemeral-disks container-disks]: timed out waiting for the condition Warning FailedMount 4m1s (x2 over 6m5s) kubelet Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[hotplug-disks private public ephemeral-disks rhel92-vm5403-root-disk container-disks libvirt-runtime sockets]: timed out waiting for the condition Warning FailedMapVolume 2m (x18 over 22m) kubelet MapVolume.SetUpDevice failed for volume "pvc-a3ac35c2-931d-4cf7-ab91-c0487c632dbe" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 192.168.108.33:6789,192.168.108.153:6789,192.168.246.96:6789,192.168.246.20:6789,192.168.244.240:6789 --keyfile=**stripped** map ocp_pool/csi-vol-08720968-ab38-4ff0-bffb-b6104f10aa2e --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed rbd: map failed: (108) Cannot send after transport endpoint shutdown Warning FailedMount 118s kubelet Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[ephemeral-disks container-disks libvirt-runtime sockets hotplug-disks rhel92-vm5403-root-disk private public]: timed out waiting for the condition =============================================================================================================================================== I verified that all the above RHCS Interfaces IPs are fully accessible from each of the nodes: =============================================================================================================================================== 192.168.108.33 192.168.108.153 192.168.246.96 192.168.246.20 192.168.244.240 =============================================================================================================================================== the above issue will be replicated for almost any VM I stop and then start again. by verifying the eternal RHCS is not the cause of the issues, and that VMs didn't get corrupted I assume that ODF is the culprit however I couldn't figure out why we are unable to mount the volumes. the below logs were collected after I cordoned & drained worker052, and was able to recover the VMs there were running on it to become operational again but here are a few VMs that are still Paused [root@d16-h06-000-r650 ~]# oc get vm --no-headers |grep Paused|xargs -l bash -c 'oc get vmi --no-headers $0' rhel92-vm0165 11d Running worker010 False rhel92-vm0416 11d Running worker084 False rhel92-vm1035 11d Running worker094 False rhel92-vm2180 11d Running worker005 False rhel92-vm2190 11d Running worker115 False rhel92-vm2223 11d Running worker090 False rhel92-vm2245 11d Running worker003 False rhel92-vm2588 11d Running worker070 False rhel92-vm2859 11d Running worker090 False rhel92-vm3223 11d Running worker019 False rhel92-vm3458 11d Running worker019 False =============================================================================================================================================== note this log is 60GiB in size once extracted: http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/ceph_vms_cant_boot.tar.gz I also collected SOS report from worker052: http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/sosreport-worker052-2023-06-23-tuhkqog.tar.xz and worker019: http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/sosreport-worker019-2023-06-23-ktqbjor.tar.xz