Bug 2216964 - VMs IO Errors - Unable to attach or mount volumes.
Summary: VMs IO Errors - Unable to attach or mount volumes.
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Niels de Vos
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-23 12:21 UTC by Boaz
Modified: 2023-08-09 16:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Boaz 2023-06-23 12:21:23 UTC
I'm running a scale regression setup on :
=========================================
OpenShift 4.13.2
OpenShift Virtualization 4.13.1
OpenShift Container Storage - 4.12.4-rhodf

this is a large-scale setup with 132 nodes running 6000 RHEL VMs on an external RHCS.
after the cluster was idle for a few days, I returned to find 96 VMs at Paused state due to IO errors,(Message:  VMI was paused, IO error)
and the csi-addons-controller-manager was crashing in a loop due to memory usage:

============================================================================================================================================
NAME                                               READY   STATUS      
csi-addons-controller-manager-58f765c9ff-2pzb4     1/2     OOMKilled  
============================================================================================================================================

I increased the memory limit to 1GiB , using "oc edit csv odf-csi-addons-operator.v4.12.4-rhodf -n openshift-storage"
which recovered the csi-addons-controller-manager, when it got back up it used ~ 650MiB of VmRSS.

now with the csi-addons-controller-manager up, I was looking at the Paused VMs -  Isolating the nodes which had Paused VM's hosted on them:

============================================================================================================================================

[root@d16-h06-000-r650 ~]# oc get vm -o wide |grep Paused|xargs -l bash -c 'oc get vmi $0'|grep worker|awk '{print $4}'|uniq
worker010
worker084
worker094
worker005
worker152
worker115
worker090
worker003
worker070
worker090
worker019
============================================================================================================================================


looking at dmesg on worker052 which had most of the VMs running on:
============================================================================================================================================

[Thu Jun 22 14:30:29 2023] rbd: rbd45: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd45: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd42: write at objno 2977 4128768~16384 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd42: write at objno 3086 393216~16384 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd0: failed to lock header: -108
[Thu Jun 22 14:30:29 2023] rbd: rbd42: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd45: write at objno 1018 3948544~16384 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd39: write at objno 4304 3297280~237568 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd42: write at objno 2944 16384~4096 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd42: write at objno 1015 2244608~16384 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd42: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write at objno 4031 3149824~8192 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd45: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd42: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write at objno 3079 360448~16384 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd45: write at objno 2944 1024~512 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd0: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write at objno 2944 16384~4096 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write at objno 768 2101248~8192 result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd9: write result -108
[Thu Jun 22 14:30:29 2023] rbd: rbd45: write at objno 922 2019328~4096 result -108
============================================================================================================================================

the CEPH cluster didn't experience any issues and had good health:

============================================================================================================================================
[ceph: root@d28-h01-000-r750 /]# ceph health detail
HEALTH_OK
============================================================================================================================================

my next step was to verify that the VMs didn't get corrupted, I achieved that by exporting the VM RBD image from a few VMs and using xfs_repair, this is for "rhel92-vm5403":

============================================================================================================================================
rbd export --export-format 1 ocp_pool/csi-vol-08720968-ab38-4ff0-bffb-b6104f10aa2e /mnt/my_mount_point/csi-vol-08720968-ab38-4ff0-bffb-b6104f10aa2e_format1.export
using qemu nbd to mount the image:
qemu-nbd --connect=/dev/nbd2 -f raw csi-vol-08720968-ab38-4ff0-bffb-b6104f10aa2e_format1.export
we can get the info about the vm here:
xfs_info /dev/nbd2p
we can test that the image is not corrupted using:
xfs_repair -n /dev/nbd2p1
example output:
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_fdblocks 184902, counted 194911
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 2
        - agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
============================================================================================================================================

I also verified that I can create new RBD images:

============================================================================================================================================
csi-vol-ff6e4494-14ff-49f6-9795-ea99b8f25ef6
csi-vol-ff86834a-c950-4f21-aa5e-023c7b5c6221
csi-vol-ff87e622-fe6c-41dc-aea7-84581756e43c
csi-vol-ff8914a2-9749-46c2-b409-66518a0c1784
csi-vol-ff8a385c-9400-448c-9d3e-6881ffa6e908
csi-vol-ff92a77d-dcc3-4819-a188-058352114ead
csi-vol-ffac9f5d-6ac3-4839-9fec-3f13069a6e4f
csi-vol-ffb4b71a-df89-47d9-901e-4434f532c1d0
csi-vol-ffd35b1f-a2e5-4532-9127-b1f3b3799e51
csi-vol-ffd78cb8-6a4b-4f30-8181-7de112c5b5c9
csi-vol-ffe74211-336e-4efc-a527-36d3e425c6c5
csi-vol-ffe7ff55-ceab-4311-b70d-2cbe7c8749a1
csi-vol-ffedd0c8-e2f8-44fa-a131-5349e78d70cf
csi-vol-fff85a28-8465-486e-a6fe-9b0f153e6a44
csi-vol-fffea38e-6f41-4e63-9b1c-33132a78439a
test <===== image I created.
============================================================================================================================================

looking at the VM itself it looks like we are unable to mount volumes, it's the same issue for all Paused VMs:

============================================================================================================================================
mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[hotplug-disks private public rhel92-vm5403-root-disk ephemeral-disks container-disks libvirt-runtime sockets]: timed out waiting for the condition
  Warning  FailedMount      14m                  kubelet            Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[rhel92-vm5403-root-disk private public ephemeral-disks container-disks libvirt-runtime sockets hotplug-disks]: timed out waiting for the condition
  Warning  FailedMount      12m                  kubelet            Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[container-disks libvirt-runtime rhel92-vm5403-root-disk sockets hotplug-disks private public ephemeral-disks]: timed out waiting for the condition
  Warning  FailedMount      10m                  kubelet            Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[libvirt-runtime sockets hotplug-disks private public rhel92-vm5403-root-disk ephemeral-disks container-disks]: timed out waiting for the condition
  Warning  FailedMount      8m8s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[libvirt-runtime rhel92-vm5403-root-disk sockets hotplug-disks private public ephemeral-disks container-disks]: timed out waiting for the condition
  Warning  FailedMount      4m1s (x2 over 6m5s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[hotplug-disks private public ephemeral-disks rhel92-vm5403-root-disk container-disks libvirt-runtime sockets]: timed out waiting for the condition
  Warning  FailedMapVolume  2m (x18 over 22m)    kubelet            MapVolume.SetUpDevice failed for volume "pvc-a3ac35c2-931d-4cf7-ab91-c0487c632dbe" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 192.168.108.33:6789,192.168.108.153:6789,192.168.246.96:6789,192.168.246.20:6789,192.168.244.240:6789 --keyfile=**stripped** map ocp_pool/csi-vol-08720968-ab38-4ff0-bffb-b6104f10aa2e --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed
rbd: map failed: (108) Cannot send after transport endpoint shutdown
  Warning  FailedMount  118s  kubelet  Unable to attach or mount volumes: unmounted volumes=[rhel92-vm5403-root-disk], unattached volumes=[ephemeral-disks container-disks libvirt-runtime sockets hotplug-disks rhel92-vm5403-root-disk private public]: timed out waiting for the condition
===============================================================================================================================================

I verified that all the above RHCS Interfaces IPs are fully accessible from each of the nodes:

===============================================================================================================================================

192.168.108.33
192.168.108.153
192.168.246.96
192.168.246.20
192.168.244.240
===============================================================================================================================================

the above issue will be replicated for almost any VM I stop and then start again.
by verifying the eternal RHCS is not the cause of the issues, and that VMs didn't get corrupted I assume that ODF is the culprit however I couldn't figure out why we are unable to mount the volumes.


the below logs were collected after I cordoned & drained worker052, and was able to recover the VMs there were running on it to become operational again
but here are a few VMs that are still Paused


[root@d16-h06-000-r650 ~]# oc get vm --no-headers |grep Paused|xargs -l bash -c 'oc get vmi --no-headers $0'
rhel92-vm0165   11d   Running         worker010   False
rhel92-vm0416   11d   Running         worker084   False
rhel92-vm1035   11d   Running         worker094   False
rhel92-vm2180   11d   Running         worker005   False
rhel92-vm2190   11d   Running         worker115   False
rhel92-vm2223   11d   Running         worker090   False
rhel92-vm2245   11d   Running         worker003   False
rhel92-vm2588   11d   Running         worker070   False
rhel92-vm2859   11d   Running         worker090   False
rhel92-vm3223   11d   Running         worker019   False
rhel92-vm3458   11d   Running         worker019   False

===============================================================================================================================================

note this log is 60GiB in size once extracted:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/ceph_vms_cant_boot.tar.gz

I also collected SOS report from worker052:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/sosreport-worker052-2023-06-23-tuhkqog.tar.xz

and worker019:
http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/sosreport-worker019-2023-06-23-ktqbjor.tar.xz


Note You need to log in before you can comment on or make changes to this bug.