Description of problem (please be detailed as possible and provide log snippests): [DR] Rbd image mount failed on pod saying rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd5 but could not correct them: fsck from util-linux 2.32.1 Version of all relevant components (if applicable): OCP version:- 4.9.0-0.nightly-2021-11-08-084355 ODF version:- 4.9.0-230.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes pod is stuck in ContainerCreating state Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy DR cluster 2. Deploy Workloads 3. Perform failover of workloads to secondary site 4. Check pod logs Actual results: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 5m41s default-scheduler Successfully assigned busybox-workloads/busybox-2 to prsurve-vm-dev-88rv4-worker-vh6hb Normal CreateResource 5m41s subscription Synchronizer created resource busybox-2 of gvk:/v1, Kind=Pod Normal SuccessfulAttachVolume 5m40s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c1eb1ab8-850f-43f5-b2f9-0982900f9326" Warning FailedMount 84s (x2 over 3m38s) kubelet Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-c8fvj]: timed out waiting for the condition Warning FailedMount 83s (x10 over 5m37s) kubelet MountVolume.MountDevice failed for volume "pvc-c1eb1ab8-850f-43f5-b2f9-0982900f9326" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd5 but could not correct them: fsck from util-linux 2.32.1 /dev/rbd5 contains a file system with errors, check forced. /dev/rbd5: Resize inode not valid. /dev/rbd5: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. Expected results: rbd image mounting should not fail Additional info: I tried to delete the pod but did not worked it is still in ContainerCreating after pod restart
Proposing as a blocker since the consequences are quite severe
Orit asked me to look at this from the ext4 / local filesystem perspective. From what I can see so far, it looks like the block device has been corrupted, overwritten, or damaged in some way such that e2fsck cannot even recognize an ext4 filesystem on it. I don't see any indication that this is an ext4 or e2sfck bug at this point. Thanks, -Eric
I'll note that the entire inode table in block group 0 seems to be zeroed out on disk, which is why it can't validate any of the critical inodes (resize, root, journal): # dumpe2fs csi-vol-13869cb6-411d-11ec-a8ee-0a580a830082.export | grep -A8 "Group 0:" dumpe2fs 1.45.6 (20-Mar-2020) Group 0: (Blocks 0-32767) csum 0x1c82 [ITABLE_ZEROED] Primary superblock at 0, Group descriptors at 1-1 Reserved GDT blocks at 2-1024 Block bitmap at 1025 (+1025), csum 0xd2a19528 Inode bitmap at 1041 (+1041), csum 0xc8748330 Inode table at 1057-1568 (+1057) <---- inode table 23513 free blocks, 8180 free inodes, 2 directories, 8180 unused inodes Free blocks: 0-1026, 1041-1042, 1057-2080, 9249-32767 Free inodes: 1, 3, 5-6, 8, 10, 12, 15, 21, 23, 25, 27, 29-30, 35, 39, 42, 44, 46, 48-8192 # dd if=csi-vol-13869cb6-411d-11ec-a8ee-0a580a830082.export bs=4k skip=1057 count=512 |hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00200000 #
Notes from the DR Triage meeting (Nov, 12th Friday 8:00 AM Eastern TZ): - Based on the one instance there is not much from RBD or DR orchestration that can be gleaned - As per comment #8 and comment #9 there is no evident ext4 issues here (thanks @esandeen), the issue seems to be more related to RBD snapshot that was promoted and was expected to be crash consistent - Current action is to reproduce this with additional RBD debug logging enabled on the cluster to understand if there is a code path that is causing this - For example failure to roll back a semi synced snapshot on a promote request - Marking this as NEEDINFO to QE for the same
Was attempting to reproduce this issue using a 1 minute RBD snapshot schedule and triggering failovers in a loop. Something like so, Post initial deployment of the application/pod using Ramen/ACM: 1) Check pod running on desired cluster if [[ "Running" == $(kubectl get pods -n busybox-sample busybox --context=west -o jsonpath='{.status.phase}') ]]; then <proceed>; fi * Timeout in 8 minutes with error if pod is not in running state by then 2) Try to reset the schedule in the primary (due to bz #2019161) 3) Wait for 2:15 minutes (for the schedules to catch up) 4) Wait for PeersReady condition in DRPC if [[ "True" == $(kubectl get drpc -n busybox-sample busybox-drpc --context hub -o jsonpath='{.status.conditions[?(@.type=="PeerReady")].status}') ]]; then <proceed>; fi * Timeout in 8 minutes with error if condition is not met 5) Failover - Repeat 1-5 from west->east Unable to loop and perform the test as in step (2) due to the other BZ and the fact that RBD is not picking up the schedule addition within a tolerance (at times takes ~60 or more minutes). Leaving the intention here nevertheless, to enable quicker reproduction of the issue once related bugs ahve a workaround or a fix as appropriate.
The logs are not enough to take a look at things from rbd perspective now, is their a test with debug rbd = 30 rbd_mirror=30 journaler = 30, if anyone attempting to reproduce, please set these options before proceeding.
No longer a blocker as it is not reproducible.
Not a 4.11 blocker
The last reproducer was not useful (not same as the original bug too), closing this bug as discussed in some of the previous comments. Please reopen if we see this again, nothing to debug as of now.
I created a clone for the new issue to avoid any confusion, lets track and update that issue in the cloned bug. Moving this one out of 4.12 https://bugzilla.redhat.com/show_bug.cgi?id=2158664
Subham, please check https://bugzilla.redhat.com/show_bug.cgi?id=2021460#c34
I saw this in bz `ODF version:- 4.9.0-230.ci` but I see in logs it is `v4.12.0-0`. So, in that case, we check every logRotate configuration every 15min and update logs accordingly and default rotation is based on either of periodicity is daily, and maxlogSize 500M which every comes first.