Bug 2021460 - [RDR] Rbd image mount failed on pod saying rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd5 but could not correct them: fsck from util-linux 2.32.1
Summary: [RDR] Rbd image mount failed on pod saying rpc error: code = Internal desc = ...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ilya Dryomov
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks: 2011326 2030746
TreeView+ depends on / blocked
 
Reported: 2021-11-09 10:38 UTC by Pratik Surve
Modified: 2023-08-09 16:37 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Failover action reports RADOS block device image mount failed on the pod with RPC error `fsck` Failing over a disaster recovery (DR) protected workload may result in pods not starting with volume mount errors that state the volume has file system consistency check (fsck) errors. This prevents the workload from failing over to the failover cluster.
Clone Of:
: 2030746 2158560 2158664 (view as bug list)
Environment:
Last Closed: 2023-04-10 07:01:07 UTC
Embargoed:


Attachments (Terms of Use)

Description Pratik Surve 2021-11-09 10:38:22 UTC
Description of problem (please be detailed as possible and provide log
snippests):

[DR] Rbd image mount failed on pod saying rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd5 but could not correct them: fsck from util-linux 2.32.1



Version of all relevant components (if applicable):
OCP version:- 4.9.0-0.nightly-2021-11-08-084355

ODF version:- 4.9.0-230.ci

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes pod is stuck in ContainerCreating state 


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy DR cluster 
2. Deploy Workloads 
3. Perform failover of workloads to secondary site
4. Check pod logs 


Actual results:
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               5m41s                 default-scheduler        Successfully assigned busybox-workloads/busybox-2 to prsurve-vm-dev-88rv4-worker-vh6hb
  Normal   CreateResource          5m41s                 subscription             Synchronizer created resource busybox-2 of gvk:/v1, Kind=Pod
  Normal   SuccessfulAttachVolume  5m40s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-c1eb1ab8-850f-43f5-b2f9-0982900f9326"
  Warning  FailedMount             84s (x2 over 3m38s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-c8fvj]: timed out waiting for the condition
  Warning  FailedMount             83s (x10 over 5m37s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-c1eb1ab8-850f-43f5-b2f9-0982900f9326" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd5 but could not correct them: fsck from util-linux 2.32.1
/dev/rbd5 contains a file system with errors, check forced.
/dev/rbd5: Resize inode not valid.  

/dev/rbd5: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.



Expected results:
rbd image mounting should not fail


Additional info:
I tried to delete the pod but did not worked it is still in ContainerCreating after pod restart

Comment 3 Elad 2021-11-09 15:08:01 UTC
Proposing as a blocker since the consequences are quite severe

Comment 8 Eric Sandeen 2021-11-11 15:14:40 UTC
Orit asked me to look at this from the ext4 / local filesystem perspective.

From what I can see so far, it looks like the block device has been corrupted, overwritten, or damaged in some way such that e2fsck cannot even recognize an ext4 filesystem on it.  I don't see any indication that this is an ext4 or e2sfck bug at this point.

Thanks,
-Eric

Comment 9 Eric Sandeen 2021-11-11 16:45:07 UTC
I'll note that the entire inode table in block group 0 seems to be zeroed out on disk, which is why it can't validate any of the critical inodes (resize, root, journal):

# dumpe2fs  csi-vol-13869cb6-411d-11ec-a8ee-0a580a830082.export  | grep -A8 "Group 0:"
dumpe2fs 1.45.6 (20-Mar-2020)
Group 0: (Blocks 0-32767) csum 0x1c82 [ITABLE_ZEROED]
  Primary superblock at 0, Group descriptors at 1-1
  Reserved GDT blocks at 2-1024
  Block bitmap at 1025 (+1025), csum 0xd2a19528
  Inode bitmap at 1041 (+1041), csum 0xc8748330
  Inode table at 1057-1568 (+1057)                                         <---- inode table
  23513 free blocks, 8180 free inodes, 2 directories, 8180 unused inodes
  Free blocks: 0-1026, 1041-1042, 1057-2080, 9249-32767
  Free inodes: 1, 3, 5-6, 8, 10, 12, 15, 21, 23, 25, 27, 29-30, 35, 39, 42, 44, 46, 48-8192

# dd if=csi-vol-13869cb6-411d-11ec-a8ee-0a580a830082.export bs=4k skip=1057 count=512 |hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00200000
#

Comment 10 Shyamsundar 2021-11-12 14:55:34 UTC
Notes from the DR Triage meeting (Nov, 12th Friday 8:00 AM Eastern TZ):

- Based on the one instance there is not much from RBD or DR orchestration that can be gleaned
- As per comment #8 and comment #9 there is no evident ext4 issues here (thanks @esandeen), the issue seems to be more related to RBD snapshot that was promoted and was expected to be crash consistent
- Current action is to reproduce this with additional RBD debug logging enabled on the cluster to understand if there is a code path that is causing this
  - For example failure to roll back a semi synced snapshot on a promote request
  - Marking this as NEEDINFO to QE for the same

Comment 11 Shyamsundar 2021-11-13 14:20:37 UTC
Was attempting to reproduce this issue using a 1 minute RBD snapshot schedule and triggering failovers in a loop. Something like so,
Post initial deployment of the application/pod using Ramen/ACM:

1) Check pod running on desired cluster
if [[ "Running" == $(kubectl get pods -n busybox-sample busybox --context=west -o jsonpath='{.status.phase}') ]]; then <proceed>; fi
   * Timeout in 8 minutes with error if pod is not in running state by then

2) Try to reset the schedule in the primary (due to bz #2019161)

3) Wait for 2:15 minutes (for the schedules to catch up)

4) Wait for PeersReady condition in DRPC
  if [[ "True" == $(kubectl get drpc -n busybox-sample busybox-drpc --context hub -o jsonpath='{.status.conditions[?(@.type=="PeerReady")].status}') ]]; then <proceed>; fi
   * Timeout in 8 minutes with error if condition is not met

5) Failover

- Repeat 1-5 from west->east

Unable to loop and perform the test as in step (2) due to the other BZ and the fact that RBD is not picking up the schedule addition within a tolerance (at times takes ~60 or more minutes). Leaving the intention here nevertheless, to enable quicker reproduction of the issue once related bugs ahve a workaround or a fix as appropriate.

Comment 12 Deepika Upadhyay 2021-11-23 11:56:13 UTC
The logs are not enough to take a look at things from rbd perspective now, is their a test with debug rbd = 30 rbd_mirror=30 journaler = 30, if anyone attempting to reproduce, please set these options before proceeding.

Comment 21 Mudit Agarwal 2022-05-31 09:17:01 UTC
No longer a blocker as it is not reproducible.

Comment 22 Mudit Agarwal 2022-07-05 13:12:27 UTC
Not a 4.11 blocker

Comment 29 Mudit Agarwal 2022-10-26 03:15:41 UTC
The last reproducer was not useful (not same as the original bug too), closing this bug as discussed in some of the previous comments.
Please reopen if we see this again, nothing to debug as of now.

Comment 35 Mudit Agarwal 2023-01-06 02:36:47 UTC
I created a clone for the new issue to avoid any confusion, lets track and update that issue in the cloned bug. Moving this one out of 4.12
https://bugzilla.redhat.com/show_bug.cgi?id=2158664

Comment 36 Mudit Agarwal 2023-01-06 02:37:30 UTC
Subham, please check https://bugzilla.redhat.com/show_bug.cgi?id=2021460#c34

Comment 43 Subham Rai 2023-01-06 11:12:13 UTC
I saw this in bz `ODF version:- 4.9.0-230.ci` but I see in logs it  is `v4.12.0-0`.

So, in that case, we check every logRotate configuration every 15min and update logs accordingly and default rotation is based on either of periodicity is daily, and maxlogSize 500M which every comes first.


Note You need to log in before you can comment on or make changes to this bug.