Bug 2021460
Summary: | [RDR] Rbd image mount failed on pod saying rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd5 but could not correct them: fsck from util-linux 2.32.1 | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pratik Surve <prsurve> | |
Component: | ceph | Assignee: | Ilya Dryomov <idryomov> | |
ceph sub component: | RBD-Mirror | QA Contact: | Elad <ebenahar> | |
Status: | CLOSED DEFERRED | Docs Contact: | ||
Severity: | high | |||
Priority: | unspecified | CC: | amagrawa, bniver, esandeen, idryomov, jespy, jstrunk, kramdoss, kseeger, mmuench, muagarwa, ocs-bugs, odf-bz-bot, olakra, owasserm, rar, srai, srangana, sunkumar, vashastr, ypadia | |
Version: | 4.9 | Keywords: | Reopened | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Known Issue | ||
Doc Text: |
.Failover action reports RADOS block device image mount failed on the pod with RPC error `fsck`
Failing over a disaster recovery (DR) protected workload may result in pods not starting with volume mount errors that state the volume has file system consistency check (fsck) errors. This prevents the workload from failing over to the failover cluster.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2030746 2158560 2158664 (view as bug list) | Environment: | ||
Last Closed: | 2023-04-10 07:01:07 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2011326, 2030746 |
Description
Pratik Surve
2021-11-09 10:38:22 UTC
Proposing as a blocker since the consequences are quite severe Orit asked me to look at this from the ext4 / local filesystem perspective. From what I can see so far, it looks like the block device has been corrupted, overwritten, or damaged in some way such that e2fsck cannot even recognize an ext4 filesystem on it. I don't see any indication that this is an ext4 or e2sfck bug at this point. Thanks, -Eric I'll note that the entire inode table in block group 0 seems to be zeroed out on disk, which is why it can't validate any of the critical inodes (resize, root, journal): # dumpe2fs csi-vol-13869cb6-411d-11ec-a8ee-0a580a830082.export | grep -A8 "Group 0:" dumpe2fs 1.45.6 (20-Mar-2020) Group 0: (Blocks 0-32767) csum 0x1c82 [ITABLE_ZEROED] Primary superblock at 0, Group descriptors at 1-1 Reserved GDT blocks at 2-1024 Block bitmap at 1025 (+1025), csum 0xd2a19528 Inode bitmap at 1041 (+1041), csum 0xc8748330 Inode table at 1057-1568 (+1057) <---- inode table 23513 free blocks, 8180 free inodes, 2 directories, 8180 unused inodes Free blocks: 0-1026, 1041-1042, 1057-2080, 9249-32767 Free inodes: 1, 3, 5-6, 8, 10, 12, 15, 21, 23, 25, 27, 29-30, 35, 39, 42, 44, 46, 48-8192 # dd if=csi-vol-13869cb6-411d-11ec-a8ee-0a580a830082.export bs=4k skip=1057 count=512 |hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00200000 # Notes from the DR Triage meeting (Nov, 12th Friday 8:00 AM Eastern TZ): - Based on the one instance there is not much from RBD or DR orchestration that can be gleaned - As per comment #8 and comment #9 there is no evident ext4 issues here (thanks @esandeen), the issue seems to be more related to RBD snapshot that was promoted and was expected to be crash consistent - Current action is to reproduce this with additional RBD debug logging enabled on the cluster to understand if there is a code path that is causing this - For example failure to roll back a semi synced snapshot on a promote request - Marking this as NEEDINFO to QE for the same Was attempting to reproduce this issue using a 1 minute RBD snapshot schedule and triggering failovers in a loop. Something like so, Post initial deployment of the application/pod using Ramen/ACM: 1) Check pod running on desired cluster if [[ "Running" == $(kubectl get pods -n busybox-sample busybox --context=west -o jsonpath='{.status.phase}') ]]; then <proceed>; fi * Timeout in 8 minutes with error if pod is not in running state by then 2) Try to reset the schedule in the primary (due to bz #2019161) 3) Wait for 2:15 minutes (for the schedules to catch up) 4) Wait for PeersReady condition in DRPC if [[ "True" == $(kubectl get drpc -n busybox-sample busybox-drpc --context hub -o jsonpath='{.status.conditions[?(@.type=="PeerReady")].status}') ]]; then <proceed>; fi * Timeout in 8 minutes with error if condition is not met 5) Failover - Repeat 1-5 from west->east Unable to loop and perform the test as in step (2) due to the other BZ and the fact that RBD is not picking up the schedule addition within a tolerance (at times takes ~60 or more minutes). Leaving the intention here nevertheless, to enable quicker reproduction of the issue once related bugs ahve a workaround or a fix as appropriate. The logs are not enough to take a look at things from rbd perspective now, is their a test with debug rbd = 30 rbd_mirror=30 journaler = 30, if anyone attempting to reproduce, please set these options before proceeding. No longer a blocker as it is not reproducible. Not a 4.11 blocker The last reproducer was not useful (not same as the original bug too), closing this bug as discussed in some of the previous comments. Please reopen if we see this again, nothing to debug as of now. I created a clone for the new issue to avoid any confusion, lets track and update that issue in the cloned bug. Moving this one out of 4.12 https://bugzilla.redhat.com/show_bug.cgi?id=2158664 Subham, please check https://bugzilla.redhat.com/show_bug.cgi?id=2021460#c34 I saw this in bz `ODF version:- 4.9.0-230.ci` but I see in logs it is `v4.12.0-0`. So, in that case, we check every logRotate configuration every 15min and update logs accordingly and default rotation is based on either of periodicity is daily, and maxlogSize 500M which every comes first. |