Bug 732169

Summary: Superblock hint for external superblock should be .....
Product: [Fedora] Fedora Reporter: lejeczek <peljasz>
Component: kernelAssignee: Eric Sandeen <esandeen>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 14CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-26 18:52:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description lejeczek 2011-08-20 08:03:26 UTC
Description of problem:

having our R815 shutdown in orderly fashion then any hard disk that is not a part of the filesystem removed, causes a file system fail to mount, fsck:

Superblock hint for external superblock should be 0xfd04

journal for the failing filesystem is external, again not on a drive being removed, journal device is an lvm2 device

then if we put those taken out drives back in the filesystem mounts fine again

fsck, if we leave removed drives out, fixes the problem and filesystem mounts ok


Version-Release number of selected component (if applicable):

2.6.35.13-92.fc14.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
it seem like kernel and ext4 loose the track of what is what when enumerating hard drives after something was removed, but if a drive irrelevant to filesystem in question is being replaced by another drive then problem persists

Expected results:


Additional info:

Comment 1 Eric Sandeen 2011-08-22 17:30:40 UTC
int e2fsck_fix_ext3_journal_hint(e2fsck_t ctx)
{
...

        uuid_unparse(sb->s_journal_uuid, uuid);
        journal_name = blkid_get_devname(ctx->blkid, "UUID", uuid);
        if (!journal_name)
                return 0;

        if (stat(journal_name, &st) < 0)
                return 0;

        if (st.st_rdev != sb->s_journal_dev) {
                clear_problem_context(&pctx);
                pctx.num = st.st_rdev;
                if (fix_problem(ctx, PR_0_EXTERNAL_JOURNAL_HINT, &pctx)) {

so it looks at the filesystem superblock for s_journal_uuid, and then asks blkid to get the device name containing that uuid.

it then stats the device, and checks whether it has the same device number as is stored in the superblock.

This does seem like a recipe for failure if devices are rearranged...  I'll try to ask Ted, this seems weird.

(but - you said if you switch one non-fs disk with another non-fs disk you get the same problem?  Perhaps they are still enumerated differently...)

Comment 2 Eric Sandeen 2011-08-22 17:35:26 UTC
How did mount fail?

this may be expected, sadly, if device numbers are rearranged.

journal_dev=devnum      When the external journal device's major/minor numbers
                        have changed, this option allows the user to specify
                        the new journal location.  The journal device is
                        identified through its new major/minor numbers encoded
                        in devnum.


could be used to specify a new device number after you have rearranged disks.

Comment 3 lejeczek 2011-09-09 15:14:22 UTC
Hi Eric,
yes it does fail in the same fashion, whem/if a non-fs drive is being replaced with another non-fs drive.
In my case it's a hardware raid thus I'd reckon only rearranging raid devices ,that similarly bear no relation to the failing filesystem, causes ext4 to fail.

Seem like using journal_dev at mount time is a way around the problem, but so is applying fsck on the filesystem, only faster as it does not do all the work fsck does, used once at mount time heals the problem and not needed next time and FS mounts ok.

all redhat-derived distros seem to suffer from this problem, have checked Oracle 6 and SL 6.1, have not checked different distros

Comment 4 Eric Sandeen 2011-09-09 15:22:48 UTC
lejeczek, I'm afraid this behavior is by design... rearranging devices does mess up the external journal device location.

Without a mount.extN mount helper to call blkid and look for the new location, I'm not sure how we could do this differently...

Comment 5 lejeczek 2011-09-09 22:22:55 UTC
sure it's ok when/if there is an easy fix for a problem, like there is one for this very problem.
if it is by design then whether by negligence or oversight the mechanism ended up to be somewhat dysfunctional, surely this must not be a goal set by logic, if intended then only as a trade-off between whatever the designer(s) had on stake.

enumeration of the block devices seemed always to be an Achilles heel of linux in the past, I did come across it in the past (492456)

surely it would be great if this design could be rectify in some near future.

should we mark it as not-a-bug or should we leave it here as info for others?

Comment 6 Dave Jones 2011-09-26 18:52:11 UTC
I'd suggest bringing this up as a feature request upstream at linux-fsdevel.org

We wouldn't introduce something Fedora specific for this (especially in f14 at this stage), so it would have to have upstream buy-in anyway.