Bug 622576 - fsck.gfs2 segfaults if journals are missing
fsck.gfs2 segfaults if journals are missing
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
6.1
x86_64 Linux
urgent Severity urgent
: rc
: 6.0
Assigned To: Robert Peterson
Cluster QE
: ZStream
Depends On: 575968 620384 624689 624691
Blocks: 637699
  Show dependency treegraph
 
Reported: 2010-08-09 15:20 EDT by Robert Peterson
Modified: 2011-05-19 08:53 EDT (History)
11 users (show)

See Also:
Fixed In Version: cluster-3.0.12-24.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 620384
: 683104 (view as bug list)
Environment:
Last Closed: 2011-05-19 08:53:30 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Final patch for 6.1 (18.36 KB, patch)
2010-08-18 13:57 EDT, Robert Peterson
no flags Details | Diff
fsck.gfs2 log while rebuilding journals (504.63 KB, application/x-gzip)
2011-03-07 10:51 EST, Nate Straz
no flags Details

  None (edit)
Description Robert Peterson 2010-08-09 15:20:38 EDT
+++ This bug was initially created as a clone of Bug #620384 +++
This bug was cloned in order to crosswrite the patch for
bug #620384 to RHEL6.

Hello all,

Centos 5.5 
drbd --> lv --> gfs2
gfs2_tool 0.1.62 (built Mar 31 2010 07:34:45)

The filesystem was created in Centos5.2

Running gfs2_fsck seg faults:

    [root@tweety-2 /]# gfs2_fsck -v /dev/mapper/vg1-data1
    Initializing fsck
    Initializing lists...
    jid=0: Looking at journal...
    jid=0: Journal is clean.
    jid=1: Looking at journal...
    jid=1: Journal is clean.
    jid=2: Looking at journal...
    jid=2: Journal is clean.
    jid=3: Looking at journal...
    jid=3: Journal is clean.
    jid=4: Looking at journal...
    jid=4: Journal is clean.
    jid=5: Looking at journal...
    jid=5: Journal is clean.
    Segmentation fault
    gfs2_fsck[5131]: segfault at 00000000000000f0 rip 000000000040aefa rsp 00007fffd02c2d50 error 4


The nice thing is that it also alters the lock mechanism defined for the fs, in one that does not exist (fsck_dlm):

    [root@tweety-2 /]# mount /mounts
    /sbin/mount.gfs2: error mounting /dev/mapper/vg1-data1 on /mounts: No such file or directory

    GFS2: fsid=: Trying to join cluster "fsck_dlm", "tweety:gfs2-11"
    GFS2: can't find protocol fsck_dlm
    GFS2: fsid=: can't mount proto=fsck_dlm, table=tweety:gfs2-11, hostdata=
    GFS2: fsid=: Trying to join cluster "fsck_dlm", "tweety:gfs2-11"
    GFS2: can't find protocol fsck_dlm
    GFS2: fsid=: can't mount proto=fsck_dlm, table=tweety:gfs2-11, hostdata=
    GFS2: fsid=: Trying to join cluster "fsck_dlm", "tweety:gfs2-11"
    GFS2: can't find protocol fsck_dlm
    GFS2: fsid=: can't mount proto=fsck_dlm, table=tweety:gfs2-11, hostdata=


It gets restored with:

    [root@tweety-2 /]# gfs2_tool sb /dev/mapper/vg1-data1 proto lock_dlm
    You shouldn't change any of these values if the filesystem is mounted.

    Are you sure? [y/n] y

    current lock protocol name = "fsck_dlm"
    new lock protocol name = "lock_dlm"
    Done


    [root@tweety-2 /]#mount /mounts
    [root@tweety-2 /]#

    GFS2: fsid=: Trying to join cluster "lock_dlm", "tweety:gfs2-11"
    GFS2: fsid=tweety:gfs2-11.0: Joined cluster. Now mounting FS...
    GFS2: fsid=tweety:gfs2-11.0: jid=0, already locked for use
    GFS2: fsid=tweety:gfs2-11.0: jid=0: Looking at journal...
    GFS2: fsid=tweety:gfs2-11.0: jid=0: Done
    GFS2: fsid=tweety:gfs2-11.0: jid=1: Trying to acquire journal lock...
    GFS2: fsid=tweety:gfs2-11.0: jid=1: Looking at journal...
    GFS2: fsid=tweety:gfs2-11.0: jid=1: Done
    GFS2: fsid=tweety:gfs2-11.0: jid=2: Trying to acquire journal lock...
    GFS2: fsid=tweety:gfs2-11.0: jid=2: Looking at journal...
    GFS2: fsid=tweety:gfs2-11.0: jid=2: Done
    GFS2: fsid=tweety:gfs2-11.0: jid=3: Trying to acquire journal lock...
    GFS2: fsid=tweety:gfs2-11.0: jid=3: Looking at journal...
    GFS2: fsid=tweety:gfs2-11.0: jid=3: Done
    GFS2: fsid=tweety:gfs2-11.0: jid=4: Trying to acquire journal lock...
    GFS2: fsid=tweety:gfs2-11.0: jid=4: Looking at journal...
    GFS2: fsid=tweety:gfs2-11.0: jid=4: Done
    GFS2: fsid=tweety:gfs2-11.0: jid=5: Trying to acquire journal lock...
    GFS2: fsid=tweety:gfs2-11.0: jid=5: Looking at journal...
    GFS2: fsid=tweety:gfs2-11.0: jid=5: Done


I have saved all my date, but before destroying this GFS2, would any developer like me to assist in debugging?

Sincerely,

Theophanis Kontogiannis

--- Additional comment from rpeterso@redhat.com on 2010-08-02 09:52:28 EDT ---

This may be a bug that I've previously found and fixed.  Can
you try (at your own risk) the fsck.gfs2 on my people page?

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2

If that doesn't work, please save off your file system metadata
with gfs2_edit savemeta and post it somewhere (private) where
I can download it and recreate the problem.

--- Additional comment from theophanis_kontogiannis@yahoo.gr on 2010-08-02 10:10:24 EDT ---

Hello Bob,

No change with the mentioned fsck.

[root@tweety-2 ~]# ./fsck.gfs2 -v /dev/mapper/vg1-data1 
Initializing fsck
Initializing lists...
jid=0: Looking at journal...
jid=0: Journal is clean.
jid=1: Looking at journal...
jid=1: Journal is clean.
jid=2: Looking at journal...
jid=2: Journal is clean.
jid=3: Looking at journal...
jid=3: Journal is clean.
jid=4: Looking at journal...
jid=4: Journal is clean.
jid=5: Looking at journal...
jid=5: Journal is clean.
Segmentation fault
fsck.gfs2[7436]: segfault at 00000000000000f8 rip 0000000000410ea3 rsp 00007fff0cd9de80 error 4

and again changed the lock proto.

GFS2: fsid=: Trying to join cluster "fsck_dlm", "tweety:gfs2-11"
GFS2: can't find protocol fsck_dlm
GFS2: fsid=: can't mount proto=fsck_dlm, table=tweety:gfs2-11, hostdata=

I am sending you an e-mail with the link for the metada.

No worries about my data. I have backed up all of them so we can do whatever we like on this file system.

BR
TK

--- Additional comment from rpeterso@redhat.com on 2010-08-04 15:29:10 EDT ---

I received a copy of Theophanis's metadata, restored it and
recreated the problem using my latest and greatest code.  The
problem seems to be that there are two journals mysteriously
deleted from the jindex.  The code that looks at the integrity
of the journals is apparently unable to cope with missing journals.
In this case there are supposed to be ten journals (journal0
through journal9) but journal6 and journal7 are gone for some reason.

I changed the code so that it just skips over the missing journals
but it encounters another problem in pass1 when it tries to
recreate them.  I'm investigating that now and hopefully it will
be easy to figure out.

It should be noted that this file system has a block size of 512
bytes (1/2K) which I believe is an unsupported configuration.
Normal journals are driven to 4 levels of metadata indirection!
So far I haven't run into any code that can't deal with this block
size, and the missing journal problem would still be there even if
the block size was bigger, so that's not impacting me at the moment.
It might, however, have contributed to the fact that those two
journals are missing (under investigation as well).  It would be
helpful to know if Theophanis knows how the journals went missing.
Was the file system created with ten journals or fewer, and
gfs2_jadd run?

--- Additional comment from theophanis_kontogiannis@yahoo.gr on 2010-08-04 16:59:58 EDT ---

Hi Bob,

In fact until now I did not even know the journals were missing. 

After moving out all my files I run gfs2_fsck for no reason, and this is how I ended up filling this bug.

The file system was created from the beggining with ten journals and no kind of alternations were made throughout its lifecycle.

BR
TK

--- Additional comment from rpeterso@redhat.com on 2010-08-05 10:45:45 EDT ---

I've got a prototype that seems to be working properly.  I'm
testing it now.  Hopefully we can get this into 5.6.  I'm going
to try to figure out how the journals went missing.

--- Additional comment from rpeterso@redhat.com on 2010-08-06 23:56:23 EDT ---

Created an attachment (id=437297)
Preliminary patch

With this patch I was able to fix the broken file system.
This is still a preliminary patch and has not been tested properly.
Comment 1 Robert Peterson 2010-08-18 13:57:03 EDT
Created attachment 439456 [details]
Final patch for 6.1

Here is the final patch I'm hoping to push to the git repo.
It was tested on system roth-08.
Comment 2 Robert Peterson 2010-08-18 14:20:35 EDT
The patch was pushed to the master branch of the gfs2-utils
git repo, and the STABLE3 and RHEL6 branches of the cluster
git repo for inclusion into 6.1.  Changing status to POST
until we start doing 6.1 builds.
Comment 5 David Mair 2010-09-24 09:46:14 EDT
Upping the priority/severity on this.  This is ugly. We don't need this going out the door on 6.0.
Comment 11 Nate Straz 2011-03-04 17:59:21 EST
I came up with two scenarios I would like clarification on.

* removing multiple journals in the middle

  If I have a file system with five journals and I remove the middle three, fsck.gfs2 will only recreate one journal at a time.  I have to run fsck.gfs2 three times to get the journals all back.  This seems like a bug that should be fixed.

* removing the last journal

  If I remove the last journal, fsck.gfs2 will not try to recreate it.  Does GFS2 know how many journals should be in the file system besides the number of entries in the jindex directory?
Comment 12 Robert Peterson 2011-03-07 08:50:43 EST
In answer to comment #11:
fsck.gfs2 should probably recover multiple journals.  Do you
have output I can look at from this scenario where it didn't?
I'd just like to double-check that it didn't act up.

Regarding the removal of the last journal: It entirely depends
on how/why the journals were missing.

When fsck.gfs2 checks the journals, it goes by the jindex
directory and how many dirents it has, taking "." and ".." into
account.  If a journal is removed by manually deleting it
through the metafs, the jindex will be adjusted properly so
fsck.gfs2 won't know it ever existed.  There's no way for it to
know the journal was ever there.  We have a gfs2_jadd command
to add journals, but if there was a gfs2_jdel command, it would
do just that, and we wouldn't want fsck.gfs2 to assume the last
journal ever existed.  In other words, it doesn't "keep count" in
the superblock or anything.  It only goes by the jindex directory.

This patch was primarily created to recover situations where
journals disappear abnormally, not unlinked from the metafs.
In the original scenario, the journal was missing because a
much older version of fsck.gfs2 had detected corruption and
mistakenly tossed it into lost+found.

The need to recover journals this way should be rare, so I
don't think we should hold up the release of 6.1 because of it.
If you want, you can open a new bugzilla and we could add new
code to fsck.gfs2 that analyzes journal file names.
In other words, if it finds "journal0" and "journal5" we could
add the smarts to fill in the missing gap.
Comment 13 Nate Straz 2011-03-07 10:51:57 EST
Created attachment 482716 [details]
fsck.gfs2 log while rebuilding journals

Attached is the complete output of fsck.gfs2 while I run it until all of the journals are rebuilt.

The interesting parts are probably these lines:

Initializing fsck
File system journal "journal1" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Error: resource group 17 (0x11): free space (0) does not match bitmap (3)
(3 blocks were reclaimed)
The rgrp was fixed.
RGs: Consistent: 799   Inconsistent: 1   Fixed: 1   Total: 800
Starting pass1
Invalid or missing journal1 system inode (should be 4, is 0).
Rebuilding system file "journal1"
Pass1 complete
...
Initializing fsck
File system journal "journal2" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Invalid or missing journal2 system inode (should be 4, is 0).
Rebuilding system file "journal2"
Pass1 complete
...
alizing fsck
File system journal "journal3" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Invalid or missing journal3 system inode (should be 4, is 0).
Rebuilding system file "journal3"
Pass1 complete
Comment 14 Nate Straz 2011-03-08 10:13:51 EST
I split the multiple journal rebuild issue to bug 683104 and marking this bug as VERIFIED.
Comment 15 errata-xmlrpc 2011-05-19 08:53:30 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html

Note You need to log in before you can comment on or make changes to this bug.