Bug 338651 - GFS2: Not all metadata is revoked that should be
GFS2: Not all metadata is revoked that should be
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Steve Whitehouse
Cluster QE
Depends On: 236099
  Show dependency treegraph
Reported: 2007-10-18 15:33 EDT by Robert Peterson
Modified: 2011-05-12 06:31 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-05-12 06:31:56 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
saved metadata for the corrupt file system (444.06 KB, application/octet-stream)
2007-10-18 15:43 EDT, Robert Peterson
no flags Details
Possible fix (1018 bytes, patch)
2008-01-07 19:30 EST, Robert Peterson
no flags Details | Diff

  None (edit)
Description Robert Peterson 2007-10-18 15:33:08 EDT
Description of problem:

I've been working on my gfs2_fsck bug (bug #291551) to replay journals, and
therefore testing its sanity in the gfs2-kernel as well.  Besides
unit-testing the code, I've been doing tests where I:

(1) Mount a gfs2 file system, in this case /dev/sdb1
(2) Start all kinds of I/O
(3) Power-fence the node
(4) Use gfs2_edit's savemeta feature to save the metadata "as is"
(5) Verify everything was saved as it should be
(6) Restore saved metadata to another device, in this case /dev/sdb2.
(7) Run gfs2_fsck on the duplicate (/dev/sdb2)
(8) Check the sanity of the decisions made,
(9) Run gfs2_fsck on the original
(10) Make sure they give the same results.

In doing this, today I came across an interesting problem.

In this case, the journal was replayed.  The journal contained a log
descriptor with a directory inode "/roth-01" that had two leafs, and
one of those leafs contained a pointer to a given file inode (among
others) for file "/roth-01/laio_rwdirect".  I can illustrate it this

0x8166 (dir inode) "/roth-01"
  |---> leaf #1: 0x816b
  |---> leaf #2: 0x8167
                  |---> file inode 0xcd69 "/roth-01/laio_rwdirect"

Later, the journal had a revoke for that file inode ("laio_rwdirect"),
but not the source dir inode block ("/roth-01") nor either directory
leaf, one of which pointed to the file.

When the journal was replayed, the revoke canceled the file inode for
"laio_rwdirect", but since neither the directory inode nor its leafs
were revoked, the changes were still made to "/roth-01" and its leaf
blocks.  What I ended up with is a directory that pointed off into
never-never land.  That was correctly flagged as an error by gfs2_fsck:

[root@roth-01 ../cluster/gfs2/fsck]# ./gfs2_fsck /dev/sdb2
Initializing fsck
Recovering journals (this may take a while).
Journal #1 ("journal0") is dirty.  Okay to replay it? (y/n)y
jid=0: Replayed 0 of 0 journaled data blocks
jid=0: Replayed 7 of 10 metadata blocks

Journal recovery complete.
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Starting pass1
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Directory entry 'laio_rwdirect' at block 52585 (0xcd69) in dir inode 33126
(0x8166) has an invalid block type: 15.
Clear directory entry to non-inode block? (y/n) 

Both directory leaf and inode are in the same log descriptor.  The inode
is in a subsequent revoke log descriptor, but neither the directory inode
nor either leaf is revoked.  Here is the gfs2_edit breakdown of the journal
(excerpt from gfs2_edit -p journal0 /dev/sdb1) but only the active part.
I've verified that this is the ONLY real section of the journal that is
replayed and needs to be replayed:

Block #2533: Log header: Seq = 0x802a, tail = 0x250a, blk = 0x2533
Block #2534: Log descriptor, type 300 (Metadata) len:11, data1: 10
             0x0000cd6b   0x0000cd6a   0x0000cd69   0x0000805c   
             0x00000014   0x0000805d   0x00000011   0x00000013   
             0x00008166   0x00008167   
Block #253f: Log descriptor, type 301 (Revoke) len:1, data1: 11
             0x0000b9e2   0x0000b9e4   0x0000c9d5   0x0000c7d7   
             0x0000c5d9   0x0000c3db   0x0000c1dd   0x0000bfdf   
             0x0000bde1   0x0000bbe3   0x0000b9e3   
Block #2540: Log header: Seq = 0x802b, tail = 0x2534, blk = 0x2540
Block #2541: Log descriptor, type 301 (Revoke) len:1, data1: 3
             0x0000cd6b   0x0000cd6a   0x0000cd69   
Block #2542: Log header: Seq = 0x802c, tail = 0x2534, blk = 0x2542
--------- end of the journal, as witnessed by the following entry:
Block #2543: Log header: Seq = 0x1a4, tail = 0x0, blk = 0x2543

So journal block #2534 contains 10 metadata blocks:

             0x0000cd6b   0x0000cd6a   0x0000cd69   0x0000805c   
             0x00000014   0x0000805d   0x00000011   0x00000013   
             0x00008166   0x00008167   

Journal block #2541 contains revokes for only the first three of
these metadata blocks, one of which (0xcd69) is for "laio_rwdirect":

             0x0000cd6b   0x0000cd6a   0x0000cd69   

Blocks 0x8166 and 0x8167 were not revoked, but 0xcd69 was.
Therefore, gfs2_fsck replayed the other seven journal blocks and later it
correctly identified leaf 0x8167 pointing to an inode 0xcd69 that does
not exist.

My theory at this time is that all ten metadata blocks from
block #2534 should have been revoked in block #2541, since there was
a subsequent log header and they were not.

This is highly speculative, but perhaps those block buffers were
never unpinned, or else they had a bd_ail, and perhaps that caused them
to never be added to the gl_ail_list, which is later used to build
the revoke list.  Excerpt from gfs2_unpin:

        if (bd->bd_ail) {
        } else {
                struct gfs2_glock *gl = bd->bd_gl;
                list_add(&bd->bd_ail_gl_list, &gl->gl_ail_list);

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
See steps given above
Actual results:
See gfs2_fsck output above

Expected results:
gfs2_fsck should come up clean.

Additional info:
Comment 1 Robert Peterson 2007-10-18 15:43:43 EDT
Created attachment 231441 [details]
saved metadata for the corrupt file system

This is a gfs2_edit savemeta of the file system in its corrupt state.
You can restore the metadata with commands similar to these:

cd /tmp/
bunzip2 gfsmeta.bz2
gfs2_edit restoremeta /tmp/gfsmeta /dev/sdb2
(given that /dev/sdb2 is a file system you want to overwrite with it)

Assuming, of course, that you have the latest gfs2_edit from the
HEAD branch of cvs.
Comment 2 Robert Peterson 2008-01-07 19:30:12 EST
Created attachment 291028 [details]
Possible fix

I got to thinking that if the code is missing revokes, it might also be
forgetting to release the bd's (buffer descriptors) associated with those
forgotten revokes and that might account for the oom problem with
bug #349271.

Digging around the code, I noticed that there was no log_lock protection
in the revoke_lo_before_commit function, like the others, and thought
that maybe introducing such a lock might solve this problem.
I was kind of hoping this would fix our oom problem as well, but
it didn't.

I haven't actually tried to recreate the revoke problem for this bz
with this patch, so I don't know if this patch is a good idea or not.
Comment 9 RHEL Product and Program Management 2010-12-07 04:48:52 EST
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.
Comment 10 Steve Whitehouse 2011-05-12 06:31:56 EDT
This might be a dup of bug #690555. It has been open for a very very long time and I don't think that it is still relevant. We already have a bug open to track the split transactions issue, which is bug #236099 so I think this one is not something we need to keep open.

Please feel free to reopen if you disagree, but this looks like a historical artifact to me and something we might as well dispose of at this point in time.

Note You need to log in before you can comment on or make changes to this bug.