Description of problem: I've been working on my gfs2_fsck bug (bug #291551) to replay journals, and therefore testing its sanity in the gfs2-kernel as well. Besides unit-testing the code, I've been doing tests where I: (1) Mount a gfs2 file system, in this case /dev/sdb1 (2) Start all kinds of I/O (3) Power-fence the node (4) Use gfs2_edit's savemeta feature to save the metadata "as is" (5) Verify everything was saved as it should be (6) Restore saved metadata to another device, in this case /dev/sdb2. (7) Run gfs2_fsck on the duplicate (/dev/sdb2) (8) Check the sanity of the decisions made, (9) Run gfs2_fsck on the original (10) Make sure they give the same results. In doing this, today I came across an interesting problem. In this case, the journal was replayed. The journal contained a log descriptor with a directory inode "/roth-01" that had two leafs, and one of those leafs contained a pointer to a given file inode (among others) for file "/roth-01/laio_rwdirect". I can illustrate it this way: Block 0x8166 (dir inode) "/roth-01" | |---> leaf #1: 0x816b |---> leaf #2: 0x8167 | |---> file inode 0xcd69 "/roth-01/laio_rwdirect" Later, the journal had a revoke for that file inode ("laio_rwdirect"), but not the source dir inode block ("/roth-01") nor either directory leaf, one of which pointed to the file. When the journal was replayed, the revoke canceled the file inode for "laio_rwdirect", but since neither the directory inode nor its leafs were revoked, the changes were still made to "/roth-01" and its leaf blocks. What I ended up with is a directory that pointed off into never-never land. That was correctly flagged as an error by gfs2_fsck: [root@roth-01 ../cluster/gfs2/fsck]# ./gfs2_fsck /dev/sdb2 Initializing fsck Recovering journals (this may take a while). Journal #1 ("journal0") is dirty. Okay to replay it? (y/n)y jid=0: Replayed 0 of 0 journaled data blocks jid=0: Replayed 7 of 10 metadata blocks Journal recovery complete. Validating Resource Group index. Level 1 RG check. (level 1 passed) Starting pass1 Pass1 complete Starting pass1b Pass1b complete Starting pass1c Pass1c complete Starting pass2 Directory entry 'laio_rwdirect' at block 52585 (0xcd69) in dir inode 33126 (0x8166) has an invalid block type: 15. Clear directory entry to non-inode block? (y/n) Both directory leaf and inode are in the same log descriptor. The inode is in a subsequent revoke log descriptor, but neither the directory inode nor either leaf is revoked. Here is the gfs2_edit breakdown of the journal (excerpt from gfs2_edit -p journal0 /dev/sdb1) but only the active part. I've verified that this is the ONLY real section of the journal that is replayed and needs to be replayed: Block #2533: Log header: Seq = 0x802a, tail = 0x250a, blk = 0x2533 Block #2534: Log descriptor, type 300 (Metadata) len:11, data1: 10 0x0000cd6b 0x0000cd6a 0x0000cd69 0x0000805c 0x00000014 0x0000805d 0x00000011 0x00000013 0x00008166 0x00008167 Block #253f: Log descriptor, type 301 (Revoke) len:1, data1: 11 0x0000b9e2 0x0000b9e4 0x0000c9d5 0x0000c7d7 0x0000c5d9 0x0000c3db 0x0000c1dd 0x0000bfdf 0x0000bde1 0x0000bbe3 0x0000b9e3 Block #2540: Log header: Seq = 0x802b, tail = 0x2534, blk = 0x2540 Block #2541: Log descriptor, type 301 (Revoke) len:1, data1: 3 0x0000cd6b 0x0000cd6a 0x0000cd69 Block #2542: Log header: Seq = 0x802c, tail = 0x2534, blk = 0x2542 --------- end of the journal, as witnessed by the following entry: Block #2543: Log header: Seq = 0x1a4, tail = 0x0, blk = 0x2543 So journal block #2534 contains 10 metadata blocks: 0x0000cd6b 0x0000cd6a 0x0000cd69 0x0000805c 0x00000014 0x0000805d 0x00000011 0x00000013 0x00008166 0x00008167 Journal block #2541 contains revokes for only the first three of these metadata blocks, one of which (0xcd69) is for "laio_rwdirect": 0x0000cd6b 0x0000cd6a 0x0000cd69 Blocks 0x8166 and 0x8167 were not revoked, but 0xcd69 was. Therefore, gfs2_fsck replayed the other seven journal blocks and later it correctly identified leaf 0x8167 pointing to an inode 0xcd69 that does not exist. My theory at this time is that all ten metadata blocks from block #2534 should have been revoked in block #2541, since there was a subsequent log header and they were not. This is highly speculative, but perhaps those block buffers were never unpinned, or else they had a bd_ail, and perhaps that caused them to never be added to the gl_ail_list, which is later used to build the revoke list. Excerpt from gfs2_unpin: if (bd->bd_ail) { list_del(&bd->bd_ail_st_list); brelse(bh); } else { struct gfs2_glock *gl = bd->bd_gl; list_add(&bd->bd_ail_gl_list, &gl->gl_ail_list); atomic_inc(&gl->gl_ail_count); } Version-Release number of selected component (if applicable): RHEL51 How reproducible: Unknown Steps to Reproduce: See steps given above Actual results: See gfs2_fsck output above Expected results: gfs2_fsck should come up clean. Additional info:
Created attachment 231441 [details] saved metadata for the corrupt file system This is a gfs2_edit savemeta of the file system in its corrupt state. You can restore the metadata with commands similar to these: cd /tmp/ bunzip2 gfsmeta.bz2 gfs2_edit restoremeta /tmp/gfsmeta /dev/sdb2 (given that /dev/sdb2 is a file system you want to overwrite with it) Assuming, of course, that you have the latest gfs2_edit from the HEAD branch of cvs.
Created attachment 291028 [details] Possible fix I got to thinking that if the code is missing revokes, it might also be forgetting to release the bd's (buffer descriptors) associated with those forgotten revokes and that might account for the oom problem with bug #349271. Digging around the code, I noticed that there was no log_lock protection in the revoke_lo_before_commit function, like the others, and thought that maybe introducing such a lock might solve this problem. I was kind of hoping this would fix our oom problem as well, but it didn't. I haven't actually tried to recreate the revoke problem for this bz with this patch, so I don't know if this patch is a good idea or not.
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.6 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
This might be a dup of bug #690555. It has been open for a very very long time and I don't think that it is still relevant. We already have a bug open to track the split transactions issue, which is bug #236099 so I think this one is not something we need to keep open. Please feel free to reopen if you disagree, but this looks like a historical artifact to me and something we might as well dispose of at this point in time.