Description of problem: While working on a fix for bug #291551 I discovered a problem with gfs2_fsck. Basically, if gfs2_fsck makes changes to any of the system inodes, the changes get reverted later. The problem is that the code doesn't recognize system inodes when it makes changes to them. Instead, it treats them as regular inodes. Later, when the program ends, it rewrites the system inodes from entries in memory which do NOT contain the changes made before. That means that gfs2_fsck can't fix inode problems with the system inodes. Version-Release number of selected component (if applicable): RHEL5, RHEL51 How reproducible: Always Steps to Reproduce: 1. Mount a gfs2 file system 2. Start some gfs2 IO process that creates a bunch of files and/or directories in the root inode (a system inode). I used QE's d_io. 3. Power-fence the system while it's running. 4. When the system comes back up, run gfs2_fsck TWICE. Actual results: [root@roth-01 ../cluster/gfs2/fsck]# ./gfs2_fsck /dev/sdb1 Initializing fsck Validating Resource Group index. Level 1 RG check. (level 1 passed) Recovering journals (this may take a while). Journal recovery complete. Starting pass1 Pass1 complete Starting pass1b Pass1b complete Starting pass1c Pass1c complete Starting pass2 root inode 22 (0x16): Entries is 2 - should be 3 Fix entries for root inode 22 (0x16)? (y/n) y Entries updated Pass2 complete Starting pass3 Pass3 complete Starting pass4 Link count inconsistent for inode 22 (0x16) has 2 but fsck found 3. Update link count for inode 22 (0x16) ? (y/n) y Link count updated for inode 22 (0x16) Pass4 complete Starting pass5 Pass5 complete Writing changes to disk gfs2_fsck complete [root@roth-01 ../cluster/gfs2/fsck]# ./gfs2_fsck /dev/sdb1 Initializing fsck Validating Resource Group index. Level 1 RG check. (level 1 passed) Recovering journals (this may take a while). Journal recovery complete. Starting pass1 Pass1 complete Starting pass1b Pass1b complete Starting pass1c Pass1c complete Starting pass2 root inode 22 (0x16): Entries is 2 - should be 3 Fix entries for root inode 22 (0x16)? (y/n) y Entries updated Pass2 complete Starting pass3 Pass3 complete Starting pass4 Link count inconsistent for inode 22 (0x16) has 2 but fsck found 3. Update link count for inode 22 (0x16) ? (y/n) y Link count updated for inode 22 (0x16) Pass4 complete Starting pass5 Pass5 complete Writing changes to disk gfs2_fsck complete In this case, the first invocation of gfs2_fsck fixed the problem with the root inode, but the fixes were later reverted. The second invocation found the same problem, as would all future invocations. Expected results: The second gfs2_fsck shouldn't find any errors with inode 22 (the root inode) because the first invocation should have fixed it. Additional info: Serious problem. Relatively easy to fix.
Requesting ack flags for inclusion in 5.2.
It's been a long time so I thought I'd write an update on this problem. First, the good news: By dummying up a file system with gfs2_edit, I was easily able to recreate this problem. I also debugged the problem, found the issue and wrote a working prototype patch that fixes the problem. Now the bad news: The code impacts all the places where fsck references dinodes, so it's pretty pervasive. In order to test the various code paths, I tried to run my tests from bug #382581. That's when I discovered that, unfortunately, the current version of gfs2_edit does not save the rindex system file, which means I couldn't recreate the proper conditions for the test again. (That doesn't invalidate the testing I did for 382581, however, because I used the same rindex file over and over and everything worked.) So first, I had to make a few more changes to gfs2_edit so that it can save the rindex file properly. Second, I had to rewrite the test conditions. Rather than creating six file systems with varying states of corruption, I decided it made more sense to have two: (1) A clean file system with a bunch of EAs and (2) a file system that has leukemia, with a wide range of corruption throughout the system; basically one occurrence of each possible type of corruption. So I set up my good file system, saved it with the new, improved gfs2_edit, then set about corrupting it in multiple ways to test code paths. The test was an success. In fact, it was too much of a success. Not only did it find bugs in my patch to fix this problem, it also uncovered more bugs that are out there lurking in the depths. It uncovered bugs that likely exist in gfs1's gfs_fsck. For example, if you have an inode with an extended attribute that is really a duplicate of another inode's data block, fsck crashes. It seems to be a condition that neither gfs1 nor gfs2's fscks can handle properly. The duplicate resolving code doesn't look at EAs. I will likely open new bugzilla records in order to fix gfs2_edit and to treat these new bugs apart from this bug record. Unfortunately, I can't ship the code and mark this as modified until those code paths are tested. I can't really test the code paths of my current fix until I get these issues resolved.
Created attachment 271861 [details] Patch to fix the problem This patch fixes the problem for this bug record and I'm planning to ship it this way. This isn't as bad as I previously thought. The only serious bug I found is the one mentioned in my previous comment, namely, there are problems when gfs2_fsck encounters a block that is duplicated between a dinode and an extended attribute. I'm going to have to open that as a separate bug record, and I believe it exists in gfs1 as well. However, the odds of hitting this are very low, so I recommend deferring that fix to 5.3 or even later. For now, I've pulled that particular piece of corruption out of my test case and gfs2_fsck seems to handle all the other corruption properly. One notable exception is when an EA block is out of range, in which case the code doesn't know how to repair it. So the corruption is never fixed. So gfs2_fsck will complain about the corruption each time it runs. That could be technically opened as another bug record, but I don't have a good solution or a good thought on how to fix that. Maybe throw away the whole thing? At any rate, the code flags the error but doesn't crash. Again, that's probably a problem with gfs1 as well. This attached patch allows gfs2_fsck to use either new buffers for dinodes or existing dinodes where the dinode happens to be for a system file. In debugging the fix, it had the effect of keeping my buffer management code "honest". In other words, it uncovered several places where gfs2_fsck was mismanaging the buffers, but getting away with it because the buffers were discarded. Since the buffers for system dinodes can't be discarded, I had to fix all these problems and manage the buffers properly. In doing my testing, I did encounter another very serious and ugly bug: that has a very simple one-line fix. Basically, if duplicates are encountered at all, whole directories can be completely destroyed. The fix is at the end of pass1.c: - b = osi_list_entry(tmp, struct blocks, list); + b = osi_list_entry(dup_list.next, struct blocks, list); This fix is included in this patch, but it almost warrants its own bug. I guess I'll call that a management decision and get some opinions on it. This patch has been extensively tested by virtue of my new test case which puts gfs2_fsck through its paces (and evidenced by the number of bugs I shook out of it).
Created attachment 271881 [details] Saved metadata for "control" fsck test To run this test, copy the attached bz2 file into /tmp, then: cd /tmp/ bunzip2 fscktest1.meta.bz2 gfs2_edit restoremeta /tmp/fscktest1.meta /dev/roth_vg/roth_lv gfs2_fsck /dev/roth_vg/roth_lv Your destination device (e.g. /dev/roth_vg/roth_lv) should be 72GB or bigger.
Created attachment 271901 [details] Saved metadata for "leukemia" test To run this test, copy the attached bz2 file into /tmp, then: cd /tmp/ bunzip2 fscktest3.meta.bz2 gfs2_edit restoremeta /tmp/fscktest3.meta /dev/roth_vg/roth_lv gfs2_fsck /dev/roth_vg/roth_lv Your destination device (e.g. /dev/roth_vg/roth_lv) should be 72GB or bigger.
Correction: Toward the end of comment #4, I mentioned a serious and ugly bug that has a simple one-line fix. The code actually appears at the end of pass1b.c (not pass1.c as stated). That fix already exists in gfs1's gfs_fsck (shipped as part of bug #216203.) Apparently it was cross-written to gfs2 incorrectly by yours truly when I did the fix for gfs2 bug #223506 back in January.
I tested on roth-01 using my newly designed "leukemia" test as described above and attached below. I committed this fix to the HEAD and RHEL5 branches of cvs for inclusion in RHEL5.2, so I'm changing the status to modified. Other notes: 1. I opened bug #404611 to address the problem of blocks that are duplicated between data and extended attributes. 2. After getting management input, I'm shipping the one-line fix mentioned in comment #4 and comment #7 with this patch (as attached.)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0350.html