Bug 683104

Summary: fsck.gfs2 only rebuilds one missing journal at a time
Product: Red Hat Enterprise Linux 6 Reporter: Nate Straz <nstraz>
Component: clusterAssignee: Robert Peterson <rpeterso>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 6.1CC: ccaulfie, cluster-maint, fdinitto, lhh, rpeterso, swhiteho, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cluster-3.0.12.1-4.el6 Doc Type: Bug Fix
Doc Text:
Prior to this patch, the fsck.gfs2 program used the number of entries in the journal index to look for missing journals. As a result, if more than one journal was missing, they were not all rebuilt and subsequent runs of fsck.gfs2 were needed to recover all the journals. Since each node needs its own journal, code was added to fsck.gfs2 to use the "per_node" system directory to determine the correct number of journals to repair. As a result, fsck.gfs2 now repairs all the journals in one run.
Story Points: ---
Clone Of: 622576 Environment:
Last Closed: 2011-12-06 14:51:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Untested patch
none
Patch that works properly
none
Better patch none

Description Nate Straz 2011-03-08 15:12:51 UTC
+++ This bug was initially created as a clone of Bug #622576 +++

* removing multiple journals in the middle

  If I have a file system with five journals and I remove the middle three, fsck.gfs2 will only recreate one journal at a time.  I have to run fsck.gfs2 three times to get the journals all back.  This seems like a bug that should be fixed.

--- Additional comment from rpeterso on 2011-03-07 08:50:43 EST ---

In answer to comment #11:
fsck.gfs2 should probably recover multiple journals.  Do you
have output I can look at from this scenario where it didn't?
I'd just like to double-check that it didn't act up.

--- Additional comment from nstraz on 2011-03-07 10:51:57 EST ---

Created attachment 482716 [details]
fsck.gfs2 log while rebuilding journals

Attached is the complete output of fsck.gfs2 while I run it until all of the journals are rebuilt.

The interesting parts are probably these lines:

Initializing fsck
File system journal "journal1" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Error: resource group 17 (0x11): free space (0) does not match bitmap (3)
(3 blocks were reclaimed)
The rgrp was fixed.
RGs: Consistent: 799   Inconsistent: 1   Fixed: 1   Total: 800
Starting pass1
Invalid or missing journal1 system inode (should be 4, is 0).
Rebuilding system file "journal1"
Pass1 complete
...
Initializing fsck
File system journal "journal2" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Invalid or missing journal2 system inode (should be 4, is 0).
Rebuilding system file "journal2"
Pass1 complete
...
alizing fsck
File system journal "journal3" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Invalid or missing journal3 system inode (should be 4, is 0).
Rebuilding system file "journal3"
Pass1 complete

Comment 1 Robert Peterson 2011-06-03 20:00:39 UTC
Created attachment 502895 [details]
Untested patch

I think this patch should do the trick, but I haven't taken
the time to test it yet.

Comment 2 Robert Peterson 2011-06-16 16:18:22 UTC
Created attachment 505093 [details]
Patch that works properly

The previous patch did not work for several reasons.  This one
works and is tested, and will most likely be shipped as is.

Comment 3 Robert Peterson 2011-06-16 16:23:28 UTC
The previously attached patch was tested on system gfs-i24c-01.
The test is as follows:

(1) I restore a metadata set I created that has journals 2-6 missing.
(2) I run the new fsck with -n to verify it doesn't crash or make changes.
    Due to the large amount of output, I redirect the output elsewhere.
(3) I run the new fsck with -y to verify it rebuilds all the journals
    and gives the proper return code of 1.
(4) I run the new fsck again to verify a second run finds no errors
    and gives a return code of 0.

Here are the testing results:

[root@gfs-i24c-01 ../gfs2/fsck]# gfs2_edit restoremeta /home/bob/metadata/gfs2/severaldeadjournals.meta /dev/sasdrives/bob 
File system size: 104792069 (0x63f0005) blocks, aka 399.768GB
There are 104857600 blocks of 4096 bytes in the destination device.

104857600 metadata blocks (100%) processed, 
File /home/bob/metadata/gfs2/severaldeadjournals.meta restore successful.
[root@gfs-i24c-01 ../gfs2/fsck]# ./fsck.gfs2 -n /dev/sasdrives/bob &> /tmp/gronk
[root@gfs-i24c-01 ../gfs2/fsck]# echo $?
4
[root@gfs-i24c-01 ../gfs2/fsck]# ./fsck.gfs2 -y /dev/sasdrives/bob
Initializing fsck
File system journal "journal2" is missing: pass1 will try to recreate it.
File system journal "journal3" is missing: pass1 will try to recreate it.
File system journal "journal4" is missing: pass1 will try to recreate it.
File system journal "journal5" is missing: pass1 will try to recreate it.
File system journal "journal6" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Invalid or missing journal2 system inode (should be 4, is 0).
Rebuilding system file "journal2"
Invalid or missing journal3 system inode (should be 4, is 0).
Rebuilding system file "journal3"
Invalid or missing journal4 system inode (should be 4, is 0).
Rebuilding system file "journal4"
Invalid or missing journal5 system inode (should be 4, is 0).
Rebuilding system file "journal5"
Invalid or missing journal6 system inode (should be 4, is 0).
Rebuilding system file "journal6"
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete      
Starting pass3
Pass3 complete      
Starting pass4
Pass4 complete      
Starting pass5
Pass5 complete      
The statfs file is wrong:

Current statfs values:
blocks:  104846384 (0x63fd430)
free:    104745764 (0x63e4b24)
dinodes: 35 (0x23)

Calculated statfs values:
blocks:  104846384 (0x63fd430)
free:    104581594 (0x63bc9da)
dinodes: 40 (0x28)
The statfs file was fixed.
Writing changes to disk
gfs2_fsck complete    
[root@gfs-i24c-01 ../gfs2/fsck]# echo $?
1
[root@gfs-i24c-01 ../gfs2/fsck]# ./fsck.gfs2 /dev/sasdrives/bob
Initializing fsck
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete      
Starting pass3
Pass3 complete      
Starting pass4
Pass4 complete      
Starting pass5
Pass5 complete      
gfs2_fsck complete    
[root@gfs-i24c-01 ../gfs2/fsck]# echo $?
0
[root@gfs-i24c-01 ../gfs2/fsck]#

Comment 4 Robert Peterson 2011-06-16 19:05:01 UTC
Created attachment 505130 [details]
Better patch

While doing additional testing I discovered a shortcoming of the
previous patch: If the per_node directory was missing and needed
to be built, fsck.gfs2 would crash because it was trying to
rebuild it too early (at a point where the rgrps were not read in).

This patch is able to handle that situation properly.  If the
per_node directory is missing and is rebuilt, fsck.gfs2 may
only build one journal during that run.

Comment 5 Robert Peterson 2011-06-16 19:29:38 UTC
This patch was pushed to the master branch of the gfs2-utils
git repository and the RHEL6 branch of the cluster.git
repository.  It was tested on system gfs-i24c-01 as described
in comment #3, plus another test where the per_node directory
was manually removed with gfs2_edit.  Changing status to POST
until we get this into a build.

Comment 8 Nate Straz 2011-08-08 15:54:04 UTC
Verified that multiple journals are recovered at the same time with gfs2-utils-3.0.12.1-7.el6.x86_64

Comment 9 Robert Peterson 2011-10-27 14:15:57 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Prior to this patch, the fsck.gfs2 program used the number of entries in the journal index to look for missing journals. As a result, if more than one journal was missing, they were not all rebuilt and subsequent runs of fsck.gfs2 were needed to recover all the journals.  Since each node needs its own journal, code was added to fsck.gfs2 to use the "per_node" system directory to determine the correct number of journals to repair.  As a result, fsck.gfs2 now repairs all the journals in one run.

Comment 10 errata-xmlrpc 2011-12-06 14:51:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1516.html