683104 – fsck.gfs2 only rebuilds one missing journal at a time

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 683104 - fsck.gfs2 only rebuilds one missing journal at a time

Summary: fsck.gfs2 only rebuilds one missing journal at a time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-03-08 15:12 UTC by Nate Straz
Modified:	2011-12-06 14:51 UTC (History)
CC List:	7 users (show)
Fixed In Version:	cluster-3.0.12.1-4.el6
Doc Type:	Bug Fix
Doc Text:	Prior to this patch, the fsck.gfs2 program used the number of entries in the journal index to look for missing journals. As a result, if more than one journal was missing, they were not all rebuilt and subsequent runs of fsck.gfs2 were needed to recover all the journals. Since each node needs its own journal, code was added to fsck.gfs2 to use the "per_node" system directory to determine the correct number of journals to repair. As a result, fsck.gfs2 now repairs all the journals in one run.
Clone Of:	622576
Environment:
Last Closed:	2011-12-06 14:51:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Untested patch (1.00 KB, patch) 2011-06-03 20:00 UTC, Robert Peterson	no flags	Details \| Diff
Patch that works properly (2.82 KB, patch) 2011-06-16 16:18 UTC, Robert Peterson	no flags	Details \| Diff
Better patch (4.39 KB, patch) 2011-06-16 19:05 UTC, Robert Peterson	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1516	0	normal	SHIPPED_LIVE	cluster and gfs2-utils bug fix update	2011-12-06 00:51:09 UTC

Description Nate Straz 2011-03-08 15:12:51 UTC

+++ This bug was initially created as a clone of Bug #622576 +++

* removing multiple journals in the middle

  If I have a file system with five journals and I remove the middle three, fsck.gfs2 will only recreate one journal at a time.  I have to run fsck.gfs2 three times to get the journals all back.  This seems like a bug that should be fixed.

--- Additional comment from rpeterso on 2011-03-07 08:50:43 EST ---

In answer to comment #11:
fsck.gfs2 should probably recover multiple journals.  Do you
have output I can look at from this scenario where it didn't?
I'd just like to double-check that it didn't act up.

--- Additional comment from nstraz on 2011-03-07 10:51:57 EST ---

Created attachment 482716 [details]
fsck.gfs2 log while rebuilding journals

Attached is the complete output of fsck.gfs2 while I run it until all of the journals are rebuilt.

The interesting parts are probably these lines:

Initializing fsck
File system journal "journal1" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Error: resource group 17 (0x11): free space (0) does not match bitmap (3)
(3 blocks were reclaimed)
The rgrp was fixed.
RGs: Consistent: 799   Inconsistent: 1   Fixed: 1   Total: 800
Starting pass1
Invalid or missing journal1 system inode (should be 4, is 0).
Rebuilding system file "journal1"
Pass1 complete
...
Initializing fsck
File system journal "journal2" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Invalid or missing journal2 system inode (should be 4, is 0).
Rebuilding system file "journal2"
Pass1 complete
...
alizing fsck
File system journal "journal3" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Invalid or missing journal3 system inode (should be 4, is 0).
Rebuilding system file "journal3"
Pass1 complete

Comment 1 Robert Peterson 2011-06-03 20:00:39 UTC

Created attachment 502895 [details]
Untested patch

I think this patch should do the trick, but I haven't taken
the time to test it yet.

Comment 2 Robert Peterson 2011-06-16 16:18:22 UTC

Created attachment 505093 [details]
Patch that works properly

The previous patch did not work for several reasons.  This one
works and is tested, and will most likely be shipped as is.

Comment 3 Robert Peterson 2011-06-16 16:23:28 UTC

The previously attached patch was tested on system gfs-i24c-01.
The test is as follows:

(1) I restore a metadata set I created that has journals 2-6 missing.
(2) I run the new fsck with -n to verify it doesn't crash or make changes.
    Due to the large amount of output, I redirect the output elsewhere.
(3) I run the new fsck with -y to verify it rebuilds all the journals
    and gives the proper return code of 1.
(4) I run the new fsck again to verify a second run finds no errors
    and gives a return code of 0.

Here are the testing results:

[root@gfs-i24c-01 ../gfs2/fsck]# gfs2_edit restoremeta /home/bob/metadata/gfs2/severaldeadjournals.meta /dev/sasdrives/bob 
File system size: 104792069 (0x63f0005) blocks, aka 399.768GB
There are 104857600 blocks of 4096 bytes in the destination device.

104857600 metadata blocks (100%) processed, 
File /home/bob/metadata/gfs2/severaldeadjournals.meta restore successful.
[root@gfs-i24c-01 ../gfs2/fsck]# ./fsck.gfs2 -n /dev/sasdrives/bob &> /tmp/gronk
[root@gfs-i24c-01 ../gfs2/fsck]# echo $?
4
[root@gfs-i24c-01 ../gfs2/fsck]# ./fsck.gfs2 -y /dev/sasdrives/bob
Initializing fsck
File system journal "journal2" is missing: pass1 will try to recreate it.
File system journal "journal3" is missing: pass1 will try to recreate it.
File system journal "journal4" is missing: pass1 will try to recreate it.
File system journal "journal5" is missing: pass1 will try to recreate it.
File system journal "journal6" is missing: pass1 will try to recreate it.

Journal recovery complete.
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Invalid or missing journal2 system inode (should be 4, is 0).
Rebuilding system file "journal2"
Invalid or missing journal3 system inode (should be 4, is 0).
Rebuilding system file "journal3"
Invalid or missing journal4 system inode (should be 4, is 0).
Rebuilding system file "journal4"
Invalid or missing journal5 system inode (should be 4, is 0).
Rebuilding system file "journal5"
Invalid or missing journal6 system inode (should be 4, is 0).
Rebuilding system file "journal6"
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete      
Starting pass3
Pass3 complete      
Starting pass4
Pass4 complete      
Starting pass5
Pass5 complete      
The statfs file is wrong:

Current statfs values:
blocks:  104846384 (0x63fd430)
free:    104745764 (0x63e4b24)
dinodes: 35 (0x23)

Calculated statfs values:
blocks:  104846384 (0x63fd430)
free:    104581594 (0x63bc9da)
dinodes: 40 (0x28)
The statfs file was fixed.
Writing changes to disk
gfs2_fsck complete    
[root@gfs-i24c-01 ../gfs2/fsck]# echo $?
1
[root@gfs-i24c-01 ../gfs2/fsck]# ./fsck.gfs2 /dev/sasdrives/bob
Initializing fsck
Validating Resource Group index.
Level 1 rgrp check: Checking if all rgrp and rindex values are good.
(level 1 passed)
Starting pass1
Pass1 complete      
Starting pass1b
Pass1b complete
Starting pass1c
Pass1c complete
Starting pass2
Pass2 complete      
Starting pass3
Pass3 complete      
Starting pass4
Pass4 complete      
Starting pass5
Pass5 complete      
gfs2_fsck complete    
[root@gfs-i24c-01 ../gfs2/fsck]# echo $?
0
[root@gfs-i24c-01 ../gfs2/fsck]#

Comment 4 Robert Peterson 2011-06-16 19:05:01 UTC

Created attachment 505130 [details]
Better patch

While doing additional testing I discovered a shortcoming of the
previous patch: If the per_node directory was missing and needed
to be built, fsck.gfs2 would crash because it was trying to
rebuild it too early (at a point where the rgrps were not read in).

This patch is able to handle that situation properly.  If the
per_node directory is missing and is rebuilt, fsck.gfs2 may
only build one journal during that run.

Comment 5 Robert Peterson 2011-06-16 19:29:38 UTC

This patch was pushed to the master branch of the gfs2-utils
git repository and the RHEL6 branch of the cluster.git
repository.  It was tested on system gfs-i24c-01 as described
in comment #3, plus another test where the per_node directory
was manually removed with gfs2_edit.  Changing status to POST
until we get this into a build.

Comment 8 Nate Straz 2011-08-08 15:54:04 UTC

Verified that multiple journals are recovered at the same time with gfs2-utils-3.0.12.1-7.el6.x86_64

Comment 9 Robert Peterson 2011-10-27 14:15:57 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Prior to this patch, the fsck.gfs2 program used the number of entries in the journal index to look for missing journals. As a result, if more than one journal was missing, they were not all rebuilt and subsequent runs of fsck.gfs2 were needed to recover all the journals.  Since each node needs its own journal, code was added to fsck.gfs2 to use the "per_node" system directory to determine the correct number of journals to repair.  As a result, fsck.gfs2 now repairs all the journals in one run.

Comment 10 errata-xmlrpc 2011-12-06 14:51:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1516.html

Note You need to log in before you can comment on or make changes to this bug.