Bug 457557

Summary: GFS2 mount & fsck quit with consistency error
Product: Red Hat Enterprise Linux 5 Reporter: Theophanis Kontogiannis <theophanis_kontogiannis>
Component: kernelAssignee: Robert Peterson <rpeterso>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: low    
Version: 5.2CC: adas, bmarzins, bstevens, cluster-maint, edamato, rpeterso, swhiteho
Target Milestone: rc   
Target Release: 5.5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-01-27 10:53:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Possible fix none

Description Theophanis Kontogiannis 2008-08-01 14:13:07 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1

Description of problem:
GFS2: fsid=tweety:gfs2-00.0: fatal: filesystem consistency error
GFS2: fsid=tweety:gfs2-00.0:   inode = 4 25
GFS2: fsid=tweety:gfs2-00.0:   function = jhead_scan, file=fs/gfs2/recovery.c, line = 239


Version-Release number of selected component (if applicable):


How reproducible:
Always


Steps to Reproduce:
1. mount -t gfs2 /dev/<GFS2 filesystem> <mountpoint>
2. Observe kernel messages
3. Try gfs2.fsck /dev/<GFS2 filesystem>
4. Observe kernel messages

Actual Results:
GFS2: fsid=tweety:gfs2-00.0: fatal: filesystem consistency error
GFS2: fsid=tweety:gfs2-00.0:   inode = 4 25
GFS2: fsid=tweety:gfs2-00.0:   function = jhead_scan, file=fs/gfs2/recovery.c, line = 239

Expected Results:
Mount the filesystem
Fix the filesystem

Additional info:

Comment 1 Robert Peterson 2008-08-01 15:31:37 UTC
Analysis of journal0 indicates there are two log headers near the
wrap-around point of the log with the same sequence number.
Here is some output from gfs2_edit -p journal0:

0x73fb (j+73a0): Log header: Seq: 0xa9b7, tail: 0x6c30, blk: 0x73a0
0x73fc (j+73a1): Log header: Seq: 0xa9b8, tail: 0x73a1, blk: 0x73a1
0x73fd (j+73a2): Log header: Seq: 0xa9b8, tail: 0x7070, blk: 0x73a2
0x73fe (j+73a3): Log header: Seq: 0xa9b9, tail: 0x73a3, blk: 0x73a3
0x73ff (j+73a4): Log header: Seq: 0xa9ba, tail: 0x73a4, blk: 0x73a4
0x7570 (j+7515): Log descriptor, type 300 (Metadata) len:504, data1: 503
...
Note the two occurrences of sequence number 0xa9b8.

The two 0xa9b8 log headers have completely different contents, so I'm
thinking we need some kind of spin lock for the sequence number, or
make the source value in the sdp atomic.

Unfortunately, gfs2_fsck does not try to fix duplicate sequence
numbers, and the file system is unmountable once this occurs.
I've given Theophanis instructions on how to repair this by hand
with gfs2_edit.  This is this near the wrap-point, there are only
four log headers that need to have their sequence number bumped.

We probably need to fix this for both RHEL5.3 and Fedora.


Comment 3 RHEL Program Management 2008-08-01 15:50:34 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Robert Peterson 2008-08-08 17:39:10 UTC
Details of the two log headers with the same sequence number:

Block #29692    (0x73fc)      of 97255424 (0x5CC0000)  (log header)

Log Header:
  mh_magic              0x01161970                (hex)
  mh_type               8                         0x8
  mh_format             800                       0x320
  lh_sequence           43448                     0xa9b8
  lh_flags              0x00000000                (hex)
  lh_tail               29601                     0x73a1
  lh_blkno              29601                     0x73a1
  lh_hash               0xB60FE59B                (hex)

Block #29693    (0x73fd)      of 97255424 (0x5CC0000)  (log header)

Log Header:
  mh_magic              0x01161970                (hex)
  mh_type               8                         0x8
  mh_format             800                       0x320
  lh_sequence           43448                     0xa9b8
  lh_flags              0x00000000                (hex)
  lh_tail               28784                     0x7070
  lh_blkno              29602                     0x73a2
  lh_hash               0x19E47132                (hex)

Comment 5 Robert Peterson 2008-08-08 21:57:31 UTC
Hi Theophanis,

The errors described in the problem description indicate that corruption
was found in the journal during mount and/or gfs2_fsck.  That corruption
was introduced the last time data was written on that system.

Can you tell me what version of gfs2 you were running when this failure
occurred?  Was it the stock Fedora kernel?  Or gfs2-kmod compiled from
source?  Or Centos5.1, or RHEL5.2, etc.?

The file system obviously had some bad information put into the journal
so the corruption was apparently introduced while lots of data was being
written.  Can you give me more details about what was happening on the
system before the failure occurred?  It almost looks like the file system
was being unmounted while data was being poured into it or something like
that.  Or maybe the file system was unmounted immediately after a large
amount of data had been written.  (Which should be okay and not cause
any problems).

I can tell that no other nodes had written to the other nine journals.

Comment 6 Theophanis Kontogiannis 2008-08-12 13:08:11 UTC
Hi Bob,

The case happened with:
gfs2-utils-0.1.44-1.el5_2.1
2.6.18-92.1.6.el5.centos.plus-x86_64
drbd82-8.2.6-1.el5.centos
kmod-drbd82-8.2.6-1.2.6.18_92.1.6.el5.centos.plus

The module was from this kernel.

The file system is set up like that:

physical disk --> pv --> lv --> vg --> drbd --> gfs2


I can not find any evidence of the exact system status the moment the file system crashed. I believe though, it happened when DRBD lost network connection, that resulted in "data in transit".

Comment 7 Theophanis Kontogiannis 2008-08-12 15:27:46 UTC
Hi Bob,

The case happened with:
gfs2-utils-0.1.44-1.el5_2.1
2.6.18-92.1.6.el5.centos.plus-x86_64
drbd82-8.2.6-1.el5.centos
kmod-drbd82-8.2.6-1.2.6.18_92.1.6.el5.centos.plus

The module was from this kernel.

The file system is set up like that:

physical disk --> pv --> lv --> vg --> drbd --> gfs2


I can not find any evidence of the exact system status the moment the file system crashed. I believe though, it happened when DRBD lost network connection, that resulted in "data in transit".

Comment 8 Theophanis Kontogiannis 2008-08-13 22:54:27 UTC
In addition I believe you need the following information related to gfs2 module:

[root@tweety-2 ~]# modinfo gfs2
filename:       /lib/modules/2.6.18-92.1.6.el5.centos.plus/kernel/fs/gfs2/gfs2.ko
license:        GPL
author:         Red Hat, Inc.
description:    Global File System
srcversion:     B09BC266DD032D7FCEA51E5
depends:
vermagic:       2.6.18-92.1.6.el5.centos.plus SMP mod_unload gcc-4.1
parm:           scand_secs:The number of seconds between scand runs (uint)
module_sig:     883f3504863c71a95c7adb72c4c65e11273cf09e25e5ce3532f2b84d3f5ac7b5ad1043ea692c56409f5deea449c3bc6555cf3ef68bd1468b8a2f3586

Comment 10 Robert Peterson 2009-03-12 22:36:51 UTC
Believe it or not, I may have recreated this problem--twice--in gfs1!
I was testing my latest patch for bug #455696, when I got this error:

GFS: fsid=shell:gfs.0: fatal: filesystem consistency error
GFS: fsid=shell:gfs.0:   function = trans_go_xmote_bh
GFS: fsid=shell:gfs.0:   file = /home/bob/cluster/gfs-kernel/src/gfs/glops.c, line = 538
GFS: fsid=shell:gfs.0:   time = 1236892610
GFS: fsid=shell:gfs.0: about to withdraw from the cluster
GFS: fsid=shell:gfs.0: waiting for outstanding I/O

Examining the journal closely, I found two log headers in journal0
that have the same sequence number.  I'm hoping I can put some
instrumentation into my gfs file system code to show me what is the
root cause of this problem.

Comment 11 Steve Whitehouse 2009-03-13 10:41:16 UTC
Well I have one possible idea on how this can be happening, and if I'm right it can only happen at boot time. We have (in ops_fstype.c):

        error = init_threads(sdp, DO);
        if (error)
                goto fail_per_node;

        if (!(sb->s_flags & MS_RDONLY)) {
                error = gfs2_make_fs_rw(sdp);
                if (error) {
                        fs_err(sdp, "can't make FS RW: %d\n", error);
                        goto fail_threads;
                }
        }

So init_threads starts up gfs2_logd which looks like its going to try and do a journal flush pretty much right away. gfs2_make_fs_rw() does this:

       error = gfs2_glock_nq_init(sdp->sd_trans_gl, LM_ST_SHARED, 0, &t_gh);
        if (error)
                return error;

        j_gl->gl_ops->go_inval(j_gl, DIO_METADATA);

        error = gfs2_find_jhead(sdp->sd_jdesc, &head);
        if (error)
                goto fail;

        if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
                gfs2_consist(sdp);
                error = -EIO;
                goto fail;
        }

        /*  Initialize some head of the log stuff  */
        sdp->sd_log_sequence = head.lh_sequence + 1;
        gfs2_log_pointers_init(sdp, head.lh_blkno);


Now when we lock the transaction lock, we call trans_go_xmote_bh() which also does:

                if (!test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) {
                        sdp->sd_log_sequence = head.lh_sequence + 1;
                        gfs2_log_pointers_init(sdp, head.lh_blkno);
                }

so in otherwords we are duplicating the work for some strange reason. I think a number of the items in gfs2_make_fs_ro are simply duplicates and can be removed. The interesting thing is what happens if we have a log flush after the initial setting of sdp->sd_log_sequence in trans_go_xmote_bh() and the second updating of this variable in gfs2_make_fs_ro(). It looks to me as if we'd land up resetting the log sequence after we'd written a header.

Also, when we lock the transaction lock, we'll then be caching the glock so that unless there is a suspension of the filesystem, there will be no further state changes and thus no calls to this place where we might update the sequence number. There is only one other place where the log sequence is updated (aside from the ++ when we flush the log itself) and that is in gfs2_make_fs_rw() so by deduction, that would seem to be the obvious place to look.

Comment 12 Steve Whitehouse 2009-03-13 11:28:13 UTC
Looks like that might not be the whole story, but watch this space.... I'll try to come up with a patch.

Comment 13 Steve Whitehouse 2009-03-13 13:31:07 UTC
There seem to be a numebr of issues which might or might not be related to this, but ought to be fixed anyway.

On mount, we start up our various kernel threads, gfs2_logd and gfs2_quotad, neither of these should be trying to write to anything until we make the journal live. gfs2_logd seems to have no protection against being run early, and gfs2_quotad just assumes that the transaction will return -EROFS if it is run early.

When gfs2_quotad does get a -EROFS return, its specifically ignored, suggesting that it can happen frequently. Also due to the positioning of the test in gfs2_trans_begin() it will have already tried to lock the transaction lock. It also sets GL_NOCACHE on the lock in this case.

It isn't at all clear why the test for the journal being live is made under the transaction lock since its local state and not something that would otherwise need to be synchronised across the cluster.

The duplication mentioned in comment #11 doesn't actually cause a problem in itself since the journal live flag prevents both sets of journal initialisation running. However when mixed with this particular issue, I wonder whether there is a possible cause here.

Comment 14 Steve Whitehouse 2009-03-13 13:32:05 UTC
Created attachment 335089 [details]
Possible fix

Comment 15 Steve Whitehouse 2009-03-18 14:07:33 UTC
Looks like gfs2_logd can run during periods when we don't hold the transaction glock. Thats obviously not right, but it is also quite a rare event. Once it has happened though, it makes the journal unusable.

I'm working on a patch which will fix all the issues, but its quite long & involved.

Comment 16 Robert Peterson 2009-04-20 21:45:50 UTC
FWIW, I went through my NFS/GFS file system corruption patch and
compared all my GFS code changes to the corresponding GFS2 code.
All but two of the code changes are unnecessary in GFS2 for various
reasons.  The two exceptions are: (1) In the GFS patch, I added
code to prevent the transaction glock from being released in function
clear_glock unless it's unmount time.  I'm not entirely sure if we
should do that in GFS2 or if it's adequately covered.  (2) In the
GFS patch, I added locking to the find_jhead code.  This would be
the equivalent to adding gfs2_log_lock to gfs2_find_jhead.  This
addition might just solve this bug but that is purely theoretical
and I haven't dug into the code enough to be certain.  We would have
to study the performance and lock order implications.

Theophanis, have you seen this problem again?  If so, do you have a
way to recreate the failure?  Are you willing to test a GFS2 patch?

Comment 17 Theophanis Kontogiannis 2009-04-25 19:21:23 UTC
Hi Bob and All,

I have not met the same problem ever again.
Also in the mean time I have upgraded to 5.3 (less then a month now).

I am willing to do beta testing, but I will be able to do it sometime during May.

How can I be of help?

Thank you

Comment 18 Theophanis Kontogiannis 2009-04-26 09:54:43 UTC
Hi Bob and All,

I have not met the same problem ever again.
Also in the mean time I have upgraded to 5.3 (less then a month now).

I am willing to do beta testing, but I will be able to do it sometime during May.

How can I be of help?

Thank you

Comment 19 Robert Peterson 2009-05-01 15:38:49 UTC
Hi Theophanis,

It looks like another person has hit this problem.  In their case,
the problem may have been caused by two nodes mounting the same
gfs2 file system with lock_nolock rather than lock_dlm.  Could this
be what happened to you?

In the mean time, I'm working on a new bugzilla to fix gfs2_fsck to
detect and fix the issue.  I'll add you to the cc list.

Comment 20 Robert Peterson 2009-05-04 20:09:03 UTC
As part of my (huge) patch for bug #455696, I decided we needed to
protect the gfs_find_jhead function from other processes writing
to the journal.  Steve Whitehouse: Do you think that be what caused
Theophanis's corruption (i.e. log headers with the same sequence
number)?  The GFS change looked something like this, so should we
consider adding similar code to the gfs2_find_jhead function?

diff --git a/gfs-kernel/src/gfs/recovery.c b/gfs-kernel/src/gfs/recovery.c
index 5c2e8e1..2ecee04 100644
--- a/gfs-kernel/src/gfs/recovery.c
+++ b/gfs-kernel/src/gfs/recovery.c
@@ -24,6 +24,7 @@
 #include "glock.h"
 #include "glops.h"
 #include "lm.h"
+#include "log.h"
 #include "lops.h"
 #include "recovery.h"
 
@@ -251,6 +252,7 @@ gfs_find_jhead(struct gfs_sbd *sdp, struct gfs_jindex *jdesc,
 	seg1 = 0;
 	seg2 = jdesc->ji_nsegment - 1;
 
+	gfs_log_lock(sdp);
 	for (;;) {
 		seg_m = (seg1 + seg2) / 2;
 
@@ -262,6 +264,7 @@ gfs_find_jhead(struct gfs_sbd *sdp, struct gfs_jindex *jdesc,
 			error = verify_jhead(sdp, jdesc, gl, &lh1);
 			if (unlikely(error)) {
 				printk("GFS: verify_jhead error=%d\n", error);
+				gfs_log_unlock(sdp);
 				return error;
 			}
 			memcpy(head, &lh1, sizeof(struct gfs_log_header));
@@ -278,6 +281,7 @@ gfs_find_jhead(struct gfs_sbd *sdp, struct gfs_jindex *jdesc,
 			seg2 = seg_m;
 	}
 
+	gfs_log_unlock(sdp);
 	return error;
 }

Note: The gfs2 code has an additional return that would have to be
taken into account, of course.

Comment 22 Theophanis Kontogiannis 2009-12-12 21:34:28 UTC
Hello Bob and All,

Now I have upgraded to 5.4 (1.5 months now) and still no occurrence of the bug.

I guess it has been permanently fixed?

Comment 23 Steve Whitehouse 2010-01-27 10:53:23 UTC
Looks like this has long since been fixed, so closing it. If it reappears, please reopen this bug.