Bug 457557
Summary: | GFS2 mount & fsck quit with consistency error | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Theophanis Kontogiannis <theophanis_kontogiannis> | ||||
Component: | kernel | Assignee: | Robert Peterson <rpeterso> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 5.2 | CC: | adas, bmarzins, bstevens, cluster-maint, edamato, rpeterso, swhiteho | ||||
Target Milestone: | rc | ||||||
Target Release: | 5.5 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2010-01-27 10:53:23 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Theophanis Kontogiannis
2008-08-01 14:13:07 UTC
Analysis of journal0 indicates there are two log headers near the wrap-around point of the log with the same sequence number. Here is some output from gfs2_edit -p journal0: 0x73fb (j+73a0): Log header: Seq: 0xa9b7, tail: 0x6c30, blk: 0x73a0 0x73fc (j+73a1): Log header: Seq: 0xa9b8, tail: 0x73a1, blk: 0x73a1 0x73fd (j+73a2): Log header: Seq: 0xa9b8, tail: 0x7070, blk: 0x73a2 0x73fe (j+73a3): Log header: Seq: 0xa9b9, tail: 0x73a3, blk: 0x73a3 0x73ff (j+73a4): Log header: Seq: 0xa9ba, tail: 0x73a4, blk: 0x73a4 0x7570 (j+7515): Log descriptor, type 300 (Metadata) len:504, data1: 503 ... Note the two occurrences of sequence number 0xa9b8. The two 0xa9b8 log headers have completely different contents, so I'm thinking we need some kind of spin lock for the sequence number, or make the source value in the sdp atomic. Unfortunately, gfs2_fsck does not try to fix duplicate sequence numbers, and the file system is unmountable once this occurs. I've given Theophanis instructions on how to repair this by hand with gfs2_edit. This is this near the wrap-point, there are only four log headers that need to have their sequence number bumped. We probably need to fix this for both RHEL5.3 and Fedora. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Details of the two log headers with the same sequence number: Block #29692 (0x73fc) of 97255424 (0x5CC0000) (log header) Log Header: mh_magic 0x01161970 (hex) mh_type 8 0x8 mh_format 800 0x320 lh_sequence 43448 0xa9b8 lh_flags 0x00000000 (hex) lh_tail 29601 0x73a1 lh_blkno 29601 0x73a1 lh_hash 0xB60FE59B (hex) Block #29693 (0x73fd) of 97255424 (0x5CC0000) (log header) Log Header: mh_magic 0x01161970 (hex) mh_type 8 0x8 mh_format 800 0x320 lh_sequence 43448 0xa9b8 lh_flags 0x00000000 (hex) lh_tail 28784 0x7070 lh_blkno 29602 0x73a2 lh_hash 0x19E47132 (hex) Hi Theophanis, The errors described in the problem description indicate that corruption was found in the journal during mount and/or gfs2_fsck. That corruption was introduced the last time data was written on that system. Can you tell me what version of gfs2 you were running when this failure occurred? Was it the stock Fedora kernel? Or gfs2-kmod compiled from source? Or Centos5.1, or RHEL5.2, etc.? The file system obviously had some bad information put into the journal so the corruption was apparently introduced while lots of data was being written. Can you give me more details about what was happening on the system before the failure occurred? It almost looks like the file system was being unmounted while data was being poured into it or something like that. Or maybe the file system was unmounted immediately after a large amount of data had been written. (Which should be okay and not cause any problems). I can tell that no other nodes had written to the other nine journals. Hi Bob, The case happened with: gfs2-utils-0.1.44-1.el5_2.1 2.6.18-92.1.6.el5.centos.plus-x86_64 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-1.2.6.18_92.1.6.el5.centos.plus The module was from this kernel. The file system is set up like that: physical disk --> pv --> lv --> vg --> drbd --> gfs2 I can not find any evidence of the exact system status the moment the file system crashed. I believe though, it happened when DRBD lost network connection, that resulted in "data in transit". Hi Bob, The case happened with: gfs2-utils-0.1.44-1.el5_2.1 2.6.18-92.1.6.el5.centos.plus-x86_64 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-1.2.6.18_92.1.6.el5.centos.plus The module was from this kernel. The file system is set up like that: physical disk --> pv --> lv --> vg --> drbd --> gfs2 I can not find any evidence of the exact system status the moment the file system crashed. I believe though, it happened when DRBD lost network connection, that resulted in "data in transit". In addition I believe you need the following information related to gfs2 module: [root@tweety-2 ~]# modinfo gfs2 filename: /lib/modules/2.6.18-92.1.6.el5.centos.plus/kernel/fs/gfs2/gfs2.ko license: GPL author: Red Hat, Inc. description: Global File System srcversion: B09BC266DD032D7FCEA51E5 depends: vermagic: 2.6.18-92.1.6.el5.centos.plus SMP mod_unload gcc-4.1 parm: scand_secs:The number of seconds between scand runs (uint) module_sig: 883f3504863c71a95c7adb72c4c65e11273cf09e25e5ce3532f2b84d3f5ac7b5ad1043ea692c56409f5deea449c3bc6555cf3ef68bd1468b8a2f3586 Believe it or not, I may have recreated this problem--twice--in gfs1! I was testing my latest patch for bug #455696, when I got this error: GFS: fsid=shell:gfs.0: fatal: filesystem consistency error GFS: fsid=shell:gfs.0: function = trans_go_xmote_bh GFS: fsid=shell:gfs.0: file = /home/bob/cluster/gfs-kernel/src/gfs/glops.c, line = 538 GFS: fsid=shell:gfs.0: time = 1236892610 GFS: fsid=shell:gfs.0: about to withdraw from the cluster GFS: fsid=shell:gfs.0: waiting for outstanding I/O Examining the journal closely, I found two log headers in journal0 that have the same sequence number. I'm hoping I can put some instrumentation into my gfs file system code to show me what is the root cause of this problem. Well I have one possible idea on how this can be happening, and if I'm right it can only happen at boot time. We have (in ops_fstype.c): error = init_threads(sdp, DO); if (error) goto fail_per_node; if (!(sb->s_flags & MS_RDONLY)) { error = gfs2_make_fs_rw(sdp); if (error) { fs_err(sdp, "can't make FS RW: %d\n", error); goto fail_threads; } } So init_threads starts up gfs2_logd which looks like its going to try and do a journal flush pretty much right away. gfs2_make_fs_rw() does this: error = gfs2_glock_nq_init(sdp->sd_trans_gl, LM_ST_SHARED, 0, &t_gh); if (error) return error; j_gl->gl_ops->go_inval(j_gl, DIO_METADATA); error = gfs2_find_jhead(sdp->sd_jdesc, &head); if (error) goto fail; if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) { gfs2_consist(sdp); error = -EIO; goto fail; } /* Initialize some head of the log stuff */ sdp->sd_log_sequence = head.lh_sequence + 1; gfs2_log_pointers_init(sdp, head.lh_blkno); Now when we lock the transaction lock, we call trans_go_xmote_bh() which also does: if (!test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) { sdp->sd_log_sequence = head.lh_sequence + 1; gfs2_log_pointers_init(sdp, head.lh_blkno); } so in otherwords we are duplicating the work for some strange reason. I think a number of the items in gfs2_make_fs_ro are simply duplicates and can be removed. The interesting thing is what happens if we have a log flush after the initial setting of sdp->sd_log_sequence in trans_go_xmote_bh() and the second updating of this variable in gfs2_make_fs_ro(). It looks to me as if we'd land up resetting the log sequence after we'd written a header. Also, when we lock the transaction lock, we'll then be caching the glock so that unless there is a suspension of the filesystem, there will be no further state changes and thus no calls to this place where we might update the sequence number. There is only one other place where the log sequence is updated (aside from the ++ when we flush the log itself) and that is in gfs2_make_fs_rw() so by deduction, that would seem to be the obvious place to look. Looks like that might not be the whole story, but watch this space.... I'll try to come up with a patch. There seem to be a numebr of issues which might or might not be related to this, but ought to be fixed anyway. On mount, we start up our various kernel threads, gfs2_logd and gfs2_quotad, neither of these should be trying to write to anything until we make the journal live. gfs2_logd seems to have no protection against being run early, and gfs2_quotad just assumes that the transaction will return -EROFS if it is run early. When gfs2_quotad does get a -EROFS return, its specifically ignored, suggesting that it can happen frequently. Also due to the positioning of the test in gfs2_trans_begin() it will have already tried to lock the transaction lock. It also sets GL_NOCACHE on the lock in this case. It isn't at all clear why the test for the journal being live is made under the transaction lock since its local state and not something that would otherwise need to be synchronised across the cluster. The duplication mentioned in comment #11 doesn't actually cause a problem in itself since the journal live flag prevents both sets of journal initialisation running. However when mixed with this particular issue, I wonder whether there is a possible cause here. Created attachment 335089 [details]
Possible fix
Looks like gfs2_logd can run during periods when we don't hold the transaction glock. Thats obviously not right, but it is also quite a rare event. Once it has happened though, it makes the journal unusable. I'm working on a patch which will fix all the issues, but its quite long & involved. FWIW, I went through my NFS/GFS file system corruption patch and compared all my GFS code changes to the corresponding GFS2 code. All but two of the code changes are unnecessary in GFS2 for various reasons. The two exceptions are: (1) In the GFS patch, I added code to prevent the transaction glock from being released in function clear_glock unless it's unmount time. I'm not entirely sure if we should do that in GFS2 or if it's adequately covered. (2) In the GFS patch, I added locking to the find_jhead code. This would be the equivalent to adding gfs2_log_lock to gfs2_find_jhead. This addition might just solve this bug but that is purely theoretical and I haven't dug into the code enough to be certain. We would have to study the performance and lock order implications. Theophanis, have you seen this problem again? If so, do you have a way to recreate the failure? Are you willing to test a GFS2 patch? Hi Bob and All, I have not met the same problem ever again. Also in the mean time I have upgraded to 5.3 (less then a month now). I am willing to do beta testing, but I will be able to do it sometime during May. How can I be of help? Thank you Hi Bob and All, I have not met the same problem ever again. Also in the mean time I have upgraded to 5.3 (less then a month now). I am willing to do beta testing, but I will be able to do it sometime during May. How can I be of help? Thank you Hi Theophanis, It looks like another person has hit this problem. In their case, the problem may have been caused by two nodes mounting the same gfs2 file system with lock_nolock rather than lock_dlm. Could this be what happened to you? In the mean time, I'm working on a new bugzilla to fix gfs2_fsck to detect and fix the issue. I'll add you to the cc list. As part of my (huge) patch for bug #455696, I decided we needed to protect the gfs_find_jhead function from other processes writing to the journal. Steve Whitehouse: Do you think that be what caused Theophanis's corruption (i.e. log headers with the same sequence number)? The GFS change looked something like this, so should we consider adding similar code to the gfs2_find_jhead function? diff --git a/gfs-kernel/src/gfs/recovery.c b/gfs-kernel/src/gfs/recovery.c index 5c2e8e1..2ecee04 100644 --- a/gfs-kernel/src/gfs/recovery.c +++ b/gfs-kernel/src/gfs/recovery.c @@ -24,6 +24,7 @@ #include "glock.h" #include "glops.h" #include "lm.h" +#include "log.h" #include "lops.h" #include "recovery.h" @@ -251,6 +252,7 @@ gfs_find_jhead(struct gfs_sbd *sdp, struct gfs_jindex *jdesc, seg1 = 0; seg2 = jdesc->ji_nsegment - 1; + gfs_log_lock(sdp); for (;;) { seg_m = (seg1 + seg2) / 2; @@ -262,6 +264,7 @@ gfs_find_jhead(struct gfs_sbd *sdp, struct gfs_jindex *jdesc, error = verify_jhead(sdp, jdesc, gl, &lh1); if (unlikely(error)) { printk("GFS: verify_jhead error=%d\n", error); + gfs_log_unlock(sdp); return error; } memcpy(head, &lh1, sizeof(struct gfs_log_header)); @@ -278,6 +281,7 @@ gfs_find_jhead(struct gfs_sbd *sdp, struct gfs_jindex *jdesc, seg2 = seg_m; } + gfs_log_unlock(sdp); return error; } Note: The gfs2 code has an additional return that would have to be taken into account, of course. Hello Bob and All, Now I have upgraded to 5.4 (1.5 months now) and still no occurrence of the bug. I guess it has been permanently fixed? Looks like this has long since been fixed, so closing it. If it reappears, please reopen this bug. |