Created attachment 358494 [details] Cluster.conf file for an existing cluster that failed on Aug 13, 14 and 21st with these symptoms Description of problem: Somewhat infrequently, we have issues with general cluster instability. I cannot readily reproduce the problem, but it happesn enough (about every 2 months per cluster) that it has become a pretty major problem. Version-Release number of selected component (if applicable): How reproducible: Eventually, this happens to just about every cluster we have given enough time. Steps to Reproduce: 1. Wait until cluster fails (our clusters seem to fail pretty regularly (about once a month). 2. 3. Actual results: We get error messages like these from openais in the logs, then 1 or more nodes get fenced or the cluster goes inquorate and completely fails: Aug 14 18:00:43 qssimes02 openais[5958]: [TOTEM] The token was lost in the OPERATIONAL state. Aug 14 18:00:43 qssimes02 openais[5958]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Aug 14 18:00:43 qssimes02 openais[5958]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes). Aug 14 18:00:43 qssimes02 openais[5958]: [TOTEM] entering GATHER state from 2. Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] entering GATHER state from 0. Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Creating commit token because I am the rep. Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Saving state aru e0d3 high seq received e0d3 Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Storing new sequence id for ring 1e0 Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] entering COMMIT state. Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] entering RECOVERY state. Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] position [0] member 10.55.32.88: Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] previous ring seq 476 rep 10.55.32.84 Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] aru e0d3 high delivered e0d3 received flag 1 Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Did not need to originate any messages in recovery. Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Sending initial ORF token Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] CLM CONFIGURATION CHANGE Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] New Configuration: Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] r(0) ip(10.55.32.88) Aug 14 18:00:48 qssimes02 clurgmgrd[6620]: <info> State change: qssimes01 DOWN Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] Members Left: Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] r(0) ip(10.55.32.84) Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] Members Joined: Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] CLM CONFIGURATION CHANGE Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] New Configuration: Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] r(0) ip(10.55.32.88) Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] Members Left: Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] Members Joined: Aug 14 18:00:48 qssimes02 openais[5958]: [SYNC ] This node is within the primary component and will provide service. Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] entering OPERATIONAL state. Aug 14 18:00:48 qssimes02 openais[5958]: [CLM ] got nodejoin message 10.55.32.88 Aug 14 18:00:48 qssimes02 openais[5958]: [CPG ] got joinlist message from node 2 Expected results: Cluster recovers from temporary network outage and stays up (that's what clustering is for right?) Additional info: qssimes01> rpm -q openais cman openais-0.80.3-22.el5_3.8 cman-2.0.98-1.el5_3.4 qssimes01> uname -a Linux qssimes01 2.6.18-128.1.16.el5 #1 SMP Fri Jun 26 10:53:31 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
I'm also now noting this in our logs for this particular cluster, sorry I missed this before. Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: fatal: invalid metadata block Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: bh = 5241860 (magic number) Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: function = gfs2_rgrp_bh_get, file = fs/gfs2/rgrp.c, line = 763 Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: about to withdraw this file system Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: telling LM to withdraw Aug 13 23:00:28 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: withdrawn Aug 13 23:00:28 qssimes01 kernel: Aug 13 23:00:28 qssimes01 kernel: Call Trace: Aug 13 23:00:28 qssimes01 kernel: [<ffffffff8860552a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 Aug 13 23:00:28 qssimes01 kernel: [<ffffffff800157c8>] sync_buffer+0x0/0x3f Aug 13 23:00:28 qssimes01 kernel: [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78 Aug 13 23:00:28 qssimes01 kernel: [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23 Aug 13 23:00:28 qssimes01 kernel: [<ffffffff800177eb>] ll_rw_block+0x8c/0xab Aug 13 23:00:28 qssimes01 kernel: [<ffffffff8861867f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38 Aug 13 23:00:28 qssimes01 kernel: [<ffffffff8861647e>] :gfs2:gfs2_rgrp_bh_get+0x113/0x261 Aug 13 23:00:28 qssimes01 kernel: [<ffffffff88602b63>] :gfs2:glock_work_func+0x0/0xb8 Aug 13 23:00:28 qssimes01 kernel: [<ffffffff88600fd1>] :gfs2:do_promote+0xad/0x137
steve W: please comment on comment #1 regarding the withdraw.
Dave, Please include full logs from both nodes so we may evaluate them. I see a withdraw occurred on Aug 13 23:00:28 but the config change from node 2 occured 8 hours later. What happened between these 8 hours? Are times synced on the cluster? Any core files in /var/lib/openais/*? If so attach them. Thanks -steve
Created attachment 358504 [details] logs from qssimes01 (/var/log all tarred up since we have a custom syslog.conf file)
Created attachment 358505 [details] /var/log from qssimes02
Thanks for the comments. I only pasted portions of logs. In this case, the cluster crashed on 8/13 but also crashed 8 hours later on 8/14. So, I confused you by pasting logs from different crashes into the comments. Hopefully, now that you have all my logs, things will be clearer.
The withdraw of comment #1 is caused by the fs attempting to read one of the resource groups on the fs and finding that the block is corrupt (either its looking in the wrong place, or the block has been over written with data). In order to fix that, you'll need to unmount on all nodes and run fsck. If its possible to save that metadata first that would be useful for our analysis. Some questions: 1. Was the filesystem grown at any points? 2. Was the filesystem created as gfs2 originally or a converson from gfs1? 3. What kind of activity was going on immediately before the error was discovered? Also, if you could post the results of fsck here, that would be useful too. The initial comment says "Cluster recovers from temporary network outage and stays up (that's what clustering is for right?)" which makes me wonder if the temporary outage is being caused by something external (someone plugging/unplugging network cables?) or similar. Are these messages really appearing with no external cause?
Thanks for getting back to me. Here are your question answers: 1. No, never grown. 2. It was originally created as GFS2 3. We are running custom application from a consultant here. I can't categorize _EVERYTHING_, but there is an ISAM database and a bunch of scripts. The FSCK result was enormous and I wasn't expecting it to fill my terminal buffer, but this is some of the output (it was all very similar): Block #9959441 (0x97f811) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #9959442 (0x97f812) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #9959443 (0x97f813) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #9959444 (0x97f814) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10024962 (0x98f802) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10024963 (0x98f803) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10024964 (0x98f804) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10024965 (0x98f805) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10024966 (0x98f806) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10090484 (0x99f7f4) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10090485 (0x99f7f5) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10090486 (0x99f7f6) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10090487 (0x99f7f7) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10090488 (0x99f7f8) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10156006 (0x9af7e6) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10156007 (0x9af7e7) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10156008 (0x9af7e8) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10156009 (0x9af7e9) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10156010 (0x9af7ea) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10221528 (0x9bf7d8) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10221529 (0x9bf7d9) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10221530 (0x9bf7da) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10221531 (0x9bf7db) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10221532 (0x9bf7dc) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10287050 (0x9cf7ca) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10287051 (0x9cf7cb) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10287052 (0x9cf7cc) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10287053 (0x9cf7cd) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10287054 (0x9cf7ce) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10352572 (0x9df7bc) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10352573 (0x9df7bd) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10352574 (0x9df7be) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10352575 (0x9df7bf) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10352576 (0x9df7c0) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10418094 (0x9ef7ae) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10418095 (0x9ef7af) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10418096 (0x9ef7b0) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10418097 (0x9ef7b1) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG. Attempting to repair the RG. Block #10418098 (0x9ef7b2) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYP E_RG. Attempting to repair the RG. (level 2 passed) Starting pass1 40 percent complete. So as to the "temporary network outages", I don't believe that's the cause here. That was a poor assuption on my part . . . sorry about that.
And here's another update. After running fsck on the filesystems then restarting the whole cluster, we continue to have GFS2 problems with these types of error messages: Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: fatal: invalid metadata block Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: bh = 6909402 (magic number) Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: function = ea_foreach_i, file = fs/gfs2/eattr.c, line = 80 Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: about to withdraw this file system Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: telling LM to withdraw Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: withdrawn Aug 26 16:30:42 qssimes01 kernel: Aug 26 16:30:42 qssimes01 kernel: Call Trace: Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8860752a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff80064ae8>] __wait_on_bit+0x60/0x6e Aug 26 16:30:42 qssimes01 kernel: [<ffffffff800157c8>] sync_buffer+0x0/0x3f Aug 26 16:30:42 qssimes01 kernel: [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8861a67f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff88600557>] :gfs2:ea_foreach_i+0x57/0x12d Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8860068c>] :gfs2:ea_foreach+0x5f/0x195 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff88600e17>] :gfs2:ea_dealloc_unstuffed+0x0/0x2f8 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff886008fc>] :gfs2:gfs2_ea_dealloc+0x6c/0x587 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff88612d56>] :gfs2:gfs2_delete_inode+0xc6/0x191 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff88612cd6>] :gfs2:gfs2_delete_inode+0x46/0x191 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8009c07a>] queue_delayed_work+0x75/0x7e Aug 26 16:30:42 qssimes01 kernel: [<ffffffff88612c90>] :gfs2:gfs2_delete_inode+0x0/0x191 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8002f942>] generic_delete_inode+0xc6/0x143 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff88617c88>] :gfs2:gfs2_inplace_reserve_i+0x63b/0x691 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff88603019>] :gfs2:do_promote+0xf5/0x137 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff801255c3>] selinux_capable+0x15/0x2a Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8860c40d>] :gfs2:gfs2_write_begin+0x116/0x33f Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8860de8f>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8860e2c5>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff80124abe>] avc_has_perm+0x43/0x55 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8860e468>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f Aug 26 16:30:42 qssimes01 kernel: [<ffffffff800c4075>] generic_file_read+0xac/0xc5 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8009eba4>] autoremove_wake_function+0x0/0x2e Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8012541c>] file_has_perm+0x94/0xa3 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8860e5b8>] :gfs2:gfs2_file_write+0x49/0xa7 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff80016d48>] vfs_write+0xce/0x174 Aug 26 16:30:42 qssimes01 kernel: [<ffffffff80017615>] sys_write+0x45/0x6e Aug 26 16:30:42 qssimes01 kernel: [<ffffffff8005e116>] system_call+0x7e/0x83 Aug 26 16:30:42 qssimes01 kernel: Here are some various versions of our software ( if it helps ): qssimes01> rpm -q cman openais lvm2-cluster cman-2.0.98-1.el5_3.4 openais-0.80.3-22.el5_3.8 lvm2-cluster-2.02.40-7.el5 qssimes01> rpm -qa | grep -i gfs gfs-utils-0.1.18-1.el5 gfs2-utils-0.1.53-1.el5_3.3 kmod-gfs-0.1.31-3.el5 qssimes01> uname -rm 2.6.18-128.1.16.el5 x86_64
Created attachment 358795 [details] full fsck output from last crash
It sounds like somehow the filesystem has been corrupted pretty badly. You should never see messages like the one in comment #9 unless something has gone very wrong. The only thing I can suggest is to try running fsck again in case there is something it did not pick up on the first run (which has been an issue in the past with fsck, but I thought was now fixed). Bob should be able to tell you exactly which fsck versions were affected by that.
Looking again at the trace in comment #9, I have another question. Are you using extended attributes either directly or via posix ACLs or selinux or something?
Created attachment 358983 [details] full "double" fsck output as the last poster suggested.
On comment #12, we are not actively using extended attributes or selinux. But selinux is on in permissive mode with the out-of-the-box targeted policy.
Even in permissive mode, selinux will still label the files. That means that the attempt to access the xattrs which failed in comment #9 was almost certainly trying to do something along those lines since it was in the middle of the xattr handling code. If you don't actually need selinux on GFS2, then I'd suggest turning it off since it will slow down some operations. Thanks for posting the second fsck results.
The fsck results indicate that at some stage it looks like the cluster has crashed when there were inodes which were open, but unlinked. This should be harmless as the filesystem should clean this up itself next time it tries to allocate something. More recent versions of fsck can clean this up without waiting for the filesystem to do it. Bob can tell you exactly which version you will need in order to get that feature.
I believe my version of gfs2-utils is the latest available that I see on rhn: "gfs2-utils-0.1.53-1.el5_3.3" Also, it does look like I have some vmcores that I could upload. however, they are 32 GB in size, what's the best way to attach huge files to this bugzilla report?
Looks like we had another crash at 11:15 am Pacific today: here are some logs from around that time. The update at 10:36 was me updating the "post_fail_delay" to 90 seconds to give kdump more time to create crash dumps. Unfortunately, the FS we are dumping to is full of vmcore files. If someone could let me know how to best update my 32 GB core dumps, I'd appreciate it so I can clear out some space. Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: Joined cluster. Now mounting FS... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=0, already locked for use Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=0: Looking at journal... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=0: Done Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=1: Trying to acquire journal lock... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=1: Looking at journal... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=1: Done Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=2: Trying to acquire journal lock... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=2: Looking at journal... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=2: Done Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=3: Trying to acquire journal lock... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=3: Looking at journal... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=3: Done Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=4: Trying to acquire journal lock... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=4: Looking at journal... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=4: Done Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=5: Trying to acquire journal lock... Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=5: Looking at journal... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=5: Done Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=6: Trying to acquire journal lock... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=6: Looking at journal... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=6: Done Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=7: Trying to acquire journal lock... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=7: Looking at journal... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=7: Done Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=8: Trying to acquire journal lock... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=8: Looking at journal... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=8: Done Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=9: Trying to acquire journal lock... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=9: Looking at journal... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=9: Done Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=10: Trying to acquire journal lock... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=10: Looking at journal... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=10: Done Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=11: Trying to acquire journal lock... Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=11: Looking at journal... Aug 28 10:35:44 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=11: Done Aug 28 10:36:32 qssimes02 ccsd[12727]: Update of cluster.conf complete (version 10 -> 11). Aug 28 11:15:13 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Trying to acquire journal lock... Aug 28 11:15:13 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Busy Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Trying to acquire journal lock... Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Looking at journal... Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Acquiring the transaction lock... Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Replaying journal... Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Replayed 21 of 22 blocks Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Found 1 revoke tags Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Journal replayed in 1s Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Done Aug 28 11:43:52 qssimes02 kernel: GFS2: fsid=QSSIMES:data02.0: fatal: invalid metadata block <fence seems to have occurred here and machine went offline then rebooted>
Hi, If you don't mind, please download this tar ball: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/gfs2_fsck_edit.tgz It contains pre-release versions of fsck.gfs2 and gfs2_edit. You may first want to consider saving off all your GFS2 metadata using this command before proceeding: ./gfs2_edit savemeta /dev/your/device /tmp/519049.meta Next, try running this fsck.gfs2. It will hopefully be better at detecting and fixing your file system corruption (and only one run should be necessary). If it works, great. If it has a problem you can use the saved metadata to restore your file system to its condition before the new fsck.gfs2 was run. (using ./gfs2_edit restoremeta /tmp/519049.meta /dev/your/device) If you still have problems, I'd like to get a copy of that metadata you saved with gfs2_edit from before the fsck.gfs2 was run. Of course, the metadata will be too big to attach to the bugzilla record, so perhaps you can zip it with bzip2, and put the ".bz2" file on a ftp or web server somewhere so I can download it for analysis (there are some free ones). I'd like to learn how your file system became corrupt, but that may be nearly impossible to figure out. The metadata might give us some clues. If the problem is fixed by fsck.gfs2 and then happens again, perhaps I can use the saved metadata to recreate the corruption and find the problem. The goal of "gfs2_edit savemeta" is to save only metadata (file names, directory structure journals and system files) and not save any (presumably confidential) "real" data. Let me know how this goes. Bob Peterson
Trying now bob, it may take a while since I actually have 6 filesystems on which to save metadata before trying this and it seems like it takes a while. On a side note, I was able to compress a vmcore file and upload it to dropbox.redhat.com as vmcore-sr1947858-bz519049.vmcore.bz2. I've also updated my kdump.conf file to trim the size of future core dumps so this won't be such a painful ordeal next time. So, the savemeta seems like it'll take about 10 minutes per FS for me, plus another 10 minutes per fs to run the fsck. So, I probably won't be able to post an update until after COB Pacific today. Thanks for your help so far, -Dave.
Sorry for the delayed response. I successfully fsck'ed each filesystem with the new gfs2_fsck and started up the services. I'll check back on Monday with a status on whether the systems are still up. Again, thanks. -Dave.
Hi Dave, Have you had a chance to check in on the filesystems this morning? Thanks again, Frank.
Yes, sorry for not updating sooner. With no load, the systems stayed up all day Saturday and Sunday and so far today. A load test will be started later today to see if the environment remains stable.
Well, the load tests ran for about 18 hours then failed today at 11:00 pacific with the exact same sorts of errors: Sep 1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: fatal: invalid metadata block Sep 1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: bh = 4556247 (magic number) Sep 1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: function = ea_foreach_i, file = fs/gfs2/eattr.c, line = 80 Sep 1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: about to withdraw this file system Sep 1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: telling LM to withdraw Sep 1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: gfs2_delete_inode: -5 Sep 1 11:00:43 qssimes02 last message repeated 7 times Sep 1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: withdrawn Sep 1 11:00:43 qssimes02 kernel: Sep 1 11:00:43 qssimes02 kernel: Call Trace: Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8848752a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff80064ae8>] __wait_on_bit+0x60/0x6e Sep 1 11:00:43 qssimes02 kernel: [<ffffffff800157c8>] sync_buffer+0x0/0x3f Sep 1 11:00:43 qssimes02 kernel: [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8849a67f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff88480557>] :gfs2:ea_foreach_i+0x57/0x12d Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8848068c>] :gfs2:ea_foreach+0x5f/0x195 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff88480e17>] :gfs2:ea_dealloc_unstuffed+0x0/0x2f8 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff884808fc>] :gfs2:gfs2_ea_dealloc+0x6c/0x587 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff88492d56>] :gfs2:gfs2_delete_inode+0xc6/0x191 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff88492cd6>] :gfs2:gfs2_delete_inode+0x46/0x191 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8009c07a>] queue_delayed_work+0x75/0x7e Sep 1 11:00:43 qssimes02 kernel: [<ffffffff88492c90>] :gfs2:gfs2_delete_inode+0x0/0x191 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8002f942>] generic_delete_inode+0xc6/0x143 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff88497c88>] :gfs2:gfs2_inplace_reserve_i+0x63b/0x691 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff88483019>] :gfs2:do_promote+0xf5/0x137 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8848c40d>] :gfs2:gfs2_write_begin+0x116/0x33f Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8008aeb5>] __activate_task+0x56/0x6d Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8848de8f>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8848e2c5>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff80013de2>] find_lock_page+0x26/0xa1 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8848e468>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8845b784>] :dlm:dlm_put_lockspace+0x10/0x1f Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8009eba4>] autoremove_wake_function+0x0/0x2e Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8009eba4>] autoremove_wake_function+0x0/0x2e Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8848e5b8>] :gfs2:gfs2_file_write+0x49/0xa7 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff80016d48>] vfs_write+0xce/0x174 Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8000bcc0>] fget_light+0x59/0x7c Sep 1 11:00:43 qssimes02 kernel: [<ffffffff80017615>] sys_write+0x45/0x6e Sep 1 11:00:43 qssimes02 kernel: [<ffffffff8005e116>] system_call+0x7e/0x83 Sep 1 11:00:43 qssimes02 kernel:
here's a rough timeline of today's events. 1. At 10:29:01 one of our nodes decided to fence the other and we got these error messages in /var/log/syslog (for us): Sep 1 10:27:21 qssimes02 kernel: dlm: closing connection to node 1 Sep 1 10:29:01 qssimes02 fenced[625]: qssimes01 not a cluster member after 100 sec post_fail_delay Sep 1 10:29:01 qssimes02 fenced[625]: fencing node "qssimes01" Sep 1 10:29:40 qssimes02 ccsd[566]: Attempt to close an unopened CCS descriptor (4757250). Sep 1 10:29:40 qssimes02 ccsd[566]: Error while processing disconnect: Invalid request descriptor Sep 1 10:29:40 qssimes02 fenced[625]: fence "qssimes01" success We also got these in /var/log/syslogs/local4 from openais around that time: Sep 1 10:21:53 qssimes02 clurgmgrd: [17980]: <info> Executing /opt/dcs_6.0/files/ime_startup status Sep 1 10:22:15 qssimes02 openais[579]: [TOTEM] Retransmit List: ef5383 ef5384 ef5385 ef5386 ef5387 ef5388 ef5389 ef538a ef538b ef538c Sep 1 10:22:15 qssimes02 openais[579]: [TOTEM] Retransmit List: ef5383 ef5384 ef5385 ef5386 ef5387 ef5388 ef5389 ef538a ef538b ef538c Sep 1 10:22:15 qssimes02 openais[579]: [TOTEM] Retransmit List: ef5399 ef539a ef539b ef539c ef539d ef539e Sep 1 10:22:23 qssimes02 clurgmgrd: [17980]: <info> Executing /opt/dcs_6.0/files/ime_startup status Sep 1 10:23:23 qssimes02 last message repeated 2 times Sep 1 10:24:53 qssimes02 last message repeated 3 times Sep 1 10:26:23 qssimes02 last message repeated 3 times Sep 1 10:26:53 qssimes02 clurgmgrd: [17980]: <info> Executing /opt/dcs_6.0/files/ime_startup status Sep 1 10:27:16 qssimes02 openais[579]: [TOTEM] The token was lost in the OPERATIONAL state. Sep 1 10:27:16 qssimes02 openais[579]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Sep 1 10:27:16 qssimes02 openais[579]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes). Sep 1 10:27:16 qssimes02 openais[579]: [TOTEM] entering GATHER state from 2. Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] entering GATHER state from 0. Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] Creating commit token because I am the rep. Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] Saving state aru 117856c high seq received 117856c Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] Storing new sequence id for ring 308 Sep 1 10:27:21 qssimes02 clurgmgrd[17980]: <info> State change: qssimes01 DOWN Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] entering COMMIT state. Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] entering RECOVERY state. Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] position [0] member 10.55.32.88: Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] previous ring seq 772 rep 10.55.32.84 Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] aru 117856c high delivered 117856c received flag 1 Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] Did not need to originate any messages in recovery. Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] Sending initial ORF token Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] CLM CONFIGURATION CHANGE Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] New Configuration: Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] r(0) ip(10.55.32.88) Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] Members Left: Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] r(0) ip(10.55.32.84) Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] Members Joined: Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] CLM CONFIGURATION CHANGE Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] New Configuration: Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] r(0) ip(10.55.32.88) Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] Members Left: Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] Members Joined: Sep 1 10:27:21 qssimes02 openais[579]: [SYNC ] This node is within the primary component and will provide service. Sep 1 10:27:21 qssimes02 openais[579]: [TOTEM] entering OPERATIONAL state. Sep 1 10:27:21 qssimes02 openais[579]: [CLM ] got nodejoin message 10.55.32.88 Sep 1 10:27:21 qssimes02 openais[579]: [CPG ] got joinlist message from node 2 Sep 1 10:27:21 qssimes02 clurgmgrd[17980]: <notice> Starting service:imeslave on [ 2 ] Sep 1 10:27:21 qssimes02 clurgmgrd[17980]: <info> Start failed; node reports: 0 failures, 1 exclusive, 0 dependency errors 2. About 6 minutes later, the remaining cluster node withdrew, core dumped and rebooted with this error message in /var/log/syslog: Sep 1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: fatal: invalid metadata block Sep 1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: bh = 4556247 (magic number) Sep 1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: function = ea_foreach_i, file = fs/gfs2/eattr.c, line = 80 Sep 1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: about to withdraw this file system Sep 1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: telling LM to withdraw 3. Then the nodes went into a fencing loop where neither could startup until we intervened manually and fscked each filesystem. Note: There are no logs or coredumps from the originally fenced node qssimes01 around 10:29 We have this in /var/log/syslog at that time: Sep 1 05:15:24 qssimes01 ntpd[5786]: synchronized to 129.46.51.243, stratum 2 Sep 1 05:47:56 qssimes01 ntpd[5786]: synchronized to 10.45.79.4, stratum 2 Sep 1 06:40:21 qssimes01 ntpd[5786]: synchronized to 129.46.51.243, stratum 2 Sep 1 07:40:37 qssimes01 ntpd[5786]: synchronized to 10.45.78.127, stratum 2 Sep 1 08:14:17 qssimes01 ntpd[5786]: synchronized to 129.46.51.243, stratum 2 Sep 1 10:13:47 qssimes01 ntpd[5786]: synchronized to 10.45.79.4, stratum 2 And this in /var/log/syslogs/local4: Sep 1 10:20:55 qssimes01 last message repeated 3 times Sep 1 10:22:25 qssimes01 last message repeated 3 times Sep 1 10:23:55 qssimes01 last message repeated 3 times Sep 1 10:25:25 qssimes01 last message repeated 3 times Sep 1 10:26:55 qssimes01 last message repeated 3 times So, what I'm concerned about is an apparently random reboot of the machine qssimes01 while it was doing IO operations. No log indicates any useful information to me on why qssimes01 rebooted. Or am I missing something here?
Dave, To confirm, there was no vmcore created on qssimes01 and you're certain that it rebooted for some reason in this instance?
Dave, Also, I see it discussed but were you able to entirely disable selinux on these cluster nodes? Thanks again, Frank.
At least a couple of these failures seem to be with extended attributes. It's interesting that the corruption was noticed shortly after a node was fenced. That makes me wonder if there is a problem with journal replay after the node comes back up. To try to determine that, I'd like to propose an experiment. 1. Set everything up to not mount GFS2 automatically. 2. Run fsck.gfs2 again to make sure the file system is clean. 3. Run the failing scenario again. 4. After qssimes02 decides to fence qssimes01, do not restart the cluster right away. Instead, reboot both nodes. 5. Restart the cluster manually but don't mount the GFS2 file system. Instead, save off the GFS2 metadata, bzip2 it, and put it on a web or ftp server where I can analyze it. I want to look at the journals to see if there is something in there that might cause corruption, or if there are clues to the corruption. To save off the metadata, do: gfs2_edit savemeta <gfs2 device> <some filename> bzip2 <some filename> (Do not fsck.gfs2 this time.) 6. Next, mount the file system normally and use it for a while to see if corruption is detected. If so, let me know the messages it gives you, especially ones that say "bh = <block>" like many of the previous messages.
That's confirmed. No vmcore files were created on qssimes01 and there's no log entries indicating any type of problem or issue with it. qssimes01 was not rebooted manually and no one was logged into it at the time. It was fenced and there was no log message on qssimes02 or qssimes01 indicating to me why it was fenced. All we got was this on qssimes02: Sep 1 10:27:21 qssimes02 kernel: dlm: closing connection to node 1 Sep 1 10:29:01 qssimes02 fenced[625]: qssimes01 not a cluster member after 100 sec post_fail_delay Sep 1 10:29:01 qssimes02 fenced[625]: fencing node "qssimes01
Responding to comment #27: Yes selinux is disabled on both nodes. Responding to comment #28: Will do. I've current turned clustering services off for the next reboot and will do as you propose. I have already done a manual fsck of the filesystems so we will commence another load test to see if we can reproduce the problem again. It may take a day or two for the failure to occur as it doesn't appear to happen right away.
Dave, Thanks for the confirmation on those. Frank.
Based on the above, I suspect that there is an error relating to xattrs which causes the initial issue and that is almost certainly a hang on the node in question. That eventually causes the node to leave the cluster and get fenced, whereupon recovery is unable to correctly recover the xattrs resulting in the reported issue when the node tries to reread the same inode. It certainly looks xattr related, so that if selinux can be disabled at least for the gfs2 filesystem, that will probably be a workaround in the short term.
Selinux has been disabled: qssimes01> sudo /usr/sbin/sestatus SELinux status: disabled qssimes02> sudo /usr/sbin/sestatus Password: SELinux status: disabled It was suggested in comment #15 above then I confirmed it in comment #29 after Frank asked again in comment #27. I disabled it by editing /etc/selinux/config and rebooting. Selinux has been disabled since August 28th @ 9:56 AM Pacific. Need I also disable selinux somewhere else?
Hi Dave, Sorry for the confusion, I'm passing information back to the engineers from some of the earlier runs and asking questions on current ones; it's mixing things up here a bit. We'll get that straightened out, we don't intend on bugging you for the same information repeatedly like that.
Currently, we see this error on the qssimes01/2 cluster when running strace: 28186 open("/dcs/data05/DATA/dcs/polldata/Q/QCBREW1/QCTXN2/QCBREW1_QCTXN2_014494.dat", O_RDWR) = -1 EIO (Input/output error) This file is accessible on qssimes02 at this time, but not qssimes01 .
It should be noted that we opened a bug report during the installation of the cluster service as well. https://bugzilla.redhat.com/show_bug.cgi?id=512799
FYI we got another crash at 1:07 PM PDT yesterday. No useful log messages. The fenced machine did attempt to dump its core but failed. I cleaned up our old coredumps so (hopefully) that won't happen again. In any case, where are we with this? What can I do to help more on the ticket?
OK, well, still hoping to hear something I guess. It was proposed in our Service Request that I update these 2 systems to RHEL 5.4 since that was released the other day. Would anyone on this ticket agree or disagree? Is there any other information we can provide on this?
It would be a good idea to upgrade the qssimes01/2 cluster.
I do recommend upgrading to 5.4 but I don't think it will solve the corruption problem at hand. We are currently still analyzing the problem and don't have any answers yet. I'll update the bugzilla record when I know more. A big question in my mind is why the node is getting fenced in the first place. That's something that may get better with 5.4. I don't know for sure whether the node was fenced because of the corruption (via kernel panic or some such) or whether the corruption was caused by a problem with the fencing recovery (i.e. a problem with journal replay). There are still too many unknowns here. It would be helpful if someone could analyze the syslogs and/or console logs of all the nodes around the time the fencing occurs to figure out if there was a kernel panic or some such just prior to the node being fenced.
Also note that I still don't have the metadata for the file systems that are showing occasional corruption, and that may also be helpful.
Sorry Robert, that's my bad. I've uploaded those meta files to dropbox.redhat.com and named them like so: appvol01.meta-bz519049.bz2 datavol01.meta-bz519049.bz2 datavol02.meta-bz519049.bz2 datavol03.meta-bz519049.bz2 datavol04.meta-bz519049.bz2 datavol05.meta-bz519049.bz2 I am doing the update to 5.4 now, I'm hoping it'll help. Also, not sure if you reviewed the link that Everett Bennett posted above (https://bugzilla.redhat.com/show_bug.cgi?id=512799). We had this ticket open on this cluster already and added this item to our cluster.conf file to work around it: <gfs_controld plock_ownership="0" plock_rate_limit="0"/> Could that previous bug and it's recommended fix be in any way related to our current problems? Thanks, -Dave.
And in reference to comment #41 where you request log analysis. I've just tarred up everthing in our /var/log directory on each host and uploaded that to dropbox.redhat.com qssimes01-bz519049.tar.bz2 qssimes02-bz519049.tar.bz2 Feel free to do additional analysis if you like. I've tried to post all the logs I thought were relevant around the time of each crash, but I could've missed something. I also uploaded sosreports for our corresponding Service Request on this ticket which you may have access to. Please don't hesitate to remind me of things I've forgotten, and thanks for your help on this. I should have these systems up to RHEL 5.4 in a few minutes here . . . -Dave.
I've retrieved the metadata from dropbox.redhat.com and examined it. The 12 journals have absolutely no information regarding the corrupted blocks. That means either the journals have wrapped since the corruption occurred (which means a lot of activity since the original failure) or else the corruption isn't related to a transaction in the journal. I tend to think it's the latter. I also retrieved and analyzed the syslog files from comment #44. I found very little useful information beyond the notes previously posted by Dave: Dave, you did a good job at isolating and posting the relevant messages. Syslog made it look like the corruption likely caused a kernel panic, which caused the fencing to happen. I also did an extensive code review and analysis of the extended attribute code, but again, I found no evidence of logic problems that could cause the corruption we're seeing. Which brings me back to the original question of how the file systems became corrupt. I've examined the metadata and also found very little useful evidence to go on. The blocks complained about in syslog seem to be used now for normal data blocks. One very interesting thing to node is that the block just prior to the corrupt one (the kernel message said bh = 4556247 which in hex is 0x4585d7--so block 0x4585d6) is marked as "unlinked" in the file system rgrp bitmap. Which leads me to the next topic: The lack of other evidence of corruption (smoking guns), plus the two extraordinary call traces in comment #9 and comment #24 lead me now to believe that the GFS2 code got confused somewhere near the code path that calls try_rgrp_unlink() in rgrp.c. If you look closely at those two call traces, they show allocates calling unlink, which can only happen due to try_rgrp_unlink. That's my best clue so far. Although I have no evidence to prove it, I believe that try_rgrp_unlink() is causing an extended attribute block that's already been deleted, to be deleted again (after it's already been reused as a data block by another inode). Or something similar. That's where I'm focusing my efforts now. Perhaps the fact that the bitmap showed 0x4585d6 as "unlinked" somehow caused it to accidentally try to delete 0x4585d7 by mistake or something strange. Monday is a holiday in the USA, but not in the UK where GFS2 expert Steve Whitehouse resides. Perhaps on Monday Steve can spot something in that area of code.
The theory in comment #45 sounds plausible. I'm currently investigating.
Created attachment 360000 [details] Possible fix Looking at the code, I've come up with a theory that the bug may be caused by a race in the inode deallocation code. If I'm correct, the issue will only occur when there are two (or more) nodes in the cluster both trying to deallocate the same inode at the same time. This is a fairly unusual event, but certainly not impossible. In that case it looks like one of the node would get stuck waiting for the inode glock while the other node is doing the deallocation. When the other node completes the deallocation and releases the glock, the original node might then try and complete the same operation if its attempt to "try lock" the i_open_gl succeeds. The attached patch adds an extra check to the deallocation code so that after acquiring the inode glock, we ensure that there really is an inode requiring deallocation before proceeding with the remainder of the procedure. The cost is an extra couple of glocks and also a disk access on each deallocation. This is not a big deal since both glocks will be needed in the deallocation process anyway (so we cache them earlier) and the disk access is a read of the rgrp block containing the bitmap entry for the unlinked inode. That has to be written later in the deallocation function, so that it doesn't really make any performance difference. The patch is against the upstream gfs2 kernel, but its trivial to create a RHEL version, and we'll try and get something built shortly for testing purposes. Due to the requirement for a very similar test in the NFS code, we can share the new test with the GFS2 NFS code. I've tested the patch on a single node basis only so far, to ensure that it doesn't break anything obvious. Even if it doesn't solve this particular bug, I'm considering adding this patch to the upstream branch anyway, as it should improve the filesystem's robustness in the face of errors in the on-disk data.
Created attachment 360100 [details] RHEL5 version of the upstream patch
Hi Dave, A RHEL5.4 based x86_64 build of a kernel set with this patch can be downloaded at the following URL: http://people.redhat.com/~fhirtz/.gfs2test/ It's just the stock -164 build with the one patch that we're referencing here: <snip> * Tue Sep 08 2009 Bob Peterson <rpeterso> [2.6.18-164.bz519049.el5] - [gfs] GFS2 Filesystem Withdrawal: fatal: invalid metadata block </snip> I'd like to request that this be tested against the failure case that we'd been seeing. We'll work with you on any concerns or issues found, but the patch itself should be low risk. Please let me know if you have questions or concerns. Thanks again, Frank.
Just an update here. We have not yet applied the proposed patches. We have updated to RHEL 5.4 & the environment has been stable for a week now. I'd like to leave this open for a while still, but tentatively RHEL 5.4 has fixed the problem I think.
I don't want to ship the fix for RHEL5.x until we know that the patch fixes the problem. We won't know that unless we can recreate the problem without the patch, then apply the patch and see if it fixes it. If the problem doesn't recreate for six months, perhaps we should just close this without shipping the patch. So I'm setting the NEEDINFO flag until we get more feedback.
Thanks Robert. I think that's fair. Despite our best efforts thus far, I cannot reproduce the error when running a fully patched RHEL 5.4 release. We are continueing acceptance testing of the product and I'll update the ticket if we have any other issues. But for now, all seems well.
I think we'll push this patch into 5.5 since then it has time to be tested before it gets released. If there is a need for this in z-stream, we'll have to clone this bug at some future point.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 367809 [details] Updates patch with bug fix
in kernel-2.6.18-173.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days