519049 – GFS2 Filesystem Withdrawal: fatal: invalid metadata block

Bug 519049 - GFS2 Filesystem Withdrawal: fatal: invalid metadata block

Summary: GFS2 Filesystem Withdrawal: fatal: invalid metadata block

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	526947 533192 564288
TreeView+	depends on / blocked

Reported:	2009-08-24 18:01 UTC by Dave
Modified:	2023-09-14 01:17 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 07:25:31 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Cluster.conf file for an existing cluster that failed on Aug 13, 14 and 21st with these symptoms (1.91 KB, text/plain) 2009-08-24 18:01 UTC, Dave	no flags	Details
logs from qssimes01 (/var/log all tarred up since we have a custom syslog.conf file) (1015.70 KB, application/x-bzip) 2009-08-24 21:02 UTC, Dave	no flags	Details
/var/log from qssimes02 (1.32 MB, application/x-bzip) 2009-08-24 21:02 UTC, Dave	no flags	Details
full fsck output from last crash (150.69 KB, application/x-bzip) 2009-08-27 01:26 UTC, Dave	no flags	Details
full "double" fsck output as the last poster suggested. (148.55 KB, application/x-bzip) 2009-08-28 00:47 UTC, Dave	no flags	Details
Possible fix (4.73 KB, patch) 2009-09-07 15:16 UTC, Steve Whitehouse	no flags	Details \| Diff
RHEL5 version of the upstream patch (4.73 KB, patch) 2009-09-08 16:30 UTC, Robert Peterson	no flags	Details \| Diff
Updates patch with bug fix (5.18 KB, patch) 2009-11-06 11:05 UTC, Steve Whitehouse	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Dave 2009-08-24 18:01:54 UTC

Created attachment 358494 [details]
Cluster.conf file for an existing cluster that failed on Aug 13, 14 and 21st with these symptoms

Description of problem:
Somewhat infrequently, we have issues with general cluster instability.  I cannot readily reproduce the problem, but it happesn enough (about every 2 months per cluster) that it has become a pretty major problem.


Version-Release number of selected component (if applicable):


How reproducible:
Eventually, this happens to just about every cluster we have given enough time.

Steps to Reproduce:
1. Wait until cluster fails (our clusters seem to fail pretty regularly (about once a month).
2.  
3.
  
Actual results:

We get error messages like these from openais in the logs, then 1 or more nodes get fenced or the cluster goes inquorate and completely fails:

Aug 14 18:00:43 qssimes02 openais[5958]: [TOTEM] The token was lost in the OPERATIONAL state. 
Aug 14 18:00:43 qssimes02 openais[5958]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). 
Aug 14 18:00:43 qssimes02 openais[5958]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes). 
Aug 14 18:00:43 qssimes02 openais[5958]: [TOTEM] entering GATHER state from 2. 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] entering GATHER state from 0. 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Creating commit token because I am the rep. 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Saving state aru e0d3 high seq received e0d3 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Storing new sequence id for ring 1e0 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] entering COMMIT state. 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] entering RECOVERY state. 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] position [0] member 10.55.32.88: 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] previous ring seq 476 rep 10.55.32.84 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] aru e0d3 high delivered e0d3 received flag 1 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Did not need to originate any messages in recovery. 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] Sending initial ORF token 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] New Configuration: 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ]        r(0) ip(10.55.32.88)  
Aug 14 18:00:48 qssimes02 clurgmgrd[6620]: <info> State change: qssimes01 DOWN 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] Members Left: 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ]        r(0) ip(10.55.32.84)  
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] Members Joined: 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] New Configuration: 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ]        r(0) ip(10.55.32.88)  
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] Members Left: 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] Members Joined: 
Aug 14 18:00:48 qssimes02 openais[5958]: [SYNC ] This node is within the primary component and will provide service. 
Aug 14 18:00:48 qssimes02 openais[5958]: [TOTEM] entering OPERATIONAL state. 
Aug 14 18:00:48 qssimes02 openais[5958]: [CLM  ] got nodejoin message 10.55.32.88 
Aug 14 18:00:48 qssimes02 openais[5958]: [CPG  ] got joinlist message from node 2 



Expected results:

Cluster recovers from temporary network outage and stays up (that's what clustering is for right?)


Additional info:

qssimes01> rpm -q openais cman
openais-0.80.3-22.el5_3.8
cman-2.0.98-1.el5_3.4
qssimes01> uname -a
Linux qssimes01 2.6.18-128.1.16.el5 #1 SMP Fri Jun 26 10:53:31 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Comment 1 Dave 2009-08-24 19:03:50 UTC

I'm also now noting this in our logs for this particular cluster, sorry I missed this before.

Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: fatal: invalid metadata block
Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1:   bh = 5241860 (magic number)
Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1:   function = gfs2_rgrp_bh_get, file = fs/gfs2/rgrp.c, line = 763
Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: about to withdraw this file system
Aug 13 23:00:27 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: telling LM to withdraw
Aug 13 23:00:28 qssimes01 kernel: GFS2: fsid=QSSIMES:app01.1: withdrawn
Aug 13 23:00:28 qssimes01 kernel: 
Aug 13 23:00:28 qssimes01 kernel: Call Trace:
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff8860552a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff800157c8>] sync_buffer+0x0/0x3f
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff800177eb>] ll_rw_block+0x8c/0xab
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff8861867f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff8861647e>] :gfs2:gfs2_rgrp_bh_get+0x113/0x261
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff88602b63>] :gfs2:glock_work_func+0x0/0xb8
Aug 13 23:00:28 qssimes01 kernel:  [<ffffffff88600fd1>] :gfs2:do_promote+0xad/0x137

Comment 2 Steven Dake 2009-08-24 19:13:32 UTC

steve W: please comment on comment #1 regarding the withdraw.

Comment 3 Steven Dake 2009-08-24 19:27:57 UTC

Dave,
Please include full logs from both nodes so we may evaluate them.

I see a withdraw occurred on Aug 13 23:00:28 but the config change from node 2 occured 8 hours later.  What happened between these 8 hours?  Are times synced on the cluster?  Any core files in /var/lib/openais/*?  If so attach them.

Thanks
-steve

Comment 4 Dave 2009-08-24 21:02:04 UTC

Created attachment 358504 [details]
logs from qssimes01 (/var/log all tarred up since we have a custom syslog.conf file)

Comment 5 Dave 2009-08-24 21:02:26 UTC

Created attachment 358505 [details]
/var/log from qssimes02

Comment 6 Dave 2009-08-24 21:04:43 UTC

Thanks for the comments.  I only pasted portions of logs.  In this case, the cluster crashed on 8/13 but also crashed 8 hours later on 8/14.  So, I confused you by pasting logs from different crashes into the comments.  Hopefully, now that you have all my logs, things will be clearer.

Comment 7 Steve Whitehouse 2009-08-25 08:47:23 UTC

The withdraw of comment #1 is caused by the fs attempting to read one of the resource groups on the fs and finding that the block is corrupt (either its looking in the wrong place, or the block has been over written with data).

In order to fix that, you'll need to unmount on all nodes and run fsck. If its possible to save that metadata first that would be useful for our analysis.

Some questions:
 1. Was the filesystem grown at any points?
 2. Was the filesystem created as gfs2 originally or a converson from gfs1?
 3. What kind of activity was going on immediately before the error was discovered?

Also, if you could post the results of fsck here, that would be useful too.


The initial comment says "Cluster recovers from temporary network outage and stays up (that's what clustering is for right?)" which makes me wonder if the temporary outage is being caused by something external (someone plugging/unplugging network cables?) or similar. Are these messages really appearing with no external cause?

Comment 8 Dave 2009-08-27 00:20:23 UTC

Thanks for getting back to me.  Here are your question answers:

1. No, never grown.
2. It was originally created as GFS2
3. We are running custom application from a consultant here.  I can't categorize _EVERYTHING_, but there is an ISAM database and a bunch of scripts.

The FSCK result was enormous and I wasn't expecting it to fill my terminal buffer, but this is some of the output (it was all very similar):

Block #9959441 (0x97f811) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #9959442 (0x97f812) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #9959443 (0x97f813) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #9959444 (0x97f814) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10024962 (0x98f802) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10024963 (0x98f803) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10024964 (0x98f804) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10024965 (0x98f805) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10024966 (0x98f806) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10090484 (0x99f7f4) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10090485 (0x99f7f5) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10090486 (0x99f7f6) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10090487 (0x99f7f7) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10090488 (0x99f7f8) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10156006 (0x9af7e6) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10156007 (0x9af7e7) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10156008 (0x9af7e8) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10156009 (0x9af7e9) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10156010 (0x9af7ea) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10221528 (0x9bf7d8) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10221529 (0x9bf7d9) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10221530 (0x9bf7da) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10221531 (0x9bf7db) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10221532 (0x9bf7dc) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10287050 (0x9cf7ca) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10287051 (0x9cf7cb) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10287052 (0x9cf7cc) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10287053 (0x9cf7cd) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10287054 (0x9cf7ce) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10352572 (0x9df7bc) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10352573 (0x9df7bd) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10352574 (0x9df7be) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10352575 (0x9df7bf) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10352576 (0x9df7c0) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10418094 (0x9ef7ae) (1 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10418095 (0x9ef7af) (2 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10418096 (0x9ef7b0) (3 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10418097 (0x9ef7b1) (4 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYPE_RG.
Attempting to repair the RG.
Block #10418098 (0x9ef7b2) (5 of 5) is neither GFS2_METATYPE_RB nor GFS2_METATYP

E_RG.
Attempting to repair the RG.
(level 2 passed)
Starting pass1
40 percent complete.


So as to the "temporary network outages", I don't believe that's the cause here.  That was a poor assuption on my part . . . sorry about that.

Comment 9 Dave 2009-08-27 00:25:07 UTC

And here's another update.  After running fsck on the filesystems then restarting the whole cluster, we continue to have GFS2 problems with these types of error messages:


Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: fatal: invalid metadata block
Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1:   bh = 6909402 (magic number)
Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1:   function = ea_foreach_i, file = fs/gfs2/eattr.c, line = 80
Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: about to withdraw this file system
Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: telling LM to withdraw
Aug 26 16:30:42 qssimes01 kernel: GFS2: fsid=QSSIMES:data02.1: withdrawn
Aug 26 16:30:42 qssimes01 kernel: 
Aug 26 16:30:42 qssimes01 kernel: Call Trace:
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8860752a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff80064ae8>] __wait_on_bit+0x60/0x6e
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff800157c8>] sync_buffer+0x0/0x3f
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8861a67f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff88600557>] :gfs2:ea_foreach_i+0x57/0x12d
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8860068c>] :gfs2:ea_foreach+0x5f/0x195
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff88600e17>] :gfs2:ea_dealloc_unstuffed+0x0/0x2f8
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff886008fc>] :gfs2:gfs2_ea_dealloc+0x6c/0x587
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff88612d56>] :gfs2:gfs2_delete_inode+0xc6/0x191
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff88612cd6>] :gfs2:gfs2_delete_inode+0x46/0x191
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8009c07a>] queue_delayed_work+0x75/0x7e
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff88612c90>] :gfs2:gfs2_delete_inode+0x0/0x191
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8002f942>] generic_delete_inode+0xc6/0x143
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff88617c88>] :gfs2:gfs2_inplace_reserve_i+0x63b/0x691
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff88603019>] :gfs2:do_promote+0xf5/0x137
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff801255c3>] selinux_capable+0x15/0x2a
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8860c40d>] :gfs2:gfs2_write_begin+0x116/0x33f
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8860de8f>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8860e2c5>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff80124abe>] avc_has_perm+0x43/0x55
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8860e468>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff800c4075>] generic_file_read+0xac/0xc5
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8009eba4>] autoremove_wake_function+0x0/0x2e
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8012541c>] file_has_perm+0x94/0xa3
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8860e5b8>] :gfs2:gfs2_file_write+0x49/0xa7
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff80016d48>] vfs_write+0xce/0x174
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff80017615>] sys_write+0x45/0x6e
Aug 26 16:30:42 qssimes01 kernel:  [<ffffffff8005e116>] system_call+0x7e/0x83
Aug 26 16:30:42 qssimes01 kernel: 

Here are some various versions of our software ( if it helps ):

qssimes01> rpm -q cman openais lvm2-cluster 
cman-2.0.98-1.el5_3.4
openais-0.80.3-22.el5_3.8
lvm2-cluster-2.02.40-7.el5
qssimes01> rpm -qa | grep -i gfs
gfs-utils-0.1.18-1.el5
gfs2-utils-0.1.53-1.el5_3.3
kmod-gfs-0.1.31-3.el5
qssimes01> uname -rm
2.6.18-128.1.16.el5 x86_64

Comment 10 Dave 2009-08-27 01:26:16 UTC

Created attachment 358795 [details]
full fsck output from last crash

Comment 11 Steve Whitehouse 2009-08-27 07:56:51 UTC

It sounds like somehow the filesystem has been corrupted pretty badly. You should never see messages like the one in comment #9 unless something has gone very wrong. The only thing I can suggest is to try running fsck again in case there is something it did not pick up on the first run (which has been an issue in the past with fsck, but I thought was now fixed). Bob should be able to tell you exactly which fsck versions were affected by that.

Comment 12 Steve Whitehouse 2009-08-27 08:47:31 UTC

Looking again at the trace in comment #9, I have another question. Are you using extended attributes either directly or via posix ACLs or selinux or something?

Comment 13 Dave 2009-08-28 00:47:24 UTC

Created attachment 358983 [details]
full "double" fsck output as the last poster suggested.

Comment 14 Dave 2009-08-28 00:48:34 UTC

On comment #12, we are not actively using extended attributes or selinux.  But selinux is on in permissive mode with the out-of-the-box targeted policy.

Comment 15 Steve Whitehouse 2009-08-28 10:42:25 UTC

Even in permissive mode, selinux will still label the files. That means that the attempt to access the xattrs which failed in comment #9 was almost certainly trying to do something along those lines since it was in the middle of the xattr handling code.

If you don't actually need selinux on GFS2, then I'd suggest turning it off since it will slow down some operations. Thanks for posting the second fsck results.

Comment 16 Steve Whitehouse 2009-08-28 10:49:07 UTC

The fsck results indicate that at some stage it looks like the cluster has crashed when there were inodes which were open, but unlinked. This should be harmless as the filesystem should clean this up itself next time it tries to allocate something.

More recent versions of fsck can clean this up without waiting for the filesystem to do it. Bob can tell you exactly which version you will need in order to get that feature.

Comment 17 Dave 2009-08-28 17:40:35 UTC

I believe my version of gfs2-utils is the latest available that I see on rhn: "gfs2-utils-0.1.53-1.el5_3.3" Also, it does look like I have some vmcores that I could upload.  however, they are 32 GB in size, what's the best way to attach huge files to this bugzilla report?

Comment 18 Dave 2009-08-28 19:16:35 UTC

Looks like we had another crash at 11:15 am Pacific today:  here are some logs from around that time.  The update at 10:36 was me updating the "post_fail_delay" to 90 seconds to give kdump more time to create crash dumps.  Unfortunately, the FS we are dumping to is full of vmcore files.  If someone could let me know how to best update my 32 GB core dumps, I'd appreciate it so I can clear out some space. 

Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: Joined cluster. Now mounting FS...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=0, already locked for use
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=0: Looking at journal...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=0: Done
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=1: Trying to acquire journal lock...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=1: Looking at journal...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=1: Done
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=2: Trying to acquire journal lock...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=2: Looking at journal...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=2: Done
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=3: Trying to acquire journal lock...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=3: Looking at journal...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=3: Done
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=4: Trying to acquire journal lock...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=4: Looking at journal...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=4: Done
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=5: Trying to acquire journal lock...
Aug 28 10:35:42 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=5: Looking at journal...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=5: Done
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=6: Trying to acquire journal lock...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=6: Looking at journal...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=6: Done
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=7: Trying to acquire journal lock...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=7: Looking at journal...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=7: Done
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=8: Trying to acquire journal lock...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=8: Looking at journal...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=8: Done
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=9: Trying to acquire journal lock...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=9: Looking at journal...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=9: Done
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=10: Trying to acquire journal lock...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=10: Looking at journal...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=10: Done
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=11: Trying to acquire journal lock...
Aug 28 10:35:43 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=11: Looking at journal...
Aug 28 10:35:44 qssimes02 kernel: GFS2: fsid=QSSIMES:app01.0: jid=11: Done
Aug 28 10:36:32 qssimes02 ccsd[12727]: Update of cluster.conf complete (version 10 -> 11). 
Aug 28 11:15:13 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Trying to acquire journal lock...
Aug 28 11:15:13 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Busy
Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Trying to acquire journal lock...
Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Looking at journal...
Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Acquiring the transaction lock...
Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Replaying journal...
Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Replayed 21 of 22 blocks
Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Found 1 revoke tags
Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Journal replayed in 1s
Aug 28 11:15:14 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: jid=1: Done
Aug 28 11:43:52 qssimes02 kernel: GFS2: fsid=QSSIMES:data02.0: fatal: invalid metadata block
<fence seems to have occurred here and machine went offline then rebooted>

Comment 19 Robert Peterson 2009-08-28 20:26:00 UTC

Hi,

If you don't mind, please download this tar ball:

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/gfs2_fsck_edit.tgz

It contains pre-release versions of fsck.gfs2 and gfs2_edit.

You may first want to consider saving off all your GFS2 metadata
using this command before proceeding:

./gfs2_edit savemeta /dev/your/device /tmp/519049.meta

Next, try running this fsck.gfs2.  It will hopefully be better
at detecting and fixing your file system corruption (and only
one run should be necessary).  If it works, great.  If it has a
problem you can use the saved metadata to restore your file
system to its condition before the new fsck.gfs2 was run.
(using ./gfs2_edit restoremeta /tmp/519049.meta /dev/your/device)

If you still have problems, I'd like to get a copy of that
metadata you saved with gfs2_edit from before the fsck.gfs2 was
run.  Of course, the metadata will be too big to attach to the
bugzilla record, so perhaps you can zip it with bzip2, and put
the ".bz2" file on a ftp or web server somewhere so I can
download it for analysis (there are some free ones).

I'd like to learn how your file system became corrupt, but that
may be nearly impossible to figure out.  The metadata might give
us some clues.  If the problem is fixed by fsck.gfs2 and then
happens again, perhaps I can use the saved metadata to recreate
the corruption and find the problem.  The goal of "gfs2_edit
savemeta" is to save only metadata (file names, directory structure
journals and system files) and not save any (presumably
confidential) "real" data.

Let me know how this goes.

Bob Peterson

Comment 20 Dave 2009-08-28 21:43:15 UTC

Trying now bob, it may take a while since I actually have 6 filesystems on which to save metadata before trying this and it seems like it takes a while.

On a side note, I was able to compress a vmcore file and upload it to dropbox.redhat.com as vmcore-sr1947858-bz519049.vmcore.bz2.  

I've also updated my kdump.conf file to trim the size of future core dumps so this won't be such a painful ordeal next time.

So, the savemeta seems like it'll take about 10 minutes per FS for me, plus another 10 minutes per fs to run the fsck.  So, I probably won't be able to post an update until after COB Pacific today.  

Thanks for your help so far,

-Dave.

Comment 21 Dave 2009-08-30 04:05:55 UTC

Sorry for the delayed response.  I successfully fsck'ed each filesystem with the new gfs2_fsck and started up the services.  I'll check back on Monday with a status on whether the systems are still up.

Again, thanks.

-Dave.

Comment 22 Frank Hirtz 2009-08-31 18:55:49 UTC

Hi Dave,

Have you had a chance to check in on the filesystems this morning?

Thanks again,

Frank.

Comment 23 Dave 2009-08-31 19:55:05 UTC

Yes, sorry for not updating sooner.  With no load, the systems stayed up all day Saturday and Sunday and so far today.  A load test will be started later today to see if the environment remains stable.

Comment 24 Dave 2009-09-01 18:41:40 UTC

Well, the load tests ran for about 18 hours then failed today at 11:00 pacific with the exact same sorts of errors:

Sep  1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: fatal: invalid metadata block
Sep  1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0:   bh = 4556247 (magic number)
Sep  1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0:   function = ea_foreach_i, file = fs/gfs2/eattr.c, line = 80
Sep  1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: about to withdraw this file system
Sep  1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: telling LM to withdraw
Sep  1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: gfs2_delete_inode: -5
Sep  1 11:00:43 qssimes02 last message repeated 7 times
Sep  1 11:00:43 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: withdrawn
Sep  1 11:00:43 qssimes02 kernel: 
Sep  1 11:00:43 qssimes02 kernel: Call Trace:
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8848752a>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff80064ae8>] __wait_on_bit+0x60/0x6e
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff800157c8>] sync_buffer+0x0/0x3f
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8849a67f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff88480557>] :gfs2:ea_foreach_i+0x57/0x12d
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8848068c>] :gfs2:ea_foreach+0x5f/0x195
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff88480e17>] :gfs2:ea_dealloc_unstuffed+0x0/0x2f8
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff884808fc>] :gfs2:gfs2_ea_dealloc+0x6c/0x587
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff80064b62>] out_of_line_wait_on_bit+0x6c/0x78
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8009ebd2>] wake_bit_function+0x0/0x23
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff88492d56>] :gfs2:gfs2_delete_inode+0xc6/0x191
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff88492cd6>] :gfs2:gfs2_delete_inode+0x46/0x191
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8009c07a>] queue_delayed_work+0x75/0x7e
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff88492c90>] :gfs2:gfs2_delete_inode+0x0/0x191
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8002f942>] generic_delete_inode+0xc6/0x143
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff88497c88>] :gfs2:gfs2_inplace_reserve_i+0x63b/0x691
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff88483019>] :gfs2:do_promote+0xf5/0x137
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8848c40d>] :gfs2:gfs2_write_begin+0x116/0x33f
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8008aeb5>] __activate_task+0x56/0x6d
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8848de8f>] :gfs2:gfs2_file_buffered_write+0x14b/0x2e5
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8848e2c5>] :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff80013de2>] find_lock_page+0x26/0xa1
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8848e468>] :gfs2:gfs2_file_write_nolock+0xaa/0x10f
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8845b784>] :dlm:dlm_put_lockspace+0x10/0x1f
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8009eba4>] autoremove_wake_function+0x0/0x2e
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8009eba4>] autoremove_wake_function+0x0/0x2e
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8848e5b8>] :gfs2:gfs2_file_write+0x49/0xa7
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff80016d48>] vfs_write+0xce/0x174
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8000bcc0>] fget_light+0x59/0x7c
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff80017615>] sys_write+0x45/0x6e
Sep  1 11:00:43 qssimes02 kernel:  [<ffffffff8005e116>] system_call+0x7e/0x83
Sep  1 11:00:43 qssimes02 kernel:

Comment 25 Dave 2009-09-01 19:20:25 UTC

here's a rough timeline of today's events.

1. At 10:29:01 one of our nodes decided to fence the other and we got these error messages in /var/log/syslog (for us):

Sep  1 10:27:21 qssimes02 kernel: dlm: closing connection to node 1
Sep  1 10:29:01 qssimes02 fenced[625]: qssimes01 not a cluster member after 100 sec post_fail_delay
Sep  1 10:29:01 qssimes02 fenced[625]: fencing node "qssimes01"
Sep  1 10:29:40 qssimes02 ccsd[566]: Attempt to close an unopened CCS descriptor (4757250). 
Sep  1 10:29:40 qssimes02 ccsd[566]: Error while processing disconnect: Invalid request descriptor 
Sep  1 10:29:40 qssimes02 fenced[625]: fence "qssimes01" success

We also got these in /var/log/syslogs/local4 from openais around that time:

Sep  1 10:21:53 qssimes02 clurgmgrd: [17980]: <info> Executing /opt/dcs_6.0/files/ime_startup status
Sep  1 10:22:15 qssimes02 openais[579]: [TOTEM] Retransmit List: ef5383 ef5384 ef5385 ef5386 ef5387 ef5388 ef5389 ef538a ef538b ef538c
Sep  1 10:22:15 qssimes02 openais[579]: [TOTEM] Retransmit List: ef5383 ef5384 ef5385 ef5386 ef5387 ef5388 ef5389 ef538a ef538b ef538c
Sep  1 10:22:15 qssimes02 openais[579]: [TOTEM] Retransmit List: ef5399 ef539a ef539b ef539c ef539d ef539e
Sep  1 10:22:23 qssimes02 clurgmgrd: [17980]: <info> Executing /opt/dcs_6.0/files/ime_startup status
Sep  1 10:23:23 qssimes02 last message repeated 2 times
Sep  1 10:24:53 qssimes02 last message repeated 3 times
Sep  1 10:26:23 qssimes02 last message repeated 3 times
Sep  1 10:26:53 qssimes02 clurgmgrd: [17980]: <info> Executing /opt/dcs_6.0/files/ime_startup status
Sep  1 10:27:16 qssimes02 openais[579]: [TOTEM] The token was lost in the OPERATIONAL state.
Sep  1 10:27:16 qssimes02 openais[579]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
Sep  1 10:27:16 qssimes02 openais[579]: [TOTEM] Transmit multicast socket send buffer size (288000 bytes).
Sep  1 10:27:16 qssimes02 openais[579]: [TOTEM] entering GATHER state from 2.
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] entering GATHER state from 0.
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] Creating commit token because I am the rep.
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] Saving state aru 117856c high seq received 117856c
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] Storing new sequence id for ring 308
Sep  1 10:27:21 qssimes02 clurgmgrd[17980]: <info> State change: qssimes01 DOWN
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] entering COMMIT state.
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] entering RECOVERY state.
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] position [0] member 10.55.32.88:
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] previous ring seq 772 rep 10.55.32.84
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] aru 117856c high delivered 117856c received flag 1
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] Did not need to originate any messages in recovery.
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] Sending initial ORF token
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] CLM CONFIGURATION CHANGE
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] New Configuration:
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ]         r(0) ip(10.55.32.88)
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] Members Left:
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ]         r(0) ip(10.55.32.84)
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] Members Joined:
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] CLM CONFIGURATION CHANGE
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] New Configuration:
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ]         r(0) ip(10.55.32.88)
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] Members Left:
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] Members Joined:
Sep  1 10:27:21 qssimes02 openais[579]: [SYNC ] This node is within the primary component and will provide service.
Sep  1 10:27:21 qssimes02 openais[579]: [TOTEM] entering OPERATIONAL state.
Sep  1 10:27:21 qssimes02 openais[579]: [CLM  ] got nodejoin message 10.55.32.88
Sep  1 10:27:21 qssimes02 openais[579]: [CPG  ] got joinlist message from node 2
Sep  1 10:27:21 qssimes02 clurgmgrd[17980]: <notice> Starting service:imeslave on [ 2 ]
Sep  1 10:27:21 qssimes02 clurgmgrd[17980]: <info> Start failed; node reports: 0 failures, 1 exclusive, 0 dependency errors

2. About 6 minutes later, the remaining cluster node withdrew, core dumped and rebooted with this error message in /var/log/syslog:

Sep  1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: fatal: invalid metadata block
Sep  1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0:   bh = 4556247 (magic number)
Sep  1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0:   function = ea_foreach_i, file = fs/gfs2/eattr.c, line = 80
Sep  1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: about to withdraw this file system
Sep  1 10:36:05 qssimes02 kernel: GFS2: fsid=QSSIMES:data05.0: telling LM to withdraw


3. Then the nodes went into a fencing loop where neither could startup until we intervened manually and fscked each filesystem.

Note:  There are no logs or coredumps from the originally fenced node qssimes01 around 10:29  We have this in /var/log/syslog at that time:

Sep  1 05:15:24 qssimes01 ntpd[5786]: synchronized to 129.46.51.243, stratum 2
Sep  1 05:47:56 qssimes01 ntpd[5786]: synchronized to 10.45.79.4, stratum 2
Sep  1 06:40:21 qssimes01 ntpd[5786]: synchronized to 129.46.51.243, stratum 2
Sep  1 07:40:37 qssimes01 ntpd[5786]: synchronized to 10.45.78.127, stratum 2
Sep  1 08:14:17 qssimes01 ntpd[5786]: synchronized to 129.46.51.243, stratum 2
Sep  1 10:13:47 qssimes01 ntpd[5786]: synchronized to 10.45.79.4, stratum 2


And this in /var/log/syslogs/local4:

Sep  1 10:20:55 qssimes01 last message repeated 3 times
Sep  1 10:22:25 qssimes01 last message repeated 3 times
Sep  1 10:23:55 qssimes01 last message repeated 3 times
Sep  1 10:25:25 qssimes01 last message repeated 3 times
Sep  1 10:26:55 qssimes01 last message repeated 3 times


So, what I'm concerned about is an apparently random reboot of the machine qssimes01 while it was doing IO operations.  No log indicates any useful information to me on why qssimes01 rebooted.  Or am I missing something here?

Comment 26 Frank Hirtz 2009-09-01 20:54:09 UTC

Dave,

To confirm, there was no vmcore created on qssimes01 and you're certain that it rebooted for some reason in this instance?

Comment 27 Frank Hirtz 2009-09-01 21:13:13 UTC

Dave,

Also, I see it discussed but were you able to entirely disable selinux on these cluster nodes?

Thanks again,

Frank.

Comment 28 Robert Peterson 2009-09-01 21:14:46 UTC

At least a couple of these failures seem to be with extended
attributes.  It's interesting that the corruption was noticed
shortly after a node was fenced.  That makes me wonder if there
is a problem with journal replay after the node comes back up.
To try to determine that, I'd like to propose an experiment.

1. Set everything up to not mount GFS2 automatically.
2. Run fsck.gfs2 again to make sure the file system is clean.
3. Run the failing scenario again.
4. After qssimes02 decides to fence qssimes01, do not restart
   the cluster right away.  Instead, reboot both nodes.
5. Restart the cluster manually but don't mount the GFS2 file
   system.  Instead, save off the GFS2 metadata, bzip2 it, and
   put it on a web or ftp server where I can analyze it.  I
   want to look at the journals to see if there is something in
   there that might cause corruption, or if there are clues to
   the corruption.  To save off the metadata, do:
   gfs2_edit savemeta <gfs2 device> <some filename>
   bzip2 <some filename>
   (Do not fsck.gfs2 this time.)
6. Next, mount the file system normally and use it for a while
   to see if corruption is detected.  If so, let me know the
   messages it gives you, especially ones that say "bh = <block>"
   like many of the previous messages.

Comment 29 Dave 2009-09-01 21:15:51 UTC

That's confirmed.  No vmcore files were created on qssimes01 and there's no log entries indicating any type of problem or issue with it.  qssimes01 was not rebooted manually and no one was logged into it at the time.  It was fenced and there was no log message on qssimes02 or qssimes01 indicating to me why it was fenced.

All we got was this on qssimes02:

Sep  1 10:27:21 qssimes02 kernel: dlm: closing connection to node 1
Sep  1 10:29:01 qssimes02 fenced[625]: qssimes01 not a cluster member after 100
sec post_fail_delay
Sep  1 10:29:01 qssimes02 fenced[625]: fencing node "qssimes01

Comment 30 Dave 2009-09-01 21:18:51 UTC

Responding to comment #27:

Yes selinux is disabled on both nodes.

Responding to comment #28:

Will do.  I've current turned clustering services off for the next reboot and will do as you propose.  I have already done a manual fsck of the filesystems so we will commence another load test to see if we can reproduce the problem again.  It may take a day or two for the failure to occur as it doesn't appear to happen right away.

Comment 32 Frank Hirtz 2009-09-01 21:26:40 UTC

Dave, 

Thanks for the confirmation on those.

Frank.

Comment 33 Steve Whitehouse 2009-09-02 08:41:51 UTC

Based on the above, I suspect that there is an error relating to xattrs which causes the initial issue and that is almost certainly a hang on the node in question. That eventually causes the node to leave the cluster and get fenced, whereupon recovery is unable to correctly recover the xattrs resulting in the reported issue when the node tries to reread the same inode.

It certainly looks xattr related, so that if selinux can be disabled at least for the gfs2 filesystem, that will probably be a workaround in the short term.

Comment 34 Dave 2009-09-02 15:20:17 UTC

Selinux has been disabled:

qssimes01> sudo /usr/sbin/sestatus 
SELinux status:                 disabled

qssimes02> sudo /usr/sbin/sestatus
Password:
SELinux status:                 disabled

It was suggested in comment #15 above then I confirmed it in comment #29 after Frank asked again in comment #27.

I disabled it by editing /etc/selinux/config and rebooting.  Selinux has been disabled since August 28th @ 9:56 AM Pacific.

Need I also disable selinux somewhere else?

Comment 35 Frank Hirtz 2009-09-02 18:04:27 UTC

Hi Dave,

Sorry for the confusion, I'm passing information back to the engineers from some of the earlier runs and asking questions on current ones; it's mixing things up here a bit. We'll get that straightened out, we don't intend on bugging you for the same information repeatedly like that.

Comment 36 Everett Bennett 2009-09-02 19:17:59 UTC

Currently, we see this error on the qssimes01/2 cluster when running strace:

28186 open("/dcs/data05/DATA/dcs/polldata/Q/QCBREW1/QCTXN2/QCBREW1_QCTXN2_014494.dat", O_RDWR) = -1 EIO (Input/output error)

This file is accessible on qssimes02 at this time, but not qssimes01 .

Comment 37 Everett Bennett 2009-09-02 19:33:51 UTC

It should be noted that we opened a bug report during the installation of the cluster service as well.

  https://bugzilla.redhat.com/show_bug.cgi?id=512799

Comment 38 Dave 2009-09-03 17:36:59 UTC

FYI we got another crash at 1:07 PM PDT yesterday.  No useful log messages.  The fenced machine did attempt to dump its core but failed.  I cleaned up our old coredumps so (hopefully) that won't happen again.

In any case, where are we with this?  What can I do to help more on the ticket?

Comment 39 Dave 2009-09-04 15:41:07 UTC

OK, well, still hoping to hear something I guess.

It was proposed in our Service Request that I update these 2 systems to RHEL 5.4 since that was released the other day.

Would anyone on this ticket agree or disagree?

Is there any other information we can provide on this?

Comment 40 Everett Bennett 2009-09-04 15:48:45 UTC

It would be a good idea to upgrade the qssimes01/2 cluster.

Comment 41 Robert Peterson 2009-09-04 16:01:52 UTC

I do recommend upgrading to 5.4 but I don't think it will solve
the corruption problem at hand.  We are currently still analyzing
the problem and don't have any answers yet.  I'll update the
bugzilla record when I know more.

A big question in my mind is why the node is getting fenced in
the first place.  That's something that may get better with 5.4.

I don't know for sure whether the node was fenced because of the
corruption (via kernel panic or some such) or whether the
corruption was caused by a problem with the fencing recovery
(i.e. a problem with journal replay).  There are still too many
unknowns here.  It would be helpful if someone could analyze
the syslogs and/or console logs of all the nodes around the time
the fencing occurs to figure out if there was a kernel panic or
some such just prior to the node being fenced.

Comment 42 Robert Peterson 2009-09-04 16:04:36 UTC

Also note that I still don't have the metadata for the file systems
that are showing occasional corruption, and that may also be
helpful.

Comment 43 Dave 2009-09-04 16:21:03 UTC

Sorry Robert, that's my bad.

I've uploaded those meta files to dropbox.redhat.com and named them like so:

appvol01.meta-bz519049.bz2
datavol01.meta-bz519049.bz2
datavol02.meta-bz519049.bz2
datavol03.meta-bz519049.bz2
datavol04.meta-bz519049.bz2
datavol05.meta-bz519049.bz2

I  am doing the update to 5.4 now, I'm hoping it'll help.

Also, not sure if you reviewed the link that Everett Bennett posted above (https://bugzilla.redhat.com/show_bug.cgi?id=512799).  We had this ticket open on this cluster already and added this item to our cluster.conf file to work around it:

	<gfs_controld plock_ownership="0" plock_rate_limit="0"/>

Could that previous bug and it's recommended fix be in any way related to our current problems?

Thanks,

-Dave.

Comment 44 Dave 2009-09-04 16:31:25 UTC

And in reference to comment #41 where you request log analysis.  I've just tarred up everthing in our /var/log directory on each host and uploaded that to dropbox.redhat.com

qssimes01-bz519049.tar.bz2
qssimes02-bz519049.tar.bz2

Feel free to do additional analysis if you like. I've tried to post all the logs I thought were relevant around the time of each crash, but I could've missed something.

I also uploaded sosreports for our corresponding Service Request on this ticket which you may have access to.

Please don't hesitate to remind me of things I've forgotten, and thanks for your help on this.

I should have these systems up to RHEL 5.4 in a few minutes here . . .

-Dave.

Comment 45 Robert Peterson 2009-09-04 20:51:13 UTC

I've retrieved the metadata from dropbox.redhat.com and examined
it.  The 12 journals have absolutely no information regarding the
corrupted blocks.  That means either the journals have wrapped
since the corruption occurred (which means a lot of activity since
the original failure) or else the corruption isn't related to a
transaction in the journal.  I tend to think it's the latter.

I also retrieved and analyzed the syslog files from comment #44.
I found very little useful information beyond the notes previously
posted by Dave: Dave, you did a good job at isolating and posting
the relevant messages.  Syslog made it look like the corruption
likely caused a kernel panic, which caused the fencing to happen.

I also did an extensive code review and analysis of the extended
attribute code, but again, I found no evidence of logic problems
that could cause the corruption we're seeing.

Which brings me back to the original question of how the file
systems became corrupt.  I've examined the metadata and also found
very little useful evidence to go on.  The blocks complained about
in syslog seem to be used now for normal data blocks.

One very interesting thing to node is that the block just
prior to the corrupt one (the kernel message said bh = 4556247
which in hex is 0x4585d7--so block 0x4585d6) is marked as
"unlinked" in the file system rgrp bitmap.  Which leads me to
the next topic:

The lack of other evidence of corruption (smoking guns), plus the
two extraordinary call traces in comment #9 and comment #24 lead
me now to believe that the GFS2 code got confused somewhere near
the code path that calls try_rgrp_unlink() in rgrp.c.  If you
look closely at those two call traces, they show allocates
calling unlink, which can only happen due to try_rgrp_unlink.
That's my best clue so far.  Although I have no evidence to prove
it, I believe that try_rgrp_unlink() is causing an extended
attribute block that's already been deleted, to be deleted again
(after it's already been reused as a data block by another inode).
Or something similar.  That's where I'm focusing my efforts now.

Perhaps the fact that the bitmap showed 0x4585d6 as "unlinked"
somehow caused it to accidentally try to delete 0x4585d7 by
mistake or something strange.

Monday is a holiday in the USA, but not in the UK where GFS2
expert Steve Whitehouse resides.  Perhaps on Monday Steve can
spot something in that area of code.

Comment 46 Steve Whitehouse 2009-09-07 10:47:12 UTC

The theory in comment #45 sounds plausible. I'm currently investigating.

Comment 47 Steve Whitehouse 2009-09-07 15:16:42 UTC

Created attachment 360000 [details]
Possible fix

Looking at the code, I've come up with a theory that the bug may be caused by a race in the inode deallocation code. If I'm correct, the issue will only occur when there are two (or more) nodes in the cluster both trying to deallocate the same inode at the same time. This is a fairly unusual event, but certainly not impossible.

In that case it looks like one of the node would get stuck waiting for the inode glock while the other node is doing the deallocation. When the other node completes the deallocation and releases the glock, the original node might then try and complete the same operation if its attempt to "try lock" the i_open_gl succeeds.

The attached patch adds an extra check to the deallocation code so that after acquiring the inode glock, we ensure that there really is an inode requiring deallocation before proceeding with the remainder of the procedure.

The cost is an extra couple of glocks and also a disk access on each deallocation. This is not a big deal since both glocks will be needed in the deallocation process anyway (so we cache them earlier) and the disk access is a read of the rgrp block containing the bitmap entry for the unlinked inode. That has to be written later in the deallocation function, so that it doesn't really make any performance difference.

The patch is against the upstream gfs2 kernel, but its trivial to create a RHEL version, and we'll try and get something built shortly for testing purposes.

Due to the requirement for a very similar test in the NFS code, we can share the new test with the GFS2 NFS code.

I've tested the patch on a single node basis only so far, to ensure that it doesn't break anything obvious. Even if it doesn't solve this particular bug, I'm considering adding this patch to the upstream branch anyway, as it should improve the filesystem's robustness in the face of errors in the on-disk data.

Comment 48 Robert Peterson 2009-09-08 16:30:23 UTC

Created attachment 360100 [details]
RHEL5 version of the upstream patch

Comment 50 Frank Hirtz 2009-09-09 18:56:41 UTC

Hi Dave,

A RHEL5.4 based x86_64 build of a kernel set with this patch can be downloaded at the following URL:

http://people.redhat.com/~fhirtz/.gfs2test/

It's just the stock -164 build with the one patch that we're referencing here:

<snip>
* Tue Sep 08 2009 Bob Peterson <rpeterso> [2.6.18-164.bz519049.el5]
- [gfs] GFS2 Filesystem Withdrawal: fatal: invalid metadata block
</snip>

I'd like to request that this be tested against the failure case that we'd been seeing. We'll work with you on any concerns or issues found, but the patch itself should be low risk. Please let me know if you have questions or concerns.

Thanks again,

Frank.

Comment 51 Dave 2009-09-11 22:17:08 UTC

Just an update here.  We have not yet applied the proposed patches.  We have updated to RHEL 5.4 & the environment has been stable for a week now.  I'd like to leave this open for a while still, but tentatively RHEL 5.4 has fixed the problem I think.

Comment 52 Robert Peterson 2009-09-22 16:23:36 UTC

I don't want to ship the fix for RHEL5.x until we know that the
patch fixes the problem.  We won't know that unless we can recreate
the problem without the patch, then apply the patch and see if it
fixes it.  If the problem doesn't recreate for six months, perhaps
we should just close this without shipping the patch.  So I'm
setting the NEEDINFO flag until we get more feedback.

Comment 53 Dave 2009-09-22 16:41:39 UTC

Thanks Robert.  I think that's fair.  Despite our best efforts thus far, I cannot reproduce the error when running a fully patched RHEL 5.4 release.  We are continueing acceptance testing of the product and I'll update the ticket if we have any other issues.  But for now, all seems well.

Comment 54 Steve Whitehouse 2009-10-21 12:46:54 UTC

I think we'll push this patch into 5.5 since then it has time to be tested before it gets released. If there is a need for this in z-stream, we'll have to clone this bug at some future point.

Comment 55 RHEL Program Management 2009-10-21 12:50:27 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 58 Steve Whitehouse 2009-11-06 11:05:33 UTC

Created attachment 367809 [details]
Updates patch with bug fix

Comment 59 Don Zickus 2009-11-10 16:51:14 UTC

in kernel-2.6.18-173.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 64 Chris Ward 2010-02-11 10:21:55 UTC

~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 70 errata-xmlrpc 2010-03-30 07:25:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 71 Red Hat Bugzilla 2023-09-14 01:17:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.