471258 – fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed

Bug 471258 - fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed

Summary: fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	gfs-kmod
Sub Component:
Version:	5.2
Hardware:	i386
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Abhijith Das
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	520985 (view as bug list)
Depends On:
Blocks:	613107
TreeView+	depends on / blocked

Reported:	2008-11-12 17:34 UTC by Michael Worsham
Modified:	2013-01-11 02:29 UTC (History)
CC List:	9 users (show)
Fixed In Version:	gfs-kmod-0.1.34-8.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	612624 (view as bug list)
Environment:
Last Closed:	2010-03-30 08:56:08 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Trial patch to manage racing gfs_creates when selinux is in permissive mode. (384 bytes, patch) 2009-09-15 22:07 UTC, Abhijith Das	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
CentOS	3138	0	None	None	None	Never
Red Hat Product Errata	RHSA-2010:0291	0	normal	SHIPPED_LIVE	Moderate: gfs-kmod security, bug fix and enhancement update	2010-03-29 14:12:22 UTC

Description Michael Worsham 2008-11-12 17:34:35 UTC

Description of problem:

Suffering the same problem as posted in CentOS-5 bug report #3138, except we are running RHEL 5.2 and PAE kernel.

[root@app018 log]# uname -a
Linux app018 2.6.18-92.1.13.el5PAE #1 SMP Thu Sep 4 04:05:54 EDT 2008 i686 i686 i386 GNU/Linux

[root@app018 log]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.2 (Tikanga)

Problem Description (detailed):

We are running a 4 node cluster with GFS. One of the systems will remove itself from the cluster and the remaining 3 nodes seems to be locked out of the GFS file systems. Any processes that were interacting with the file system are in an uninterruptable sleep state as if they are waiting for IO, but the IO wait is very low, often zero.

The only message in the log appears in /var/log/kernel.log on the system that removes itself from the cluster. All 4 systems need to be restarted to be able to use the GFS systems again.

Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2:   function = gfs_trans_add_gl
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2:   file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_PAE/src/gfs/trans.c, line = 237
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2:   time = 1226500292
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: about to withdraw from the cluster
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: telling LM to withdraw


Version-Release number of selected component (if applicable):

kmod-gfs-PAE-0.1.23-5.el5_2.2
gfs-utils-0.1.17-1.el5


How reproducible:

Not at all. This is the 2nd time it has happened in a month.

Comment 1 Robert Peterson 2008-11-12 19:05:49 UTC

Reassigning to Abhi since this looks like a similar bugzilla he worked on.

Comment 2 Michael Worsham 2008-11-12 19:48:48 UTC

Additional notes...

In a 4 node cluster RHEL 5.2 gfs (v1) cluster, We have seen this failure at seemingly random times on different nodes.  In each case, there were not device errors in dmesg or in available logs on our SAN (EMC Clarion) that indicate a hardware problem.   
 
The failure has always resulted in at least one cluster member being fenced.
 
It sounds like the error described in http://rhn.redhat.com/errata/RHBA-2008-0854.html but that ERRATA has been applied to all systems already.

Comment 3 Abhijith Das 2008-11-14 04:08:46 UTC

That errata is outdated by this http://rhn.redhat.com/errata/RHBA-2008-0942.html

It seems like gfs_trans_add_gl is called without the glock in question being held. Can you tell me what kind of load you are running on the filesystem? I can try to recreate the problem on my test cluster and debug it.

Comment 4 Charles Long 2008-11-19 20:50:06 UTC

We applied the updated errata on 11/14, but the error came back today (11/19)
http://rhn.redhat.com/errata/RHBA-2008-0942.html

I have an active RHN support case open on it (1873060).  All four node members are running the PAE kernel and have SELinux set to permissive (I see avc warning routinely because we have apache writing logs in an unexpected location).

As for load, each of the failures has occurred during very low utilization periods.  One the failures occurred while all apache processes were stopped (only process that should be making writes to the fs that failed) and no users were logged on interactively.  It seems to be random.

Comment 5 Tom Lanyon 2009-05-12 04:17:31 UTC

We have had this problem surface on a 5.2 x86_64 cluster with kmod-gfs 0.1.23-5.el5.

Three nodes of a five node cluster were participating in a GFS filesystem holding web application data being served by httpd. One node produced the same error message as the bug reporter (different cluster name + GFS lock table name, obviously) and all other nodes hung IO for this filesystem until we intervened.

Removing the problematic node from the cluster forcefully (shutdown the node) let the other nodes continue operating without having to restart the whole cluster; the problem node then rejoined the cluster without causing any issues.

Comment 6 Tom Lanyon 2009-05-12 04:45:41 UTC

Further information on the environment we produced this in:

Cluster is made up of 5.2 x86_64 paravirtualised virtual machines running on-top of a 5.2 x86_64 xen dom0 cluster. Virtual machines are running:

kernel-xen-2.6.18-92.1.10.el5
kmod-gfs-xen-0.1.23-5.el5
gfs-utils-0.1.17-1.el5


Of specific interest for us is why the node which experiences this error cannot withdraw cleanly from the GFS service and let the other nodes continue to function? Why is removing this node from the cluster necessary?

Comment 7 Charles Long 2009-05-12 12:58:17 UTC

Update on the cluster we used to originally report this bug.  We have now gone several months without seeing the gfs_lock_by_me errors.  The last thing we changed was setting SELinux to disabled.  It was previously in permissive mode.

Comment 8 Robert Peterson 2009-09-03 15:46:14 UTC

*** Bug 520985 has been marked as a duplicate of this bug. ***

Comment 10 Tom Lanyon 2009-09-04 01:34:22 UTC

Following up on my previous information - we have seen this 2 or 3 times over a 9 month period. After Charles' comment above, we set SELinux to disabled mode on all hosts and haven't yet had a recurrence of the problem.

This has only been stable for 1 month though, so given the prior infrequency of the issue it may still resurface...

Comment 11 Abhijith Das 2009-09-15 22:07:57 UTC

Created attachment 361146 [details]
Trial patch to manage racing gfs_creates when selinux is in permissive mode.

With selinux in permissive mode, I've been able to hit this bug pretty easily.

It arises when two processes (on different nodes) race each other to create the same file. Upon successful creation of the inode, security xattr for it needs to be written when selinux is in permissive mode. 

One of the racing processes creates the inode and succeeds in acquiring an EXclusive lock on it to set the xattr. The other process fails to actually create the inode (seeing that it exists by now), and does a lookup (which returns a SHared lock) instead to complete the operation. However, this process goes on and incorrectly attempts to write xattr, which fails with the above assert because it doesn't hold an EX lock on the inode.

The process on the second node should not be attempting to write xattrs since it did not create the inode in the first place. This patch ensures that.

Comment 12 Abhijith Das 2009-09-22 19:57:42 UTC

Pushed above patch to RHEL55, STABLE2, STABLE3 and master git branches.

Comment 14 Robert Peterson 2009-12-08 16:23:22 UTC

Build 2137969 complete and successful.  This is fixed in
gfs-kmod-0.1.34-8.el5.  Changing status to Modified.

Comment 19 errata-xmlrpc 2010-03-30 08:56:08 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0291.html

Note You need to log in before you can comment on or make changes to this bug.