Red Hat Bugzilla – Bug 471258
fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed
Last modified: 2013-01-10 21:29:47 EST
Description of problem:
Suffering the same problem as posted in CentOS-5 bug report #3138, except we are running RHEL 5.2 and PAE kernel.
[root@app018 log]# uname -a
Linux app018 2.6.18-92.1.13.el5PAE #1 SMP Thu Sep 4 04:05:54 EDT 2008 i686 i686 i386 GNU/Linux
[root@app018 log]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.2 (Tikanga)
Problem Description (detailed):
We are running a 4 node cluster with GFS. One of the systems will remove itself from the cluster and the remaining 3 nodes seems to be locked out of the GFS file systems. Any processes that were interacting with the file system are in an uninterruptable sleep state as if they are waiting for IO, but the IO wait is very low, often zero.
The only message in the log appears in /var/log/kernel.log on the system that removes itself from the cluster. All 4 systems need to be restarted to be able to use the GFS systems again.
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: function = gfs_trans_add_gl
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: file = /builddir/build/BUILD/gfs-kmod-0.1.23/_kmod_build_PAE/src/gfs/trans.c, line = 237
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: time = 1226500292
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: about to withdraw from the cluster
Nov 12 09:31:32 app018 kernel: GFS: fsid=app-lamp-prod:phptmp.2: telling LM to withdraw
Version-Release number of selected component (if applicable):
Not at all. This is the 2nd time it has happened in a month.
Reassigning to Abhi since this looks like a similar bugzilla he worked on.
In a 4 node cluster RHEL 5.2 gfs (v1) cluster, We have seen this failure at seemingly random times on different nodes. In each case, there were not device errors in dmesg or in available logs on our SAN (EMC Clarion) that indicate a hardware problem.
The failure has always resulted in at least one cluster member being fenced.
It sounds like the error described in http://rhn.redhat.com/errata/RHBA-2008-0854.html but that ERRATA has been applied to all systems already.
That errata is outdated by this http://rhn.redhat.com/errata/RHBA-2008-0942.html
It seems like gfs_trans_add_gl is called without the glock in question being held. Can you tell me what kind of load you are running on the filesystem? I can try to recreate the problem on my test cluster and debug it.
We applied the updated errata on 11/14, but the error came back today (11/19)
I have an active RHN support case open on it (1873060). All four node members are running the PAE kernel and have SELinux set to permissive (I see avc warning routinely because we have apache writing logs in an unexpected location).
As for load, each of the failures has occurred during very low utilization periods. One the failures occurred while all apache processes were stopped (only process that should be making writes to the fs that failed) and no users were logged on interactively. It seems to be random.
We have had this problem surface on a 5.2 x86_64 cluster with kmod-gfs 0.1.23-5.el5.
Three nodes of a five node cluster were participating in a GFS filesystem holding web application data being served by httpd. One node produced the same error message as the bug reporter (different cluster name + GFS lock table name, obviously) and all other nodes hung IO for this filesystem until we intervened.
Removing the problematic node from the cluster forcefully (shutdown the node) let the other nodes continue operating without having to restart the whole cluster; the problem node then rejoined the cluster without causing any issues.
Further information on the environment we produced this in:
Cluster is made up of 5.2 x86_64 paravirtualised virtual machines running on-top of a 5.2 x86_64 xen dom0 cluster. Virtual machines are running:
Of specific interest for us is why the node which experiences this error cannot withdraw cleanly from the GFS service and let the other nodes continue to function? Why is removing this node from the cluster necessary?
Update on the cluster we used to originally report this bug. We have now gone several months without seeing the gfs_lock_by_me errors. The last thing we changed was setting SELinux to disabled. It was previously in permissive mode.
*** Bug 520985 has been marked as a duplicate of this bug. ***
Following up on my previous information - we have seen this 2 or 3 times over a 9 month period. After Charles' comment above, we set SELinux to disabled mode on all hosts and haven't yet had a recurrence of the problem.
This has only been stable for 1 month though, so given the prior infrequency of the issue it may still resurface...
Created attachment 361146 [details]
Trial patch to manage racing gfs_creates when selinux is in permissive mode.
With selinux in permissive mode, I've been able to hit this bug pretty easily.
It arises when two processes (on different nodes) race each other to create the same file. Upon successful creation of the inode, security xattr for it needs to be written when selinux is in permissive mode.
One of the racing processes creates the inode and succeeds in acquiring an EXclusive lock on it to set the xattr. The other process fails to actually create the inode (seeing that it exists by now), and does a lookup (which returns a SHared lock) instead to complete the operation. However, this process goes on and incorrectly attempts to write xattr, which fails with the above assert because it doesn't hold an EX lock on the inode.
The process on the second node should not be attempting to write xattrs since it did not create the inode in the first place. This patch ensures that.
Pushed above patch to RHEL55, STABLE2, STABLE3 and master git branches.
Build 2137969 complete and successful. This is fixed in
gfs-kmod-0.1.34-8.el5. Changing status to Modified.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.