Bug 471258
Summary: | fatal: assertion "gfs_glock_is_locked_by_me(gl) && gfs_glock_is_held_excl(gl)" failed | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Michael Worsham <michael.worsham> | ||||
Component: | gfs-kmod | Assignee: | Abhijith Das <adas> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 5.2 | CC: | charles.long, edamato, jkortus, jwest, nstraz, plyons, rpeterso, swhiteho, tom | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | gfs-kmod-0.1.34-8.el5 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 612624 (view as bug list) | Environment: | |||||
Last Closed: | 2010-03-30 08:56:08 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 613107 | ||||||
Attachments: |
|
Description
Michael Worsham
2008-11-12 17:34:35 UTC
Reassigning to Abhi since this looks like a similar bugzilla he worked on. Additional notes... In a 4 node cluster RHEL 5.2 gfs (v1) cluster, We have seen this failure at seemingly random times on different nodes. In each case, there were not device errors in dmesg or in available logs on our SAN (EMC Clarion) that indicate a hardware problem. The failure has always resulted in at least one cluster member being fenced. It sounds like the error described in http://rhn.redhat.com/errata/RHBA-2008-0854.html but that ERRATA has been applied to all systems already. That errata is outdated by this http://rhn.redhat.com/errata/RHBA-2008-0942.html It seems like gfs_trans_add_gl is called without the glock in question being held. Can you tell me what kind of load you are running on the filesystem? I can try to recreate the problem on my test cluster and debug it. We applied the updated errata on 11/14, but the error came back today (11/19) http://rhn.redhat.com/errata/RHBA-2008-0942.html I have an active RHN support case open on it (1873060). All four node members are running the PAE kernel and have SELinux set to permissive (I see avc warning routinely because we have apache writing logs in an unexpected location). As for load, each of the failures has occurred during very low utilization periods. One the failures occurred while all apache processes were stopped (only process that should be making writes to the fs that failed) and no users were logged on interactively. It seems to be random. We have had this problem surface on a 5.2 x86_64 cluster with kmod-gfs 0.1.23-5.el5. Three nodes of a five node cluster were participating in a GFS filesystem holding web application data being served by httpd. One node produced the same error message as the bug reporter (different cluster name + GFS lock table name, obviously) and all other nodes hung IO for this filesystem until we intervened. Removing the problematic node from the cluster forcefully (shutdown the node) let the other nodes continue operating without having to restart the whole cluster; the problem node then rejoined the cluster without causing any issues. Further information on the environment we produced this in: Cluster is made up of 5.2 x86_64 paravirtualised virtual machines running on-top of a 5.2 x86_64 xen dom0 cluster. Virtual machines are running: kernel-xen-2.6.18-92.1.10.el5 kmod-gfs-xen-0.1.23-5.el5 gfs-utils-0.1.17-1.el5 Of specific interest for us is why the node which experiences this error cannot withdraw cleanly from the GFS service and let the other nodes continue to function? Why is removing this node from the cluster necessary? Update on the cluster we used to originally report this bug. We have now gone several months without seeing the gfs_lock_by_me errors. The last thing we changed was setting SELinux to disabled. It was previously in permissive mode. *** Bug 520985 has been marked as a duplicate of this bug. *** Following up on my previous information - we have seen this 2 or 3 times over a 9 month period. After Charles' comment above, we set SELinux to disabled mode on all hosts and haven't yet had a recurrence of the problem. This has only been stable for 1 month though, so given the prior infrequency of the issue it may still resurface... Created attachment 361146 [details]
Trial patch to manage racing gfs_creates when selinux is in permissive mode.
With selinux in permissive mode, I've been able to hit this bug pretty easily.
It arises when two processes (on different nodes) race each other to create the same file. Upon successful creation of the inode, security xattr for it needs to be written when selinux is in permissive mode.
One of the racing processes creates the inode and succeeds in acquiring an EXclusive lock on it to set the xattr. The other process fails to actually create the inode (seeing that it exists by now), and does a lookup (which returns a SHared lock) instead to complete the operation. However, this process goes on and incorrectly attempts to write xattr, which fails with the above assert because it doesn't hold an EX lock on the inode.
The process on the second node should not be attempting to write xattrs since it did not create the inode in the first place. This patch ensures that.
Pushed above patch to RHEL55, STABLE2, STABLE3 and master git branches. Build 2137969 complete and successful. This is fixed in gfs-kmod-0.1.34-8.el5. Changing status to Modified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0291.html |