Bug 678585
Summary: | dlm_controld: fcntl F_SETLKW should be interruptible in GFS2 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Tomas Herfert <therfert> | ||||||||||
Component: | cluster | Assignee: | David Teigland <teigland> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | 6.0 | CC: | adas, anmiller, bmarzins, ccaulfie, cluster-maint, djansa, fdinitto, jawilson, jeder, jwest, lhh, mharvey, mschick, rdassen, rpeterso, rrajasek, rsvoboda, sbradley, swhiteho, teigland, therfert | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | cluster-3.0.12.1-2.el6 | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
Cause: gfs2 posix lock operations (implemented in dlm) are not interruptible when they wait for another posix lock. This was the way they were originally implemented because of simplicity.
Consequence: processes that created a deadlock with posix locks, e.g. AB/BA, could not be killed to resolve the problem, and one node would need to be reset.
Fix: the dlm uses a new kernel feature that allows the waiting process to be killed, and information about the killed process is now passed to dlm_controld so it can clean up.
Result: processes deadlocked on gfs2 posix locks can now be recovered by killing one or more of them.
|
Story Points: | --- | ||||||||||
Clone Of: | |||||||||||||
: | 707005 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2011-12-06 14:50:43 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 695824, 707005 | ||||||||||||
Attachments: |
|
Description
Tomas Herfert
2011-02-18 14:32:46 UTC
Can you tell me which NICs the cluster is using, and which firmware version the NICs are using, if applicable. I assume from the report that the application is making use of fcntl POSIX locks. That appears to be the code path in question. Depending on the application requirements, there may be better solutions. Can you confirm that there was no process which was holding a lock and thus blocking the process in question? To debug a posix lock problem, the following information from both nodes would be a good start: cman_tool nodes corosync-objctl group_tool -n dlm_tool log_plock dlm_tool plocks <fsname> /etc/cluster/cluster.conf /var/log/messages /proc/locks ps ax -o pid,stat,cmd,wchan (In reply to comment #2) > Can you tell me which NICs the cluster is using, and which firmware version the > NICs are using, if applicable. Used NICs: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz > > I assume from the report that the application is making use of fcntl POSIX > locks. That appears to be the code path in question. Depending on the > application requirements, there may be better solutions. Can you confirm that > there was no process which was holding a lock and thus blocking the process in > question? There is only one process on each node which access the shared files as I explained. However I don't know what operations exactly those processes are doing. Created attachment 479549 [details] Resuls of the commands/content of the files you wanted in comment #3 Please find the results attached. nodeA = messaging-22 nodeB = messaging-23 (In reply to comment #3) > To debug a posix lock problem, the following information from both nodes would > be a good start: > > cman_tool nodes > corosync-objctl > group_tool -n > dlm_tool log_plock > dlm_tool plocks <fsname> > /etc/cluster/cluster.conf > /var/log/messages > /proc/locks > ps ax -o pid,stat,cmd,wchan When I was looking at result of "corosync-objctl", I noticed there is multicast address "totem.interface.mcastaddr=239.192.149.169". I hadn't expected it uses muticast when it wasn't configured anywhere. Multicast is routed via different interface - bond0. It is bonding interface (mode=balance-tlb) which uses two the following physical interfaces: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (In reply to comment #4) > (In reply to comment #2) > > Can you tell me which NICs the cluster is using, and which firmware version the > > NICs are using, if applicable. > > Used NICs: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz > > > > > I assume from the report that the application is making use of fcntl POSIX > > locks. That appears to be the code path in question. Depending on the > > application requirements, there may be better solutions. Can you confirm that > > there was no process which was holding a lock and thus blocking the process in > > question? > > There is only one process on each node which access the shared files as I > explained. However I don't know what operations exactly those processes are > doing. It appears that the application has gotten itself into a standard A,B / B,A deadlock with posix locks. Our clustered posix locks do not do EDEADLK detection of posix locks. 2394838 WR 2-2 nodeid 22 pid 29389 owner ffff8804de8c2368 rown 0 2394838 WR 1-1 nodeid 23 pid 28907 owner ffff88050fc29050 rown 0 2394838 WR 1-1 nodeid 22 pid 29389 owner ffff8804de8c2368 rown 0 WAITING 2394846 WR 1-1 nodeid 22 pid 29389 owner ffff8804de8c2368 rown 0 2394846 WR 2-2 nodeid 23 pid 28907 owner ffff88050fc29050 rown 0 2394846 WR 1-1 nodeid 23 pid 28907 owner ffff88050fc29050 rown 0 WAITING > 38.1 = byte 1 of inode 38 (2394838) = /mnt/jbm/common/therfert/nodeB/hornetq/journal/server.lock > 38.2 = byte 2 of same > 46.1 = byte 1 of inode 46 (2394846) = /mnt/jbm/common/therfert/nodeA/hornetq/journal/server.lock > 46.2 = byte 2 of same node 22 holds WR on 38.2, 46.1 node 22 waits WR on 38.1 (held by node 23) node 23 holds WR on 38.1, 46.2 node 23 waits WR on 46.1 (held by node 22) The standard deadlock avoidance methods (lock ordering, non-blocking requests) are the only two options for avoiding this. Thanks much for this fast finding, Dave. Comment from Andy Taylor from HornetQ team we basically hold 2 locks at bytes 1 and 2, the live server obtains the lock at byte 1 and holds it, the backup lock holds the lock at byte 2 and then hold it, the backup then tries the lock at byte 1 and will block until the live server dies. Once the live server dies and the java process is killed the lock at byte 1 is removed and the backup server obtains it and releases its backup lock for other backup nodes. We are using the java file channel classes for doing the locking so if the lock is not released once the java process running the live server is killed then this is an issue with how the OS works with the file system. I'm not an expert in this field so can't really comment on that. However the following code should recreate the issue, run it twice from the same directory and then kill one node. File file = new File(".", "server.lock"); if (!file.exists()) { file.createNewFile(); } RandomAccessFile raFile = new RandomAccessFile(file, "rw"); FileChannel channel = raFile.getChannel(); channel.lock(1, 1, false); System.out.println("lock obtained"); If I've understood this correctly, the problem is that the fcntl F_SETLKW call on GFS2 is not interruptible? The Linux man page says: F_SETLKW (struct flock *) As for F_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the sig- nal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)). So I'd expect that the process should be able to continue and not get stuck as a zombie if it is killed. I'm not sure whether that is a requirement of fcntl locks or something that is Linux specific though. We have a bug open for flock which has a similar non-interruptible problem in bz #472380 but so far as I know, nobody has an application for which that matters. I'm surprised that the fcntl lock implementation is not interruptible though, since we use the library functions provided like other filesystems. If we can confirm the problem, then I'll have a look at it shortly. I'm currently travelling otherwise I'd check the POSIX docs to see what they have to say as well. It's using wait_event() to wait for a response from dlm_controld. One hard part about wait_event_interruptible() is the kernel and userland state getting out of sync, i.e. corrupted lock state. But perhaps we could handle process termination as a special case since all lock state is being cleared. If the dlm could check whether process termination was the cause of wait_event_interruptible returning, then it could possibly let dlm_controld know somehow, so that dlm_controld could do a special lock cleanup. Can somebody please comment if this could be fixed on the GFS2 side or do you still need changes on the application side? It will require quite a bit of work to know if it's possible to handle this in dlm/gfs2 or not. If it is possible, the change will probably be too complicated to have ready any time soon. (In reply to comment #13) > It will require quite a bit of work to know if it's possible to handle this in > dlm/gfs2 or not. If it is possible, the change will probably be too > complicated to have ready any time soon. We potentially have a customer that this issue will effect. What kind of priority does this have? Based on the explanation of what these locks are used for (detecting a node failure), there's another possibly simple way of avoiding the deadlock: use two different processes to lock the two files. (In reply to comment #17) > Based on the explanation of what these locks are used for (detecting a node > failure), there's another possibly simple way of avoiding the deadlock: use > two different processes to lock the two files. There are two different processes. The first JVM (process) on NodeA is the active node, and the second JVM (process) is on NodeB, which is the backup node. So, there are two processes on two different nodes locking the file at two positions. I mean use more than one process on each node. (In reply to comment #20) > (In reply to comment #17) > > Based on the explanation of what these locks are used for (detecting a node > > failure), there's another possibly simple way of avoiding the deadlock: use > > two different processes to lock the two files. > > There are two different processes. The first JVM (process) on NodeA is the > active node, and the second JVM (process) is on NodeB, which is the backup > node. So, there are two processes on two different nodes locking the file at > two positions. Okay, but that's actually possible. (In reply to comment #22) > (In reply to comment #20) > > (In reply to comment #17) > > > Based on the explanation of what these locks are used for (detecting a node > > > failure), there's another possibly simple way of avoiding the deadlock: use > > > two different processes to lock the two files. > > > > There are two different processes. The first JVM (process) on NodeA is the > > active node, and the second JVM (process) is on NodeB, which is the backup > > node. So, there are two processes on two different nodes locking the file at > > two positions. > > Okay, but that's actually possible. I meant NOT possible. > we basically hold 2 locks at bytes 1 and 2, the live server obtains the lock at > byte 1 and holds it, the backup lock holds the lock at byte 2 and then hold it, > the backup then tries the lock at byte 1 and will block until the live server > dies. Once the live server dies and the java process is killed the lock at byte > 1 is removed and the backup server obtains it and releases its backup lock for > other backup nodes > 38.1 = byte 1 of inode 38 (2394838) = /mnt/jbm/common/therfert/nodeB/hornetq/journal/server.lock > 38.2 = byte 2 of same > 46.1 = byte 1 of inode 46 (2394846) = /mnt/jbm/common/therfert/nodeA/hornetq/journal/server.lock > 46.2 = byte 2 of same > node 22 holds WR on 38.2, 46.1 > node 22 waits WR on 38.1 (held by node 23) > node 23 holds WR on 38.1, 46.2 > node 23 waits WR on 46.1 (held by node 22) deadlock: node22,pid1: hold lock live-nodeA (fileA,byte1) node22,pid1: wait lock live-nodeB (fileB,byte1) node22,pid1: hold lock backup-nodeB (fileB,byte2) node23,pid1: hold lock live-nodeB (fileB,byte1) node23,pid1: wait lock live-nodeA (fileA,byte1) node23,pid1: hold lock backup-nodeA (fileA,byte2) no deadlock: node22,pid1: hold lock live-nodeA (fileA,byte1) node22,pid2: wait lock live-nodeB (fileB,byte1) node22,pid3: hold lock backup-nodeB (fileB,byte2) node23,pid1: hold lock live-nodeB (fileB,byte1) node23,pid2: wait lock live-nodeA (fileA,byte1) node23,pid3: hold lock backup-nodeA (fileA,byte2) What I was suggesting above, is that you could fork/exec a process to acquire each of the server locks to avoid the deadlock. (I am not suggesting that we shouldn't try to handle this in gfs/dlm; I very much think we should. I am simply offering possible workarounds.) (In reply to comment #24) > > we basically hold 2 locks at bytes 1 and 2, the live server obtains the lock at > > byte 1 and holds it, the backup lock holds the lock at byte 2 and then hold it, > > the backup then tries the lock at byte 1 and will block until the live server > > dies. Once the live server dies and the java process is killed the lock at byte > > 1 is removed and the backup server obtains it and releases its backup lock for > > other backup nodes > > > 38.1 = byte 1 of inode 38 (2394838) = /mnt/jbm/common/therfert/nodeB/hornetq/journal/server.lock > > 38.2 = byte 2 of same > > 46.1 = byte 1 of inode 46 (2394846) = /mnt/jbm/common/therfert/nodeA/hornetq/journal/server.lock > > 46.2 = byte 2 of same > > > node 22 holds WR on 38.2, 46.1 > > node 22 waits WR on 38.1 (held by node 23) > > node 23 holds WR on 38.1, 46.2 > > node 23 waits WR on 46.1 (held by node 22) > > deadlock: > > node22,pid1: hold lock live-nodeA (fileA,byte1) > node22,pid1: wait lock live-nodeB (fileB,byte1) > node22,pid1: hold lock backup-nodeB (fileB,byte2) > > node23,pid1: hold lock live-nodeB (fileB,byte1) > node23,pid1: wait lock live-nodeA (fileA,byte1) > node23,pid1: hold lock backup-nodeA (fileA,byte2) > > no deadlock: > > node22,pid1: hold lock live-nodeA (fileA,byte1) > node22,pid2: wait lock live-nodeB (fileB,byte1) > node22,pid3: hold lock backup-nodeB (fileB,byte2) > > node23,pid1: hold lock live-nodeB (fileB,byte1) > node23,pid2: wait lock live-nodeA (fileA,byte1) > node23,pid3: hold lock backup-nodeA (fileA,byte2) > > What I was suggesting above, is that you could fork/exec a process to acquire > each of the server locks to avoid the deadlock. > > (I am not suggesting that we shouldn't try to handle this in gfs/dlm; I very > much think we should. I am simply offering possible workarounds.) I appreciate the attempt at a workaround, by the process is a Java Virtual Machine. Doing a fork/exec from that is not a feasible workaround. Created attachment 481967 [details]
kernel patch
Experimental kernel patch to allow a process blocked on plock to be killed, and cleaned up.
Created attachment 481968 [details]
dlm_controld patch
dlm_controld patch that goes along with the previous kernel patch.
Using the two patches in comments 29 and 30, I created a simple AB/BA deadlock with plocks, resolved it by killing one of the processes, and had all the lock state properly cleaned up. I expect this will resolve the specific problem in this bz. The patches don't have any testing beyond the trivial proof of concept test. This also includes a change to the user/kernel interface, but I don't believe it creates any incompatibilities with previous user or kernel versions. The only question is what if we have a "new" kernel with "old" userspace? So far as I can see all the other combinations would work correctly, and otherwise this looks like a good solution. With a new kernel and old userspace, the kernel would complain ("dev_write no op ...") when userspace responded to an unlock caused by a close (as opposed to an unlock the process called.) It would still work. Ok, excellent. Sounds good. I'm waiting to do anything further with this patch until I hear if it works. Dave, Can you please provide a test package with these patches applied so that middleware qe can apply the new package to our test environment and execute our tests? Thanks, Mike scratch kernel build including patch https://brewweb.devel.redhat.com/taskinfo?taskID=3177437 here's a dlm_controld x86_64 binary with the patch; I'm hoping this will work, but I'm not certain http://people.redhat.com/~teigland/dlm_controld http://download.devel.redhat.com/brewroot/scratch/adas/task_3328652/ build with patch in comment #29 This issue requires a change to both dlm (in kernel) and dlm_controld (in cluster userspace). I had thought the original bug 678585 was for the dlm kernel component, but it was actually for cluster. I've cloned the original bz so we now have two bugs: bug 678585: for dlm_controld in userspace bug 707005: for dlm in kernel posted: https://www.redhat.com/archives/cluster-devel/2011-May/msg00068.html (Waiting to push to cluster.git RHEL6 branch until 6.2 branch is ready.) To test, create a deadlock between two processes on separate nodes, then kill one of the processes: node1: lock fileA node2: lock fileB node1: lock fileB node2: lock fileA kill process on node1, and node2 should get the lock on fileA Verified against cman-3.0.12.1-7.el6.x86_64 [root@buzz-01 shared]# /tmp/lockhold A B Press enter to lock A Attempting to lock A Lock Acquired Press enter to lock B Attempting to lock B ^C [root@buzz-02 shared]# /tmp/lockhold B A Press enter to lock B Attempting to lock B Lock Acquired Press enter to lock A Attempting to lock A Lock Acquired Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: gfs2 posix lock operations (implemented in dlm) are not interruptible when they wait for another posix lock. This was the way they were originally implemented because of simplicity. Consequence: processes that created a deadlock with posix locks, e.g. AB/BA, could not be killed to resolve the problem, and one node would need to be reset. Fix: the dlm uses a new kernel feature that allows the waiting process to be killed, and information about the killed process is now passed to dlm_controld so it can clean up. Result: processes deadlocked on gfs2 posix locks can now be recovered by killing one or more of them. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1516.html |