Hide Forgot
Created attachment 479523 [details] /sys/kernel/debug/gfs2/jbm:jbm/glocks from both nodes Description of problem: JBoss QA guys have the following problem when they are testing failover of HornetQ application. Testing is based on RHEL6 cluster, clvmd and GFS2. The application is running on two nodes. There is one process on each node. Processes on both nodes use shared files[1] on the GFS2 filesystem. The problem is that when the process is killed (on any of the nodes) by SIGKILL (kill -9) it always becomes zombie, which can't be killed at all. It seems it is hanged on some GFS2/DLM lock - see here[2]. When the process is killed on one node then it becomes zombie, but on the other node it still works w/o problem. When the process is killed also on the other node then it also becomes zombie. So the result is unkillable zombie process on both nodes. When one node is rebooted, the zombie process on the other node finishes automatically. I saved /sys/kernel/debug/gfs2/jbm:jbm/glocks before kill and after kill and there are no differences. You can find this file from both nodes attached. [1] The following files are opened when the process becomes zombie: Node A: java 29389 therfert 45u REG 253,4 1048576 2394856 /mnt/jbm/common/therfert/nodeA/hornetq/bindings/hornetq-bindings-1.bindings java 29389 therfert 57u REG 253,4 1048576 2395114 /mnt/jbm/common/therfert/nodeA/hornetq/bindings/hornetq-bindings-2.bindings java 29389 therfert 64u REG 253,4 10485760 2395372 /mnt/jbm/common/therfert/nodeA/hornetq/journal/hornetq-data-1.hq java 29389 therfert 65u REG 253,4 10485760 2397940 /mnt/jbm/common/therfert/nodeA/hornetq/journal/hornetq-data-2.hq java 29389 therfert 81u REG 253,4 1048576 2421052 /mnt/jbm/common/therfert/nodeA/hornetq/bindings/hornetq-jms-1.jms java 29389 therfert 84u REG 253,4 1048576 2421310 /mnt/jbm/common/therfert/nodeA/hornetq/bindings/hornetq-jms-2.jms java 29389 therfert 90uw REG 253,4 19 2394838 /mnt/jbm/common/therfert/nodeB/hornetq/journal/server.lock java 29389 therfert 118uw REG 253,4 19 2394846 /mnt/jbm/common/therfert/nodeA/hornetq/journal/server.lock Node B: java 29389 therfert 118uw REG 253,4 19 2394846 /mnt/jbm/common/therfert/nodeA/hornetq/journal/server.lock java 29389 therfert 45u REG 253,4 1048576 2394856 /mnt/jbm/common/therfert/nodeA/hornetq/bindings/hornetq-bindings-1.bindings java 29389 therfert 57u REG 253,4 1048576 2395114 /mnt/jbm/common/therfert/nodeA/hornetq/bindings/hornetq-bindings-2.bindings java 29389 therfert 64u REG 253,4 10485760 2395372 /mnt/jbm/common/therfert/nodeA/hornetq/journal/hornetq-data-1.hq java 29389 therfert 65u REG 253,4 10485760 2397940 /mnt/jbm/common/therfert/nodeA/hornetq/journal/hornetq-data-2.hq java 29389 therfert 81u REG 253,4 1048576 2421052 /mnt/jbm/common/therfert/nodeA/hornetq/bindings/hornetq-jms-1.jms java 29389 therfert 84u REG 253,4 1048576 2421310 /mnt/jbm/common/therfert/nodeA/hornetq/bindings/hornetq-jms-2.jms java 29389 therfert 90uw REG 253,4 19 2394838 /mnt/jbm/common/therfert/nodeB/hornetq/journal/server.lock [2] INFO: task java:29180 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. java D 0000000000000000 3432 29180 28852 0x00000084 ffff8804e233fd98 0000000000000046 0000000000000000 ffffffffa05b6ec0 ffffffffa05b6ed8 0000000000000046 ffff88003b7d6fd8 00000001007ffeaf ffff8804e21e1500 ffff8804e233ffd8 0000000000010608 ffff8804e21e1500 Call Trace: [<ffffffffa05aace5>] dlm_posix_lock+0x1b5/0x2d0 [dlm] [<ffffffff81096e90>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa05e247b>] gfs2_lock+0x7b/0xf0 [gfs2] [<ffffffff811d4623>] vfs_lock_file+0x23/0x40 [<ffffffff811d487f>] fcntl_setlk+0x17f/0x340 [<ffffffff8119b65d>] sys_fcntl+0x19d/0x580 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b Version-Release number of selected component (if applicable): RHEL6.0 up to date. Kernel: 2.6.32-71.14.1.el6.x86_64.debug How reproducible: https://issues.jboss.org/browse/JBPAPP-5956 Actual results: hanged zombie process after kill Expected results: process should be killed immediately Additional info: Let me know what else you need. Thanks Tomas
Can you tell me which NICs the cluster is using, and which firmware version the NICs are using, if applicable. I assume from the report that the application is making use of fcntl POSIX locks. That appears to be the code path in question. Depending on the application requirements, there may be better solutions. Can you confirm that there was no process which was holding a lock and thus blocking the process in question?
To debug a posix lock problem, the following information from both nodes would be a good start: cman_tool nodes corosync-objctl group_tool -n dlm_tool log_plock dlm_tool plocks <fsname> /etc/cluster/cluster.conf /var/log/messages /proc/locks ps ax -o pid,stat,cmd,wchan
(In reply to comment #2) > Can you tell me which NICs the cluster is using, and which firmware version the > NICs are using, if applicable. Used NICs: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz > > I assume from the report that the application is making use of fcntl POSIX > locks. That appears to be the code path in question. Depending on the > application requirements, there may be better solutions. Can you confirm that > there was no process which was holding a lock and thus blocking the process in > question? There is only one process on each node which access the shared files as I explained. However I don't know what operations exactly those processes are doing.
Created attachment 479549 [details] Resuls of the commands/content of the files you wanted in comment #3 Please find the results attached. nodeA = messaging-22 nodeB = messaging-23 (In reply to comment #3) > To debug a posix lock problem, the following information from both nodes would > be a good start: > > cman_tool nodes > corosync-objctl > group_tool -n > dlm_tool log_plock > dlm_tool plocks <fsname> > /etc/cluster/cluster.conf > /var/log/messages > /proc/locks > ps ax -o pid,stat,cmd,wchan
When I was looking at result of "corosync-objctl", I noticed there is multicast address "totem.interface.mcastaddr=239.192.149.169". I hadn't expected it uses muticast when it wasn't configured anywhere. Multicast is routed via different interface - bond0. It is bonding interface (mode=balance-tlb) which uses two the following physical interfaces: Intel Corporation 82598EB 10-Gigabit AF Dual Port Network Connection (In reply to comment #4) > (In reply to comment #2) > > Can you tell me which NICs the cluster is using, and which firmware version the > > NICs are using, if applicable. > > Used NICs: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz > > > > > I assume from the report that the application is making use of fcntl POSIX > > locks. That appears to be the code path in question. Depending on the > > application requirements, there may be better solutions. Can you confirm that > > there was no process which was holding a lock and thus blocking the process in > > question? > > There is only one process on each node which access the shared files as I > explained. However I don't know what operations exactly those processes are > doing.
It appears that the application has gotten itself into a standard A,B / B,A deadlock with posix locks. Our clustered posix locks do not do EDEADLK detection of posix locks. 2394838 WR 2-2 nodeid 22 pid 29389 owner ffff8804de8c2368 rown 0 2394838 WR 1-1 nodeid 23 pid 28907 owner ffff88050fc29050 rown 0 2394838 WR 1-1 nodeid 22 pid 29389 owner ffff8804de8c2368 rown 0 WAITING 2394846 WR 1-1 nodeid 22 pid 29389 owner ffff8804de8c2368 rown 0 2394846 WR 2-2 nodeid 23 pid 28907 owner ffff88050fc29050 rown 0 2394846 WR 1-1 nodeid 23 pid 28907 owner ffff88050fc29050 rown 0 WAITING > 38.1 = byte 1 of inode 38 (2394838) = /mnt/jbm/common/therfert/nodeB/hornetq/journal/server.lock > 38.2 = byte 2 of same > 46.1 = byte 1 of inode 46 (2394846) = /mnt/jbm/common/therfert/nodeA/hornetq/journal/server.lock > 46.2 = byte 2 of same node 22 holds WR on 38.2, 46.1 node 22 waits WR on 38.1 (held by node 23) node 23 holds WR on 38.1, 46.2 node 23 waits WR on 46.1 (held by node 22) The standard deadlock avoidance methods (lock ordering, non-blocking requests) are the only two options for avoiding this.
Thanks much for this fast finding, Dave.
Comment from Andy Taylor from HornetQ team we basically hold 2 locks at bytes 1 and 2, the live server obtains the lock at byte 1 and holds it, the backup lock holds the lock at byte 2 and then hold it, the backup then tries the lock at byte 1 and will block until the live server dies. Once the live server dies and the java process is killed the lock at byte 1 is removed and the backup server obtains it and releases its backup lock for other backup nodes. We are using the java file channel classes for doing the locking so if the lock is not released once the java process running the live server is killed then this is an issue with how the OS works with the file system. I'm not an expert in this field so can't really comment on that. However the following code should recreate the issue, run it twice from the same directory and then kill one node. File file = new File(".", "server.lock"); if (!file.exists()) { file.createNewFile(); } RandomAccessFile raFile = new RandomAccessFile(file, "rw"); FileChannel channel = raFile.getChannel(); channel.lock(1, 1, false); System.out.println("lock obtained");
If I've understood this correctly, the problem is that the fcntl F_SETLKW call on GFS2 is not interruptible? The Linux man page says: F_SETLKW (struct flock *) As for F_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the sig- nal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)). So I'd expect that the process should be able to continue and not get stuck as a zombie if it is killed. I'm not sure whether that is a requirement of fcntl locks or something that is Linux specific though. We have a bug open for flock which has a similar non-interruptible problem in bz #472380 but so far as I know, nobody has an application for which that matters. I'm surprised that the fcntl lock implementation is not interruptible though, since we use the library functions provided like other filesystems. If we can confirm the problem, then I'll have a look at it shortly. I'm currently travelling otherwise I'd check the POSIX docs to see what they have to say as well.
It's using wait_event() to wait for a response from dlm_controld. One hard part about wait_event_interruptible() is the kernel and userland state getting out of sync, i.e. corrupted lock state. But perhaps we could handle process termination as a special case since all lock state is being cleared. If the dlm could check whether process termination was the cause of wait_event_interruptible returning, then it could possibly let dlm_controld know somehow, so that dlm_controld could do a special lock cleanup.
Can somebody please comment if this could be fixed on the GFS2 side or do you still need changes on the application side?
It will require quite a bit of work to know if it's possible to handle this in dlm/gfs2 or not. If it is possible, the change will probably be too complicated to have ready any time soon.
(In reply to comment #13) > It will require quite a bit of work to know if it's possible to handle this in > dlm/gfs2 or not. If it is possible, the change will probably be too > complicated to have ready any time soon. We potentially have a customer that this issue will effect. What kind of priority does this have?
Based on the explanation of what these locks are used for (detecting a node failure), there's another possibly simple way of avoiding the deadlock: use two different processes to lock the two files.
(In reply to comment #17) > Based on the explanation of what these locks are used for (detecting a node > failure), there's another possibly simple way of avoiding the deadlock: use > two different processes to lock the two files. There are two different processes. The first JVM (process) on NodeA is the active node, and the second JVM (process) is on NodeB, which is the backup node. So, there are two processes on two different nodes locking the file at two positions.
I mean use more than one process on each node.
(In reply to comment #20) > (In reply to comment #17) > > Based on the explanation of what these locks are used for (detecting a node > > failure), there's another possibly simple way of avoiding the deadlock: use > > two different processes to lock the two files. > > There are two different processes. The first JVM (process) on NodeA is the > active node, and the second JVM (process) is on NodeB, which is the backup > node. So, there are two processes on two different nodes locking the file at > two positions. Okay, but that's actually possible.
(In reply to comment #22) > (In reply to comment #20) > > (In reply to comment #17) > > > Based on the explanation of what these locks are used for (detecting a node > > > failure), there's another possibly simple way of avoiding the deadlock: use > > > two different processes to lock the two files. > > > > There are two different processes. The first JVM (process) on NodeA is the > > active node, and the second JVM (process) is on NodeB, which is the backup > > node. So, there are two processes on two different nodes locking the file at > > two positions. > > Okay, but that's actually possible. I meant NOT possible.
> we basically hold 2 locks at bytes 1 and 2, the live server obtains the lock at > byte 1 and holds it, the backup lock holds the lock at byte 2 and then hold it, > the backup then tries the lock at byte 1 and will block until the live server > dies. Once the live server dies and the java process is killed the lock at byte > 1 is removed and the backup server obtains it and releases its backup lock for > other backup nodes > 38.1 = byte 1 of inode 38 (2394838) = /mnt/jbm/common/therfert/nodeB/hornetq/journal/server.lock > 38.2 = byte 2 of same > 46.1 = byte 1 of inode 46 (2394846) = /mnt/jbm/common/therfert/nodeA/hornetq/journal/server.lock > 46.2 = byte 2 of same > node 22 holds WR on 38.2, 46.1 > node 22 waits WR on 38.1 (held by node 23) > node 23 holds WR on 38.1, 46.2 > node 23 waits WR on 46.1 (held by node 22) deadlock: node22,pid1: hold lock live-nodeA (fileA,byte1) node22,pid1: wait lock live-nodeB (fileB,byte1) node22,pid1: hold lock backup-nodeB (fileB,byte2) node23,pid1: hold lock live-nodeB (fileB,byte1) node23,pid1: wait lock live-nodeA (fileA,byte1) node23,pid1: hold lock backup-nodeA (fileA,byte2) no deadlock: node22,pid1: hold lock live-nodeA (fileA,byte1) node22,pid2: wait lock live-nodeB (fileB,byte1) node22,pid3: hold lock backup-nodeB (fileB,byte2) node23,pid1: hold lock live-nodeB (fileB,byte1) node23,pid2: wait lock live-nodeA (fileA,byte1) node23,pid3: hold lock backup-nodeA (fileA,byte2) What I was suggesting above, is that you could fork/exec a process to acquire each of the server locks to avoid the deadlock. (I am not suggesting that we shouldn't try to handle this in gfs/dlm; I very much think we should. I am simply offering possible workarounds.)
(In reply to comment #24) > > we basically hold 2 locks at bytes 1 and 2, the live server obtains the lock at > > byte 1 and holds it, the backup lock holds the lock at byte 2 and then hold it, > > the backup then tries the lock at byte 1 and will block until the live server > > dies. Once the live server dies and the java process is killed the lock at byte > > 1 is removed and the backup server obtains it and releases its backup lock for > > other backup nodes > > > 38.1 = byte 1 of inode 38 (2394838) = /mnt/jbm/common/therfert/nodeB/hornetq/journal/server.lock > > 38.2 = byte 2 of same > > 46.1 = byte 1 of inode 46 (2394846) = /mnt/jbm/common/therfert/nodeA/hornetq/journal/server.lock > > 46.2 = byte 2 of same > > > node 22 holds WR on 38.2, 46.1 > > node 22 waits WR on 38.1 (held by node 23) > > node 23 holds WR on 38.1, 46.2 > > node 23 waits WR on 46.1 (held by node 22) > > deadlock: > > node22,pid1: hold lock live-nodeA (fileA,byte1) > node22,pid1: wait lock live-nodeB (fileB,byte1) > node22,pid1: hold lock backup-nodeB (fileB,byte2) > > node23,pid1: hold lock live-nodeB (fileB,byte1) > node23,pid1: wait lock live-nodeA (fileA,byte1) > node23,pid1: hold lock backup-nodeA (fileA,byte2) > > no deadlock: > > node22,pid1: hold lock live-nodeA (fileA,byte1) > node22,pid2: wait lock live-nodeB (fileB,byte1) > node22,pid3: hold lock backup-nodeB (fileB,byte2) > > node23,pid1: hold lock live-nodeB (fileB,byte1) > node23,pid2: wait lock live-nodeA (fileA,byte1) > node23,pid3: hold lock backup-nodeA (fileA,byte2) > > What I was suggesting above, is that you could fork/exec a process to acquire > each of the server locks to avoid the deadlock. > > (I am not suggesting that we shouldn't try to handle this in gfs/dlm; I very > much think we should. I am simply offering possible workarounds.) I appreciate the attempt at a workaround, by the process is a Java Virtual Machine. Doing a fork/exec from that is not a feasible workaround.
Created attachment 481967 [details] kernel patch Experimental kernel patch to allow a process blocked on plock to be killed, and cleaned up.
Created attachment 481968 [details] dlm_controld patch dlm_controld patch that goes along with the previous kernel patch.
Using the two patches in comments 29 and 30, I created a simple AB/BA deadlock with plocks, resolved it by killing one of the processes, and had all the lock state properly cleaned up. I expect this will resolve the specific problem in this bz. The patches don't have any testing beyond the trivial proof of concept test. This also includes a change to the user/kernel interface, but I don't believe it creates any incompatibilities with previous user or kernel versions.
The only question is what if we have a "new" kernel with "old" userspace? So far as I can see all the other combinations would work correctly, and otherwise this looks like a good solution.
With a new kernel and old userspace, the kernel would complain ("dev_write no op ...") when userspace responded to an unlock caused by a close (as opposed to an unlock the process called.) It would still work.
Ok, excellent. Sounds good.
I'm waiting to do anything further with this patch until I hear if it works.
Dave, Can you please provide a test package with these patches applied so that middleware qe can apply the new package to our test environment and execute our tests? Thanks, Mike
scratch kernel build including patch https://brewweb.devel.redhat.com/taskinfo?taskID=3177437 here's a dlm_controld x86_64 binary with the patch; I'm hoping this will work, but I'm not certain http://people.redhat.com/~teigland/dlm_controld
http://download.devel.redhat.com/brewroot/scratch/adas/task_3328652/ build with patch in comment #29
This issue requires a change to both dlm (in kernel) and dlm_controld (in cluster userspace). I had thought the original bug 678585 was for the dlm kernel component, but it was actually for cluster. I've cloned the original bz so we now have two bugs: bug 678585: for dlm_controld in userspace bug 707005: for dlm in kernel
posted: https://www.redhat.com/archives/cluster-devel/2011-May/msg00068.html (Waiting to push to cluster.git RHEL6 branch until 6.2 branch is ready.)
To test, create a deadlock between two processes on separate nodes, then kill one of the processes: node1: lock fileA node2: lock fileB node1: lock fileB node2: lock fileA kill process on node1, and node2 should get the lock on fileA
Verified against cman-3.0.12.1-7.el6.x86_64 [root@buzz-01 shared]# /tmp/lockhold A B Press enter to lock A Attempting to lock A Lock Acquired Press enter to lock B Attempting to lock B ^C [root@buzz-02 shared]# /tmp/lockhold B A Press enter to lock B Attempting to lock B Lock Acquired Press enter to lock A Attempting to lock A Lock Acquired
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: gfs2 posix lock operations (implemented in dlm) are not interruptible when they wait for another posix lock. This was the way they were originally implemented because of simplicity. Consequence: processes that created a deadlock with posix locks, e.g. AB/BA, could not be killed to resolve the problem, and one node would need to be reset. Fix: the dlm uses a new kernel feature that allows the waiting process to be killed, and information about the killed process is now passed to dlm_controld so it can clean up. Result: processes deadlocked on gfs2 posix locks can now be recovered by killing one or more of them.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1516.html