Description of problem: 2 node cluster consisting of node1 and node2. /usr/local is placed on a GFS filesystem mounted on both nodes. Lockmanager is dlm. Loads of processes on both nodes are in the "D" state (...) 23055 D ls /usr/local/sw glock_wait_internal 26000 D ls /usr/local/sw glock_wait_internal 5583 D ls /usr/local/sw glock_wait_internal 21220 D ls /usr/local/sw glock_wait_internal 24169 D ls /usr/local/sw glock_wait_internal 4566 D ls -l /usr/local glock_wait_internal 25800 D ls --color=tty - glock_wait_internal 29006 D ls -l /usr/local glock_wait_internal 5856 Dl java EDXPut -pfi glock_wait_internal (...) a strace to ls -l /usr/local/swadmin/mnx/xml ends up in lstat("/usr/local/swadmin/mnx/xml", This happens on both cluster nodes. All processes trying to access the directory /usr/local/swadmin/mnx/xml are in "Waiting for IO (D)" state. It seems that the process with pid=5856 running on node1 is holding the locks. a lockdump analysis with the decipher_lockstate_dump and parse_lockdump shows the following relevant output: 4 chain: lockdump.node1.dec Glock (inode[2], 1114343) gl_flags = lock[1] gl_count = 5 gl_state = shared[3] req_gh = yes req_bh = yes lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 1 ail_bufs = no Request owner = 5856 gh_state = exclusive[1] gh_flags = try[0] local_excl[5] async[6] error = 0 gh_iflags = promote[1] Waiter3 owner = 5856 gh_state = exclusive[1] gh_flags = try[0] local_excl[5] async[6] error = 0 gh_iflags = promote[1] Inode: busy lockdump.node2.dec Glock (inode[2], 1114343) gl_flags = gl_count = 2 gl_state = unlocked[0] req_gh = no req_bh = no lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 0 ail_bufs = no Inode: num = 1114343/1114343 type = regular[1] i_count = 1 i_flags = vnode = yes lockdump.node1.dec Glock (inode[2], 627732) gl_flags = dirty[5] gl_count = 379 gl_state = exclusive[1] req_gh = no req_bh = no lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 58 ail_bufs = no Holder owner = 5856 gh_state = exclusive[1] gh_flags = try[0] local_excl[5] async[6] error = 0 gh_iflags = promote[1] holder[6] first[7] Waiter2 owner = none[-1] gh_state = shared[3] gh_flags = try[0] error = 0 gh_iflags = demote[2] alloced[4] dealloc[5] Waiter3 owner = 32753 gh_state = shared[3] gh_flags = any[3] error = 0 gh_iflags = promote[1] [...loads of Waiter3 entries...] Waiter3 owner = 4566 gh_state = shared[3] gh_flags = any[3] error = 0 gh_iflags = promote[1] Inode: busy lockdump.node2.dec Glock (inode[2], 627732) gl_flags = lock[1] gl_count = 375 gl_state = unlocked[0] req_gh = yes req_bh = yes lvb_count = 0 object = yes new_le = no incore_le = no reclaim = no aspace = 0 ail_bufs = no Request owner = 20187 gh_state = shared[3] gh_flags = any[3] error = 0 gh_iflags = promote[1] Waiter3 owner = 20187 gh_state = shared[3] gh_flags = any[3] error = 0 gh_iflags = promote[1] [...loads of Waiter3 entries...] Waiter3 owner = 10460 gh_state = shared[3] gh_flags = any[3] error = 0 gh_iflags = promote[1] Inode: busy 2 requests The dlm lock dump shows: node1: Resource 0000010001218088 (parent 0000000000000000). Name (len=24) " 2 1100e7" Local Copy, Master is node 2 Granted Queue Conversion Queue Waiting Queue 5eb00178 PR (EX) Master: 3eeb0117 LQ: 0,0x5 node2: Resource 00000107e462c8c8 (parent 0000000000000000). Name (len=24) " 2 1100e7" Master Copy Granted Queue 3eeb0117 PR Remote: 1 5eb00178 Conversion Queue Waiting Queue Version-Release number of selected component (if applicable): RHEL4u4 + GFS + CS The kernel backtrace of pid 5856 on node1 is: crash> bt 5856 PID: 5856 TASK: 10bd26427f0 CPU: 0 COMMAND: "java" #0 [10bd20cfbc8] schedule at ffffffff8030a1d1 #1 [10bd20cfca0] wait_for_completion at ffffffff8030a415 #2 [10bd20cfd20] glock_wait_internal at ffffffffa018574e #3 [10bd20cfd60] gfs_glock_nq_m at ffffffffa01860ce #4 [10bd20cfda0] gfs_unlink at ffffffffa019ce41 #5 [10bd20cfea0] vfs_unlink at ffffffff801889fa #6 [10bd20cfed0] sys_unlink at ffffffff80188b19 #7 [10bd20cff30] filp_close at ffffffff80178e48 #8 [10bd20cff50] error_exit at ffffffff80110d91 RIP: 0000002a9593f649 RSP: 0000007fbfffbca0 RFLAGS: 00010206 RAX: 0000000000000057 RBX: ffffffff8011026a RCX: 0000002a9cc9c870 RDX: 0000002ae5989000 RSI: 0000002a962fa3a8 RDI: 0000002ae5989000 RBP: 0000000000000000 R8: 0000002a9630abb0 R9: 0000000000000ffc R10: 0000002a9630abc0 R11: 0000000000000206 R12: 0000000040115700 R13: 0000002ae23294b0 R14: 0000007fbfffc300 R15: 0000002ae5989000 ORIG_RAX: 0000000000000057 CS: 0033 SS: 002b How reproducible: I have no idea. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: There is also a discussion on the public mailing list. Subject is "gfs deadlock situation"
https://www.redhat.com/archives/linux-cluster/2007-February/msg00110.html > node1: > Resource 0000010001218088 (parent 0000000000000000). Name (len=24) " 2 > 1100e7" > Local Copy, Master is node 2 > Granted Queue > Conversion Queue > Waiting Queue > 5eb00178 PR (EX) Master: 3eeb0117 LQ: 0,0x5 > node2: > Resource 00000107e462c8c8 (parent 0000000000000000). Name (len=24) " 2 > 1100e7" > Master Copy > Granted Queue > 3eeb0117 PR Remote: 1 5eb00178 > Conversion Queue > Waiting Queue The state of the lock on node1 looks bad. I'm studying the code and struggling to understand how it could possibly arrive in that state. Some things to notice: - the lock is converting, it should be on the Conversion Queue, not the Waiting Queue - lockqueue_state is 0, so either node1 has not sent a remote request to node2 at all, or node1 did send something and already received some kind of reply so it's not waiting for a reply any longer - the state of the lock on node2 looks normal Did you check for suspicious syslog messages on both nodes? Did any nodes on this fs mount, unmount or fail around the time this happened? Has this happened before? If you'd like to try to reproduce this with some dlm debugging I could send you a patch (although this is such an odd state I'm not sure yet where I'd begin to add debugging.)
As mentioned above, the problem is a dlm lock that's in a nonsensical state. There are no clues how it might have gotten into that state. Adding some new sanity checking to the dlm might catch the problem sooner (if it's reproducable) and give us some hints about where to look. No ideas yet where to add new sanity checks.
There are no suspicious syslog messages on both nodes. This just happens sometimes but I was not able to reproduce the problem. It seems, that mostly gfs directory dinodes are affected. I also observed the problem with php session-file directories.
Hi I just see that we have 163 processes in D - State, I attached a sysreport and some GFS - logs. root@zhlr421a:~# ps aux | grep D USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18860 0.0 0.0 99624 4168 ? Ss Apr10 0:14 /opt/hp/hpsmh/sbin/hpsmhd -DSSL -f /opt/hp/hpsmh/conf/smhpd.conf hpsmh 18883 0.0 0.0 383088 5356 ? Sl Apr10 0:00 /opt/hp/hpsmh/sbin/hpsmhd -DSSL -f /opt/hp/hpsmh/conf/smhpd.conf swadmin 24128 0.1 0.3 1476300 417424 ? Dl 05:35 0:36 /usr/java/jdk1.5.0_07/bin/java EDXPut -pfile ../../mnx.pfile swadmin 17871 0.0 0.0 57432 680 ? D 10:10 0:00 ls /usr/local/swadmin/mnx/xml swadmin 17878 0.0 0.0 5740 508 ? D 10:10 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_17824.xml /usr/local/swadmin/mnx/xml swadmin 26678 0.0 0.0 57432 680 ? D 10:15 0:00 ls /usr/local/swadmin/mnx/xml swadmin 26708 0.0 0.0 5740 508 ? D 10:15 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_26587.xml /usr/local/swadmin/mnx/xml swadmin 28658 0.0 0.0 57432 680 ? D 10:16 0:00 ls /usr/local/swadmin/mnx/xml swadmin 2972 0.0 0.0 57432 680 ? D 10:20 0:00 ls /usr/local/swadmin/mnx/xml swadmin 2976 0.0 0.0 5740 508 ? D 10:20 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_2924.xml /usr/local/swadmin/mnx/xml swadmin 12513 0.0 0.0 57432 680 ? D 10:25 0:00 ls /usr/local/swadmin/mnx/xml swadmin 12548 0.0 0.0 5740 508 ? D 10:25 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_12424.xml /usr/local/swadmin/mnx/xml swadmin 13745 0.0 0.0 57432 680 ? D 10:26 0:00 ls /usr/local/swadmin/mnx/xml swadmin 21534 0.0 0.0 57432 680 ? D 10:30 0:00 ls /usr/local/swadmin/mnx/xml swadmin 21538 0.0 0.0 5740 508 ? D 10:30 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_21485.xml /usr/local/swadmin/mnx/xml swadmin 30268 0.0 0.0 57432 680 ? D 10:35 0:00 ls /usr/local/swadmin/mnx/xml swadmin 30303 0.0 0.0 5740 508 ? D 10:35 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_30178.xml /usr/local/swadmin/mnx/xml swadmin 32252 0.0 0.0 57432 680 ? D 10:36 0:00 ls /usr/local/swadmin/mnx/xml swadmin 6608 0.0 0.0 57432 680 ? D 10:40 0:00 ls /usr/local/swadmin/mnx/xml swadmin 6612 0.0 0.0 5740 508 ? D 10:40 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_6559.xml /usr/local/swadmin/mnx/xml swadmin 16335 0.0 0.0 57432 680 ? D 10:45 0:00 ls /usr/local/swadmin/mnx/xml swadmin 16369 0.0 0.0 5740 508 ? D 10:45 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_16241.xml /usr/local/swadmin/mnx/xml swadmin 17488 0.0 0.0 57432 680 ? D 10:46 0:00 ls /usr/local/swadmin/mnx/xml swadmin 25239 0.0 0.0 57432 680 ? D 10:50 0:00 ls /usr/local/swadmin/mnx/xml swadmin 25243 0.0 0.0 5740 508 ? D 10:50 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_25190.xml /usr/local/swadmin/mnx/xml swadmin 1507 0.0 0.0 57432 680 ? D 10:55 0:00 ls /usr/local/swadmin/mnx/xml swadmin 1542 0.0 0.0 5740 508 ? D 10:55 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_1414.xml /usr/local/swadmin/mnx/xml swadmin 3510 0.0 0.0 57432 680 ? D 10:56 0:00 ls /usr/local/swadmin/mnx/xml swadmin 10311 0.0 0.0 57432 680 ? D 11:00 0:00 ls /usr/local/swadmin/mnx/xml swadmin 10315 0.0 0.0 5740 508 ? D 11:00 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_10262.xml /usr/local/swadmin/mnx/xml swadmin 20144 0.0 0.0 57432 680 ? D 11:05 0:00 ls /usr/local/swadmin/mnx/xml swadmin 20174 0.0 0.0 5740 508 ? D 11:05 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_20058.xml /usr/local/swadmin/mnx/xml swadmin 21288 0.0 0.0 57432 680 ? D 11:06 0:00 ls /usr/local/swadmin/mnx/xml swadmin 28948 0.0 0.0 57432 680 ? D 11:10 0:00 ls /usr/local/swadmin/mnx/xml swadmin 28952 0.0 0.0 5740 508 ? D 11:10 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_28899.xml /usr/local/swadmin/mnx/xml swadmin 5245 0.0 0.0 57432 680 ? D 11:15 0:00 ls /usr/local/swadmin/mnx/xml swadmin 5280 0.0 0.0 5740 508 ? D 11:15 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_5169.xml /usr/local/swadmin/mnx/xml swadmin 7252 0.0 0.0 57432 680 ? D 11:16 0:00 ls /usr/local/swadmin/mnx/xml swadmin 14094 0.0 0.0 57432 680 ? D 11:20 0:00 ls /usr/local/swadmin/mnx/xml swadmin 14110 0.0 0.0 5740 508 ? D 11:20 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_14049.xml /usr/local/swadmin/mnx/xml swadmin 23853 0.0 0.0 57432 680 ? D 11:25 0:00 ls /usr/local/swadmin/mnx/xml swadmin 23887 0.0 0.0 5740 508 ? D 11:25 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_23772.xml /usr/local/swadmin/mnx/xml swadmin 25033 0.0 0.0 57432 680 ? D 11:26 0:00 ls /usr/local/swadmin/mnx/xml swadmin 32685 0.0 0.0 57432 680 ? D 11:30 0:00 ls /usr/local/swadmin/mnx/xml swadmin 32689 0.0 0.0 5740 508 ? D 11:30 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_32638.xml /usr/local/swadmin/mnx/xml swadmin 8981 0.0 0.0 57432 680 ? D 11:35 0:00 ls /usr/local/swadmin/mnx/xml swadmin 9016 0.0 0.0 5740 508 ? D 11:35 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_8895.xml /usr/local/swadmin/mnx/xml swadmin 10973 0.0 0.0 57432 680 ? D 11:36 0:00 ls /usr/local/swadmin/mnx/xml swadmin 17950 0.0 0.0 57432 680 ? D 11:40 0:00 ls /usr/local/swadmin/mnx/xml swadmin 17954 0.0 0.0 5740 508 ? D 11:40 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_17903.xml /usr/local/swadmin/mnx/xml swadmin 27647 0.0 0.0 5740 508 ? D 11:45 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_27554.xml /usr/local/swadmin/mnx/xml swadmin 27648 0.0 0.0 57432 680 ? D 11:45 0:00 ls /usr/local/swadmin/mnx/xml swadmin 29511 0.0 0.0 57432 680 ? D 11:46 0:00 ls /usr/local/swadmin/mnx/xml swadmin 4015 0.0 0.0 57432 680 ? D 11:50 0:00 ls /usr/local/swadmin/mnx/xml swadmin 4019 0.0 0.0 5740 508 ? D 11:50 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_3966.xml /usr/local/swadmin/mnx/xml swadmin 12745 0.0 0.0 57432 680 ? D 11:55 0:00 ls /usr/local/swadmin/mnx/xml swadmin 12780 0.0 0.0 5740 508 ? D 11:55 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_12659.xml /usr/local/swadmin/mnx/xml swadmin 14914 0.0 0.0 57432 680 ? D 11:56 0:00 ls /usr/local/swadmin/mnx/xml swadmin 21812 0.0 0.0 57432 680 ? D 12:00 0:00 ls /usr/local/swadmin/mnx/xml swadmin 21816 0.0 0.0 5740 508 ? D 12:00 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_21763.xml /usr/local/swadmin/mnx/xml swadmin 31395 0.0 0.0 57432 680 ? D 12:05 0:00 ls /usr/local/swadmin/mnx/xml swadmin 31429 0.0 0.0 5740 508 ? D 12:05 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_31310.xml /usr/local/swadmin/mnx/xml swadmin 877 0.0 0.0 57432 680 ? D 12:06 0:00 ls /usr/local/swadmin/mnx/xml swadmin 7761 0.0 0.0 57432 680 ? D 12:10 0:00 ls /usr/local/swadmin/mnx/xml swadmin 7764 0.0 0.0 5740 508 ? D 12:10 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_7714.xml /usr/local/swadmin/mnx/xml swadmin 16646 0.0 0.0 57432 680 ? D 12:15 0:00 ls /usr/local/swadmin/mnx/xml swadmin 16682 0.0 0.0 5740 508 ? D 12:15 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_16550.xml /usr/local/swadmin/mnx/xml swadmin 18678 0.0 0.0 57432 680 ? D 12:16 0:00 ls /usr/local/swadmin/mnx/xml swadmin 25644 0.0 0.0 57432 680 ? D 12:20 0:00 ls /usr/local/swadmin/mnx/xml swadmin 25648 0.0 0.0 5740 508 ? D 12:20 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_25597.xml /usr/local/swadmin/mnx/xml swadmin 2773 0.0 0.0 57432 680 ? D 12:25 0:00 ls /usr/local/swadmin/mnx/xml swadmin 2809 0.0 0.0 5740 508 ? D 12:25 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_2684.xml /usr/local/swadmin/mnx/xml swadmin 4737 0.0 0.0 57432 680 ? D 12:26 0:00 ls /usr/local/swadmin/mnx/xml swadmin 11590 0.0 0.0 57432 680 ? D 12:30 0:00 ls /usr/local/swadmin/mnx/xml swadmin 11594 0.0 0.0 5740 508 ? D 12:30 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_11542.xml /usr/local/swadmin/mnx/xml swadmin 20617 0.0 0.0 57432 680 ? D 12:35 0:00 ls /usr/local/swadmin/mnx/xml swadmin 20690 0.0 0.0 5740 508 ? D 12:35 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_20527.xml /usr/local/swadmin/mnx/xml swadmin 22627 0.0 0.0 57432 680 ? D 12:36 0:00 ls /usr/local/swadmin/mnx/xml swadmin 29544 0.0 0.0 57432 680 ? D 12:40 0:00 ls /usr/local/swadmin/mnx/xml swadmin 29548 0.0 0.0 5740 508 ? D 12:40 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_29494.xml /usr/local/swadmin/mnx/xml swadmin 6688 0.0 0.0 57432 680 ? D 12:45 0:00 ls /usr/local/swadmin/mnx/xml swadmin 6725 0.0 0.0 5740 508 ? D 12:45 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_6592.xml /usr/local/swadmin/mnx/xml swadmin 8594 0.0 0.0 57432 680 ? D 12:46 0:00 ls /usr/local/swadmin/mnx/xml swadmin 15683 0.0 0.0 57432 680 ? D 12:50 0:00 ls /usr/local/swadmin/mnx/xml swadmin 15687 0.0 0.0 5740 508 ? D 12:50 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_15636.xml /usr/local/swadmin/mnx/xml swadmin 24548 0.0 0.0 57432 680 ? D 12:55 0:00 ls /usr/local/swadmin/mnx/xml swadmin 24583 0.0 0.0 5740 508 ? D 12:55 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_24459.xml /usr/local/swadmin/mnx/xml swadmin 26565 0.0 0.0 57432 680 ? D 12:56 0:00 ls /usr/local/swadmin/mnx/xml swadmin 935 0.0 0.0 57432 680 ? D 13:00 0:00 ls /usr/local/swadmin/mnx/xml swadmin 939 0.0 0.0 5740 508 ? D 13:00 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_885.xml /usr/local/swadmin/mnx/xml swadmin 10584 0.0 0.0 57432 680 ? D 13:05 0:00 ls /usr/local/swadmin/mnx/xml swadmin 10618 0.0 0.0 5740 508 ? D 13:05 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_10495.xml /usr/local/swadmin/mnx/xml swadmin 12581 0.0 0.0 57432 680 ? D 13:06 0:00 ls /usr/local/swadmin/mnx/xml swadmin 19624 0.0 0.0 57432 680 ? D 13:10 0:00 ls /usr/local/swadmin/mnx/xml swadmin 19628 0.0 0.0 5740 508 ? D 13:10 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_19576.xml /usr/local/swadmin/mnx/xml swadmin 28491 0.0 0.0 57432 680 ? D 13:15 0:00 ls /usr/local/swadmin/mnx/xml swadmin 28529 0.0 0.0 5740 508 ? D 13:15 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_28402.xml /usr/local/swadmin/mnx/xml swadmin 30511 0.0 0.0 57432 680 ? D 13:16 0:00 ls /usr/local/swadmin/mnx/xml swadmin 4883 0.0 0.0 57432 680 ? D 13:20 0:00 ls /usr/local/swadmin/mnx/xml swadmin 4887 0.0 0.0 5740 508 ? D 13:20 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_4836.xml /usr/local/swadmin/mnx/xml swadmin 14659 0.0 0.0 57432 680 ? D 13:25 0:00 ls /usr/local/swadmin/mnx/xml swadmin 14710 0.0 0.0 5740 508 ? D 13:25 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_14571.xml /usr/local/swadmin/mnx/xml swadmin 16596 0.0 0.0 57432 680 ? D 13:26 0:00 ls /usr/local/swadmin/mnx/xml swadmin 23596 0.0 0.0 57432 680 ? D 13:30 0:00 ls /usr/local/swadmin/mnx/xml swadmin 23599 0.0 0.0 5740 508 ? D 13:30 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_23547.xml /usr/local/swadmin/mnx/xml swadmin 32400 0.0 0.0 57432 680 ? D 13:35 0:00 ls /usr/local/swadmin/mnx/xml swadmin 32435 0.0 0.0 5740 508 ? D 13:35 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_32314.xml /usr/local/swadmin/mnx/xml swadmin 1951 0.0 0.0 57432 680 ? D 13:36 0:00 ls /usr/local/swadmin/mnx/xml swadmin 8793 0.0 0.0 57432 680 ? D 13:40 0:00 ls /usr/local/swadmin/mnx/xml swadmin 8797 0.0 0.0 5740 508 ? D 13:40 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_8744.xml /usr/local/swadmin/mnx/xml swadmin 18650 0.0 0.0 57432 680 ? D 13:45 0:00 ls /usr/local/swadmin/mnx/xml swadmin 18689 0.0 0.0 5740 508 ? D 13:45 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_18556.xml /usr/local/swadmin/mnx/xml swadmin 19875 0.0 0.0 57432 680 ? D 13:46 0:00 ls /usr/local/swadmin/mnx/xml swadmin 27598 0.0 0.0 57432 680 ? D 13:50 0:00 ls /usr/local/swadmin/mnx/xml swadmin 27602 0.0 0.0 5740 508 ? D 13:50 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_27546.xml /usr/local/swadmin/mnx/xml swadmin 3931 0.0 0.0 5740 508 ? D 13:55 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_3840.xml /usr/local/swadmin/mnx/xml swadmin 3932 0.0 0.0 57432 680 ? D 13:55 0:00 ls /usr/local/swadmin/mnx/xml swadmin 5942 0.0 0.0 57432 680 ? D 13:56 0:00 ls /usr/local/swadmin/mnx/xml swadmin 12828 0.0 0.0 57432 680 ? D 14:00 0:00 ls /usr/local/swadmin/mnx/xml swadmin 12832 0.0 0.0 5740 508 ? D 14:00 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_12778.xml /usr/local/swadmin/mnx/xml swadmin 22051 0.0 0.0 57432 680 ? D 14:05 0:00 ls /usr/local/swadmin/mnx/xml swadmin 22576 0.0 0.0 5740 508 ? D 14:05 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_21971.xml /usr/local/swadmin/mnx/xml swadmin 24061 0.0 0.0 57432 680 ? D 14:06 0:00 ls /usr/local/swadmin/mnx/xml swadmin 31889 0.0 0.0 57432 680 ? D 14:10 0:00 ls /usr/local/swadmin/mnx/xml swadmin 31893 0.0 0.0 5740 508 ? D 14:10 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_31842.xml /usr/local/swadmin/mnx/xml swadmin 8378 0.0 0.0 57432 680 ? D 14:15 0:00 ls /usr/local/swadmin/mnx/xml swadmin 8412 0.0 0.0 5740 508 ? D 14:15 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_8283.xml /usr/local/swadmin/mnx/xml swadmin 10393 0.0 0.0 57432 680 ? D 14:16 0:00 ls /usr/local/swadmin/mnx/xml swadmin 17534 0.0 0.0 57432 680 ? D 14:20 0:00 ls /usr/local/swadmin/mnx/xml swadmin 17537 0.0 0.0 5740 508 ? D 14:20 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_17486.xml /usr/local/swadmin/mnx/xml root 21391 8.0 0.0 80764 12020 pts/3 D+ 14:21 0:20 /usr/lib/rpm/rpmv -V --all swadmin 29982 0.0 0.0 57432 680 ? D 14:25 0:00 ls /usr/local/swadmin/mnx/xml swadmin 30015 0.0 0.0 5740 508 ? D 14:25 0:00 mv /usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_29894.xml /usr/local/swadmin/mnx/xml swadmin 32556 0.0 0.0 57432 680 ? D 14:26 0:00 ls /usr/local/swadmin/mnx/xml root 1078 0.0 0.0 53136 708 pts/8 S+ 14:26 0:00 grep D root@zhlr421a:~# ps aux | grep D | wc -l 136 do you need any other Information? thx mike
Created attachment 157546 [details] dlm logs etc
Created attachment 157548 [details] dlm logs b-member
This looks like the same thing as last time: b Resource 00000103ddff26e8 (parent 0000000000000000). Name (len=24) " 2 71fc6b" Master Copy Granted Queue 44f00042 PR Remote: 1 c9580352 Conversion Queue Waiting Queue a Resource 000001014d140948 (parent 0000000000000000). Name (len=24) " 2 71fc6b" Local Copy, Master is node 2 Granted Queue Conversion Queue Waiting Queue c9580352 PR (EX) Master: 44f00042 LQ: 0,0x5 a:dlm_debug lt_sharedroot grant lock on lockqueue 2 lt_sharedroot process_lockqueue_reply id c4740348 state 0 lt_sharedroot grant lock on lockqueue 2 lt_sharedroot process_lockqueue_reply id cb0f0057 state 0 lt_sharedroot grant lock on lockqueue 2 lt_sharedroot process_lockqueue_reply id c7f20004 state 0 lt_sharedroot grant lock on lockqueue 3 lt_sharedroot process_lockqueue_reply id c4de02c3 state 0 lt_sharedroot grant lock on lockqueue 3 lt_sharedroot process_lockqueue_reply id c9580352 state 0 <--- lt_sharedroot grant lock on lockqueue 3 lt_sharedroot process_lockqueue_reply id c7b10203 state 0 lt_sharedroot send einval to 2 I've been working through a number of different theories about what could have happened, but none of them are holding up. I'll try again to come up with a debugging patch that might narrow in on what's happening.
Created attachment 157569 [details] debugging patch Could you add this patch to the system? It adds some extra info to some existing debug statements, switches one log_debug to log_error, and adds a couple new sanity checks.
Hi Dave that't maybe possible can you send me a Howto? Do I expect some negative Impact with this patch? This is not a Testsystem. thx mike
Do we need a reboot to enable this patch? thx mike
./configure --kernel_src=/lib/modules/`uname -r`/build/ cd cluster/dlm-kernel/src; make you should save off the current dlm.ko somewhere, then make install. The patch will probably produce a little extra debugging info in dmesg and /var/log/messages. The first and third chunks of the patch are new sanity checks that shouldn't appear at all. If the dlm debugging messages are too "noisy" with the patch, then please send me the output and go back to the original dlm.ko. Given that it's a shared root setup you may need to reboot, and my instructions at the top may not work since you may need to include the new dlm in the initrd. You probably need some advice from the atix experts on how to go about this.
Hi Dave thx for your Howto, but as you said I need help here because all of our Clusters are sharedroot. I will ask Marc to made us a proper initrd. thx mike
*** Bug 368761 has been marked as a duplicate of this bug. ***
This is probably a duplicate of bug 349001 which was recently fixed.
*** This bug has been marked as a duplicate of bug 349001 ***