Bug 228916 - corrupt dlm lock
corrupt dlm lock
Status: CLOSED DUPLICATE of bug 349001
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm-kernel (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
GFS Bugs
:
: 368761 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-02-15 15:52 EST by Mark Hlawatschek
Modified: 2009-04-16 16:31 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 15:37:23 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
dlm logs etc (2.06 MB, application/octet-stream)
2007-06-21 10:55 EDT, Michael Hagmann
no flags Details
dlm logs b-member (592.28 KB, application/octet-stream)
2007-06-21 11:09 EDT, Michael Hagmann
no flags Details
debugging patch (1.61 KB, text/plain)
2007-06-21 17:01 EDT, David Teigland
no flags Details

  None (edit)
Description Mark Hlawatschek 2007-02-15 15:52:30 EST
Description of problem:

2 node cluster consisting of node1 and node2. 
/usr/local is placed on a GFS filesystem mounted on both nodes. 
Lockmanager is dlm.

Loads of processes on both nodes are in the "D" state
(...)
23055 D    ls /usr/local/sw glock_wait_internal
26000 D    ls /usr/local/sw glock_wait_internal
 5583 D    ls /usr/local/sw glock_wait_internal
21220 D    ls /usr/local/sw glock_wait_internal
24169 D    ls /usr/local/sw glock_wait_internal
 4566 D    ls -l /usr/local glock_wait_internal
25800 D    ls --color=tty - glock_wait_internal
29006 D    ls -l /usr/local glock_wait_internal
 5856 Dl   java EDXPut -pfi glock_wait_internal
(...)

a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
lstat("/usr/local/swadmin/mnx/xml",                                                                

This happens on both cluster nodes.

All processes trying to access the directory /usr/local/swadmin/mnx/xml are 
in "Waiting for IO (D)" state.

It seems that the process with pid=5856 running on node1 is holding the locks.

a lockdump analysis with the decipher_lockstate_dump and parse_lockdump shows 
the following relevant output: 

4 chain:
lockdump.node1.dec Glock (inode[2], 1114343)
  gl_flags = lock[1]
  gl_count = 5
  gl_state = shared[3]
  req_gh = yes
  req_bh = yes
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 1
  ail_bufs = no
  Request
    owner = 5856
    gh_state = exclusive[1]
    gh_flags = try[0] local_excl[5] async[6]
    error = 0
    gh_iflags = promote[1]
  Waiter3
    owner = 5856
    gh_state = exclusive[1]
    gh_flags = try[0] local_excl[5] async[6]
    error = 0
    gh_iflags = promote[1]
  Inode: busy
lockdump.node2.dec Glock (inode[2], 1114343)
  gl_flags =
  gl_count = 2
  gl_state = unlocked[0]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 0
  ail_bufs = no
  Inode:
    num = 1114343/1114343
    type = regular[1]
    i_count = 1
    i_flags =
    vnode = yes
lockdump.node1.dec Glock (inode[2], 627732)
  gl_flags = dirty[5]
  gl_count = 379
  gl_state = exclusive[1]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 58
  ail_bufs = no
  Holder
    owner = 5856
    gh_state = exclusive[1]
    gh_flags = try[0] local_excl[5] async[6]
    error = 0
    gh_iflags = promote[1] holder[6] first[7]
  Waiter2
    owner = none[-1]
    gh_state = shared[3]
    gh_flags = try[0]
    error = 0
    gh_iflags = demote[2] alloced[4] dealloc[5]
  Waiter3
    owner = 32753
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  [...loads of Waiter3 entries...]
  Waiter3
    owner = 4566
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  Inode: busy
lockdump.node2.dec Glock (inode[2], 627732)
  gl_flags = lock[1]
  gl_count = 375
  gl_state = unlocked[0]
  req_gh = yes
  req_bh = yes
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 0
  ail_bufs = no
  Request
    owner = 20187
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  Waiter3
    owner = 20187
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  [...loads of Waiter3 entries...]
  Waiter3
    owner = 10460
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  Inode: busy
2 requests

The dlm lock dump shows:
node1:
Resource 0000010001218088 (parent 0000000000000000). Name (len=24) "       2          
1100e7"
Local Copy, Master is node 2
Granted Queue
Conversion Queue
Waiting Queue
5eb00178 PR (EX) Master:     3eeb0117  LQ: 0,0x5

node2:
Resource 00000107e462c8c8 (parent 0000000000000000). Name (len=24) "       2          
1100e7"
Master Copy
Granted Queue
3eeb0117 PR Remote:   1 5eb00178
Conversion Queue
Waiting Queue
Version-Release number of selected component (if applicable):
RHEL4u4 + GFS + CS

The kernel backtrace of pid 5856 on node1 is:

crash> bt 5856
 PID: 5856   TASK: 10bd26427f0       CPU: 0   COMMAND: "java"
  #0 [10bd20cfbc8] schedule at ffffffff8030a1d1
  #1 [10bd20cfca0] wait_for_completion at ffffffff8030a415
  #2 [10bd20cfd20] glock_wait_internal at ffffffffa018574e
  #3 [10bd20cfd60] gfs_glock_nq_m at ffffffffa01860ce
  #4 [10bd20cfda0] gfs_unlink at ffffffffa019ce41
  #5 [10bd20cfea0] vfs_unlink at ffffffff801889fa
  #6 [10bd20cfed0] sys_unlink at ffffffff80188b19
  #7 [10bd20cff30] filp_close at ffffffff80178e48
  #8 [10bd20cff50] error_exit at ffffffff80110d91
     RIP: 0000002a9593f649  RSP: 0000007fbfffbca0  RFLAGS: 00010206
     RAX: 0000000000000057  RBX: ffffffff8011026a  RCX: 0000002a9cc9c870
     RDX: 0000002ae5989000  RSI: 0000002a962fa3a8  RDI: 0000002ae5989000
     RBP: 0000000000000000   R8: 0000002a9630abb0   R9: 0000000000000ffc
     R10: 0000002a9630abc0  R11: 0000000000000206  R12: 0000000040115700
     R13: 0000002ae23294b0  R14: 0000007fbfffc300  R15: 0000002ae5989000
     ORIG_RAX: 0000000000000057  CS: 0033  SS: 002b

How reproducible:
I have no idea.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
There is also a discussion on the public mailing list. Subject is "gfs 
deadlock situation"
Comment 1 David Teigland 2007-02-20 14:28:15 EST
https://www.redhat.com/archives/linux-cluster/2007-February/msg00110.html

> node1:
> Resource 0000010001218088 (parent 0000000000000000). Name (len=24) "       2
> 1100e7"
> Local Copy, Master is node 2
> Granted Queue
> Conversion Queue
> Waiting Queue
> 5eb00178 PR (EX) Master:     3eeb0117  LQ: 0,0x5

> node2:
> Resource 00000107e462c8c8 (parent 0000000000000000). Name (len=24) "       2
> 1100e7"
> Master Copy
> Granted Queue
> 3eeb0117 PR Remote:   1 5eb00178
> Conversion Queue
> Waiting Queue

The state of the lock on node1 looks bad.  I'm studying the code and
struggling to understand how it could possibly arrive in that state.

Some things to notice:
- the lock is converting, it should be on the Conversion Queue, not the
  Waiting Queue
- lockqueue_state is 0, so either node1 has not sent a remote request to
  node2 at all, or node1 did send something and already received some kind
  of reply so it's not waiting for a reply any longer
- the state of the lock on node2 looks normal

Did you check for suspicious syslog messages on both nodes?  Did any nodes
on this fs mount, unmount or fail around the time this happened?  Has this
happened before?  If you'd like to try to reproduce this with some dlm
debugging I could send you a patch (although this is such an odd state I'm
not sure yet where I'd begin to add debugging.)

Comment 2 David Teigland 2007-02-20 14:36:15 EST
As mentioned above, the problem is a dlm lock that's in a nonsensical
state.  There are no clues how it might have gotten into that state.
Adding some new sanity checking to the dlm might catch the problem
sooner (if it's reproducable) and give us some hints about where to look.
No ideas yet where to add new sanity checks.
Comment 3 Mark Hlawatschek 2007-02-21 09:43:30 EST
There are no suspicious syslog messages on both nodes. This just happens 
sometimes but I was not able to reproduce the problem.
It seems, that mostly gfs directory dinodes are affected. I also observed the 
problem with php session-file directories.
Comment 4 Michael Hagmann 2007-06-21 10:50:22 EDT
Hi I just see that we have 163 processes in D - State, I attached a sysreport
and some GFS - logs.

root@zhlr421a:~# ps aux | grep D
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root     18860  0.0  0.0 99624 4168 ?        Ss   Apr10   0:14
/opt/hp/hpsmh/sbin/hpsmhd -DSSL -f /opt/hp/hpsmh/conf/smhpd.conf
hpsmh    18883  0.0  0.0 383088 5356 ?       Sl   Apr10   0:00
/opt/hp/hpsmh/sbin/hpsmhd -DSSL -f /opt/hp/hpsmh/conf/smhpd.conf
swadmin  24128  0.1  0.3 1476300 417424 ?    Dl   05:35   0:36
/usr/java/jdk1.5.0_07/bin/java EDXPut -pfile ../../mnx.pfile
swadmin  17871  0.0  0.0 57432  680 ?        D    10:10   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  17878  0.0  0.0  5740  508 ?        D    10:10   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_17824.xml
/usr/local/swadmin/mnx/xml
swadmin  26678  0.0  0.0 57432  680 ?        D    10:15   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  26708  0.0  0.0  5740  508 ?        D    10:15   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_26587.xml
/usr/local/swadmin/mnx/xml
swadmin  28658  0.0  0.0 57432  680 ?        D    10:16   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   2972  0.0  0.0 57432  680 ?        D    10:20   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   2976  0.0  0.0  5740  508 ?        D    10:20   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_2924.xml
/usr/local/swadmin/mnx/xml
swadmin  12513  0.0  0.0 57432  680 ?        D    10:25   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  12548  0.0  0.0  5740  508 ?        D    10:25   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_12424.xml
/usr/local/swadmin/mnx/xml
swadmin  13745  0.0  0.0 57432  680 ?        D    10:26   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  21534  0.0  0.0 57432  680 ?        D    10:30   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  21538  0.0  0.0  5740  508 ?        D    10:30   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_21485.xml
/usr/local/swadmin/mnx/xml
swadmin  30268  0.0  0.0 57432  680 ?        D    10:35   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  30303  0.0  0.0  5740  508 ?        D    10:35   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_30178.xml
/usr/local/swadmin/mnx/xml
swadmin  32252  0.0  0.0 57432  680 ?        D    10:36   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   6608  0.0  0.0 57432  680 ?        D    10:40   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   6612  0.0  0.0  5740  508 ?        D    10:40   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_6559.xml
/usr/local/swadmin/mnx/xml
swadmin  16335  0.0  0.0 57432  680 ?        D    10:45   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  16369  0.0  0.0  5740  508 ?        D    10:45   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_16241.xml
/usr/local/swadmin/mnx/xml
swadmin  17488  0.0  0.0 57432  680 ?        D    10:46   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  25239  0.0  0.0 57432  680 ?        D    10:50   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  25243  0.0  0.0  5740  508 ?        D    10:50   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_25190.xml
/usr/local/swadmin/mnx/xml
swadmin   1507  0.0  0.0 57432  680 ?        D    10:55   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   1542  0.0  0.0  5740  508 ?        D    10:55   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_1414.xml
/usr/local/swadmin/mnx/xml
swadmin   3510  0.0  0.0 57432  680 ?        D    10:56   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  10311  0.0  0.0 57432  680 ?        D    11:00   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  10315  0.0  0.0  5740  508 ?        D    11:00   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_10262.xml
/usr/local/swadmin/mnx/xml
swadmin  20144  0.0  0.0 57432  680 ?        D    11:05   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  20174  0.0  0.0  5740  508 ?        D    11:05   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_20058.xml
/usr/local/swadmin/mnx/xml
swadmin  21288  0.0  0.0 57432  680 ?        D    11:06   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  28948  0.0  0.0 57432  680 ?        D    11:10   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  28952  0.0  0.0  5740  508 ?        D    11:10   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_28899.xml
/usr/local/swadmin/mnx/xml
swadmin   5245  0.0  0.0 57432  680 ?        D    11:15   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   5280  0.0  0.0  5740  508 ?        D    11:15   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_5169.xml
/usr/local/swadmin/mnx/xml
swadmin   7252  0.0  0.0 57432  680 ?        D    11:16   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  14094  0.0  0.0 57432  680 ?        D    11:20   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  14110  0.0  0.0  5740  508 ?        D    11:20   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_14049.xml
/usr/local/swadmin/mnx/xml
swadmin  23853  0.0  0.0 57432  680 ?        D    11:25   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  23887  0.0  0.0  5740  508 ?        D    11:25   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_23772.xml
/usr/local/swadmin/mnx/xml
swadmin  25033  0.0  0.0 57432  680 ?        D    11:26   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  32685  0.0  0.0 57432  680 ?        D    11:30   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  32689  0.0  0.0  5740  508 ?        D    11:30   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_32638.xml
/usr/local/swadmin/mnx/xml
swadmin   8981  0.0  0.0 57432  680 ?        D    11:35   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   9016  0.0  0.0  5740  508 ?        D    11:35   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_8895.xml
/usr/local/swadmin/mnx/xml
swadmin  10973  0.0  0.0 57432  680 ?        D    11:36   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  17950  0.0  0.0 57432  680 ?        D    11:40   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  17954  0.0  0.0  5740  508 ?        D    11:40   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_17903.xml
/usr/local/swadmin/mnx/xml
swadmin  27647  0.0  0.0  5740  508 ?        D    11:45   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_27554.xml
/usr/local/swadmin/mnx/xml
swadmin  27648  0.0  0.0 57432  680 ?        D    11:45   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  29511  0.0  0.0 57432  680 ?        D    11:46   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   4015  0.0  0.0 57432  680 ?        D    11:50   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   4019  0.0  0.0  5740  508 ?        D    11:50   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_3966.xml
/usr/local/swadmin/mnx/xml
swadmin  12745  0.0  0.0 57432  680 ?        D    11:55   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  12780  0.0  0.0  5740  508 ?        D    11:55   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_12659.xml
/usr/local/swadmin/mnx/xml
swadmin  14914  0.0  0.0 57432  680 ?        D    11:56   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  21812  0.0  0.0 57432  680 ?        D    12:00   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  21816  0.0  0.0  5740  508 ?        D    12:00   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_21763.xml
/usr/local/swadmin/mnx/xml
swadmin  31395  0.0  0.0 57432  680 ?        D    12:05   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  31429  0.0  0.0  5740  508 ?        D    12:05   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_31310.xml
/usr/local/swadmin/mnx/xml
swadmin    877  0.0  0.0 57432  680 ?        D    12:06   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   7761  0.0  0.0 57432  680 ?        D    12:10   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   7764  0.0  0.0  5740  508 ?        D    12:10   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_7714.xml
/usr/local/swadmin/mnx/xml
swadmin  16646  0.0  0.0 57432  680 ?        D    12:15   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  16682  0.0  0.0  5740  508 ?        D    12:15   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_16550.xml
/usr/local/swadmin/mnx/xml
swadmin  18678  0.0  0.0 57432  680 ?        D    12:16   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  25644  0.0  0.0 57432  680 ?        D    12:20   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  25648  0.0  0.0  5740  508 ?        D    12:20   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_25597.xml
/usr/local/swadmin/mnx/xml
swadmin   2773  0.0  0.0 57432  680 ?        D    12:25   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   2809  0.0  0.0  5740  508 ?        D    12:25   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_2684.xml
/usr/local/swadmin/mnx/xml
swadmin   4737  0.0  0.0 57432  680 ?        D    12:26   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  11590  0.0  0.0 57432  680 ?        D    12:30   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  11594  0.0  0.0  5740  508 ?        D    12:30   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_11542.xml
/usr/local/swadmin/mnx/xml
swadmin  20617  0.0  0.0 57432  680 ?        D    12:35   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  20690  0.0  0.0  5740  508 ?        D    12:35   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_20527.xml
/usr/local/swadmin/mnx/xml
swadmin  22627  0.0  0.0 57432  680 ?        D    12:36   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  29544  0.0  0.0 57432  680 ?        D    12:40   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  29548  0.0  0.0  5740  508 ?        D    12:40   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_29494.xml
/usr/local/swadmin/mnx/xml
swadmin   6688  0.0  0.0 57432  680 ?        D    12:45   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   6725  0.0  0.0  5740  508 ?        D    12:45   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_6592.xml
/usr/local/swadmin/mnx/xml
swadmin   8594  0.0  0.0 57432  680 ?        D    12:46   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  15683  0.0  0.0 57432  680 ?        D    12:50   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  15687  0.0  0.0  5740  508 ?        D    12:50   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_15636.xml
/usr/local/swadmin/mnx/xml
swadmin  24548  0.0  0.0 57432  680 ?        D    12:55   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  24583  0.0  0.0  5740  508 ?        D    12:55   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_24459.xml
/usr/local/swadmin/mnx/xml
swadmin  26565  0.0  0.0 57432  680 ?        D    12:56   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin    935  0.0  0.0 57432  680 ?        D    13:00   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin    939  0.0  0.0  5740  508 ?        D    13:00   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_885.xml
/usr/local/swadmin/mnx/xml
swadmin  10584  0.0  0.0 57432  680 ?        D    13:05   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  10618  0.0  0.0  5740  508 ?        D    13:05   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_10495.xml
/usr/local/swadmin/mnx/xml
swadmin  12581  0.0  0.0 57432  680 ?        D    13:06   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  19624  0.0  0.0 57432  680 ?        D    13:10   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  19628  0.0  0.0  5740  508 ?        D    13:10   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_19576.xml
/usr/local/swadmin/mnx/xml
swadmin  28491  0.0  0.0 57432  680 ?        D    13:15   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  28529  0.0  0.0  5740  508 ?        D    13:15   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_28402.xml
/usr/local/swadmin/mnx/xml
swadmin  30511  0.0  0.0 57432  680 ?        D    13:16   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   4883  0.0  0.0 57432  680 ?        D    13:20   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   4887  0.0  0.0  5740  508 ?        D    13:20   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_4836.xml
/usr/local/swadmin/mnx/xml
swadmin  14659  0.0  0.0 57432  680 ?        D    13:25   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  14710  0.0  0.0  5740  508 ?        D    13:25   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_14571.xml
/usr/local/swadmin/mnx/xml
swadmin  16596  0.0  0.0 57432  680 ?        D    13:26   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  23596  0.0  0.0 57432  680 ?        D    13:30   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  23599  0.0  0.0  5740  508 ?        D    13:30   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_23547.xml
/usr/local/swadmin/mnx/xml
swadmin  32400  0.0  0.0 57432  680 ?        D    13:35   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  32435  0.0  0.0  5740  508 ?        D    13:35   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_32314.xml
/usr/local/swadmin/mnx/xml
swadmin   1951  0.0  0.0 57432  680 ?        D    13:36   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   8793  0.0  0.0 57432  680 ?        D    13:40   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   8797  0.0  0.0  5740  508 ?        D    13:40   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_8744.xml
/usr/local/swadmin/mnx/xml
swadmin  18650  0.0  0.0 57432  680 ?        D    13:45   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  18689  0.0  0.0  5740  508 ?        D    13:45   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_18556.xml
/usr/local/swadmin/mnx/xml
swadmin  19875  0.0  0.0 57432  680 ?        D    13:46   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  27598  0.0  0.0 57432  680 ?        D    13:50   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  27602  0.0  0.0  5740  508 ?        D    13:50   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_27546.xml
/usr/local/swadmin/mnx/xml
swadmin   3931  0.0  0.0  5740  508 ?        D    13:55   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_3840.xml
/usr/local/swadmin/mnx/xml
swadmin   3932  0.0  0.0 57432  680 ?        D    13:55   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   5942  0.0  0.0 57432  680 ?        D    13:56   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  12828  0.0  0.0 57432  680 ?        D    14:00   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  12832  0.0  0.0  5740  508 ?        D    14:00   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_12778.xml
/usr/local/swadmin/mnx/xml
swadmin  22051  0.0  0.0 57432  680 ?        D    14:05   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  22576  0.0  0.0  5740  508 ?        D    14:05   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_21971.xml
/usr/local/swadmin/mnx/xml
swadmin  24061  0.0  0.0 57432  680 ?        D    14:06   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  31889  0.0  0.0 57432  680 ?        D    14:10   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  31893  0.0  0.0  5740  508 ?        D    14:10   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_31842.xml
/usr/local/swadmin/mnx/xml
swadmin   8378  0.0  0.0 57432  680 ?        D    14:15   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin   8412  0.0  0.0  5740  508 ?        D    14:15   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_8283.xml
/usr/local/swadmin/mnx/xml
swadmin  10393  0.0  0.0 57432  680 ?        D    14:16   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  17534  0.0  0.0 57432  680 ?        D    14:20   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  17537  0.0  0.0  5740  508 ?        D    14:20   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_17486.xml
/usr/local/swadmin/mnx/xml
root     21391  8.0  0.0 80764 12020 pts/3   D+   14:21   0:20 /usr/lib/rpm/rpmv
-V --all
swadmin  29982  0.0  0.0 57432  680 ?        D    14:25   0:00 ls
/usr/local/swadmin/mnx/xml
swadmin  30015  0.0  0.0  5740  508 ?        D    14:25   0:00 mv
/usr/local/swadmin/mnx/tmp/msg_mnx_zhlr421a_rds_s_total_29894.xml
/usr/local/swadmin/mnx/xml
swadmin  32556  0.0  0.0 57432  680 ?        D    14:26   0:00 ls
/usr/local/swadmin/mnx/xml
root      1078  0.0  0.0 53136  708 pts/8    S+   14:26   0:00 grep D
root@zhlr421a:~# ps aux | grep D | wc -l
136

do you need any other Information?

thx mike
Comment 5 Michael Hagmann 2007-06-21 10:55:30 EDT
Created attachment 157546 [details]
dlm logs etc
Comment 6 Michael Hagmann 2007-06-21 11:09:54 EDT
Created attachment 157548 [details]
dlm logs b-member
Comment 7 David Teigland 2007-06-21 14:59:38 EDT
This looks like the same thing as last time:

b
Resource 00000103ddff26e8 (parent 0000000000000000). Name (len=24) "       2   
      71fc6b" 
Master Copy
Granted Queue
44f00042 PR Remote:   1 c9580352
Conversion Queue
Waiting Queue

a
Resource 000001014d140948 (parent 0000000000000000). Name (len=24) "       2   
      71fc6b" 
Local Copy, Master is node 2
Granted Queue
Conversion Queue
Waiting Queue
c9580352 PR (EX) Master:     44f00042  LQ: 0,0x5

a:dlm_debug
lt_sharedroot grant lock on lockqueue 2
lt_sharedroot process_lockqueue_reply id c4740348 state 0
lt_sharedroot grant lock on lockqueue 2
lt_sharedroot process_lockqueue_reply id cb0f0057 state 0
lt_sharedroot grant lock on lockqueue 2
lt_sharedroot process_lockqueue_reply id c7f20004 state 0
lt_sharedroot grant lock on lockqueue 3
lt_sharedroot process_lockqueue_reply id c4de02c3 state 0
lt_sharedroot grant lock on lockqueue 3
lt_sharedroot process_lockqueue_reply id c9580352 state 0   <---
lt_sharedroot grant lock on lockqueue 3
lt_sharedroot process_lockqueue_reply id c7b10203 state 0
lt_sharedroot send einval to 2

I've been working through a number of different theories about what could
have happened, but none of them are holding up.  I'll try again to come up
with a debugging patch that might narrow in on what's happening.
Comment 8 David Teigland 2007-06-21 17:01:57 EDT
Created attachment 157569 [details]
debugging patch

Could you add this patch to the system?  It adds some extra info to some
existing debug statements, switches one log_debug to log_error, and adds
a couple new sanity checks.
Comment 9 Michael Hagmann 2007-06-21 17:13:03 EDT
Hi Dave

that't maybe possible can you send me a Howto? Do I expect some negative Impact
with this patch? This is not a Testsystem.

thx mike
Comment 10 Michael Hagmann 2007-06-21 17:16:04 EDT
Do we need a reboot to enable this patch?

thx mike 
Comment 11 David Teigland 2007-06-21 17:27:30 EDT
./configure --kernel_src=/lib/modules/`uname -r`/build/
cd cluster/dlm-kernel/src; make
you should save off the current dlm.ko somewhere, then
make install.

The patch will probably produce a little extra debugging info
in dmesg and /var/log/messages.  The first and third chunks of
the patch are new sanity checks that shouldn't appear at all.
If the dlm debugging messages are too "noisy" with the patch,
then please send me the output and go back to the original dlm.ko.

Given that it's a shared root setup you may need to reboot, and my instructions
at the top may not work since you may need to include the new dlm in the initrd.
 You probably need some advice from the atix experts on how to go about this.

Comment 12 Michael Hagmann 2007-06-21 17:47:28 EDT
Hi Dave

thx for your Howto, but as you said I need help here because all of our Clusters
are sharedroot.

I will ask Marc to made us a proper initrd.

thx mike
Comment 13 David Teigland 2008-01-29 13:33:40 EST
*** Bug 368761 has been marked as a duplicate of this bug. ***
Comment 14 David Teigland 2008-01-29 13:35:25 EST
This is probably a duplicate of bug 349001 which was recently fixed.
Comment 17 David Teigland 2009-01-20 15:37:23 EST

*** This bug has been marked as a duplicate of bug 349001 ***

Note You need to log in before you can comment on or make changes to this bug.