Bug 351321 - messages stranded in requestqueue
messages stranded in requestqueue
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm-kernel (Show other bugs)
All Linux
low Severity low
: ---
: ---
Assigned To: David Teigland
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2007-10-24 16:45 EDT by David Teigland
Modified: 2010-01-05 15:33 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-01-05 15:33:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
patch to test (2.27 KB, text/plain)
2007-10-25 16:54 EDT, David Teigland
no flags Details
patch to try (2.27 KB, text/plain)
2007-10-30 14:06 EDT, David Teigland
no flags Details

  None (edit)
Description David Teigland 2007-10-24 16:45:33 EDT
Description of problem:

Hit when running the test in bug 299061 comment 52.

add_to_requestqueue() can add a new message to the requestqueue
just after process_requestqueue() checks it and determines it's
empty.  This means dlm_recvd will spin forever in wait_requestqueue()
waiting for the message to be removed.

[root@marathon-02 ~]# cat /proc/cluster/dlm_debug  | grep lv0
lv0 total nodes 3
lv0 rebuild resource directory
lv0 rebuilt 1 resources
lv0 purge requests
lv0 purged 0 requests
lv0 mark waiting requests
lv0 marked 0 requests
lv0 purge locks of departed nodes
lv0 purged 0 locks
lv0 update remastered resources
lv0 updated 12 resources
lv0 rebuild locks
lv0 rebuilt 0 locks
lv0 recover event 12936 done
lv0 move flags 0,0,1 ids 12929,12936,12936
lv0 process held requests
lv0 add_to_requestq cmd 1 fr 4
lv0 processed 0 requests
lv0 resend marked requests
lv0 resent 0 requests
lv0 recover event 12936 finished

0000010080caff68 0000000000000000 000001007d14fc28 0000000000000000
       000001007d14fdc8 ffffffff802397c0 00000000000000c3 ffffffff8011c6b2
       0000010078d6a400 ffffffff80110b69
Call Trace:<IRQ> <ffffffff802397c0>{showacpu+45}
       <ffffffff80110b69>{call_function_interrupt+133}  <EOI>
       <ffffffffa0244117>{:dlm:dlm_recvd+289} <ffffffffa0243ff6>{:dlm:dlm_recvd+0}
       <ffffffff8014ba3f>{kthread+200} <ffffffff80110f47>{child_rip+8}
       <ffffffff8014ba68>{keventd_create_kthread+0} <ffffffff8014b977>{kthread+0}

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 1 David Teigland 2007-10-25 16:54:13 EDT
Created attachment 238011 [details]
patch to test

The same problem has been found and fixed in the RHEL5 code
(and then changed again recently).  This patch is the equivalent
for RHEL4.
Comment 2 David Teigland 2007-10-29 13:50:12 EDT
hit this again, not using the patch above


lv2 process held requests
lv2 processed 0 requests
lv2 resend marked requests
lv2 resent 0 requests
lv2 recover event 58418 finished
lv2 move flags 1,0,0 ids 58418,58418,58418
lv2 move flags 0,1,0 ids 58418,58423,58418
lv2 move use event 58423
lv2 recover event 58423
lv2 remove node 3
lv2 total nodes 3
lv2 rebuild resource directory
lv2 rebuilt 2 resources
lv2 purge requests
lv2 purged 0 requests
lv2 mark waiting requests
lv2 marked 0 requests
lv2 purge locks of departed nodes
lv2 purged 0 locks
lv2 update remastered resources
lv2 updated 1 resources
lv2 rebuild locks
lv2 rebuilt 0 locks
lv2 recover event 58423 done
lv2 move flags 0,0,1 ids 58418,58423,58423
lv2 process held requests
lv2 processed 0 requests
lv2 resend marked requests
lv2 resent 0 requests
lv2 recover event 58423 finished
lv2 add_to_requestq cmd 1 fr 4

00000100cffe3f68 0000000000000000 0000010124e9db58 0000000000000000
       0000010074107000 ffffffff802397c0 0000000000000002 ffffffff8011c6b2
       0000010122032000 ffffffff80110b69
Call Trace:<IRQ> <ffffffff802397c0>{showacpu+45}
       <ffffffff80110b69>{call_function_interrupt+133}  <EOI>
       <ffffffff8030bd43>{thread_return+166} <ffffffff80133159>{__might_sleep+173}
       <ffffffffa01edff6>{:dlm:dlm_recvd+0} <ffffffff8014ba3f>{kthread+200}
       <ffffffff80110f47>{child_rip+8} <ffffffff8014ba68>{keventd_create_kthread+0}
       <ffffffff8014b977>{kthread+0} <ffffffff80110f3f>{child_rip+0}

[root@marathon-01 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 2 3 4]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 2 3 4]

DLM Lock Space:  "lv1"                             29049 13496 update    U-11,9,4
[3 1 2]

DLM Lock Space:  "lv2"                             29094 13498 run       -
[1 4 2]

GFS Mount Group: "lv1"                             29092 13497 run       -
[1 2]

GFS Mount Group: "lv2"                             29144 13499 update   
[1 4 2]
Comment 3 David Teigland 2007-10-30 14:06:00 EDT
Created attachment 243711 [details]
patch to try

fixed the patch
Comment 4 David Teigland 2008-01-14 10:58:08 EST
fix checked into RHEL4 branch

Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/lockqueue.c,v  <--  lockqueue.c
new revision:; previous revision:
Comment 5 David Teigland 2010-01-05 15:33:21 EST
I have no idea which update this fix went into,

Note You need to log in before you can comment on or make changes to this bug.