Bug 351321 - messages stranded in requestqueue
Summary: messages stranded in requestqueue
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: dlm-kernel
Version: 4
Hardware: All
OS: Linux
low
low
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-10-24 20:45 UTC by David Teigland
Modified: 2010-01-05 20:33 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2010-01-05 20:33:21 UTC
Embargoed:


Attachments (Terms of Use)
patch to test (2.27 KB, text/plain)
2007-10-25 20:54 UTC, David Teigland
no flags Details
patch to try (2.27 KB, text/plain)
2007-10-30 18:06 UTC, David Teigland
no flags Details

Description David Teigland 2007-10-24 20:45:33 UTC
Description of problem:

Hit when running the test in bug 299061 comment 52.

add_to_requestqueue() can add a new message to the requestqueue
just after process_requestqueue() checks it and determines it's
empty.  This means dlm_recvd will spin forever in wait_requestqueue()
waiting for the message to be removed.

[root@marathon-02 ~]# cat /proc/cluster/dlm_debug  | grep lv0
lv0 total nodes 3
lv0 rebuild resource directory
lv0 rebuilt 1 resources
lv0 purge requests
lv0 purged 0 requests
lv0 mark waiting requests
lv0 marked 0 requests
lv0 purge locks of departed nodes
lv0 purged 0 locks
lv0 update remastered resources
lv0 updated 12 resources
lv0 rebuild locks
lv0 rebuilt 0 locks
lv0 recover event 12936 done
lv0 move flags 0,0,1 ids 12929,12936,12936
lv0 process held requests
lv0 add_to_requestq cmd 1 fr 4
lv0 processed 0 requests
lv0 resend marked requests
lv0 resent 0 requests
lv0 recover event 12936 finished

CPU2:
0000010080caff68 0000000000000000 000001007d14fc28 0000000000000000
       000001007d14fdc8 ffffffff802397c0 00000000000000c3 ffffffff8011c6b2
       0000010078d6a400 ffffffff80110b69
Call Trace:<IRQ> <ffffffff802397c0>{showacpu+45}
<ffffffff8011c6b2>{smp_call_function_interrupt+64}
       <ffffffff80110b69>{call_function_interrupt+133}  <EOI>
<ffffffff802a6b7c>{pci_conf1_read+0}
       <ffffffffa0240993>{:dlm:wait_requestqueue+42}
<ffffffffa0240989>{:dlm:wait_requestqueue+32}
       <ffffffffa02413c5>{:dlm:process_cluster_request+104}
       <ffffffffa0249f0c>{:dlm:rcom_process_message+1069}
       <ffffffffa0245a11>{:dlm:midcomms_process_incoming_buffer+782}
       <ffffffffa024b03a>{:dlm:restbl_rsb_update_recv+280}
       <ffffffffa0243b36>{:dlm:receive_from_sock+623}
<ffffffff8014ba68>{keventd_create_kthread+0}
       <ffffffffa0244117>{:dlm:dlm_recvd+289} <ffffffffa0243ff6>{:dlm:dlm_recvd+0}
       <ffffffff8014ba3f>{kthread+200} <ffffffff80110f47>{child_rip+8}
       <ffffffff8014ba68>{keventd_create_kthread+0} <ffffffff8014b977>{kthread+0}
       <ffffffff80110f3f>{child_rip+0}


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2007-10-25 20:54:13 UTC
Created attachment 238011 [details]
patch to test

The same problem has been found and fixed in the RHEL5 code
(and then changed again recently).  This patch is the equivalent
for RHEL4.

Comment 2 David Teigland 2007-10-29 17:50:12 UTC
hit this again, not using the patch above

marathon-01

lv2 process held requests
lv2 processed 0 requests
lv2 resend marked requests
lv2 resent 0 requests
lv2 recover event 58418 finished
lv2 move flags 1,0,0 ids 58418,58418,58418
lv2 move flags 0,1,0 ids 58418,58423,58418
lv2 move use event 58423
lv2 recover event 58423
lv2 remove node 3
lv2 total nodes 3
lv2 rebuild resource directory
lv2 rebuilt 2 resources
lv2 purge requests
lv2 purged 0 requests
lv2 mark waiting requests
lv2 marked 0 requests
lv2 purge locks of departed nodes
lv2 purged 0 locks
lv2 update remastered resources
lv2 updated 1 resources
lv2 rebuild locks
lv2 rebuilt 0 locks
lv2 recover event 58423 done
lv2 move flags 0,0,1 ids 58418,58423,58423
lv2 process held requests
lv2 processed 0 requests
lv2 resend marked requests
lv2 resent 0 requests
lv2 recover event 58423 finished
lv2 add_to_requestq cmd 1 fr 4


CPU1:
00000100cffe3f68 0000000000000000 0000010124e9db58 0000000000000000
       0000010074107000 ffffffff802397c0 0000000000000002 ffffffff8011c6b2
       0000010122032000 ffffffff80110b69
Call Trace:<IRQ> <ffffffff802397c0>{showacpu+45}
<ffffffff8011c6b2>{smp_call_function_interrupt+64}
       <ffffffff80110b69>{call_function_interrupt+133}  <EOI>
<ffffffff8030b192>{schedule+120}
       <ffffffff8030bd43>{thread_return+166} <ffffffff80133159>{__might_sleep+173}
       <ffffffffa01ea9c2>{:dlm:wait_requestqueue+89}
<ffffffffa01eb3c5>{:dlm:process_cluster_request+104}
       <ffffffffa01f3f1e>{:dlm:rcom_process_message+1087}
       <ffffffffa01efa11>{:dlm:midcomms_process_incoming_buffer+782}
       <ffffffff80135c64>{autoremove_wake_function+0}
<ffffffffa01edb36>{:dlm:receive_from_sock+623}
       <ffffffff8014ba68>{keventd_create_kthread+0}
<ffffffffa01ee117>{:dlm:dlm_recvd+289}
       <ffffffffa01edff6>{:dlm:dlm_recvd+0} <ffffffff8014ba3f>{kthread+200}
       <ffffffff80110f47>{child_rip+8} <ffffffff8014ba68>{keventd_create_kthread+0}
       <ffffffff8014b977>{kthread+0} <ffffffff80110f3f>{child_rip+0}


[root@marathon-01 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 2 3 4]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 2 3 4]

DLM Lock Space:  "lv1"                             29049 13496 update    U-11,9,4
[3 1 2]

DLM Lock Space:  "lv2"                             29094 13498 run       -
[1 4 2]

GFS Mount Group: "lv1"                             29092 13497 run       -
[1 2]

GFS Mount Group: "lv2"                             29144 13499 update   
SU-10,280,36,2,2
[1 4 2]


Comment 3 David Teigland 2007-10-30 18:06:00 UTC
Created attachment 243711 [details]
patch to try

fixed the patch

Comment 4 David Teigland 2008-01-14 15:58:08 UTC
fix checked into RHEL4 branch

Checking in lockqueue.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/lockqueue.c,v  <--  lockqueue.c
new revision: 1.37.2.12; previous revision: 1.37.2.11


Comment 5 David Teigland 2010-01-05 20:33:21 UTC
I have no idea which update this fix went into,
http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=c8d815e711e20c54f38c381df40cf5a6ca75884b


Note You need to log in before you can comment on or make changes to this bug.