Description of problem: Hit when running the test in bug 299061 comment 52. add_to_requestqueue() can add a new message to the requestqueue just after process_requestqueue() checks it and determines it's empty. This means dlm_recvd will spin forever in wait_requestqueue() waiting for the message to be removed. [root@marathon-02 ~]# cat /proc/cluster/dlm_debug | grep lv0 lv0 total nodes 3 lv0 rebuild resource directory lv0 rebuilt 1 resources lv0 purge requests lv0 purged 0 requests lv0 mark waiting requests lv0 marked 0 requests lv0 purge locks of departed nodes lv0 purged 0 locks lv0 update remastered resources lv0 updated 12 resources lv0 rebuild locks lv0 rebuilt 0 locks lv0 recover event 12936 done lv0 move flags 0,0,1 ids 12929,12936,12936 lv0 process held requests lv0 add_to_requestq cmd 1 fr 4 lv0 processed 0 requests lv0 resend marked requests lv0 resent 0 requests lv0 recover event 12936 finished CPU2: 0000010080caff68 0000000000000000 000001007d14fc28 0000000000000000 000001007d14fdc8 ffffffff802397c0 00000000000000c3 ffffffff8011c6b2 0000010078d6a400 ffffffff80110b69 Call Trace:<IRQ> <ffffffff802397c0>{showacpu+45} <ffffffff8011c6b2>{smp_call_function_interrupt+64} <ffffffff80110b69>{call_function_interrupt+133} <EOI> <ffffffff802a6b7c>{pci_conf1_read+0} <ffffffffa0240993>{:dlm:wait_requestqueue+42} <ffffffffa0240989>{:dlm:wait_requestqueue+32} <ffffffffa02413c5>{:dlm:process_cluster_request+104} <ffffffffa0249f0c>{:dlm:rcom_process_message+1069} <ffffffffa0245a11>{:dlm:midcomms_process_incoming_buffer+782} <ffffffffa024b03a>{:dlm:restbl_rsb_update_recv+280} <ffffffffa0243b36>{:dlm:receive_from_sock+623} <ffffffff8014ba68>{keventd_create_kthread+0} <ffffffffa0244117>{:dlm:dlm_recvd+289} <ffffffffa0243ff6>{:dlm:dlm_recvd+0} <ffffffff8014ba3f>{kthread+200} <ffffffff80110f47>{child_rip+8} <ffffffff8014ba68>{keventd_create_kthread+0} <ffffffff8014b977>{kthread+0} <ffffffff80110f3f>{child_rip+0} Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 238011 [details] patch to test The same problem has been found and fixed in the RHEL5 code (and then changed again recently). This patch is the equivalent for RHEL4.
hit this again, not using the patch above marathon-01 lv2 process held requests lv2 processed 0 requests lv2 resend marked requests lv2 resent 0 requests lv2 recover event 58418 finished lv2 move flags 1,0,0 ids 58418,58418,58418 lv2 move flags 0,1,0 ids 58418,58423,58418 lv2 move use event 58423 lv2 recover event 58423 lv2 remove node 3 lv2 total nodes 3 lv2 rebuild resource directory lv2 rebuilt 2 resources lv2 purge requests lv2 purged 0 requests lv2 mark waiting requests lv2 marked 0 requests lv2 purge locks of departed nodes lv2 purged 0 locks lv2 update remastered resources lv2 updated 1 resources lv2 rebuild locks lv2 rebuilt 0 locks lv2 recover event 58423 done lv2 move flags 0,0,1 ids 58418,58423,58423 lv2 process held requests lv2 processed 0 requests lv2 resend marked requests lv2 resent 0 requests lv2 recover event 58423 finished lv2 add_to_requestq cmd 1 fr 4 CPU1: 00000100cffe3f68 0000000000000000 0000010124e9db58 0000000000000000 0000010074107000 ffffffff802397c0 0000000000000002 ffffffff8011c6b2 0000010122032000 ffffffff80110b69 Call Trace:<IRQ> <ffffffff802397c0>{showacpu+45} <ffffffff8011c6b2>{smp_call_function_interrupt+64} <ffffffff80110b69>{call_function_interrupt+133} <EOI> <ffffffff8030b192>{schedule+120} <ffffffff8030bd43>{thread_return+166} <ffffffff80133159>{__might_sleep+173} <ffffffffa01ea9c2>{:dlm:wait_requestqueue+89} <ffffffffa01eb3c5>{:dlm:process_cluster_request+104} <ffffffffa01f3f1e>{:dlm:rcom_process_message+1087} <ffffffffa01efa11>{:dlm:midcomms_process_incoming_buffer+782} <ffffffff80135c64>{autoremove_wake_function+0} <ffffffffa01edb36>{:dlm:receive_from_sock+623} <ffffffff8014ba68>{keventd_create_kthread+0} <ffffffffa01ee117>{:dlm:dlm_recvd+289} <ffffffffa01edff6>{:dlm:dlm_recvd+0} <ffffffff8014ba3f>{kthread+200} <ffffffff80110f47>{child_rip+8} <ffffffff8014ba68>{keventd_create_kthread+0} <ffffffff8014b977>{kthread+0} <ffffffff80110f3f>{child_rip+0} [root@marathon-01 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 2 3 4] DLM Lock Space: "clvmd" 3 3 run - [1 2 3 4] DLM Lock Space: "lv1" 29049 13496 update U-11,9,4 [3 1 2] DLM Lock Space: "lv2" 29094 13498 run - [1 4 2] GFS Mount Group: "lv1" 29092 13497 run - [1 2] GFS Mount Group: "lv2" 29144 13499 update SU-10,280,36,2,2 [1 4 2]
Created attachment 243711 [details] patch to try fixed the patch
fix checked into RHEL4 branch Checking in lockqueue.c; /cvs/cluster/cluster/dlm-kernel/src/Attic/lockqueue.c,v <-- lockqueue.c new revision: 1.37.2.12; previous revision: 1.37.2.11
I have no idea which update this fix went into, http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=c8d815e711e20c54f38c381df40cf5a6ca75884b