Bug 607745
| Summary: | cpg may have issue with join/leave in tight loop | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] Corosync Cluster Engine | Reporter: | Steven Dake <sdake> | ||||
| Component: | cpg | Assignee: | Jan Friesse <jfriesse> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | |||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | low | ||||||
| Version: | 1.3 | CC: | 2clarkd, agk, asalkeld, fdinitto, jfriesse, sdake | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-07-29 08:15:29 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Steven Dake
2010-06-24 17:31:02 UTC
Created attachment 426659 [details]
test case
During tests with two nodes, code coverage of the daemon during the failure may indicate that the process_info_list_head list seems to be holding onto an entry for the group name, thus the message_handler_req_lib_cpg_join() is repeatedly returning CPG_ERR_TRYAGAIN. It may be that the cpd_state for the entry is not moving from the CPD_STATE_JOIN_STARTED. Perhaps there are a few conditions of race between the TOTEM messages between daemons and the local requests from the IPC library of a client for a new group? I have witnessed cases where the leave request will occur with the cpd_state at CPD_STATE_JOIN_COMPLETED but more often the case the cpd_state is CPD_STATE_UNJOINED. Are there state transitions directly to CPD_STATE_LEAVE_STARTED from CPD_STATE_JOIN_STARTED? Augmenting the logging a bit and analyzing code coverage of the corosync daemon the re-use of the 'conn_info' pointers addresses between subsequent finalize/initialize of communications with the corosync daemon may exacerbate the bug of attempting to re-establish connectivity to a group from a particular client application. The root cause of the problem, (lots of speculation here), may actually be a race condition between the reception of messages from the client application and the processing of messages from remote daemons through the totem protocol. Perhaps the processing of the leave request does not occur in time to clean up the cpg_pd_list (other times it is the process_info_list) before a request from the application comes in to attempt a new cpg_join. Is possible that the cpg_pd_list and the process_info_list that are part of services/cpg.c are not being properly cleaned up early enough during the finalize operation? Note in the log below how the 'try_again' error occurs but a gratuitous dump of the process_info_list still shows the group despite the finalize requiest! note also the similary in conn pointer values! Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:totemsrp_mcast:2156 mcasted message added to pending queue Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 13c Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3644 Delivering 13d to 13e Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3713 Delivering MCAST message with seq 13e to pending delivery queue Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_exec_cpg_procjoin:855 got procjoin message from cluster node 996481452 name ./stress_finalize_groupName-0 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_exec_cpg_procleave:871 got procleave message from cluster node 996481452 name ./stress_finalize_groupName-1 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_exec_cpg_procleave:888 procleave message from cluster node 996481452 deleted process_info_list name ./stress_finalize_groupName-1 Jun 28 17:51:14 corosync [CPG ] cpg.c:cpg_lib_exit_fn:662 exit_fn for conn=0x7f4bbc00a430 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:message_handler_mcast:3809 Received ringid(172.25.101.59:4296) seq 13e Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:totemsrp_mcast:2156 mcasted message added to pending queue Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 13d Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3644 Delivering 13e to 13f Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3713 Delivering MCAST message with seq 13f to pending delivery queue Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_exec_cpg_procleave:871 got procleave message from cluster node 996481452 name ./stress_finalize_groupName-1 Jun 28 17:51:14 corosync [CPG ] cpg.c:cpg_lib_init_fn:1051 lib_init_fn: conn=0x7f4bbc011f40, cpd=0x7f4bbc011770 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:message_handler_mcast:3809 Received ringid(172.25.101.59:4296) seq 13f Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 13e Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1064 got lib_cpg_join request on conn=0x7f4bbc011f40, cpd=0x7f4bbc011770 cpd state 0 name ./stress_finalize_groupName-1 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1068 lib_cpg_join: dump cpg_pd_list name /0 conn=0x7f4bbc011f40 cpd=0x7f4bbc011770 cpd state 0 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1068 lib_cpg_join: dump cpg_pd_list name /0 conn=0x7f4bbc011f40 cpd=0x7f4bbc011770 cpd state 0 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1087 lib_cpg_join: dump process_info_list name ./stress_finalize_groupName-0 conn=0x7f4bbc011f40, cpd=0x7f4bbc011770 cpd state 0 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:totemsrp_mcast:2156 mcasted message added to pending queue Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 13f Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3644 Delivering 13f to 140 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3713 Delivering MCAST message with seq 140 to pending delivery queue Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_mcast:1206 got mcast request on conn 0x7f4bbc011f40 cpd 0x7f4bbc011770 state 2 name ./stress_finalize_groupName-1/30 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_exec_cpg_procjoin:855 got procjoin message from cluster node 996481452 name ./stress_finalize_groupName-1 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_leave:1137 got leave request on conn 0x7f4bbc013020 cpd 0x7f4bbc013560 state 0 name ./stress_finalize_groupName-0 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:message_handler_mcast:3809 Received ringid(172.25.101.59:4296) seq 140 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_finalize:1175 cpg finalize for conn=0x7f4bbc013020 state 0 Jun 28 17:51:14 corosync [CPG ] cpg.c:cpg_lib_exit_fn:662 exit_fn for conn=0x7f4bbc013020 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:totemsrp_mcast:2156 mcasted message added to pending queue Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3644 Delivering 140 to 141 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3713 Delivering MCAST message with seq 141 to pending delivery queue Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:message_handler_mcast:3809 Received ringid(172.25.101.59:4296) seq 141 Jun 28 17:51:14 corosync [CPG ] cpg.c:cpg_lib_init_fn:1051 lib_init_fn: conn=0x7f4bbc013020, cpd=0x7f4bbc016280 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 140 Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 141 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1064 got lib_cpg_join request on conn=0x7f4bbc013020, cpd=0x7f4bbc016280 cpd state 0 name ./stress_finalize_groupName-0 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1068 lib_cpg_join: dump cpg_pd_list name /0 conn=0x7f4bbc013020 cpd=0x7f4bbc016280 cpd state 0 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1068 lib_cpg_join: dump cpg_pd_list name ./stress_finalize_groupName-1/30 conn=0x7f4bbc013020 cpd=0x7f4bbc016280 cpd state 0 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1087 lib_cpg_join: dump process_info_list name ./stress_finalize_groupName-0 conn=0x7f4bbc013020, cpd=0x7f4bbc016280 cpd state 0 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1092 lib_cpg_join: err_try_again conn=0x7f4bbc013020, cpd=0x7f4bbc016280 cpd state 0 Jun 28 17:51:14 corosync [CPG ] cpg.c:message_handler_req_lib_cpg_join:1064 got lib_cpg_join request on conn=0x7f4bbc013020, cpd=0x7f4bbc016280 cpd state 0 name ./stress_finalize_groupName-0 This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle. Changing version to '14'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping I've tried to reproduce that issue and I was not very successful. Between 1.2.x and 1.4.x there were a lot of fixes and I've also fixed totem_mcast error handling in cpg in Bug 664969. It may be related (and I'm pretty sure it is) so is there anybody who can confirm that: - bug is still present - Patches from bz#664969.ether fixes or not fixes problem? It was easily reproducible before. It may be fixed. Recommend closing bug. If someone comes across this problem again, it can be reopened. |