Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 607745

Summary:

cpg may have issue with join/leave in tight loop

Product:

[Retired] Corosync Cluster Engine

Reporter:

Steven Dake <sdake>

Component:

cpg

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Severity:

medium

Docs Contact:

Priority:

low

Version:

1.3

CC:

2clarkd, agk, asalkeld, fdinitto, jfriesse, sdake

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-07-29 08:15:29 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
test case	none

Description Steven Dake 2010-06-24 17:31:02 UTC

Description of problem:
If the attached test program is run with an iteration count of greater than
about 10, group joins for the specified group name tends to start
failing (CS_ERR_TRY_AGAIN) but never recovers (trying again doesn't
help :).  This test was run on a single node of a 3 node system (but
may be reproduce similar problems on a smaller number of nodes).
./stress_finalize -i 10 -j 1 junk'

Version-Release number of selected component (if applicable):
corosync-1.2.5

How reproducible:
low rate

Steps to Reproduce:
1. see description
2.
3.
  
Actual results:
try again error

Expected results:
no try again error

Additional info:

Comment 1 Steven Dake 2010-06-24 17:31:30 UTC

Created attachment 426659 [details]
test case

Comment 2 dan clark 2010-06-29 01:11:38 UTC

During tests with two nodes, code coverage of the daemon during the failure may indicate that the process_info_list_head list seems to be holding onto an entry for the group name, thus the message_handler_req_lib_cpg_join() is repeatedly returning CPG_ERR_TRYAGAIN.  It may be that the cpd_state for the entry is not moving from the CPD_STATE_JOIN_STARTED.  Perhaps there are a few conditions of race between the TOTEM messages between daemons and the local requests from the IPC library of a client for a new group?  I have witnessed cases where the leave request will occur with the cpd_state at CPD_STATE_JOIN_COMPLETED but more often the case the cpd_state is CPD_STATE_UNJOINED.  Are there state transitions directly to CPD_STATE_LEAVE_STARTED from CPD_STATE_JOIN_STARTED?

Augmenting the logging a bit and analyzing code coverage of the corosync daemon the re-use of the 'conn_info' pointers addresses between subsequent finalize/initialize of communications with the corosync daemon may exacerbate the bug of attempting to re-establish connectivity to a group from a particular client application.  The root cause of the problem, (lots of speculation here), may actually be a race condition between the reception of messages from the client application and the processing of messages from remote daemons through the totem protocol.  Perhaps the processing of the leave request does not occur in time to clean up the cpg_pd_list (other times it is the process_info_list) before a request from the application comes in to attempt a new cpg_join.  

Is possible that the cpg_pd_list and the process_info_list that are part of services/cpg.c are not being properly cleaned up early enough during the finalize operation?

Note in the log below how the 'try_again' error occurs but a gratuitous dump of the process_info_list still shows the group despite the finalize requiest!  note also the similary in conn pointer values!

Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:totemsrp_mcast:2156 mcasted message added to pending queue
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 13c
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3644 Delivering 13d to 13e
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3713 Delivering MCAST message with seq 13e to pending delivery queue
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_exec_cpg_procjoin:855 got procjoin message from cluster node 996481452 name ./stress_finalize_groupName-0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_exec_cpg_procleave:871 got procleave message from cluster node 996481452 name ./stress_finalize_groupName-1
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_exec_cpg_procleave:888 procleave message from cluster node 996481452 deleted process_info_list name ./stress_finalize_groupName-1
Jun 28 17:51:14 corosync [CPG   ] cpg.c:cpg_lib_exit_fn:662 exit_fn for conn=0x7f4bbc00a430
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:message_handler_mcast:3809 Received ringid(172.25.101.59:4296) seq 13e
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:totemsrp_mcast:2156 mcasted message added to pending queue
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 13d
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3644 Delivering 13e to 13f
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3713 Delivering MCAST message with seq 13f to pending delivery queue
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_exec_cpg_procleave:871 got procleave message from cluster node 996481452 name ./stress_finalize_groupName-1
Jun 28 17:51:14 corosync [CPG   ] cpg.c:cpg_lib_init_fn:1051 lib_init_fn: conn=0x7f4bbc011f40, cpd=0x7f4bbc011770
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:message_handler_mcast:3809 Received ringid(172.25.101.59:4296) seq 13f
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 13e
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1064 got lib_cpg_join request on conn=0x7f4bbc011f40, cpd=0x7f4bbc011770 cpd state 0 name ./stress_finalize_groupName-1
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1068 lib_cpg_join: dump cpg_pd_list name /0 conn=0x7f4bbc011f40 cpd=0x7f4bbc011770 cpd state 0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1068 lib_cpg_join: dump cpg_pd_list name /0 conn=0x7f4bbc011f40 cpd=0x7f4bbc011770 cpd state 0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1087 lib_cpg_join: dump process_info_list name ./stress_finalize_groupName-0 conn=0x7f4bbc011f40, cpd=0x7f4bbc011770 cpd state 0
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:totemsrp_mcast:2156 mcasted message added to pending queue
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 13f
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3644 Delivering 13f to 140
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3713 Delivering MCAST message with seq 140 to pending delivery queue
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_mcast:1206 got mcast request on conn 0x7f4bbc011f40 cpd 0x7f4bbc011770 state 2 name ./stress_finalize_groupName-1/30
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_exec_cpg_procjoin:855 got procjoin message from cluster node 996481452 name ./stress_finalize_groupName-1
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_leave:1137 got leave request on conn 0x7f4bbc013020 cpd 0x7f4bbc013560 state 0 name ./stress_finalize_groupName-0
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:message_handler_mcast:3809 Received ringid(172.25.101.59:4296) seq 140
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_finalize:1175 cpg finalize for conn=0x7f4bbc013020 state 0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:cpg_lib_exit_fn:662 exit_fn for conn=0x7f4bbc013020
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:totemsrp_mcast:2156 mcasted message added to pending queue
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3644 Delivering 140 to 141
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_deliver_to_app:3713 Delivering MCAST message with seq 141 to pending delivery queue
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:message_handler_mcast:3809 Received ringid(172.25.101.59:4296) seq 141
Jun 28 17:51:14 corosync [CPG   ] cpg.c:cpg_lib_init_fn:1051 lib_init_fn: conn=0x7f4bbc013020, cpd=0x7f4bbc016280
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 140
Jun 28 17:51:14 corosync [TOTEM ] totemsrp.c:messages_free:2279 releasing messages up to and including 141
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1064 got lib_cpg_join request on conn=0x7f4bbc013020, cpd=0x7f4bbc016280 cpd state 0 name ./stress_finalize_groupName-0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1068 lib_cpg_join: dump cpg_pd_list name /0 conn=0x7f4bbc013020 cpd=0x7f4bbc016280 cpd state 0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1068 lib_cpg_join: dump cpg_pd_list name ./stress_finalize_groupName-1/30 conn=0x7f4bbc013020 cpd=0x7f4bbc016280 cpd state 0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1087 lib_cpg_join: dump process_info_list name ./stress_finalize_groupName-0 conn=0x7f4bbc013020, cpd=0x7f4bbc016280 cpd state 0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1092 lib_cpg_join: err_try_again conn=0x7f4bbc013020, cpd=0x7f4bbc016280 cpd state 0
Jun 28 17:51:14 corosync [CPG   ] cpg.c:message_handler_req_lib_cpg_join:1064 got lib_cpg_join request on conn=0x7f4bbc013020, cpd=0x7f4bbc016280 cpd state 0 name ./stress_finalize_groupName-0

Comment 4 Bug Zapper 2010-07-30 12:13:38 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle.
Changing version to '14'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 5 Jan Friesse 2011-07-28 15:09:14 UTC

I've tried to reproduce that issue and I was not very successful. Between 1.2.x and 1.4.x there were a lot of fixes and I've also fixed totem_mcast error handling  in cpg in Bug 664969. It may be related (and I'm pretty sure it is) so is there anybody who can confirm that:
- bug is still present
- Patches from bz#664969.ether fixes or not fixes problem?

Comment 6 Steven Dake 2011-07-28 18:17:02 UTC

It was easily reproducible before.  It may be fixed.  Recommend closing bug.  If someone comes across this problem again, it can be reopened.