Bug 605313
| Summary: | isolating a node and then removing the isolation at 30 seconds causes a protocol meltdown 15% of time | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | David Teigland <teigland> | ||||||
| Component: | corosync | Assignee: | Steven Dake <sdake> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 6.0 | CC: | asalkeld, cluster-maint, nstraz, sdake, syeghiay | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | corosync-1.2.3-11.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Isolating a node and then removing the isolation potentially resulted in membership misbehavior.
|
Story Points: | --- | ||||||
| Clone Of: | |||||||||
| : | 610334 611129 (view as bug list) | Environment: | |||||||
| Last Closed: | 2010-11-10 22:07:11 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
David Teigland
2010-06-17 15:58:53 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. Angus, Is this fixed upstream by your recent cpg work? Nice test name :)
I suspect it is but it is not super clear to me what is happening above.
We might still be sensitive to removing the lowest nodeid on flatiron.
David can you confirm my understanding of the output/problem below?
1) what is "fenced:daemon" and "fenced:default"?
2) what message is coming from what node?
<all joined, 4 leaving> - {all ok}
1276717625 fenced:daemon conf 4 0 1 memb 5 6 7 8 join left 4
1276717625 fenced:default conf 4 0 1 memb 5 6 7 8 join left 4
1276717625 fenced:daemon ring 5:3656 4 memb 5 6 7 8
1276717625 fenced:default ring 5:3656 4 memb 5 6 7 8
<5 leaving> - {all ok}
1276717626 fenced:daemon conf 3 0 1 memb 6 7 8 join left 5
1276717626 fenced:default conf 3 0 1 memb 6 7 8 join left 5
Problem:
node 4 thinks there are still 5 members? {should be 1 - node 4 is isolated}
node 5 thinks there are still 4 members? {should be 1 - node 5 is isolated}
1276717626 fenced:daemon ring 4:3664 5 memb 4 5 6 7 8
1276717626 fenced:default ring 4:3664 5 memb 4 5 6 7 8
1276718259 fenced:daemon ring 5:3668 4 memb 5 6 7 8
m=4,5,6,7,8
-4
-5
Angus
fenced:daemon and fenced:default are the names of two cpgs used by fenced. fenced:foo conf ... is the confchg callback for cpg fenced:foo. fenced:foo ring ... is the ringid callback for cpg fenced:foo. There are no messages in the logs I pasted. conf <memb_count> <join_count> <left_count> memb <memb_ids> join <join_ids> left <left_ids> ring <ringid> <memb_count> memb <memb_ids> Only node 4 was partitioned/merged. The correct output on nodes 5-8 would be: 1276717591 fenced:default conf 5 1 0 memb 4 5 6 7 8 join 8 left (starting state) 1276717625 fenced:default conf 4 0 1 memb 5 6 7 8 join left 4 (-4, partition) 1276717626 fenced:default conf 5 1 0 memb 4 5 6 7 8 join 4 left (+4, merge) and at this point nodes 5-8 would kill node 4, so they'd see another: ?????????? fenced:default conf 4 0 1 memb 5 6 7 8 join left 4 (-4, killed) The ringid callbacks appear to be fine except that node 5 is missing: 1276718259 fenced:default ring 5:3668 4 memb 5 6 7 8 Is there a test case for this problem from QE? Yes, laryngitis reproduces this easily. Nate reproduced with corosync-1.2.3-4.el6 using laryngitis. This version has all the fixes for cpg issues we currently are aware of. Angus can you look into fixing in laryngitis environment. When isolating a node via iptables for more then 30 seconds and then removing that isolation, the totem membership becomes unstable 15% of the time. token = 10 sec consensus = 20 sec boundary = 30 seconds (token+consensus) Based upon Nate's test, I've distilled this down to a faster reproducer with only corosync (and attached the code). To reproduce, run a 3 node cluster with the attached reproduce.tar.gz config file. On one of the nodes, run the loop script in the reproducer directory. This script loops isolating and removing isolation from one of the nodes. On the remaining two nodes, apply the "force-exit.patch" which triggers corosync to exit when the configuration melts down. takes about 45 secs to 1 minute to trigger a meltdown. Created attachment 429290 [details]
reproducer for corosync
Based upon instrumentation of the membership protocol, the Totem specification corosync implements is in error for a particular case of its membership algorithm. Specifically page 17 figure 4: " Join message from processor q received: do some work to process the join mesage ... if my_id in message.fail_set then add message.sender_id to my_fail_set else merge message.fail_set into my_fail_set The purpose of this code is to allow two processors to see each other as faulty and form singleton rings. Unfortunately the specification does not account for the fact that there may be other members that receive the same message and execute the else branch when in fact the fail set is invalid for that current membership. This doesn't occur all the time because sometimes the if branch is sometimes executed first. Since there is no synchrony in the membership algorithm different timings can expose this problem in the protocol. Here is an example: partition 1 my_proc_list = A partition 2 my_proc_list = B,C partition 2 is in gather state partition 1 is in gather state isolation removed B sends join message with proc set=A,B,C, fail set=A A receives join message. my_id is subset of fail set so it adds B to its fail set (ie executes the if part of the branch) C receives join message (executes else part of branch) C,B agree on consensus, and form a configuration Sometime before A receives a join message from C (which would put it in consensus and trigger it to start a new membership round): A sends a join message with proc set A,B,C fail set=B C receives this message (while in operational) C adds B (which it is happily communicating with at the moment) to the failed set (then part of conditional). C sends join message to A,B,C proc set=A,B,C fail set=B A receives fail set=B, adding B to fail set B receives C's join message(proc=a,b,c fail=b) and adds C to fail set (if part of branch) A sends join message (proc=a,b,c fail=b) B receives A's join message, which adds A to fail set (if part of conditional) At this point, every processor ends with a singleton ring and more confusion occurs in the totem membership protocol after this mess. The proper solution is to use the last known good membership when making a decision to add a failed processor to the failed list. Created attachment 429298 [details]
flatiron revision 2985 to fix the problem
I'm trying to understand how the bug described in comments 18-23 relates to what I first reported. In the original description, the totem membership appeared to be ok (see [QUORUM] and ringid messages), but the cpg membership was bad. But maybe the appearance is misleading? The most recent tests don't include cpg, and only focus on totem errors. Are there totem errors in the original bug that just aren't visible in the log messages? I'm guessing it was the test in comment 19 that ran for 12 hours? Have we tried the original test to verify that's fixed also? Angus tried laryngitis for 100 iterations and it passes (previously failed at 5-7). Yes dct, after investigation with larygitis, we found that there was a condition of totem entering a singleton node configuration which broke the rest of the stack. Perhaps cpg should also behave better in this condition. Ran through laryngitis just fine with corosync-1.2.3-19.el6. Red Hat Enterprise Linux 6.0 is now available and should resolve the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Isolating a node and then removing the isolation potentially resulted in membership misbehavior.
|