| Summary: | 2-node cluster without fencing after cluster partitioning on resume of network fails to join one node to the cluster | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Dan Frincu <frincu.dan> |
| Component: | pacemaker | Assignee: | Andrew Beekhof <andrew> |
| Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 16 | CC: | abeekhof, andrew, asalkeld, fdinitto, jfriesse, jreznik, lhh, sdake |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-02-26 14:31:07 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Attachments: | |||
|
Description
Dan Frincu
2011-11-09 16:13:48 UTC
Please run corosync-blackbox on both nodes after the reconnect. This will tell us if corosync reforms membership properly. Created attachment 532780 [details]
output from corosync blackbox on both nodes
Added attachment with corosync blackbox output.
Dan, It looks like you took the blackbox long after the occurrence. The blackbox is circular buffer, so eventually the data we are interested in will be written over. Can you take the blackbox right after the event? Thanks -steve Hi Steven, Unfortunately no, the nodes are now being used for a different purpose. Anyway, it can be considered a very rare case, especially since no fencing was used on those test nodes, so I think it can be closed as I can provide any more relevant data. Thank you. Dan I meant _can't_ provide any more relevant data, it can be closed for the time being. Adding Andrew as cc and closing for now then. As we discussed on irc, the case where fencing fails is problematic for the stack, and we have some work underway to address this in the future via watchdog 9secondary fencing mechanism). Regards -steve I think this is actually a pacemaker bug. Messages from both are being exchanged, so there is no reason to think corosync itself is misbehaving. I saw something similar the other day but there were insufficient logs.
Basically the problem boils down to cluster1 thinking that cluster2 has no pacemaker processes running:
proc=00000000000000000000000000000000
Nov 8 11:45:05 cluster1 crmd: [26839]: info: crm_update_peer: Node cluster2: id=1836915884 state=member (new) addr=r(0) ip(172.28.125.109) votes=1 born=60 seen=108 proc=000000000000
00000000000000000000
Likewise from the other direction:
Nov 8 08:57:39 cluster2 crmd: [27342]: info: crm_update_peer: Node cluster1: id=1870470316 state=member (new) addr=r(0) ip(172.28.125.111) votes=1 born=84 seen=100 proc=00000000000000000000000000000002
So unless corosync is loosing messages, this looks to be a bug in the pacemaker plugin. However its hard to debug further since corosync.conf (and thus the plugin) logs to a file not included in the tarball:
logfile: /var/log/corosync/corosync.log
to_syslog: no
Any chance you could reproduce?
Created attachment 535847 [details]
corosync logs from both nodes with records from the same timeframe as the hb_report
Logs, as requested.
Not sure how this one got lost. By con-incidence David is working on similar bugs from the upstream bugzilla: #5040 UNCONFIRMED - andrew - "Not in our membership" after node reboot. #5041 UNCONFIRMED - andrew - udpu cluster see node what dead I think we're getting close to a solution. Both nodes appear to have changed node ids at runtime... exactly how was the network lost? Nov 08 07:49:08 corosync [pcmk ] WARN: pcmk_update_nodeid: Detected local node id change: 1836915884 -> 16777343 Nov 08 07:52:44 corosync [pcmk ] WARN: pcmk_update_nodeid: Detected local node id change: 1870470316 -> 16777343 I take back comment #7 If I'm reading the log snippets below correctly, both nodes agree that 172.28.125.111 is down. And that there is only 3s during which it is properly connected "Nov 08 08:56:22" to "Nov 08 08:56:25". Additionally, someone appears to be playing with the network. These two logs indicate that rather than being stopped, corosync is binding to loopback, then ethX, then loopback again. Nov 08 07:52:44 corosync [pcmk ] WARN: pcmk_update_nodeid: Detected local node id change: 1870470316 -> 16777343 Nov 08 08:56:25 corosync [pcmk ] WARN: pcmk_update_nodeid: Detected local node id change: 16777343 -> 1870470316 Perhaps it would be advisable to set a nodeid? And to not bring up/down the entire interface with ifconfig. # grep corosync cluster2.corosync.log | grep -v -e config_query_callback -e pcmk Nov 08 07:49:04 corosync [TOTEM ] A processor failed, forming new configuration. Nov 08 07:49:04 corosync [TOTEM ] The network interface is down. Nov 08 07:49:08 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 08 07:49:08 corosync [CPG ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:2 left:2) Nov 08 07:49:08 corosync [MAIN ] Completed service synchronization, ready to provide service. Nov 08 08:57:35 corosync [TOTEM ] The network interface [172.28.125.109] is now up. Nov 08 08:57:39 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 08 08:57:39 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.125.111) ; members(old:1 left:0) Nov 08 08:57:39 corosync [MAIN ] Completed service synchronization, ready to provide service. # grep corosync cluster1.corosync.log | grep -v -e config_query_callback -e pcmk Nov 08 07:49:04 corosync [TOTEM ] A processor failed, forming new configuration. Nov 08 07:49:08 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 08 07:49:08 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.125.111) ; members(old:2 left:1) Nov 08 07:49:08 corosync [MAIN ] Completed service synchronization, ready to provide service. Nov 08 07:52:41 corosync [TOTEM ] The network interface is down. Nov 08 07:52:44 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 08 07:52:44 corosync [CPG ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:1 left:1) Nov 08 07:52:44 corosync [MAIN ] Completed service synchronization, ready to provide service. Nov 08 08:56:22 corosync [TOTEM ] The network interface [172.28.125.111] is now up. Nov 08 08:56:25 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 08 08:56:25 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.125.111) ; members(old:1 left:1) Nov 08 08:56:25 corosync [MAIN ] Completed service synchronization, ready to provide service. Nov 08 08:57:39 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 08 08:57:39 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.125.111) ; members(old:1 left:0) Nov 08 08:57:39 corosync [MAIN ] Completed service synchronization, ready to provide service. Setting back to NEW as the logs tell a different story to the description. We either need the problem reproduced again (and bloackbox data included) or to close. Network Manager is not supported with the cluster stack at this time. It takes interfaces out of service during runtime which wreaks all kinds of havoc (it will show the interface binding to loopback and then binding back to ethx). Are you running network manager? Regards -steve This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |