Bug 1289209
Summary: | Corosync process dies when mate cluster node interface turned down and back up | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | kelvin | ||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.7 | CC: | ccaulfie, cluster-maint, david.w.h.chin, jruemker, mkelly | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-12-07 20:39:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Hello, We've been looking into your issue through the support case you engaged us in, and that has led us to some conclusions about what happened here, some of which is expected, and some of which may be part of another bug report we have open. I'll give a quick summary here, and then we're going to follow up in your case to further detail the analysis and present evidence from your logs and system to demonstrate what happened. When you take down the network in a two-node cluster, both nodes see the other as having left and initiate fencing, as they both still have quorum (one of the perks/downsides of two-node clusters). Both fail to fence since the network is down, so they just sit there blocked until something changes. When you bring the network back up, they both discover each other again and form a new membership together, but one of the rules implemented by the cluster stack is that a node cannot leave and then rejoin without having fully restarted; there is no way to merge their histories together, so the node that has just rejoined will be killed by the other. See the following for reference: One or more nodes is killed in a RHEL 6 cluster after nodes report "fenced[xxxx]: telling cman to remove nodeid 2 from cluster" https://access.redhat.com/solutions/308583 Earlier releases of RHEL 6 had no mechanism to prevent both nodes from killing each other simultaneously in this situation, or to cause the direction of the killing to be deterministic, so a change was made to have the lower-node-ID be the preferred winner, as described in: After a network split and recovery in a RHEL 5 or 6 cluster using fence_scsi, one node is killed and the remaining node cannot access storage devices https://access.redhat.com/solutions/308633 However this doesn't always align with what we want to happen, such as in your case. When both nodes are racing to fence each other, one node will usually initiate its fence-action slightly ahead of the other by a hair, meaning it will eventually be the one to power off the other. However, that node may not be the same one to issue a kill message to the other, meaning the node that gets fenced could send off that kill to the remaining node right before it gets fenced. The end result is that the one node left standing may be left without its cluster processes running because it was killed. This is explained here: Cluster is blocked and unmanageable after one node reports "Not killing node <node> despite it rejoining the cluster with existing state, it has a lower node ID" in RHEL 5 or 6 https://access.redhat.com/solutions/751503 In summary, the killing of one node when you perform this network-split is entirely expected, as that is how the cluster copes with the sudden rejoining of a node that hasn't restarted. However, its also expected that that same node that was killed also be fenced, and not the other way around. This latter condition is problematic, and is being addressed in Bug #1077890. As such, I'm closing this as a duplicate of that bug. If you have further questions, please don't hesitate to let us know through your support case. I'll continue to be engaged in that case along with your case owner, and we'll make sure you've got what you need to prevent this from further occurring. Thanks, John Ruemker, RHCA Principal Software Maintenance Engineer Red Hat Global Support Services *** This bug has been marked as a duplicate of bug 1077890 *** |
Created attachment 1103287 [details] /etc/cluster/cluster.conf (with hostnames changed) Description of problem: In a 2-node cluster where the service is a VM, the corosync daemon on one node dies (terminates) if an ifdown/ifup is performed on the peer node. Version-Release number of selected component (if applicable): RHEL 6.7 Both hosts have the same package versions installed # rpm -q cman ccs omping rgmanager resource-agents fence-virt fence-virtd fence-virtd-libvirt fence-virtd-multicast fence-virtd-serial corosync kernel cman-3.0.12.1-73.el6.1.x86_64 ccs-0.16.2-81.el6.x86_64 omping-0.0.4-1.el6.x86_64 rgmanager-3.0.12.1-22.el6.x86_64 resource-agents-3.9.5-24.el6_7.1.x86_64 fence-virt-0.2.3-19.el6.x86_64 fence-virtd-0.2.3-19.el6.x86_64 fence-virtd-libvirt-0.2.3-19.el6.x86_64 fence-virtd-multicast-0.2.3-19.el6.x86_64 fence-virtd-serial-0.2.3-19.el6.x86_64 corosync-1.4.7-2.el6.x86_64 kernel-2.6.32-573.el6.x86_64 kernel-2.6.32-573.3.1.el6.x86_64 How reproducible: always Steps to Reproduce: 1. Set up attached cluster.conf on 2 hosts 2. configure /etc/fence_virtd.conf 3. deploy VM named jobhistory on both host A and host B, using bridge br0 that is associated to eth0 4. at this point, clustat shows status correctly and clusvcadm -r jobhistory can be used to migrate the VM between hosts. 4. on server where vm running (call this host A), execute "ifdown eth0; sleep 300; ifup eth0" Actual results: corosync daemon on host B dies after host A eth0 comes back up. corosync daemon on host B will not stay up. host B does not process a 'reboot' or a 'shutdown -r' command in a timely fashion' host B seems to require a physical reset in order to successfully rejoin the cluster Expected results: Corosync daemon on hosts A and B re-establish communication and decide where the VM service should be running once host A eth0 interface returns to service. Additional info: Environment: 2x HP DL380 Gen9 cluster.conf file attached. Host B's corosync daemon was up at about 11:15:10 Host A's interface came back up at about 11:15:40 Host B's corosync daemon died before 11:15:56 [root@host-b ~]# date; for SERVICE in ricci fence_virtd cman rgmanager; do printf "%-12s " $SERVICE; service $SERVICE status; done Mon Dec 7 11:15:10 EST 2015 ricci ricci (pid 5530) is running... fence_virtd fence_virtd (pid 4880) is running... cman cluster is running. rgmanager rgmanager (pid 5559) is running... [root@host-b ~]# date; for SERVICE in ricci fence_virtd cman rgmanager; do printf "%-12s " $SERVICE; service $SERVICE status; done Mon Dec 7 11:15:56 EST 2015 ricci ricci (pid 5530) is running... fence_virtd fence_virtd (pid 4880) is running... cman Found stale pid file rgmanager rgmanager (pid 5559) is running... Below are clips from cluster logs for the time period in question. Logs have been edited to substitute hostnames for host-a and host-b, and to replace the FQDN. =========================== =========================== Log files from host A [root@host-a ~]# for FILE in /var/log/cluster/*log; do echo -e "\n\n$FILE"; grep 'Dec .7 11' $FILE; done /var/log/cluster/corosync.log Dec 07 11:00:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:00:58 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:00:58 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:00:58 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0) Dec 07 11:00:58 corosync [MAIN ] Completed service synchronization, ready to provide service. Dec 07 11:10:05 corosync [TOTEM ] A processor failed, forming new configuration. Dec 07 11:10:07 corosync [QUORUM] Members[1]: 1 Dec 07 11:10:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:10:07 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:2 left:1) Dec 07 11:10:07 corosync [MAIN ] Completed service synchronization, ready to provide service. Dec 07 11:15:36 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:15:36 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:15:36 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:15:36 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0) Dec 07 11:15:36 corosync [MAIN ] Completed service synchronization, ready to provide service. Dec 07 11:15:48 corosync [TOTEM ] A processor failed, forming new configuration. Dec 07 11:15:50 corosync [QUORUM] Members[1]: 1 Dec 07 11:15:50 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:15:50 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:2 left:1) Dec 07 11:15:50 corosync [MAIN ] Completed service synchronization, ready to provide service. Dec 07 11:26:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:26:07 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:26:07 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:26:07 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0) Dec 07 11:26:07 corosync [MAIN ] Completed service synchronization, ready to provide service. Dec 07 11:26:21 corosync [TOTEM ] A processor failed, forming new configuration. Dec 07 11:26:23 corosync [QUORUM] Members[1]: 1 Dec 07 11:26:23 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:26:23 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:2 left:1) Dec 07 11:26:23 corosync [MAIN ] Completed service synchronization, ready to provide service. /var/log/cluster/dlm_controld.log /var/log/cluster/fenced.log Dec 07 11:10:07 fenced fencing node host-b.fqdn Dec 07 11:10:37 fenced fence host-b.fqdn dev 0.0 agent fence_xvm result: error from agent Dec 07 11:10:37 fenced fence host-b.fqdn failed Dec 07 11:10:40 fenced fencing node host-b.fqdn Dec 07 11:11:10 fenced fence host-b.fqdn dev 0.0 agent fence_xvm result: error from agent Dec 07 11:11:10 fenced fence host-b.fqdn failed Dec 07 11:11:13 fenced fencing node host-b.fqdn Dec 07 11:11:43 fenced fence host-b.fqdn dev 0.0 agent fence_xvm result: error from agent Dec 07 11:11:43 fenced fence host-b.fqdn failed Dec 07 11:15:38 fenced telling cman to remove nodeid 2 from cluster Dec 07 11:15:38 fenced fence host-b.fqdn success /var/log/cluster/gfs_controld.log /var/log/cluster/rgmanager.log Dec 07 11:00:59 rgmanager Restarting changed resources. Dec 07 11:00:59 rgmanager Starting changed resources. Dec 07 11:04:07 rgmanager State change: host-b.fqdn UP Dec 07 11:07:38 rgmanager Stopping service vm:jobhistory Dec 07 11:07:39 rgmanager [vm] Virtual machine jobhistory is running Dec 07 11:07:49 rgmanager Service vm:jobhistory is stopped Dec 07 11:07:50 rgmanager Service vm:jobhistory is now running on member 2 Dec 07 11:08:32 rgmanager Starting stopped service vm:jobhistory Dec 07 11:08:32 rgmanager [vm] virsh -c qemu:///system start jobhistory Dec 07 11:08:33 rgmanager Service vm:jobhistory started Dec 07 11:10:07 rgmanager State change: host-b.fqdn DOWN Dec 07 11:15:36 rgmanager State change: host-b.fqdn UP Dec 07 11:15:50 rgmanager State change: host-b.fqdn DOWN =========================== =========================== Log files from host B [root@host-b ~]# for FILE in /var/log/cluster/*log; do echo -e "\n\n$FILE"; grep 'Dec .7 11' $FILE; done /var/log/cluster/corosync.log Dec 07 11:00:22 corosync [MAIN ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service. Dec 07 11:00:22 corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp Dec 07 11:00:22 corosync [MAIN ] Successfully read config from /etc/cluster/cluster.conf Dec 07 11:00:22 corosync [MAIN ] Successfully parsed cman config Dec 07 11:00:22 corosync [TOTEM ] Initializing transport (UDP/IP Multicast). Dec 07 11:00:22 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Dec 07 11:00:22 corosync [TOTEM ] The network interface [135.121.112.79] is now up. Dec 07 11:00:22 corosync [QUORUM] Using quorum provider quorum_cman Dec 07 11:00:22 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Dec 07 11:00:22 corosync [CMAN ] CMAN 3.0.12.1 (built Jul 6 2015 05:30:35) started Dec 07 11:00:22 corosync [SERV ] Service engine loaded: corosync CMAN membership service 2.90 Dec 07 11:00:22 corosync [SERV ] Service engine loaded: openais checkpoint service B.01.01 Dec 07 11:00:22 corosync [SERV ] Service engine loaded: corosync extended virtual synchrony service Dec 07 11:00:22 corosync [SERV ] Service engine loaded: corosync configuration service Dec 07 11:00:22 corosync [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 Dec 07 11:00:22 corosync [SERV ] Service engine loaded: corosync cluster config database access v1.01 Dec 07 11:00:22 corosync [SERV ] Service engine loaded: corosync profile loading service Dec 07 11:00:22 corosync [QUORUM] Using quorum provider quorum_cman Dec 07 11:00:22 corosync [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Dec 07 11:00:22 corosync [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Dec 07 11:00:22 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:00:22 corosync [CMAN ] quorum regained, resuming activity Dec 07 11:00:22 corosync [QUORUM] This node is within the primary component and will provide service. Dec 07 11:00:22 corosync [QUORUM] Members[1]: 2 Dec 07 11:00:22 corosync [QUORUM] Members[1]: 2 Dec 07 11:00:22 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.79) ; members(old:0 left:0) Dec 07 11:00:22 corosync [MAIN ] Completed service synchronization, ready to provide service. Dec 07 11:00:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:00:58 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:00:58 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:00:58 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0) Dec 07 11:00:58 corosync [MAIN ] Completed service synchronization, ready to provide service. Dec 07 11:10:05 corosync [TOTEM ] A processor failed, forming new configuration. Dec 07 11:10:07 corosync [QUORUM] Members[1]: 2 Dec 07 11:10:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:10:07 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.79) ; members(old:2 left:1) Dec 07 11:10:07 corosync [MAIN ] Completed service synchronization, ready to provide service. Dec 07 11:15:36 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 07 11:15:36 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:15:36 corosync [QUORUM] Members[2]: 1 2 Dec 07 11:15:36 corosync [CPG ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0) Dec 07 11:15:36 corosync [MAIN ] Completed service synchronization, ready to provide service. /var/log/cluster/dlm_controld.log Dec 07 11:00:26 dlm_controld dlm_controld 3.0.12.1 started Dec 07 11:15:38 dlm_controld cluster is down, exiting Dec 07 11:15:38 dlm_controld daemon cpg_dispatch error 2 /var/log/cluster/fenced.log Dec 07 11:00:26 fenced fenced 3.0.12.1 started Dec 07 11:10:07 fenced fencing node host-a.fqdn Dec 07 11:10:37 fenced fence host-a.fqdn dev 0.0 agent fence_xvm result: error from agent Dec 07 11:10:37 fenced fence host-a.fqdn failed Dec 07 11:10:40 fenced fencing node host-a.fqdn Dec 07 11:11:10 fenced fence host-a.fqdn dev 0.0 agent fence_xvm result: error from agent Dec 07 11:11:10 fenced fence host-a.fqdn failed Dec 07 11:11:13 fenced fencing node host-a.fqdn Dec 07 11:11:43 fenced fence host-a.fqdn dev 0.0 agent fence_xvm result: error from agent Dec 07 11:11:43 fenced fence host-a.fqdn failed Dec 07 11:15:41 fenced fence host-a.fqdn success Dec 07 11:15:43 fenced cpg_mcast_joined error 2 handle 3006c83e00000001 victim_done Dec 07 11:15:45 fenced cpg_mcast_joined error 2 handle 3006c83e00000001 complete Dec 07 11:15:45 fenced cpg_dispatch error 2 Dec 07 11:15:45 fenced cman_get_cluster error -1 112 Dec 07 11:15:45 fenced cman_get_cluster error -1 112 Dec 07 11:15:45 fenced cluster is down, exiting Dec 07 11:15:45 fenced daemon cpg_dispatch error 2 Dec 07 11:15:45 fenced cpg_dispatch error 2 /var/log/cluster/gfs_controld.log Dec 07 11:00:27 gfs_controld gfs_controld 3.0.12.1 started Dec 07 11:15:38 gfs_controld cluster is down, exiting Dec 07 11:15:38 gfs_controld daemon cpg_dispatch error 2 /var/log/cluster/rgmanager.log Dec 07 11:03:49 rgmanager I am node #2 Dec 07 11:03:49 rgmanager Resource Group Manager Starting Dec 07 11:03:49 rgmanager Loading Service Data Dec 07 11:03:52 rgmanager Initializing Services Dec 07 11:04:07 rgmanager [vm] Virtual machine jobhistory is shut off Dec 07 11:04:07 rgmanager Services Initialized Dec 07 11:04:07 rgmanager State change: Local UP Dec 07 11:04:07 rgmanager State change: host-a.fqdn UP Dec 07 11:04:08 rgmanager Migration: vm:jobhistory is running on 1 Dec 07 11:07:49 rgmanager Starting stopped service vm:jobhistory Dec 07 11:07:49 rgmanager [vm] virsh -c qemu:///system start jobhistory Dec 07 11:07:50 rgmanager Service vm:jobhistory started Dec 07 11:08:21 rgmanager Stopping service vm:jobhistory Dec 07 11:08:22 rgmanager [vm] Virtual machine jobhistory is running Dec 07 11:08:32 rgmanager Service vm:jobhistory is stopped Dec 07 11:08:33 rgmanager Service vm:jobhistory is now running on member 1 Dec 07 11:10:07 rgmanager State change: host-a.fqdn DOWN Dec 07 11:15:36 rgmanager State change: host-a.fqdn UP Dec 07 11:15:38 rgmanager #67: Shutting down uncleanly Dec 07 11:15:38 rgmanager [vm] Virtual machine jobhistory is shut off