Bug 1289209

Summary:

Corosync process dies when mate cluster node interface turned down and back up

Product:

Red Hat Enterprise Linux 6

Reporter:

kelvin

Component:

corosync

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED DUPLICATE

QA Contact:

cluster-qe <cluster-qe>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

6.7

CC:

ccaulfie, cluster-maint, david.w.h.chin, jruemker, mkelly

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-12-07 20:39:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/etc/cluster/cluster.conf (with hostnames changed)	none

Description kelvin 2015-12-07 16:40:57 UTC

Created attachment 1103287 [details]
/etc/cluster/cluster.conf (with hostnames changed)

Description of problem:
In a 2-node cluster where the service is a VM, the corosync daemon on one node dies (terminates) if an ifdown/ifup is performed on the peer node.


Version-Release number of selected component (if applicable):
RHEL 6.7

Both hosts have the same package versions installed
# rpm -q cman ccs omping rgmanager resource-agents fence-virt fence-virtd fence-virtd-libvirt fence-virtd-multicast fence-virtd-serial corosync kernel
cman-3.0.12.1-73.el6.1.x86_64
ccs-0.16.2-81.el6.x86_64
omping-0.0.4-1.el6.x86_64
rgmanager-3.0.12.1-22.el6.x86_64
resource-agents-3.9.5-24.el6_7.1.x86_64
fence-virt-0.2.3-19.el6.x86_64
fence-virtd-0.2.3-19.el6.x86_64
fence-virtd-libvirt-0.2.3-19.el6.x86_64
fence-virtd-multicast-0.2.3-19.el6.x86_64
fence-virtd-serial-0.2.3-19.el6.x86_64
corosync-1.4.7-2.el6.x86_64
kernel-2.6.32-573.el6.x86_64
kernel-2.6.32-573.3.1.el6.x86_64


How reproducible:
always

Steps to Reproduce:
1. Set up attached cluster.conf on 2 hosts
2. configure /etc/fence_virtd.conf
3. deploy VM named jobhistory on both host A and host B, using bridge br0 that is associated to eth0
4. at this point, clustat shows status correctly and clusvcadm -r jobhistory can be used to migrate the VM between hosts.
4. on server where vm running (call this host A), execute "ifdown eth0; sleep 300; ifup eth0"

Actual results:
corosync daemon on host B dies after host A eth0 comes back up.
corosync daemon on host B will not stay up.
host B does not process a 'reboot' or a 'shutdown -r' command in a timely fashion'
host B seems to require a physical reset in order to successfully rejoin the cluster


Expected results:
Corosync daemon on hosts A and B re-establish communication and decide where the VM service should be running once host A eth0 interface returns to service.

Additional info:
Environment: 2x HP DL380 Gen9

cluster.conf file attached.

Host B's corosync daemon was up at about 11:15:10
Host A's interface came back up at about 11:15:40
Host B's corosync daemon died before     11:15:56

[root@host-b ~]# date; for SERVICE in ricci fence_virtd cman rgmanager; do printf "%-12s   " $SERVICE; service $SERVICE status; done
Mon Dec  7 11:15:10 EST 2015
ricci          ricci (pid  5530) is running...
fence_virtd    fence_virtd (pid  4880) is running...
cman           cluster is running.
rgmanager      rgmanager (pid  5559) is running...

[root@host-b ~]# date; for SERVICE in ricci fence_virtd cman rgmanager; do printf "%-12s   " $SERVICE; service $SERVICE status; done
Mon Dec  7 11:15:56 EST 2015
ricci          ricci (pid  5530) is running...
fence_virtd    fence_virtd (pid  4880) is running...
cman           Found stale pid file
rgmanager      rgmanager (pid  5559) is running...



Below are clips from cluster logs for the time period in question.
Logs have been edited to substitute hostnames for host-a and host-b, and to replace the FQDN.
===========================
===========================
Log files from host A
[root@host-a ~]# for FILE in  /var/log/cluster/*log; do echo -e "\n\n$FILE"; grep 'Dec .7 11' $FILE; done


/var/log/cluster/corosync.log
Dec 07 11:00:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:00:58 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:00:58 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:00:58 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0)
Dec 07 11:00:58 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Dec 07 11:10:05 corosync [TOTEM ] A processor failed, forming new configuration.
Dec 07 11:10:07 corosync [QUORUM] Members[1]: 1
Dec 07 11:10:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:10:07 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:2 left:1)
Dec 07 11:10:07 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Dec 07 11:15:36 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:15:36 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:15:36 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:15:36 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0)
Dec 07 11:15:36 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Dec 07 11:15:48 corosync [TOTEM ] A processor failed, forming new configuration.
Dec 07 11:15:50 corosync [QUORUM] Members[1]: 1
Dec 07 11:15:50 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:15:50 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:2 left:1)
Dec 07 11:15:50 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Dec 07 11:26:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:26:07 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:26:07 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:26:07 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0)
Dec 07 11:26:07 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Dec 07 11:26:21 corosync [TOTEM ] A processor failed, forming new configuration.
Dec 07 11:26:23 corosync [QUORUM] Members[1]: 1
Dec 07 11:26:23 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:26:23 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:2 left:1)
Dec 07 11:26:23 corosync [MAIN  ] Completed service synchronization, ready to provide service.


/var/log/cluster/dlm_controld.log


/var/log/cluster/fenced.log
Dec 07 11:10:07 fenced fencing node host-b.fqdn
Dec 07 11:10:37 fenced fence host-b.fqdn dev 0.0 agent fence_xvm result: error from agent
Dec 07 11:10:37 fenced fence host-b.fqdn failed
Dec 07 11:10:40 fenced fencing node host-b.fqdn
Dec 07 11:11:10 fenced fence host-b.fqdn dev 0.0 agent fence_xvm result: error from agent
Dec 07 11:11:10 fenced fence host-b.fqdn failed
Dec 07 11:11:13 fenced fencing node host-b.fqdn
Dec 07 11:11:43 fenced fence host-b.fqdn dev 0.0 agent fence_xvm result: error from agent
Dec 07 11:11:43 fenced fence host-b.fqdn failed
Dec 07 11:15:38 fenced telling cman to remove nodeid 2 from cluster
Dec 07 11:15:38 fenced fence host-b.fqdn success


/var/log/cluster/gfs_controld.log


/var/log/cluster/rgmanager.log
Dec 07 11:00:59 rgmanager Restarting changed resources.
Dec 07 11:00:59 rgmanager Starting changed resources.
Dec 07 11:04:07 rgmanager State change: host-b.fqdn UP
Dec 07 11:07:38 rgmanager Stopping service vm:jobhistory
Dec 07 11:07:39 rgmanager [vm] Virtual machine jobhistory is running
Dec 07 11:07:49 rgmanager Service vm:jobhistory is stopped
Dec 07 11:07:50 rgmanager Service vm:jobhistory is now running on member 2
Dec 07 11:08:32 rgmanager Starting stopped service vm:jobhistory
Dec 07 11:08:32 rgmanager [vm] virsh -c qemu:///system start jobhistory
Dec 07 11:08:33 rgmanager Service vm:jobhistory started
Dec 07 11:10:07 rgmanager State change: host-b.fqdn DOWN
Dec 07 11:15:36 rgmanager State change: host-b.fqdn UP
Dec 07 11:15:50 rgmanager State change: host-b.fqdn DOWN


===========================
===========================
Log files from host B
[root@host-b ~]# for FILE in  /var/log/cluster/*log; do echo -e "\n\n$FILE"; grep 'Dec .7 11' $FILE; done


/var/log/cluster/corosync.log
Dec 07 11:00:22 corosync [MAIN  ] Corosync Cluster Engine ('1.4.7'): started and ready to provide service.
Dec 07 11:00:22 corosync [MAIN  ] Corosync built-in features: nss dbus rdma snmp
Dec 07 11:00:22 corosync [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Dec 07 11:00:22 corosync [MAIN  ] Successfully parsed cman config
Dec 07 11:00:22 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
Dec 07 11:00:22 corosync [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Dec 07 11:00:22 corosync [TOTEM ] The network interface [135.121.112.79] is now up.
Dec 07 11:00:22 corosync [QUORUM] Using quorum provider quorum_cman
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Dec 07 11:00:22 corosync [CMAN  ] CMAN 3.0.12.1 (built Jul  6 2015 05:30:35) started
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: corosync configuration service
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: corosync profile loading service
Dec 07 11:00:22 corosync [QUORUM] Using quorum provider quorum_cman
Dec 07 11:00:22 corosync [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Dec 07 11:00:22 corosync [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Dec 07 11:00:22 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:00:22 corosync [CMAN  ] quorum regained, resuming activity
Dec 07 11:00:22 corosync [QUORUM] This node is within the primary component and will provide service.
Dec 07 11:00:22 corosync [QUORUM] Members[1]: 2
Dec 07 11:00:22 corosync [QUORUM] Members[1]: 2
Dec 07 11:00:22 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.79) ; members(old:0 left:0)
Dec 07 11:00:22 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Dec 07 11:00:58 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:00:58 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:00:58 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:00:58 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0)
Dec 07 11:00:58 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Dec 07 11:10:05 corosync [TOTEM ] A processor failed, forming new configuration.
Dec 07 11:10:07 corosync [QUORUM] Members[1]: 2
Dec 07 11:10:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:10:07 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.79) ; members(old:2 left:1)
Dec 07 11:10:07 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Dec 07 11:15:36 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 07 11:15:36 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:15:36 corosync [QUORUM] Members[2]: 1 2
Dec 07 11:15:36 corosync [CPG   ] chosen downlist: sender r(0) ip(135.121.112.78) ; members(old:1 left:0)
Dec 07 11:15:36 corosync [MAIN  ] Completed service synchronization, ready to provide service.


/var/log/cluster/dlm_controld.log
Dec 07 11:00:26 dlm_controld dlm_controld 3.0.12.1 started
Dec 07 11:15:38 dlm_controld cluster is down, exiting
Dec 07 11:15:38 dlm_controld daemon cpg_dispatch error 2


/var/log/cluster/fenced.log
Dec 07 11:00:26 fenced fenced 3.0.12.1 started
Dec 07 11:10:07 fenced fencing node host-a.fqdn
Dec 07 11:10:37 fenced fence host-a.fqdn dev 0.0 agent fence_xvm result: error from agent
Dec 07 11:10:37 fenced fence host-a.fqdn failed
Dec 07 11:10:40 fenced fencing node host-a.fqdn
Dec 07 11:11:10 fenced fence host-a.fqdn dev 0.0 agent fence_xvm result: error from agent
Dec 07 11:11:10 fenced fence host-a.fqdn failed
Dec 07 11:11:13 fenced fencing node host-a.fqdn
Dec 07 11:11:43 fenced fence host-a.fqdn dev 0.0 agent fence_xvm result: error from agent
Dec 07 11:11:43 fenced fence host-a.fqdn failed
Dec 07 11:15:41 fenced fence host-a.fqdn success
Dec 07 11:15:43 fenced cpg_mcast_joined error 2 handle 3006c83e00000001 victim_done
Dec 07 11:15:45 fenced cpg_mcast_joined error 2 handle 3006c83e00000001 complete
Dec 07 11:15:45 fenced cpg_dispatch error 2
Dec 07 11:15:45 fenced cman_get_cluster error -1 112
Dec 07 11:15:45 fenced cman_get_cluster error -1 112
Dec 07 11:15:45 fenced cluster is down, exiting
Dec 07 11:15:45 fenced daemon cpg_dispatch error 2
Dec 07 11:15:45 fenced cpg_dispatch error 2


/var/log/cluster/gfs_controld.log
Dec 07 11:00:27 gfs_controld gfs_controld 3.0.12.1 started
Dec 07 11:15:38 gfs_controld cluster is down, exiting
Dec 07 11:15:38 gfs_controld daemon cpg_dispatch error 2


/var/log/cluster/rgmanager.log
Dec 07 11:03:49 rgmanager I am node #2
Dec 07 11:03:49 rgmanager Resource Group Manager Starting
Dec 07 11:03:49 rgmanager Loading Service Data
Dec 07 11:03:52 rgmanager Initializing Services
Dec 07 11:04:07 rgmanager [vm] Virtual machine jobhistory is shut off
Dec 07 11:04:07 rgmanager Services Initialized
Dec 07 11:04:07 rgmanager State change: Local UP
Dec 07 11:04:07 rgmanager State change: host-a.fqdn UP
Dec 07 11:04:08 rgmanager Migration: vm:jobhistory is running on 1
Dec 07 11:07:49 rgmanager Starting stopped service vm:jobhistory
Dec 07 11:07:49 rgmanager [vm] virsh -c qemu:///system start jobhistory
Dec 07 11:07:50 rgmanager Service vm:jobhistory started
Dec 07 11:08:21 rgmanager Stopping service vm:jobhistory
Dec 07 11:08:22 rgmanager [vm] Virtual machine jobhistory is running
Dec 07 11:08:32 rgmanager Service vm:jobhistory is stopped
Dec 07 11:08:33 rgmanager Service vm:jobhistory is now running on member 1
Dec 07 11:10:07 rgmanager State change: host-a.fqdn DOWN
Dec 07 11:15:36 rgmanager State change: host-a.fqdn UP
Dec 07 11:15:38 rgmanager #67: Shutting down uncleanly
Dec 07 11:15:38 rgmanager [vm] Virtual machine jobhistory is shut off

Comment 2 John Ruemker 2015-12-07 20:39:51 UTC

Hello,
We've been looking into your issue through the support case you engaged us in, and that has led us to some conclusions about what happened here, some of which is expected, and some of which may be part of another bug report we have open.  I'll give a quick summary here, and then we're going to follow up in your case to further detail the analysis and present evidence from your logs and system to demonstrate what happened.

When you take down the network in a two-node cluster, both nodes see the other as having left and initiate fencing, as they both still have quorum (one of the perks/downsides of two-node clusters).  Both fail to fence since the network is down, so they just sit there blocked until something changes.  When you bring the network back up, they both discover each other again and form a new membership together, but one of the rules implemented by the cluster stack is that a node cannot leave and then rejoin without having fully restarted; there is no way to merge their histories together, so the node that has just rejoined will be killed by the other.  See the following for reference:

  One or more nodes is killed in a RHEL 6 cluster after nodes report "fenced[xxxx]: telling cman to remove nodeid 2 from cluster"
  https://access.redhat.com/solutions/308583

Earlier releases of RHEL 6 had no mechanism to prevent both nodes from killing each other simultaneously in this situation, or to cause the direction of the killing to be deterministic, so a change was made to have the lower-node-ID be the preferred winner, as described in:

  After a network split and recovery in a RHEL 5 or 6 cluster using fence_scsi, one node is killed and the remaining node cannot access storage devices
  https://access.redhat.com/solutions/308633

However this doesn't always align with what we want to happen, such as in your case.  When both nodes are racing to fence each other, one node will usually initiate its fence-action slightly ahead of the other by a hair, meaning it will eventually be the one to power off the other.  However, that node may not be the same one to issue a kill message to the other, meaning the node that gets fenced could send off that kill to the remaining node right before it gets fenced.  The end result is that the one node left standing may be left without its cluster processes running because it was killed.  This is explained here:

  Cluster is blocked and unmanageable after one node reports "Not killing node <node> despite it rejoining the cluster with existing state, it has a lower node ID" in RHEL 5 or 6
  https://access.redhat.com/solutions/751503

In summary, the killing of one node when you perform this network-split is entirely expected, as that is how the cluster copes with the sudden rejoining of a node that hasn't restarted.  However, its also expected that that same node that was killed also be fenced, and not the other way around.  This latter condition is problematic, and is being addressed in Bug #1077890.  As such, I'm closing this as a duplicate of that bug.

If you have further questions, please don't hesitate to let us know through your support case.  I'll continue to be engaged in that case along with your case owner, and we'll make sure you've got what you need to prevent this from further occurring.

Thanks,
John Ruemker, RHCA
Principal Software Maintenance Engineer
Red Hat Global Support Services

*** This bug has been marked as a duplicate of bug 1077890 ***