Bug 752477

Summary:

2-node cluster without fencing after cluster partitioning on resume of network fails to join one node to the cluster

Product:

[Fedora] Fedora

Reporter:

Dan Frincu <frincu.dan>

Component:

pacemaker

Assignee:

Andrew Beekhof <andrew>

Status:

CLOSED WONTFIX

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

abeekhof, andrew, asalkeld, fdinitto, jfriesse, jreznik, lhh, sdake

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-02-26 14:31:07 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
hb report from the time the network failure occured until present moment, when it is still showing the bad status on the first node	none
output from corosync blackbox on both nodes	none
corosync logs from both nodes with records from the same timeframe as the hb_report	none

Description Dan Frincu 2011-11-09 16:13:48 UTC

Created attachment 532604 [details]
hb report from the time the network failure occured until present moment, when it is still showing the bad status on the first node

Description of problem:
In a 2-node cluster configured without fencing, after loss of network connection between nodes, one of the 2 nodes sees itself and its peer as offline in crm_mon, while the other node sees both nodes as online.

Version-Release number of selected component (if applicable):
Pacemaker 1.1.5-2~bpo60+1, Corosync 1.4.1-1~bpo60+1, Debian Squeeze on 2.6.32-5-amd64
Pacemaker and Corosync from http://people.debian.org/~madkiss/pacemaker-1.1

How to reproduce:
Link 2 nodes via one ethernet cable to a switch, configure corosync with one ring, mcastaddr 239.255.1.1, mcastport: 5405, disable stonith and quorum, setup pacemaker resources, have them go into a stable working state, then yank the power on the switch, let things go awry, connect back the switch.

Both nodes should not run resources, as seen in the attached hb_report, as the VIP went into a "can't run anywhere to tue INF score". Issued crm resource cleanup all on both nodes, only cluster2 cleaned up and started the resources, cluster1 hung for a while when trying to connect to the other node.

Stopped pacemaker and corosync on cluster1. Started them back. They both started, but crm_mon -fr on cluster1 shows:

============
Last updated: Wed Nov  9 11:03:04 2011
Stack: openais
Current DC: NONE
2 Nodes configured, 2 expected votes
8 Resources configured.
============

OFFLINE: [ cluster1 cluster2 ]

Full list of resources:

 Resource Group: all
     virtual_ip (ocf::heartbeat:IPaddr2):       Stopped
     fs_log_fai (ocf::heartbeat:Filesystem):    Stopped
     fs_mysql   (ocf::heartbeat:Filesystem):    Stopped
     fs_data_fai        (ocf::heartbeat:Filesystem):    Stopped
     fs_data_mirror     (ocf::heartbeat:Filesystem):    Stopped
     fs_puppet  (ocf::heartbeat:Filesystem):    Stopped
     mysqld     (ocf::heartbeat:mysql): Stopped
     apache     (ocf::heartbeat:apache):        Stopped
     nginx      (lsb:nginx):    Stopped
     ld-server  (lsb:ld-server):        Stopped
 Master/Slave Set: ms_drbd_data_fai [drbd_data_fai]
     Stopped: [ drbd_data_fai:0 drbd_data_fai:1 ]
 Master/Slave Set: ms_drbd_data_mirror [drbd_data_mirror]
     Stopped: [ drbd_data_mirror:0 drbd_data_mirror:1 ]
 Master/Slave Set: ms_drbd_log_fai [drbd_log_fai]
     Stopped: [ drbd_log_fai:0 drbd_log_fai:1 ]
 Master/Slave Set: ms_drbd_mysql [drbd_mysql]
     Stopped: [ drbd_mysql:0 drbd_mysql:1 ]
 Master/Slave Set: ms_drbd_puppet [drbd_puppet]
     Stopped: [ drbd_puppet:0 drbd_puppet:1 ]
 Resource Group: asdf
     dummy      (ocf::heartbeat:Dummy): Stopped
     dummy2     (ocf::heartbeat:Dummy): Stopped
     dummy3     (ocf::heartbeat:Dummy): Stopped
 cluster_mon    (ocf::pacemaker:ClusterMon):    Stopped

While cluster2 shows:

============
Last updated: Wed Nov  9 11:03:43 2011
Stack: openais
Current DC: cluster2 - partition with quorum
Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
8 Resources configured.
============

Online: [ cluster1 cluster2 ]

Full list of resources:

 Resource Group: all
     virtual_ip (ocf::heartbeat:IPaddr2):       Started cluster2
     fs_log_fai (ocf::heartbeat:Filesystem):    Started cluster2
     fs_mysql   (ocf::heartbeat:Filesystem):    Started cluster2
     fs_data_fai        (ocf::heartbeat:Filesystem):    Started cluster2
     fs_data_mirror     (ocf::heartbeat:Filesystem):    Started cluster2
     fs_puppet  (ocf::heartbeat:Filesystem):    Started cluster2
     mysqld     (ocf::heartbeat:mysql): Started cluster2
     apache     (ocf::heartbeat:apache):        Started cluster2
     nginx      (lsb:nginx):    Started cluster2
     ld-server  (lsb:ld-server):        Started cluster2
 Master/Slave Set: ms_drbd_data_fai [drbd_data_fai]
     Masters: [ cluster2 ]
     Slaves: [ cluster1 ]
 Master/Slave Set: ms_drbd_data_mirror [drbd_data_mirror]
     Masters: [ cluster2 ]
     Slaves: [ cluster1 ]
 Master/Slave Set: ms_drbd_log_fai [drbd_log_fai]
     Masters: [ cluster2 ]
     Slaves: [ cluster1 ]
 Master/Slave Set: ms_drbd_mysql [drbd_mysql]
     Masters: [ cluster2 ]
     Slaves: [ cluster1 ]
 Master/Slave Set: ms_drbd_puppet [drbd_puppet]
     Masters: [ cluster2 ]
     Slaves: [ cluster1 ]
 Resource Group: asdf
     dummy      (ocf::heartbeat:Dummy): Started cluster2
     dummy2     (ocf::heartbeat:Dummy): Started cluster2
     dummy3     (ocf::heartbeat:Dummy): Started cluster2
 cluster_mon    (ocf::pacemaker:ClusterMon):    Stopped

Migration summary:
* Node cluster2:
* Node cluster1:

root@cluster1:# corosync-cfgtool -s
Printing ring status.
Local node ID 1870470316
RING ID 0
        id      = 172.28.125.111
        status  = ring 0 active with no faults
root@cluster1:# crmadmin -D
Designated Controller is: cluster2

root@cluster2:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1836915884
RING ID 0
        id      = 172.28.125.109
        status  = ring 0 active with no faults
root@cluster2:~# crmadmin -D
Designated Controller is: cluster2

Network connectivity works, selinux disabled, no firewall rules set.

I can provide additional information if needed.

Comment 1 Steven Dake 2011-11-09 18:07:29 UTC

Please run corosync-blackbox on both nodes after the reconnect.  This will tell us if corosync reforms membership properly.

Comment 2 Dan Frincu 2011-11-10 11:17:52 UTC

Created attachment 532780 [details]
output from corosync blackbox on both nodes

Added attachment with corosync blackbox output.

Comment 3 Steven Dake 2011-11-18 17:11:54 UTC

Dan,

It looks like you took the blackbox long after the occurrence. The blackbox is circular buffer, so eventually the data we are interested in will be written over.

Can you take the blackbox right after the event?

Thanks
-steve

Comment 4 Dan Frincu 2011-11-18 17:51:47 UTC

Hi Steven,

Unfortunately no, the nodes are now being used for a different purpose. Anyway, it can be considered a very rare case, especially since no fencing was used on those test nodes, so I think it can be closed as I can provide any more relevant data.

Thank you.
Dan

Comment 5 Dan Frincu 2011-11-18 18:01:18 UTC

I meant _can't_ provide any more relevant data, it can be closed for the time being.

Comment 6 Steven Dake 2011-11-18 18:16:40 UTC

Adding Andrew as cc and closing for now then.

As we discussed on irc, the case where fencing fails is problematic for the stack, and we have some work underway to address this in the future via watchdog 9secondary fencing mechanism).

Regards
-steve

Comment 7 Andrew Beekhof 2011-11-24 01:30:59 UTC

I think this is actually a pacemaker bug.
Messages from both are being exchanged, so there is no reason to think corosync itself is misbehaving.

Comment 8 Andrew Beekhof 2011-11-24 02:27:27 UTC

I saw something similar the other day but there were insufficient logs.
Basically the problem boils down to cluster1 thinking that cluster2 has no pacemaker processes running:

    proc=00000000000000000000000000000000

Nov  8 11:45:05 cluster1 crmd: [26839]: info: crm_update_peer: Node cluster2: id=1836915884 state=member (new) addr=r(0) ip(172.28.125.109)  votes=1 born=60 seen=108 proc=000000000000
00000000000000000000

Likewise from the other direction:

Nov  8 08:57:39 cluster2 crmd: [27342]: info: crm_update_peer: Node cluster1: id=1870470316 state=member (new) addr=r(0) ip(172.28.125.111)  votes=1 born=84 seen=100 proc=00000000000000000000000000000002


So unless corosync is loosing messages, this looks to be a bug in the pacemaker plugin.  However its hard to debug further since corosync.conf (and thus the plugin) logs to a file not included in the tarball:

        logfile: /var/log/corosync/corosync.log
        to_syslog: no

Any chance you could reproduce?

Comment 9 Dan Frincu 2011-11-24 10:27:37 UTC

Created attachment 535847 [details]
corosync logs from both nodes with records from the same timeframe as the hb_report

Logs, as requested.

Comment 10 Andrew Beekhof 2012-03-20 22:08:28 UTC

Not sure how this one got lost.
By con-incidence David is working on similar bugs from the upstream bugzilla:

#5040 UNCONFIRMED - andrew - "Not in our membership" after node reboot.
#5041 UNCONFIRMED - andrew - udpu cluster see node what dead

I think we're getting close to a solution.

Comment 11 Andrew Beekhof 2012-03-26 23:57:07 UTC

Both nodes appear to have changed node ids at runtime... exactly how was the network lost?

Nov 08 07:49:08 corosync [pcmk  ] WARN: pcmk_update_nodeid: Detected local node id change: 1836915884 -> 16777343

Nov 08 07:52:44 corosync [pcmk  ] WARN: pcmk_update_nodeid: Detected local node id change: 1870470316 -> 16777343

Comment 12 Andrew Beekhof 2012-03-27 00:17:47 UTC

I take back comment #7

If I'm reading the log snippets below correctly, both nodes agree that 172.28.125.111 is down.
And that there is only 3s during which it is properly connected "Nov 08 08:56:22" to "Nov 08 08:56:25".

Additionally, someone appears to be playing with the network. These two logs indicate that rather than being stopped, corosync is binding to loopback, then ethX, then loopback again.

Nov 08 07:52:44 corosync [pcmk  ] WARN: pcmk_update_nodeid: Detected local node id change: 1870470316 -> 16777343
Nov 08 08:56:25 corosync [pcmk  ] WARN: pcmk_update_nodeid: Detected local node id change: 16777343 -> 1870470316

Perhaps it would be advisable to set a nodeid?  And to not bring up/down the entire interface with ifconfig.



# grep corosync cluster2.corosync.log | grep -v -e config_query_callback -e pcmk 
Nov 08 07:49:04 corosync [TOTEM ] A processor failed, forming new configuration.
Nov 08 07:49:04 corosync [TOTEM ] The network interface is down.
Nov 08 07:49:08 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 08 07:49:08 corosync [CPG   ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:2 left:2)
Nov 08 07:49:08 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Nov 08 08:57:35 corosync [TOTEM ] The network interface [172.28.125.109] is now up.
Nov 08 08:57:39 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 08 08:57:39 corosync [CPG   ] chosen downlist: sender r(0) ip(172.28.125.111) ; members(old:1 left:0)
Nov 08 08:57:39 corosync [MAIN  ] Completed service synchronization, ready to provide service.

# grep corosync cluster1.corosync.log | grep -v -e config_query_callback -e pcmk 
Nov 08 07:49:04 corosync [TOTEM ] A processor failed, forming new configuration.
Nov 08 07:49:08 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 08 07:49:08 corosync [CPG   ] chosen downlist: sender r(0) ip(172.28.125.111) ; members(old:2 left:1)
Nov 08 07:49:08 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Nov 08 07:52:41 corosync [TOTEM ] The network interface is down.
Nov 08 07:52:44 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 08 07:52:44 corosync [CPG   ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:1 left:1)
Nov 08 07:52:44 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Nov 08 08:56:22 corosync [TOTEM ] The network interface [172.28.125.111] is now up.
Nov 08 08:56:25 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 08 08:56:25 corosync [CPG   ] chosen downlist: sender r(0) ip(172.28.125.111) ; members(old:1 left:1)
Nov 08 08:56:25 corosync [MAIN  ] Completed service synchronization, ready to provide service.
Nov 08 08:57:39 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 08 08:57:39 corosync [CPG   ] chosen downlist: sender r(0) ip(172.28.125.111) ; members(old:1 left:0)
Nov 08 08:57:39 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Comment 13 Andrew Beekhof 2012-03-27 00:19:13 UTC

Setting back to NEW as the logs tell a different story to the description.
We either need the problem reproduced again (and bloackbox data included) or to close.

Comment 14 Steven Dake 2012-03-27 00:36:58 UTC

Network Manager is not supported with the cluster stack at this time.  It takes interfaces out of service during runtime which wreaks all kinds of havoc (it will show the interface binding to loopback and then binding back to ethx).  Are you running network manager?

Regards
-steve

Comment 15 Fedora End Of Life 2013-01-16 21:28:20 UTC

This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 16 Jaroslav Reznik 2013-02-26 14:31:07 UTC

Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.