1114852 – Sometimes cluster unexpectedly dies and fails to start.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1114852 - Sometimes cluster unexpectedly dies and fails to start.

Summary: Sometimes cluster unexpectedly dies and fails to start.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libqb
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Christine Caulfield
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1276345 1277538
TreeView+	depends on / blocked

Reported:	2014-07-01 07:20 UTC by Leonid Natapov
Modified:	2019-09-12 07:55 UTC (History)
CC List:	26 users (show)
Fixed In Version:	libqb-1.0-1.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1276345 1277538 (view as bug list)
Environment:
Last Closed:	2016-11-04 04:39:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
crm report (164.08 KB, application/x-bzip) 2014-07-01 07:20 UTC, Leonid Natapov	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1405553	0	None	None	None	Never
Red Hat Product Errata	RHBA-2016:2363	0	normal	SHIPPED_LIVE	libqb bug fix and enhancement update	2016-11-03 13:49:55 UTC

Description Leonid Natapov 2014-07-01 07:20:17 UTC

Created attachment 913689 [details]
crm report

Sometimes pace maker unexpectedly dies and fails to start.
I have cluster with 3 nodes that successfully running. Sometimes on one of the nodes cluster dies. Suddenly "pcs status" shows Error: cluster is not currently running on this node. Trying to start cluster fails. It happens from time to time on all 3 nodes.

Here is the /var/log/messages log.

crm_report is attached.

pacemaker-cluster-libs-1.1.10-31.el7_0.x86_64
pacemaker-cli-1.1.10-31.el7_0.x86_64
pacemaker-libs-1.1.10-31.el7_0.x86_64
pacemaker-1.1.10-31.el7_0.x86_64

--------------------------------------------------------------------------
Jul  1 10:09:15 puma43 systemd: Started Corosync Cluster Engine.
Jul  1 10:09:15 puma43 systemd: Starting Pacemaker High Availability Cluster Manager...
Jul  1 10:09:15 puma43 systemd: Started Pacemaker High Availability Cluster Manager.
Jul  1 10:09:15 puma43 systemd: Failed to reset devices.list on /system.slice/system-epmd.slice: Invalid argument
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Jul  1 10:09:15 puma43 pacemakerd: Could not establish pacemakerd connection: Connection refused (111)
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: mcp_read_config: Configured corosync to accept connections from group 189: Library error (2)
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: main: Starting Pacemaker 1.1.10-31.el7_0 (Build: 368c726):  generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc systemd nagios  corosync-native
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 2
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 2
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: cluster_connect_quorum: Quorum acquired
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 1
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 2
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Jul  1 10:09:15 puma43 cib[108543]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Jul  1 10:09:15 puma43 lrmd[108545]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Jul  1 10:09:15 puma43 attrd[108546]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Jul  1 10:09:15 puma43 stonith-ng[108544]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 1
Jul  1 10:09:15 puma43 attrd[108546]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jul  1 10:09:15 puma43 stonith-ng[108544]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jul  1 10:09:15 puma43 pengine[108547]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Jul  1 10:09:15 puma43 crmd[108548]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Jul  1 10:09:15 puma43 corosync[102268]: [MAIN  ] Denied connection attempt from 189:189
Jul  1 10:09:15 puma43 corosync[102268]: [QB    ] Invalid IPC credentials (102268-108546-2).
Jul  1 10:09:15 puma43 attrd[108546]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Jul  1 10:09:15 puma43 attrd[108546]: error: main: HA Signon failed
Jul  1 10:09:15 puma43 attrd[108546]: error: main: Aborting startup
Jul  1 10:09:15 puma43 crmd[108548]: notice: main: CRM Git Version: 368c726
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1
Jul  1 10:09:15 puma43 cib[108543]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Jul  1 10:09:15 puma43 corosync[102268]: [MAIN  ] Denied connection attempt from 189:189
Jul  1 10:09:15 puma43 corosync[102268]: [QB    ] Invalid IPC credentials (102268-108543-3).
Jul  1 10:09:15 puma43 cib[108543]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 1
Jul  1 10:09:15 puma43 cib[108543]: crit: cib_init: Cannot sign in to the cluster... terminating
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 1
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1] - state is now member (was (null))
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: crm_update_peer_state: pcmk_quorum_notification: Node puma43.scl.lab.tlv.redhat.com[2] - state is now member (was (null))
Jul  1 10:09:15 puma43 stonith-ng[108544]: notice: corosync_node_name: Unable to get node name for nodeid 2
Jul  1 10:09:15 puma43 stonith-ng[108544]: notice: get_node_name: Could not obtain a node name for corosync nodeid 2
Jul  1 10:09:15 puma43 stonith-ng[108544]: notice: corosync_node_name: Unable to get node name for nodeid 2
Jul  1 10:09:15 puma43 stonith-ng[108544]: notice: get_node_name: Defaulting to uname -n for the local corosync node name
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 3
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 3
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 3
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: corosync_node_name: Unable to get node name for nodeid 3
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: get_node_name: Could not obtain a node name for corosync nodeid 3
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[3] - state is now member (was (null))
Jul  1 10:09:15 puma43 pacemakerd[108542]: error: pcmk_child_exit: Child process cib (108543) exited: Network is down (100)
Jul  1 10:09:15 puma43 pacemakerd[108542]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down.
Jul  1 10:09:15 puma43 pacemakerd[108542]: error: pcmk_child_exit: Child process attrd (108546) exited: Network is down (100)
Jul  1 10:09:15 puma43 pacemakerd[108542]: warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be respawned. Shutting ourselves down.
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: pcmk_shutdown_worker: Shuting down Pacemaker
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: stop_child: Stopping crmd: Sent -15 to process 108548
Jul  1 10:09:15 puma43 crmd[108548]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Jul  1 10:09:15 puma43 crmd[108548]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Jul  1 10:09:15 puma43 crmd[108548]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
Jul  1 10:09:15 puma43 crmd[108548]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
Jul  1 10:09:15 puma43 crmd[108548]: notice: terminate_cs_connection: Disconnecting from Corosync
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: stop_child: Stopping pengine: Sent -15 to process 108547
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: stop_child: Stopping lrmd: Sent -15 to process 108545
Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 108544
Jul  1 10:09:24 puma43 stonith-ng[108544]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107)
Jul  1 10:09:24 puma43 pacemakerd[108542]: notice: pcmk_shutdown_worker: Shutdown complete

Comment 2 Andrew Beekhof 2014-07-01 22:34:24 UTC

This looks to be the root cause of the restart issues:

Jul  1 10:09:15 puma43 pacemakerd[108542]: notice: mcp_read_config: Configured corosync to accept connections from group 189: Library error (2)

Something to do with the corosync API by the looks of it

Comment 3 Andrew Beekhof 2014-07-04 00:37:16 UTC

Chrissie: I need your eyes here....

I'm looking in corosync-blackbox-live.txt ... 
The last lines I see are:

trace   Jul 01 09:21:44 messages_deliver_to_app(3901):12: Delivering 35f3c to 35f3d
trace   Jul 01 09:21:44 messages_deliver_to_app(3970):12: Delivering MCAST message with seq 0 to pending delivery queue

yet pacemakerd is doing things (like trying to connect) that I'd have though would produce logging much later than that.

Eg. this call that fails:

Jul  1 09:33:04 puma43 pacemakerd[104109]: notice: mcp_read_config: Configured corosync to accept connections from group 189: Library error (2)


Could you take a look?  Its starting to smell like a corosync bug to me.

Comment 6 Jan Friesse 2014-07-11 08:02:22 UTC

Andrew,
for me interesting lines are:

trace   Jul 01 09:21:34 messages_deliver_toERROR: qb_rb_chunk_read failed: Connection timed out
[trace] ENTERING qb_rb_close()
[debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header

So I believe messages AFTER these lines are just result of reading of uninitialized memory.

What also seems interesting is "Could not establish pacemakerd connection: Connection refused (111)". Is this normal?

But there is thing I don't understand at all. cmap_set_uint8 returning CS_ERR_LIBRARY. I mean if cmap_init would fail, I would believe it is corosync problem (or fixed one with kernel and unix socket credentials). But cmap_set_int8 calling simply cannot return CS_ERR_LIBRARY (directly) in it's code path. Only place where it can is libqb (qb_ipcc_sendv_recv probably?).

This is why I believe problem is hidden somewhere in LibQB.

Dave, any idea? Maybe you've already fixed this bug in upstream? Please return back if you believe that this is not libqb but corosync problem.

Comment 7 Andrew Beekhof 2014-07-14 02:15:37 UTC

(In reply to Jan Friesse from comment #6)
> Andrew,
> for me interesting lines are:
> 
> trace   Jul 01 09:21:34 messages_deliver_toERROR: qb_rb_chunk_read failed:
> Connection timed out
> [trace] ENTERING qb_rb_close()
> [debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header
> 
> So I believe messages AFTER these lines are just result of reading of
> uninitialized memory.

Is that not from the subsequent calling of qb-blackbox though?  Not the actual running of corosync

> What also seems interesting is "Could not establish pacemakerd connection:
> Connection refused (111)". Is this normal?

Yes, thats pacemakerd checking for another instance of itself.

Comment 8 David Vossel 2014-07-14 21:22:15 UTC

(In reply to Andrew Beekhof from comment #7)
> (In reply to Jan Friesse from comment #6)
> > Andrew,
> > for me interesting lines are:
> > 
> > trace   Jul 01 09:21:34 messages_deliver_toERROR: qb_rb_chunk_read failed:
> > Connection timed out
> > [trace] ENTERING qb_rb_close()
> > [debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header
> > 
> > So I believe messages AFTER these lines are just result of reading of
> > uninitialized memory.
> 
> Is that not from the subsequent calling of qb-blackbox though?  Not the
> actual running of corosync

qb-create_from_file-header is a blackbox thing. I'm not seeing anything here that indicates a libqb bug.  I haven't encountered anything like this before.

Comment 9 Jan Friesse 2014-07-15 06:04:07 UTC

Leonid,
let's try to isolate faulty component. When you hit bug again, can you please try to login to nodes, run corosync-cmapctl (on every node) and send result (output + error code)?

Comment 12 Jan Friesse 2014-08-05 11:10:47 UTC

Leonid,
can you please provide requested info?

Comment 14 Kostiantyn Ponomarenko 2015-03-11 17:35:13 UTC

I faced the same issue.
The OS is Debian.
Two-node cluster. 
The situation is faced when the cluster is configured for two-nodes, but there is no second node. Actually in this case "node-1" is the only node which I try to start, and "node-0" is not used (is shutdown).

The software:
-libqb0_0.17.1
-corosync_2.3.4
-pacemaker_1.1.12

/var/log/messages:
------------------
Mar 11 17:17:02 daemon.notice<29> corosync[101978]:   [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
Mar 11 17:17:02 daemon.info<30> corosync[101978]:   [MAIN  ] Corosync built-in features: pie relro bindnow
Mar 11 17:17:02 daemon.notice<29> corosync[101979]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
Mar 11 17:17:02 daemon.notice<29> corosync[101979]:   [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Mar 11 17:17:02 daemon.notice<29> corosync[101979]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
Mar 11 17:17:02 daemon.notice<29> corosync[101979]:   [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [TOTEM ] The network interface [169.254.0.3] is now up.
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Mar 11 17:17:03 daemon.info<30> corosync[101979]:   [QB    ] server name: cmap
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Mar 11 17:17:03 daemon.info<30> corosync[101979]:   [QB    ] server name: cfg
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Mar 11 17:17:03 daemon.info<30> corosync[101979]:   [QB    ] server name: cpg
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [QUORUM] Using quorum provider corosync_votequorum
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Mar 11 17:17:03 daemon.info<30> corosync[101979]:   [QB    ] server name: votequorum
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Mar 11 17:17:03 daemon.info<30> corosync[101979]:   [QB    ] server name: quorum
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [TOTEM ] adding new UDPU member {169.254.0.2}
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [TOTEM ] adding new UDPU member {169.254.0.3}
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [TOTEM ] The network interface [169.254.1.3] is now up.
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [TOTEM ] adding new UDPU member {169.254.1.2}
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [TOTEM ] adding new UDPU member {169.254.1.3}
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [TOTEM ] A new membership (169.254.0.3:848) was formed. Members joined: 2
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [QUORUM] Members[1]: 2
Mar 11 17:17:03 daemon.notice<29> corosync[101979]:   [MAIN  ] Completed service synchronization, ready to provide service.
Mar 11 17:17:03 daemon.notice<29> pacemaker: Starting Pacemaker Cluster Manager
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Mar 11 17:17:03 daemon.err<27> pacemakerd[101990]:    error: mcp_read_config: Couldn't create logfile: /var/log/pacemaker.log
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: mcp_read_config: Configured corosync to accept connections from group 107: Library error (2)
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: main: Starting Pacemaker 1.1.12 (Build: 561c4cf):  generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  corosync-native snmp libesmtp acls
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: cluster_connect_quorum: Quorum lost
Mar 11 17:17:03 daemon.notice<29> stonithd[101993]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Mar 11 17:17:03 daemon.notice<29> attrd[101995]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Mar 11 17:17:03 daemon.err<27> corosync[101979]:   [MAIN  ] Denied connection attempt from 105:107
Mar 11 17:17:03 daemon.err<27> attrd[101995]:    error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Mar 11 17:17:03 daemon.err<27> attrd[101995]:    error: main: Cluster connection failed
Mar 11 17:17:03 daemon.notice<29> attrd[101995]:   notice: main: Cleaning up before exit
Mar 11 17:17:03 daemon.notice<29> cib[101992]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Mar 11 17:17:03 daemon.err<27> corosync[101979]:   [QB    ] Invalid IPC credentials (101979-101995-2).
Mar 11 17:17:03 kern.info<6> kernel: [447239.222441] attrd[101995]: segfault at 1b8 ip 00007f5e62f0e9e1 sp 00007fffe35300a0 error 4 in libqb.so.0.17.1[7f5e62eff000+22000]
Mar 11 17:17:03 daemon.err<27> corosync[101979]:   [MAIN  ] Denied connection attempt from 105:107
Mar 11 17:17:03 daemon.err<27> corosync[101979]:   [QB    ] Invalid IPC credentials (101979-101992-2).
Mar 11 17:17:03 daemon.err<27> cib[101992]:    error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Mar 11 17:17:03 daemon.crit<26> cib[101992]:     crit: cib_init: Cannot sign in to the cluster... terminating
Mar 11 17:17:03 daemon.notice<29> crmd[101997]:   notice: main: CRM Git Version: 561c4cf
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node-1[2] - state is now member (was (null))
Mar 11 17:17:03 daemon.err<27> pacemakerd[101990]:    error: pcmk_child_exit: Child process cib (101992) exited: Network is down (100)
Mar 11 17:17:03 daemon.warning<28> pacemakerd[101990]:  warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down.
Mar 11 17:17:03 daemon.err<27> pacemakerd[101990]:    error: child_waitpid: Managed process 101995 (attrd) dumped core
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: pcmk_child_exit: Child process attrd terminated with signal 11 (pid=101995, core=1)
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: pcmk_shutdown_worker: Shuting down Pacemaker
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: stop_child: Stopping crmd: Sent -15 to process 101997
Mar 11 17:17:03 daemon.warning<28> crmd[101997]:  warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Mar 11 17:17:03 daemon.notice<29> crmd[101997]:   notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Mar 11 17:17:03 daemon.warning<28> crmd[101997]:  warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
Mar 11 17:17:03 daemon.notice<29> crmd[101997]:   notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
Mar 11 17:17:03 daemon.notice<29> crmd[101997]:   notice: terminate_cs_connection: Disconnecting from Corosync
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: stop_child: Stopping pengine: Sent -15 to process 101996
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: stop_child: Stopping lrmd: Sent -15 to process 101994
Mar 11 17:17:03 daemon.notice<29> pacemakerd[101990]:   notice: stop_child: Stopping stonith-ng: Sent -15 to process 101993
Mar 11 17:17:12 daemon.err<27> stonithd[101993]:    error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107)
Mar 11 17:17:12 daemon.notice<29> pacemakerd[101990]:   notice: pcmk_shutdown_worker: Shutdown complete
Mar 11 17:17:12 daemon.notice<29> pacemakerd[101990]:   notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error



corosync-cmapctl:
-----------------
config.totemconfig_reload_in_progress (u8) = 0
internal_configuration.service.0.name (str) = corosync_cmap
internal_configuration.service.0.ver (u32) = 0
internal_configuration.service.1.name (str) = corosync_cfg
internal_configuration.service.1.ver (u32) = 0
internal_configuration.service.2.name (str) = corosync_cpg
internal_configuration.service.2.ver (u32) = 0
internal_configuration.service.3.name (str) = corosync_quorum
internal_configuration.service.3.ver (u32) = 0
internal_configuration.service.4.name (str) = corosync_pload
internal_configuration.service.4.ver (u32) = 0
internal_configuration.service.5.name (str) = corosync_votequorum
internal_configuration.service.5.ver (u32) = 0
logging.debug (str) = off
logging.fileline (str) = off
logging.logfile (str) = /var/log/cluster/corosync.log
logging.logfile_priority (str) = err
logging.logger_subsys.QUORUM.debug (str) = off
logging.logger_subsys.QUORUM.subsys (str) = QUORUM
logging.timestamp (str) = on
logging.to_logfile (str) = yes
logging.to_stderr (str) = no
logging.to_syslog (str) = yes
nodelist.local_node_pos (u32) = 1
nodelist.node.0.name (str) = node-0
nodelist.node.0.nodeid (u32) = 1
nodelist.node.0.ring0_addr (str) = 169.254.0.2
nodelist.node.0.ring1_addr (str) = 169.254.1.2
nodelist.node.1.name (str) = node-1
nodelist.node.1.nodeid (u32) = 2
nodelist.node.1.ring0_addr (str) = 169.254.0.3
nodelist.node.1.ring1_addr (str) = 169.254.1.3
quorum.provider (str) = corosync_votequorum
runtime.blackbox.dump_flight_data (str) = no
runtime.blackbox.dump_state (str) = no
runtime.connections.active (u64) = 1
runtime.connections.closed (u64) = 36
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.client_pid (u32) = 102017
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.dispatched (u64) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.flow_control (u32) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.flow_control_count (u64) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.invalid_request (u64) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.name (str) = corosync-cmapct
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.overload (u64) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.queue_size (u32) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.recv_retries (u64) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.requests (u64) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.responses (u64) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.send_retries (u64) = 0
runtime.connections.corosync-cmapct:102017:0x7f8fd0008000.service_id (u32) = 0
runtime.services.cfg.0.rx (u64) = 0
runtime.services.cfg.0.tx (u64) = 0
runtime.services.cfg.1.rx (u64) = 0
runtime.services.cfg.1.tx (u64) = 0
runtime.services.cfg.2.rx (u64) = 0
runtime.services.cfg.2.tx (u64) = 0
runtime.services.cfg.3.rx (u64) = 0
runtime.services.cfg.3.tx (u64) = 0
runtime.services.cfg.service_id (u16) = 1
runtime.services.cmap.0.rx (u64) = 1
runtime.services.cmap.0.tx (u64) = 1
runtime.services.cmap.service_id (u16) = 0
runtime.services.cpg.0.rx (u64) = 2
runtime.services.cpg.0.tx (u64) = 2
runtime.services.cpg.1.rx (u64) = 2
runtime.services.cpg.1.tx (u64) = 2
runtime.services.cpg.2.rx (u64) = 0
runtime.services.cpg.2.tx (u64) = 0
runtime.services.cpg.3.rx (u64) = 12
runtime.services.cpg.3.tx (u64) = 12
runtime.services.cpg.4.rx (u64) = 0
runtime.services.cpg.4.tx (u64) = 0
runtime.services.cpg.5.rx (u64) = 1
runtime.services.cpg.5.tx (u64) = 1
runtime.services.cpg.service_id (u16) = 2
runtime.services.pload.0.rx (u64) = 0
runtime.services.pload.0.tx (u64) = 0
runtime.services.pload.1.rx (u64) = 0
runtime.services.pload.1.tx (u64) = 0
runtime.services.pload.service_id (u16) = 4
runtime.services.quorum.service_id (u16) = 3
runtime.services.votequorum.0.rx (u64) = 3
runtime.services.votequorum.0.tx (u64) = 2
runtime.services.votequorum.1.rx (u64) = 0
runtime.services.votequorum.1.tx (u64) = 0
runtime.services.votequorum.2.rx (u64) = 0
runtime.services.votequorum.2.tx (u64) = 0
runtime.services.votequorum.3.rx (u64) = 0
runtime.services.votequorum.3.tx (u64) = 0
runtime.services.votequorum.service_id (u16) = 5
runtime.totem.pg.mrp.rrp.0.faulty (u8) = 0
runtime.totem.pg.mrp.rrp.1.faulty (u8) = 0
runtime.totem.pg.mrp.srp.avg_backlog_calc (u32) = 0
runtime.totem.pg.mrp.srp.avg_token_workload (u32) = 0
runtime.totem.pg.mrp.srp.commit_entered (u64) = 1
runtime.totem.pg.mrp.srp.commit_token_lost (u64) = 0
runtime.totem.pg.mrp.srp.consensus_timeouts (u64) = 0
runtime.totem.pg.mrp.srp.continuous_gather (u32) = 0
runtime.totem.pg.mrp.srp.continuous_sendmsg_failures (u32) = 0
runtime.totem.pg.mrp.srp.firewall_enabled_or_nic_failure (u8) = 0
runtime.totem.pg.mrp.srp.gather_entered (u64) = 1
runtime.totem.pg.mrp.srp.gather_token_lost (u64) = 0
runtime.totem.pg.mrp.srp.mcast_retx (u64) = 0
runtime.totem.pg.mrp.srp.mcast_rx (u64) = 0
runtime.totem.pg.mrp.srp.mcast_tx (u64) = 22
runtime.totem.pg.mrp.srp.memb_commit_token_rx (u64) = 2
runtime.totem.pg.mrp.srp.memb_commit_token_tx (u64) = 3
runtime.totem.pg.mrp.srp.memb_join_rx (u64) = 1
runtime.totem.pg.mrp.srp.memb_join_tx (u64) = 1
runtime.totem.pg.mrp.srp.memb_merge_detect_rx (u64) = 2062
runtime.totem.pg.mrp.srp.memb_merge_detect_tx (u64) = 2062
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(169.254.0.3) r(1) ip(169.254.1.3) 
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined
runtime.totem.pg.mrp.srp.mtt_rx_token (u32) = 180
runtime.totem.pg.mrp.srp.operational_entered (u64) = 1
runtime.totem.pg.mrp.srp.operational_token_lost (u64) = 0
runtime.totem.pg.mrp.srp.orf_token_rx (u64) = 4511
runtime.totem.pg.mrp.srp.orf_token_tx (u64) = 1
runtime.totem.pg.mrp.srp.recovery_entered (u64) = 1
runtime.totem.pg.mrp.srp.recovery_token_lost (u64) = 0
runtime.totem.pg.mrp.srp.rx_msg_dropped (u64) = 0
runtime.totem.pg.mrp.srp.token_hold_cancel_rx (u64) = 6
runtime.totem.pg.mrp.srp.token_hold_cancel_tx (u64) = 6
runtime.totem.pg.msg_queue_avail (u32) = 0
runtime.totem.pg.msg_reserved (u32) = 1
runtime.votequorum.ev_barrier (u32) = 2
runtime.votequorum.this_node_id (u32) = 2
runtime.votequorum.two_node (u8) = 0
totem.cluster_name (str) = avid_storage_cluster
totem.crypto_cipher (str) = none
totem.crypto_hash (str) = none
totem.interface.0.bindnetaddr (str) = 169.254.0.3
totem.interface.0.mcastaddr (str) = 239.192.90.15
totem.interface.0.mcastport (u16) = 5405
totem.interface.1.bindnetaddr (str) = 169.254.1.3
totem.interface.1.mcastaddr (str) = 239.192.90.16
totem.interface.1.mcastport (u16) = 5405
totem.rrp_mode (str) = passive
totem.transport (str) = udpu
totem.version (u32) = 2

Comment 21 Jan Friesse 2015-04-07 07:08:23 UTC

John,
do you have coredumps of crashed corosync?

Comment 24 Jan Pokorný [poki] 2015-04-10 06:53:59 UTC

See also http://clusterlabs.org/pipermail/users/2015-March/000162.html

Comment 25 Kostiantyn Ponomarenko 2015-04-14 13:43:24 UTC

According to: "Mar 27 14:00:50 daemon.err<27> pacemakerd[111069]:    error: child_waitpid: Managed process 111074 (attrd) dumped core", there is a core dump in "/var/cores/core.attrd-111074-1427464849".

How can I attach the core dump here?

Comment 26 Andrew Beekhof 2015-04-14 22:42:59 UTC

While interesting, attrd crashing is not related to this bug.
You need a crash from corosync.

Comment 27 Kostiantyn Ponomarenko 2015-04-15 13:00:22 UTC

"/var/cores/" doesn't have any other core dumps.
Does Corosync use the same place for dumps?
I am not sure that Corosync actually crashed.
I will try to reproduce this state, and as soon as I get more info I will get back with it.

Comment 28 Kostiantyn Ponomarenko 2015-04-22 10:55:55 UTC

I faced this issue one more time.
Now I can surly say that Corosync doesn't crash.

On a working machine I stopped Pacemaker and Corosync.
Then I started them with the next commands and got this:
------------------
# /etc/init.d/corosync start
Starting Corosync Cluster Engine (corosync): [  OK  ]
# /etc/init.d/corosync status
corosync (pid 100837) is running...
# /etc/init.d/pacemaker start
Starting Pacemaker Cluster Manager[  OK  ]
# /etc/init.d/pacemaker status
pacemakerd is stopped
------------------


/var/log/messages:
------------------
Apr 22 10:49:08 daemon.notice<29> pacemaker: Starting Pacemaker Cluster Manager
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error: mcp_read_config: Couldn't create logfile: /var/log/pacemaker.log
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: mcp_read_config: Configured corosync to accept connections from group 107: Library error (2)
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: main: Starting Pacemaker 1.1.12 (Build: 561c4cf):  generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  corosync-native snmp libesmtp acls
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: cluster_connect_quorum: Quorum lost
Apr 22 10:49:08 daemon.notice<29> stonithd[114136]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 22 10:49:08 daemon.notice<29> attrd[114138]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [MAIN  ] Denied connection attempt from 105:107
Apr 22 10:49:08 daemon.err<27> attrd[114138]:    error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Apr 22 10:49:08 daemon.err<27> attrd[114138]:    error: main: Cluster connection failed
Apr 22 10:49:08 daemon.notice<29> attrd[114138]:   notice: main: Cleaning up before exit
Apr 22 10:49:08 kern.info<6> kernel: [162259.416242] attrd[114138]: segfault at 1b8 ip 00007f375481c9e1 sp 00007fff7ddf0d50 error 4 in libqb.so.0.17.1[7f375480d000+22000]
Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [QB    ] Invalid IPC credentials (100837-114138-2).
Apr 22 10:49:08 daemon.notice<29> cib[114135]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 22 10:49:08 daemon.err<27> cib[114135]:    error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
Apr 22 10:49:08 daemon.crit<26> cib[114135]:     crit: cib_init: Cannot sign in to the cluster... terminating
Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [MAIN  ] Denied connection attempt from 105:107
Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [QB    ] Invalid IPC credentials (100837-114135-3).
Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice: main: CRM Git Version: 561c4cf
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node-0[1] - state is now member (was (null))
Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error: pcmk_child_exit: Child process cib (114135) exited: Network is down (100)
Apr 22 10:49:08 daemon.warning<28> pacemakerd[114133]:  warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down.
Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error: child_waitpid: Managed process 114138 (attrd) dumped core
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: pcmk_child_exit: Child process attrd terminated with signal 11 (pid=114138, core=1)
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: pcmk_shutdown_worker: Shuting down Pacemaker
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: stop_child: Stopping crmd: Sent -15 to process 114140
Apr 22 10:49:08 daemon.warning<28> crmd[114140]:  warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Apr 22 10:49:08 daemon.warning<28> crmd[114140]:  warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice: terminate_cs_connection: Disconnecting from Corosync
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: stop_child: Stopping pengine: Sent -15 to process 114139
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: stop_child: Stopping lrmd: Sent -15 to process 114137
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: stop_child: Stopping stonith-ng: Sent -15 to process 114136
Apr 22 10:49:17 daemon.err<27> stonithd[114136]:    error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107)
Apr 22 10:49:17 daemon.notice<29> pacemakerd[114133]:   notice: pcmk_shutdown_worker: Shutdown complete
Apr 22 10:49:17 daemon.notice<29> pacemakerd[114133]:   notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error
------------------


"/var/cores/" contains only "core.attrd-*".
What else can I do?

Comment 29 Kostiantyn Ponomarenko 2015-04-22 11:10:08 UTC

Could be the problem in 'libqb'?
I noticed this line in the log:
Apr 22 10:49:08 kern.info<6> kernel: [162259.416242] attrd[114138]: segfault at 1b8 ip 00007f375481c9e1 sp 00007fff7ddf0d50 error 4 in libqb.so.0.17.1[7f375480d000+22000]

Comment 30 Jan Friesse 2015-04-22 12:30:53 UTC

Konstantyn,
of course it can be bug in libqb, but I don't think it's very probable. Really important line is:

Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: mcp_read_config: Configured corosync to accept connections from group 107: Library error (2)

What exactly is your reproducer? Just stop pacemaker + corosync and then start them again? How reliable is such reproducer?

Comment 31 Kostiantyn Ponomarenko 2015-04-22 13:05:50 UTC

Yes, that is correct.
You need sequentially start/stop Corosync/Pacemaker for some amount of times (~5-10).

Comment 32 Jan Friesse 2015-06-19 15:34:08 UTC

Kostiantyn,
there was bug in libqb bug 1211375. After installing fixed version, is the problem still there?

Comment 33 Kostiantyn Ponomarenko 2015-06-19 16:07:36 UTC

Jan,
I haven't tried the fixed version yet. I am still using "v0.17.1 Final Release".

Comment 34 Arkadiusz Bubała 2015-08-03 08:42:20 UTC

Hello,
I've got the same issue. The update of libqb to the latest version didn't resolve the problem. It seems that it is much harder to reproduce but it's still here.

The steps to reproduce this bug is:
1. Halt the secondary node.
2. Stop Corosync/Pacemaker.
3. Start Corosync/Pacemaker. The corosync is running but pacemaker exits.
4. Start the secondary node.
5. Start Corosync/Pacemaker on the secondary node. The corosync is running but pacemaker exits.
6. There is no segmentation fault error. The pacemaker exits because it can't connect to corosync.

After a few retries corosync and pacemaker starts correctly.

Comment 35 Jan Friesse 2015-08-04 06:16:17 UTC

Arkadiusz,
is halting of secondary node needed, or sequential start/stop of pacemaker on one node also reproduce the problem?

Comment 36 Arkadiusz Bubała 2015-08-04 08:51:33 UTC

Jan,
I've never reproduced this problem with sequential start/stop of pacemaker but I could try to test it with stop/start in loop. This issue also seldom happens while rebooting the secondary node. 
The second thing is that this issue only occurs on very powerfull environment.

Currently both nodes are running on Intel Server Board S2600IP with two Intel Xeon E5-2643 3.30GHz CPUs and 128 GB of RAM.

Comment 37 Kostiantyn Ponomarenko 2015-08-04 09:19:26 UTC

Jan,
I can surly say that it is reproducible with only one node up.
I don't remember whether is it reproducible with two nodes up.

BTW, my environment: 
Intel(R) Xeon(R) CPU E5-2618L v2 @ 2.00GHz
32 GB RAM

Comment 38 Arkadiusz Bubała 2015-08-04 10:28:59 UTC

Jan, Kostiantyn,

I tested it with stop/start sequence and it is reproducible but both - corosync and peacemaker have to be restarted. When I did stop/start of pacemaker only this issue didn't occur. It is also impossible to reproduce this with debug log level. So I've added some log prints to the libqb. Below are the results:


When everything works fine:

Aug 04 12:09:06 [61070] node2 pacemakerd:     info: crm_ipc_connect: 	Could not establish pacemakerd connection: Connection refused (111)
Aug 04 12:09:06 [61070] node2 pacemakerd:     info: get_cluster_type: 	Detected an active 'corosync' cluster
Aug 04 12:09:06 [61070] node2 pacemakerd:     info: mcp_read_config: 	Reading configure for stack: corosync
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 40: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 44: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 40: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 43: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: res: 40: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 62: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 40: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 44: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 40: Resource temporarily unavailable (11)
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 2
Aug 04 12:09:06 [61070] node2 pacemakerd:  warning: qb_ipcc_sendv_recv: 	res: 24: Success (0)
Aug 04 12:09:06 [61070] node2 pacemakerd:   notice: mcp_read_config: 	Configured corosync to accept connections from group 112: OK (1)

When the error occurs:


Aug 04 12:13:22 [110652] node2 pacemakerd:     info: crm_ipc_connect: 	Could not establish pacemakerd connection: Connection refused (111)
Aug 04 12:13:22 [110652] node2 pacemakerd:     info: get_cluster_type: 	Detected an active 'corosync' cluster
Aug 04 12:13:22 [110652] node2 pacemakerd:     info: mcp_read_config: 	Reading configure for stack: corosync
Aug 04 12:13:22 [110652] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm_fc_get_rc = 1
Aug 04 12:13:22 [110652] node2 pacemakerd:    error: qb_ipcc_sendv_recv: 	fc_get failed, result = -107: Resource temporarily unavailable (11)
Aug 04 12:13:22 [110652] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 1
Aug 04 12:13:22 [110652] node2 pacemakerd:    error: qb_ipcc_sendv_recv: 	fc_get failed, result = -107: Resource temporarily unavailable (11)
Aug 04 12:13:22 [110652] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 1
Aug 04 12:13:22 [110652] node2 pacemakerd:    error: qb_ipcc_sendv_recv: 	fc_get failed, result = -107: Resource temporarily unavailable (11)
Aug 04 12:13:22 [110652] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 0
Aug 04 12:13:22 [110652] node2 pacemakerd:    error: qb_ipcc_sendv_recv: 	fc_get failed, result = -107: Resource temporarily unavailable (11)
Aug 04 12:13:22 [110652] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 0
Aug 04 12:13:22 [110652] node2 pacemakerd:    error: qb_ipcc_sendv_recv: 	fc_get failed, result = -107: Resource temporarily unavailable (11)
Aug 04 12:13:22 [110652] node2 pacemakerd:  warning: qb_ipc_shm_fc_get: 	shm _fc_get rc = 0
Aug 04 12:13:22 [110652] node2 pacemakerd:    error: qb_ipcc_sendv_recv: 	fc_get failed, result = -107: Success (0)
Aug 04 12:13:22 [110652] node2 pacemakerd:   notice: mcp_read_config: 	Configured corosync to accept connections from group 112: Library error (2)

I'll continue debugging and let you know when I'll find something.

Comment 39 Jan Friesse 2015-08-04 14:34:25 UTC

Ok,
so it looks like some kind of funny race condition, that may be why I'm unable to reproduce it.

Guys, what exactly are you using to restart? Just service (or systemctl) pacemaker stop; service corosync stop; service corosync start; service pacemaker start?

Comment 40 Arkadiusz Bubała 2015-08-05 08:40:06 UTC

I found the root of the issue. In pacemaker code in function mcp_read_config there are:

cmap_initialize(&local_handle);
...
stack = get_cluster_type();

This creates two cmap sockets from pacemakerd daemon in a short time.

From pacemaker.log:
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_ipcc_shm_connect: 	(1438758850834812) ipcc_shm_connect: cmap
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_open_2: 	Increase ref_count for /dev/shm/qb-cmap-request-110743-112876-1-data
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_open_2: 	Increase ref_count for /dev/shm/qb-cmap-response-110743-112876-1-data
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_open_2: 	Increase ref_count for /dev/shm/qb-cmap-event-110743-112876-1-data
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_ipcc_shm_connect: 	(1438758850838617) ipcc_shm_connect: cmap
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_open_2: 	Increase ref_count for /dev/shm/qb-cmap-request-110743-112876-1-data
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_open_2: 	Increase ref_count for /dev/shm/qb-cmap-response-110743-112876-1-data
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_open_2: 	Increase ref_count for /dev/shm/qb-cmap-event-110743-112876-1-data
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_ipcc_shm_disconnect: 	shm_disconnect
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_close: 	Decrease ref_count for /dev/shm/qb-cmap-request-110743-112876-1-data
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_close: 	Decrease ref_count for /dev/shm/qb-cmap-response-110743-112876-1-data
Aug 05 09:14:10 [112876] node2 pacemakerd:  warning: qb_rb_close: 	Decrease ref_count for /dev/shm/qb-cmap-event-110743-112876-1-data


From corosync.log:
Aug  5 09:14:10 node2 corosync[110743]:   [QB    ] (1438758850832743) ipcs_shm_connect: cmap (pid: 112876)
Aug  5 09:14:10 node2 corosync[110743]:   [QB    ] creae new ref_count on /dev/shm/qb-cmap-request-110743-112876-1-data
Aug  5 09:14:10 node2 corosync[110743]:   [QB    ] creae new ref_count on /dev/shm/qb-cmap-response-110743-112876-1-data
Aug  5 09:14:10 node2 corosync[110743]:   [QB    ] creae new ref_count on /dev/shm/qb-cmap-event-110743-112876-1-data
Aug  5 09:14:10 node2 corosync[110743]:   [QB    ] (1438758850836802) ipcs_shm_connect: cmap (pid: 112876)
Aug  5 09:14:10 node2 corosync[110743]:   [QB    ] creae new ref_count on /dev/shm/qb-cmap-request-110743-112876-1-data
Aug  5 09:14:10 node2 corosync[110743]:   [QB    ] creae new ref_count on /dev/shm/qb-cmap-response-110743-112876-1-data
Aug  5 09:14:10 node2 corosync[110743]:   [QB    ] creae new ref_count on /dev/shm/qb-cmap-event-110743-112876-1-data

The reference_counter was overwritten when get_cluster_type created new connection and then decreased when it closed this connection.
When I moved the get_clsuter_type() function before cmap_initialize() in pacemaker code this issue didn't occur.

Comment 41 Arkadiusz Bubała 2015-08-05 10:27:46 UTC

Update:

I found what really generates this problem. It is length of CONNECTION_DESCRIPTION field. It is defined as 16 by default in libqb ipc_int.h:
#define CONNECTION_DESCRIPTION (16)

When both corosync and pacemaker has PID >= 10000 and the fd >= 10 the description field will have at least 17 chars. Increasing the length of description field resolved this issue.

Comment 42 David Vossel 2015-08-05 20:45:20 UTC

(In reply to Arkadiusz Bubała from comment #41)
> Update:
> 
> I found what really generates this problem. It is length of
> CONNECTION_DESCRIPTION field. It is defined as 16 by default in libqb
> ipc_int.h:
> #define CONNECTION_DESCRIPTION (16)
> 
> When both corosync and pacemaker has PID >= 10000 and the fd >= 10 the
> description field will have at least 17 chars. Increasing the length of
> description field resolved this issue.

The format string is

pid-pid-fd

10000-10000-10 is only 14 characters.

Looking at your debug, the pids are actually greater that 100,000, which is way larger than the default maximum pid used in rhel6 and rhel7.

# cat /proc/sys/kernel/pid_max 
32768

Under normal operation 16 characters should work. You all have increased the maximum pid size (likely for a really good reason) which has unexpectedly broken IPC.

As a workaround, is it possible to lower the pid_max value back to the default range.. or something <= 99,999? This will resolve the issue in the short term. 

Of course we'll fix this in libqb. I just wanted to make you aware there's a work around.

Comment 43 Arkadiusz Bubała 2015-08-06 06:22:34 UTC

David,
my mistake, of course the pid is larger than 100000 not 10000. I forgot about one zero. The pid_max on tested environments was 122880.

Comment 44 Kostiantyn Ponomarenko 2015-08-12 07:54:09 UTC

Yes, you are right. Here is what I have:

# cat /proc/sys/kernel/pid_max
122880

Which version of "libqb" will get the fix, the next one - v0.17.2?

Comment 46 Christine Caulfield 2015-08-24 08:22:06 UTC

We should be able to get this in 0.17.2, yes.

Comment 47 Christine Caulfield 2015-08-24 08:28:50 UTC

Sorry, my mistake. I should have said that will be in 0.17.3. 0.17.2 will be released this week and it won't happen that soon.

Comment 48 marcindulak 2015-08-26 08:12:31 UTC

I guess a backport ot RHEL6.7 is needed too.
I'm seeing pacemaker crashing, and have:
# cat /proc/sys/kernel/pid_max
147456

Comment 49 Shinobu KINJO 2015-10-23 03:09:32 UTC

When will there be errata of this bug?

Comment 55 John Ruemker 2015-10-29 13:45:15 UTC

(In reply to marcindulak from comment #48)
> I guess a backport ot RHEL6.7 is needed too.
> I'm seeing pacemaker crashing, and have:
> # cat /proc/sys/kernel/pid_max
> 147456

Hello,
Do you have a case open with Red Hat Global Support Services on this issue?  If not, I'd like to request you please open one so we can work with you on getting this fixed in the release you need it for.

Thanks,
John Ruemker
Principal Software Maintenance Engineer
Red Hat Global Support Services

Comment 64 Christine Caulfield 2016-01-19 16:02:10 UTC

This will be covered by the rebase BZ #1299968

Comment 68 errata-xmlrpc 2016-11-04 04:39:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2363.html

Note You need to log in before you can comment on or make changes to this bug.

abeekhof
arkadiusz.bubala
ccaulfie
cfeist
cluster-maint
ctowsley
djansa
fdinitto
jfriesse
jkortus
jkurik
jpokorny
jruemker
konstantin.ponomarenko
lmiksik
lnatapov
Marcin.Dulak
markd
oblaut
sbradley
skinjo
snagar
srevivo
tatsu-ab1
tojeline
ykawada