Bug 1211375

Summary:

Assertion in libqb via corosync

Product:

Red Hat Enterprise Linux 7

Reporter:

Frank Enderle <frank.enderle>

Component:

libqb

Assignee:

Christine Caulfield <ccaulfie>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

7.1

CC:

amoralej, cluster-maint, crose, ctowsley, fdinitto, jherrman, jkortus, jruemker, linux-bugs, mjuricek, mzheng, tlavigne

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

libqb-0.17.1-2.el7

Doc Type:

Bug Fix

Doc Text:

Cause: An unrelated change caused the libqb Interprocess communication (IPC) mainloop not to free some resources when a connection was terminated. Consequence: When a total of 64K connections had been made libqb would assert and the application would exit. Fix libqb now cleans up correctly after a connection terminates. Result More than 64 connections can be made via libqb, and the process will not exit.

Story Points:

---

Clone Of:

Clones:

1212297 (view as bug list)

Environment:

Last Closed:

2015-11-19 04:50:09 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1212297

Attachments:

Description	Flags
pcs config output	none

Description Frank Enderle 2015-04-13 19:26:07 UTC

Description of problem:

Please note that I'm running CentOS, however I wanted to report the problem also here.

Apx once per day corosync restarts on my current cluster config. Through the backtrace it is visible that the problem is an assertion in libqb:

#4  0x00007faca2d3aa26 in _poll_add_ (l=l@entry=0x7faca3986cd0, p=QB_LOOP_HIGH, fd=fd@entry=27, events=35, data=0x7faca46ed3b0, pe_pt=pe_pt@entry=0x7fffad747988) at loop_poll.c:305
        pe = 0x7faca47e6148
        install_pos = 4294967274
        res = 0
        s = 0x7faca3986eb0
        __PRETTY_FUNCTION__ = "_poll_add_"

Line 305 of loop_poll.c verifies that the selected array position is actually free:

---
303 install_pos = _get_empty_array_position_(s);
304
305 assert(qb_array_index(s->poll_entries, install_pos, (void **)&pe) == 0);
---

Since the assertion is hit I think this is a severe problem which needs to be addressed.

Version-Release number of selected component (if applicable):
libqb-0.17.1-1.el7_1.1 (CentOS 7.1)


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
See my bug on the CentOS bug tracker which also contains abrt information.

Comment 2 David Vossel 2015-04-13 22:38:16 UTC

(In reply to Frank Enderle from comment #0)
> Description of problem:
> 
> Please note that I'm running CentOS, however I wanted to report the problem
> also here.
> 
> Apx once per day corosync restarts on my current cluster config. Through the
> backtrace it is visible that the problem is an assertion in libqb:
> 
> #4  0x00007faca2d3aa26 in _poll_add_ (l=l@entry=0x7faca3986cd0,
> p=QB_LOOP_HIGH, fd=fd@entry=27, events=35, data=0x7faca46ed3b0,
> pe_pt=pe_pt@entry=0x7fffad747988) at loop_poll.c:305
>         pe = 0x7faca47e6148
>         install_pos = 4294967274

install_pos looks weird here because it is actually an error number, -INVAL to be exact. This is mostly likely caused by the libqb array growing beyond 65536 elements, or 2^16.

It appears the libqb mainloop fd list is growing without bound for some reason. This is probably either an issue with libqb not properly reclaiming memory in mainloop after a file descriptor has been removed, or an issue with corosync not deleting entries from mainloop.

-- David

Comment 3 David Vossel 2015-04-13 22:39:51 UTC

Frank, Do you have any steps to reproduce this? What is corosync doing when this happens? Is this a pacemaker environment?

Comment 4 Frank Enderle 2015-04-14 07:30:45 UTC

(In reply to David Vossel from comment #3)
> Frank, Do you have any steps to reproduce this? What is corosync doing when
> this happens? Is this a pacemaker environment?

Sorry, I have no steps to redproduce, but I did some further checking:

1. The problems seem to have started on Apr 02, which is interesting since I upgraded from CentOS 7.0 to 7.1 on Apr 01. So it _might_ be a regression. I will try downgrade libqb and corosync to check, starting with corosync, proceeding to libqb if that has no effect.

2. The interval is somewhat around 9 to 10 hours. So the 'running out of resources' theory seems to be a valid one.

3. I also noted that it seems like if only the 'not DC node' does the assertion, but I'm not really sure of that.

The cluster is a pacemaker cluster which i set up pretty much as described in the RHEL7 HA Add-On manual. I'll attached the output of pcs config show.

After failure the following entries show up in the logs:

Apr 14 05:45:30 sio-c01-db02 abrt-hook-ccpp: Saved core dump of pid 16142 (/usr/sbin/corosync) to /var/tmp/abrt/ccpp-2015-04-14-05:45:30-16142 (30638080 bytes)
Apr 14 05:45:30 sio-c01-db02 attrd[16174]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 14 05:45:30 sio-c01-db02 crmd[16175]: error: crmd_quorum_destroy: connection terminated
Apr 14 05:45:30 sio-c01-db02 stonith-ng[16173]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 14 05:45:30 sio-c01-db02 pacemakerd[16171]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 14 05:45:30 sio-c01-db02 attrd[16174]: crit: attrd_cpg_destroy: Lost connection to Corosync service!
Apr 14 05:45:30 sio-c01-db02 pacemakerd[16171]: error: mcp_cpg_destroy: Connection destroyed
Apr 14 05:45:30 sio-c01-db02 stonith-ng[16173]: error: stonith_peer_cs_destroy: Corosync connection terminated
Apr 14 05:45:30 sio-c01-db02 attrd[16174]: notice: main: Cleaning up before exit
Apr 14 05:45:30 sio-c01-db02 attrd[16174]: notice: crm_client_disconnect_all: Disconnecting client 0xe56c30, pid=16175...
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: error: crm_ipc_read: Connection to stonith-ng failed
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: error: mainloop_gio_callback: Connection to stonith-ng[0x21b3560] closed (I/O condition=17)
Apr 14 05:45:30 sio-c01-db02 crmd[16175]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
Apr 14 05:45:30 sio-c01-db02 cib[16172]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 14 05:45:30 sio-c01-db02 cib[16172]: error: cib_cs_destroy: Corosync connection lost!  Exiting.
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: error: stonith_connection_destroy_cb: LRMD lost STONITH connection
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: error: stonith_connection_failed: STONITH connection failed, finalizing 1 pending operations.
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: qb_ipcs_event_sendv: new_event_notification (20237-16175-8): Broken pipe (32)
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed
Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed
Apr 14 05:45:30 sio-c01-db02 systemd: corosync.service: main process exited, code=dumped, status=6/ABRT
Apr 14 05:45:30 sio-c01-db02 systemd: pacemaker.service: main process exited, code=exited, status=107/n/a
Apr 14 05:45:30 sio-c01-db02 systemd: Unit pacemaker.service entered failed state.
Apr 14 05:45:30 sio-c01-db02 systemd: Unit corosync.service entered failed state.
Apr 14 05:45:30 sio-c01-db02 abrt-server: Generating core_backtrace
Apr 14 05:45:30 sio-c01-db02 systemd: pacemaker.service holdoff time over, scheduling restart.
Apr 14 05:45:30 sio-c01-db02 systemd: Stopping Pacemaker High Availability Cluster Manager...
Apr 14 05:45:30 sio-c01-db02 systemd: Starting Corosync Cluster Engine...
Apr 14 05:45:30 sio-c01-db02 corosync[11280]: [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
Apr 14 05:45:30 sio-c01-db02 corosync[11280]: [MAIN  ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
Apr 14 05:45:30 sio-c01-db02 corosync[11281]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Apr 14 05:45:30 sio-c01-db02 corosync[11281]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] The network interface [10.1.1.3] is now up.
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV  ] Service engine loaded: corosync configuration map access [0]
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB    ] server name: cmap
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV  ] Service engine loaded: corosync configuration service [1]
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB    ] server name: cfg
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB    ] server name: cpg
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV  ] Service engine loaded: corosync profile loading service [4]
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QUORUM] Using quorum provider corosync_votequorum
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB    ] server name: votequorum
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB    ] server name: quorum
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] adding new UDPU member {10.1.1.2}
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] adding new UDPU member {10.1.1.3}
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] A new membership (10.1.1.3:28) was formed. Members joined: 2
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] A new membership (10.1.1.2:32) was formed. Members joined: 1
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QUORUM] This node is within the primary component and will provide service.
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QUORUM] Members[2]: 1 2
Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [MAIN  ] Completed service synchronization, ready to provide service.
Apr 14 05:45:31 sio-c01-db02 corosync: Starting Corosync Cluster Engine (corosync): [  OK  ]
Apr 14 05:45:31 sio-c01-db02 systemd: Started Corosync Cluster Engine.
Apr 14 05:45:31 sio-c01-db02 systemd: Starting Pacemaker High Availability Cluster Manager...
Apr 14 05:45:31 sio-c01-db02 systemd: Started Pacemaker High Availability Cluster Manager.
Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
Apr 14 05:45:31 sio-c01-db02 pacemakerd: crm_ipc_connect: Could not establish pacemakerd connection: Connection refused (111)
Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: mcp_read_config: Configured corosync to accept connections from group 189: OK (1)
Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: main: Starting Pacemaker 1.1.12 (Build: a14efad):  generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc upstart systemd nagios  corosync-native atomic-attrd acls
Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: find_and_track_existing_processes: Tracking existing lrmd process (pid=20237)
Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: find_and_track_existing_processes: Tracking existing pengine process (pid=20239)
Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: cluster_connect_quorum: Quorum acquired
Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: crm_update_peer_state: pcmk_quorum_notification: Node sio-c01-db01a[1] - state is now member (was (null))
Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: crm_update_peer_state: pcmk_quorum_notification: Node sio-c01-db02a[2] - state is now member (was (null))

and so on...

Comment 5 Frank Enderle 2015-04-14 07:32:43 UTC

Created attachment 1014181 [details]
pcs config output

Comment 6 Frank Enderle 2015-04-14 08:21:42 UTC

Ok so I just downgraded corosync from:

corosync-2.3.4-4.el7.x86_64
corosynclib-2.3.4-4.el7.x86_64

to

corosync-2.3.3-2.el7.x86_64
corosynclib-2.3.3-2.el7.x86_64

We'll see tomorrow.

Comment 7 David Vossel 2015-04-14 14:45:24 UTC

I've tracked this down. The bug is in libqb 0.17.1. I'll have a fix shortly.

-- David

Comment 8 David Vossel 2015-04-14 15:58:14 UTC

I've posted the patch upstream.
https://github.com/ClusterLabs/libqb/pull/147


This issue was exposed by a recent change made to the IPC server code. The bug has nothing to do with the IPC changes other than those changes altered the IPC server's usage of mainloop slightly. Unfortunately this alteration causes libqb's mainloop code to not reclaim resources correctly. Eventually the un-reclaimed resources hit a hard limit within libqb which triggers an assert and causes corosync to crash.

-- David

Comment 10 Frank Enderle 2015-04-14 22:41:51 UTC

Ok - so I upgraded back to the latest corosync.

Thank you for hunting it down and submitting the patch. I'm really looking forward to the fixed package.

Comment 17 errata-xmlrpc 2015-11-19 04:50:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2200.html