Bug 1211375
Summary: | Assertion in libqb via corosync | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Frank Enderle <frank.enderle> | ||||
Component: | libqb | Assignee: | Christine Caulfield <ccaulfie> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 7.1 | CC: | amoralej, cluster-maint, crose, ctowsley, fdinitto, jherrman, jkortus, jruemker, linux-bugs, mjuricek, mzheng, tlavigne | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libqb-0.17.1-2.el7 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause:
An unrelated change caused the libqb Interprocess communication (IPC) mainloop not to free some resources when a connection was terminated.
Consequence:
When a total of 64K connections had been made libqb would assert and the application would exit.
Fix
libqb now cleans up correctly after a connection terminates.
Result
More than 64 connections can be made via libqb, and the process will not exit.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1212297 (view as bug list) | Environment: | |||||
Last Closed: | 2015-11-19 04:50:09 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1212297 | ||||||
Attachments: |
|
Description
Frank Enderle
2015-04-13 19:26:07 UTC
(In reply to Frank Enderle from comment #0) > Description of problem: > > Please note that I'm running CentOS, however I wanted to report the problem > also here. > > Apx once per day corosync restarts on my current cluster config. Through the > backtrace it is visible that the problem is an assertion in libqb: > > #4 0x00007faca2d3aa26 in _poll_add_ (l=l@entry=0x7faca3986cd0, > p=QB_LOOP_HIGH, fd=fd@entry=27, events=35, data=0x7faca46ed3b0, > pe_pt=pe_pt@entry=0x7fffad747988) at loop_poll.c:305 > pe = 0x7faca47e6148 > install_pos = 4294967274 install_pos looks weird here because it is actually an error number, -INVAL to be exact. This is mostly likely caused by the libqb array growing beyond 65536 elements, or 2^16. It appears the libqb mainloop fd list is growing without bound for some reason. This is probably either an issue with libqb not properly reclaiming memory in mainloop after a file descriptor has been removed, or an issue with corosync not deleting entries from mainloop. -- David Frank, Do you have any steps to reproduce this? What is corosync doing when this happens? Is this a pacemaker environment? (In reply to David Vossel from comment #3) > Frank, Do you have any steps to reproduce this? What is corosync doing when > this happens? Is this a pacemaker environment? Sorry, I have no steps to redproduce, but I did some further checking: 1. The problems seem to have started on Apr 02, which is interesting since I upgraded from CentOS 7.0 to 7.1 on Apr 01. So it _might_ be a regression. I will try downgrade libqb and corosync to check, starting with corosync, proceeding to libqb if that has no effect. 2. The interval is somewhat around 9 to 10 hours. So the 'running out of resources' theory seems to be a valid one. 3. I also noted that it seems like if only the 'not DC node' does the assertion, but I'm not really sure of that. The cluster is a pacemaker cluster which i set up pretty much as described in the RHEL7 HA Add-On manual. I'll attached the output of pcs config show. After failure the following entries show up in the logs: Apr 14 05:45:30 sio-c01-db02 abrt-hook-ccpp: Saved core dump of pid 16142 (/usr/sbin/corosync) to /var/tmp/abrt/ccpp-2015-04-14-05:45:30-16142 (30638080 bytes) Apr 14 05:45:30 sio-c01-db02 attrd[16174]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 14 05:45:30 sio-c01-db02 crmd[16175]: error: crmd_quorum_destroy: connection terminated Apr 14 05:45:30 sio-c01-db02 stonith-ng[16173]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 14 05:45:30 sio-c01-db02 pacemakerd[16171]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 14 05:45:30 sio-c01-db02 attrd[16174]: crit: attrd_cpg_destroy: Lost connection to Corosync service! Apr 14 05:45:30 sio-c01-db02 pacemakerd[16171]: error: mcp_cpg_destroy: Connection destroyed Apr 14 05:45:30 sio-c01-db02 stonith-ng[16173]: error: stonith_peer_cs_destroy: Corosync connection terminated Apr 14 05:45:30 sio-c01-db02 attrd[16174]: notice: main: Cleaning up before exit Apr 14 05:45:30 sio-c01-db02 attrd[16174]: notice: crm_client_disconnect_all: Disconnecting client 0xe56c30, pid=16175... Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: error: crm_ipc_read: Connection to stonith-ng failed Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: error: mainloop_gio_callback: Connection to stonith-ng[0x21b3560] closed (I/O condition=17) Apr 14 05:45:30 sio-c01-db02 crmd[16175]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67) Apr 14 05:45:30 sio-c01-db02 cib[16172]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 14 05:45:30 sio-c01-db02 cib[16172]: error: cib_cs_destroy: Corosync connection lost! Exiting. Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: error: stonith_connection_destroy_cb: LRMD lost STONITH connection Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: error: stonith_connection_failed: STONITH connection failed, finalizing 1 pending operations. Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: qb_ipcs_event_sendv: new_event_notification (20237-16175-8): Broken pipe (32) Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed Apr 14 05:45:30 sio-c01-db02 lrmd[20237]: warning: send_client_notify: Notification of client crmd/013980c5-4ffe-4e78-a788-f89057f52e09 failed Apr 14 05:45:30 sio-c01-db02 systemd: corosync.service: main process exited, code=dumped, status=6/ABRT Apr 14 05:45:30 sio-c01-db02 systemd: pacemaker.service: main process exited, code=exited, status=107/n/a Apr 14 05:45:30 sio-c01-db02 systemd: Unit pacemaker.service entered failed state. Apr 14 05:45:30 sio-c01-db02 systemd: Unit corosync.service entered failed state. Apr 14 05:45:30 sio-c01-db02 abrt-server: Generating core_backtrace Apr 14 05:45:30 sio-c01-db02 systemd: pacemaker.service holdoff time over, scheduling restart. Apr 14 05:45:30 sio-c01-db02 systemd: Stopping Pacemaker High Availability Cluster Manager... Apr 14 05:45:30 sio-c01-db02 systemd: Starting Corosync Cluster Engine... Apr 14 05:45:30 sio-c01-db02 corosync[11280]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service. Apr 14 05:45:30 sio-c01-db02 corosync[11280]: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow Apr 14 05:45:30 sio-c01-db02 corosync[11281]: [TOTEM ] Initializing transport (UDP/IP Unicast). Apr 14 05:45:30 sio-c01-db02 corosync[11281]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] The network interface [10.1.1.3] is now up. Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV ] Service engine loaded: corosync configuration map access [0] Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB ] server name: cmap Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV ] Service engine loaded: corosync configuration service [1] Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB ] server name: cfg Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB ] server name: cpg Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV ] Service engine loaded: corosync profile loading service [4] Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QUORUM] Using quorum provider corosync_votequorum Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB ] server name: votequorum Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QB ] server name: quorum Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] adding new UDPU member {10.1.1.2} Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] adding new UDPU member {10.1.1.3} Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] A new membership (10.1.1.3:28) was formed. Members joined: 2 Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [TOTEM ] A new membership (10.1.1.2:32) was formed. Members joined: 1 Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QUORUM] This node is within the primary component and will provide service. Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [QUORUM] Members[2]: 1 2 Apr 14 05:45:31 sio-c01-db02 corosync[11281]: [MAIN ] Completed service synchronization, ready to provide service. Apr 14 05:45:31 sio-c01-db02 corosync: Starting Corosync Cluster Engine (corosync): [ OK ] Apr 14 05:45:31 sio-c01-db02 systemd: Started Corosync Cluster Engine. Apr 14 05:45:31 sio-c01-db02 systemd: Starting Pacemaker High Availability Cluster Manager... Apr 14 05:45:31 sio-c01-db02 systemd: Started Pacemaker High Availability Cluster Manager. Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log Apr 14 05:45:31 sio-c01-db02 pacemakerd: crm_ipc_connect: Could not establish pacemakerd connection: Connection refused (111) Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: mcp_read_config: Configured corosync to accept connections from group 189: OK (1) Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: main: Starting Pacemaker 1.1.12 (Build: a14efad): generated-manpages agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc upstart systemd nagios corosync-native atomic-attrd acls Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: find_and_track_existing_processes: Tracking existing lrmd process (pid=20237) Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: find_and_track_existing_processes: Tracking existing pengine process (pid=20239) Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: cluster_connect_quorum: Quorum acquired Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: crm_update_peer_state: pcmk_quorum_notification: Node sio-c01-db01a[1] - state is now member (was (null)) Apr 14 05:45:31 sio-c01-db02 pacemakerd[11297]: notice: crm_update_peer_state: pcmk_quorum_notification: Node sio-c01-db02a[2] - state is now member (was (null)) and so on... Created attachment 1014181 [details]
pcs config output
Ok so I just downgraded corosync from: corosync-2.3.4-4.el7.x86_64 corosynclib-2.3.4-4.el7.x86_64 to corosync-2.3.3-2.el7.x86_64 corosynclib-2.3.3-2.el7.x86_64 We'll see tomorrow. I've tracked this down. The bug is in libqb 0.17.1. I'll have a fix shortly. -- David I've posted the patch upstream. https://github.com/ClusterLabs/libqb/pull/147 This issue was exposed by a recent change made to the IPC server code. The bug has nothing to do with the IPC changes other than those changes altered the IPC server's usage of mainloop slightly. Unfortunately this alteration causes libqb's mainloop code to not reclaim resources correctly. Eventually the un-reclaimed resources hit a hard limit within libqb which triggers an assert and causes corosync to crash. -- David Ok - so I upgraded back to the latest corosync. Thank you for hunting it down and submitting the patch. I'm really looking forward to the fixed package. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2200.html |