1245653 – repeating lrmd segfaults and exit leading to fencing

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1245653 - repeating lrmd segfaults and exit leading to fencing

Summary: repeating lrmd segfaults and exit leading to fencing

Keywords:
Status:	CLOSED DUPLICATE of bug 1246291
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Beekhof
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-22 13:23 UTC by John Ruemker
Modified:	2019-07-11 09:41 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-07-27 01:29:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1538173	0	None	None	None	Never

Description John Ruemker 2015-07-22 13:23:06 UTC

Description of problem: Customer has a 3-node RHEL-OSP HA cluster that they powered off entirely for a planned power outage.  Upon bringing the nodes back into service, they began experiencing repeated reboots/fencing, and this has not stopped even after several attempts to stop/start the cluster, standby, maintenance-mode, etc.  

There is a lot of activity in the logs so its very difficult to even get a clear picture of what exactly is happening.  Here's what appears to happen (seemingly in the order listed), but its not clear what's related to or causing what else:

- Failure to retrieve metadata for galera agent

  Jul  6 08:57:00 slnec1ctl1 crmd[33886]: error: generic_get_metadata: Failed to retrieve meta-data for ocf:heartbeat:galera
  Jul  6 08:57:00 slnec1ctl1 crmd[33886]: warning: get_rsc_metadata: No metadata found for galera::ocf:heartbeat: Input/output error (-5)

- lrmd assertion in lrmd_ipc_dispatch:

Jul  6 08:57:02 slnec1ctl1 lrmd[4164]: error: crm_abort: lrmd_ipc_dispatch: Triggered assert at main.c:123 : flags & crm_ipc_client_response
Jul  6 08:57:02 slnec1ctl1 lrmd[4164]: error: lrmd_ipc_dispatch: Invalid client request: 0x1219ce0

- crmd disconnect and exit, followed by respawn:

Jul  6 08:57:12 slnec1ctl1 crmd[33886]: crit: lrm_connection_destroy: LRM Connection failed
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: warning: do_update_resource: Resource pcmk-slnec1ctl1 no longer exists in the lrmd
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: qb_ipcs_event_sendv: new_event_notification (4164-33886-8): Bad file descriptor (9)
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: notice: process_lrm_event: Operation pcmk-slnec1ctl1_stop_0: ok (node=pcmk-slnec1ctl1, call=2, rc=0, cib-update=0, confirmed=true)
Jul  6 08:57:12 slnec1ctl1 attrd[4166]: notice: attrd_peer_remove: Removing all pcmk-slnec1ctl1 attributes for pcmk-slnec1ctl1
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: warning: do_recover: Fast-tracking shutdown in response to errors
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 attrd[4166]: notice: attrd_peer_remove: Removing all pcmk-slnec1ctl2 attributes for pcmk-slnec1ctl2
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: notice: do_lrm_control: Disconnected from the LRM
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: notice: terminate_cs_connection: Disconnecting from Corosync
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 crmd[33886]: error: crmd_fast_exit: Could not recover from internal error
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul  6 08:57:12 slnec1ctl1 pacemakerd[4050]: error: pcmk_child_exit: The crmd process (33886) exited: Generic Pacemaker error (201)
Jul  6 08:57:12 slnec1ctl1 pacemakerd[4050]: notice: pcmk_process_exit: Respawning failed child process: crmd
Jul  6 08:57:12 slnec1ctl1 crmd[36596]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log

- In _some_ instances, lrmd then segfaults:

Jul  6 09:57:28 slnec1ctl1 kernel: lrmd[4236]: segfault at 1 ip 00007f9c2d90e8b8 sp 00007fffaca1edb8 error 4 in libxml2.so.2.9.1[7f9c2d860000+15e000]
Jul  6 09:57:28 slnec1ctl1 pacemakerd[4107]: error: child_waitpid: Managed process 4236 (lrmd) dumped core
Jul  6 09:57:28 slnec1ctl1 pacemakerd[4107]: error: pcmk_child_exit: The lrmd process (4236) terminated with signal 11 (core=1)
Jul  6 09:57:28 slnec1ctl1 pacemakerd[4107]: notice: pcmk_process_exit: Respawning failed child process: lrmd

- In some instances the system gets fenced.


This has been going on for weeks, and it is constant.  They gave us /var/lib/pacemaker/cores from node 1 and it had over 1300 lrmd cores in it.  Just a random sampling of some of those backtraces is as follows:


Thread 1 (Thread 0x7f137e8bc880 (LWP 49039)):
#0  0x00007f137cfa88b8 in xmlStrEqual () from /lib64/libxml2.so.2
#1  0x00007f137cf565ed in xmlHasProp () from /lib64/libxml2.so.2
#2  0x00007f137e4838b7 in crm_element_value (data=<optimized out>, name=<optimized out>) at xml.c:6068
#3  0x00007f137e48bf8c in hash2field (key=0x6f6974617265706f, value=0x1, user_data=user_data@entry=0x1aa2aa0) at xml.c:4932
#4  0x0000000000403e82 in send_cmd_complete_notify (cmd=0x1ae1a60) at lrmd.c:471
#5  cmd_finalize (cmd=cmd@entry=0x1ae1a60, rsc=rsc@entry=0x0) at lrmd.c:540
#6  0x000000000040525e in action_complete (action=<optimized out>) at lrmd.c:892
#7  0x00007f137e255950 in operation_finalize (op=op@entry=0x1ae5b00) at services_linux.c:254
#8  0x00007f137e25bd86 in systemd_unit_exec_with_unit (op=op@entry=0x1ae5b00, unit=0x1af1f84 "/org/freedesktop/systemd1/unit/openstack_2dceilometer_2dalarm_2devaluator_2eservice")
    at systemd.c:643
#9  0x00007f137e25c041 in systemd_loadunit_result (reply=reply@entry=0x1ae6c00, op=op@entry=0x1ae5b00) at systemd.c:175
#10 0x00007f137e25c341 in systemd_loadunit_cb (pending=0x1ade1d0, user_data=0x1ae5b00) at systemd.c:197
#11 0x00007f137de05862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3
#12 0x00007f137de08b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#13 0x00007f137e257eb0 in pcmk_dbus_connection_dispatch (connection=0x1aacca0, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392
#14 0x00007f137de07260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3
#15 0x00007f137de07386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3
#16 0x00007f137de1f63a in dbus_watch_handle () from /lib64/libdbus-1.so.3
#17 0x00007f137e25783a in pcmk_dbus_watch_dispatch (userdata=0x1aa8bd0) at dbus.c:430
#18 0x00007f137e49c627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x1aa6d70) at mainloop.c:665
#19 0x00007f137c43b9ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#20 0x00007f137c43bd08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#21 0x00007f137c43bfda in g_main_loop_run () from /lib64/libglib-2.0.so.0
#22 0x000000000040286e in main (argc=<optimized out>, argv=0x7ffd2b834ed8) at main.c:350
(gdb) frame 4
#4  0x0000000000403e82 in send_cmd_complete_notify (cmd=0x1ae1a60) at lrmd.c:471
471                 hash2field((gpointer) key, (gpointer) value, args);
(gdb) l
466
467             xmlNode *args = create_xml_node(notify, XML_TAG_ATTRS);
468
469             g_hash_table_iter_init(&iter, cmd->params);
470             while (g_hash_table_iter_next(&iter, (gpointer *) & key, (gpointer *) & value)) {
471                 hash2field((gpointer) key, (gpointer) value, args);
472             }
473         }
474
475         if (cmd->client_id && (cmd->call_opts & lrmd_opt_notify_orig_only)) {
(gdb) p cmd
$1 = (lrmd_cmd_t *) 0x1ae1a60
(gdb) p key
$2 = 0x6f6974617265706f <Address 0x6f6974617265706f out of bounds>
(gdb) p value
$3 = 0x1 <Address 0x1 out of bounds>
(gdb) p args
$4 = (xmlNode *) 0x1aa2aa0



Thread 1 (Thread 0x7f6104d78880 (LWP 49488)):
#0  0x00007f61028e7caa in g_hash_table_iter_init () from /lib64/libglib-2.0.so.0
#1  0x0000000000403e6c in send_cmd_complete_notify (cmd=0x1f7e450) at lrmd.c:469
#2  cmd_finalize (cmd=cmd@entry=0x1f7e450, rsc=rsc@entry=0x0) at lrmd.c:540
#3  0x000000000040525e in action_complete (action=<optimized out>) at lrmd.c:892
#4  0x00007f6104711950 in operation_finalize (op=op@entry=0x1f6f800) at services_linux.c:254
#5  0x00007f6104717d86 in systemd_unit_exec_with_unit (op=op@entry=0x1f6f800, unit=0x1fbeff4 "/org/freedesktop/systemd1/unit/httpd_2eservice") at systemd.c:643
#6  0x00007f6104718041 in systemd_loadunit_result (reply=reply@entry=0x1f69ea0, op=op@entry=0x1f6f800) at systemd.c:175
#7  0x00007f6104718341 in systemd_loadunit_cb (pending=0x1f6eea0, user_data=0x1f6f800) at systemd.c:197
#8  0x00007f61042c1862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3
#9  0x00007f61042c4b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#10 0x00007f6104713eb0 in pcmk_dbus_connection_dispatch (connection=0x1f6a020, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392
#11 0x00007f61042c3260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3
#12 0x00007f61042c3386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3
#13 0x00007f61042db63a in dbus_watch_handle () from /lib64/libdbus-1.so.3
#14 0x00007f610471383a in pcmk_dbus_watch_dispatch (userdata=0x1f69c90) at dbus.c:430
#15 0x00007f6104958627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x1f6a8a0) at mainloop.c:665
#16 0x00007f61028f89ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#17 0x00007f61028f8d08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#18 0x00007f61028f8fda in g_main_loop_run () from /lib64/libglib-2.0.so.0
#19 0x000000000040286e in main (argc=<optimized out>, argv=0x7fffd709ade8) at main.c:350
(gdb) frame 1
#1  0x0000000000403e6c in send_cmd_complete_notify (cmd=0x1f7e450) at lrmd.c:469
469             g_hash_table_iter_init(&iter, cmd->params);
(gdb) l
464             char *value = NULL;
465             GHashTableIter iter;
466
467             xmlNode *args = create_xml_node(notify, XML_TAG_ATTRS);
468
469             g_hash_table_iter_init(&iter, cmd->params);
470             while (g_hash_table_iter_next(&iter, (gpointer *) & key, (gpointer *) & value)) {
471                 hash2field((gpointer) key, (gpointer) value, args);
472             }
473         }
(gdb) p cmd
$1 = (lrmd_cmd_t *) 0x1f7e450
(gdb) p iter
$2 = {dummy1 = 0x1fbced8, dummy2 = 0x7f61042cccb7 <_dbus_message_byteswap+39>, dummy3 = 0x7fffd709a738, dummy4 = -687233204, dummy5 = 32767, dummy6 = 0x1f6e970}




Thread 1 (Thread 0x7f2a52f24880 (LWP 49948)):
#0  0x00007f2a528bd996 in operation_finalize (op=op@entry=0x213a4b0) at services_linux.c:248
#1  0x00007f2a528c3d86 in systemd_unit_exec_with_unit (op=op@entry=0x213a4b0, unit=0x214ed84 "/org/freedesktop/systemd1/unit/openstack_2dceilometer_2dalarm_2devaluator_2eservice")
    at systemd.c:643
#2  0x00007f2a528c4041 in systemd_loadunit_result (reply=reply@entry=0x2148d20, op=op@entry=0x213a4b0) at systemd.c:175
#3  0x00007f2a528c4341 in systemd_loadunit_cb (pending=0x213fb20, user_data=0x213a4b0) at systemd.c:197
#4  0x00007f2a5246d862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3
#5  0x00007f2a52470b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#6  0x00007f2a528bfeb0 in pcmk_dbus_connection_dispatch (connection=0x210ecc0, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392
#7  0x00007f2a5246f260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3
#8  0x00007f2a5246f386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3
#9  0x00007f2a5248763a in dbus_watch_handle () from /lib64/libdbus-1.so.3
#10 0x00007f2a528bf83a in pcmk_dbus_watch_dispatch (userdata=0x210ef30) at dbus.c:430
#11 0x00007f2a52b04627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x2108c90) at mainloop.c:665
#12 0x00007f2a50aa49ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#13 0x00007f2a50aa4d08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#14 0x00007f2a50aa4fda in g_main_loop_run () from /lib64/libglib-2.0.so.0
#15 0x000000000040286e in main (argc=<optimized out>, argv=0x7fff38a90078) at main.c:350
(gdb) frame 0
#0  0x00007f2a528bd996 in operation_finalize (op=op@entry=0x213a4b0) at services_linux.c:248
248                 op->opaque->repeat_timer = g_timeout_add(op->interval,
(gdb) l
243             if (op->cancel) {
244                 op->status = PCMK_LRM_OP_CANCELLED;
245                 cancel_recurring_action(op);
246             } else {
247                 recurring = 1;
248                 op->opaque->repeat_timer = g_timeout_add(op->interval,
249                                                          recurring_action_timer, (void *)op);
250             }
251         }
252
(gdb) p op
$1 = (svc_action_t *) 0x213a4b0
(gdb) p op->opaque
$2 = (svc_action_private_t *) 0x90
(gdb) p recurring_action_timer
$3 = {gboolean (gpointer)} 0x7f2a528bceb0 <recurring_action_timer>


What we need:

- Fix the segfault, of course
- Understand what exactly the sequence of events is that is triggering all of this, so we can understand how to address it in the short term and get the cluster running properly again
 

Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7_1.2, libqb-0.17.1-1.el7_1.2


How reproducible: Constantly for this customer.  No idea otherwise. 


Steps to Reproduce: Unknown


Actual results: Constant fencing/rebooting of nodes


Expected results: Stable cluster


Additional info:

Comment 2 Andrew Beekhof 2015-07-23 11:26:43 UTC

Something is very very messed up here.

Can someone 

0. stop pacemaker everywhere
1. install valgrind 
2. add the following to the bottom of /etc/sysconfig/pacemaker

G_SLICE=always-malloc
MALLOC_PERTURB_=221
MALLOC_CHECK_=3    
PCMK_valgrind_enabled=crmd,lrmd
VALGRIND_OPTS="--leak-check=full --trace-children=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p"

3. start pacemaker on one node only
4. wait for things to go mental
5. send us /var/lib/pacemaker/valgrind-*

If 4 doesn't happen, start more nodes until it does

Comment 4 Andrew Beekhof 2015-07-24 06:18:25 UTC

The request was for:

> 5. send us /var/lib/pacemaker/valgrind-*

Which are not included in the sosreports.  
Do they not exist or did sosreport not grab them?

Comment 5 Andrew Beekhof 2015-07-24 07:23:00 UTC

If we must run sosreport, can we please make sure the cluster module is active.
It has a tonne of additional information

Comment 6 Andrew Beekhof 2015-07-24 07:24:08 UTC

And finally, slnec1ctl1.rhelosp.vzw.demo/installed-rpms suggests that valgrind is not even installed

Comment 9 Andrew Beekhof 2015-07-27 01:29:55 UTC

From comment #17 of bug #1246291

I'm almost certain that this patch would fix the problem:

commit d119e21a2b37e31bc3676ae8a19ad0473bdd217f
Author: David Vossel <dvossel>
Date:   Fri Jun 5 15:07:53 2015 -0400

    Fix: crmd: handle resources named the same as cluster nodes


Reason being that I see this in the cib:

          <lrm_resource id="..." type="remote" class="ocf" provider="pacemaker">

Even though there pacemaker-remote is not intended to be in use.

Work-around: add a "fence-" prefix to all the fencing resource definitions so that they no longer exactly match the node names.

*** This bug has been marked as a duplicate of bug 1246291 ***

Note You need to log in before you can comment on or make changes to this bug.