Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem: Customer has a 3-node RHEL-OSP HA cluster that they powered off entirely for a planned power outage. Upon bringing the nodes back into service, they began experiencing repeated reboots/fencing, and this has not stopped even after several attempts to stop/start the cluster, standby, maintenance-mode, etc.
There is a lot of activity in the logs so its very difficult to even get a clear picture of what exactly is happening. Here's what appears to happen (seemingly in the order listed), but its not clear what's related to or causing what else:
- Failure to retrieve metadata for galera agent
Jul 6 08:57:00 slnec1ctl1 crmd[33886]: error: generic_get_metadata: Failed to retrieve meta-data for ocf:heartbeat:galera
Jul 6 08:57:00 slnec1ctl1 crmd[33886]: warning: get_rsc_metadata: No metadata found for galera::ocf:heartbeat: Input/output error (-5)
- lrmd assertion in lrmd_ipc_dispatch:
Jul 6 08:57:02 slnec1ctl1 lrmd[4164]: error: crm_abort: lrmd_ipc_dispatch: Triggered assert at main.c:123 : flags & crm_ipc_client_response
Jul 6 08:57:02 slnec1ctl1 lrmd[4164]: error: lrmd_ipc_dispatch: Invalid client request: 0x1219ce0
- crmd disconnect and exit, followed by respawn:
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: crit: lrm_connection_destroy: LRM Connection failed
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: warning: do_update_resource: Resource pcmk-slnec1ctl1 no longer exists in the lrmd
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: qb_ipcs_event_sendv: new_event_notification (4164-33886-8): Bad file descriptor (9)
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: notice: process_lrm_event: Operation pcmk-slnec1ctl1_stop_0: ok (node=pcmk-slnec1ctl1, call=2, rc=0, cib-update=0, confirmed=true)
Jul 6 08:57:12 slnec1ctl1 attrd[4166]: notice: attrd_peer_remove: Removing all pcmk-slnec1ctl1 attributes for pcmk-slnec1ctl1
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: warning: do_recover: Fast-tracking shutdown in response to errors
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 attrd[4166]: notice: attrd_peer_remove: Removing all pcmk-slnec1ctl2 attributes for pcmk-slnec1ctl2
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: notice: do_lrm_control: Disconnected from the LRM
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: notice: terminate_cs_connection: Disconnecting from Corosync
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 crmd[33886]: error: crmd_fast_exit: Could not recover from internal error
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed
Jul 6 08:57:12 slnec1ctl1 pacemakerd[4050]: error: pcmk_child_exit: The crmd process (33886) exited: Generic Pacemaker error (201)
Jul 6 08:57:12 slnec1ctl1 pacemakerd[4050]: notice: pcmk_process_exit: Respawning failed child process: crmd
Jul 6 08:57:12 slnec1ctl1 crmd[36596]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log
- In _some_ instances, lrmd then segfaults:
Jul 6 09:57:28 slnec1ctl1 kernel: lrmd[4236]: segfault at 1 ip 00007f9c2d90e8b8 sp 00007fffaca1edb8 error 4 in libxml2.so.2.9.1[7f9c2d860000+15e000]
Jul 6 09:57:28 slnec1ctl1 pacemakerd[4107]: error: child_waitpid: Managed process 4236 (lrmd) dumped core
Jul 6 09:57:28 slnec1ctl1 pacemakerd[4107]: error: pcmk_child_exit: The lrmd process (4236) terminated with signal 11 (core=1)
Jul 6 09:57:28 slnec1ctl1 pacemakerd[4107]: notice: pcmk_process_exit: Respawning failed child process: lrmd
- In some instances the system gets fenced.
This has been going on for weeks, and it is constant. They gave us /var/lib/pacemaker/cores from node 1 and it had over 1300 lrmd cores in it. Just a random sampling of some of those backtraces is as follows:
Thread 1 (Thread 0x7f137e8bc880 (LWP 49039)):
#0 0x00007f137cfa88b8 in xmlStrEqual () from /lib64/libxml2.so.2
#1 0x00007f137cf565ed in xmlHasProp () from /lib64/libxml2.so.2
#2 0x00007f137e4838b7 in crm_element_value (data=<optimized out>, name=<optimized out>) at xml.c:6068
#3 0x00007f137e48bf8c in hash2field (key=0x6f6974617265706f, value=0x1, user_data=user_data@entry=0x1aa2aa0) at xml.c:4932
#4 0x0000000000403e82 in send_cmd_complete_notify (cmd=0x1ae1a60) at lrmd.c:471
#5 cmd_finalize (cmd=cmd@entry=0x1ae1a60, rsc=rsc@entry=0x0) at lrmd.c:540
#6 0x000000000040525e in action_complete (action=<optimized out>) at lrmd.c:892
#7 0x00007f137e255950 in operation_finalize (op=op@entry=0x1ae5b00) at services_linux.c:254
#8 0x00007f137e25bd86 in systemd_unit_exec_with_unit (op=op@entry=0x1ae5b00, unit=0x1af1f84 "/org/freedesktop/systemd1/unit/openstack_2dceilometer_2dalarm_2devaluator_2eservice")
at systemd.c:643
#9 0x00007f137e25c041 in systemd_loadunit_result (reply=reply@entry=0x1ae6c00, op=op@entry=0x1ae5b00) at systemd.c:175
#10 0x00007f137e25c341 in systemd_loadunit_cb (pending=0x1ade1d0, user_data=0x1ae5b00) at systemd.c:197
#11 0x00007f137de05862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3
#12 0x00007f137de08b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#13 0x00007f137e257eb0 in pcmk_dbus_connection_dispatch (connection=0x1aacca0, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392
#14 0x00007f137de07260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3
#15 0x00007f137de07386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3
#16 0x00007f137de1f63a in dbus_watch_handle () from /lib64/libdbus-1.so.3
#17 0x00007f137e25783a in pcmk_dbus_watch_dispatch (userdata=0x1aa8bd0) at dbus.c:430
#18 0x00007f137e49c627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x1aa6d70) at mainloop.c:665
#19 0x00007f137c43b9ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#20 0x00007f137c43bd08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#21 0x00007f137c43bfda in g_main_loop_run () from /lib64/libglib-2.0.so.0
#22 0x000000000040286e in main (argc=<optimized out>, argv=0x7ffd2b834ed8) at main.c:350
(gdb) frame 4
#4 0x0000000000403e82 in send_cmd_complete_notify (cmd=0x1ae1a60) at lrmd.c:471
471 hash2field((gpointer) key, (gpointer) value, args);
(gdb) l
466
467 xmlNode *args = create_xml_node(notify, XML_TAG_ATTRS);
468
469 g_hash_table_iter_init(&iter, cmd->params);
470 while (g_hash_table_iter_next(&iter, (gpointer *) & key, (gpointer *) & value)) {
471 hash2field((gpointer) key, (gpointer) value, args);
472 }
473 }
474
475 if (cmd->client_id && (cmd->call_opts & lrmd_opt_notify_orig_only)) {
(gdb) p cmd
$1 = (lrmd_cmd_t *) 0x1ae1a60
(gdb) p key
$2 = 0x6f6974617265706f <Address 0x6f6974617265706f out of bounds>
(gdb) p value
$3 = 0x1 <Address 0x1 out of bounds>
(gdb) p args
$4 = (xmlNode *) 0x1aa2aa0
Thread 1 (Thread 0x7f6104d78880 (LWP 49488)):
#0 0x00007f61028e7caa in g_hash_table_iter_init () from /lib64/libglib-2.0.so.0
#1 0x0000000000403e6c in send_cmd_complete_notify (cmd=0x1f7e450) at lrmd.c:469
#2 cmd_finalize (cmd=cmd@entry=0x1f7e450, rsc=rsc@entry=0x0) at lrmd.c:540
#3 0x000000000040525e in action_complete (action=<optimized out>) at lrmd.c:892
#4 0x00007f6104711950 in operation_finalize (op=op@entry=0x1f6f800) at services_linux.c:254
#5 0x00007f6104717d86 in systemd_unit_exec_with_unit (op=op@entry=0x1f6f800, unit=0x1fbeff4 "/org/freedesktop/systemd1/unit/httpd_2eservice") at systemd.c:643
#6 0x00007f6104718041 in systemd_loadunit_result (reply=reply@entry=0x1f69ea0, op=op@entry=0x1f6f800) at systemd.c:175
#7 0x00007f6104718341 in systemd_loadunit_cb (pending=0x1f6eea0, user_data=0x1f6f800) at systemd.c:197
#8 0x00007f61042c1862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3
#9 0x00007f61042c4b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#10 0x00007f6104713eb0 in pcmk_dbus_connection_dispatch (connection=0x1f6a020, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392
#11 0x00007f61042c3260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3
#12 0x00007f61042c3386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3
#13 0x00007f61042db63a in dbus_watch_handle () from /lib64/libdbus-1.so.3
#14 0x00007f610471383a in pcmk_dbus_watch_dispatch (userdata=0x1f69c90) at dbus.c:430
#15 0x00007f6104958627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x1f6a8a0) at mainloop.c:665
#16 0x00007f61028f89ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#17 0x00007f61028f8d08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#18 0x00007f61028f8fda in g_main_loop_run () from /lib64/libglib-2.0.so.0
#19 0x000000000040286e in main (argc=<optimized out>, argv=0x7fffd709ade8) at main.c:350
(gdb) frame 1
#1 0x0000000000403e6c in send_cmd_complete_notify (cmd=0x1f7e450) at lrmd.c:469
469 g_hash_table_iter_init(&iter, cmd->params);
(gdb) l
464 char *value = NULL;
465 GHashTableIter iter;
466
467 xmlNode *args = create_xml_node(notify, XML_TAG_ATTRS);
468
469 g_hash_table_iter_init(&iter, cmd->params);
470 while (g_hash_table_iter_next(&iter, (gpointer *) & key, (gpointer *) & value)) {
471 hash2field((gpointer) key, (gpointer) value, args);
472 }
473 }
(gdb) p cmd
$1 = (lrmd_cmd_t *) 0x1f7e450
(gdb) p iter
$2 = {dummy1 = 0x1fbced8, dummy2 = 0x7f61042cccb7 <_dbus_message_byteswap+39>, dummy3 = 0x7fffd709a738, dummy4 = -687233204, dummy5 = 32767, dummy6 = 0x1f6e970}
Thread 1 (Thread 0x7f2a52f24880 (LWP 49948)):
#0 0x00007f2a528bd996 in operation_finalize (op=op@entry=0x213a4b0) at services_linux.c:248
#1 0x00007f2a528c3d86 in systemd_unit_exec_with_unit (op=op@entry=0x213a4b0, unit=0x214ed84 "/org/freedesktop/systemd1/unit/openstack_2dceilometer_2dalarm_2devaluator_2eservice")
at systemd.c:643
#2 0x00007f2a528c4041 in systemd_loadunit_result (reply=reply@entry=0x2148d20, op=op@entry=0x213a4b0) at systemd.c:175
#3 0x00007f2a528c4341 in systemd_loadunit_cb (pending=0x213fb20, user_data=0x213a4b0) at systemd.c:197
#4 0x00007f2a5246d862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3
#5 0x00007f2a52470b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#6 0x00007f2a528bfeb0 in pcmk_dbus_connection_dispatch (connection=0x210ecc0, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392
#7 0x00007f2a5246f260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3
#8 0x00007f2a5246f386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3
#9 0x00007f2a5248763a in dbus_watch_handle () from /lib64/libdbus-1.so.3
#10 0x00007f2a528bf83a in pcmk_dbus_watch_dispatch (userdata=0x210ef30) at dbus.c:430
#11 0x00007f2a52b04627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x2108c90) at mainloop.c:665
#12 0x00007f2a50aa49ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#13 0x00007f2a50aa4d08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0
#14 0x00007f2a50aa4fda in g_main_loop_run () from /lib64/libglib-2.0.so.0
#15 0x000000000040286e in main (argc=<optimized out>, argv=0x7fff38a90078) at main.c:350
(gdb) frame 0
#0 0x00007f2a528bd996 in operation_finalize (op=op@entry=0x213a4b0) at services_linux.c:248
248 op->opaque->repeat_timer = g_timeout_add(op->interval,
(gdb) l
243 if (op->cancel) {
244 op->status = PCMK_LRM_OP_CANCELLED;
245 cancel_recurring_action(op);
246 } else {
247 recurring = 1;
248 op->opaque->repeat_timer = g_timeout_add(op->interval,
249 recurring_action_timer, (void *)op);
250 }
251 }
252
(gdb) p op
$1 = (svc_action_t *) 0x213a4b0
(gdb) p op->opaque
$2 = (svc_action_private_t *) 0x90
(gdb) p recurring_action_timer
$3 = {gboolean (gpointer)} 0x7f2a528bceb0 <recurring_action_timer>
What we need:
- Fix the segfault, of course
- Understand what exactly the sequence of events is that is triggering all of this, so we can understand how to address it in the short term and get the cluster running properly again
Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7_1.2, libqb-0.17.1-1.el7_1.2
How reproducible: Constantly for this customer. No idea otherwise.
Steps to Reproduce: Unknown
Actual results: Constant fencing/rebooting of nodes
Expected results: Stable cluster
Additional info:
Something is very very messed up here.
Can someone
0. stop pacemaker everywhere
1. install valgrind
2. add the following to the bottom of /etc/sysconfig/pacemaker
G_SLICE=always-malloc
MALLOC_PERTURB_=221
MALLOC_CHECK_=3
PCMK_valgrind_enabled=crmd,lrmd
VALGRIND_OPTS="--leak-check=full --trace-children=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p"
3. start pacemaker on one node only
4. wait for things to go mental
5. send us /var/lib/pacemaker/valgrind-*
If 4 doesn't happen, start more nodes until it does
The request was for:
> 5. send us /var/lib/pacemaker/valgrind-*
Which are not included in the sosreports.
Do they not exist or did sosreport not grab them?
From comment #17 of bug #1246291
I'm almost certain that this patch would fix the problem:
commit d119e21a2b37e31bc3676ae8a19ad0473bdd217f
Author: David Vossel <dvossel>
Date: Fri Jun 5 15:07:53 2015 -0400
Fix: crmd: handle resources named the same as cluster nodes
Reason being that I see this in the cib:
<lrm_resource id="..." type="remote" class="ocf" provider="pacemaker">
Even though there pacemaker-remote is not intended to be in use.
Work-around: add a "fence-" prefix to all the fencing resource definitions so that they no longer exactly match the node names.
*** This bug has been marked as a duplicate of bug 1246291 ***
Description of problem: Customer has a 3-node RHEL-OSP HA cluster that they powered off entirely for a planned power outage. Upon bringing the nodes back into service, they began experiencing repeated reboots/fencing, and this has not stopped even after several attempts to stop/start the cluster, standby, maintenance-mode, etc. There is a lot of activity in the logs so its very difficult to even get a clear picture of what exactly is happening. Here's what appears to happen (seemingly in the order listed), but its not clear what's related to or causing what else: - Failure to retrieve metadata for galera agent Jul 6 08:57:00 slnec1ctl1 crmd[33886]: error: generic_get_metadata: Failed to retrieve meta-data for ocf:heartbeat:galera Jul 6 08:57:00 slnec1ctl1 crmd[33886]: warning: get_rsc_metadata: No metadata found for galera::ocf:heartbeat: Input/output error (-5) - lrmd assertion in lrmd_ipc_dispatch: Jul 6 08:57:02 slnec1ctl1 lrmd[4164]: error: crm_abort: lrmd_ipc_dispatch: Triggered assert at main.c:123 : flags & crm_ipc_client_response Jul 6 08:57:02 slnec1ctl1 lrmd[4164]: error: lrmd_ipc_dispatch: Invalid client request: 0x1219ce0 - crmd disconnect and exit, followed by respawn: Jul 6 08:57:12 slnec1ctl1 crmd[33886]: crit: lrm_connection_destroy: LRM Connection failed Jul 6 08:57:12 slnec1ctl1 crmd[33886]: warning: do_update_resource: Resource pcmk-slnec1ctl1 no longer exists in the lrmd Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: qb_ipcs_event_sendv: new_event_notification (4164-33886-8): Bad file descriptor (9) Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 crmd[33886]: notice: process_lrm_event: Operation pcmk-slnec1ctl1_stop_0: ok (node=pcmk-slnec1ctl1, call=2, rc=0, cib-update=0, confirmed=true) Jul 6 08:57:12 slnec1ctl1 attrd[4166]: notice: attrd_peer_remove: Removing all pcmk-slnec1ctl1 attributes for pcmk-slnec1ctl1 Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 crmd[33886]: error: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC Jul 6 08:57:12 slnec1ctl1 crmd[33886]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=lrm_connection_destroy ] Jul 6 08:57:12 slnec1ctl1 crmd[33886]: warning: do_recover: Fast-tracking shutdown in response to errors Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 crmd[33886]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 attrd[4166]: notice: attrd_peer_remove: Removing all pcmk-slnec1ctl2 attributes for pcmk-slnec1ctl2 Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 crmd[33886]: notice: do_lrm_control: Disconnected from the LRM Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 crmd[33886]: notice: terminate_cs_connection: Disconnecting from Corosync Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 crmd[33886]: error: crmd_fast_exit: Could not recover from internal error Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 lrmd[4164]: warning: send_client_notify: Notification of client crmd/8988b67f-ff65-4a39-a330-69efcbf12567 failed Jul 6 08:57:12 slnec1ctl1 pacemakerd[4050]: error: pcmk_child_exit: The crmd process (33886) exited: Generic Pacemaker error (201) Jul 6 08:57:12 slnec1ctl1 pacemakerd[4050]: notice: pcmk_process_exit: Respawning failed child process: crmd Jul 6 08:57:12 slnec1ctl1 crmd[36596]: notice: crm_add_logfile: Additional logging available in /var/log/pacemaker.log - In _some_ instances, lrmd then segfaults: Jul 6 09:57:28 slnec1ctl1 kernel: lrmd[4236]: segfault at 1 ip 00007f9c2d90e8b8 sp 00007fffaca1edb8 error 4 in libxml2.so.2.9.1[7f9c2d860000+15e000] Jul 6 09:57:28 slnec1ctl1 pacemakerd[4107]: error: child_waitpid: Managed process 4236 (lrmd) dumped core Jul 6 09:57:28 slnec1ctl1 pacemakerd[4107]: error: pcmk_child_exit: The lrmd process (4236) terminated with signal 11 (core=1) Jul 6 09:57:28 slnec1ctl1 pacemakerd[4107]: notice: pcmk_process_exit: Respawning failed child process: lrmd - In some instances the system gets fenced. This has been going on for weeks, and it is constant. They gave us /var/lib/pacemaker/cores from node 1 and it had over 1300 lrmd cores in it. Just a random sampling of some of those backtraces is as follows: Thread 1 (Thread 0x7f137e8bc880 (LWP 49039)): #0 0x00007f137cfa88b8 in xmlStrEqual () from /lib64/libxml2.so.2 #1 0x00007f137cf565ed in xmlHasProp () from /lib64/libxml2.so.2 #2 0x00007f137e4838b7 in crm_element_value (data=<optimized out>, name=<optimized out>) at xml.c:6068 #3 0x00007f137e48bf8c in hash2field (key=0x6f6974617265706f, value=0x1, user_data=user_data@entry=0x1aa2aa0) at xml.c:4932 #4 0x0000000000403e82 in send_cmd_complete_notify (cmd=0x1ae1a60) at lrmd.c:471 #5 cmd_finalize (cmd=cmd@entry=0x1ae1a60, rsc=rsc@entry=0x0) at lrmd.c:540 #6 0x000000000040525e in action_complete (action=<optimized out>) at lrmd.c:892 #7 0x00007f137e255950 in operation_finalize (op=op@entry=0x1ae5b00) at services_linux.c:254 #8 0x00007f137e25bd86 in systemd_unit_exec_with_unit (op=op@entry=0x1ae5b00, unit=0x1af1f84 "/org/freedesktop/systemd1/unit/openstack_2dceilometer_2dalarm_2devaluator_2eservice") at systemd.c:643 #9 0x00007f137e25c041 in systemd_loadunit_result (reply=reply@entry=0x1ae6c00, op=op@entry=0x1ae5b00) at systemd.c:175 #10 0x00007f137e25c341 in systemd_loadunit_cb (pending=0x1ade1d0, user_data=0x1ae5b00) at systemd.c:197 #11 0x00007f137de05862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3 #12 0x00007f137de08b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3 #13 0x00007f137e257eb0 in pcmk_dbus_connection_dispatch (connection=0x1aacca0, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392 #14 0x00007f137de07260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3 #15 0x00007f137de07386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3 #16 0x00007f137de1f63a in dbus_watch_handle () from /lib64/libdbus-1.so.3 #17 0x00007f137e25783a in pcmk_dbus_watch_dispatch (userdata=0x1aa8bd0) at dbus.c:430 #18 0x00007f137e49c627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x1aa6d70) at mainloop.c:665 #19 0x00007f137c43b9ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #20 0x00007f137c43bd08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0 #21 0x00007f137c43bfda in g_main_loop_run () from /lib64/libglib-2.0.so.0 #22 0x000000000040286e in main (argc=<optimized out>, argv=0x7ffd2b834ed8) at main.c:350 (gdb) frame 4 #4 0x0000000000403e82 in send_cmd_complete_notify (cmd=0x1ae1a60) at lrmd.c:471 471 hash2field((gpointer) key, (gpointer) value, args); (gdb) l 466 467 xmlNode *args = create_xml_node(notify, XML_TAG_ATTRS); 468 469 g_hash_table_iter_init(&iter, cmd->params); 470 while (g_hash_table_iter_next(&iter, (gpointer *) & key, (gpointer *) & value)) { 471 hash2field((gpointer) key, (gpointer) value, args); 472 } 473 } 474 475 if (cmd->client_id && (cmd->call_opts & lrmd_opt_notify_orig_only)) { (gdb) p cmd $1 = (lrmd_cmd_t *) 0x1ae1a60 (gdb) p key $2 = 0x6f6974617265706f <Address 0x6f6974617265706f out of bounds> (gdb) p value $3 = 0x1 <Address 0x1 out of bounds> (gdb) p args $4 = (xmlNode *) 0x1aa2aa0 Thread 1 (Thread 0x7f6104d78880 (LWP 49488)): #0 0x00007f61028e7caa in g_hash_table_iter_init () from /lib64/libglib-2.0.so.0 #1 0x0000000000403e6c in send_cmd_complete_notify (cmd=0x1f7e450) at lrmd.c:469 #2 cmd_finalize (cmd=cmd@entry=0x1f7e450, rsc=rsc@entry=0x0) at lrmd.c:540 #3 0x000000000040525e in action_complete (action=<optimized out>) at lrmd.c:892 #4 0x00007f6104711950 in operation_finalize (op=op@entry=0x1f6f800) at services_linux.c:254 #5 0x00007f6104717d86 in systemd_unit_exec_with_unit (op=op@entry=0x1f6f800, unit=0x1fbeff4 "/org/freedesktop/systemd1/unit/httpd_2eservice") at systemd.c:643 #6 0x00007f6104718041 in systemd_loadunit_result (reply=reply@entry=0x1f69ea0, op=op@entry=0x1f6f800) at systemd.c:175 #7 0x00007f6104718341 in systemd_loadunit_cb (pending=0x1f6eea0, user_data=0x1f6f800) at systemd.c:197 #8 0x00007f61042c1862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3 #9 0x00007f61042c4b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3 #10 0x00007f6104713eb0 in pcmk_dbus_connection_dispatch (connection=0x1f6a020, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392 #11 0x00007f61042c3260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3 #12 0x00007f61042c3386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3 #13 0x00007f61042db63a in dbus_watch_handle () from /lib64/libdbus-1.so.3 #14 0x00007f610471383a in pcmk_dbus_watch_dispatch (userdata=0x1f69c90) at dbus.c:430 #15 0x00007f6104958627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x1f6a8a0) at mainloop.c:665 #16 0x00007f61028f89ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #17 0x00007f61028f8d08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0 #18 0x00007f61028f8fda in g_main_loop_run () from /lib64/libglib-2.0.so.0 #19 0x000000000040286e in main (argc=<optimized out>, argv=0x7fffd709ade8) at main.c:350 (gdb) frame 1 #1 0x0000000000403e6c in send_cmd_complete_notify (cmd=0x1f7e450) at lrmd.c:469 469 g_hash_table_iter_init(&iter, cmd->params); (gdb) l 464 char *value = NULL; 465 GHashTableIter iter; 466 467 xmlNode *args = create_xml_node(notify, XML_TAG_ATTRS); 468 469 g_hash_table_iter_init(&iter, cmd->params); 470 while (g_hash_table_iter_next(&iter, (gpointer *) & key, (gpointer *) & value)) { 471 hash2field((gpointer) key, (gpointer) value, args); 472 } 473 } (gdb) p cmd $1 = (lrmd_cmd_t *) 0x1f7e450 (gdb) p iter $2 = {dummy1 = 0x1fbced8, dummy2 = 0x7f61042cccb7 <_dbus_message_byteswap+39>, dummy3 = 0x7fffd709a738, dummy4 = -687233204, dummy5 = 32767, dummy6 = 0x1f6e970} Thread 1 (Thread 0x7f2a52f24880 (LWP 49948)): #0 0x00007f2a528bd996 in operation_finalize (op=op@entry=0x213a4b0) at services_linux.c:248 #1 0x00007f2a528c3d86 in systemd_unit_exec_with_unit (op=op@entry=0x213a4b0, unit=0x214ed84 "/org/freedesktop/systemd1/unit/openstack_2dceilometer_2dalarm_2devaluator_2eservice") at systemd.c:643 #2 0x00007f2a528c4041 in systemd_loadunit_result (reply=reply@entry=0x2148d20, op=op@entry=0x213a4b0) at systemd.c:175 #3 0x00007f2a528c4341 in systemd_loadunit_cb (pending=0x213fb20, user_data=0x213a4b0) at systemd.c:197 #4 0x00007f2a5246d862 in complete_pending_call_and_unlock () from /lib64/libdbus-1.so.3 #5 0x00007f2a52470b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3 #6 0x00007f2a528bfeb0 in pcmk_dbus_connection_dispatch (connection=0x210ecc0, new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:392 #7 0x00007f2a5246f260 in _dbus_connection_update_dispatch_status_and_unlock () from /lib64/libdbus-1.so.3 #8 0x00007f2a5246f386 in _dbus_connection_handle_watch () from /lib64/libdbus-1.so.3 #9 0x00007f2a5248763a in dbus_watch_handle () from /lib64/libdbus-1.so.3 #10 0x00007f2a528bf83a in pcmk_dbus_watch_dispatch (userdata=0x210ef30) at dbus.c:430 #11 0x00007f2a52b04627 in mainloop_gio_callback (gio=<optimized out>, condition=G_IO_IN, data=0x2108c90) at mainloop.c:665 #12 0x00007f2a50aa49ba in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #13 0x00007f2a50aa4d08 in g_main_context_iterate.isra.24 () from /lib64/libglib-2.0.so.0 #14 0x00007f2a50aa4fda in g_main_loop_run () from /lib64/libglib-2.0.so.0 #15 0x000000000040286e in main (argc=<optimized out>, argv=0x7fff38a90078) at main.c:350 (gdb) frame 0 #0 0x00007f2a528bd996 in operation_finalize (op=op@entry=0x213a4b0) at services_linux.c:248 248 op->opaque->repeat_timer = g_timeout_add(op->interval, (gdb) l 243 if (op->cancel) { 244 op->status = PCMK_LRM_OP_CANCELLED; 245 cancel_recurring_action(op); 246 } else { 247 recurring = 1; 248 op->opaque->repeat_timer = g_timeout_add(op->interval, 249 recurring_action_timer, (void *)op); 250 } 251 } 252 (gdb) p op $1 = (svc_action_t *) 0x213a4b0 (gdb) p op->opaque $2 = (svc_action_private_t *) 0x90 (gdb) p recurring_action_timer $3 = {gboolean (gpointer)} 0x7f2a528bceb0 <recurring_action_timer> What we need: - Fix the segfault, of course - Understand what exactly the sequence of events is that is triggering all of this, so we can understand how to address it in the short term and get the cluster running properly again Version-Release number of selected component (if applicable): pacemaker-1.1.12-22.el7_1.2, libqb-0.17.1-1.el7_1.2 How reproducible: Constantly for this customer. No idea otherwise. Steps to Reproduce: Unknown Actual results: Constant fencing/rebooting of nodes Expected results: Stable cluster Additional info: