Description of problem: When RHEL OSP7 RC configured by OSP-director with contoroller HA using 3 nodes (and configured the fencing using xfence_xvm), lrmd sometimes crashes by segmentation fault. Version-Release number of selected component: pacemaker-1.1.12-22.el7_1.2 Additional info: reporter: libreport-2.1.11 cmdline: /usr/libexec/pacemaker/lrmd executable: /usr/libexec/pacemaker/lrmd global_uuid: 566d6d8d0ad5cef6e984df4fa09b28d7355d8e53 kernel: 3.10.0-229.7.2.el7.x86_64 runlevel: N 3 type: CCpp uid: 0
Created attachment 1055540 [details] File: cgroup
Created attachment 1055541 [details] File: core_backtrace
Created attachment 1055542 [details] File: dso_list
Created attachment 1055543 [details] File: environ
Created attachment 1055544 [details] File: limits
Created attachment 1055545 [details] File: machineid
Created attachment 1055546 [details] File: maps
Created attachment 1055547 [details] File: open_fds
Created attachment 1055548 [details] File: proc_pid_status
Created attachment 1055549 [details] File: var_log_messages
Created attachment 1055550 [details] File: sosreport-overcloud-controller-0.localdomain-20150723195046.tar.xz
Created attachment 1055551 [details] File: sosreport.tar.xz
Need crm_report archives for all nodes in the cluster at the time the crash occurred before I can say much. Filtered logs files and near empty sosreports don't help much. crm_report comes with detailed --help text.
Created attachment 1055820 [details] crm_report log The crm_report file is attached. Thanks, Tomoki
Very helpful, thankyou. If I look at core.6034 from controller 0, and go to stack frame 2 I see: (gdb) #2 cmd_finalize (cmd=cmd@entry=0xefab40, rsc=rsc@entry=0x0) at lrmd.c:540 540 send_cmd_complete_notify(cmd); (gdb) p *cmd $1 = {timeout = 15654448, interval = 0, start_delay = 15697312, timeout_orig = 0, call_id = 223, exec_rc = 3, lrmd_op_status = 1, call_opts = 4, delay_id = 0, stonith_recurring_id = 0, rsc_deleted = 1, client_id = 0xf0d0a0 "@\222", <incomplete sequence \356>, origin = 0xef4070 "`\325", <incomplete sequence \354>, rsc_id = 0xefb020 "`", <incomplete sequence \357>, action = 0xf060b0 "\320", <incomplete sequence \356>, real_action = 0x0, exit_reason = 0x0, output = 0x0, userdata_str = 0xef6d60 "\240\347", <incomplete sequence \356>, t_first_run = { time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_run = {time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_queue = {time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_rcchange = {time = 1437749769, millitm = 703, timezone = 240, dstflag = 0}, first_notify_sent = 1, last_notify_rc = 3, last_notify_op_status = 1, last_pid = 0, params = 0xd0} Which is complete and utter garbage. Its hard to escape the conclusion that this is a use-after-free. HOWEVER, this is a flurry of activity in the logs prior to the segfault, beginning with: Jul 24 10:56:09 [6034] overcloud-controller-0.localdomain lrmd: info: cancel_recurring_action: Cancelling operation mongod_status_60000 Jul 24 10:56:09 [6034] overcloud-controller-0.localdomain lrmd: warning: qb_ipcs_event_sendv: new_event_notification (6034-6033-8): Bad file descriptor (9) Something in the IPC code could be the root cause here (since process 6033 is still alive at the timd. It would be useful to know if re-testing with 0.17.1-1 results in the same behaviour. Another alternative is based on these logs: Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: notice: te_rsc_command: Initiating action 39: stop overcloud-controller-2_stop_0 on overcloud-controller-2 Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: info: lrmd_api_disconnect: Disconnecting from lrmd service Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: info: lrmd_ipc_connection_destroy: IPC connection destroyed Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: crit: lrm_connection_destroy: LRM Connection failed Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: info: lrmd_api_disconnect: Disconnecting from lrmd service Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: warning: do_update_resource: Resource overcloud-controller-0 no longer exists in the lrmd The logs here could use some work, its not clear which connections are being closed nor if they are being closed pro-actively or in response to an error. If the latter, then it would support the IPC theory above. If the former, then whatever is triggering the disconnection would be the root cause (i'll continue investigating). To be clear though, either way, we need to fix the lrmd from crashing.
I'm almost certain that this patch would fix the problem: commit d119e21a2b37e31bc3676ae8a19ad0473bdd217f Author: David Vossel <dvossel> Date: Fri Jun 5 15:07:53 2015 -0400 Fix: crmd: handle resources named the same as cluster nodes Reason being that I see this in the cib: <lrm_resource id="overcloud-controller-1" type="remote" class="ocf" provider="pacemaker"> Even though there pacemaker-remote is not intended to be in use. Work-around: add a "fence-" prefix to all the fencing resource definitions so that they no longer exactly match the node names. This explains why the behaviour occurs when processing the result of the node's fencing resource: Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: warning: do_update_resource: Resource overcloud-controller-0 no longer exists in the lrmd Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: info: match_graph_event: Action overcloud-controller-0_stop_0 (37) confirmed on overcloud-controller-0 (rc=0)
*** Bug 1245653 has been marked as a duplicate of this bug. ***
Fixes for the lrmd side of things are: Andrew Beekhof (25 hours ago) b2b4950: Fix: dbus: Remove redundant ref/unref of pending call records Ken Gaillot (27 hours ago) 016f29e: Fix: libservices: add DBus call reference when getting properties (origin/pr/766) Andrew Beekhof (35 hours ago) c3ab812: Fix: upstart: Ensure pending structs are correctly unreferenced Andrew Beekhof (2 days ago) c99a372: Fix: systemd: Ensure pending structs are correctly unreferenced Andrew Beekhof (4 days ago) 2091b55: Fix: systemd: Track pending operations so they can be safely cancelled We'll pick these up for 7.3 In the meantime, the crmd patch is present in pacemaker-1.1.12-22.el7_1.4 (available soon) and is sufficient to prevent this problem.
These were included some time ago but I forgot to update the bug.
adding to errata
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2383.html