Bug 1246291
| Summary: | lrmd killed by SIGSEGV | ||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Tomoki Sekiyama <tsekiyam> | ||||||||||||||||||||||||||||
| Component: | pacemaker | Assignee: | Andrew Beekhof <abeekhof> | ||||||||||||||||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||||||||||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||||||||||||||
| Priority: | high | ||||||||||||||||||||||||||||||
| Version: | 7.1 | CC: | cfeist, cluster-maint, djansa, fdinitto, jkortus, jruemker, lmiksik, mtanino | ||||||||||||||||||||||||||||
| Target Milestone: | rc | ||||||||||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||||||||
| URL: | https://api.access.redhat.com/rs/telemetry/abrt/reports/bthash/c060d3a4a86c68a93275db70569ad33e071e8bc9 | ||||||||||||||||||||||||||||||
| Whiteboard: | abrt_hash:566d6d8d0ad5cef6e984df4fa09b28d7355d8e53 | ||||||||||||||||||||||||||||||
| Fixed In Version: | pacemaker-1.1.13-10.el7 | Doc Type: | Bug Fix | ||||||||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||||
| Clone Of: | |||||||||||||||||||||||||||||||
| : | 1247019 (view as bug list) | Environment: | |||||||||||||||||||||||||||||
| Last Closed: | 2015-11-19 12:11:59 UTC | Type: | --- | ||||||||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||||||||||
| Bug Depends On: | |||||||||||||||||||||||||||||||
| Bug Blocks: | 1133060, 1247019 | ||||||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||||||
|
Description
Tomoki Sekiyama
2015-07-24 00:01:20 UTC
Created attachment 1055540 [details]
File: cgroup
Created attachment 1055541 [details]
File: core_backtrace
Created attachment 1055542 [details]
File: dso_list
Created attachment 1055543 [details]
File: environ
Created attachment 1055544 [details]
File: limits
Created attachment 1055545 [details]
File: machineid
Created attachment 1055546 [details]
File: maps
Created attachment 1055547 [details]
File: open_fds
Created attachment 1055548 [details]
File: proc_pid_status
Created attachment 1055549 [details]
File: var_log_messages
Created attachment 1055550 [details]
File: sosreport-overcloud-controller-0.localdomain-20150723195046.tar.xz
Created attachment 1055551 [details]
File: sosreport.tar.xz
Need crm_report archives for all nodes in the cluster at the time the crash occurred before I can say much. Filtered logs files and near empty sosreports don't help much. crm_report comes with detailed --help text. Created attachment 1055820 [details]
crm_report log
The crm_report file is attached.
Thanks,
Tomoki
Very helpful, thankyou.
If I look at core.6034 from controller 0, and go to stack frame 2 I see:
(gdb)
#2 cmd_finalize (cmd=cmd@entry=0xefab40, rsc=rsc@entry=0x0) at lrmd.c:540
540 send_cmd_complete_notify(cmd);
(gdb) p *cmd
$1 = {timeout = 15654448, interval = 0, start_delay = 15697312, timeout_orig = 0, call_id = 223, exec_rc = 3, lrmd_op_status = 1, call_opts = 4, delay_id = 0, stonith_recurring_id = 0, rsc_deleted = 1, client_id = 0xf0d0a0 "@\222", <incomplete sequence \356>,
origin = 0xef4070 "`\325", <incomplete sequence \354>, rsc_id = 0xefb020 "`", <incomplete sequence \357>, action = 0xf060b0 "\320", <incomplete sequence \356>, real_action = 0x0, exit_reason = 0x0, output = 0x0, userdata_str = 0xef6d60 "\240\347", <incomplete sequence \356>, t_first_run = {
time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_run = {time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_queue = {time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_rcchange = {time = 1437749769, millitm = 703, timezone = 240, dstflag = 0},
first_notify_sent = 1, last_notify_rc = 3, last_notify_op_status = 1, last_pid = 0, params = 0xd0}
Which is complete and utter garbage. Its hard to escape the conclusion that this is a use-after-free.
HOWEVER, this is a flurry of activity in the logs prior to the segfault, beginning with:
Jul 24 10:56:09 [6034] overcloud-controller-0.localdomain lrmd: info: cancel_recurring_action: Cancelling operation mongod_status_60000
Jul 24 10:56:09 [6034] overcloud-controller-0.localdomain lrmd: warning: qb_ipcs_event_sendv: new_event_notification (6034-6033-8): Bad file descriptor (9)
Something in the IPC code could be the root cause here (since process 6033 is still alive at the timd.
It would be useful to know if re-testing with 0.17.1-1 results in the same behaviour.
Another alternative is based on these logs:
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: notice: te_rsc_command: Initiating action 39: stop overcloud-controller-2_stop_0 on overcloud-controller-2
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: info: lrmd_api_disconnect: Disconnecting from lrmd service
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: info: lrmd_ipc_connection_destroy: IPC connection destroyed
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: crit: lrm_connection_destroy: LRM Connection failed
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: info: lrmd_api_disconnect: Disconnecting from lrmd service
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: warning: do_update_resource: Resource overcloud-controller-0 no longer exists in the lrmd
The logs here could use some work, its not clear which connections are being closed nor if they are being closed pro-actively or in response to an error.
If the latter, then it would support the IPC theory above. If the former, then whatever is triggering the disconnection would be the root cause (i'll continue investigating).
To be clear though, either way, we need to fix the lrmd from crashing.
I'm almost certain that this patch would fix the problem:
commit d119e21a2b37e31bc3676ae8a19ad0473bdd217f
Author: David Vossel <dvossel>
Date: Fri Jun 5 15:07:53 2015 -0400
Fix: crmd: handle resources named the same as cluster nodes
Reason being that I see this in the cib:
<lrm_resource id="overcloud-controller-1" type="remote" class="ocf" provider="pacemaker">
Even though there pacemaker-remote is not intended to be in use.
Work-around: add a "fence-" prefix to all the fencing resource definitions so that they no longer exactly match the node names.
This explains why the behaviour occurs when processing the result of the node's fencing resource:
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: warning: do_update_resource: Resource overcloud-controller-0 no longer exists in the lrmd
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain crmd: info: match_graph_event: Action overcloud-controller-0_stop_0 (37) confirmed on overcloud-controller-0 (rc=0)
*** Bug 1245653 has been marked as a duplicate of this bug. *** Fixes for the lrmd side of things are: Andrew Beekhof (25 hours ago) b2b4950: Fix: dbus: Remove redundant ref/unref of pending call records Ken Gaillot (27 hours ago) 016f29e: Fix: libservices: add DBus call reference when getting properties (origin/pr/766) Andrew Beekhof (35 hours ago) c3ab812: Fix: upstart: Ensure pending structs are correctly unreferenced Andrew Beekhof (2 days ago) c99a372: Fix: systemd: Ensure pending structs are correctly unreferenced Andrew Beekhof (4 days ago) 2091b55: Fix: systemd: Track pending operations so they can be safely cancelled We'll pick these up for 7.3 In the meantime, the crmd patch is present in pacemaker-1.1.12-22.el7_1.4 (available soon) and is sufficient to prevent this problem. These were included some time ago but I forgot to update the bug. adding to errata Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2383.html |