1246291 – lrmd killed by SIGSEGV

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1246291 - lrmd killed by SIGSEGV

Summary: lrmd killed by SIGSEGV

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Andrew Beekhof
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:	https://api.access.redhat.com/rs/tele...
Whiteboard:	abrt_hash:566d6d8d0ad5cef6e984df4fa09...
Duplicates (1):	1245653 (view as bug list)
Depends On:
Blocks:	1133060 1247019
TreeView+	depends on / blocked

Reported:	2015-07-24 00:01 UTC by Tomoki Sekiyama
Modified:	2019-07-11 09:41 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pacemaker-1.1.13-10.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1247019 (view as bug list)
Environment:
Last Closed:	2015-11-19 12:11:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
File: cgroup (159 bytes, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: core_backtrace (5.84 KB, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: dso_list (3.70 KB, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: environ (629 bytes, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: limits (1.29 KB, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: machineid (135 bytes, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: maps (19.48 KB, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: open_fds (353 bytes, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: proc_pid_status (1.02 KB, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: var_log_messages (31.62 KB, text/plain) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: sosreport-overcloud-controller-0.localdomain-20150723195046.tar.xz (6.95 MB, application/octet-stream) 2015-07-24 00:01 UTC, Tomoki Sekiyama	no flags	Details
File: sosreport.tar.xz (9.34 MB, application/octet-stream) 2015-07-24 00:02 UTC, Tomoki Sekiyama	no flags	Details
crm_report log (6.19 MB, application/x-bzip) 2015-07-24 15:40 UTC, Tomoki Sekiyama	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1538173	0	None	None	None	Never
Red Hat Product Errata	RHSA-2015:2383	0	normal	SHIPPED_LIVE	Moderate: pacemaker security, bug fix, and enhancement update	2015-11-19 10:49:49 UTC

Description Tomoki Sekiyama 2015-07-24 00:01:20 UTC

Description of problem:
When RHEL OSP7 RC configured by OSP-director with contoroller HA using 3 nodes (and configured the fencing using xfence_xvm), lrmd sometimes crashes by segmentation fault.

Version-Release number of selected component:
pacemaker-1.1.12-22.el7_1.2

Additional info:
reporter:       libreport-2.1.11
cmdline:        /usr/libexec/pacemaker/lrmd
executable:     /usr/libexec/pacemaker/lrmd
global_uuid:    566d6d8d0ad5cef6e984df4fa09b28d7355d8e53
kernel:         3.10.0-229.7.2.el7.x86_64
runlevel:       N 3
type:           CCpp
uid:            0

Comment 1 Tomoki Sekiyama 2015-07-24 00:01:22 UTC

Created attachment 1055540 [details]
File: cgroup

Comment 2 Tomoki Sekiyama 2015-07-24 00:01:23 UTC

Created attachment 1055541 [details]
File: core_backtrace

Comment 3 Tomoki Sekiyama 2015-07-24 00:01:24 UTC

Created attachment 1055542 [details]
File: dso_list

Comment 4 Tomoki Sekiyama 2015-07-24 00:01:25 UTC

Created attachment 1055543 [details]
File: environ

Comment 5 Tomoki Sekiyama 2015-07-24 00:01:26 UTC

Created attachment 1055544 [details]
File: limits

Comment 6 Tomoki Sekiyama 2015-07-24 00:01:27 UTC

Created attachment 1055545 [details]
File: machineid

Comment 7 Tomoki Sekiyama 2015-07-24 00:01:28 UTC

Created attachment 1055546 [details]
File: maps

Comment 8 Tomoki Sekiyama 2015-07-24 00:01:29 UTC

Created attachment 1055547 [details]
File: open_fds

Comment 9 Tomoki Sekiyama 2015-07-24 00:01:30 UTC

Created attachment 1055548 [details]
File: proc_pid_status

Comment 10 Tomoki Sekiyama 2015-07-24 00:01:31 UTC

Created attachment 1055549 [details]
File: var_log_messages

Comment 11 Tomoki Sekiyama 2015-07-24 00:01:48 UTC

Created attachment 1055550 [details]
File: sosreport-overcloud-controller-0.localdomain-20150723195046.tar.xz

Comment 12 Tomoki Sekiyama 2015-07-24 00:02:12 UTC

Created attachment 1055551 [details]
File: sosreport.tar.xz

Comment 14 Andrew Beekhof 2015-07-24 11:16:12 UTC

Need crm_report archives for all nodes in the cluster at the time the crash occurred before I can say much. Filtered logs files and near empty sosreports don't help much.

crm_report comes with detailed --help text.

Comment 15 Tomoki Sekiyama 2015-07-24 15:40:23 UTC

Created attachment 1055820 [details]
crm_report log

The crm_report file is attached.

Thanks,
Tomoki

Comment 16 Andrew Beekhof 2015-07-27 00:07:46 UTC

Very helpful, thankyou.

If I look at core.6034 from controller 0, and go to stack frame 2 I see:

(gdb) 
#2  cmd_finalize (cmd=cmd@entry=0xefab40, rsc=rsc@entry=0x0) at lrmd.c:540
540	    send_cmd_complete_notify(cmd);
(gdb) p *cmd
$1 = {timeout = 15654448, interval = 0, start_delay = 15697312, timeout_orig = 0, call_id = 223, exec_rc = 3, lrmd_op_status = 1, call_opts = 4, delay_id = 0, stonith_recurring_id = 0, rsc_deleted = 1, client_id = 0xf0d0a0 "@\222", <incomplete sequence \356>, 
  origin = 0xef4070 "`\325", <incomplete sequence \354>, rsc_id = 0xefb020 "`", <incomplete sequence \357>, action = 0xf060b0 "\320", <incomplete sequence \356>, real_action = 0x0, exit_reason = 0x0, output = 0x0, userdata_str = 0xef6d60 "\240\347", <incomplete sequence \356>, t_first_run = {
    time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_run = {time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_queue = {time = 1437749769, millitm = 692, timezone = 240, dstflag = 0}, t_rcchange = {time = 1437749769, millitm = 703, timezone = 240, dstflag = 0}, 
  first_notify_sent = 1, last_notify_rc = 3, last_notify_op_status = 1, last_pid = 0, params = 0xd0}

Which is complete and utter garbage.  Its hard to escape the conclusion that this is a use-after-free.

HOWEVER, this is a flurry of activity in the logs prior to the segfault, beginning with:

Jul 24 10:56:09 [6034] overcloud-controller-0.localdomain       lrmd:     info: cancel_recurring_action: 	Cancelling operation mongod_status_60000
Jul 24 10:56:09 [6034] overcloud-controller-0.localdomain       lrmd:  warning: qb_ipcs_event_sendv: 	new_event_notification (6034-6033-8): Bad file descriptor (9)

Something in the IPC code could be the root cause here (since process 6033 is still alive at the timd.
It would be useful to know if re-testing with 0.17.1-1 results in the same behaviour.

Another alternative is based on these logs:

Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain       crmd:   notice: te_rsc_command: 	Initiating action 39: stop overcloud-controller-2_stop_0 on overcloud-controller-2
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain       crmd:     info: lrmd_api_disconnect: 	Disconnecting from lrmd service
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain       crmd:     info: lrmd_ipc_connection_destroy: 	IPC connection destroyed
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain       crmd:     crit: lrm_connection_destroy: 	LRM Connection failed
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain       crmd:     info: lrmd_api_disconnect: 	Disconnecting from lrmd service
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain       crmd:  warning: do_update_resource: 	Resource overcloud-controller-0 no longer exists in the lrmd

The logs here could use some work, its not clear which connections are being closed nor if they are being closed pro-actively or in response to an error.
If the latter, then it would support the IPC theory above. If the former, then whatever is triggering the disconnection would be the root cause (i'll continue investigating).

To be clear though, either way, we need to fix the lrmd from crashing.

Comment 17 Andrew Beekhof 2015-07-27 01:03:20 UTC

I'm almost certain that this patch would fix the problem:

commit d119e21a2b37e31bc3676ae8a19ad0473bdd217f
Author: David Vossel <dvossel>
Date:   Fri Jun 5 15:07:53 2015 -0400

    Fix: crmd: handle resources named the same as cluster nodes


Reason being that I see this in the cib:

          <lrm_resource id="overcloud-controller-1" type="remote" class="ocf" provider="pacemaker">

Even though there pacemaker-remote is not intended to be in use.

Work-around: add a "fence-" prefix to all the fencing resource definitions so that they no longer exactly match the node names.

This explains why the behaviour occurs when processing the result of the node's fencing resource:

Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain       crmd:  warning: do_update_resource: 	Resource overcloud-controller-0 no longer exists in the lrmd
Jul 24 10:56:09 [6033] overcloud-controller-0.localdomain       crmd:     info: match_graph_event: 	Action overcloud-controller-0_stop_0 (37) confirmed on overcloud-controller-0 (rc=0)

Comment 18 Andrew Beekhof 2015-07-27 01:29:55 UTC

*** Bug 1245653 has been marked as a duplicate of this bug. ***

Comment 19 Andrew Beekhof 2015-07-30 23:01:01 UTC

Fixes for the lrmd side of things are:

Andrew Beekhof (25 hours ago) b2b4950: Fix: dbus: Remove redundant ref/unref of pending call records 
Ken Gaillot (27 hours ago) 016f29e: Fix: libservices: add DBus call reference when getting properties  (origin/pr/766)
Andrew Beekhof (35 hours ago) c3ab812: Fix: upstart: Ensure pending structs are correctly unreferenced 
Andrew Beekhof (2 days ago) c99a372: Fix: systemd: Ensure pending structs are correctly unreferenced 
Andrew Beekhof (4 days ago) 2091b55: Fix: systemd: Track pending operations so they can be safely cancelled 


We'll pick these up for 7.3
In the meantime, the crmd patch is present in pacemaker-1.1.12-22.el7_1.4 (available soon) and is sufficient to prevent this problem.

Comment 21 Andrew Beekhof 2015-10-14 23:13:09 UTC

These were included some time ago but I forgot to update the bug.

Comment 22 Andrew Beekhof 2015-10-15 00:54:20 UTC

adding to errata

Comment 27 errata-xmlrpc 2015-11-19 12:11:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2383.html

Note You need to log in before you can comment on or make changes to this bug.