1452196 – Crash in lrmd

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1452196 - Crash in lrmd

Summary: Crash in lrmd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.3
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1451170
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-18 14:08 UTC by Tom Lavigne
Modified:	2020-12-14 08:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:	pacemaker-1.1.15-11.el7_3.5
Doc Type:	Bug Fix
Doc Text:	Cause: When cancelling a systemd operation, Pacemaker was unable to detect whether it was in-flight. Consequence: If the operation were in-flight, Pacemaker would free its memory, and lrmd would core-dump with a segmentation fault. Fix: Pacemaker can now detect in-flight systemd operations properly. Result: Lrmd does not crash when cancelling an in-flight systemd operation.
Clone Of:	1451170
Environment:
Last Closed:	2017-06-28 17:00:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Cluster Labs	5311	0	None	None	None	2017-05-18 14:09:08 UTC
Red Hat Product Errata	RHBA-2017:1608	0	normal	SHIPPED_LIVE	pacemaker bug fix update	2017-06-28 20:56:44 UTC

Description Tom Lavigne 2017-05-18 14:08:53 UTC

This bug has been copied from bug #1451170 and has been proposed
to be backported to 7.3 z-stream (EUS).

Comment 3 Ken Gaillot 2017-05-18 16:16:42 UTC

QA: Here is a reproducer (from parent bz):

1. Configure a cluster with at least two nodes.
2. Pick a node to use for testing. On it, install the pacemaker-cts and pacemaker-debuginfo packages, and set "MALLOC_PERTURB_=221" in /etc/sysconfig/pacemaker.
3. Start the cluster, and configure a systemd resource (any service will do, or a dummy service) named "bz1451170" with a monitor interval of 5 seconds. Ban the resource from all nodes but the one chosen for testing.
4. Open three terminal windows to the test node, prepare the commands below, and execute them at the appropriate times:
4a. Find the PID of the lrmd. When ready to proceed, attach to it with "gdb /usr/libexec/pacemaker/lrmd $PID", and copy-and-paste this script at the gdb prompt:
break services_add_inflight_op
commands
        if $_streq(op->rsc,"bz1451170")
                echo *** FREEZE SYSTEMD NOW ***\n
                call sleep(5)
                finish
                echo *** CANCEL MONITOR NOW ***\n
                call sleep(5)
                echo *** CONTINUING ***\n
        end
        continue
end
continue
# script end
4b. In a second window, freeze systemd with "gdb -p 1" immediately after gdb in the first window outputs "FREEZE SYSTEMD NOW".
4c. In a third window, cancel the monitor with "/usr/libexec/pacemaker/lrmd_test -c cancel -a monitor -r bz1451170 -i 5000" immediately after gdb in the first window outputs "CANCEL MONITOR NOW".
4d. If the gdb in the first window shows "script end" and goes back to a prompt, enter "c" at that prompt.
4e. Wait a little while, then unfreeze systemd by entering control-D at the gdb prompt in the second window, to exit gdb.

Before the fix, lrmd will usually dump core. Unfortunately it is still timing-sensitive even with all this. Check the pacemaker detail log for a message like "Managed process ... (lrmd) dumped core" (systemd may respawn lrmd immediately, so there might not be any other noticeable effect). If there is no such message, try disabling and re-enabling the bz1451170 resource, and check again. Usually it will dump core at one of these points. If you want to retry the test, stop and start the cluster on the test node between tests to ensure a clean environment. After the fix, lrmd will never core dump, but you will see a message "Will cancel systemd op bz1451170_status_5000 when in-flight instance completes".

Comment 9 errata-xmlrpc 2017-06-28 17:00:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1608

Note You need to log in before you can comment on or make changes to this bug.