Bug 1452196

Summary: Crash in lrmd
Product: Red Hat Enterprise Linux 7 Reporter: Tom Lavigne <tlavigne>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.3CC: abeekhof, cfeist, cluster-maint, ctowsley, fdinitto, jliberma, jmelvin, jthomas, kgaillot, mjuricek, mschuppe, ohochman
Target Milestone: rcKeywords: ZStream
Target Release: 7.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.15-11.el7_3.5 Doc Type: Bug Fix
Doc Text:
Cause: When cancelling a systemd operation, Pacemaker was unable to detect whether it was in-flight. Consequence: If the operation were in-flight, Pacemaker would free its memory, and lrmd would core-dump with a segmentation fault. Fix: Pacemaker can now detect in-flight systemd operations properly. Result: Lrmd does not crash when cancelling an in-flight systemd operation.
Story Points: ---
Clone Of: 1451170 Environment:
Last Closed: 2017-06-28 17:00:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1451170    
Bug Blocks:    

Description Tom Lavigne 2017-05-18 14:08:53 UTC
This bug has been copied from bug #1451170 and has been proposed
to be backported to 7.3 z-stream (EUS).

Comment 3 Ken Gaillot 2017-05-18 16:16:42 UTC
QA: Here is a reproducer (from parent bz):

1. Configure a cluster with at least two nodes.
2. Pick a node to use for testing. On it, install the pacemaker-cts and pacemaker-debuginfo packages, and set "MALLOC_PERTURB_=221" in /etc/sysconfig/pacemaker.
3. Start the cluster, and configure a systemd resource (any service will do, or a dummy service) named "bz1451170" with a monitor interval of 5 seconds. Ban the resource from all nodes but the one chosen for testing.
4. Open three terminal windows to the test node, prepare the commands below, and execute them at the appropriate times:
4a. Find the PID of the lrmd. When ready to proceed, attach to it with "gdb /usr/libexec/pacemaker/lrmd $PID", and copy-and-paste this script at the gdb prompt:
break services_add_inflight_op
commands
        if $_streq(op->rsc,"bz1451170")
                echo *** FREEZE SYSTEMD NOW ***\n
                call sleep(5)
                finish
                echo *** CANCEL MONITOR NOW ***\n
                call sleep(5)
                echo *** CONTINUING ***\n
        end
        continue
end
continue
# script end
4b. In a second window, freeze systemd with "gdb -p 1" immediately after gdb in the first window outputs "FREEZE SYSTEMD NOW".
4c. In a third window, cancel the monitor with "/usr/libexec/pacemaker/lrmd_test -c cancel -a monitor -r bz1451170 -i 5000" immediately after gdb in the first window outputs "CANCEL MONITOR NOW".
4d. If the gdb in the first window shows "script end" and goes back to a prompt, enter "c" at that prompt.
4e. Wait a little while, then unfreeze systemd by entering control-D at the gdb prompt in the second window, to exit gdb.

Before the fix, lrmd will usually dump core. Unfortunately it is still timing-sensitive even with all this. Check the pacemaker detail log for a message like "Managed process ... (lrmd) dumped core" (systemd may respawn lrmd immediately, so there might not be any other noticeable effect). If there is no such message, try disabling and re-enabling the bz1451170 resource, and check again. Usually it will dump core at one of these points. If you want to retry the test, stop and start the cluster on the test node between tests to ensure a clean environment. After the fix, lrmd will never core dump, but you will see a message "Will cancel systemd op bz1451170_status_5000 when in-flight instance completes".

Comment 9 errata-xmlrpc 2017-06-28 17:00:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1608