Bug 1452196
| Summary: | Crash in lrmd | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Tom Lavigne <tlavigne> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.3 | CC: | abeekhof, cfeist, cluster-maint, ctowsley, fdinitto, jliberma, jmelvin, jthomas, kgaillot, mjuricek, mschuppe, ohochman |
| Target Milestone: | rc | Keywords: | ZStream |
| Target Release: | 7.3 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-1.1.15-11.el7_3.5 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: When cancelling a systemd operation, Pacemaker was unable to detect whether it was in-flight.
Consequence: If the operation were in-flight, Pacemaker would free its memory, and lrmd would core-dump with a segmentation fault.
Fix: Pacemaker can now detect in-flight systemd operations properly.
Result: Lrmd does not crash when cancelling an in-flight systemd operation.
|
Story Points: | --- |
| Clone Of: | 1451170 | Environment: | |
| Last Closed: | 2017-06-28 17:00:37 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1451170 | ||
| Bug Blocks: | |||
|
Description
Tom Lavigne
2017-05-18 14:08:53 UTC
QA: Here is a reproducer (from parent bz): 1. Configure a cluster with at least two nodes. 2. Pick a node to use for testing. On it, install the pacemaker-cts and pacemaker-debuginfo packages, and set "MALLOC_PERTURB_=221" in /etc/sysconfig/pacemaker. 3. Start the cluster, and configure a systemd resource (any service will do, or a dummy service) named "bz1451170" with a monitor interval of 5 seconds. Ban the resource from all nodes but the one chosen for testing. 4. Open three terminal windows to the test node, prepare the commands below, and execute them at the appropriate times: 4a. Find the PID of the lrmd. When ready to proceed, attach to it with "gdb /usr/libexec/pacemaker/lrmd $PID", and copy-and-paste this script at the gdb prompt: break services_add_inflight_op commands if $_streq(op->rsc,"bz1451170") echo *** FREEZE SYSTEMD NOW ***\n call sleep(5) finish echo *** CANCEL MONITOR NOW ***\n call sleep(5) echo *** CONTINUING ***\n end continue end continue # script end 4b. In a second window, freeze systemd with "gdb -p 1" immediately after gdb in the first window outputs "FREEZE SYSTEMD NOW". 4c. In a third window, cancel the monitor with "/usr/libexec/pacemaker/lrmd_test -c cancel -a monitor -r bz1451170 -i 5000" immediately after gdb in the first window outputs "CANCEL MONITOR NOW". 4d. If the gdb in the first window shows "script end" and goes back to a prompt, enter "c" at that prompt. 4e. Wait a little while, then unfreeze systemd by entering control-D at the gdb prompt in the second window, to exit gdb. Before the fix, lrmd will usually dump core. Unfortunately it is still timing-sensitive even with all this. Check the pacemaker detail log for a message like "Managed process ... (lrmd) dumped core" (systemd may respawn lrmd immediately, so there might not be any other noticeable effect). If there is no such message, try disabling and re-enabling the bz1451170 resource, and check again. Usually it will dump core at one of these points. If you want to retry the test, stop and start the cluster on the test node between tests to ensure a clean environment. After the fix, lrmd will never core dump, but you will see a message "Will cancel systemd op bz1451170_status_5000 when in-flight instance completes". Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1608 |