Bug 1346726
Summary: | Backport upstream bug systemd: Return PCMK_OCF_UNKNOWN_ERROR instead of PCMK_OCF_NOT_INSTALLED for uncertain errors on LoadUnit | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Chen <cchen> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | low | Docs Contact: | Steven J. Levine <slevine> |
Priority: | medium | ||
Version: | 7.2 | CC: | abeekhof, cfeist, cluster-maint, djansa |
Target Milestone: | rc | ||
Target Release: | 7.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-1.1.15-2.el7 | Doc Type: | Release Note |
Doc Text: |
Pacemaker now distinguishes transient failures from fatal failures when loading systemd units
Previously, Pacemaker treated all errors loading a *systemd* unit as fatal. As a consequence, Pacemaker would not start a *systemd* resource on a node where it could not load the *systemd* unit, even if the load failed due to transient conditions such as CPU load. With this update, Pacemaker now distinguishes transient failures from fatal failures when loading *systemd* units. Logs and cluster status now show more appropriate messages, and the resource can start on the node once the transient error clears.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-03 18:59:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Chen
2016-06-15 09:01:06 UTC
Hi, This fix is included in the latest build planned for RHEL 7.3. QA: A decent test for the change in functionality is: 1. Configure a cluster with a systemd resource. 2. "pcs resource disable" the resource 3. Remove the service's systemd unit file from /usr/lib/systemd/system (or move to a temporary location), and run "systemctl daemon-reload" 4. "pcs resource enable" the resource 5. "pcs cluster status" should show that the start fails on the original node, and the resource is moved to a different node After the fix, the log on the original node should should show an error like: Could not issue start for ...: Unit ... failed to load: No such file or directory. Before the fix, that line will not be present, and I believe it will show an "Unexpected DBus type" error instead. This does not test the new possibility of returning PCMK_OCF_UNKNOWN_ERROR, but I'm not aware of a way to reproduce a transient DBus error at a specific point, and this will test the code paths that changed in a way that should be sufficient for that case. [root@tardis-01 ~]# pcs status Cluster name: tardis Stack: corosync Current DC: tardis-01 (version 1.1.15-10.el7-e174ec8) - partition with quorum Last updated: Fri Sep 9 17:34:07 2016 Last change: Fri Sep 9 17:33:52 2016 by root via cibadmin on tardis-01 3 nodes and 10 resources configured Online: [ tardis-01 tardis-02 tardis-03 ] Full list of resources: fencing-tardis01 (stonith:fence_ipmilan): Started tardis-01 fencing-tardis02 (stonith:fence_ipmilan): Started tardis-02 fencing-tardis03 (stonith:fence_ipmilan): Started tardis-03 Clone Set: dlm-clone [dlm] Started: [ tardis-01 tardis-02 tardis-03 ] Clone Set: clvmd-clone [clvmd] Started: [ tardis-01 tardis-02 tardis-03 ] apache (systemd:httpd): Started tardis-01 Daemon Status: corosync: active/disabled pacemaker: active/enabled pcsd: active/enabled [root@tardis-01 ~]# pcs resource disable apache [root@tardis-01 ~]# rpm -e httpd [root@tardis-01 ~]# pcs resource enable apache [root@tardis-01 ~]# pcs status Cluster name: tardis Stack: corosync Current DC: tardis-01 (version 1.1.15-10.el7-e174ec8) - partition with quorum Last updated: Fri Sep 9 17:35:38 2016 Last change: Fri Sep 9 17:35:23 2016 by root via crm_resource on tardis-01 3 nodes and 10 resources configured Online: [ tardis-01 tardis-02 tardis-03 ] Full list of resources: fencing-tardis01 (stonith:fence_ipmilan): Started tardis-01 fencing-tardis02 (stonith:fence_ipmilan): Started tardis-02 fencing-tardis03 (stonith:fence_ipmilan): Started tardis-03 Clone Set: dlm-clone [dlm] Started: [ tardis-01 tardis-02 tardis-03 ] Clone Set: clvmd-clone [clvmd] Started: [ tardis-01 tardis-02 tardis-03 ] apache (systemd:httpd): Started tardis-02 Failed Actions: * apache_start_0 on tardis-01 'not installed' (5): call=38, status=Not installed, exitreason='none', last-rc-change='Fri Sep 9 17:35:23 2016', queued=0ms, exec=101ms Daemon Status: corosync: active/disabled pacemaker: active/enabled pcsd: active/enabled tardis-01 journal: lrmd[2127]: error: Could not issue start for apache: Unit not found. crmd[2130]: error: Result of start operation for apache on tardis-01: Not installed | call=38 key=apache_start_0 confirmed=true status=7 cib-update=268 crmd[2130]: warning: Action 37 (apache_start_0) on tardis-01 failed (target: 0 vs. rc: 5): Error works as expected with pacemaker-1.1.15-10.el7.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html |