Bug 1346726
| Summary: | Backport upstream bug systemd: Return PCMK_OCF_UNKNOWN_ERROR instead of PCMK_OCF_NOT_INSTALLED for uncertain errors on LoadUnit | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Chen <cchen> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | low | Docs Contact: | Steven J. Levine <slevine> |
| Priority: | medium | ||
| Version: | 7.2 | CC: | abeekhof, cfeist, cluster-maint, djansa |
| Target Milestone: | rc | ||
| Target Release: | 7.3 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-1.1.15-2.el7 | Doc Type: | Release Note |
| Doc Text: |
Pacemaker now distinguishes transient failures from fatal failures when loading systemd units
Previously, Pacemaker treated all errors loading a *systemd* unit as fatal. As a consequence, Pacemaker would not start a *systemd* resource on a node where it could not load the *systemd* unit, even if the load failed due to transient conditions such as CPU load. With this update, Pacemaker now distinguishes transient failures from fatal failures when loading *systemd* units. Logs and cluster status now show more appropriate messages, and the resource can start on the node once the transient error clears.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-03 18:59:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Chen
2016-06-15 09:01:06 UTC
Hi, This fix is included in the latest build planned for RHEL 7.3. QA: A decent test for the change in functionality is: 1. Configure a cluster with a systemd resource. 2. "pcs resource disable" the resource 3. Remove the service's systemd unit file from /usr/lib/systemd/system (or move to a temporary location), and run "systemctl daemon-reload" 4. "pcs resource enable" the resource 5. "pcs cluster status" should show that the start fails on the original node, and the resource is moved to a different node After the fix, the log on the original node should should show an error like: Could not issue start for ...: Unit ... failed to load: No such file or directory. Before the fix, that line will not be present, and I believe it will show an "Unexpected DBus type" error instead. This does not test the new possibility of returning PCMK_OCF_UNKNOWN_ERROR, but I'm not aware of a way to reproduce a transient DBus error at a specific point, and this will test the code paths that changed in a way that should be sufficient for that case. [root@tardis-01 ~]# pcs status
Cluster name: tardis
Stack: corosync
Current DC: tardis-01 (version 1.1.15-10.el7-e174ec8) - partition with quorum
Last updated: Fri Sep 9 17:34:07 2016 Last change: Fri Sep 9 17:33:52 2016 by root via cibadmin on tardis-01
3 nodes and 10 resources configured
Online: [ tardis-01 tardis-02 tardis-03 ]
Full list of resources:
fencing-tardis01 (stonith:fence_ipmilan): Started tardis-01
fencing-tardis02 (stonith:fence_ipmilan): Started tardis-02
fencing-tardis03 (stonith:fence_ipmilan): Started tardis-03
Clone Set: dlm-clone [dlm]
Started: [ tardis-01 tardis-02 tardis-03 ]
Clone Set: clvmd-clone [clvmd]
Started: [ tardis-01 tardis-02 tardis-03 ]
apache (systemd:httpd): Started tardis-01
Daemon Status:
corosync: active/disabled
pacemaker: active/enabled
pcsd: active/enabled
[root@tardis-01 ~]# pcs resource disable apache
[root@tardis-01 ~]# rpm -e httpd
[root@tardis-01 ~]# pcs resource enable apache
[root@tardis-01 ~]# pcs status
Cluster name: tardis
Stack: corosync
Current DC: tardis-01 (version 1.1.15-10.el7-e174ec8) - partition with quorum
Last updated: Fri Sep 9 17:35:38 2016 Last change: Fri Sep 9 17:35:23 2016 by root via crm_resource on tardis-01
3 nodes and 10 resources configured
Online: [ tardis-01 tardis-02 tardis-03 ]
Full list of resources:
fencing-tardis01 (stonith:fence_ipmilan): Started tardis-01
fencing-tardis02 (stonith:fence_ipmilan): Started tardis-02
fencing-tardis03 (stonith:fence_ipmilan): Started tardis-03
Clone Set: dlm-clone [dlm]
Started: [ tardis-01 tardis-02 tardis-03 ]
Clone Set: clvmd-clone [clvmd]
Started: [ tardis-01 tardis-02 tardis-03 ]
apache (systemd:httpd): Started tardis-02
Failed Actions:
* apache_start_0 on tardis-01 'not installed' (5): call=38, status=Not installed, exitreason='none',
last-rc-change='Fri Sep 9 17:35:23 2016', queued=0ms, exec=101ms
Daemon Status:
corosync: active/disabled
pacemaker: active/enabled
pcsd: active/enabled
tardis-01 journal:
lrmd[2127]: error: Could not issue start for apache: Unit not found.
crmd[2130]: error: Result of start operation for apache on tardis-01: Not installed | call=38 key=apache_start_0 confirmed=true status=7 cib-update=268
crmd[2130]: warning: Action 37 (apache_start_0) on tardis-01 failed (target: 0 vs. rc: 5): Error
works as expected with pacemaker-1.1.15-10.el7.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html |