Bug 1346726

Summary: Backport upstream bug systemd: Return PCMK_OCF_UNKNOWN_ERROR instead of PCMK_OCF_NOT_INSTALLED for uncertain errors on LoadUnit
Product: Red Hat Enterprise Linux 7 Reporter: Chen <cchen>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact: Steven J. Levine <slevine>
Priority: medium    
Version: 7.2CC: abeekhof, cfeist, cluster-maint, djansa
Target Milestone: rc   
Target Release: 7.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.15-2.el7 Doc Type: Release Note
Doc Text:
Pacemaker now distinguishes transient failures from fatal failures when loading systemd units Previously, Pacemaker treated all errors loading a *systemd* unit as fatal. As a consequence, Pacemaker would not start a *systemd* resource on a node where it could not load the *systemd* unit, even if the load failed due to transient conditions such as CPU load. With this update, Pacemaker now distinguishes transient failures from fatal failures when loading *systemd* units. Logs and cluster status now show more appropriate messages, and the resource can start on the node once the transient error clears.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 18:59:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chen 2016-06-15 09:01:06 UTC
Description of problem:

Please backport the following upstream bug

https://github.com/ClusterLabs/pacemaker/pull/824

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Ken Gaillot 2016-06-15 14:15:02 UTC
Hi,

This fix is included in the latest build planned for RHEL 7.3.

Comment 4 Ken Gaillot 2016-06-23 17:17:47 UTC
QA: A decent test for the change in functionality is:

1. Configure a cluster with a systemd resource.
2. "pcs resource disable" the resource
3. Remove the service's systemd unit file from /usr/lib/systemd/system (or move to a temporary location), and run "systemctl daemon-reload"
4. "pcs resource enable" the resource
5. "pcs cluster status" should show that the start fails on the original node, and the resource is moved to a different node

After the fix, the log on the original node should should show an error like:

Could not issue start for ...: Unit ... failed to load: No such file 
or directory.

Before the fix, that line will not be present, and I believe it will show an "Unexpected DBus type" error instead.

This does not test the new possibility of returning PCMK_OCF_UNKNOWN_ERROR, but I'm not aware of a way to reproduce a transient DBus error at a specific point, and this will test the code paths that changed in a way that should be sufficient for that case.

Comment 5 Jaroslav Kortus 2016-09-09 15:39:06 UTC
[root@tardis-01 ~]# pcs status
Cluster name: tardis
Stack: corosync
Current DC: tardis-01 (version 1.1.15-10.el7-e174ec8) - partition with quorum
Last updated: Fri Sep  9 17:34:07 2016		Last change: Fri Sep  9 17:33:52 2016 by root via cibadmin on tardis-01

3 nodes and 10 resources configured

Online: [ tardis-01 tardis-02 tardis-03 ]

Full list of resources:

 fencing-tardis01	(stonith:fence_ipmilan):	Started tardis-01
 fencing-tardis02	(stonith:fence_ipmilan):	Started tardis-02
 fencing-tardis03	(stonith:fence_ipmilan):	Started tardis-03
 Clone Set: dlm-clone [dlm]
     Started: [ tardis-01 tardis-02 tardis-03 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ tardis-01 tardis-02 tardis-03 ]
 apache	(systemd:httpd):	Started tardis-01

Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@tardis-01 ~]# pcs resource disable apache
[root@tardis-01 ~]# rpm -e httpd
[root@tardis-01 ~]# pcs resource enable apache
[root@tardis-01 ~]# pcs status
Cluster name: tardis
Stack: corosync
Current DC: tardis-01 (version 1.1.15-10.el7-e174ec8) - partition with quorum
Last updated: Fri Sep  9 17:35:38 2016		Last change: Fri Sep  9 17:35:23 2016 by root via crm_resource on tardis-01

3 nodes and 10 resources configured

Online: [ tardis-01 tardis-02 tardis-03 ]

Full list of resources:

 fencing-tardis01	(stonith:fence_ipmilan):	Started tardis-01
 fencing-tardis02	(stonith:fence_ipmilan):	Started tardis-02
 fencing-tardis03	(stonith:fence_ipmilan):	Started tardis-03
 Clone Set: dlm-clone [dlm]
     Started: [ tardis-01 tardis-02 tardis-03 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ tardis-01 tardis-02 tardis-03 ]
 apache	(systemd:httpd):	Started tardis-02

Failed Actions:
* apache_start_0 on tardis-01 'not installed' (5): call=38, status=Not installed, exitreason='none',
    last-rc-change='Fri Sep  9 17:35:23 2016', queued=0ms, exec=101ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled


tardis-01 journal:
lrmd[2127]:    error: Could not issue start for apache: Unit not found.
crmd[2130]:    error: Result of start operation for apache on tardis-01: Not installed | call=38 key=apache_start_0 confirmed=true status=7 cib-update=268
crmd[2130]:  warning: Action 37 (apache_start_0) on tardis-01 failed (target: 0 vs. rc: 5): Error

works as expected with pacemaker-1.1.15-10.el7.x86_64

Comment 7 errata-xmlrpc 2016-11-03 18:59:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html