Bug 1346726 - Backport upstream bug systemd: Return PCMK_OCF_UNKNOWN_ERROR instead of PCMK_OCF_NOT_INSTALLED for uncertain errors on LoadUnit
Summary: Backport upstream bug systemd: Return PCMK_OCF_UNKNOWN_ERROR instead of PCMK_...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.2
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: rc
: 7.3
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
Steven J. Levine
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-15 09:01 UTC by Chen
Modified: 2016-11-03 18:59 UTC (History)
4 users (show)

(edit)
Pacemaker now distinguishes transient failures from fatal failures when loading systemd units

Previously, Pacemaker treated all errors loading a *systemd* unit as fatal. As a consequence, Pacemaker would not start a *systemd* resource on a node where it could not load the *systemd* unit, even if the load failed due to transient conditions such as CPU load. With this update, Pacemaker now distinguishes transient failures from fatal failures when loading *systemd* units. Logs and cluster status now show more appropriate messages, and the resource can start on the node once the transient error clears.
Clone Of:
(edit)
Last Closed: 2016-11-03 18:59:57 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2578 normal SHIPPED_LIVE Moderate: pacemaker security, bug fix, and enhancement update 2016-11-03 12:07:24 UTC

Description Chen 2016-06-15 09:01:06 UTC
Description of problem:

Please backport the following upstream bug

https://github.com/ClusterLabs/pacemaker/pull/824

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Ken Gaillot 2016-06-15 14:15:02 UTC
Hi,

This fix is included in the latest build planned for RHEL 7.3.

Comment 4 Ken Gaillot 2016-06-23 17:17:47 UTC
QA: A decent test for the change in functionality is:

1. Configure a cluster with a systemd resource.
2. "pcs resource disable" the resource
3. Remove the service's systemd unit file from /usr/lib/systemd/system (or move to a temporary location), and run "systemctl daemon-reload"
4. "pcs resource enable" the resource
5. "pcs cluster status" should show that the start fails on the original node, and the resource is moved to a different node

After the fix, the log on the original node should should show an error like:

Could not issue start for ...: Unit ... failed to load: No such file 
or directory.

Before the fix, that line will not be present, and I believe it will show an "Unexpected DBus type" error instead.

This does not test the new possibility of returning PCMK_OCF_UNKNOWN_ERROR, but I'm not aware of a way to reproduce a transient DBus error at a specific point, and this will test the code paths that changed in a way that should be sufficient for that case.

Comment 5 Jaroslav Kortus 2016-09-09 15:39:06 UTC
[root@tardis-01 ~]# pcs status
Cluster name: tardis
Stack: corosync
Current DC: tardis-01 (version 1.1.15-10.el7-e174ec8) - partition with quorum
Last updated: Fri Sep  9 17:34:07 2016		Last change: Fri Sep  9 17:33:52 2016 by root via cibadmin on tardis-01

3 nodes and 10 resources configured

Online: [ tardis-01 tardis-02 tardis-03 ]

Full list of resources:

 fencing-tardis01	(stonith:fence_ipmilan):	Started tardis-01
 fencing-tardis02	(stonith:fence_ipmilan):	Started tardis-02
 fencing-tardis03	(stonith:fence_ipmilan):	Started tardis-03
 Clone Set: dlm-clone [dlm]
     Started: [ tardis-01 tardis-02 tardis-03 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ tardis-01 tardis-02 tardis-03 ]
 apache	(systemd:httpd):	Started tardis-01

Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@tardis-01 ~]# pcs resource disable apache
[root@tardis-01 ~]# rpm -e httpd
[root@tardis-01 ~]# pcs resource enable apache
[root@tardis-01 ~]# pcs status
Cluster name: tardis
Stack: corosync
Current DC: tardis-01 (version 1.1.15-10.el7-e174ec8) - partition with quorum
Last updated: Fri Sep  9 17:35:38 2016		Last change: Fri Sep  9 17:35:23 2016 by root via crm_resource on tardis-01

3 nodes and 10 resources configured

Online: [ tardis-01 tardis-02 tardis-03 ]

Full list of resources:

 fencing-tardis01	(stonith:fence_ipmilan):	Started tardis-01
 fencing-tardis02	(stonith:fence_ipmilan):	Started tardis-02
 fencing-tardis03	(stonith:fence_ipmilan):	Started tardis-03
 Clone Set: dlm-clone [dlm]
     Started: [ tardis-01 tardis-02 tardis-03 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ tardis-01 tardis-02 tardis-03 ]
 apache	(systemd:httpd):	Started tardis-02

Failed Actions:
* apache_start_0 on tardis-01 'not installed' (5): call=38, status=Not installed, exitreason='none',
    last-rc-change='Fri Sep  9 17:35:23 2016', queued=0ms, exec=101ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled


tardis-01 journal:
lrmd[2127]:    error: Could not issue start for apache: Unit not found.
crmd[2130]:    error: Result of start operation for apache on tardis-01: Not installed | call=38 key=apache_start_0 confirmed=true status=7 cib-update=268
crmd[2130]:  warning: Action 37 (apache_start_0) on tardis-01 failed (target: 0 vs. rc: 5): Error

works as expected with pacemaker-1.1.15-10.el7.x86_64

Comment 7 errata-xmlrpc 2016-11-03 18:59:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html


Note You need to log in before you can comment on or make changes to this bug.