1446669 – order pacemaker after resource-agents-deps

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1446669 - order pacemaker after resource-agents-deps

Summary: order pacemaker after resource-agents-deps

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.3
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	7.4
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1507411 (view as bug list)
Depends On:	1316130 1449419
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-28 14:39 UTC by Ken Gaillot
Modified:	2020-12-11 20:31 UTC (History)
CC List:	21 users (show)
Fixed In Version:	pacemaker-1.1.16-9.el7
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:	1316130
Environment:
Last Closed:	2017-08-01 17:54:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3120531	0	None	None	None	2020-12-11 20:31:53 UTC
Red Hat Product Errata	RHEA-2017:1862	0	normal	SHIPPED_LIVE	pacemaker bug fix and enhancement update	2017-08-01 18:04:15 UTC

Description Ken Gaillot 2017-04-28 14:39:41 UTC

+++ This bug was initially created as a clone of Bug #1316130 +++

Description of problem:

during a cluster node reboot, the node get's always fenced if pacemaker ist not stopped before.


Version-Release number of selected component (if applicable):

pacemaker-1.1.13-10.el7_2.2.x86_64
resource-agents-3.9.5-54.el7_2.6.x86_64

How reproducible:

with three gfs2 volumes nearly always. 


Steps to Reproduce:
1. configure multipath
2. configure gfs2 according to https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Global_File_System_2/ch-clustsetup-GFS2.html
3. reboot a node

Actual results:

node gets fenced


Expected results:

node reboots


Additional info:

using rhel 7.2, pacemaker, dlm, clvmd, gfs2 on SAN based block devices. Block devices are multipath devices. 


cluster log:


notice: Scheduling Node nodeb for shutdown
notice: Initiating action 70: stop clvmd_stop_0 on nodeb
warning: Action 70 (clvmd_stop_0) on nodeb failed (target: 0 vs. rc: 1): Error
warning: Node nodeb will be fenced because of resource failure(s)
warning: Scheduling Node nodeb for STONITH



the journal log on nodeb:


ERROR: Volume group "vg_sys" not found Cannot process volume group vg_sys


Analyse:

The ocf::heartbeat::clvm resource agent determines the available volume groups, and then proceeds to shut stop them. It is a shell script and not an atom operation. See https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/clvm#L269

On nodeb shutdown the systemd units  blk-availability.service and/or multipathd.service will remove the multipath block devices. Without the block devices, the volume group vanishes. 

If the volume groups vanish after the resource agent determined the volumes (line 269), it can't stop them in line 273.

problem can be solved by adding a dependency in the pacemaker.service unit.

/etc/systemd/system/pacemaker.service.d/order.conf

[Unit]
After=multipathd.service
After=blk-availability.service

<snip>

--- Additional comment from Ken Gaillot on 2016-12-19 13:46:58 EST ---

(In reply to Jan Pokorný from comment #18)
> (it could also be s/\.service/\.target/ though no experience here)

Yes, a .target is essentially identical to a .service, but with only [Unit] dependency information (Before/After/Wants/Requires), no [Service] section.

I'm leaning to this solution:

* resource-agents would deploy a systemd target for agent dependencies (basically just a name, no actual dependencies listed)

* pacemaker's systemd unit file would add After= and Wants= with the new target

* If a particular resource agent has a systemd unit dependency for something that cannot be managed by pacemaker as a resource, that agent could create a drop-in adding the dependency to the target when it is started. For example, clvmd and LVM require blk-availability, but blk-availability would never be a pacemaker resource. I would avoid automating any other dependencies, because we don't know whether pacemaker will manage them -- for example, LVM might depend on iSCSI or multipathd, but we wouldn't want drop-in dependencies for them if pacemaker is managing them.

* System administrators would be required to manually create drop-ins for the new target for any local dependencies. Resource agent man pages and meta-data, and any relevant online documentation, would be updated to mention how to do this. Resource agents could mention common dependencies (such as iSCSI and multipathd for Filesystem).

Comment 2 Ken Gaillot 2017-05-03 22:27:30 UTC

Fixed upstream as of:

https://github.com/ClusterLabs/pacemaker/pull/1270/commits/06e2e269091ba69e699301d8c86c58ef94809be0

QA: This is simply to support Bug 1316130, so testing that one (while using these packages) is sufficient to test this also.

Comment 3 Ken Gaillot 2017-05-09 21:57:32 UTC

Docs: Any documentation for the parent Bug 1316130 will be sufficient for this as well.

Comment 5 michal novacek 2017-05-26 08:12:29 UTC

Verification is here: bz1316130 comment #30

Comment 6 errata-xmlrpc 2017-08-01 17:54:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1862

Comment 7 Ken Gaillot 2017-10-31 21:21:49 UTC

*** Bug 1507411 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

abeekhof
agk
apanagio
c.handel
cluster-maint
cluster-qe
fdinitto
feiwang
heinzm
jbrassow
jpokorny
kgaillot
kwenning
mnovacek
msnitzer
oalbrigt
prajnoha
prockai
rhel-docs
sbradley
zkabelac