1303969 – resource (un)manage: add optional switch to dis-/en-able monitor operations as well

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1303969 - resource (un)manage: add optional switch to dis-/en-able monitor operations as well

Summary: resource (un)manage: add optional switch to dis-/en-able monitor operations a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-02 14:50 UTC by Jan Pokorný [poki]
Modified:	2021-08-30 12:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:	pcs-0.9.158-2.el7
Doc Type:	Release Note
Doc Text:	New option to the "pcs resource unmanage" command to disable monitor operations Even when a resource is in unmanaged mode, monitor operations are still run by the cluster. That may cause the cluster to report errors the user is not interested in as those errors may be expected for a particular use case when the resource is unmanaged. The "pcs resource unmanage" command now supports the "--monitor" option, which disables monitor operations when putting a resource into unmanaged mode. Additionally, the "pcs resource manage" command also supports the "--monitor" option, which enables the monitor operations when putting a resource back into managed mode.
Clone Of:
Environment:
Last Closed:	2017-08-01 18:22:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix + tests (134.01 KB, patch) 2017-03-17 15:22 UTC, Tomas Jelinek	no flags	Details \| Diff
additional fix + tests (33.41 KB, patch) 2017-04-24 15:38 UTC, Tomas Jelinek	no flags	Details \| Diff
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1327268	medium	CLOSED	Add pcs option to modify an existing operation	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	1259763	None	None	None	2018-07-30 14:51:20 UTC
Red Hat Product Errata	RHBA-2017:1958	normal	SHIPPED_LIVE	pcs bug fix and enhancement update	2017-08-01 18:09:47 UTC

Internal Links: 1327268

Description Jan Pokorný [poki] 2016-02-02 14:50:58 UTC

There are cases when setting resource as unmanaged is not enough
as the underlying operation of recurring monitor actions are not
paused [1] and this can cause issues when values other than those
indicating "started" or "stopped" are signalled (just because
the environment is in some kind of reconstruction).

Proposed solution:

pcs resource unmanage RESOURCE --with-monitor
pcs resource manage RESOURCE --with-monitor

--with-monitor for unmanage:
  for all respective recurring monitors: enabled=false 

--with-monitor for manage:
  for all respective recurring monitors: enabled=true
  (or rather, drop the property and use the identical default)

Alternatively, monitor=1 instead of --with-monitor.


[1] http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/#_monitoring_resources_when_administration_is_disabled

Comment 1 Jan Pokorný [poki] 2016-02-03 15:26:12 UTC

Real world use case:
http://oss.clusterlabs.org/pipermail/users/2016-February/002217.html

Note that even more convoluted request shortcuts regarding
maintenance/monitor combo may be desired (as one can decipher
from that post).

Comment 3 Tomas Jelinek 2016-07-13 15:05:22 UTC

another real life use case:
http://clusterlabs.org/pipermail/users/2016-July/003490.html

Comment 4 Madison Kelly 2016-07-13 18:46:54 UTC

I would also like to see this option.

Comment 6 Ken Gaillot 2017-01-24 17:56:32 UTC

Pacemaker continues to monitor unmanaged resources so it can provide accurate status output during maintenance. The behavior is described upstream at:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-monitoring-unmanaged

The example in Comment 1 is not actually relevant here. That example is a problem regardless of monitors -- resources should not be moved to another mode while unmanaged. In that case, the correct resolution was to configure the resource appropriately for live migration.

It is a good idea to offer users the option of disabling monitors, in case they don't want to see the failures cluttering the status output. But many users will want to know whether the service is functioning or not, regardless of maintenance mode, so I wouldn't make it the default. For example, a user might not want to leave maintenance mode until a monitor comes back successful, or they might want to know if maintenance on one resource causes problems for another (also unmanaged) resource.

To disable a monitor, set enabled=FALSE in the operation definition.

Comment 8 Tomas Jelinek 2017-02-28 14:08:23 UTC

We do not want to disable / enable monitor operations by default. However running a resource with all monitors disabled may cause issues. Pcs should therefore display a warning when manging a resource with all monitors disabled if the user did not request enabling the monitors.

Comment 9 Tomas Jelinek 2017-03-17 15:22:44 UTC

Created attachment 1264062 [details]
proposed fix + tests

Comment 10 Ivan Devat 2017-04-10 15:57:59 UTC

After Fix:

[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.157-1.el7.x86_64

[vm-rhel72-1 ~] $ pcs resource
 R      (ocf::heartbeat:Dummy): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs resource unmanage R --monitor
[vm-rhel72-1 ~] $ pcs cluster cib|grep '<primitive.*id="R"' -A5
      <primitive class="ocf" id="R" provider="heartbeat" type="Dummy">
        <meta_attributes id="R-meta_attributes">
          <nvpair id="R-meta_attributes-is-managed" name="is-managed" value="false"/>
        </meta_attributes>
        <operations>
          <op id="R-monitor-interval-10" interval="10" name="monitor" timeout="20" enabled="false"/>
[vm-rhel72-1 ~] $ pcs resource manage R --monitor
[vm-rhel72-1 ~] $ pcs cluster cib|grep '<primitive.*id="R"' -A2
      <primitive class="ocf" id="R" provider="heartbeat" type="Dummy">
        <operations>
          <op id="R-monitor-interval-10" interval="10" name="monitor" timeout="20"/>


> Code was completely overwritten so testing all combinations is required
> It must work for clone, master, group as well

Comment 15 Tomas Jelinek 2017-04-24 15:38:11 UTC

Created attachment 1273637 [details]
additional fix + tests

My bad, I did not notice the monitor operation not being disabled is what was reported.


The table specifying which resources to set as unmanaged had to be updated to accommodate correct behavior with respect to disabled monitor operations:

resource hierarchy - specified resource - what to return
a primitive - the primitive - the primitive

a cloned primitive - the primitive - the primitive
a cloned primitive - the clone - the primitive
  The resource will run on all nodes after unclone. However that doesn't
  seem to be bad behavior. Moreover, if monitor operations were disabled,
  they wouldn't enable on unclone, but the resource would become managed,
  which is definitely bad.

a primitive in a group - the primitive - the primitive
  Otherwise all primitives in the group would become unmanaged.
a primitive in a group - the group - all primitives in the group
  If only the group was set to unmanaged, setting any primitive in the
  group to managed would set all the primitives in the group to managed.
  If the group as well as all its primitives were set to unmanaged, any
  primitive added to the group would become unmanaged. This new primitive
  would become managed if any original group primitive becomes managed.
  Therefore changing one primitive influences another one, which we do
  not want to happen.

a primitive in a cloned group - the primitive - the primitive
a primitive in a cloned group - the group - all primitives in the group
  See group notes above
a primitive in a cloned group - the clone - all primitives in the group
  See clone notes above


Test:
[root@rh73-node1:~]# pcs resource create CloneDummy ocf:heartbeat:Dummy clone
[root@rh73-node1:~]# pcs resource unmanage CloneDummy-clone --monitor
[root@rh73-node1:~]# pcs resource show CloneDummy-clone
 Clone: CloneDummy-clone
  Resource: CloneDummy (class=ocf provider=heartbeat type=Dummy)
   Meta Attrs: is-managed=false
   Operations: monitor enabled=false interval=10 timeout=20 (CloneDummy-monitor-interval-10)
               start interval=0s timeout=20 (CloneDummy-start-interval-0s)
               stop interval=0s timeout=20 (CloneDummy-stop-interval-0s)

Comment 16 Tomas Jelinek 2017-05-26 10:55:46 UTC

After fix:

[root@rh73-node1:~]# rpm -q pcs
pcs-0.9.158-2.el7.x86_64
[root@rh73-node1:~]# pcs resource show dummy-clone
 Clone: dummy-clone
  Resource: dummy (class=ocf provider=pacemaker type=Dummy)
   Operations: monitor interval=10 timeout=20 (dummy-monitor-interval-10)
               start interval=0s timeout=20 (dummy-start-interval-0s)
               stop interval=0s timeout=20 (dummy-stop-interval-0s)
[root@rh73-node1:~]# pcs resource unmanage dummy-clone --monitor
[root@rh73-node1:~]# pcs resource show dummy-clone
 Clone: dummy-clone
  Resource: dummy (class=ocf provider=pacemaker type=Dummy)
   Meta Attrs: is-managed=false 
   Operations: monitor enabled=false interval=10 timeout=20 (dummy-monitor-interval-10)
               start interval=0s timeout=20 (dummy-start-interval-0s)
               stop interval=0s timeout=20 (dummy-stop-interval-0s)

Comment 23 errata-xmlrpc 2017-08-01 18:22:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1958

Note You need to log in before you can comment on or make changes to this bug.