Bug 1303969

Summary: resource (un)manage: add optional switch to dis-/en-able monitor operations as well
Product: Red Hat Enterprise Linux 7 Reporter: Jan Pokorný [poki] <jpokorny>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact: Steven J. Levine <slevine>
Priority: high    
Version: 7.2CC: cfeist, cluster-maint, idevat, kgaillot, mkelly, omular, rsteiger, sbradley, slevine, tlavigne, tojeline
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.158-2.el7 Doc Type: Release Note
Doc Text:
New option to the "pcs resource unmanage" command to disable monitor operations Even when a resource is in unmanaged mode, monitor operations are still run by the cluster. That may cause the cluster to report errors the user is not interested in as those errors may be expected for a particular use case when the resource is unmanaged. The "pcs resource unmanage" command now supports the "--monitor" option, which disables monitor operations when putting a resource into unmanaged mode. Additionally, the "pcs resource manage" command also supports the "--monitor" option, which enables the monitor operations when putting a resource back into managed mode.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 18:22:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
proposed fix + tests
none
additional fix + tests none

Description Jan Pokorný [poki] 2016-02-02 14:50:58 UTC
There are cases when setting resource as unmanaged is not enough
as the underlying operation of recurring monitor actions are not
paused [1] and this can cause issues when values other than those
indicating "started" or "stopped" are signalled (just because
the environment is in some kind of reconstruction).

Proposed solution:

pcs resource unmanage RESOURCE --with-monitor
pcs resource manage RESOURCE --with-monitor

--with-monitor for unmanage:
  for all respective recurring monitors: enabled=false 

--with-monitor for manage:
  for all respective recurring monitors: enabled=true
  (or rather, drop the property and use the identical default)

Alternatively, monitor=1 instead of --with-monitor.


[1] http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/#_monitoring_resources_when_administration_is_disabled

Comment 1 Jan Pokorný [poki] 2016-02-03 15:26:12 UTC
Real world use case:
http://oss.clusterlabs.org/pipermail/users/2016-February/002217.html

Note that even more convoluted request shortcuts regarding
maintenance/monitor combo may be desired (as one can decipher
from that post).

Comment 3 Tomas Jelinek 2016-07-13 15:05:22 UTC
another real life use case:
http://clusterlabs.org/pipermail/users/2016-July/003490.html

Comment 4 Madison Kelly 2016-07-13 18:46:54 UTC
I would also like to see this option.

Comment 6 Ken Gaillot 2017-01-24 17:56:32 UTC
Pacemaker continues to monitor unmanaged resources so it can provide accurate status output during maintenance. The behavior is described upstream at:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-monitoring-unmanaged

The example in Comment 1 is not actually relevant here. That example is a problem regardless of monitors -- resources should not be moved to another mode while unmanaged. In that case, the correct resolution was to configure the resource appropriately for live migration.

It is a good idea to offer users the option of disabling monitors, in case they don't want to see the failures cluttering the status output. But many users will want to know whether the service is functioning or not, regardless of maintenance mode, so I wouldn't make it the default. For example, a user might not want to leave maintenance mode until a monitor comes back successful, or they might want to know if maintenance on one resource causes problems for another (also unmanaged) resource.

To disable a monitor, set enabled=FALSE in the operation definition.

Comment 8 Tomas Jelinek 2017-02-28 14:08:23 UTC
We do not want to disable / enable monitor operations by default. However running a resource with all monitors disabled may cause issues. Pcs should therefore display a warning when manging a resource with all monitors disabled if the user did not request enabling the monitors.

Comment 9 Tomas Jelinek 2017-03-17 15:22:44 UTC
Created attachment 1264062 [details]
proposed fix + tests

Comment 10 Ivan Devat 2017-04-10 15:57:59 UTC
After Fix:

[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.157-1.el7.x86_64

[vm-rhel72-1 ~] $ pcs resource
 R      (ocf::heartbeat:Dummy): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs resource unmanage R --monitor
[vm-rhel72-1 ~] $ pcs cluster cib|grep '<primitive.*id="R"' -A5
      <primitive class="ocf" id="R" provider="heartbeat" type="Dummy">
        <meta_attributes id="R-meta_attributes">
          <nvpair id="R-meta_attributes-is-managed" name="is-managed" value="false"/>
        </meta_attributes>
        <operations>
          <op id="R-monitor-interval-10" interval="10" name="monitor" timeout="20" enabled="false"/>
[vm-rhel72-1 ~] $ pcs resource manage R --monitor
[vm-rhel72-1 ~] $ pcs cluster cib|grep '<primitive.*id="R"' -A2
      <primitive class="ocf" id="R" provider="heartbeat" type="Dummy">
        <operations>
          <op id="R-monitor-interval-10" interval="10" name="monitor" timeout="20"/>


> Code was completely overwritten so testing all combinations is required
> It must work for clone, master, group as well

Comment 15 Tomas Jelinek 2017-04-24 15:38:11 UTC
Created attachment 1273637 [details]
additional fix + tests

My bad, I did not notice the monitor operation not being disabled is what was reported.


The table specifying which resources to set as unmanaged had to be updated to accommodate correct behavior with respect to disabled monitor operations:

resource hierarchy - specified resource - what to return
a primitive - the primitive - the primitive

a cloned primitive - the primitive - the primitive
a cloned primitive - the clone - the primitive
  The resource will run on all nodes after unclone. However that doesn't
  seem to be bad behavior. Moreover, if monitor operations were disabled,
  they wouldn't enable on unclone, but the resource would become managed,
  which is definitely bad.

a primitive in a group - the primitive - the primitive
  Otherwise all primitives in the group would become unmanaged.
a primitive in a group - the group - all primitives in the group
  If only the group was set to unmanaged, setting any primitive in the
  group to managed would set all the primitives in the group to managed.
  If the group as well as all its primitives were set to unmanaged, any
  primitive added to the group would become unmanaged. This new primitive
  would become managed if any original group primitive becomes managed.
  Therefore changing one primitive influences another one, which we do
  not want to happen.

a primitive in a cloned group - the primitive - the primitive
a primitive in a cloned group - the group - all primitives in the group
  See group notes above
a primitive in a cloned group - the clone - all primitives in the group
  See clone notes above


Test:
[root@rh73-node1:~]# pcs resource create CloneDummy ocf:heartbeat:Dummy clone
[root@rh73-node1:~]# pcs resource unmanage CloneDummy-clone --monitor
[root@rh73-node1:~]# pcs resource show CloneDummy-clone
 Clone: CloneDummy-clone
  Resource: CloneDummy (class=ocf provider=heartbeat type=Dummy)
   Meta Attrs: is-managed=false
   Operations: monitor enabled=false interval=10 timeout=20 (CloneDummy-monitor-interval-10)
               start interval=0s timeout=20 (CloneDummy-start-interval-0s)
               stop interval=0s timeout=20 (CloneDummy-stop-interval-0s)

Comment 16 Tomas Jelinek 2017-05-26 10:55:46 UTC
After fix:

[root@rh73-node1:~]# rpm -q pcs
pcs-0.9.158-2.el7.x86_64
[root@rh73-node1:~]# pcs resource show dummy-clone
 Clone: dummy-clone
  Resource: dummy (class=ocf provider=pacemaker type=Dummy)
   Operations: monitor interval=10 timeout=20 (dummy-monitor-interval-10)
               start interval=0s timeout=20 (dummy-start-interval-0s)
               stop interval=0s timeout=20 (dummy-stop-interval-0s)
[root@rh73-node1:~]# pcs resource unmanage dummy-clone --monitor
[root@rh73-node1:~]# pcs resource show dummy-clone
 Clone: dummy-clone
  Resource: dummy (class=ocf provider=pacemaker type=Dummy)
   Meta Attrs: is-managed=false 
   Operations: monitor enabled=false interval=10 timeout=20 (dummy-monitor-interval-10)
               start interval=0s timeout=20 (dummy-start-interval-0s)
               stop interval=0s timeout=20 (dummy-stop-interval-0s)

Comment 23 errata-xmlrpc 2017-08-01 18:22:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1958