Bug 1303969 - resource (un)manage: add optional switch to dis-/en-able monitor operations as well
resource (un)manage: add optional switch to dis-/en-able monitor operations a...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcs (Show other bugs)
7.2
Unspecified Unspecified
high Severity unspecified
: rc
: ---
Assigned To: Tomas Jelinek
cluster-qe@redhat.com
Steven J. Levine
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-02 09:50 EST by Jan Pokorný
Modified: 2017-08-01 14:22 EDT (History)
10 users (show)

See Also:
Fixed In Version: pcs-0.9.158-2.el7
Doc Type: Release Note
Doc Text:
New option to the "pcs resource unmanage" command to disable monitor operations Even when a resource is in unmanaged mode, monitor operations are still run by the cluster. That may cause the cluster to report errors the user is not interested in as those errors may be expected for a particular use case when the resource is unmanaged. The "pcs resource unmanage" command now supports the "--monitor" option, which disables monitor operations when putting a resource into unmanaged mode. Additionally, the "pcs resource manage" command also supports the "--monitor" option, which enables the monitor operations when putting a resource back into managed mode.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-01 14:22:57 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
proposed fix + tests (134.01 KB, patch)
2017-03-17 11:22 EDT, Tomas Jelinek
no flags Details | Diff
additional fix + tests (33.41 KB, patch)
2017-04-24 11:38 EDT, Tomas Jelinek
no flags Details | Diff

  None (edit)
Description Jan Pokorný 2016-02-02 09:50:58 EST
There are cases when setting resource as unmanaged is not enough
as the underlying operation of recurring monitor actions are not
paused [1] and this can cause issues when values other than those
indicating "started" or "stopped" are signalled (just because
the environment is in some kind of reconstruction).

Proposed solution:

pcs resource unmanage RESOURCE --with-monitor
pcs resource manage RESOURCE --with-monitor

--with-monitor for unmanage:
  for all respective recurring monitors: enabled=false 

--with-monitor for manage:
  for all respective recurring monitors: enabled=true
  (or rather, drop the property and use the identical default)

Alternatively, monitor=1 instead of --with-monitor.


[1] http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/#_monitoring_resources_when_administration_is_disabled
Comment 1 Jan Pokorný 2016-02-03 10:26:12 EST
Real world use case:
http://oss.clusterlabs.org/pipermail/users/2016-February/002217.html

Note that even more convoluted request shortcuts regarding
maintenance/monitor combo may be desired (as one can decipher
from that post).
Comment 3 Tomas Jelinek 2016-07-13 11:05:22 EDT
another real life use case:
http://clusterlabs.org/pipermail/users/2016-July/003490.html
Comment 4 digimer 2016-07-13 14:46:54 EDT
I would also like to see this option.
Comment 6 Ken Gaillot 2017-01-24 12:56:32 EST
Pacemaker continues to monitor unmanaged resources so it can provide accurate status output during maintenance. The behavior is described upstream at:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-monitoring-unmanaged

The example in Comment 1 is not actually relevant here. That example is a problem regardless of monitors -- resources should not be moved to another mode while unmanaged. In that case, the correct resolution was to configure the resource appropriately for live migration.

It is a good idea to offer users the option of disabling monitors, in case they don't want to see the failures cluttering the status output. But many users will want to know whether the service is functioning or not, regardless of maintenance mode, so I wouldn't make it the default. For example, a user might not want to leave maintenance mode until a monitor comes back successful, or they might want to know if maintenance on one resource causes problems for another (also unmanaged) resource.

To disable a monitor, set enabled=FALSE in the operation definition.
Comment 8 Tomas Jelinek 2017-02-28 09:08:23 EST
We do not want to disable / enable monitor operations by default. However running a resource with all monitors disabled may cause issues. Pcs should therefore display a warning when manging a resource with all monitors disabled if the user did not request enabling the monitors.
Comment 9 Tomas Jelinek 2017-03-17 11:22 EDT
Created attachment 1264062 [details]
proposed fix + tests
Comment 10 Ivan Devat 2017-04-10 11:57:59 EDT
After Fix:

[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.157-1.el7.x86_64

[vm-rhel72-1 ~] $ pcs resource
 R      (ocf::heartbeat:Dummy): Started vm-rhel72-3
[vm-rhel72-1 ~] $ pcs resource unmanage R --monitor
[vm-rhel72-1 ~] $ pcs cluster cib|grep '<primitive.*id="R"' -A5
      <primitive class="ocf" id="R" provider="heartbeat" type="Dummy">
        <meta_attributes id="R-meta_attributes">
          <nvpair id="R-meta_attributes-is-managed" name="is-managed" value="false"/>
        </meta_attributes>
        <operations>
          <op id="R-monitor-interval-10" interval="10" name="monitor" timeout="20" enabled="false"/>
[vm-rhel72-1 ~] $ pcs resource manage R --monitor
[vm-rhel72-1 ~] $ pcs cluster cib|grep '<primitive.*id="R"' -A2
      <primitive class="ocf" id="R" provider="heartbeat" type="Dummy">
        <operations>
          <op id="R-monitor-interval-10" interval="10" name="monitor" timeout="20"/>


> Code was completely overwritten so testing all combinations is required
> It must work for clone, master, group as well
Comment 15 Tomas Jelinek 2017-04-24 11:38 EDT
Created attachment 1273637 [details]
additional fix + tests

My bad, I did not notice the monitor operation not being disabled is what was reported.


The table specifying which resources to set as unmanaged had to be updated to accommodate correct behavior with respect to disabled monitor operations:

resource hierarchy - specified resource - what to return
a primitive - the primitive - the primitive

a cloned primitive - the primitive - the primitive
a cloned primitive - the clone - the primitive
  The resource will run on all nodes after unclone. However that doesn't
  seem to be bad behavior. Moreover, if monitor operations were disabled,
  they wouldn't enable on unclone, but the resource would become managed,
  which is definitely bad.

a primitive in a group - the primitive - the primitive
  Otherwise all primitives in the group would become unmanaged.
a primitive in a group - the group - all primitives in the group
  If only the group was set to unmanaged, setting any primitive in the
  group to managed would set all the primitives in the group to managed.
  If the group as well as all its primitives were set to unmanaged, any
  primitive added to the group would become unmanaged. This new primitive
  would become managed if any original group primitive becomes managed.
  Therefore changing one primitive influences another one, which we do
  not want to happen.

a primitive in a cloned group - the primitive - the primitive
a primitive in a cloned group - the group - all primitives in the group
  See group notes above
a primitive in a cloned group - the clone - all primitives in the group
  See clone notes above


Test:
[root@rh73-node1:~]# pcs resource create CloneDummy ocf:heartbeat:Dummy clone
[root@rh73-node1:~]# pcs resource unmanage CloneDummy-clone --monitor
[root@rh73-node1:~]# pcs resource show CloneDummy-clone
 Clone: CloneDummy-clone
  Resource: CloneDummy (class=ocf provider=heartbeat type=Dummy)
   Meta Attrs: is-managed=false
   Operations: monitor enabled=false interval=10 timeout=20 (CloneDummy-monitor-interval-10)
               start interval=0s timeout=20 (CloneDummy-start-interval-0s)
               stop interval=0s timeout=20 (CloneDummy-stop-interval-0s)
Comment 16 Tomas Jelinek 2017-05-26 06:55:46 EDT
After fix:

[root@rh73-node1:~]# rpm -q pcs
pcs-0.9.158-2.el7.x86_64
[root@rh73-node1:~]# pcs resource show dummy-clone
 Clone: dummy-clone
  Resource: dummy (class=ocf provider=pacemaker type=Dummy)
   Operations: monitor interval=10 timeout=20 (dummy-monitor-interval-10)
               start interval=0s timeout=20 (dummy-start-interval-0s)
               stop interval=0s timeout=20 (dummy-stop-interval-0s)
[root@rh73-node1:~]# pcs resource unmanage dummy-clone --monitor
[root@rh73-node1:~]# pcs resource show dummy-clone
 Clone: dummy-clone
  Resource: dummy (class=ocf provider=pacemaker type=Dummy)
   Meta Attrs: is-managed=false 
   Operations: monitor enabled=false interval=10 timeout=20 (dummy-monitor-interval-10)
               start interval=0s timeout=20 (dummy-start-interval-0s)
               stop interval=0s timeout=20 (dummy-stop-interval-0s)
Comment 23 errata-xmlrpc 2017-08-01 14:22:57 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1958

Note You need to log in before you can comment on or make changes to this bug.