Bug 1631519 - [RFE] The cluster should not be allowed to disable a resource if dependent resources are still online
Summary: [RFE] The cluster should not be allowed to disable a resource if dependent re...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pcs
Version: 8.0
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: 8.2
Assignee: Tomas Jelinek
QA Contact: pkomarov
Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks: 1759305 1770973 1787598
TreeView+ depends on / blocked
 
Reported: 2018-09-20 19:08 UTC by Ryan
Modified: 2020-04-28 15:28 UTC (History)
11 users (show)

Fixed In Version: pcs-0.10.3-2.el8
Doc Type: Enhancement
Doc Text:
.New command options to disable a resource only if this would not affect other resources It is sometimes necessary to disable resources only if this would not have an effect on other resources. Ensuring that this would be the case can be impossible to do by hand when complex resource relations are set up. To address this need, the `pcs resource disable` command now supports the following options: * `pcs resource disable --simulate`: show effects of disabling specified resource(s) while not changing the cluster configuration * `pcs resource disable --safe`: disable specified resource(s) only if no other resources would be affected in any way, such as being migrated from one node to another * `pcs resource disable --safe --no-strict`: disable specified resource(s) only if no other resources would be stopped or demoted In addition, the `pcs resource safe-disable` command has been introduced as an alias for `pcs resource disable --safe`.
Clone Of:
: 1759305 1770973 (view as bug list)
Environment:
Last Closed: 2020-04-28 15:27:56 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
proposed fix + tests (87.89 KB, patch)
2019-10-07 08:54 UTC, Tomas Jelinek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 773484 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Bugzilla 1631514 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Bugzilla 1782887 0 unspecified CLOSED invalid transition produced in crm_simulate with "start A then stop B" constraint 2021-12-13 15:01:57 UTC
Red Hat Knowledge Base (Solution) 3621001 0 None None None 2018-09-21 19:44:19 UTC
Red Hat Product Errata RHEA-2020:1568 0 None None None 2020-04-28 15:28:35 UTC

Internal Links: 773484 1631514 1782887 1833114

Description Ryan 2018-09-20 19:08:24 UTC
1. Proposed title of this feature request
Do not disable a resource if dependent resources are still online

3. What is the nature and description of the request?
When disabling a resource, we want to know if resources that rely on it are still operating. For example, we don't want to pull an lvm resource out from under a filesystem resource, or a filesystem resource out from under a db resource.

4. Why does the customer need this? (List the business requirements here)
Prevent data loss/corruption by stopping resources inappropriately.

5. How would the customer like to achieve this? (List the functional requirements here)
An error or at least a warning should be thrown when a pcs resource disable <resource id> command targets a resource which is a dependency of other running resources.

6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.
Before: pcs resource disable <resource id> succeeds.
After: pcs resource disable <resource id> throws error or warning.

7. Is there already an existing RFE upstream or in Red Hat Bugzilla?
No.

8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?
7.6 if possible.

9. Is the sales team involved in this request and do they have any additional input?
No.

10. List any affected packages or components.
pcs

11. Would the customer be able to assist in testing this functionality if implemented?
Yes.

Comment 7 Tomas Jelinek 2019-10-07 08:54:04 UTC
Created attachment 1623047 [details]
proposed fix + tests

The original behavior of the "pcs resource disable" command has been preserved to maintain backward compatibility. However, new options have been added to the command providing the requested functionality:
* "pcs resource disable --simulate": show effects of disabling specified resource(s) while not changing the cluster configuration
* "pcs resource disable --safe": disable specified resource(s) only if no other resources would be stopped or demoted
* "pcs resource disable --safe --strict": disable specified resource(s) only if no other resources would be affected in any way, i.e. migrated from a node to another node


Test:

1. Set some resource dependencies, e.g.:
[root@rh80-node1:~]# pcs constraint
Location Constraints:
Ordering Constraints:
  start d1 then start d2 (kind:Mandatory)
Colocation Constraints:
Ticket Constraints:


2. Check effects of disabling a resource:
[root@rh80-node1:~]# pcs resource disable --simulate d1
1 of 4 resources DISABLED and 0 BLOCKED from being started due to failures

Current cluster status:
Online: [ rh80-node1 rh80-node2 ]

 xvm    (stonith:fence_xvm):    Started rh80-node1
 d2     (ocf::pacemaker:Dummy): Started rh80-node2
 d1     (ocf::pacemaker:Dummy): Started rh80-node1 (disabled)
 d3     (ocf::pacemaker:Dummy): Started rh80-node2

Transition Summary:
 * Stop       d2      ( rh80-node2 )   due to required d1 start
 * Stop       d1      ( rh80-node1 )   due to node availability

Executing cluster transition:
 * Resource action: d2              stop on rh80-node2
 * Resource action: d1              stop on rh80-node1

Revised cluster status:
Online: [ rh80-node1 rh80-node2 ]

 xvm    (stonith:fence_xvm):    Started rh80-node1
 d2     (ocf::pacemaker:Dummy): Stopped
 d1     (ocf::pacemaker:Dummy): Stopped (disabled)
 d3     (ocf::pacemaker:Dummy): Started rh80-node2


3. Try disabling a resource while not stopping other resources:
[root@rh80-node1:~]# pcs resource disable --safe d1
Error: Disabling specified resources would have an effect on other resources

1 of 4 resources DISABLED and 0 BLOCKED from being started due to failures

Current cluster status:
Online: [ rh80-node1 rh80-node2 ]

 xvm    (stonith:fence_xvm):    Started rh80-node1
 d2     (ocf::pacemaker:Dummy): Started rh80-node2
 d1     (ocf::pacemaker:Dummy): Started rh80-node1 (disabled)
 d3     (ocf::pacemaker:Dummy): Started rh80-node2

Transition Summary:
 * Stop       d2      ( rh80-node2 )   due to required d1 start
 * Stop       d1      ( rh80-node1 )   due to node availability

Executing cluster transition:
 * Resource action: d2              stop on rh80-node2
 * Resource action: d1              stop on rh80-node1

Revised cluster status:
Online: [ rh80-node1 rh80-node2 ]

 xvm    (stonith:fence_xvm):    Started rh80-node1
 d2     (ocf::pacemaker:Dummy): Stopped
 d1     (ocf::pacemaker:Dummy): Stopped (disabled)
 d3     (ocf::pacemaker:Dummy): Started rh80-node2


4. Disable the resource anyway:
[root@rh80-node1:~]# pcs resource disable d1
[root@rh80-node1:~]# pcs status resources
d2     (ocf::pacemaker:Dummy): Stopped
d1     (ocf::pacemaker:Dummy): Stopped (disabled)
d3     (ocf::pacemaker:Dummy): Started rh80-node2

Comment 8 Miroslav Lisik 2019-10-23 15:40:43 UTC
After fix:

[root@r81-node-01 ~]# rpm -q pcs
pcs-0.10.3-1.el8.x86_64

[root@r81-node-01 ~]# pcs resource
 Clone Set: locking-clone [locking]
     Started: [ r81-node-01 r81-node-02 ]
 lvm    (ocf::pacemaker:Dummy): Started r81-node-01
 fs     (ocf::pacemaker:Dummy): Started r81-node-01
 web    (ocf::pacemaker:Dummy): Started r81-node-01
[root@r81-node-01 ~]# pcs constraint
Location Constraints:
Ordering Constraints:
  start locking-clone then start lvm (kind:Mandatory)
  start lvm then start fs (kind:Mandatory)
  start fs then start web (kind:Mandatory)
Colocation Constraints:
  lvm with locking-clone (score:INFINITY)
  fs with lvm (score:INFINITY)
  web with fs (score:INFINITY)
Ticket Constraints:

A) --simulate

[root@r81-node-01 ~]# pcs resource disable --simulate dlm
4 of 9 resources DISABLED and 0 BLOCKED from being started due to failures

Current cluster status:
Online: [ r81-node-01 r81-node-02 ]

 fence-r81-node-01      (stonith:fence_xvm):    Started r81-node-01
 fence-r81-node-02      (stonith:fence_xvm):    Started r81-node-02
 Clone Set: locking-clone [locking]
     Started: [ r81-node-01 r81-node-02 ]
 lvm    (ocf::pacemaker:Dummy): Started r81-node-01
 fs     (ocf::pacemaker:Dummy): Started r81-node-01
 web    (ocf::pacemaker:Dummy): Started r81-node-01

Transition Summary:
 * Stop       dlm:0          ( r81-node-02 )   due to node availability
 * Stop       lvmlockd:0     ( r81-node-02 )   due to node availability
 * Stop       dlm:1          ( r81-node-01 )   due to node availability
 * Stop       lvmlockd:1     ( r81-node-01 )   due to node availability
 * Stop       lvm            ( r81-node-01 )   due to node availability
 * Stop       fs             ( r81-node-01 )   due to node availability
 * Stop       web            ( r81-node-01 )   due to node availability

Executing cluster transition:
 * Resource action: web             stop on r81-node-01
 * Resource action: fs              stop on r81-node-01
 * Resource action: lvm             stop on r81-node-01
 * Pseudo action:   locking-clone_stop_0
 * Pseudo action:   locking:0_stop_0
 * Resource action: lvmlockd        stop on r81-node-02
 * Pseudo action:   locking:1_stop_0
 * Resource action: lvmlockd        stop on r81-node-01
 * Resource action: dlm             stop on r81-node-02
 * Resource action: dlm             stop on r81-node-01
 * Pseudo action:   locking:0_stopped_0
 * Pseudo action:   locking:1_stopped_0
 * Pseudo action:   locking-clone_stopped_0

Revised cluster status:
Online: [ r81-node-01 r81-node-02 ]

 fence-r81-node-01      (stonith:fence_xvm):    Started r81-node-01
 fence-r81-node-02      (stonith:fence_xvm):    Started r81-node-02
 Clone Set: locking-clone [locking]
     Stopped: [ r81-node-01 r81-node-02 ]
 lvm    (ocf::pacemaker:Dummy): Stopped
 fs     (ocf::pacemaker:Dummy): Stopped
 web    (ocf::pacemaker:Dummy): Stopped
[root@r81-node-01 ~]# echo $?
0

B) --safe

[root@r81-node-01 ~]# pcs resource disable lvm --safe
Error: Disabling specified resources would have an effect on other resources

1 of 9 resources DISABLED and 0 BLOCKED from being started due to failures

Current cluster status:
Online: [ r81-node-01 r81-node-02 ]

 fence-r81-node-01      (stonith:fence_xvm):    Started r81-node-01
 fence-r81-node-02      (stonith:fence_xvm):    Started r81-node-02
 Clone Set: locking-clone [locking]
     Started: [ r81-node-01 r81-node-02 ]
 lvm    (ocf::pacemaker:Dummy): Started r81-node-01 (disabled)
 fs     (ocf::pacemaker:Dummy): Started r81-node-01
 web    (ocf::pacemaker:Dummy): Started r81-node-01

Transition Summary:
 * Stop       lvm     ( r81-node-01 )   due to node availability
 * Stop       fs      ( r81-node-01 )   due to node availability
 * Stop       web     ( r81-node-01 )   due to node availability

Executing cluster transition:
 * Resource action: web             stop on r81-node-01
 * Resource action: fs              stop on r81-node-01
 * Resource action: lvm             stop on r81-node-01

Revised cluster status:
Online: [ r81-node-01 r81-node-02 ]

 fence-r81-node-01      (stonith:fence_xvm):    Started r81-node-01
 fence-r81-node-02      (stonith:fence_xvm):    Started r81-node-02
 Clone Set: locking-clone [locking]
     Started: [ r81-node-01 r81-node-02 ]
 lvm    (ocf::pacemaker:Dummy): Stopped (disabled)
 fs     (ocf::pacemaker:Dummy): Stopped
 web    (ocf::pacemaker:Dummy): Stopped
[root@r81-node-01 ~]# echo $?
1

C) --strict

[root@r81-node-01 ~]# pcs resource disable fs --strict
Error: Disabling specified resources would have an effect on other resources

1 of 9 resources DISABLED and 0 BLOCKED from being started due to failures

Current cluster status:
Online: [ r81-node-01 r81-node-02 ]

 fence-r81-node-01      (stonith:fence_xvm):    Started r81-node-01
 fence-r81-node-02      (stonith:fence_xvm):    Started r81-node-02
 Clone Set: locking-clone [locking]
     Started: [ r81-node-01 r81-node-02 ]
 lvm    (ocf::pacemaker:Dummy): Started r81-node-01
 fs     (ocf::pacemaker:Dummy): Started r81-node-01 (disabled)
 web    (ocf::pacemaker:Dummy): Started r81-node-01

Transition Summary:
 * Stop       fs      ( r81-node-01 )   due to node availability
 * Stop       web     ( r81-node-01 )   due to node availability

Executing cluster transition:
 * Resource action: web             stop on r81-node-01
 * Resource action: fs              stop on r81-node-01

Revised cluster status:
Online: [ r81-node-01 r81-node-02 ]

 fence-r81-node-01      (stonith:fence_xvm):    Started r81-node-01
 fence-r81-node-02      (stonith:fence_xvm):    Started r81-node-02
 Clone Set: locking-clone [locking]
     Started: [ r81-node-01 r81-node-02 ]
 lvm    (ocf::pacemaker:Dummy): Started r81-node-01
 fs     (ocf::pacemaker:Dummy): Stopped (disabled)
 web    (ocf::pacemaker:Dummy): Stopped
[root@r81-node-01 ~]# echo $?
1

D) disable without long options

[root@r81-node-01 ~]# pcs resource disable dlm
[root@r81-node-01 ~]# pcs status resources
 Clone Set: locking-clone [locking]
     Stopped: [ r81-node-01 r81-node-02 ]
 lvm    (ocf::pacemaker:Dummy): Stopped
 fs     (ocf::pacemaker:Dummy): Stopped
 web    (ocf::pacemaker:Dummy): Stopped

Comment 11 Ondrej Mular 2019-11-08 12:38:20 UTC
Additional commit:
https://github.com/ClusterLabs/pcs/commit/b2be7b5482232910a490544659b34d833783346d

Changes:
 * added command 'pcs resource safe-disable' which is an alias of 'pcs resource disable --safe'
 * default behavior of 'pcs resource disable --safe' has been changed to strict mode, therefore '--strict' has been replaced by '--no-strict' option

Test:
[root@rhel82-devel2 pcs]# pcs resource safe-disable dummy1 --no-strict
[root@rhel82-devel2 pcs]# echo $?
0

Comment 12 Miroslav Lisik 2019-11-19 11:03:13 UTC
Test:

[root@r81-node-01 ~]# rpm -q pcs
pcs-0.10.3-2.el8.x86_64

[root@r81-node-01 ~]# pcs resource
  * dummy-01	(ocf::pacemaker:Dummy):	 Started r81-node-01
  * dummy-02	(ocf::pacemaker:Dummy):	 Started r81-node-02
[root@r81-node-01 ~]# pcs constraint order 
Ordering Constraints:
  start dummy-01 then start dummy-02 (kind:Mandatory)


[root@r81-node-01 ~]# pcs resource safe-disable dummy-01 --no-strict
Error: Disabling specified resources would have an effect on other resources

1 of 4 resource instances DISABLED and 0 BLOCKED from further action due to failure

Current cluster status:
Online: [ r81-node-01 r81-node-02 ]

 fence-r81-node-01	(stonith:fence_xvm):	Started r81-node-01
 fence-r81-node-02	(stonith:fence_xvm):	Started r81-node-02
 dummy-01	(ocf::pacemaker:Dummy):	Started r81-node-01 (disabled)
 dummy-02	(ocf::pacemaker:Dummy):	Started r81-node-02

Transition Summary:
 * Stop       dummy-01     ( r81-node-01 )   due to node availability
 * Stop       dummy-02     ( r81-node-02 )   due to required dummy-01 start

Executing cluster transition:
 * Resource action: dummy-02        stop on r81-node-02
 * Resource action: dummy-01        stop on r81-node-01

Revised cluster status:
Online: [ r81-node-01 r81-node-02 ]

 fence-r81-node-01	(stonith:fence_xvm):	Started r81-node-01
 fence-r81-node-02	(stonith:fence_xvm):	Started r81-node-02
 dummy-01	(ocf::pacemaker:Dummy):	Stopped (disabled)
 dummy-02	(ocf::pacemaker:Dummy):	Stopped
[root@r81-node-01 ~]# echo $?
1

Comment 19 Ken Gaillot 2019-12-11 17:49:24 UTC
(In reply to Ondrej Mular from comment #11)
> Additional commit:
> https://github.com/ClusterLabs/pcs/commit/
> b2be7b5482232910a490544659b34d833783346d
> 
> Changes:
>  * added command 'pcs resource safe-disable' which is an alias of 'pcs
> resource disable --safe'
>  * default behavior of 'pcs resource disable --safe' has been changed to
> strict mode, therefore '--strict' has been replaced by '--no-strict' option
> 
> Test:
> [root@rhel82-devel2 pcs]# pcs resource safe-disable dummy1 --no-strict
> [root@rhel82-devel2 pcs]# echo $?
> 0

This feedback may be late, but thinking about the general problem, maybe pcs could have a "safe mode" that would change the defaults for a wide range of commands. When someone chooses safety or speed they generally want it for everything or nothing.

E.g. "pcs resource disable" would take complementary options e.g. --safe/--no-safe or --safe=true/false (and similarly for strict). "pcs mode safe" (or whatever) would set a flag in pcs_settings.conf (cluster-wide) and make "pcs resource disable" default to --safe --strict, and the user would have to specify --no-safe/--safe=false etc. to get the usual behavior.

The benefit is that users don't have to remember separate commands, and sites can set general policies that are enforced with all users. Also it avoids having to add "safeX" equivalents for a bunch of other commands in the future. (And it avoids the cringy "pcs safedisable --force" == "pcs disable".)

Alternatively, pcs could take a cue from rm/cp/mv, and take an -i/--interactive option. If specified, for any potentially "dangerous" command, pcs would show what would happen and ask the user for confirmation. As is commonly done for rm/cp/mv, the user could alias 'pcs' to 'pcs -i' in their bashrc.

Comment 21 Ken Gaillot 2019-12-11 18:08:09 UTC
>> Perhaps pcs should ignore crm_simulate returning 1 and just go through the
>> transitions? In that case, pcs would proceed and stop the resource as only
>> the resource being stopped is mentioned in the transitions (meaning no other
>> resources would be affected).
>
> I wouldn't; an invalid transition means there's something wrong with the graph.
> I think it's better to require the user to force the command in that case. It
> should be very rare since it indicates a bug.

Also, it means that the live cluster won't be able to execute the transition if you commit the change, and will likely be blocked from all further action. So even forcing it would be a bad idea.

Comment 23 pkomarov 2019-12-15 15:34:04 UTC
Verified , 

[root@controller-0 ~]# rpm -q pcs
pcs-0.10.3-2.el8.x86_64

[root@controller-0 ~]# pcs config|grep -B 1 'ip-192.168.24.45 then'
Ordering Constraints:
  start ip-192.168.24.45 then start haproxy-bundle (kind:Optional) (id:order-ip-192.168.24.45-haproxy-bundle-Optional)



[root@controller-0 ~]#  pcs resource disable --safe haproxy-bundle
Error: Disabling specified resources would have an effect on other resources

3 of 50 resources DISABLED and 0 BLOCKED from being started due to failures

 [...]

Revised cluster status:
Online: [ controller-0 controller-1 controller-2 ]
[...]
 ip-192.168.24.45	(ocf::heartbeat:IPaddr2):	Stopped
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Stopped
 ip-172.17.1.70	(ocf::heartbeat:IPaddr2):	Stopped
 ip-172.17.1.31	(ocf::heartbeat:IPaddr2):	Stopped
 ip-172.17.3.135	(ocf::heartbeat:IPaddr2):	Stopped
 ip-172.17.4.52	(ocf::heartbeat:IPaddr2):	Stopped
 Container bundle set: haproxy-bundle [undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0	(ocf::heartbeat:podman):	Stopped (disabled)
   haproxy-bundle-podman-1	(ocf::heartbeat:podman):	Stopped (disabled)
   haproxy-bundle-podman-2	(ocf::heartbeat:podman):	Stopped (disabled)
[...]

Comment 26 errata-xmlrpc 2020-04-28 15:27:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:1568


Note You need to log in before you can comment on or make changes to this bug.