Bug 1298585

Summary: [RFE] pcs status output could be simpler when constraints are in place
Product: Red Hat Enterprise Linux 7 Reporter: Michele Baldessari <michele>
Component: pcsAssignee: Ivan Devat <idevat>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: low    
Version: 7.3CC: cfeist, cluster-maint, kgaillot, michele, rmarigny, royoung, rsteiger, tojeline, vcojot
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pcs-0.9.152-7.el7 Doc Type: Enhancement
Doc Text:
Feature: allow hide inactive resources in a pcs status Reason: ability to show shorter, more readable output Result: pcs is able hide inactive resources in a pcs status, so the output is shorter and more readable
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 20:56:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1361533    
Bug Blocks:    
Attachments:
Description Flags
proposed fix 2 none

Description Michele Baldessari 2016-01-14 13:43:23 UTC
(I realize this one might not be trivial, but it is a bit of a pain point
in bigger openstack installations so here we go ;)

In an OSP installation with Instance HA the compute/hypervisor nodes use
pacemaker_remoted to manage services.

There are two type of nodes in this context: controllers and compute nodes.
Each node has a "osprole=compute/controller" depending on its role.

Now certain services will have a constraint to be run only on compute nodes and
other services will have a constraint to make them run only on controllers.

What would be nice is if pcs status did not show the stopped state for services
on the nodes where it may *never* run due to the aforementioned constraints.

Here is an example:
Current DC: overcloud-controller-1 (version 1.1.13-10.el7-44eb2dd) - partition with quorum
5 nodes and 207 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
RemoteOnline: [ overcloud-novacompute-0 overcloud-novacompute-1 ]

Full list of resources:

 ip-192.0.2.14  (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-novacompute-0 overcloud-novacompute-1 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-novacompute-0 overcloud-novacompute-1 ]
 ip-192.0.2.15  (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
     Stopped: [ overcloud-novacompute-0 overcloud-novacompute-1 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-novacompute-0 overcloud-novacompute-1 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-novacompute-0 overcloud-novacompute-1 ]

....

Now we know already due to the constraints that redis-master may never ever run 
on any compute nodes, so it is a bit confusing to see the list of all
the compute nodes in Stopped state there. Also because potentially
the compute nodes might scale up to very huge numbers (hundreds) which would
make the output almost unreadable.

Comment 2 Ken Gaillot 2016-01-19 16:41:12 UTC
The requested behavior is the default in upstream pacemaker since 1.1.13; crm_mon should print Stopped clones only if --inactive was specified. Is pcs using --inactive when calling crm_mon?

Comment 3 Tomas Jelinek 2016-01-20 09:59:21 UTC
Ken,

You are right, pcs calls crm_mon with -r option specified. I tested crm_mon from pacemaker-1.1.13-10.el7.x86_64 and it works as requested in comment 0.

We will handle this in pcs then.

Thanks.

Comment 4 Ivan Devat 2016-02-02 10:05:12 UTC
Michele,

could you run "crm_mon --one-shot" on your cluster and take a look on its output. If you are satisfied with it, we can add an option to pcs to show resources status like this. Realize however that in this output all stopped resources are omitted, not only those with a constraint to be run only on specific nodes.

Thanks

Comment 5 Michele Baldessari 2016-02-03 15:13:15 UTC
Hi Ivan,

I think it is definitely an improvement. What is needed is that the stopped
are not printed when it is expected they are not running on a node.

So if there is a constraint preventing it to run it should either not print
stopped at all or it should somehow display the fact that it is stopped on that
node because of constraint rules and not because it failed there.

We can split this BZ in two if you want:
1) We keep this one to add crm_mon --one-shot to pcs
2) We track via another BZ if we can add some logic to differentiate the display of stopped due to failure vs stopped due to constraints

Does that make sense?

Comment 6 Ken Gaillot 2016-02-03 15:34:12 UTC
The output as proposed here would follow your first suggestion, i.e. stopped resources would not be printed at all.

I suspect the output would get too cluttered with the other suggestion, trying to show a reason for every stopped resource. I think that level of detail would be better using a GUI or HTML, where the user could click on a resource for more detail; and/or a separate command-line option, for example "pcs resource why".

Comment 7 Michele Baldessari 2016-02-03 15:55:28 UTC
(In reply to Ken Gaillot from comment #6)
> The output as proposed here would follow your first suggestion, i.e. stopped
> resources would not be printed at all.
> 
> I suspect the output would get too cluttered with the other suggestion,
> trying to show a reason for every stopped resource. I think that level of
> detail would be better using a GUI or HTML, where the user could click on a
> resource for more detail; and/or a separate command-line option, for example
> "pcs resource why".

I see, that makes sense. How about we aim for something like:
"Show me stopped only in case of failure" and leave anything else out (e.g. stopped due to constraints)?

I think that would be the most useful output for sysadmin managing a somewhat
complex cluster via CLI.

Comment 8 Ken Gaillot 2016-02-03 22:14:06 UTC
Any failed actions are already listed in their own section. However there's not necessarily a direct connection with the resource being stopped; a resource might fail but then be recovered successfully, or it might be the stop action that failed.

I think the "crm_mon --one-shot" output should be sufficient.

Comment 9 Ivan Devat 2016-02-05 14:56:55 UTC
proposed fix:
https://github.com/feist/pcs/commit/c4e49ba368cf636337a276f814aac242987c4222

Setup:
[vm-rhel72-1 ~pcs/pcs] $ ./pcs resource create resource-dummy Dummy
[vm-rhel72-1 ~pcs/pcs] $ ./pcs constraint location resource-dummy avoids vm-rhel72-2=INFINITY
[vm-rhel72-1 ~pcs/pcs] $ ./pcs status | grep Stopped:
     Stopped: [ vm-rhel72-2 ]

Test:
[vm-rhel72-1 ~pcs/pcs] $ ./pcs status --hide-inactive | grep Stopped:
[vm-rhel72-1 ~pcs/pcs] $ ./pcs status --hide-inactive --full
Error: you cannot specify both --hide-inactive and --full

Cleanup:
[vm-rhel72-1 ~pcs/pcs] $ ./pcs resource delete resource-dummy
Attempting to stop: resource-dummy...Stopped

Comment 10 Mike McCune 2016-03-28 22:42:18 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 11 Ivan Devat 2016-05-31 12:07:19 UTC
Setup:
[vm-rhel72-1 ~] $ pcs resource create resource-dummy Dummy
[vm-rhel72-1 ~] $ pcs constraint location resource-dummy avoids vm-rhel72-3=INFINITY
[vm-rhel72-1 ~] $ pcs resource clone resource-dummy
[vm-rhel72-1 ~] $ pcs status | grep Stopped:
     Stopped: [ vm-rhel72-3 ]


Before fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.143-15.el7.x86_64

No way to display status without stopped resources.


After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.151-1.el7.x86_64

[vm-rhel72-1 ~] $ pcs status --hide-inactive | grep Stopped:
[vm-rhel72-1 ~] $ pcs status --hide-inactive --full
Error: you cannot specify both --hide-inactive and --full
[vm-rhel72-1 ~] $ pcs resource delete resource-dummy
Attempting to stop: resource-dummy...Stopped

Comment 14 Tomas Jelinek 2016-08-08 11:39:11 UTC
Created attachment 1188649 [details]
proposed fix 2

Test:

[root@rh72-node1:~]# pcs resource 
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 d1     (ocf::heartbeat:Dummy): Started rh72-node2
 d2     (ocf::heartbeat:Dummy): Stopped (disabled)
 d3     (ocf::heartbeat:Dummy): Started rh72-node2
 d4     (ocf::heartbeat:Dummy): Stopped (disabled)
 d5     (ocf::heartbeat:Dummy): Started rh72-node1
[root@rh72-node1:~]# pcs resource --hide-inactive
 dummy  (ocf::heartbeat:Dummy): Started rh72-node1
 d1     (ocf::heartbeat:Dummy): Started rh72-node2
 d3     (ocf::heartbeat:Dummy): Started rh72-node2
 d5     (ocf::heartbeat:Dummy): Started rh72-node1

Comment 15 Ivan Devat 2016-08-19 12:27:25 UTC
Setup:
[vm-rhel72-1 ~] $ pcs resource create d1 Dummy
[vm-rhel72-1 ~] $ pcs resource create d2 Dummy
[vm-rhel72-1 ~] $ pcs resource disable d1
[vm-rhel72-1 ~] $ pcs resource
 d1     (ocf::heartbeat:Dummy): Stopped (disabled)
 d2     (ocf::heartbeat:Dummy): Started vm-rhel72-3


Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-6.el7.x86_64
[vm-rhel72-1 ~] $ pcs resource --hide-inactive
 d1     (ocf::heartbeat:Dummy): Stopped (disabled)
 d2     (ocf::heartbeat:Dummy): Started vm-rhel72-3


After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-7.el7.x86_64
[vm-rhel72-1 ~] $ pcs resource --hide-inactive
 d2     (ocf::heartbeat:Dummy): Started vm-rhel72-3

Comment 21 errata-xmlrpc 2016-11-03 20:56:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html