Bug 799070

Summary: crm_resource reports incorrect data about resource location
Product: Red Hat Enterprise Linux 6 Reporter: Jaroslav Kortus <jkortus>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: urgent    
Version: 6.3CC: cluster-maint, dvossel
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.7-3.el6 Doc Type: Bug Fix
Doc Text:
Cause: The logic for determining whether a resource was active was faulty Consequence: Resources that were active but on an unclean node were ignored by tools that relied on this logic Fix: Fix the check so that all tools agree.
Story Points: ---
Clone Of:
: 816881 (view as bug list) Environment:
Last Closed: 2012-06-20 13:48:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 816881    
Attachments:
Description Flags
cibadmin -Q output none

Description Jaroslav Kortus 2012-03-01 17:28:45 UTC
Description of problem:
crm_resource -W -r resource reports last known node even if the partition is without quorum. The other tools report it differently.


Version-Release number of selected component (if applicable):
pacemaker-1.1.7-1.el6.x86_64


How reproducible:
always

Steps to Reproduce:
1. setup resource in pacemaker
2. fail enough nodes to loose quorum
3. run crm_resource -W -r resource
  
Actual results:
last node on which it was running is reported

Expected results:
service reported as failed, stopped or not running anywhere

Additional info:
[root@node01:/]$ crm_mon -1
============
Last updated: Thu Mar  1 11:25:38 2012
Last change: Thu Mar  1 11:15:32 2012 via crm_shadow on node01
Stack: cman
Current DC: node01 - partition WITHOUT quorum
Version: 1.1.7-1.el6-148fccfd5985c5590cc601123c6c16e966b85d14
3 Nodes configured, unknown expected votes
3 Resources configured.
============

Node node03: UNCLEAN (offline)
Online: [ node01 ]
OFFLINE: [ node02 ]

 virt-fencing   (stonith:fence_xvm):    Started node01

[root@node01:/]$ crm_resource -W -r webserver
resource webserver is running on: node03 

[root@node01:/]$ crm configure show
node node01
node node02
node node03
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="192.168.100.11" cidr_netmask="32" \
        op monitor interval="30s"
primitive virt-fencing stonith:fence_xvm \
        params pcmk_host_check="static-list" pcmk_host_list="node01,node02,node03" action="reboot" debug="1"
primitive webserver ocf:heartbeat:apache \
        params configfile="/etc/httpd/conf/httpd.conf" \
        op monitor interval="30s"
group group01 webserver ClusterIP
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-1.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
        cluster-infrastructure="cman"

Comment 2 Andrew Beekhof 2012-03-02 08:46:08 UTC
Can you include the output of cibadmin -Q when the cluster is in this state?

Comment 3 Jaroslav Kortus 2012-03-02 13:28:35 UTC
Created attachment 567063 [details]
cibadmin -Q output

cibadmin -Q attached.

Comment 4 Andrew Beekhof 2012-03-05 03:30:37 UTC
Strangely crm_simulate shows the same as crm_resource:

# CIB_file=~/Downloads/cibadmin.xml tools/crm_simulate -L

Current cluster status:
Node node03: UNCLEAN (offline)
Online: [ node01 ]
OFFLINE: [ node02 ]

 virt-fencing	(stonith:fence_xvm):	Started node01
 Resource Group: group01
     webserver	(ocf::heartbeat:apache):	Started node03
     ClusterIP	(ocf::heartbeat:IPaddr2):	Started node03

Comment 5 Andrew Beekhof 2012-03-05 03:33:42 UTC
Hmm, not so strange, it appears to be telling the truth and it is crm_mon that lies:

The last op for webserver on node03 was a successful start action:

            <lrm_rsc_op id="webserver_last_0" operation_key="webserver_start_0" operation="start" crm-debug-origin="do_update_resource" crm_feature_set="3.0.6" transition-key="13:441:0:22406839-5bf5-4990-9755-00c39457b51e" transition-magic="0:0;13:441:0:22406839-5bf5-4990-9755-00c39457b51e" call-id="5" rc-code="0" op-status="0" interval="0" last-run="1330694649" last-rc-change="1330694649" exec-time="90" queue-time="0" op-digest="88eb8382443cc988d0e6ddee48ebac1a"/>

Comment 6 Andrew Beekhof 2012-03-13 12:04:20 UTC
A related patch has been committed upstream:
  https://github.com/beekhof/pacemaker/commit/31f6ca3

with subject:

   Medium: PE: Bug rhbz#799070 - Report resources as active in crm_mon if they are located on an unclean node

Further details (if any):

Comment 11 Jaroslav Kortus 2012-04-05 15:24:48 UTC
Now it behaves consistently
]$ crm_mon -1
============
Last updated: Thu Apr  5 10:22:13 2012
Last change: Thu Apr  5 10:06:46 2012 via crmd on m3c1-node01
Stack: cman
Current DC: m3c1-node01 - partition WITHOUT quorum
Version: 1.1.7-5.el6-148fccfd5985c5590cc601123c6c16e966b85d14
3 Nodes configured, unknown expected votes
2 Resources configured.
============

Node m3c1-node03: UNCLEAN (offline)
Node m3c1-node02: UNCLEAN (offline)
Online: [ m3c1-node01 ]

 virt-fencing   (stonith:fence_xvm):    Started m3c1-node01
 webserver      (ocf::heartbeat:apache):        Started m3c1-node02


However, it is still a difference compared to cman. When rgmanager looses quorum no services are reported as running (they are not reported at all).

I would like to see something similar here as well. Pacemaker should report the service as running only when it has quorum and the service is working properly. In all other states it should not state Started, but some different state (failed, pending, unclean...).

Would this be possible?

pacemaker-1.1.7-5.el6.x86_64

Comment 12 Andrew Beekhof 2012-04-10 09:46:09 UTC
(In reply to comment #11)

> However, it is still a difference compared to cman. When rgmanager looses
> quorum no services are reported as running (they are not reported at all).

Pacemaker is not rgmanager.
Pacemaker reports resources running if they are running, not based on whether quorum is true or false.  Resources may or may not be allowed to run when quorum is lost, thats up to the admin.

Comment 16 Andrew Beekhof 2012-05-08 11:42:34 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: The logic for determining whether a resource was active was faulty
Consequence: Resources that were active but on an unclean node were ignored by tools that relied on this logic
Fix: Fix the check so that all tools agree.

Comment 18 errata-xmlrpc 2012-06-20 13:48:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0846.html