Bug 1322387

Summary: Standby / Unstandby of a node & standby of 2 node causes random node to be fenced .
Product: Red Hat Enterprise Linux 7 Reporter: Jaison Raju <jraju>
Component: pacemakerAssignee: Andrew Beekhof <abeekhof>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.1CC: abeekhof, cfeist, cluster-maint, jraju
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-09 04:33:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1296673    

Description Jaison Raju 2016-03-30 12:14:10 UTC
Description of problem:
Standby / Unstandby of a node & standby of 2 node causes random node to be fenced .

Version-Release number of selected component (if applicable):
pacemaker 1.1.12-22.el7_1.1 - Red Hat x86_64
pacemaker-cli 1.1.12-22.el7_1.1 - Red Hat x86_64
pacemaker-cluster-libs 1.1.12-22.el7_1.1 - Red Hat x86_64
pacemaker-libs 1.1.12-22.el7_1.1 - Red Hat x86_64
resource-agents 3.9.5-40.el7_1.3 - Red Hat x86_64

How reproducible:
Always at customer end

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 Andrew Beekhof 2016-04-05 04:19:47 UTC
Immediate guess... one of the services is refusing to stop when required.
Will check crm report to confirm

Comment 9 Ken Gaillot 2016-04-11 17:49:46 UTC
I suspect this is a duplicate of BZ#1257414. Some of the constraints added a result of that BZ are present in this cluster, but not all of them. For example, I don't see:

pcs constraint order start haproxy-clone then openstack-keystone-clone
pcs constraint order promote redis-master then start openstack-ceilometer-central-clone require-all=false

I'm reassigning this bz to Andrew Beekhof so he can review the deployment, as he is more familiar with this type of setup than I am.

If the configuration review doesn't solve all the issues, I do think an upgrade of the RHEL and OSP packages would be beneficial.

Comment 10 Andrew Beekhof 2016-04-20 01:10:39 UTC
Where are the attachments?  yank can't find any

[abeekhof@collab-shell ~]$ yank 01588802
* searching for attachments for ticket 01588802.
* the ticket 01588802 doesn't appear to have any attachments
* [searching] dropbox for case related attachments
* [renaming] filenames and checking for duplicates
* [erasing] empty directories

Comment 12 Andrew Beekhof 2016-05-09 02:21:20 UTC
Is there somewhere persistent we can put these?
I'm back from summit now and they've been wiped from collab :-(

Comment 14 Andrew Beekhof 2016-05-09 04:33:32 UTC
Basically neutron is not stopping in time:

Feb 25 19:07:06 ncerdlabdell400 pengine[3834]: warning: unpack_rsc_op_failure: Processing failed op stop for neutron-server:0 on pcmk-ncerdlabdell400: OCF_TIMEOUT (198)
Feb 25 19:07:06 ncerdlabdell400 pengine[3834]: warning: unpack_rsc_op_failure: Processing failed op stop for neutron-server:0 on pcmk-ncerdlabdell400: OCF_TIMEOUT (198)

This is leading to:

Feb 25 19:07:06 ncerdlabdell400 pengine[3834]: warning: pe_fence_node: Node pcmk-ncerdlabdell400 will be fenced because of resource failure(s)


Basically this is a dup of Bug 1295835.
You can see https://bugzilla.redhat.com/show_bug.cgi?id=1290599#c20 for the work-around.

*** This bug has been marked as a duplicate of bug 1295835 ***