Bug 1480311

Summary:	Deleting a guestnode resource without a node removal may lead to node fencing
Product:	Red Hat Enterprise Linux 7	Reporter:	Radek Steiger <rsteiger>
Component:	pcs	Assignee:	Tomas Jelinek <tojeline>
Status:	CLOSED WONTFIX	QA Contact:	cluster-qe <cluster-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	cfeist, cluster-maint, idevat, omular, tojeline
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-13 13:02:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Radek Steiger 2017-08-10 16:24:18 UTC

> Description of problem:

In the upstream version of pcs we have a safety check to prevent deleting a guestnode resource without being removed from the nodelist first. In RHEL however it has only become a warning:

"Warning: This command is not sufficient for removing remote and guest nodes. To complete the removal, remove pacemaker authkey and stop and disable pacemaker_remote on the node(s) manually."

The problem is that in rare cases this can lead into a stonith action, so we might want to make it an error instead and require --force flag  so the user gets full responsibility for his actions.

It is possible that a proper solution/fix would be more suitable on pacemaker side though.


> Version-Release number of selected component (if applicable):

pcs-0.9.158-6.el7.x86_64
pacemaker-1.1.16-12.el7.x86_64


> How reproducible:

Rarely.


> Steps to Reproduce:

Run a loop on one of the cluster nodes where GUEST is the hostname of the guest node being added and removed periodically. Also make sure fencing is configured properly for all nodes including the said guest node.

The qarsh command is optional only to break out of the loop when the guest node dies. Also a Dummy resource isn't the right resource for a guest node, but should be enough to get the idea here.

i=0; while [ $? -eq 0 ]; do 
  let i+=1;  echo PASS $i;
  /usr/bin/qarsh -l root -t 5 GUEST "uptime" || break;
  pcs resource create GuestResource ocf:heartbeat:Dummy  --disabled;
  pcs cluster node add-guest GUEST GuestResource;
  pcs resource delete GuestResource;
done

Note: Adding sleeps between actions makes absolutely no difference.


> Actual results:

It may take a few dozens of attempts to reproduce but the guest node gets fenced eventually.


> Expected results:

Get an error when trying to delete a guest node resource that hasn't been handled by running 'pcs cluster node remove-guest'. Or ideally fix the real race condition that leads to fencing (which is probably not on pcs side).

Comment 1 Tomas Jelinek 2018-11-13 13:02:51 UTC

We cannot make this a forcible error because that would break OpenStack, see bz1459503. The correct way to remove a guest node is to use the 'pcs cluster node remove-guest' command which should be working correctly.