Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1188949

Summary: [HA] start/stop ordering constraint are not correct and can cause cluster to fail on shutdown
Product: Red Hat OpenStack Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: distributionAssignee: Fabio Massimo Di Nitto <fdinitto>
Status: CLOSED CURRENTRELEASE QA Contact: Ami Jeain <ajeain>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.0 (Juno)CC: markmc, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1189921 (view as bug list) Environment:
Last Closed: 2015-04-08 20:28:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1189921    

Description Fabio Massimo Di Nitto 2015-02-04 06:06:24 UTC
There is currently an error in the way pacemaker start/stop ordering constraint are expressed.

This can potentially lead to a cluster meltdown when issuing:

pcs cluster stop --all

because some services will fail to stop. The service will try to contact another API to notify the shutdown, but the VIP for the API is already down at that stage.

Workaround:

pcs cluster disable keystone
wait for keystone to be Stopped
pcs cluster stop --all

on start:

pcs cluster start --all
pcs enable keystone

This specific sequence affects only the process to put all controller nodes in shutdown at once. It does NOT affect reboot or shutdown of one node at a time (for upgrade purposes for example), hence the medium severity.

I am currently working on a new constraint set that should prevent this problem, in the meantime this should be documented in the release notes for GA.

Comment 1 Fabio Massimo Di Nitto 2015-02-04 07:14:32 UTC
Reference arch: https://docs.google.com/a/redhat.com/document/d/1iO41-wcY81xKn46UDkjZ-HGFR80ARXwEElRqd2HtDI8/edit# is now updated with the new constraint order in v0.5

For expert users:

- in previous setups the VIPs and lb-haproxy-clone start order was expressed as:

  pcs constraint order start lb-haproxy-clone then vip-...

- this needs to be reversed by:

  pcs constraint delete order-lb-haproxy-clone-vip-...-mandatory

  pcs constraint order start vip-... then lb-haproxy-clone

Replace "..." with the current name of the vip- service and it has to be done for all vip.

This change can be applied on a live cluster (no need to stop any service to perform this change).

Comment 3 Fabio Massimo Di Nitto 2015-02-13 19:31:14 UTC
>   pcs constraint order start vip-... then lb-haproxy-clone
> 
> Replace "..." with the current name of the vip- service and it has to be
> done for all vip.
> 
> This change can be applied on a live cluster (no need to stop any service to
> perform this change).

this turns out not to be the correct command. The correct command is:

pcs constraint order start vip-... then lb-haproxy-clone kind=Optional

(note the extra option)

This change is required to avoid a chain of events in start/stop all services when a VIP needs to move from node to another.