Bug 1188949 - [HA] start/stop ordering constraint are not correct and can cause cluster to fail on shutdown
Summary: [HA] start/stop ordering constraint are not correct and can cause cluster to ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: distribution
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 6.0 (Juno)
Assignee: Fabio Massimo Di Nitto
QA Contact: Ami Jeain
URL:
Whiteboard:
Depends On:
Blocks: 1189921
TreeView+ depends on / blocked
 
Reported: 2015-02-04 06:06 UTC by Fabio Massimo Di Nitto
Modified: 2015-04-08 20:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1189921 (view as bug list)
Environment:
Last Closed: 2015-04-08 20:28:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Fabio Massimo Di Nitto 2015-02-04 06:06:24 UTC
There is currently an error in the way pacemaker start/stop ordering constraint are expressed.

This can potentially lead to a cluster meltdown when issuing:

pcs cluster stop --all

because some services will fail to stop. The service will try to contact another API to notify the shutdown, but the VIP for the API is already down at that stage.

Workaround:

pcs cluster disable keystone
wait for keystone to be Stopped
pcs cluster stop --all

on start:

pcs cluster start --all
pcs enable keystone

This specific sequence affects only the process to put all controller nodes in shutdown at once. It does NOT affect reboot or shutdown of one node at a time (for upgrade purposes for example), hence the medium severity.

I am currently working on a new constraint set that should prevent this problem, in the meantime this should be documented in the release notes for GA.

Comment 1 Fabio Massimo Di Nitto 2015-02-04 07:14:32 UTC
Reference arch: https://docs.google.com/a/redhat.com/document/d/1iO41-wcY81xKn46UDkjZ-HGFR80ARXwEElRqd2HtDI8/edit# is now updated with the new constraint order in v0.5

For expert users:

- in previous setups the VIPs and lb-haproxy-clone start order was expressed as:

  pcs constraint order start lb-haproxy-clone then vip-...

- this needs to be reversed by:

  pcs constraint delete order-lb-haproxy-clone-vip-...-mandatory

  pcs constraint order start vip-... then lb-haproxy-clone

Replace "..." with the current name of the vip- service and it has to be done for all vip.

This change can be applied on a live cluster (no need to stop any service to perform this change).

Comment 3 Fabio Massimo Di Nitto 2015-02-13 19:31:14 UTC
>   pcs constraint order start vip-... then lb-haproxy-clone
> 
> Replace "..." with the current name of the vip- service and it has to be
> done for all vip.
> 
> This change can be applied on a live cluster (no need to stop any service to
> perform this change).

this turns out not to be the correct command. The correct command is:

pcs constraint order start vip-... then lb-haproxy-clone kind=Optional

(note the extra option)

This change is required to avoid a chain of events in start/stop all services when a VIP needs to move from node to another.


Note You need to log in before you can comment on or make changes to this bug.