1290121 – [Bug] [HA] Remove keystone constraints and add the openstack-core dummy resource in its place

Bug 1290121 - [Bug] [HA] Remove keystone constraints and add the openstack-core dummy resource in its place

Summary: [Bug] [HA] Remove keystone constraints and add the openstack-core dummy resou...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ga
Target Release:	9.0 (Mitaka)
Assignee:	Michele Baldessari
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:
Depends On:	1356997
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-09 17:12 UTC by Raoul Scarazzini
Modified:	2019-10-10 10:42 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Almost all the Overcloud Pacemaker resources depended on the Keystone resource. This meant restarting the Keystone resource after a configuration change would restart all dependent resources, which was disruptive. This fix introduces a fake openstack-core that the Overcloud Pacemaker resources (including keystone) use as a dependency. This means restarting the Keystone resource no longer causes any disruption to other services.
Clone Of:
Environment:
Last Closed:	2016-08-11 11:29:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1537885	None	None	None	2016-01-25 19:53:03 UTC
OpenStack gerrit	286446	None	None	None	2016-03-01 08:50:35 UTC
Red Hat Product Errata	RHEA-2016:1599	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 9 director Release Candidate Advisory	2016-08-11 15:25:37 UTC

Internal Links: 1317507

Description Raoul Scarazzini 2015-12-09 17:12:28 UTC

As discussed on rhos-dev mailing list in this thread:

http://post-office.corp.redhat.com/archives/rh-openstack-dev/2015-December/msg00010.html

there's no need anymore to have each service dependent from keystone. This means we can safely remove it, and replace it by adding a new resource named openstack-core (cloned) to take its place.
In terms of commands this can be done as described here: 

http://post-office.corp.redhat.com/archives/rh-openstack-dev/2015-December/msg00119.html

But the steps described in the message are useful just if you already have an ospd environment deployed (so with Keystone as a master dependency). The sequence for the creation from scratch will be something like this:

1) Core services creation (Galera, Redis, Mongo, RabbitMQ, Memcached, Haproxy);
2) Dummy service creation (openstack-core-clone) dependent from the core services;
3) Other services creation (all the others included keystone) dependent from openstack-core-clone;

This should be easily done by replacing keystone with openstack-core everywhere and adding keystone apart.

Don't hesitate to ask for additional informations, we will be happy to support you the best we can.

Comment 2 Jaromir Coufal 2015-12-10 14:03:26 UTC

What are the impacts of this change here?

Comment 3 Fabio Massimo Di Nitto 2015-12-10 14:11:49 UTC

(In reply to Jaromir Coufal from comment #2)
> What are the impacts of this change here?

It´s all in the email thread, but in summary it allows restart of keystone without bouncing all of OpenStack.

Comment 7 Jaromir Coufal 2015-12-14 13:48:19 UTC

Based on discussion with Hugh, I would consider this one as part of OSP8 upgrade strategy. It sounds like an effort which would not just improve UX but also help us to eliminate conflicts and bugs for the feature dev. This approach needs to be POC'ed and tested before we make decision going this path in order to make sure that it does not affect OSP8 delivery. I would call it best effort for OSP8, but we won't block release on it.

Comment 8 Michele Baldessari 2016-01-13 09:26:31 UTC

Just a note that we do not forget that this change has implications on the upgrade experience as today we use keystone disable to shut down all services during an upgrade.

Comment 9 Hugh Brock 2016-01-15 06:59:26 UTC

Would like to get this in at least for new deployments with 8.0.

Comment 10 Mike Burns 2016-01-19 20:23:26 UTC

I have some concerns about this being a blocker. This isn't currently part of the upgrade strategy at all. It's actually somewhat irrelevant since the entire control plane is down during upgrades.

I think the right way to handle constraints in general is to have a list of constraints maintained somewhere. Then we have some process or tool that reads the constraints that we want and generates the right commands to create them based on what is already deployed. This command would also drop constraints that are no longer needed. For the case where someone has added a service to pacemaker that isn't part of our OOTB architecture (partner extensions), we could have a location to drop files for additional constraints. This particular tooling is going to be beyond our delivery ability for OSP 8 however, so we're limited to a hard coded set of commands for the upgrade to make the constraints match. This generic tooling really should run through an upstream tripleo spec.

The fact that we're limited to hardcoded commands for this upgrade means that we have to decide essentially *now* whether this is in or out for 8. Changing the decision later means more work and testing for upgrades.

I don't think we can organize this so that it's just on new deployments and not applied on upgrades. We'd need to handle both.

As Michele mentioned in comment 8, updates relies on the fact that disabling keystone disables everything else, so that needs to be changed.

As for the consequence of *not* fixing this, assuming I read the thread correctly, the issues laid out in the email thread highlighted that services would go down *on the same node* as the failing keystone service, not cluster wide. If you take keystone down cluster wide, everything would come down.

If we do fix this, other services will continue to run, but any that actually need keystone will start throwing errors when accessed.

Comment 11 Jiri Stransky 2016-01-20 09:30:13 UTC

I agree with Mike. I think the only added value of this currently is that operator can restart keystone manually without bringing whole OpenStack down, but there's no practical effect on updates or upgrades, unless we implement smart handling of service restarts in Puppet too.

The main issue we have is lack of proper wiring between Puppet and Pacemaker to do selective service restarts based on whether config was changed or not. Without this we need to restart all services unconditionally. (We cannot let Puppet restart pacemaker services the usual way on config change, as Puppet doesn't have information about when the *other* nodes in a cluster have applied their conf file value changes.) I already have an idea how to make that happen, but it's not completely trivial, so it's perhaps more of a Mitaka thing, and would justify a blueprint upstream. (Provided i get to carve out enough time for it.)

Comment 12 Fabio Massimo Di Nitto 2016-01-20 09:43:48 UTC

Actually, if you only look at the current situation you are both right but that said, you are forgetting another important problem that is moving keystone under apache (that just come up in the list things that needs to be done).

If you keep keystone where it is now, you will have a bootstrap loop with horizon that will be more problematic than having keystone isolated on its own.

also, Mike, the point is that the keystone team made it clear that there are some configurations of keystone that allows keystone to be down and auth to work (details were in the email thread).

I personally have no strong opinions either way, but:

1) changing keystone dependencies is somehow safe
2) it improves operational experiences (not just updates/upgrades)
3) it will make it easier to move keystone under apache (that might come as requirement for 8.0 or 8.x)

Cons:

1) we need to recheck updates and upgrades to change pcs resource disable keystone and use the new dummy entry

So in all there are more pros than cons. Final word to the people asking for this change.

Comment 13 Nathan Kinder 2016-01-20 19:01:44 UTC

(In reply to Fabio Massimo Di Nitto from comment #12)
> Actually, if you only look at the current situation you are both right but
> that said, you are forgetting another important problem that is moving
> keystone under apache (that just come up in the list things that needs to be
> done).

One longer term thing we should consider is that all WSGI services running within Apache httpd will all start and stop together.  If we move more services to run in httpd in the future, we'll be controlling the run state of services all together unless we have separate httpd processes for each service.

> also, Mike, the point is that the keystone team made it clear that there are
> some configurations of keystone that allows keystone to be down and auth to
> work (details were in the email thread).

While it is true that other services can validate PKI tokens even when Keystone is down, it's important to note that the PKI token format is discouraged nowadays.  We default to UUID tokens with a Director deployment, and the Fernet token format is seen as the future recommended format.  Both of these formats require Keystone to be running for the purposes of token validation.  We should just assume that Keystone needs to be running somewhere for the other services to service any API requests.

Comment 14 Fabio Massimo Di Nitto 2016-01-21 04:31:05 UTC

(In reply to Nathan Kinder from comment #13)
> (In reply to Fabio Massimo Di Nitto from comment #12)
> > Actually, if you only look at the current situation you are both right but
> > that said, you are forgetting another important problem that is moving
> > keystone under apache (that just come up in the list things that needs to be
> > done).
> 
> One longer term thing we should consider is that all WSGI services running
> within Apache httpd will all start and stop together.  If we move more
> services to run in httpd in the future, we'll be controlling the run state
> of services all together unless we have separate httpd processes for each
> service.

This is something we have been wondering in the HA team just in the last call. How are those services being designed to run? Under one instance? multiple instance of httpd?

Considering that we are putting a great deal of effort to be able to restart each service standalone to minimize operational downtime, we should probably aim to have them all in separated httpd processes I guess. Tho this might require some extra work generating the different httpd configs and that become more of a deployment problem.

Comment 18 Felipe Alfaro Solana 2016-03-14 08:21:18 UTC

I don't quite understand why there are Pacemaker dependencies at all. Don't all OpenStack services communicate either via the API (which is HAproxy load-balanced) or RabbitMQ? Why does, for example, Glance have a dependency on Keystone?

# pcs constraint | grep keystone
  start memcached-clone then start openstack-keystone-clone (kind:Mandatory)
  start rabbitmq-clone then start openstack-keystone-clone (kind:Mandatory)
  start openstack-keystone-clone then start openstack-ceilometer-central-clone (kind:Mandatory)
  start openstack-keystone-clone then start openstack-glance-registry-clone (kind:Mandatory)
  start openstack-keystone-clone then start openstack-cinder-api-clone (kind:Mandatory)
  start openstack-keystone-clone then start neutron-server-clone (kind:Mandatory)
  start openstack-keystone-clone then start openstack-nova-consoleauth-clone (kind:Mandatory)
  promote galera-master then start openstack-keystone-clone (kind:Mandatory)
  start haproxy-clone then start openstack-keystone-clone (kind:Mandatory)
  start openstack-keystone-clone then start openstack-heat-api-clone (kind:Mandatory)

I don't get it.

Comment 19 Raoul Scarazzini 2016-03-14 08:31:45 UTC

(In reply to Felipe Alfaro Solana from comment #18)
> I don't quite understand why there are Pacemaker dependencies at all. Don't
> all OpenStack services communicate either via the API (which is HAproxy
> load-balanced) or RabbitMQ? Why does, for example, Glance have a dependency
> on Keystone?

Because in the past Glance (and a lot of other services) was unable to start without Keystone. And note that with the constraint you mentioned it's enough having just one keystone instance running to have the constraint satisfied.
Maybe this will change in the future, but until a service will not be able to start properly without a core service already running the constraint will be necessary.

> # pcs constraint | grep keystone
>   start memcached-clone then start openstack-keystone-clone (kind:Mandatory)
>   start rabbitmq-clone then start openstack-keystone-clone (kind:Mandatory)
>   start openstack-keystone-clone then start
> openstack-ceilometer-central-clone (kind:Mandatory)
>   start openstack-keystone-clone then start openstack-glance-registry-clone
> (kind:Mandatory)
>   start openstack-keystone-clone then start openstack-cinder-api-clone
> (kind:Mandatory)
>   start openstack-keystone-clone then start neutron-server-clone
> (kind:Mandatory)
>   start openstack-keystone-clone then start openstack-nova-consoleauth-clone
> (kind:Mandatory)
>   promote galera-master then start openstack-keystone-clone (kind:Mandatory)
>   start haproxy-clone then start openstack-keystone-clone (kind:Mandatory)
>   start openstack-keystone-clone then start openstack-heat-api-clone
> (kind:Mandatory)
> 
> I don't get it.

Comment 20 Sadique Puthen 2016-03-14 09:00:28 UTC

Raoul,

Can you reconfirm this? If there are 3 nodes in the cluster, node1, node2 and node3. 

Are you saying that keystone on node3 will start if rabbitmq is failed on node3 and node2 and running only on node1?

Also if keystone is running on node1 only and failed on node2 and 3, Is it true that glance will start on all 3 nodes?

Comment 21 Raoul Scarazzini 2016-03-14 11:13:56 UTC

To make some clarification, who made the trick here is the interleave parameter inside the clone declaration. With interleave set to true once the local instance of a resource is started then the dependent ones (on the same node) can start too.

But we're talking about the *same* node. In the specific case you're mentioning we've got the rabbit mq clone which has got two properties: interleave=true and ordered=true. The first one does what I've explained above, the second one starts the copies in series instead of in parallel. So if we've got rabbit started on node1 the keystone can start on this node too and then all the other services but, answering your questions:

1) No, if keystone on node3 will NOT start if rabbitmq is failed on node3;

2) Glance will start only on node1.

Hope to have made things clear.

Comment 25 Felipe Alfaro Solana 2016-06-17 15:25:36 UTC

Sorry to get into this conversation too late. I actually filed a support case for RHOSP7 some time ago (support case 1594170) and proposed removing Keystone dependencies on RabbitMQ and MySQL. May I ask what's the point of having a dummy PCS resource? Why not simply removing the constraints at all?

Comment 26 Raoul Scarazzini 2016-06-17 15:36:39 UTC

(In reply to Felipe Alfaro Solana from comment #25)
> Sorry to get into this conversation too late. I actually filed a support
> case for RHOSP7 some time ago (support case 1594170) and proposed removing
> Keystone dependencies on RabbitMQ and MySQL. May I ask what's the point of
> having a dummy PCS resource? Why not simply removing the constraints at all?

Having this bug solved was the first step, the next one will be the one you're thinking about. There are a lot of thing involved in this. Since the high impact of each change everything must be done gradually and carefully.

Comment 28 errata-xmlrpc 2016-08-11 11:29:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1599.html

Note You need to log in before you can comment on or make changes to this bug.

ayoung
dmacpher
ealcaniz
ebarrera
fdinitto
felipe.alfaro
hbrock
jason.dobies
jcoufal
jguiditt
jjoyce
mburns
michele
nbarcet
nkinder
rhel-osp-director-maint
royoung
sputhenp
tvignaud
ushkalim