Bug 1390632 - nova-compute service fails to start if galera is not available
Summary: nova-compute service fails to start if galera is not available
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: ---
Assignee: Eoghan Glynn
QA Contact: Prasanth Anbalagan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-01 14:52 UTC by Marian Krcmarik
Modified: 2020-12-21 19:34 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-01-26 05:01:38 UTC
Target Upstream Version:


Attachments (Terms of Use)
pcs status xml - correctly working cluster with Instance HA (46.28 KB, application/xml)
2016-11-01 14:52 UTC, Marian Krcmarik
no flags Details
pcmk logs (760 bytes, application/octet-stream)
2016-11-02 08:27 UTC, Marian Krcmarik
no flags Details
AVC messages (8.40 KB, text/plain)
2016-11-02 08:27 UTC, Marian Krcmarik
no flags Details

Description Marian Krcmarik 2016-11-01 14:52:47 UTC
Created attachment 1216138 [details]
pcs status xml - correctly working cluster with Instance HA

Description of problem:
The nova-compute openstack service fails with SystemExit(1) when It is started by pacemaker withint Openstack Instance HA environment - We have a openstack compute node where openstack nova-compute service is running and managed by pacemaker - to get Instance HA configured following steps are being performed:
- systemctl stop openstack-nova-compute
- systemctl disable openstack-nova-compute
- Install and setup pacemaker remote on the node
- Create pacemaker resource where pacemaker manages nova-compute with using systemd on pacemaker remote nodes: "pcs resource create nova-compute systemd:openstack-nova-compute --clone interleave=true --disabled --force"
- Enable the resource: "pcs resource enable nova-compute"

The results is that start of the resource fails:
● openstack-nova-compute.service - OpenStack Nova Compute Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-compute.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2016-10-24 20:44:27 UTC; 54min ago
 Main PID: 14952 (code=exited, status=0/SUCCESS)

Oct 24 20:44:22 compute-1.localdomain nova-compute[14952]: Option "rpc_backend" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored i...e future.
Oct 24 20:44:27 compute-1.localdomain systemd[1]: Started Cluster Controlled openstack-nova-compute.
Oct 24 20:44:27 compute-1.localdomain nova-compute[14952]: Traceback (most recent call last):
Oct 24 20:44:27 compute-1.localdomain nova-compute[14952]: File "/usr/lib/python2.7/site-packages/eventlet/queue.py", line 118, in switch
Oct 24 20:44:27 compute-1.localdomain nova-compute[14952]: self.greenlet.switch(value)
Oct 24 20:44:27 compute-1.localdomain nova-compute[14952]: File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Oct 24 20:44:27 compute-1.localdomain nova-compute[14952]: result = function(*args, **kwargs)
Oct 24 20:44:27 compute-1.localdomain nova-compute[14952]: File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 711, in run_service
Oct 24 20:44:27 compute-1.localdomain nova-compute[14952]: raise SystemExit(1)
Oct 24 20:44:27 compute-1.localdomain nova-compute[14952]: SystemExit: 1

If I start the service directly with using systemd (systemctl start...) the service can start correctly. I am not able to get service recovered when doing "pcs resource cleanup".

I did not observe this doing the same steps on RHOS9 with RHEL7.2

Version-Release number of selected component (if applicable):
pacemaker-1.1.15-11.el7.x86_64
pacemaker-libs-1.1.15-11.el7.x86_64
pacemaker-remote-1.1.15-11.el7.x86_64
puppet-pacemaker-0.3.0-0.20161007163831.4ea11b8.el7ost.noarch
pacemaker-cli-1.1.15-11.el7.x86_64
pacemaker-cluster-libs-1.1.15-11.el7.x86_64

How reproducible:
Very Often

Attaching output of the cluster -  nova-compute service got running by starting it through systemctl and then doing pcs resource cleanup

Comment 1 Ken Gaillot 2016-11-01 22:09:22 UTC
Can you attach the result of the following command?

crm_report --from "2016-10-24 20:30:00" --to "2016-10-24 21:00:00"

Comment 2 Andrew Beekhof 2016-11-02 00:57:04 UTC
One of the reasons we split nova-compute and nova-compute-wait is that nova would bork when we started it.  

I'm struggling to imagine how the cluster could possibly cause this as the extent of our involvement is to signal systemd via dbus that we'd like the service to start.

Unless perhaps its something in the override file??

I would start by grabbing a copy of that file when pacemaker attempts to start the service (it should be /run/systemd/system/openstack-nova-compute.service.d/50-pacemaker.conf ).

Next step would be to put it there before trying to start it manually and if that fails, remove lines until it works.

Comment 3 Andrew Beekhof 2016-11-02 00:58:34 UTC
Expected contents of that file:

            char *override = crm_strdup_printf(
                "[Unit]\n"
                "Description=Cluster Controlled %s\n"
                "Before=pacemaker.service\n"
                "\n"
                "[Service]\n"
                "Restart=no\n",
                op->agent);

Still can't imagine how that could cause the service not to start.

Comment 4 Fabio Massimo Di Nitto 2016-11-02 05:01:18 UTC
Marian,

can you check for any selinux/avc entries?

Comment 5 Andrew Beekhof 2016-11-02 06:02:14 UTC
(In reply to Fabio Massimo Di Nitto from comment #4)
> Marian,
> 
> can you check for any selinux/avc entries?

In theory even that shouldn't be a factor as it's still the systemd process that is spinning up these daemons.

A good idea to check though

Comment 6 Marian Krcmarik 2016-11-02 08:27:00 UTC
Created attachment 1216388 [details]
pcmk logs

The logs collected by following command:
crm_report --from "2016-11-02 00:15:00" --to "2016-11-02 00:45:00"

related to this fail of nova-compute:
● openstack-nova-compute.service - OpenStack Nova Compute Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-compute.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2016-11-02 00:28:36 UTC; 7h ago
 Main PID: 22911 (code=exited, status=0/SUCCESS)

Nov 02 00:28:34 compute-0.localdomain nova-compute[22911]: Option "rpc_backend" from group "DEFAULT" is deprecated for removal.  It...uture.
Nov 02 00:28:35 compute-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-compute.
Nov 02 00:28:35 compute-0.localdomain nova-compute[22911]: Traceback (most recent call last):
Nov 02 00:28:35 compute-0.localdomain nova-compute[22911]: File "/usr/lib/python2.7/site-packages/eventlet/queue.py", line 118, in switch
Nov 02 00:28:35 compute-0.localdomain nova-compute[22911]: self.greenlet.switch(value)
Nov 02 00:28:35 compute-0.localdomain nova-compute[22911]: File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 21...n main
Nov 02 00:28:35 compute-0.localdomain nova-compute[22911]: result = function(*args, **kwargs)
Nov 02 00:28:35 compute-0.localdomain nova-compute[22911]: File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 71...ervice
Nov 02 00:28:35 compute-0.localdomain nova-compute[22911]: raise SystemExit(1)
Nov 02 00:28:35 compute-0.localdomain nova-compute[22911]: SystemExit: 1

Comment 7 Marian Krcmarik 2016-11-02 08:27:55 UTC
Created attachment 1216389 [details]
AVC messages

AVC messages from audit.log - all seem to be related to the dhclient, nothing else.

Comment 8 Andrew Beekhof 2016-11-02 10:53:46 UTC
Crazy idea... can you remove/fix the rpc_backend deprication warning and re-test.
There is precedent that this might help.

Comment 9 Marian Krcmarik 2016-11-02 14:56:46 UTC
(In reply to Andrew Beekhof from comment #8)
> Crazy idea... can you remove/fix the rpc_backend deprication warning and
> re-test.
> There is precedent that this might help.

That did not help unfortunately:
● openstack-nova-compute.service - OpenStack Nova Compute Server
   Loaded: loaded (/usr/lib/systemd/system/openstack-nova-compute.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2016-11-02 14:24:16 UTC; 21min ago
 Main PID: 15309 (code=exited, status=0/SUCCESS)

Nov 02 14:24:10 compute-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-compute...
Nov 02 14:24:15 compute-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-compute.
Nov 02 14:24:16 compute-0.localdomain nova-compute[15309]: Traceback (most recent call last):
Nov 02 14:24:16 compute-0.localdomain nova-compute[15309]: File "/usr/lib/python2.7/site-packages/eventlet/queue.py", line 118, in switch
Nov 02 14:24:16 compute-0.localdomain nova-compute[15309]: self.greenlet.switch(value)
Nov 02 14:24:16 compute-0.localdomain nova-compute[15309]: File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 21...n main
Nov 02 14:24:16 compute-0.localdomain nova-compute[15309]: result = function(*args, **kwargs)
Nov 02 14:24:16 compute-0.localdomain nova-compute[15309]: File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 71...ervice
Nov 02 14:24:16 compute-0.localdomain nova-compute[15309]: raise SystemExit(1)
Nov 02 14:24:16 compute-0.localdomain nova-compute[15309]: SystemExit: 1

Btw. If I wait some time (not sure how long) and I do "pcs resource cleanup" then service is started correctly by pacemaker - so I need to wait several minutes between "pcs resource enable nova-compute" and "pcs resource cleanup", Starting the service right after enabling the resource fails but cleanup is successful and nova-compute is started - but I need to wait some time, I cannot do it quickly after starting the service during "pcs resource enable" fails.
Should I use such workaround and sleep for some time?

Comment 10 Andrew Beekhof 2016-11-03 03:30:36 UTC
Can we get the nova versions too please?

Comment 11 Andrew Beekhof 2016-11-03 03:46:59 UTC
Can you change 'Restart=always' in /usr/lib/systemd/system/openstack-nova-compute.service to say 'Restart=no' and retry the "not via pacemaker" case?

I wonder if the error happens more often but is hidden by systemd auto-restarting nova.

Comment 12 Marian Krcmarik 2016-11-03 15:54:34 UTC
(In reply to Andrew Beekhof from comment #11)
> Can you change 'Restart=always' in
> /usr/lib/systemd/system/openstack-nova-compute.service to say 'Restart=no'
> and retry the "not via pacemaker" case?
> 
> I wonder if the error happens more often but is hidden by systemd
> auto-restarting nova.

I tried that, even used the same unit file as it uses when started through pacemaker (just added ExecStart line) and It worked for me.
It's really strange since It only fails at the enabling the resource and first attempt to start.

Comment 13 Andrew Beekhof 2016-11-04 01:24:29 UTC
I believe that this is caused by a race condition in nova-compute.

If the database goes away while nova-compute is running, it will attempt to reconnect.
However if the database is not available when nova-compute tries to start, it will terminate immediately.

Without pacemaker, systemd will automatically restart it.
However pacemaker explicitly disables this behaviour, so we notice the failure.


Reproducer:

sed -i 's/Restart=.*/Restart=no/' /usr/lib/systemd/system/openstack-nova-compute.service
pcs resource disable galera
systemctl restart openstack-nova-compute


The short term solution is arguably to add an ordering constraint between galera and nova-compute.

# pcs constraint order galera-master then nova-compute-clone

Comment 14 Andrew Beekhof 2016-11-04 04:05:33 UTC
tl;dr - if the database goes away while nova-compute is running, it tries to reconnect.  the same behaviour is required if the database is not available at startup.

Comment 15 melanie witt 2016-11-11 15:48:00 UTC
We discussed this in the compute team bug triage call this morning. The behavior of nova-compute isn't a race condition -- it was an intentional choice to have the service fail fast if it is unable to perform its initialization tasks and serve requests. This way, an operator is immediately aware of any such problem with nova-compute instead of allowing nova-compute to start and run when it is, in fact, not actually working. If we desire to change the behavior, it would take the same consideration and discussion upstream as it did to arrive at the present behavior.

As you mentioned, usually systemd will restart the process. What is the reason you don't want to use Restart=yes in the pacemaker conf or have the ordering constraint that ensures galera is up and running before starting nova-compute that depends on it?

Comment 16 Fabio Massimo Di Nitto 2016-11-11 16:06:17 UTC
(In reply to melanie witt from comment #15)
> We discussed this in the compute team bug triage call this morning. The
> behavior of nova-compute isn't a race condition -- it was an intentional
> choice to have the service fail fast if it is unable to perform its
> initialization tasks and serve requests. This way, an operator is
> immediately aware of any such problem with nova-compute instead of allowing
> nova-compute to start and run when it is, in fact, not actually working. If
> we desire to change the behavior, it would take the same consideration and
> discussion upstream as it did to arrive at the present behavior.

the problem is that this behavior goes against the HA NG architecture that was proposed/accepted in early Jan 2016 and no objections were made by any team. This architecture is now a reality and implemented in OSP10 (aka we can´t roll back).

The short version is:

"Every service has to be able to start/stop without their dependencies running: for example galera/rabbit. and need to be able to connect/reconnect at run time to those services as they are made available".

This is necessary to avoid unnecessary failures due to services on different nodes starting at different times, since startup is not ordered or controller.

Similar concepts apply to composable roles.

All services on the controller nodes are already operating with this concept in mind. nova-compute appears to be the exception at this point.

> 
> As you mentioned, usually systemd will restart the process. What is the
> reason you don't want to use Restart=yes in the pacemaker conf or have the
> ordering constraint that ensures galera is up and running before starting
> nova-compute that depends on it?

systemd and pacemaker are both service manages. Having them taking decisions on the same service is a bad idea. Restart=yes delegates that decision to systemd and won´t allow pacemaker to operate properly.

The reason why we don´t want constraints is because they cause other issues at runtime. for example a galera restart will trigger a nova-compute restart.
This behavior is something customers don´t like in general since nova-compute is capable of recovering a connection to galera once the service is started.

Comment 17 Stephen Gordon 2016-11-11 16:35:09 UTC
(In reply to Fabio Massimo Di Nitto from comment #16)
> (In reply to melanie witt from comment #15)
> > We discussed this in the compute team bug triage call this morning. The
> > behavior of nova-compute isn't a race condition -- it was an intentional
> > choice to have the service fail fast if it is unable to perform its
> > initialization tasks and serve requests. This way, an operator is
> > immediately aware of any such problem with nova-compute instead of allowing
> > nova-compute to start and run when it is, in fact, not actually working. If
> > we desire to change the behavior, it would take the same consideration and
> > discussion upstream as it did to arrive at the present behavior.
> 
> the problem is that this behavior goes against the HA NG architecture that
> was proposed/accepted in early Jan 2016 and no objections were made by any
> team. This architecture is now a reality and implemented in OSP10 (aka we
> can´t roll back).

I think the problem is it was discussed and implemented only on the control plane (and honestly I am struggling to find relevant threads on rh-openstack-dev). After all if it was actually implemented on the non-control plane nodes we wouldn't be having this discussion. Can you point to the archive of discussion where this was agreed to for e.g. the compute nodes?

> All services on the controller nodes are already operating with this concept in mind. nova-compute appears to be the exception at this point.

Has anyone actually done any investigation of non-control plane nodes (e.g. storage) to confirm this or is that just the assumption because this is the first case where we have run it (I assume due to the intersection with Instance HA - for cases that don't intersect with Instance HA it seems like the proposed action of having systemd restart it wouldn't be an issue)?

Comment 18 Fabio Massimo Di Nitto 2016-11-11 17:27:20 UTC
(In reply to Stephen Gordon from comment #17)
> (In reply to Fabio Massimo Di Nitto from comment #16)
> > (In reply to melanie witt from comment #15)
> > > We discussed this in the compute team bug triage call this morning. The
> > > behavior of nova-compute isn't a race condition -- it was an intentional
> > > choice to have the service fail fast if it is unable to perform its
> > > initialization tasks and serve requests. This way, an operator is
> > > immediately aware of any such problem with nova-compute instead of allowing
> > > nova-compute to start and run when it is, in fact, not actually working. If
> > > we desire to change the behavior, it would take the same consideration and
> > > discussion upstream as it did to arrive at the present behavior.
> > 
> > the problem is that this behavior goes against the HA NG architecture that
> > was proposed/accepted in early Jan 2016 and no objections were made by any
> > team. This architecture is now a reality and implemented in OSP10 (aka we
> > can´t roll back).
> 
> I think the problem is it was discussed and implemented only on the control
> plane (and honestly I am struggling to find relevant threads on
> rh-openstack-dev). After all if it was actually implemented on the
> non-control plane nodes we wouldn't be having this discussion.

but the whole point of composable role (leaving aside for one second pcmk and HA) is that there is no more control plane or compute plane. You can have everything mixed up as you like (or almost) and behavior of services is inconsistent.

At the time we started working on HA NG, we knew that composable role was on the way and we didn´t differentiate between controller and compute.

Also, reply to one more point from comment #15. The fact that nova-compute might or might not start, and you get a notification from a potential monitoring tool (that we don´t have yet in product btw), doesn´t confirm if a given compute node is working or not till it´s visible in nova service-list (IIRC the command). So the daemon can start, the monitoring tool should check for it´s visibility from inside nova to confirm functionality (or part of it at least).
There is also a corner case here that could be considered. Think a big infrastructure going down (say loss of power) with the usual 3/4 controller nodes and a 100´s compute. Do we really expect the operator to go manually restarting the 100´s nova-compute because galera took 1 second longer to bootstrap? I am just trying to think more from an end user perspective. The monitoring tool would still gives the info that a given compute is down from nova service-list if there is a problem (IMHO).

I´ll try to dig the thread, but I don´t have it in my email archive. It was posted between Jan and Feb by Rob Young.

> Can you point
> to the archive of discussion where this was agreed to for e.g. the compute
> nodes?
> 
> > All services on the controller nodes are already operating with this concept in mind. nova-compute appears to be the exception at this point.
> 
> Has anyone actually done any investigation of non-control plane nodes (e.g.
> storage) to confirm this or is that just the assumption because this is the
> first case where we have run it (I assume due to the intersection with
> Instance HA - for cases that don't intersect with Instance HA it seems like
> the proposed action of having systemd restart it wouldn't be an issue)?

Storage is working fine and it was never really managed by pacemaker (if you are thinking of storage as Ceph). Our CI currently checks the entire control plane and ceph nodes. We have been able to CI/test IHA only recently due to many bugs that didn´t allow us to deploy/test. All other services on the compute nodes are working fine (and follow the behavior that we are discussing in this bug).

I am no systemd expert, so I´ll leave that tech detail to Andrew or who knows more than I do. I think at the end of the day we want to make sure nova-compute is started even if galera is still down or starting up, and able to connect to it once galera is up and running. The details of the how, I am sure people more technically skilled than me can figure it out :-)

Comment 19 Andrew Beekhof 2016-11-13 23:07:41 UTC
(In reply to Stephen Gordon from comment #17)
> (In reply to Fabio Massimo Di Nitto from comment #16)
> > (In reply to melanie witt from comment #15)
> > > We discussed this in the compute team bug triage call this morning. The
> > > behavior of nova-compute isn't a race condition -- it was an intentional
> > > choice to have the service fail fast if it is unable to perform its
> > > initialization tasks and serve requests. This way, an operator is
> > > immediately aware of any such problem with nova-compute instead of allowing
> > > nova-compute to start and run when it is, in fact, not actually working. If
> > > we desire to change the behavior, it would take the same consideration and
> > > discussion upstream as it did to arrive at the present behavior.
> > 
> > the problem is that this behavior goes against the HA NG architecture that
> > was proposed/accepted in early Jan 2016 and no objections were made by any
> > team. This architecture is now a reality and implemented in OSP10 (aka we
> > can´t roll back).
> 
> I think the problem is it was discussed and implemented only on the control
> plane (and honestly I am struggling to find relevant threads on
> rh-openstack-dev). After all if it was actually implemented on the
> non-control plane nodes we wouldn't be having this discussion. Can you point
> to the archive of discussion where this was agreed to for e.g. the compute
> nodes?

The official thread was "[rhos-dev] PLEASE READ: RHOSP High Availability Next and Component blocker bugs for OSP 10".

I don't see any explicit "we agree for computes", however neither does anything suggest the scope is limited to controllers.

> 
> > All services on the controller nodes are already operating with this concept in mind. nova-compute appears to be the exception at this point.
> 
> Has anyone actually done any investigation of non-control plane nodes (e.g.
> storage) to confirm this or is that just the assumption because this is the
> first case where we have run it (I assume due to the intersection with
> Instance HA - for cases that don't intersect with Instance HA it seems like
> the proposed action of having systemd restart it wouldn't be an issue)?

I would expect it _would_ be an issue as systemd will also give up after a sufficient number of failed starts.

Comment 20 Andrew Beekhof 2016-11-13 23:26:14 UTC
(In reply to melanie witt from comment #15)
> We discussed this in the compute team bug triage call this morning. The
> behavior of nova-compute isn't a race condition -- it was an intentional
> choice to have the service fail fast if it is unable to perform its
> initialization tasks and serve requests. This way, an operator is
> immediately aware of any such problem with nova-compute instead of allowing
> nova-compute to start and run when it is, in fact, not actually working. If
> we desire to change the behavior, it would take the same consideration and
> discussion upstream as it did to arrive at the present behavior.
> 
> As you mentioned, usually systemd will restart the process. What is the
> reason you don't want to use Restart=yes in the pacemaker conf 

Pacemaker and systemd are both service managers.
Using Restart=yes would create an internal split-brain scenario if the two ever get out of sync.

Consider a failed service, the best case is that systemd notices the failure and restarts the service. This sounds like a good idea until you remember that Pacemaker has also policies configured for what to do in the event of a failure - none of which will be observed/triggered.

Worse is if Pacemaker _does_ notice and now there are two entities starting/stopping the service with no regard to each other. Worst case is that systemd sends a start after Pacemaker sends a stop, when Pacemaker polls to see confirm the stop was successful it will find the service up which will result in the node being fenced (powered off).

> or have the
> ordering constraint that ensures galera is up and running before starting
> nova-compute that depends on it?

We are doing that as a short-term fix, see comment #13

However since the current behaviour is classed as a blocker for OSP10 ([rhos-dev] PLEASE READ: RHOSP High Availability Next and Component blocker bugs for OSP 10) it still needs to be addressed in nova.

This will become more important as Pacemaker is phased out because such a constraint will no longer be possible and it will result in a lot of "noise" from kubernetes (or some equivalent) that the nova-compute container is in a tight restart loop.

Comment 21 melanie witt 2016-11-14 19:06:56 UTC
Thanks Fabio and Andrew for explaining those details.

I looked into the behavior of various nova services and found that nova-scheduler also doesn't stay up and running if the database isn't available when it starts. When it starts, it immediately does a database call (which fails with connection refused) and then it does a retry countdown from 10, waiting 10 seconds between each attempt. This retry behavior isn't done by Nova, it's done by the underlying oslo.db library. So nova-scheduler takes 100 seconds before it dies, if the database isn't running.

The nova-compute behavior is different because it does all database access through nova-conductor. I found that:

* If nova-conductor is running and has never been able to connect to the database, nova-compute will keep trying because nova-conductor isn't responding to it and the service doesn't die.

* If nova-conductor is running and has been able to connect to the database any time in the past, nova-compute will die immediately when nova-conductor responds to it with the DBConnectionError.

* If nova-conductor is not running, nova-compute will keep trying to contact it and the service doesn't die.

So it appears you must be experiencing the situation where nova-conductor has been able to talk to the database in the past but the database has become unavailable and nova-compute has been started?

Comment 22 Fabio Massimo Di Nitto 2016-11-15 09:21:01 UTC
Melanie, thanks for the investigation. I´ll set needinfo on Raoul that´s running our tests and would know about timers and such.

Also i believe most of the services on controller nodes are configured with retry=-1 (that means try to connect forever at startup). So we should cross check configs generated by director vs upstream code.

Comment 23 Raoul Scarazzini 2016-11-15 14:17:29 UTC
So we did not get this error in our tests until now because our HA pipeline was blocked on another... blocker [1].
We make this test: stop core resources (galera, rabbit and redis), wait 20 minutes and then start core resources again, checking for failed actions on the cluster. After this we deploy an instance.
Here's when we usually got problems if something is not working. Once the problem above was solved, instance deployment failed on a different problem, not strictly related to nova-compute, but seems connected to the issue described in this bug:

ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'sqlalchemy.exc.OperationalError'> (HTTP 500) (Request-ID: req-75a9095c-f08d-470b-886e-8a8ec05f78bb)

I uploaded all the sosreports from this test here [2], maybe it can be helpful while triaging this one.

[1] https://bugs.launchpad.net/tripleo/+bug/1637443
[2] http://file.rdu.redhat.com/~rscarazz/BZ1390632/

Comment 24 melanie witt 2016-11-16 01:40:16 UTC
(In reply to Raoul Scarazzini from comment #23)
> I uploaded all the sosreports from this test here [2], maybe it can be
> helpful while triaging this one.
> 
> [1] https://bugs.launchpad.net/tripleo/+bug/1637443
> [2] http://file.rdu.redhat.com/~rscarazz/BZ1390632/

Thanks for uploading the logs. I had a look at the Nova logs and didn't see that the nova-compute service ever stopped during the downtime of the core resources. From the logs, it staying running, trying to reconnect until galera/rabbit returned. As for nova-api, I saw in the logs that requests failed with a 500 until galera returned, which seems expected to me as Nova (and Keystone) can't fulfill any requests while the database is down.

What is the behavior you expected to see?

Comment 25 Fabio Massimo Di Nitto 2016-11-16 04:27:10 UTC
(In reply to melanie witt from comment #24)
> (In reply to Raoul Scarazzini from comment #23)
> > I uploaded all the sosreports from this test here [2], maybe it can be
> > helpful while triaging this one.
> > 
> > [1] https://bugs.launchpad.net/tripleo/+bug/1637443
> > [2] http://file.rdu.redhat.com/~rscarazz/BZ1390632/
> 
> Thanks for uploading the logs. I had a look at the Nova logs and didn't see
> that the nova-compute service ever stopped during the downtime of the core
> resources. 

I think Raoul misunderstood my question above :-). Clearly there seems to be a lack of CI testing here and nova-compute is not being covered properly.

>From the logs, it staying running, trying to reconnect until
> galera/rabbit returned. As for nova-api, I saw in the logs that requests
> failed with a 500 until galera returned, which seems expected to me as Nova
> (and Keystone) can't fulfill any requests while the database is down.
> 
> What is the behavior you expected to see?

This is expected and correct.

Comment 26 Raoul Scarazzini 2016-11-16 07:37:32 UTC
(In reply to Fabio Massimo Di Nitto from comment #25)
> (In reply to melanie witt from comment #24)
> > (In reply to Raoul Scarazzini from comment #23)
> > > I uploaded all the sosreports from this test here [2], maybe it can be
> > > helpful while triaging this one.
> > > 
> > > [1] https://bugs.launchpad.net/tripleo/+bug/1637443
> > > [2] http://file.rdu.redhat.com/~rscarazz/BZ1390632/
> > 
> > Thanks for uploading the logs. I had a look at the Nova logs and didn't see
> > that the nova-compute service ever stopped during the downtime of the core
> > resources. 
> 
> I think Raoul misunderstood my question above :-). Clearly there seems to be
> a lack of CI testing here and nova-compute is not being covered properly.

Could be, yes. What we're are testing today is the overall behavior of the environment after a stop and a start of the core resources.
We don't stop nova-compute, it's true, but I was thinking about the error we got inside the instance test and the fact that it could be related to this issue.

> >From the logs, it staying running, trying to reconnect until
> > galera/rabbit returned. As for nova-api, I saw in the logs that requests
> > failed with a 500 until galera returned, which seems expected to me as Nova
> > (and Keystone) can't fulfill any requests while the database is down.
> > 
> > What is the behavior you expected to see?
> 
> This is expected and correct.

Uhm, so and so. Database IS NOT down while nova-api tries to contact it and while we are trying to test instance deployment. This because the previous test (stop/start core resources) went fine. So, related or not to the main issue of this bug, this is a problem, that's why I posted the logs.

Comment 27 Fabio Massimo Di Nitto 2016-11-16 07:53:28 UTC
(In reply to Raoul Scarazzini from comment #26)
> (In reply to Fabio Massimo Di Nitto from comment #25)
> > (In reply to melanie witt from comment #24)
> > > (In reply to Raoul Scarazzini from comment #23)
> > > > I uploaded all the sosreports from this test here [2], maybe it can be
> > > > helpful while triaging this one.
> > > > 
> > > > [1] https://bugs.launchpad.net/tripleo/+bug/1637443
> > > > [2] http://file.rdu.redhat.com/~rscarazz/BZ1390632/
> > > 
> > > Thanks for uploading the logs. I had a look at the Nova logs and didn't see
> > > that the nova-compute service ever stopped during the downtime of the core
> > > resources. 
> > 
> > I think Raoul misunderstood my question above :-). Clearly there seems to be
> > a lack of CI testing here and nova-compute is not being covered properly.
> 
> Could be, yes. What we're are testing today is the overall behavior of the
> environment after a stop and a start of the core resources.
> We don't stop nova-compute, it's true, but I was thinking about the error we
> got inside the instance test and the fact that it could be related to this
> issue.
> 

Let´s not mix issues and errors otherwise this BZ will become a catch-all and will never get to a conclusion.

What I would like to see is nova-compute (and other compute resources) being tested in CI the same way we do for other resources in the HA NG scenario where any resource can start before the dependencies are available and survive when dependencies are going away (etc.)

> > >From the logs, it staying running, trying to reconnect until
> > > galera/rabbit returned. As for nova-api, I saw in the logs that requests
> > > failed with a 500 until galera returned, which seems expected to me as Nova
> > > (and Keystone) can't fulfill any requests while the database is down.
> > > 
> > > What is the behavior you expected to see?
> > 
> > This is expected and correct.
> 
> Uhm, so and so. Database IS NOT down while nova-api tries to contact it and
> while we are trying to test instance deployment. This because the previous
> test (stop/start core resources) went fine. So, related or not to the main
> issue of this bug, this is a problem, that's why I posted the logs.

Let´s open another bug for that and track it separately.

Comment 28 Raoul Scarazzini 2016-11-16 16:10:45 UTC
(In reply to Fabio Massimo Di Nitto from comment #27)
[...]
> Let´s not mix issues and errors otherwise this BZ will become a catch-all
> and will never get to a conclusion.

Ack, sorry for creating entropy around this bug. So, while focusing on this specific issue, this is the last test I've done:

1 - Stop galera and rabbit;
2 - Stop all nova services on controllers (nova-api nova-conductor nova-consoleauth nova-novncproxy nova-scheduler) and compute (nova-compute);
3 - Start all nova services on controllers and compute;
4 - Wait some time (~20 mins);
5 - Start galera and rabbit;
6 - Deploy an instance;

On step 3, while starting nova-scheduler on each controller I get:

Job for openstack-nova-scheduler.service failed because the control process exited with error code. See "systemctl status openstack-nova-scheduler.service" and "journalctl -xe" for details.

and the command exits and test can move forward. This does not happen when activating of nova-compute on the compute nodes: in this case systemd command waits forever (not just 200s, to be clear).
If you do manual intervention (but at this point obviously automatic test would be declared failed) and stop the command (Ctrl+c, to be clear), then you see that the status of the resource is "activating".
Proceeding with the other steps ends in success.

I uploaded these new sosreports and logs here [1]. Hope this can help more on 

[1] http://file.rdu.redhat.com/~rscarazz/BZ1390632/

> What I would like to see is nova-compute (and other compute resources) being
> tested in CI the same way we do for other resources in the HA NG scenario
> where any resource can start before the dependencies are available and
> survive when dependencies are going away (etc.)
> 
> > > >From the logs, it staying running, trying to reconnect until
> > > > galera/rabbit returned. As for nova-api, I saw in the logs that requests
> > > > failed with a 500 until galera returned, which seems expected to me as Nova
> > > > (and Keystone) can't fulfill any requests while the database is down.
> > > > 
> > > > What is the behavior you expected to see?
> > > 
> > > This is expected and correct.
> > 
> > Uhm, so and so. Database IS NOT down while nova-api tries to contact it and
> > while we are trying to test instance deployment. This because the previous
> > test (stop/start core resources) went fine. So, related or not to the main
> > issue of this bug, this is a problem, that's why I posted the logs.
> 
> Let´s open another bug for that and track it separately.

Will do it for sure. Sorry for creating this overlapping.

Comment 29 Andrew Beekhof 2016-11-17 00:27:28 UTC
(In reply to Raoul Scarazzini from comment #28)
> (In reply to Fabio Massimo Di Nitto from comment #27)
> [...]
> > Let´s not mix issues and errors otherwise this BZ will become a catch-all
> > and will never get to a conclusion.
> 
> Ack, sorry for creating entropy around this bug. So, while focusing on this
> specific issue, this is the last test I've done:
> 
> 1 - Stop galera and rabbit;
> 2 - Stop all nova services on controllers (nova-api nova-conductor
> nova-consoleauth nova-novncproxy nova-scheduler) and compute (nova-compute);
> 3 - Start all nova services on controllers and compute;
> 4 - Wait some time (~20 mins);
> 5 - Start galera and rabbit;
> 6 - Deploy an instance;

I think the problem is that we didn't make this sequence smart enough.

In addition to checking that we can deploy an instance (thereby validating that the stack was able to get into a functional state), we should also be checking if any services were being respawned by systemd during step '4'.

This is important because if the galera/rabbit outage is long enough:
- eventually even systemd will give up restarting services, and/or
- the logs will be full of noise making it harder to track down the root cause

Comment 30 melanie witt 2016-11-18 22:58:27 UTC
(In reply to Raoul Scarazzini from comment #28) 
> So, while focusing on this
> specific issue, this is the last test I've done:
> 
> 1 - Stop galera and rabbit;
> 2 - Stop all nova services on controllers (nova-api nova-conductor
> nova-consoleauth nova-novncproxy nova-scheduler) and compute (nova-compute);
> 3 - Start all nova services on controllers and compute;
> 4 - Wait some time (~20 mins);
> 5 - Start galera and rabbit;
> 6 - Deploy an instance;
> 
> On step 3, while starting nova-scheduler on each controller I get:
> 
> Job for openstack-nova-scheduler.service failed because the control process
> exited with error code. See "systemctl status
> openstack-nova-scheduler.service" and "journalctl -xe" for details.
> 
> and the command exits and test can move forward. This does not happen when
> activating of nova-compute on the compute nodes: in this case systemd
> command waits forever (not just 200s, to be clear).
> If you do manual intervention (but at this point obviously automatic test
> would be declared failed) and stop the command (Ctrl+c, to be clear), then
> you see that the status of the resource is "activating".
> Proceeding with the other steps ends in success.
> 
> I uploaded these new sosreports and logs here [1]. Hope this can help more
> on 
> 
> [1] http://file.rdu.redhat.com/~rscarazz/BZ1390632/
> 
> > What I would like to see is nova-compute (and other compute resources) being
> > tested in CI the same way we do for other resources in the HA NG scenario
> > where any resource can start before the dependencies are available and
> > survive when dependencies are going away (etc.)

I looked through the nova-scheduler and nova-compute logs and didn't find any Nova service exiting. They just kept attempting to contact the database.

I'm not sure I understand what Nova problem you are currently having. Are you ever getting into a situation where:

  1. Start galera.
  2. Start nova-conductor.
  3. Start nova-compute.
  4. Stop galera.
  5. Stop nova-compute.
  6. Start nova-compute and it exits immediately.

Because so far, that's the only way I'm aware of for nova-compute to exit immediately upon starting and I found that by doing local tests with devstack.

Comment 31 Andrew Beekhof 2016-11-20 22:55:47 UTC
(In reply to melanie witt from comment #30)
> (In reply to Raoul Scarazzini from comment #28) 
> > So, while focusing on this
> > specific issue, this is the last test I've done:
> > 
> > 1 - Stop galera and rabbit;
> > 2 - Stop all nova services on controllers (nova-api nova-conductor
> > nova-consoleauth nova-novncproxy nova-scheduler) and compute (nova-compute);
> > 3 - Start all nova services on controllers and compute;
> > 4 - Wait some time (~20 mins);
> > 5 - Start galera and rabbit;
> > 6 - Deploy an instance;
> > 
> > On step 3, while starting nova-scheduler on each controller I get:
> > 
> > Job for openstack-nova-scheduler.service failed because the control process
> > exited with error code. See "systemctl status
> > openstack-nova-scheduler.service" and "journalctl -xe" for details.
> > 
> > and the command exits and test can move forward. This does not happen when
> > activating of nova-compute on the compute nodes: in this case systemd
> > command waits forever (not just 200s, to be clear).
> > If you do manual intervention (but at this point obviously automatic test
> > would be declared failed) and stop the command (Ctrl+c, to be clear), then
> > you see that the status of the resource is "activating".
> > Proceeding with the other steps ends in success.
> > 
> > I uploaded these new sosreports and logs here [1]. Hope this can help more
> > on 
> > 
> > [1] http://file.rdu.redhat.com/~rscarazz/BZ1390632/
> > 
> > > What I would like to see is nova-compute (and other compute resources) being
> > > tested in CI the same way we do for other resources in the HA NG scenario
> > > where any resource can start before the dependencies are available and
> > > survive when dependencies are going away (etc.)
> 
> I looked through the nova-scheduler and nova-compute logs and didn't find
> any Nova service exiting. They just kept attempting to contact the database.
> 
> I'm not sure I understand what Nova problem you are currently having. Are
> you ever getting into a situation where:
> 
>   1. Start galera.
>   2. Start nova-conductor.
>   3. Start nova-compute.
>   4. Stop galera.
>   5. Stop nova-compute.
>   6. Start nova-compute and it exits immediately.
> 
> Because so far, that's the only way I'm aware of for nova-compute to exit
> immediately upon starting and I found that by doing local tests with
> devstack.

Are you saying that leaving nova-conductor running is the difference between reproducing and not?  That seems surprising.

Either way, that scenario would definitely occur when configuring instance HA.

Comment 32 Red Hat Bugzilla Rules Engine 2017-06-04 02:30:05 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 33 Artom Lifshitz 2018-01-17 20:36:38 UTC
Nova-compute needs certain other services to work properly: a database, a message queue, and nova-conductor. It's therefore reasonable for nova-compute to expect a database to be accessible when it starts.


Note You need to log in before you can comment on or make changes to this bug.