Bug 1814410

Summary:	[OSP13] nova_compute container unhealthy and service down because of entries in mysql nova migrations with dest_compute set to null
Product:	Red Hat OpenStack	Reporter:	ggrimaux
Component:	openstack-tripleo-heat-templates	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	David Rosenfeld <drosenfe>
Severity:	high	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	dasmith, eglynn, jhakimra, kchamart, lmiccini, mburns, mwitt, sbauza, sgordon, smooney, vromanso
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-25 08:29:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description ggrimaux 2020-03-17 19:44:10 UTC

Description of problem:
Client contacted us because of some instances stuck in error (rebuild/shutting down) but resetting the state and trying to evacuate them wasn't working.

We noticed that in 'nova service-list' that 3 nova-compute services were down.

In the sosreport of one of those compute we would see these being repeated a lot:
Mar 17 07:31:54 compute02 journal: Checking 11 migrations
Mar 17 07:31:54 compute02 journal: Waiting for evacuations to complete or fail

docker logs nova_compute would show those lines repeating over and over. Restarting the container wouldn't help.

What is weird is that the log for nova stopped on March 15 at 20h. (sos was taken on the 17).
Possible reason would be because the nova_compute container was in unhealthy state ? If you could confirm this somehow.

I found a KCS with a similar issue:
https://access.redhat.com/solutions/4847411

We verified the database but no migration was stuck. All were in failed state.

It was found that some migrations were with the state=accepted and dest_compute=null:
MariaDB [nova]> select * from migrations where dest_compute is null and status like 'accepted';

Changing that state to 'failed' brought back the container and service back up.

Is it by design or its something that shouldn't happen ?

Client has a theory of what could have happened from what we saw in the database. Possibly someone would have forcefully tried to migrate a normal instance to a SRIOV compute node. This would have failed and made that compute node misbehave. Client will try to simulate this on his side and report if he can reproduce.

For the record here's what would be the error message when running the healthcheck manually inside the nova_compute docker container:
[root@meyppcomp002 ~]# docker exec -u root -it nova_compute /bin/bash
()[root@meyppcomp002 /]# /openstack/healthcheck 5672
There is no sshd process listening on port(s) 5672 in the container

Is this a wanted behavior on nova_compute side ?
Should the container be unhealthy in such a case ?
Should the service be down ?

Let me know if you need anything else.

Version-Release number of selected component (if applicable):
openstack-nova-common-17.0.10-6.el7ost.noarch
rhosp 13.0.8

How reproducible:
100%

Steps to Reproduce:
1. Database Nova, Tables Migrations, issue appear if there's an entry with status=accepted and dest_compute set to null
2.
3.

Actual results:
nova_compute docker container on the compute is unhealthy (so nova-compute is down for that node).

Expected results:
nova_compute container should be healthy I'd say ?

Additional info:
See command output in the next comment.

Comment 2 smooney 2020-03-20 14:39:07 UTC

without sos reports we cant debug this properlly.

i have read back through the case attched and it look like they had a network partion or some other
incidient that prevented the compute nodes form connecting to the contoler based on the initall rabbitmq
and db errors. i am assuming its a network partion as the case also mention that rabbitmq was running correctly
so it was not a compute node failure.

after that point the customer tried to evaucate vms form the failed not that went into an accpeted state but had not destinaion.
and that is what you are asking engineering to root cause.
given the customer issue has already been fixed i dont think this is high high so i have reduced it to medium high since there
is not imediate action required to unblock the customer.

as i said without the sos reports we cant really debug this properly but my first guess woudl be if all
compute nodes where down and you tried to do an evaucate this situation might happen if it got a no valid
host response form the schduler. but that is just a guess and we would have to look at this more closly.

accepted is the first sate a migration/evauation enters too so it could be a result of the rabbitmq issues they were having
resuling in an rpc being lost. as such its not clear that this si specificaly due to a nova issue or if its a reuslt
of an infrasture issue. again we woudl need the sos reports for all nodes invovled to make that determination.

Comment 3 ggrimaux 2020-03-20 16:43:48 UTC

asking for sosreports.

will get back to you when I have them

Comment 6 smooney 2020-03-27 17:48:15 UTC

As noted on irc i have reviewed the controller logs but they have provided no additional useful informational.
the sosreport did not contain the logs for the compute node that was evacuated or the other compute nodes.

i can clearly see the rabitmq and database outage on the 15th but by the 17th that is resolved.
the stopage of the logs on the compute node on the 15 is likely related to the rabbitmq outage
but i cant be certin as i dont have the logs for that host. i suspect that the compute agent
exited after trying to access the db via the conductor.

reading the description more closely i think the evacuations are unrelated to the unhealthy state of the compute servcie on the compute node. migrations with dest_compute set to null and status accepted is normal. that is the first state the migration enters before the evacuation begins. the dest will be null until the scheduler selects a host.

the repeating message in the docker log output

Mar 17 07:31:54 compute02 journal: Checking 11 migrations
Mar 17 07:31:54 compute02 journal: Waiting for evacuations to complete or fail

is from https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/scripts/check-run-nova-compute#L61

i belive this is part of the insance ha feature which is not supported by the compute dfg
it is supproted by pidone

what apepars to have happened is that check-run-nova-compute scripts safe_to_start function prevented the compute agent on the compute node form starting.
this was added in this change https://opendev.org/openstack/tripleo-heat-templates/commit/9602a9bafc0d6b724aa4228411a8475e23f94efb

i am going to hand this over to PIDONE to triage.

in the context of that change it makes sense that the compute agent would not start and would be marked as unheltey/down until the migration where marked as failed.

Comment 7 Luca Miccini 2020-03-30 06:37:38 UTC

When instance HA is involved the nova_compute container would be prevented to start and the respective service marked as down explicitly via 'nova service-force-down' until all the vms that should be migrated away from that compute have completed the migration/evacuation/rebuild.
This is a safety measure to prevent the same vms to be started twice on two different hosts.
If for any reason the migrations could not be completed it is unfortunately expected for the nova_compute container to never come up properly without the operators intervention.

Comment 9 Luca Miccini 2020-08-25 08:29:48 UTC

closing because we don't have enough data to prove it is indeed a bug in the code. 
Current assumption, as per #7, is that the full recovery couldn't take place because of migrations did not complete for reasons unknown.