Bug 1814410
Summary: | [OSP13] nova_compute container unhealthy and service down because of entries in mysql nova migrations with dest_compute set to null | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | ggrimaux |
Component: | openstack-tripleo-heat-templates | Assignee: | RHOS Maint <rhos-maint> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | David Rosenfeld <drosenfe> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 13.0 (Queens) | CC: | dasmith, eglynn, jhakimra, kchamart, lmiccini, mburns, mwitt, sbauza, sgordon, smooney, vromanso |
Target Milestone: | --- | Keywords: | Triaged, ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-08-25 08:29:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
ggrimaux
2020-03-17 19:44:10 UTC
without sos reports we cant debug this properlly. i have read back through the case attched and it look like they had a network partion or some other incidient that prevented the compute nodes form connecting to the contoler based on the initall rabbitmq and db errors. i am assuming its a network partion as the case also mention that rabbitmq was running correctly so it was not a compute node failure. after that point the customer tried to evaucate vms form the failed not that went into an accpeted state but had not destinaion. and that is what you are asking engineering to root cause. given the customer issue has already been fixed i dont think this is high high so i have reduced it to medium high since there is not imediate action required to unblock the customer. as i said without the sos reports we cant really debug this properly but my first guess woudl be if all compute nodes where down and you tried to do an evaucate this situation might happen if it got a no valid host response form the schduler. but that is just a guess and we would have to look at this more closly. accepted is the first sate a migration/evauation enters too so it could be a result of the rabbitmq issues they were having resuling in an rpc being lost. as such its not clear that this si specificaly due to a nova issue or if its a reuslt of an infrasture issue. again we woudl need the sos reports for all nodes invovled to make that determination. asking for sosreports. will get back to you when I have them As noted on irc i have reviewed the controller logs but they have provided no additional useful informational. the sosreport did not contain the logs for the compute node that was evacuated or the other compute nodes. i can clearly see the rabitmq and database outage on the 15th but by the 17th that is resolved. the stopage of the logs on the compute node on the 15 is likely related to the rabbitmq outage but i cant be certin as i dont have the logs for that host. i suspect that the compute agent exited after trying to access the db via the conductor. reading the description more closely i think the evacuations are unrelated to the unhealthy state of the compute servcie on the compute node. migrations with dest_compute set to null and status accepted is normal. that is the first state the migration enters before the evacuation begins. the dest will be null until the scheduler selects a host. the repeating message in the docker log output Mar 17 07:31:54 compute02 journal: Checking 11 migrations Mar 17 07:31:54 compute02 journal: Waiting for evacuations to complete or fail is from https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/scripts/check-run-nova-compute#L61 i belive this is part of the insance ha feature which is not supported by the compute dfg it is supproted by pidone what apepars to have happened is that check-run-nova-compute scripts safe_to_start function prevented the compute agent on the compute node form starting. this was added in this change https://opendev.org/openstack/tripleo-heat-templates/commit/9602a9bafc0d6b724aa4228411a8475e23f94efb i am going to hand this over to PIDONE to triage. in the context of that change it makes sense that the compute agent would not start and would be marked as unheltey/down until the migration where marked as failed. When instance HA is involved the nova_compute container would be prevented to start and the respective service marked as down explicitly via 'nova service-force-down' until all the vms that should be migrated away from that compute have completed the migration/evacuation/rebuild. This is a safety measure to prevent the same vms to be started twice on two different hosts. If for any reason the migrations could not be completed it is unfortunately expected for the nova_compute container to never come up properly without the operators intervention. closing because we don't have enough data to prove it is indeed a bug in the code. Current assumption, as per #7, is that the full recovery couldn't take place because of migrations did not complete for reasons unknown. |