Bug 1645316 - [OSP15] Nova fails to attach both interfaces to VM after hypervisor reboot
Summary: [OSP15] Nova fails to attach both interfaces to VM after hypervisor reboot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 15.0 (Stein)
Assignee: smooney
QA Contact: Archit Modi
URL:
Whiteboard:
Depends On:
Blocks: 1691711 1691714 1691715
TreeView+ depends on / blocked
 
Reported: 2018-11-01 23:09 UTC by Brendan Shephard
Modified: 2023-09-18 00:14 UTC (History)
11 users (show)

Fixed In Version: openstack-nova-19.0.1-0.20190412150344.7c915d0.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1691711 (view as bug list)
Environment:
Last Closed: 2019-09-21 11:19:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 591607 0 'None' MERGED Force refresh instance info_cache during heal 2021-02-18 21:02:28 UTC
OpenStack gerrit 614167 0 'None' MERGED Add fill_virtual_interface_list online_data_migration script 2021-02-18 21:02:28 UTC
Red Hat Issue Tracker OSP-5343 0 None None None 2022-05-17 13:57:12 UTC
Red Hat Knowledge Base (Solution) 4089141 0 None None None 2021-06-21 12:46:33 UTC
Red Hat Product Errata RHEA-2019:2811 0 None None None 2019-09-21 11:19:45 UTC

Description Brendan Shephard 2018-11-01 23:09:32 UTC
Description of problem:
In an OSP 10 + Contrail environment ,we have an issue where a neutron port on a VM  situtaed on a Compute node goes missing when a hypervisor is rebooted.We can verify the inetrfaces with nova interface-list where it shows 2 interface whereas ion KVM virsh domiflist shows only one interface attached.

Version-Release number of selected component (if applicable):
RHOSP10

How reproducible:


Steps to Reproduce:
1. Reboot hypervisor
2. Observe only 1 of two ports attached to the VM from virsh
3. 

Actual results:
Instance has 2 ports assigned, but only one port reconnects after hypervisor reboot. 

Both ports are visible from nova interface-list. But only 1 port is visible from virsh 

Expected results:
Both ports would reattach to the instance after reboot

Additional info:
This deployment is using Juniper Contrail. Contrail can be seen to respond to the request with both ports, but the instance appears to only attach 1 port.

Quote from customer:
From our understanding this is somewhere between nova and neutron / contrail, as neutron / contrail does answer with the right mount of ports but nova, when updating the network cache only uses one. This can be seen from the logs as well

Comment 2 Brendan Shephard 2018-11-01 23:40:39 UTC
Potentially the same as this issue?
https://bugzilla.redhat.com/show_bug.cgi?id=1583976

Comment 3 smooney 2018-11-06 15:48:21 UTC
i think it would be unlikely to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1583976

can yo uprovide the instance uuid for the instance that had 2 interface and now has one. that will make parsing the logs significantly simpler

it looks like vroute in osp10 is not using os-vif so its falling back to the legacy workflow for generating the xml and plugging the VIFs.

ill keep looking to see if i can find an error for now.

Comment 6 Remus Buzatu 2018-11-06 22:25:52 UTC
(In reply to smooney from comment #3)
> i think it would be unlikely to be related to
> https://bugzilla.redhat.com/show_bug.cgi?id=1583976
> 
> can yo uprovide the instance uuid for the instance that had 2 interface and
> now has one. that will make parsing the logs significantly simpler

Nova instance ID is "cdc94820-4c11-42c5-ba11-ce1107492d78".

> 
> it looks like vroute in osp10 is not using os-vif so its falling back to the
> legacy workflow for generating the xml and plugging the VIFs.
> 
> ill keep looking to see if i can find an error for now.

Also the weird part is that everything was working properly for a few good weeks.

Comment 7 smooney 2018-11-09 18:34:32 UTC
from looking at the subset of logs in the  x-text folder i can see the instance uuid referenced in the neutron server log starting at "2018-10-22 15:30:20.436"
however this is after the sosreports for the compute node sosreport-20181021-213509 were generated. as such there are not logs currently available that cover the  period when the instance was created.

can you regenerate the sosreports for the compute node (dcd01-compute-vrouter-1.localdomain) and reattach them? 

it would also be useful to know at what time the host reboot occurred so that we can compare the logs before and after.

in the interim 2 possible mitigation come to mind.

for the instance that have been effected as a first step 
a hard reboot of the instnace may be enought to force nova and neutron to get back in sync.

if that does not resolve the issue you could also try detaching the missing interfaces form the vm and reattaching them.

in both cases this should force nova to regenerate the instance xml and rebind each of the interfaces on the neutron/contrial side.

Comment 9 smooney 2018-11-09 19:14:03 UTC
it is possible that this is related to the following bug
https://bugs.launchpad.net/nova/+bug/1751923
which has yet to be resolved upstream.
there is a propsed patchset to adress this upstream 
https://review.openstack.org/#/c/614167/

i will need to asses if this behavior exist on our downstream OSP10 codebase
and if it does we may be able to backport the fix once it is merged upstream.

Comment 10 smooney 2018-11-09 19:31:33 UTC
note the upstream bug is actully addressed by https://review.openstack.org/#/c/591607

Comment 11 smooney 2018-11-12 15:46:07 UTC
After reviewing the upstream patch i believe the correct workaround for this issue is to detatch and re attach the affected interfaces.

A hard reboot will not result in updating the nova db with state from neutron.

can you ask the customer to try this mitigation and report back.

in the interim i will try to expedite review of https://review.openstack.org/#/c/591607 and investigate the back port viability.

Comment 12 smooney 2018-11-19 12:04:17 UTC
hello i have finished analyzing the logs for instance cdc94820-4c11-42c5-ba11-ce1107492d78

as this instance was created before the logs start i was not able to confirm that that i actually lost interfaces however i did identiy 2 point where the network in cache could have been currpted due failed updates.

there are two occasion within the logs where a RPC call to save an updates to
the info cache failed which could have resulted in the database and in memory representation getting out of sync.

while analyzing cdc94820-4c11-42c5-ba11-ce1107492d78 has not proved very useful i have identified another instance e32fa6c0-dd6f-4a0c-ba98-8bff7545cc1f which does appear be in the same state. the instance creation logs exist for e32fa6c0-dd6f-4a0c-ba98-8bff7545cc1f  and more importantly i can also see the errors related to attaching and detaching the interfaces.

i am still reviewing logs related to e32fa6c0-dd6f-4a0c-ba98-8bff7545cc1f
but i have a working theory as to why the reattaching failed and a possible
workaround.

i will update that this when i confirm. effectively since we cannot bring nova inline with neutron i believe we can do the revers and bring neutron inline with nova by deattaching the interface form the neutron side which works differently. the the command that were previously tried. i will need to find the exact command to do this as it is not documented as it is generally not used as the nova way is prefered.


regards
sean



below is a copy of my raw notes from log inspection related to instance: cdc94820-4c11-42c5-ba11-ce1107492d78:

--------------------------------------------------------------------

first error related to  cdc94820-4c11-42c5-ba11-ce1107492d78 appears at 20181018:2018-10-17 15:39:42.564

nova-compute.log-20181018:2018-10-17 15:39:42.564 3265 ERROR nova.compute.manager [req-ee8def0c-ba68-4da3-9c43-3726534354f5 - - - - -] [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] An error occurred while refreshing the network cache.

caused by a request timeout 
ConnectTimeout: Request to http://192.168.4.6:9696/v2.0/networks.json?id=f224f6b1-b15d-465b-89df-e216ff121351 timed out

the next error directly realted to instance: cdc94820-4c11-42c5-ba11-ce1107492d78  happens at 15:47:27.538 
nova-compute.log-20181018:2018-10-17 15:47:27.538 3265 ERROR nova.compute.manager [req-ee8def0c-ba68-4da3-9c43-3726534354f5 - - - - -] [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] An error occurred while refreshing the network cache.
cause by a neutron server internal error resulting in an unknonw exception

nova-compute.log-20181018:2018-10-17 15:47:27.538 3265 ERROR nova.compute.manager [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] InternalServerError: An unknown exception occurred.
nova-compute.log-20181018:2018-10-17 15:47:27.538 3265 ERROR nova.compute.manager [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] Neutron server returns request_ids: ['req-ca43edc0-404b-4f80-852b-64733681ebaf']

this is followed by connection timout errors from 2018-10-17 15:55:51.437 until 2018-10-17 18:42:45.428

at this point there are no other logs directly related to this instance until 13 days later at 2018-10-30 18:50:25.705
where we recive a number of messaging timeouts

these rpc messing tiem out first happen in the _heal_instance_info_cache periodic task when trying to retrive the instance.info_cache
nova-compute.log-20181031:2018-10-30 18:50:25.705 3609 ERROR nova.compute.manager [instance: cdc94820-4c11-42c5-ba11-ce1107492d78]   File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 428, in refresh_info_cache_for_instance

then at 2018-10-30 19:15:51.321 we fail to store an update to the info cache File "/usr/lib/python2.7/site-packages/nova/network/base_api.py", line 50, in update_instance_cache_with_nw_info
nova-compute.log-20181031:2018-10-30 19:15:51.321 3609 ERROR nova.network.base_api [req-9ef6151b-062d-4ded-a7de-918a3f07db9d - - - - -] [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] Failed storing info cache

this is possibly when nova's info cache got corrupted.

later in the logs we fail to update the power state and upste the instance cache again for the same issue a messaging timeout.

Comment 14 smooney 2018-11-20 13:25:50 UTC
After analyzing the logs of e32fa6c0-dd6f-4a0c-ba98-8bff7545cc1f
i was able to determine the exact point at which nova and neutron became out of sync.

the summary is as follows.

at 16:16:09 the instance was spawned with 2 interfaces.
at 16:16:55 the instance was observed to be stop as part of a reboot
request that completed at 16:18:31.

During the shutdown phase of the reboot nova unbound the interfaces form 
the host and updated the info cache to reflect that non interfaces were
attached to the instnace. on start up nova successfully bound the interfaces in
neutron back to the instance and when it went to update the nova db a VirtualInterfaceCreateException was raised.

At this point nova and neutron had different view of the sate of the instance and its ports, as a result when nova generated the instance xml it used the stale instance network info cache values form the nova db which had no interfaces.

After further investigation and local testing the method i had hoped to suggest to detach the interfaces from neutron instead of nova to correct this inconsistency failed.

initially i was going to suggest running 
"openstack port set --device None --device-owner None"
to clear the attachment detail however this actully set the 
device id and device owner to the sting "None" instead of the
Python value None.

the next remediation i tested locally was to use curl directly to generate the api requiest that was requried to detach the port on the neutron side

TEMP_TOKEN=$(openstack token issue -c id -f value)
PORT_ID=edcc370e-7919-4b04-8e92-59a74b932dd3
NEUTRON_ENDPOINT=http://192.168.100.12:9696
curl -X  PUT -H "X-Auth-Token:${TEMP_TOKEN}" -d '{ "port":{"device_id":"","device_owner":"", "binding:host_id":"" }}'  "${NEUTRON_ENDPOINT}/v2.0/ports/${PORT_ID}" | python -mjson.tool


this infact does detach the port on the neutron side but did not remoe it on the nova side so when i tried to reattach the interface to the vm and after hard rebooting the instance i was able to reporduce the reported failure locally.

i will try and refine the reproduction steps locally to see if i can reproduce the failure consistently.

as the neutron based detach approach is not viable the best recommendation 
i can make is for the customer to delete the broken interfaces using neutron
recreate the port and attach the new port to the nova instance.


process to delete and recreate:
1. openstack prot show <port id>
   record the fixed ip and mac if you want to keep the the same
   and the netork id the port was attached to.
2. delete the port
   openstack port delete <port id>
3. recreate the port using  the previous mac and ip
   openstack port create --network <network> --fixed-ip <ip> --mac-address
4  attach new port to instance
   openstack server add port <server uuid> <new port uuid>

Comment 16 smooney 2019-01-30 14:53:17 UTC
A quick update on this.
the upstream changes have now been approved and are in the process of making there
way through the upstream gate.
https://review.openstack.org/#/q/topic:bug/1751923+(status:open+OR+status:merged)

once they are merged upstream we can start the process of backporting these changes
both upstream and downstream.

the cutoff for inclusion in the next z stream release was yesterday so these will not be
included in that release but this should be a candidate for inclusion in the following
z stream.

Comment 29 errata-xmlrpc 2019-09-21 11:19:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:2811

Comment 30 Red Hat Bugzilla 2023-09-18 00:14:43 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.