Description of problem: In an OSP 10 + Contrail environment ,we have an issue where a neutron port on a VM situtaed on a Compute node goes missing when a hypervisor is rebooted.We can verify the inetrfaces with nova interface-list where it shows 2 interface whereas ion KVM virsh domiflist shows only one interface attached. Version-Release number of selected component (if applicable): RHOSP10 How reproducible: Steps to Reproduce: 1. Reboot hypervisor 2. Observe only 1 of two ports attached to the VM from virsh 3. Actual results: Instance has 2 ports assigned, but only one port reconnects after hypervisor reboot. Both ports are visible from nova interface-list. But only 1 port is visible from virsh Expected results: Both ports would reattach to the instance after reboot Additional info: This deployment is using Juniper Contrail. Contrail can be seen to respond to the request with both ports, but the instance appears to only attach 1 port. Quote from customer: From our understanding this is somewhere between nova and neutron / contrail, as neutron / contrail does answer with the right mount of ports but nova, when updating the network cache only uses one. This can be seen from the logs as well
Potentially the same as this issue? https://bugzilla.redhat.com/show_bug.cgi?id=1583976
i think it would be unlikely to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1583976 can yo uprovide the instance uuid for the instance that had 2 interface and now has one. that will make parsing the logs significantly simpler it looks like vroute in osp10 is not using os-vif so its falling back to the legacy workflow for generating the xml and plugging the VIFs. ill keep looking to see if i can find an error for now.
(In reply to smooney from comment #3) > i think it would be unlikely to be related to > https://bugzilla.redhat.com/show_bug.cgi?id=1583976 > > can yo uprovide the instance uuid for the instance that had 2 interface and > now has one. that will make parsing the logs significantly simpler Nova instance ID is "cdc94820-4c11-42c5-ba11-ce1107492d78". > > it looks like vroute in osp10 is not using os-vif so its falling back to the > legacy workflow for generating the xml and plugging the VIFs. > > ill keep looking to see if i can find an error for now. Also the weird part is that everything was working properly for a few good weeks.
from looking at the subset of logs in the x-text folder i can see the instance uuid referenced in the neutron server log starting at "2018-10-22 15:30:20.436" however this is after the sosreports for the compute node sosreport-20181021-213509 were generated. as such there are not logs currently available that cover the period when the instance was created. can you regenerate the sosreports for the compute node (dcd01-compute-vrouter-1.localdomain) and reattach them? it would also be useful to know at what time the host reboot occurred so that we can compare the logs before and after. in the interim 2 possible mitigation come to mind. for the instance that have been effected as a first step a hard reboot of the instnace may be enought to force nova and neutron to get back in sync. if that does not resolve the issue you could also try detaching the missing interfaces form the vm and reattaching them. in both cases this should force nova to regenerate the instance xml and rebind each of the interfaces on the neutron/contrial side.
it is possible that this is related to the following bug https://bugs.launchpad.net/nova/+bug/1751923 which has yet to be resolved upstream. there is a propsed patchset to adress this upstream https://review.openstack.org/#/c/614167/ i will need to asses if this behavior exist on our downstream OSP10 codebase and if it does we may be able to backport the fix once it is merged upstream.
note the upstream bug is actully addressed by https://review.openstack.org/#/c/591607
After reviewing the upstream patch i believe the correct workaround for this issue is to detatch and re attach the affected interfaces. A hard reboot will not result in updating the nova db with state from neutron. can you ask the customer to try this mitigation and report back. in the interim i will try to expedite review of https://review.openstack.org/#/c/591607 and investigate the back port viability.
hello i have finished analyzing the logs for instance cdc94820-4c11-42c5-ba11-ce1107492d78 as this instance was created before the logs start i was not able to confirm that that i actually lost interfaces however i did identiy 2 point where the network in cache could have been currpted due failed updates. there are two occasion within the logs where a RPC call to save an updates to the info cache failed which could have resulted in the database and in memory representation getting out of sync. while analyzing cdc94820-4c11-42c5-ba11-ce1107492d78 has not proved very useful i have identified another instance e32fa6c0-dd6f-4a0c-ba98-8bff7545cc1f which does appear be in the same state. the instance creation logs exist for e32fa6c0-dd6f-4a0c-ba98-8bff7545cc1f and more importantly i can also see the errors related to attaching and detaching the interfaces. i am still reviewing logs related to e32fa6c0-dd6f-4a0c-ba98-8bff7545cc1f but i have a working theory as to why the reattaching failed and a possible workaround. i will update that this when i confirm. effectively since we cannot bring nova inline with neutron i believe we can do the revers and bring neutron inline with nova by deattaching the interface form the neutron side which works differently. the the command that were previously tried. i will need to find the exact command to do this as it is not documented as it is generally not used as the nova way is prefered. regards sean below is a copy of my raw notes from log inspection related to instance: cdc94820-4c11-42c5-ba11-ce1107492d78: -------------------------------------------------------------------- first error related to cdc94820-4c11-42c5-ba11-ce1107492d78 appears at 20181018:2018-10-17 15:39:42.564 nova-compute.log-20181018:2018-10-17 15:39:42.564 3265 ERROR nova.compute.manager [req-ee8def0c-ba68-4da3-9c43-3726534354f5 - - - - -] [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] An error occurred while refreshing the network cache. caused by a request timeout ConnectTimeout: Request to http://192.168.4.6:9696/v2.0/networks.json?id=f224f6b1-b15d-465b-89df-e216ff121351 timed out the next error directly realted to instance: cdc94820-4c11-42c5-ba11-ce1107492d78 happens at 15:47:27.538 nova-compute.log-20181018:2018-10-17 15:47:27.538 3265 ERROR nova.compute.manager [req-ee8def0c-ba68-4da3-9c43-3726534354f5 - - - - -] [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] An error occurred while refreshing the network cache. cause by a neutron server internal error resulting in an unknonw exception nova-compute.log-20181018:2018-10-17 15:47:27.538 3265 ERROR nova.compute.manager [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] InternalServerError: An unknown exception occurred. nova-compute.log-20181018:2018-10-17 15:47:27.538 3265 ERROR nova.compute.manager [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] Neutron server returns request_ids: ['req-ca43edc0-404b-4f80-852b-64733681ebaf'] this is followed by connection timout errors from 2018-10-17 15:55:51.437 until 2018-10-17 18:42:45.428 at this point there are no other logs directly related to this instance until 13 days later at 2018-10-30 18:50:25.705 where we recive a number of messaging timeouts these rpc messing tiem out first happen in the _heal_instance_info_cache periodic task when trying to retrive the instance.info_cache nova-compute.log-20181031:2018-10-30 18:50:25.705 3609 ERROR nova.compute.manager [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 428, in refresh_info_cache_for_instance then at 2018-10-30 19:15:51.321 we fail to store an update to the info cache File "/usr/lib/python2.7/site-packages/nova/network/base_api.py", line 50, in update_instance_cache_with_nw_info nova-compute.log-20181031:2018-10-30 19:15:51.321 3609 ERROR nova.network.base_api [req-9ef6151b-062d-4ded-a7de-918a3f07db9d - - - - -] [instance: cdc94820-4c11-42c5-ba11-ce1107492d78] Failed storing info cache this is possibly when nova's info cache got corrupted. later in the logs we fail to update the power state and upste the instance cache again for the same issue a messaging timeout.
After analyzing the logs of e32fa6c0-dd6f-4a0c-ba98-8bff7545cc1f i was able to determine the exact point at which nova and neutron became out of sync. the summary is as follows. at 16:16:09 the instance was spawned with 2 interfaces. at 16:16:55 the instance was observed to be stop as part of a reboot request that completed at 16:18:31. During the shutdown phase of the reboot nova unbound the interfaces form the host and updated the info cache to reflect that non interfaces were attached to the instnace. on start up nova successfully bound the interfaces in neutron back to the instance and when it went to update the nova db a VirtualInterfaceCreateException was raised. At this point nova and neutron had different view of the sate of the instance and its ports, as a result when nova generated the instance xml it used the stale instance network info cache values form the nova db which had no interfaces. After further investigation and local testing the method i had hoped to suggest to detach the interfaces from neutron instead of nova to correct this inconsistency failed. initially i was going to suggest running "openstack port set --device None --device-owner None" to clear the attachment detail however this actully set the device id and device owner to the sting "None" instead of the Python value None. the next remediation i tested locally was to use curl directly to generate the api requiest that was requried to detach the port on the neutron side TEMP_TOKEN=$(openstack token issue -c id -f value) PORT_ID=edcc370e-7919-4b04-8e92-59a74b932dd3 NEUTRON_ENDPOINT=http://192.168.100.12:9696 curl -X PUT -H "X-Auth-Token:${TEMP_TOKEN}" -d '{ "port":{"device_id":"","device_owner":"", "binding:host_id":"" }}' "${NEUTRON_ENDPOINT}/v2.0/ports/${PORT_ID}" | python -mjson.tool this infact does detach the port on the neutron side but did not remoe it on the nova side so when i tried to reattach the interface to the vm and after hard rebooting the instance i was able to reporduce the reported failure locally. i will try and refine the reproduction steps locally to see if i can reproduce the failure consistently. as the neutron based detach approach is not viable the best recommendation i can make is for the customer to delete the broken interfaces using neutron recreate the port and attach the new port to the nova instance. process to delete and recreate: 1. openstack prot show <port id> record the fixed ip and mac if you want to keep the the same and the netork id the port was attached to. 2. delete the port openstack port delete <port id> 3. recreate the port using the previous mac and ip openstack port create --network <network> --fixed-ip <ip> --mac-address 4 attach new port to instance openstack server add port <server uuid> <new port uuid>
A quick update on this. the upstream changes have now been approved and are in the process of making there way through the upstream gate. https://review.openstack.org/#/q/topic:bug/1751923+(status:open+OR+status:merged) once they are merged upstream we can start the process of backporting these changes both upstream and downstream. the cutoff for inclusion in the next z stream release was yesterday so these will not be included in that release but this should be a candidate for inclusion in the following z stream.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:2811
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days