Created attachment 1592586 [details] nova-compute.log Description of problem: ======================= nova evacuate of single instance failed ERROR nova.compute.manager [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Setting instance vm_state to ERROR: nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed Version-Release number of selected component: ============================================= RHOS_TRUNK-15.0-RHEL-8-20190701.n.0 How reproducible: ================= 100% Steps to Reproduce: =================== 1. Deploy OSPD 15 HA (undercloud, 3*controller, 2*compute, 3*ceph) 2. Create an instance 3. Shut down the source compute host 4. Evacuate the instance: nova evacuate instance-test-evc compute-0.localdomain Actual results: =============== Evacuation failed 2019-07-22 12:10:55.132 6 WARNING nova.virt.libvirt.driver [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Timeout waiting for [('network-vif-plugged', '2ebb1cde-45a7-498c-9a56-2a276839b710')] for instance with vm_state active and task_state rebuild_spawning.: eventlet.timeout.Timeout: 300 seconds 2019-07-22 12:10:55.991 6 INFO os_vif [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] Successfully unplugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:8b:c0:bc,bridge_name='br-int',has_traffic_filtering=True,id=2ebb1cde-45a7-498c-9a56-2a276839b710,network=Network(3626ea0a-ed25-4428-9e86-836287b2bf1f),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap2ebb1cde-45') 2019-07-22 12:10:56.096 6 INFO nova.virt.libvirt.driver [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Deleting instance files /var/lib/nova/instances/a3b0ca35-1342-444a-bae2-0731d4930bcc_del 2019-07-22 12:10:56.097 6 INFO nova.virt.libvirt.driver [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Deletion of /var/lib/nova/instances/a3b0ca35-1342-444a-bae2-0731d4930bcc_del complete 2019-07-22 12:10:56.616 6 ERROR nova.compute.manager [req-f366d9c5-cc6e-41b0-8d24-edcf39d926c1 338164c6b94644649ce2e3c937a1e289 66241b1e9f054152a67cf1acb46a1ca1 - default default] [instance: a3b0ca35-1342-444a-bae2-0731d4930bcc] Setting instance vm_state to ERROR: nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed Expected results: ================= Evacuation successfully Additional info: ================ nova-compute.log enclosed
(In reply to Ido Ovadia from comment #0) > Steps to Reproduce: > =================== > 1. Deploy OSPD 15 HA (undercloud, 3*controller, 2*compute, 3*ceph) > 2. Create an instance > 3. Shut down the source compute host > 4. Evacuate the instance: nova evacuate instance-test-evc > compute-0.localdomain When you say shut down the source compute host, what do you mean? The whole compute node or the nova container? Could we get full sosreports so we can see what's happening from the neutron side also.
(In reply to Stephen Finucane from comment #1) > (In reply to Ido Ovadia from comment #0) > > Steps to Reproduce: > > =================== > > 1. Deploy OSPD 15 HA (undercloud, 3*controller, 2*compute, 3*ceph) > > 2. Create an instance > > 3. Shut down the source compute host > > 4. Evacuate the instance: nova evacuate instance-test-evc > > compute-0.localdomain > > When you say shut down the source compute host, what do you mean? The whole > compute node or the nova container? Whole compute node https://docs.openstack.org/nova/rocky/admin/evacuate.html > > Could we get full sosreports so we can see what's happening from the neutron > side also.
Okay, thanks. Could we get sosreports, please.
Please check this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1720675 and https://review.opendev.org/#/c/665581 We decided to set live_migration_wait_for_vif_plug=False for osp15. Related changes are already merged so it should solve this problem. Could you please verify this? The flag live_migration_wait_for_vif_plug should be set to False.
(In reply to Maciej Józefczyk from comment #6) > Please check this bz: https://bugzilla.redhat.com/show_bug.cgi?id=1720675 > and https://review.opendev.org/#/c/665581 > We decided to set live_migration_wait_for_vif_plug=False for osp15. Related > changes are already merged so it should solve this problem. Could you please > verify this? The flag live_migration_wait_for_vif_plug should be set to > False. live_migration_wait_for_vif_plug set to False and live migration works successfully
The upstream gate has been passing the issue on the check phase, but not the gate phase (on the same test). I don't think it is related to the patch, I'm verifying that. But after talking with Daniel, I'm going to remove the blocker flag from this. This is something that has been broken in networking-ovn for a while (it reproduces on OSP13). Although the default is changing to OVN in OSP 15, people who upgrade won't be switched from ml2/ovs to ovn. We're going to treat this as a regular bug.
Verified on puddle 15.0-RHEL-8/RHOS_TRUNK-15.0-RHEL-8-20190924.n.2 which uses python3-networking-ovn-6.0.1-0.20190924050427.1242c73.el8ost.noarch Verified that instance evacuation works as expected. Some details regarding setup and scenario: OSP15 with OVN HA (3 controllers, 2 computes, 3 ceph nodes). Link to the build: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/OSPD-Customized-Deployment-virt/12643/ 1. Created external and internal networks, router, keypair, security group with pingable an loginable rules. Connected internal and external networks to the router. 2. Launched an instance connected to the internal network. Created a floating IP for the instance on the external network. Verified that instance is accessible via the floating ip. 3. Turned off ungracefully the compute node where the instance was running. ("Force Off" or "virsh destroy", tried also kernel panic using 'echo c > /proc/sysrq-trigger') 4. Waited until "openstack compute service list" shows turned off compute node as "down" 5. Initiated instance evacuation, i.e. executed command "nova evacuate vm1-net1 compute-1.redhat.local" 6. Checked that instance was rebuilt on the target compute node. 7. Verified that instance is actually running on the target compute node and has connectivity to the network. Instance received hostname and ssh key from metadata service. 8. Powered on the turned off compute host and repeated steps 3-7, this time evacuating to compute-0.redhat.local.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2957
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days