Bug 1662215 - VM will be in paused state because libvirt will wait for the non-existent vhu interface forever
Summary: VM will be in paused state because libvirt will wait for the non-existent vhu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z6
: 13.0 (Queens)
Assignee: Artom Lifshitz
QA Contact: nova-maint
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-27 06:14 UTC by Chen
Modified: 2019-09-09 17:08 UTC (History)
24 users (show)

Fixed In Version: openstack-nova-17.0.9-3.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, when detaching and removing an interface from the instance XML, Nova expected all interfaces to have a `target_dev` element in the instance XML. However, not all interfaces have a `target_dev` element. As a result, detaching interfaces appeared to succeed, and Neutron unwired the port, but the interface would remain in the instance XML. Some operations, such as rebooting the instance, failed because the instance remained in a paused state, waiting for the other end of the vhost socket to connect. With this update, Nova does not expect `vhost-user` interfaces to have `target_dev` elements and the interface is removed correctly from the instance XML.
Clone Of:
Environment:
Last Closed: 2019-04-30 17:13:59 UTC
Target Upstream Version:


Attachments (Terms of Use)
domain xml (5.58 KB, text/plain)
2019-01-07 09:46 UTC, Jing Qi
no flags Details
neutron log (1.44 MB, text/plain)
2019-01-08 04:45 UTC, Jing Qi
no flags Details
nova-compute log (835.70 KB, text/plain)
2019-01-08 04:46 UTC, Jing Qi
no flags Details
domain xml for instance-00000009 (5.46 KB, text/plain)
2019-02-18 05:13 UTC, Jing Qi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0924 0 None None None 2019-04-30 17:14:13 UTC

Description Chen 2018-12-27 06:14:28 UTC
Description of problem:

VM will be in paused state because libvirt will wait for the non-existent vhu interface forever.

Version-Release number of selected component (if applicable):

OSP13
ovs 2.9

How reproducible:

100%

Steps to Reproduce:
1. Create an instance(vm) with vhu interface
2. Detach the interface manually ( In openstack we use nova detach-interface CLI)
3. Reboot the VM (In openstack we use nova reboot)

Actual results:

The VM is in paused state forever

In libvirt/qemu-xxxx.log we have the following log and it never proceed.

2018-12-27T05:52:04.580333Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu6807df14-08,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu6807df14-08,server

# ovs-vsctl show | grep 6807 | wc -l
0

Expected results:

The VM should be able to start or fail at once

Additional info:

Comment 4 Jing Qi 2019-01-07 03:52:59 UTC
Hi CChen,
Can you please give detailed about how to detach the interface? Is it done with vm active? And have you checked if the interface really detached from the domain active xml, or you just saw the succeeded information from RHOS?
And what the libvirt / qemu-kvm version are being used?

Thanks,
Jing Qi

Comment 5 chhu 2019-01-07 06:36:59 UTC
This bug may caused by this one: 
Bug 1612052 - Failed to detach interface to VM in rhos

Comment 6 Chen 2019-01-07 07:32:30 UTC
Hi Jing,

>Can you please give detailed about how to detach the interface? 
>Is it done with vm active? 

Yes I detached the NIC when the VM is running/active

>And have you checked if the interface really detached from the domain active xml, or you just saw the succeeded information from RHOS?

I didn't get a chance to check that. I believe the xml didn't get refreshed because when we rebooted the VM the VM still tries to wait for the detached NIC. 

>And what the libvirt / qemu-kvm version are being used?

Sorry my testing environment is not accessible now...

I think the bug in comment #5 is related. I am assuming the following when we detach the NIC:

1. neutron openvswitch agent deleted the vhu interface
2. qemu level didn't refresh the xml

If the above is true, then if we rebooted the VM, the xml can not find the 

Best Regards,
Chen

Comment 7 Jing Qi 2019-01-07 09:36:28 UTC
With RHOS13 env, I tried to detach a "bridge" interface by using libvirt "virsh" command, the command returned successfully.
But from the active domain xml, the interface still in the domain xml. 

# virsh detach-device vm1 interface.xml
Device detached successfully

interface.xml:
 
    <interface type='bridge'>
      <mac address='fa:16:3e:7e:b6:aa'/>
      <source bridge='qbrd3a93336-db'/>
      <target dev='tapd3a93336-db'/>
      <model type='virtio'/>
      <driver name='vhost' rx_queue_size='512'/>
      <mtu size='1450'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>

Comment 8 Jing Qi 2019-01-07 09:46:12 UTC
Created attachment 1518937 [details]
domain xml

Comment 9 Chen 2019-01-07 10:14:13 UTC
Hi Jing,

Does this result differ with non-RHOS environment and RHOS environment ? I mean in a pure qemu-kvm + libvirt environment, if we detach a NIC using virsh command, will the xml file be refreshed dynamically ?

Best Regards,
Chen

Comment 10 Jing Qi 2019-01-08 04:42:49 UTC
(In reply to Chen from comment #9)
> Hi Jing,
> 
> Does this result differ with non-RHOS environment and RHOS environment ? I
> mean in a pure qemu-kvm + libvirt environment, if we detach a NIC using
> virsh command, will the xml file be refreshed dynamically ?

The result is different for the pure libvirt environment. The detach NIC device can succeed and xml can be refreshed.

From the RHOS environment, I found the "port removal failed" in neutron log although it prompt success when doing detach -
 
2019-01-08 04:18:37.500 21863 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-ad525197-1f88-429e-b626-076941834685 - - - - -] Ports set([u'b22c3363-296c-4b47-8c48-cb13e00b237a']) removed
2019-01-08 04:18:37.711 21863 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-ad525197-1f88-429e-b626-076941834685 - - - - -] Port removal failed for set([]) treat_devices_removed /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1602

 
> Best Regards,
> Chen

Comment 11 Jing Qi 2019-01-08 04:45:24 UTC
Created attachment 1519120 [details]
neutron log

Comment 12 Jing Qi 2019-01-08 04:46:06 UTC
Created attachment 1519121 [details]
nova-compute log

Comment 13 Chen 2019-01-21 02:40:53 UTC
Hi Jing,

I'll change the component to neutron. Thank you for your help.

Best Regards,
Chen

Comment 18 Pei Zhang 2019-01-28 12:37:32 UTC
Hi Yalan, Chen,

vhost-user is designed as server-client mode, so both server and client socket is needed to boot a VM as running status. After checking Chen's Description, seems there is no vhost-user client socket(ovs-dpdk doesn't create vhost-user client yet), that's why VM is keeping 'paused' status. I think it's designed like this.

Also, you can check more info in Bug 1413962 Comment 1. 

Thanks.

Best regards,
Pei

Comment 20 Artom Lifshitz 2019-02-15 18:15:32 UTC
I'd like a few more details to make sure I understand what's going on. Would it be possible to reproduce this with a fresh VM and provide the following information:

1. The instance XML before anything is done (by running `virsh dupxml <instance>`).
2. The exact way in which the vhu interface is detached from the VM. For example, if this is done through `virsh` on the compute host, the full `virsh` command line used.
3. The exact way in which the VM is rebooted. If this is through the API, the full REST request. If through `virsh`, the full `virsh` command line used.
4. The instance XML after the reboot.

Essentially, what I'm trying to determine is whether it's expected that the interface remains present in the instance's XML (this would happen if the interface was only detached from the transient domain with virsh), and whether the reboot re-generated the instance XML (a hard reboot would do that, a soft reboot would not).

Thanks!

Comment 21 Jing Qi 2019-02-18 05:12:12 UTC
$ nova interface-list testvm9
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| Port State | Port ID                              | Net ID                               | IP addresses | MAC Addr          |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| ACTIVE     | 279ac567-6386-4e00-8a7d-7911be59d56b | 996c938c-3ae3-40d7-bde7-570daa19086a | 172.50.0.3   | fa:16:3e:fe:b4:57 |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+

$ nova interface-detach testvm9 279ac567-6386-4e00-8a7d-7911be59d56b
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+---------+--------+--------------+----------+
+------------+---------+--------+--------------+----------+

$ nova interface-list testvm9

Then in the compute node, and run virsh command , the interface is still both in the active and inactive xml .
#virsh dumpxml instance-00000009 |grep interface -a8

  <interface type='vhostuser'>
      <mac address='fa:16:3e:fe:b4:57'/>
      <source type='unix' path='/var/run/openvswitch/vhu279ac567-63' mode='server'/>
      <target dev='vhu279ac567-63'/>
      <model type='virtio'/>
      <driver rx_queue_size='512' tx_queue_size='512'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
 #virsh edit instance-00000009
 
--- the above interface is also there.

Then run nova reboot testvm9
$ openstack server list
+--------------------------------------+---------+--------+----------+---------------+---------------------+
| ID                                   | Name    | Status | Networks | Image         | Flavor              |
+--------------------------------------+---------+--------+----------+---------------+---------------------+
| ef50bdc9-d6d9-466f-82b4-852006432db7 | testvm9 | REBOOT |          | test-qcow2    | m3                  |
| 54c030f6-0644-41d4-8896-bc9842737f34 | test1   | ERROR  |          | rhel7_testpmd | m1.medium_huge_4cpu |
+--------------------------------------+---------+--------+----------+---------------+---------------------+
And
 #virsh list --all
 Id    Name                           State
----------------------------------------------------
 11    instance-00000009              paused

Comment 22 Jing Qi 2019-02-18 05:13:26 UTC
Created attachment 1535818 [details]
domain xml for instance-00000009

Comment 23 Jing Qi 2019-02-18 05:18:19 UTC
The order pasted in comment21 should change from 

$ nova interface-detach testvm9 279ac567-6386-4e00-8a7d-7911be59d56b
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+---------+--------+--------------+----------+
+------------+---------+--------+--------------+----------+

$ nova interface-list testvm9

to 
  
$ nova interface-detach testvm9 279ac567-6386-4e00-8a7d-7911be59d56b

$ nova interface-list testvm9
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+---------+--------+--------------+----------+
+------------+---------+--------+--------------+----------+

Comment 24 Michal Privoznik 2019-02-19 08:43:10 UTC
So far this doesn't look like a libvirt bug to me. The fact that nova failed to detach an interface is not necessarily a libvirt bug. I'd need to see libvirt debug logs to confirm it is a libvirt bug. The other issue is why the domain is in paused state after it was started. Again, libvirt debug logs are needed.

Comment 25 Jing Qi 2019-02-19 09:41:04 UTC
With libvirt log set as below:
log_outputs="1:file:/var/log/libvirtd.log"

$ openstack server list
+--------------------------------------+---------+---------+-------------------+------------+---------------+
| ID                                   | Name    | Status  | Networks          | Image      | Flavor        |
+--------------------------------------+---------+---------+-------------------+------------+---------------+
| 73e9f89e-e9fd-48a2-a468-e303f529cc43 | test1   | ACTIVE  | dpdk=10.73.33.199 | test-qcow2 | m1.dpdk_2M_4U |
| ef50bdc9-d6d9-466f-82b4-852006432db7 | testvm9 | SHUTOFF |                   | test-qcow2 | m3            |
+--------------------------------------+---------+---------+-------------------+------------+---------------+
$ nova interface-list test1
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| Port State | Port ID                              | Net ID                               | IP addresses | MAC Addr          |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| ACTIVE     | 992a8944-003b-4173-9e6a-34359db0c7ec | cbc26a90-0022-460e-993b-41f6941fdac3 | 10.73.33.199 | fa:16:3e:30:e8:f4 |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
$ nova interface-detach test1 992a8944-003b-4173-9e6a-34359db0c7ec
$ nova interface-list test1
+------------+---------+--------+--------------+----------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+---------+--------+--------------+----------+
+------------+---------+--------+--------------+----------+
$ nova reboot test1
Request to reboot server <Server: test1> has been accepted.


The whole /var/log/libvirtd.log  is as below - 

2019-02-19 09:32:28.826+0000: 1021156: info : libvirt version: 4.5.0, package: 10.el7_6.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-11-08-09:31:37, x86-020.build.eng.bos.redhat.com)
2019-02-19 09:32:28.826+0000: 1021156: info : hostname: overcloud-computeovsdpdk-0
2019-02-19 09:32:28.826+0000: 1021156: error : virCommandWait:2600 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu992a8944-00 statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no key "tx_errors" in Interface record "vhu992a8944-00" column statistics

2019-02-19 09:32:34.245+0000: 1021157: error : virCommandWait:2600 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu992a8944-00 statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no key "tx_errors" in Interface record "vhu992a8944-00" column statistics
2019-02-19 09:37:24.652+0000: 1021159: error : virCommandWait:2600 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu992a8944-00 name) unexpected exit status 1: ovs-vsctl: no row "vhu992a8944-00" in table Interface

2019-02-19 09:37:24.652+0000: 1021159: error : virNetDevOpenvswitchInterfaceStats:356 : internal error: Interface not found
2019-02-19 09:37:25.258+0000: 1021155: error : virCommandWait:2600 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu992a8944-00 name) unexpected exit status 1: ovs-vsctl: no row "vhu992a8944-00" in table Interface

2019-02-19 09:37:25.258+0000: 1021155: error : virNetDevOpenvswitchInterfaceStats:356 : internal error: Interface not found
2019-02-19 09:37:25.859+0000: 1021157: error : virCommandWait:2600 : internal error: Child process (ovs-vsctl --timeout=5 get Interface vhu992a8944-00 name) unexpected exit status 1: ovs-vsctl: no row "vhu992a8944-00" in table Interface

2019-02-19 09:37:25.859+0000: 1021157: error : virNetDevOpenvswitchInterfaceStats:356 : internal error: Interface not found

Comment 26 Michal Privoznik 2019-02-19 10:18:18 UTC
(In reply to Jing Qi from comment #25)
> With libvirt log set as below:
> log_outputs="1:file:/var/log/libvirtd.log"
> 
> $ openstack server list
> +--------------------------------------+---------+---------+-----------------
> --+------------+---------------+
> | ID                                   | Name    | Status  | Networks       
> | Image      | Flavor        |
> +--------------------------------------+---------+---------+-----------------
> --+------------+---------------+
> | 73e9f89e-e9fd-48a2-a468-e303f529cc43 | test1   | ACTIVE  |
> dpdk=10.73.33.199 | test-qcow2 | m1.dpdk_2M_4U |
> | ef50bdc9-d6d9-466f-82b4-852006432db7 | testvm9 | SHUTOFF |                
> | test-qcow2 | m3            |
> +--------------------------------------+---------+---------+-----------------
> --+------------+---------------+
> $ nova interface-list test1
> +------------+--------------------------------------+------------------------
> --------------+--------------+-------------------+
> | Port State | Port ID                              | Net ID                
> | IP addresses | MAC Addr          |
> +------------+--------------------------------------+------------------------
> --------------+--------------+-------------------+
> | ACTIVE     | 992a8944-003b-4173-9e6a-34359db0c7ec |
> cbc26a90-0022-460e-993b-41f6941fdac3 | 10.73.33.199 | fa:16:3e:30:e8:f4 |
> +------------+--------------------------------------+------------------------
> --------------+--------------+-------------------+
> $ nova interface-detach test1 992a8944-003b-4173-9e6a-34359db0c7ec
> $ nova interface-list test1
> +------------+---------+--------+--------------+----------+
> | Port State | Port ID | Net ID | IP addresses | MAC Addr |
> +------------+---------+--------+--------------+----------+
> +------------+---------+--------+--------------+----------+
> $ nova reboot test1
> Request to reboot server <Server: test1> has been accepted.
> 
> 
> The whole /var/log/libvirtd.log  is as below - 
> 
> 2019-02-19 09:32:28.826+0000: 1021156: info : libvirt version: 4.5.0,
> package: 10.el7_6.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>,
> 2018-11-08-09:31:37, x86-020.build.eng.bos.redhat.com)
> 2019-02-19 09:32:28.826+0000: 1021156: info : hostname:
> overcloud-computeovsdpdk-0
> 2019-02-19 09:32:28.826+0000: 1021156: error : virCommandWait:2600 :
> internal error: Child process (ovs-vsctl --timeout=5 get Interface
> vhu992a8944-00 statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no
> key "tx_errors" in Interface record "vhu992a8944-00" column statistics
> 
> 2019-02-19 09:32:34.245+0000: 1021157: error : virCommandWait:2600 :
> internal error: Child process (ovs-vsctl --timeout=5 get Interface
> vhu992a8944-00 statistics:tx_errors) unexpected exit status 1: ovs-vsctl: no
> key "tx_errors" in Interface record "vhu992a8944-00" column statistics
> 2019-02-19 09:37:24.652+0000: 1021159: error : virCommandWait:2600 :
> internal error: Child process (ovs-vsctl --timeout=5 get Interface
> vhu992a8944-00 name) unexpected exit status 1: ovs-vsctl: no row
> "vhu992a8944-00" in table Interface
> 
> 2019-02-19 09:37:24.652+0000: 1021159: error :
> virNetDevOpenvswitchInterfaceStats:356 : internal error: Interface not found
> 2019-02-19 09:37:25.258+0000: 1021155: error : virCommandWait:2600 :
> internal error: Child process (ovs-vsctl --timeout=5 get Interface
> vhu992a8944-00 name) unexpected exit status 1: ovs-vsctl: no row
> "vhu992a8944-00" in table Interface
> 
> 2019-02-19 09:37:25.258+0000: 1021155: error :
> virNetDevOpenvswitchInterfaceStats:356 : internal error: Interface not found
> 2019-02-19 09:37:25.859+0000: 1021157: error : virCommandWait:2600 :
> internal error: Child process (ovs-vsctl --timeout=5 get Interface
> vhu992a8944-00 name) unexpected exit status 1: ovs-vsctl: no row
> "vhu992a8944-00" in table Interface
> 
> 2019-02-19 09:37:25.859+0000: 1021157: error :
> virNetDevOpenvswitchInterfaceStats:356 : internal error: Interface not found

This doesn't look like the whole debug output. Have you restarted the libvirtd after making the config file change?
Also, since the interface is type of vhostuser and plugged into an ovs bridge, why does ovd claim it doesn't know the interface?

Comment 27 Jing Qi 2019-02-21 02:58:51 UTC
Hi Michal,

The environment was setup with redhat openstack (overcloud / undercloud), and the nova_libvirt container was restarted.
The libvirtd log is the whole log since the container was restarted.
I run the similar command -
#ovs-vsctl --timeout=5 get Interface vhubf22bcc6-94 statistics:tx_errors
ovs-vsctl: no key "tx_errors" in Interface record "vhubf22bcc6-94" column statistics

# ovs-vsctl --timeout=5 get Interface vhubf22bcc6-94 statistics
{"rx_1024_to_1522_packets"=0, "rx_128_to_255_packets"=0, "rx_1523_to_max_packets"=0, "rx_1_to_64_packets"=0, "rx_256_to_511_packets"=0, "rx_512_to_1023_packets"=0, "rx_65_to_127_packets"=0, rx_bytes=0, rx_dropped=0, rx_errors=0, rx_packets=0, tx_bytes=0, tx_dropped=281, tx_packets=0}

We can see the interface vhubf22bcc6-94 is there, and it reports error since there is no "tx_errors" property in the begaining.

After the "nova interface-detach" command was issued ,vhu992a8944-00 was removed from the br-int bridge. 

Then restarted the server. It reported like this since the domain xml is still has the interface-  

2019-02-19 09:37:24.652+0000: 1021159: error :
> virNetDevOpenvswitchInterfaceStats:356 : internal error: Interface not found
> 2019-02-19 09:37:25.258+0000: 1021155: error : virCommandWait:2600 :
> internal error: Child process (ovs-vsctl --timeout=5 get Interface
> vhu992a8944-00 name) unexpected exit status 1: ovs-vsctl: no row
> "vhu992a8944-00" in table Interface

That's the issue we met. The "nova interface-detach" didn't remove the interface from the domain xml.
The the vm can't be started and it hangs in "paused" state.

Comment 28 Jing Qi 2019-02-21 03:30:57 UTC
The server restart is by issuing the command "nova reboot **".

Comment 29 Michal Privoznik 2019-02-21 07:56:27 UTC
(In reply to Jing Qi from comment #27)
> Hi Michal,
> 
> The environment was setup with redhat openstack (overcloud / undercloud),
> and the nova_libvirt container was restarted.

I suspect that doesn't restart libvirtd itself. Hence missing debug logs. It is crucial to restart libvirt after making change to any config file because libvirt reads them only at startup.
An alternative way might be to use virt-admin to enable debug logs:

# virt-admin daemon-log-outputs "1:file:/tmp/libvirtd.log"
# virt-admin daemon-log-filters --filters "1:qemu 3:remote 4:event 3:json 3:rpc"

You can confirm that debug logs are enabled by seeing messages with "debug :" severity in the log file. Also the file will grow a lot in size, not a few lines of messages.

> The libvirtd log is the whole log since the container was restarted.
> I run the similar command -
> #ovs-vsctl --timeout=5 get Interface vhubf22bcc6-94 statistics:tx_errors
> ovs-vsctl: no key "tx_errors" in Interface record "vhubf22bcc6-94" column
> statistics
> 
> # ovs-vsctl --timeout=5 get Interface vhubf22bcc6-94 statistics
> {"rx_1024_to_1522_packets"=0, "rx_128_to_255_packets"=0,
> "rx_1523_to_max_packets"=0, "rx_1_to_64_packets"=0,
> "rx_256_to_511_packets"=0, "rx_512_to_1023_packets"=0,
> "rx_65_to_127_packets"=0, rx_bytes=0, rx_dropped=0, rx_errors=0,
> rx_packets=0, tx_bytes=0, tx_dropped=281, tx_packets=0}
> 
> We can see the interface vhubf22bcc6-94 is there, and it reports error since
> there is no "tx_errors" property in the begaining.
> 
> After the "nova interface-detach" command was issued ,vhu992a8944-00 was
> removed from the br-int bridge. 
> 
> Then restarted the server. It reported like this since the domain xml is
> still has the interface-  
> 
> 2019-02-19 09:37:24.652+0000: 1021159: error :
> > virNetDevOpenvswitchInterfaceStats:356 : internal error: Interface not found
> > 2019-02-19 09:37:25.258+0000: 1021155: error : virCommandWait:2600 :
> > internal error: Child process (ovs-vsctl --timeout=5 get Interface
> > vhu992a8944-00 name) unexpected exit status 1: ovs-vsctl: no row
> > "vhu992a8944-00" in table Interface
> 
> That's the issue we met. The "nova interface-detach" didn't remove the
> interface from the domain xml.
> The the vm can't be started and it hangs in "paused" state.

Well, this sounds like bug 1461270. But that is fixed for long now. What's your libvirt version? You'll need libvirt-3.7.0-1.el7 at least.

And as for nova failing to detach an interface I really need those libvirt debug logs.

Comment 30 Artom Lifshitz 2019-02-22 00:57:32 UTC
Could you confirm that I understand this correctly:

* Everything is done through the Nova APIs (interface detach, reboot)
* The interface detach request succeeds
  (However, this is done asynchronously, so the actual detach can fail even if the request succeeds).
* The interface disappears from the list of interfaces in the nova API.
  (This should not happen if the interface detach operation fails)

I know Michal is trying to debug this from a libvirt point of view - from Nova's perspective I'd like full debug-level logs of all nova services (API, compute, etc) that include the bug being reproduced. The easiest way to do this would be to attach sosreports to this bugzilla.

Thanks!

Comment 31 Jing Qi 2019-02-22 01:51:01 UTC
I am using the libvirt version - libvirt-4.5.0-10.el7_6.3.x86_64.

Comment 33 Artom Lifshitz 2019-02-22 15:28:04 UTC
I suspect this is the same bug as upstream 1807340 [1]. This has been fixed and already backported to Queens upstream [2]. How long will the 10.73.8.19 environment be available for testing? I'd like to try that upstream patch in the environment to see if it fixes the bug.

[1] https://bugs.launchpad.net/nova/+bug/1807340
[2] https://review.openstack.org/#/c/634236/2

Comment 34 Chen 2019-02-23 02:16:37 UTC
HI Artom,

Thank you for your investigation. The upstream bug does look suspicious.

I am assuming until this bug is verified/solved, the 10.73.8.19 environment should remain available. @Jing, could you please clarify ?

Best Regards,
Chen

Comment 35 Jing Qi 2019-02-23 19:22:39 UTC
Artom & Chen,
 
How long do you need to try the patch and verify it in the environment? 
Please go ahead to apply the patch in the environment if you just need it in recent several days.
I'll try to keep it.

Thanks,
Jing

Comment 36 Artom Lifshitz 2019-02-23 20:10:30 UTC
(In reply to Jing Qi from comment #35)
> Artom & Chen,
>  
> How long do you need to try the patch and verify it in the environment? 

Applying the patch isn't really long or hard, it's just that I'm juggling several priorities at the moment and I want to know whether I have a de facto deadline for this BZ for the environment to be available.

> Please go ahead to apply the patch in the environment if you just need it in
> recent several days.
> I'll try to keep it.
> 
> Thanks,
> Jing

Comment 37 Artom Lifshitz 2019-03-15 14:25:08 UTC
Playing around with this, I first booted an instance, then dumped its XML:

$ nova boot --flavor m3.dpdk_1M_2U --image 4015c076-b57f-4be4-a016-af0835581342 --nic net-id=8c889e63-6b01-477f-9304-e5093e4b8a34 artom-test-1

$ sudo virsh dumpxml 5
    <interface type='vhostuser'>
      <mac address='fa:16:3e:28:b6:b4'/>
      <source type='unix' path='/var/run/openvswitch/vhu0659e01c-f5' mode='server'/>
      <target dev='vhu0659e01c-f5'/>
      <model type='virtio'/>
      <driver rx_queue_size='512' tx_queue_size='512'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

I then detach the interface and it disappears from interface-list:

$ $ nova interface-list fb5303dc-2768-4b94-bdb7-1d68525c9f62
| ACTIVE     | 0659e01c-f53b-4ff9-b474-c98aea6c27b9 | 8c889e63-6b01-477f-9304-e5093e4b8a34 | 10.1.111.5   | fa:16:3e:28:b6:b4 |

$ nova interface-detach fb5303dc-2768-4b94-bdb7-1d68525c9f62 0659e01c-f53b-4ff9-b474-c98aea6c27b9

$ nova interface-list fb5303dc-2768-4b94-bdb7-1d68525c9f62
<empty>

However the interface remains in the XML:

$ sudo virsh dumpxml 5
    <interface type='vhostuser'>
      <mac address='fa:16:3e:28:b6:b4'/>
      <source type='unix' path='/var/run/openvswitch/vhu0659e01c-f5' mode='server'/>
      <target dev='vhu0659e01c-f5'/>
      <model type='virtio'/>
      <driver rx_queue_size='512' tx_queue_size='512'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

After applying [1] and trying the same thing:

$ nova boot --flavor m3.dpdk_1M_2U --image 4015c076-b57f-4be4-a016-af0835581342 --nic net-id=8c889e63-6b01-477f-9304-e5093e4b8a34 artom-test-2

$ sudo virsh dumpxml instance-00000034
    <interface type='vhostuser'>
      <mac address='fa:16:3e:18:b3:df'/>
      <source type='unix' path='/var/run/openvswitch/vhuc67ad230-7d' mode='server'/>
      <target dev='vhuc67ad230-7d'/>
      <model type='virtio'/>
      <driver rx_queue_size='512' tx_queue_size='512'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

$ nova interface-list artom-test-2
| ACTIVE     | c67ad230-7de0-4919-bb24-fb33d4d959ec | 8c889e63-6b01-477f-9304-e5093e4b8a34 | 10.1.111.7   | fa:16:3e:18:b3:df |

$ nova interface-detach 781a4375-b300-4707-9ff9-a731ce0ae916 c67ad230-7de0-4919-bb24-fb33d4d959ec

$ nova interface-list artom-test-2
<empty>

$ sudo virsh dumpxml instance-00000034 | grep -A 10 interface
<empty>

This confirms backporting [1] into OSP13 should fix this. We would have eventually picked it up in a rebase since it's in upstream queens already, but if we want the fix quicker a manual cherry-pick can be done.

[1] https://review.openstack.org/#/c/634236/

Comment 46 errata-xmlrpc 2019-04-30 17:13:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0924


Note You need to log in before you can comment on or make changes to this bug.