Bug 1782528

Summary: qemu-kvm: event flood when vhost-user backed virtio netdev is unexpectedly closed while guest is transmitting
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Adrián Moreno <amorenoz>
Component: qemu-kvmAssignee: Adrián Moreno <amorenoz>
qemu-kvm sub component: Networking QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: medium CC: ailan, chayang, jasowang, jinzhao, juzhang, pezhang, virt-maint
Version: 8.0   
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1790360 (view as bug list) Environment:
Last Closed: 2020-02-04 09:42:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1790360    

Description Adrián Moreno 2019-12-11 19:10:37 UTC
Description of problem:

If a guest is transmitting data trough a vhost-user-backed interface when the backend is suddenly closed, qemu enters an infinite loop potentially freezing the guest.

Version-Release number of selected component (if applicable):
I've reproduced this issue with AV8.0 (qemu-kvm-3.1.0-20.module+el8+2888+cdc893a8) but I'd guess the problem can still be reproduced in more versions.

The function where it gets stuck is virtqueue_drop_all().

How reproducible:
Always

Steps to Reproduce:
1. Run a guest VM with a vhost-user netdev in server mode:
An example libvirt xml section:

  <interface type='vhostuser'>
      <mac address='52:54:00:e6:da:91'/>
      <source type='unix' path='/tmp/vhost-user1' mode='server'/>
      <model type='virtio'/>
      <driver name='vhost' rx_queue_size='1024'/>
      <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
  </interface> 

2. Run testpmd in the host and set fwd to rxonly:

$ testpmd -l 0,20,21,22,23 --socket-mem=1024 -n 4  --vdev 'net_vhost0,iface=/tmp/vhost-user1,client=1'  --vdev 'net_vhost1,iface=/tmp/vhost-user2,client=1' --no-pci -- --rxq=1 --txq=1 --portmask=f -a --forward-mode=rxonly --nb-cores=4 


3. In the guest, run testpmd in txmode:
testpmd -l 0,1 \
      --socket-mem 1024 \
      -n 2 \
      -- \
      --portmask=3 \
      -i 
testpmd> set fwd txonly
testpmd> start

4. Check that packets are flowing 
testpmd> show port stats all

5. Kill the host's testpmd

Actual results:
Qemu's IO thread freezes

Expected results:
Qemu's IO thread should not freeze should not freeze

Additional info:
An example backtrace of the inifinite loop is:

#0  0x000055d888ca1e69 in virtqueue_get_head ()                                                                                                                                                                                               
#1  0x000055d888ca1f82 in virtqueue_drop_all ()                                                                                                                                                                                               
#2  0x000055d888c89884 in virtio_net_drop_tx_queue_data ()                                                                                                                                                                                    
#3  0x000055d888e13669 in virtio_bus_cleanup_host_notifier ()                                                                                                                                                                                 
#4  0x000055d888ca64d6 in vhost_dev_disable_notifiers ()                                                                                                                                                                                      
#5  0x000055d888c8d5d2 in vhost_net_stop_one ()                                                                                                                                                                                              
#6  0x000055d888c8dad2 in vhost_net_stop ()                                                                                                                                                                                                   
#7  0x000055d888c8a5b4 in virtio_net_set_status ()                                                                                                                                                                                            
#8  0x000055d888e33dc6 in qmp_set_link ()                                                                                                                                                                                                     
#9  0x000055d888e394aa in chr_closed_bh ()                                                                                                                                                                                    
#10 0x000055d888f3c066 in aio_bh_poll ()                                      
#11 0x000055d888f3f394 in aio_dispatch ()                                                                       
#12 0x000055d888f3bf42 in aio_ctx_dispatch ()               
#13 0x00007f84b7c6b67d in g_main_dispatch (context=0x55d88abd1c70) at gmain.c:3176                              
#14 g_main_context_dispatch (context=0x55d88abd1c70) at gmain.c:3829
#15 0x000055d888f3e618 in main_loop_wait ()                                                                                                                                                                                                   
#16 0x000055d888d314f9 in main_loop ()                                       
#17 0x000055d888bf19c4 in main ()

Comment 1 Adrián Moreno 2019-12-11 19:12:45 UTC
This issue was detected while testing BZ 1738768

Comment 2 Adrián Moreno 2019-12-12 16:51:13 UTC
top perf yields:
Samples: 391K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 117702907563 lost: 0/0 drop: 0/0gc
  Children      Self  Shared Object
-   93.23%    65.50%  qemu-system-x86_64
   - 7.05% 0x2af0970
        g_main_context_dispatch
        aio_ctx_dispatch
        aio_dispatch
        aio_dispatch_handlers
        virtio_queue_host_notifier_read
        virtio_queue_notify_vq
        virtio_net_handle_tx_bh
      - virtio_net_drop_tx_queue_data
         - 13.68% virtqueue_drop_all
            - 25.91% virtqueue_push
               - 9.96% virtqueue_fill
                  + 5.31% vring_used_write
                    2.03% virtqueue_unmap_sggc
                  + 1.35% trace_virtqueue_fillgc
               - 8.85% virtqueue_flushgc
                  - 7.37% vring_used_idx_setgc
                     - 3.09% address_space_cache_invalidategc
                        + 2.54% invalidate_and_set_dirtygc
                     - 1.59% virtio_stw_phys_cachedgc
                        + 2.03% stw_le_phys_cachedgc
                       0.95% vring_get_region_cachesgc
                 8.58% rcu_read_unlockgc
                 4.43% rcu_read_lockgc
            - 4.61% virtqueue_get_headgc
               - 1.85% vring_avail_ringgc
                  - 1.31% virtio_lduw_phys_cachedgc
                     - 1.39% lduw_le_phys_cachedgc
                        - 1.23% address_space_lduw_le_cachedgc
                           - 0.95% lduw_le_pgc
              2.01% virtio_queue_emptygc
   + 4.20% qemu_thread_startgc


So I guess:
- it's taking a lot to drop all the packets (maybe because the guest is still writing to the queue)
- a lot of guest notifications are being received

+Jason

Comment 4 Adrián Moreno 2019-12-17 10:13:49 UTC
Updating with some findings.
It's not an infinite loop (although it seems that way), it's more an event flood. The issue that IMHO could be fixed is that notifications are left enabled when the backend closes, I'll look more deeply into that.
On the other hand, the question remains: why would testpmd keep sending notifications (i.e: trying to transmit) when the link has gone down.
So, the scope of this issue I guess is also limited by a having a misbehaving guest.

Comment 5 Pei Zhang 2020-01-07 03:30:07 UTC
Hi Adrián,

8.1.1 AV and 8.2.0 AV both hit this issue. Do you plan to fix this on 8.1.1? If yes, I'll clone a new one to rhel8.2. Thanks.

Best regards,

Pei

Comment 6 Adrián Moreno 2020-01-13 07:50:41 UTC
Hi Pei. Yes, we'll need a bz on 8.1.1 as well.
Thanks

Comment 7 Pei Zhang 2020-01-13 08:19:56 UTC
(In reply to Adrián Moreno from comment #6)
> Hi Pei. Yes, we'll need a bz on 8.1.1 as well.
> Thanks

Thanks Adrián.

I've cloned this BZ to RHEL8.2-AV.

Bug 1790360 - qemu-kvm: event flood when vhost-user backed virtio netdev is unexpectedly closed while guest is transmitting