Bug 1845488

Summary: [ovs-dpdk] [vhost-user] [TSO] too many destroy_connection calls on restarting ovs-vswitchd without tso
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Gowrishankar Muthukrishnan <gmuthukr>
Component: openvswitch2.13Assignee: Maxime Coquelin <maxime.coquelin>
Status: CLOSED NOTABUG QA Contact: qding
Severity: low Docs Contact:
Priority: low    
Version: FDP 20.DCC: apevec, atragler, cfontain, chrisw, ctrautma, ekuris, fleitner, hakhande, jhsiao, ktraynor, maxime.coquelin, qding, ralongi, rhos-maint, vchundur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1832708 Environment:
Last Closed: 2021-04-08 08:08:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1832708    
Bug Blocks:    

Description Gowrishankar Muthukrishnan 2020-06-09 11:26:27 UTC
+++ This bug was initially created as a clone of Bug #1832708 +++

Description of problem:
After disabling userspace_tso in OVS-DPDK datapath (meaning it was enabled before) and restarting openvswitch service leads too many attempts on destroying connection (to the qemu-kvm).

Below message is repeating in ovs-vswitchd.log:
2020-05-07T06:16:01.648Z|02465|netdev_dpdk|INFO|vHost Device '/var/lib/vhost_sockets/vhuc10acbd9-89' connection has been destroyed
2020-05-07T06:16:01.648Z|02466|dpdk|INFO|Dropped 4755 log messages in last 0 seconds (most recently, 0 seconds ago) due to excessive rate

Version-Release number of selected component (if applicable):
openvswitch2.13-2.13.0-25.el8fdp.1.x86_64
(from RH OSP 16.1 having rhel 8.2)

How reproducible:
Every time following steps below.

Steps to Reproduce:
1. If TSO is not enabled, do so in ovs-dpdk.
   sudo ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true

2. Launch a VM with vhost-user socket in server mode.
   Eg. <source type='unix' path='/var/lib/vhost_sockets/vhue8ff400b-30' mode='server'/> in libvirt xml.

   If in OSP, create instance.

3. Ensure guest kernel driver is able to detect TSO.
   sudo ethtool -k eth2 | egrep '(scatter|tcp|gso|csum|check|segment)'
   Check "on" for tx-checksumming, scatter-gather, tcp-segmentation-offload

4. Disable TSO in ovs-dpdk.
   sudo ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=false

5. Restart openvswitch.
   sudo systemctl restart openvswitch.service

6. Check /var/log/messages/ovs-vswitchd.service for repeating messages as reported in problem statement.

Actual results:
Repeating calls to destroy_connection.

Expected results:
No too many attempts on destroying vhost-user connection.

Additional info:

--- Additional comment from Gowrishankar Muthukrishnan on 2020-05-07 07:22:17 UTC ---



--- Additional comment from Gowrishankar Muthukrishnan on 2020-05-07 07:24:01 UTC ---

Socket files used for VM vNICs:
[heat-admin@overcloud-computeovsdpdksriov-0 ~]$ sudo virsh dumpxml instance-00000053|grep vhu
      <source type='unix' path='/var/lib/vhost_sockets/vhue8ff400b-30' mode='server'/>
      <source type='unix' path='/var/lib/vhost_sockets/vhu85273654-c7' mode='server'/>
      <source type='unix' path='/var/lib/vhost_sockets/vhuc10acbd9-89' mode='server'/>

Comment 1 Maxime Coquelin 2020-06-15 10:13:34 UTC
Hi,

I think I understand what is happening here.
In OVS' lib/netdev-dpdk.c, we have:
static int
netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
{
...
        if (userspace_tso_enabled()) {
            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
            netdev->ol_flags |= NETDEV_TX_OFFLOAD_UDP_CKSUM;
            netdev->ol_flags |= NETDEV_TX_OFFLOAD_SCTP_CKSUM;
            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
            vhost_unsup_flags = 1ULL << VIRTIO_NET_F_HOST_ECN
                                | 1ULL << VIRTIO_NET_F_HOST_UFO;
        } else {
            /* This disables checksum offloading and all the features
             * that depends on it (TSO, UFO, ECN) according to virtio
             * specification. */
            vhost_unsup_flags = 1ULL << VIRTIO_NET_F_CSUM;
        }

        err = rte_vhost_driver_disable_features(dev->vhost_id,
                                                vhost_unsup_flags);

It means that depending on whether the user requests for TSO to be enabled
or not, the Virtio features advertised by the backend for the negotiation
will be different.

The problem here seems to be that the guest is not restarted between OVS
stop and start, so the guest Virtio feature negotiation is already done.
With TSO enabled, VIRTIO_NET_F_CSUM was advertised by the backend and if
the guest driver also supported it, it had been negotiated.
When disabling TSO on OVS side, this feature is no more advertised by
OVS-DPDK, which results in Qemu to rightfully fail the reconnection to
avoid undefined behaviour. Indeed, as the backend reconnection is
transparent to the guest driver it will behave with VIRTIO_NET_F_CSUM
being negotiated.

To confirm that, I think you should have below error message in the
guest's Qemu logs:

#ifdef CONFIG_VHOST_NET_USER
    if (net->nc->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
        features = vhost_user_get_acked_features(net->nc);
        if (~net->dev.features & features) {
            fprintf(stderr, "vhost lacks feature mask %" PRIu64
                    " for backend\n",
                    (uint64_t)(~net->dev.features & features));
            goto fail;
        }
    }
#endif

Regards,
Maxime

Comment 2 Maxime Coquelin 2020-06-18 05:46:15 UTC
Hi,

Did you had any chance to try and reproduce?
What are the Qemu logs saying?

Comment 4 Gowrishankar Muthukrishnan 2020-06-22 18:21:12 UTC
(In reply to Maxime Coquelin from comment #1)
...
> To confirm that, I think you should have below error message in the
> guest's Qemu logs:
> 
> #ifdef CONFIG_VHOST_NET_USER
>     if (net->nc->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
>         features = vhost_user_get_acked_features(net->nc);
>         if (~net->dev.features & features) {
>             fprintf(stderr, "vhost lacks feature mask %" PRIu64
>                     " for backend\n",
>                     (uint64_t)(~net->dev.features & features));

I could see this error (repeating many times) in qemu/instance.log:

2020-06-22T17:58:23.613353Z qemu-kvm: failed to init vhost_net for queue 0
vhost lacks feature mask 22529 for backend

So, all negotiated features before ovs restart are lacking now as in 22529 (0x5801).

>             goto fail;
>         }
>     }
> #endif
> 
> Regards,
> Maxime

Comment 5 Maxime Coquelin 2021-04-08 08:08:36 UTC
It is not possible to enable or disable Virtio feature once the negotiation took place with the guest driver.
It this has to be done, only solution is to either hot-remove/hot-plug the Virtio device, or reboot the guest.