Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1618488

Summary:	vhost-user backend crash on SET_MEM_TABLE request handling while port enabled
Product:	Red Hat Enterprise Linux 7	Reporter:	Maxime Coquelin <maxime.coquelin>
Component:	openvswitch	Assignee:	Maxime Coquelin <maxime.coquelin>
Status:	CLOSED ERRATA	QA Contact:	Jean-Tsung Hsiao <jhsiao>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.6	CC:	ailan, atelang, atragler, ctrautma, fbaudin, fleitner, jraju, lmanasko, maxime.coquelin, mfuruta, ovs-qe, qding, rkhan, wliu
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openvswitch-2.9.0-56.el7fdp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-05 14:59:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1618788, 1618791

Description Maxime Coquelin 2018-08-16 18:49:40 UTC

Description of problem:

Vhost-user backend may crash when receiving VHOST_USER_SET_MEM_TABLE request
while it is processing the virtqueues.

Such request can be issued by QEMU when binding a VF to testpmd within the 
guest, as shown by below QEMU backtrace:
#0  vhost_user_set_mem_table (dev=0x55ba6cea5c00, mem=0x55ba6d391900)
    at /usr/src/debug/qemu-2.10.0/hw/virtio/vhost-user.c:294
#1  0x000055ba69962bfc in vhost_commit (listener=0x55ba6cea5c08) at /usr/src/debug/qemu-2.10.0/hw/virtio/vhost.c:653
#2  0x000055ba6991c5ae in memory_region_transaction_commit () at /usr/src/debug/qemu-2.10.0/memory.c:1063
#3  0x000055ba69a92c86 in pci_update_mappings (d=0x55ba6d1aec00) at hw/pci/pci.c:1303
#4  0x000055ba69a93219 in pci_default_write_config (d=d@entry=0x55ba6d1aec00, addr=4, val_in=val_in@entry=1024, 
    l=l@entry=2) at hw/pci/pci.c:1363
#5  0x000055ba69956aec in vfio_pci_write_config (pdev=0x55ba6d1aec00, addr=<optimized out>, val=1024, len=2)
    at /usr/src/debug/qemu-2.10.0/hw/vfio/pci.c:1218
#6  0x000055ba6991aec3 in memory_region_write_accessor (mr=<optimized out>, addr=<optimized out>, 
    value=<optimized out>, size=<optimized out>, shift=<optimized out>, mask=<optimized out>, attrs=...)
    at /usr/src/debug/qemu-2.10.0/memory.c:530
#7  0x000055ba69918bd9 in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7f9ecbd39708, 
    size=size@entry=2, access_size_min=<optimized out>, access_size_max=<optimized out>, 
    access=access@entry=0x55ba6991ae80 <memory_region_write_accessor>, mr=mr@entry=0x55ba6bfaa400, 
    attrs=attrs@entry=...) at /usr/src/debug/qemu-2.10.0/memory.c:596
#8  0x000055ba6991cb25 in memory_region_dispatch_write (mr=<optimized out>, addr=0, data=1024, size=<optimized out>, 
    attrs=...) at /usr/src/debug/qemu-2.10.0/memory.c:1472
#9  0x000055ba698d4312 in flatview_write (fv=0x55ba6c664c80, addr=<optimized out>, attrs=..., buf=<optimized out>, 
    len=<optimized out>) at /usr/src/debug/qemu-2.10.0/exec.c:2908
#10 0x000055ba698d790f in address_space_write (as=<optimized out>, addr=<optimized out>, attrs=..., 
    buf=<optimized out>, len=<optimized out>) at /usr/src/debug/qemu-2.10.0/exec.c:3074
#11 0x000055ba698d79a5 in address_space_rw (as=<optimized out>, addr=addr@entry=3324, attrs=..., attrs@entry=..., 
    buf=buf@entry=0x7f9ef0954000 "", len=len@entry=2, is_write=is_write@entry=true)
    at /usr/src/debug/qemu-2.10.0/exec.c:3085
#12 0x000055ba6992aaaa in kvm_handle_io (count=1, size=2, direction=<optimized out>, data=<optimized out>, 
    attrs=..., port=3324) at /usr/src/debug/qemu-2.10.0/accel/kvm/kvm-all.c:1807
#13 kvm_cpu_exec (cpu=cpu@entry=0x55ba6c014000) at /usr/src/debug/qemu-2.10.0/accel/kvm/kvm-all.c:2047
#14 0x000055ba69909b52 in qemu_kvm_cpu_thread_fn (arg=0x55ba6c014000) at /usr/src/debug/qemu-2.10.0/cpus.c:1138
#15 0x00007f9ed7fdddd5 in start_thread (arg=0x7f9ecbd3c700) at pthread_create.c:308
#16 0x00007f9ed7d07b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

When it happens, if OVS is processing the Virtio rings, below crash is likely to happen:

[root@overcloud-compute-0 openvswitch]# gdb --args ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid
…/...
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffea8ff9700 (LWP 317423)]
rte_vhost_dequeue_burst (vid=<optimized out>, queue_id=<optimized out>, mbuf_pool=0x7ff27fe23d40, pkts=pkts@entry=0x7ffea8ff8770, count=count@entry=32)
    at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/lib/librte_vhost/virtio_net.c:1567
1567		free_entries = *((volatile uint16_t *)&vq->avail->idx) -
(gdb) bt
#0  rte_vhost_dequeue_burst (vid=<optimized out>, queue_id=<optimized out>, mbuf_pool=0x7ff27fe23d40, pkts=pkts@entry=0x7ffea8ff8770, 
    count=count@entry=32) at /usr/src/debug/openvswitch-2.9.0/dpdk-17.11/lib/librte_vhost/virtio_net.c:1567
#1  0x00005555558542e4 in netdev_dpdk_vhost_rxq_recv (rxq=<optimized out>, batch=0x7ffea8ff8760) at lib/netdev-dpdk.c:1849
#2  0x00005555557a0671 in netdev_rxq_recv (rx=<optimized out>, batch=batch@entry=0x7ffea8ff8760) at lib/netdev.c:701
#3  0x0000555555779c1f in dp_netdev_process_rxq_port (pmd=pmd@entry=0x7fff88358010, rxq=0x5555560a8810, port_no=7) at lib/dpif-netdev.c:3279
#4  0x000055555577a02a in pmd_thread_main (f_=<optimized out>) at lib/dpif-netdev.c:4145
#5  0x00005555557f6cb6 in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:348
#6  0x00007ffff70eadd5 in start_thread (arg=0x7ffea8ff9700) at pthread_create.c:308
#7  0x00007ffff64e8b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) l 
1562			} else {
1563				count -= 1;
1564			}
1565		}
1566	
1567		free_entries = *((volatile uint16_t *)&vq->avail->idx) -
1568				vq->last_avail_idx;
1569		if (free_entries == 0)
1570			goto out;

Version-Release number of selected component (if applicable):


How reproducible:
Create a guest with a Virtio-net device reyling on vhost-user backend,
and with an assigned VF. Connect the vhost-user port to OVS.
In the guest, bind the VF to testpmd, the virtio-net remains binded to
the Kernel virtio-net driver.

Steps to Reproduce:
1.
2.
3.

Actual results:
OVS crash with above backtrace

Expected results:
No crash

Additional info:

Patch fixing the issue is already in upstream master, but wasn't
backported becasue it was only known to fix a new feature at that
time.

Comment 1 Maxime Coquelin 2018-08-16 18:52:01 UTC

Patch backported to v17.11.4-rc1:

commit 96935c61631fe2095246b5dce5c6fea960e34c87
Author: Maxime Coquelin <maxime.coquelin>
Date:   Thu Aug 16 19:29:22 2018 +0200

    vhost: retranslate vring addr when memory table changes
    
    [ backported from upstream commit d5022533c20aed365d513663806a999459037015 ]
    
    When the vhost-user master sends memory updates using
    VHOST_USER_SET_MEM request, the user backends unmap and then
    mmap again the memory regions in its address space.
    
    If the ring addresses have already been translated, it needs to
    be translated again as they point to unmapped memory.
    
    Signed-off-by: Maxime Coquelin <maxime.coquelin>

Comment 3 Maxime Coquelin 2018-08-17 12:52:53 UTC

Patch backported downstream and pushed to private-mcoqueli-bz1618488 branch.

Comment 13 Jean-Tsung Hsiao 2018-10-17 14:46:19 UTC

First of all, can't reproduce the issue with OVS 2.9.0-55 fdP with NIC=X710.

But, the fix passed netperf testing between guests over OVS-dpdk/{vxlan,geneve}/i40e(X710) tunnel. So, there is no regression with this fix.

Comment 14 Maxime Coquelin 2018-10-17 20:50:55 UTC

When debugging the issue, we saw that binding the VF to DPDK
cause a guest memory change, which results in QEMU to send
a memory update notification to the vhost-user backend.
In the vhost-user backend, the memory table update notification
results in all guest memory regions to be unmapped, and then
remapped. The crash happens because the virtqueues pointers
weren't re-translated (and so point to the unmmaped memory
area), causing invalid pointer dereferencing.

OVS logs when Jean tried to reproduce the issue show that
the issue does not reproduce because the guest memory regions
get remapped at the exact same virtual addresses.

I think that we can consider the fix as tested, because I 
tested it on Franck setup before posting it upstream.

Comment 15 Jean-Tsung Hsiao 2018-10-17 23:29:36 UTC

Per comment #13 and #14 set the bug status to verified.

Comment 16 Jaison Raju 2018-10-29 10:56:34 UTC

*** Bug 1633878 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2018-11-05 14:59:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3500