Bug 2155173
Summary: | [vhost-user] unable to start vhost net: 71: falling back on userspace | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Yanghang Liu <yanghliu> |
Component: | qemu-kvm | Assignee: | Laurent Vivier <lvivier> |
qemu-kvm sub component: | Networking | QA Contact: | Yanghang Liu <yanghliu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | chayang, coli, gkurz, hewang, jinzhao, juzhang, lvivier, maxime.coquelin, mhou, mst, tli, virt-maint, yanghliu |
Version: | 9.2 | Keywords: | Regression, TestBlocker, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-7.2.0-8.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-05-09 07:20:55 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yanghang Liu
2022-12-20 10:01:25 UTC
(In reply to Yanghang Liu from comment #0) ... > The qemu-kvm throws the following error: > 2022-12-20T09:29:51.849295Z qemu-kvm: Received unexpected msg type. Expected > 22 received 40 22 is VHOST_USER_IOTLB_MSG 40 is VHOST_USER_GET_STATUS > 2022-12-20T09:29:51.849321Z qemu-kvm: Fail to update device iotlb > 2022-12-20T09:29:51.849499Z qemu-kvm: Received unexpected msg type. Expected > 40 received 22 > 2022-12-20T09:29:51.849598Z qemu-kvm: Received unexpected msg type. Expected > 22 received 11 11 is VHOST_USER_GET_VRING_BASE > 2022-12-20T09:29:51.849611Z qemu-kvm: Fail to update device iotlb > 2022-12-20T09:29:51.849662Z qemu-kvm: Received unexpected msg type. Expected > 11 received 22 VHOST_USER_GET_STATUS is added by: 923b8921d210 ("vhost-user: Support vhost_dev_start") in QEMU v7.2 but it should not be enabled if dpdk doesn't support VHOST_USER_PROTOCOL_F_STATUS feature. Maxime, do you have any idea why QEMU and DPDK disagree on the protocol sequence ? QE's nfv auto tests is blocked by this bug currently. We only have a single thread on DPDK side to handle Vhost-user requests, it will read a request, handle it and reply to it. Then it reads the next one, etc... So I don't think it is possible to mix request replies order on DPDK side. Maybe there are two threads concurrently sending requests on QEMU side? What could be done is placing a BP in QEMU when unexpected reply is detected. When it breaks, dump the backtraces of the different threads to understand why multiple threads are sending requests. I don't have a setup right now to reproduce quickly this issue, if you have one ready I can help getting the backtraces. Regards, Maxime Yanghang, could you also provide the QEMU command line that is generated by libvirt and the boot parameters of the host kernel? Thanks (In reply to Maxime Coquelin from comment #9) > We only have a single thread on DPDK side to handle Vhost-user requests, > it will read a request, handle it and reply to it. Then it reads the > next one, etc... So I don't think it is possible to mix request replies > order on DPDK side. > > Maybe there are two threads concurrently sending requests on QEMU side? > Or perhaps the iotlb function (VHOST_USER_IOTLB_MSG) could be called asynchronously to the vhost_dev_start() function (VHOST_USER_GET_STATUS)? Michael, any idea? Thanks (In reply to Laurent Vivier from comment #11) > (In reply to Maxime Coquelin from comment #9) > > We only have a single thread on DPDK side to handle Vhost-user requests, > > it will read a request, handle it and reply to it. Then it reads the > > next one, etc... So I don't think it is possible to mix request replies > > order on DPDK side. > > > > Maybe there are two threads concurrently sending requests on QEMU side? > > > > Or perhaps the iotlb function (VHOST_USER_IOTLB_MSG) could be called > asynchronously to the vhost_dev_start() function (VHOST_USER_GET_STATUS)? > > Michael, any idea? > > Thanks Yes, I think that's what happens. DPDK Vhost libray sends an IOTLB miss request (VHOST_USER_SLAVE_IOTLB_MSG) to QEMU on the slave channel, which results in the thread handling it to send a VHOST_USER_IOTLB_MSG with resulting IOTLB entry on the master channel. In parallel, another QEMU thread performs the vhost_dev_start(), resulting in two threads concurrently reading the master socket to get their replies. We need a way to synchronize this, likely by introducing a lock to force synchronization on the socket, between the vhost_user_write() and process_message_reply() calls. What do you think? *** Bug 2160718 has been marked as a duplicate of this bug. *** *** Bug 2162729 has been marked as a duplicate of this bug. *** FYI the fix is now merged upstream as https://gitlab.com/qemu-project/qemu/-/commit/f340a59d5a852d75ae34555723694c7e8eafbd0c commit f340a59d5a852d75ae34555723694c7e8eafbd0c Author: Greg Kurz <groug> Date: Thu Jan 19 18:24:23 2023 +0100 Revert "vhost-user: Monitor slave channel in vhost_user_read()" This reverts commit db8a3772e300c1a656331a92da0785d81667dc81. Motivation : this is breaking vhost-user with DPDK as reported in [0]. Received unexpected msg type. Expected 22 received 40 Fail to update device iotlb Received unexpected msg type. Expected 40 received 22 Received unexpected msg type. Expected 22 received 11 Fail to update device iotlb Received unexpected msg type. Expected 11 received 22 vhost VQ 1 ring restore failed: -71: Protocol error (71) Received unexpected msg type. Expected 22 received 11 Fail to update device iotlb Received unexpected msg type. Expected 11 received 22 vhost VQ 0 ring restore failed: -71: Protocol error (71) unable to start vhost net: 71: falling back on userspace virtio The failing sequence that leads to the first error is : - QEMU sends a VHOST_USER_GET_STATUS (40) request to DPDK on the master socket - QEMU starts a nested event loop in order to wait for the VHOST_USER_GET_STATUS response and to be able to process messages from the slave channel - DPDK sends a couple of legitimate IOTLB miss messages on the slave channel - QEMU processes each IOTLB request and sends VHOST_USER_IOTLB_MSG (22) updates on the master socket - QEMU assumes to receive a response for the latest VHOST_USER_IOTLB_MSG but it gets the response for the VHOST_USER_GET_STATUS instead The subsequent errors have the same root cause : the nested event loop breaks the order by design. It lures QEMU to expect responses to the latest message sent on the master socket to arrive first. Since this was only needed for DAX enablement which is still not merged upstream, just drop the code for now. A working solution will have to be merged later on. Likely protect the master socket with a mutex and service the slave channel with a separate thread, as discussed with Maxime in the mail thread below. [0] https://lore.kernel.org/qemu-devel/43145ede-89dc-280e-b953-6a2b436de395@redhat.com/ Reported-by: Yanghang Liu <yanghliu> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2155173 Signed-off-by: Greg Kurz <groug> Message-Id: <20230119172424.478268-2-groug> Reviewed-by: Michael S. Tsirkin <mst> Signed-off-by: Michael S. Tsirkin <mst> Acked-by: Stefan Hajnoczi <stefanha> Acked-by: Maxime Coquelin <maxime.coquelin> Yanghang, please set ITM. Thanks Test environment: qemu-kvm-7.2.0-8.el9.x86_64 tuned-2.19.0-1.el9.noarch libvirt-9.0.0-4.el9.x86_64 python3-libvirt-9.0.0-1.el9.x86_64 openvswitch2.17-2.17.0-63.el9fdp.x86_64 dpdk-21.11.2-1.el9_1.x86_64 edk2-ovmf-20221207gitfff6d81270b5-5.el9.noarch seabios-bin-1.16.1-1.el9.noarch Test result : The domain with vhost-user interface(s) can be started without any qemu-kvm error. QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Move this bug to VERIFIED based on the comment 27. If new issue(s) are found during the regression tests, QE will open new bugs to track. *** Bug 2165278 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2162 |