This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1900770 - [libvirt] Supporting vDPA block in libvirt
Summary: [libvirt] Supporting vDPA block in libvirt
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: libvirt
Version: 9.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Jonathon Jongsma
QA Contact: Meina Li
URL:
Whiteboard:
Depends On: 1886123 2180076
Blocks: 2141157
TreeView+ depends on / blocked
 
Reported: 2020-11-23 16:54 UTC by Stefano Garzarella
Modified: 2023-09-22 16:36 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-09-22 16:36:04 UTC
Type: Story
Target Upstream Version: 9.8.0
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker   RHEL-7382 0 None Migrated None 2023-09-22 16:36:09 UTC

Description Stefano Garzarella 2020-11-23 16:54:23 UTC
This bug was initially created as a copy of Bug #1886123

I am copying this bug because: 

We will add the support of vDPA block devices in QEMU and we need to manage them through libvirt.

vDPA block will support both hardware and software vDPA implementations, exposing a virtio block device to the guest, that can use the standard virtio-blk device driver to access both of them.

The host kernel should provides a management API (netlink, devlink) to configure the vDPA block devices (both hardware and software).

Comment 1 Jonathon Jongsma 2021-02-19 16:13:27 UTC
Hi Stefano, do you know the current state of the qemu support?

Comment 2 Stefano Garzarella 2021-02-24 08:10:48 UTC
(In reply to Jonathon Jongsma from comment #1)
> Hi Stefano, do you know the current state of the qemu support?

Hi Jonathon, there is a PoC from ByteDance [1] but I'm not sure it will be the final version.
I'll keep you updated.

[1] https://github.com/bytedance/qemu/tree/vduse

Comment 3 Han Han 2021-05-12 07:49:27 UTC
A possible way to simulate the vdap blk device by kernel:
kernel v5.13-rc1~42^2~5 vdpa_sim_blk: add support for vdpa management tool

Comment 4 John Ferlan 2021-09-08 13:31:07 UTC
Bulk update: Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 6 RHEL Program Management 2022-05-23 07:27:16 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 7 Klaus Heinrich Kiwi 2022-05-25 15:24:01 UTC
Jaroslav, this one auto-close but I think it's still needed, and we should be targeting RHEL 9.2. Care to re-assign, plan etc?

Comment 9 Jonathon Jongsma 2022-05-25 16:22:15 UTC
yeah, it should still be open. Moving to 9.2

Comment 17 Stefano Garzarella 2023-02-14 13:10:50 UTC
Sorry for the delay, but I was disposing of accumulated queries during the PTO.

Below is a description with an example using the vDPA block simulator (available in Fedora), Jonathon feel free to ping me via email or chat if you have any questions.

Requirements:
- libblkio (to build the QEMU blkio driver - available on Fedora 37, WiP on RHEL 9 see BZ2166106)
- iproute (it provides the "vdpa" tools to setup vDPA devices)

# build QEMU with blkio
./configure ... --enable-blkio

# module used to expose vDPA devices as vhost devices (one of the vDPA bus driver) and thus usable by QEMU/libblkio
sudo modprobe vhost-vdpa

# module that provides the vDPA block simulator (128 MiB ramdisk) in kernel. It is used only for testing/debugging
sudo modprobe vdpa-sim-blk

# create a new vDPA blk simulator instance (e.g. named blk0)
sudo vdpa dev add mgmtdev vdpasim_blk name blk0

# Then /sys/bus/vdpa/devices/blk0 is created with useful info
#   driver_override: can be used to choose the right vDPA bus (see later)
#   vhost-vdpa-X: contains info (e.g. /sys/bus/vdpa/devices/blk0/vhost-vdpa-X/dev contains major/minor of the vhost character device, in this case /dev/vhost-vdpa-X that will be used by QEMU/libblkio). If it is not present, this means that the device is not attached to the vhost-vdpa bus. X is an integer starting from 0 assigned by the vhost-vdpa module.

# if there are multiple vDPA buses loaded (e.g. vhost-vdpa and virtio-vdpa), the device will be attached to the first loaded.
# to be sure that the device is attached to the vhost-vdpa bus, we can override the driver
echo vhost_vdpa | sudo tee /sys/bus/vdpa/devices/blk0/driver_override
echo blk0 | sudo tee /sys/bus/vdpa/devices/blk0/driver/unbind
echo blk0 | sudo tee /sys/bus/vdpa/drivers_probe

# as an alternative, driverctl(8) can be used for the same purpose
# note: driverctl(8) integrates with udev so the binding is preserved
#
# sudo driverctl -b vdpa set-override dev1 vhost_vdpa

# then we can start qemu with the virtio-blk-vhost-vdpa device
./qemu-system-x86_64 -m 512M -smp 2 -M q35,accel=kvm,memory-backend=mem \
  -object memory-backend-file,share=on,id=mem,size="512M",mem-path="/dev/hugepages" \
  -drive file=/path/to/f37.qcow2,format=qcow2,if=none,id=hd0 \
  -device virtio-blk-pci,drive=hd0,bootindex=1 \
  -blockdev node-name=vdpa0,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \
  -device virtio-blk-pci,drive=vdpa0

And in the VM we have the 128 MiB disk:
  guest$ fdisk -l /dev/vdb
  Disk /dev/vdb: 128 MiB, 134217728 bytes, 262144 sectors
  ...

# Note: vhost-vdpa still requires pinning all the guest memory, so QEMU can fail to register the entire memory if `ulimit -l` is not unlimited with this message:
#  qemu-system-x86_64: -device virtio-blk-pci,drive=drive_src1: Failed to add blkio mem region 0x7f204be00000 with size 536870912: Bad address (os error 14)
#  Tuning /etc/security/limits.conf will help, but I'm working to remove this restriction.

Comment 18 qing.wang 2023-03-03 06:27:08 UTC
Hi,Stefano Garzarella

This usage you mentioned in comment #17

/qemu-system-x86_64 -m 512M -smp 2 -M q35,accel=kvm,memory-backend=mem \
  -object memory-backend-file,share=on,id=mem,size="512M",mem-path="/dev/hugepages" \
  -drive file=/path/to/f37.qcow2,format=qcow2,if=none,id=hd0 \
  -device virtio-blk-pci,drive=hd0,bootindex=1 \
  -blockdev node-name=vdpa0,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \
  -device virtio-blk-pci,drive=vdpa0


1.You know we usually look the blockdev node has three kinds of nodes: protocol/format/filter. the filter node is optional.
It looks the format node is ignored by your command line.

What is the format for the path=/dev/vhost-vdpa-0 , does it use raw?
if specify qcow2 for it, what happens?


our automation always adds protocol and format node for the blockdev node definition.

does it have any side-effect?

if we have to add format node how to describe it, like following ? 

-blockdev node-name=vdpa0,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \
-blockdev node-name=fmt-vdpa0,driver=raw,file=vdpa0 \
-device virtio-blk-pci,drive=fmt-vdpa0


2. does it support migration, what is the requirement (or specific usage) for migration?

Comment 19 Stefano Garzarella 2023-03-03 08:08:21 UTC
(In reply to qing.wang from comment #18)
> Hi,Stefano Garzarella

Hi :-)

> 
> This usage you mentioned in comment #17
> 
> /qemu-system-x86_64 -m 512M -smp 2 -M q35,accel=kvm,memory-backend=mem \
>   -object
> memory-backend-file,share=on,id=mem,size="512M",mem-path="/dev/hugepages" \
>   -drive file=/path/to/f37.qcow2,format=qcow2,if=none,id=hd0 \
>   -device virtio-blk-pci,drive=hd0,bootindex=1 \
>   -blockdev
> node-name=vdpa0,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.
> direct=on \
>   -device virtio-blk-pci,drive=vdpa0
> 
> 
> 1.You know we usually look the blockdev node has three kinds of nodes:
> protocol/format/filter. the filter node is optional.
> It looks the format node is ignored by your command line.

vDPA devices are usually SmartNICs that implements network protocol (e.g. Ceph RBD, iSCSI, etc.) in hardware, so the format it's almost always raw, because it's often the protocol (like RBD) that implements the various features that a format like qcow2 could give us.

That said, formats should also be supported, but as when we use an RBD backend, I don't think that's the primary use case. 

> 
> What is the format for the path=/dev/vhost-vdpa-0 , does it use raw?
> if specify qcow2 for it, what happens?

It is like a raw block device, then it depends on how the SmartNIC is configured what network protocol is used. As mentioned it should be similar to the use case of an rbd device (in most cases for QEMU it is a raw device, but it can also use a format in special cases).

> 
> our automation always adds protocol and format node for the blockdev node
> definition.
> 
> does it have any side-effect?

It should not, also I think it is good to test with formats as well.

> 
> if we have to add format node how to describe it, like following ? 
> 
> -blockdev
> node-name=vdpa0,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.
> direct=on \
> -blockdev node-name=fmt-vdpa0,driver=raw,file=vdpa0 \
> -device virtio-blk-pci,drive=fmt-vdpa0
> 

Yep, this is exactly the way we should use it!

> 
> 2. does it support migration, what is the requirement (or specific usage)
> for migration?

This depends on the various HWs, but from QEMU's perspective it is supported.
For QEMU, the source and destination should have a SmartNIC configured with the same backend (e.g. the same RBD disk).

For now, we have no HW to test it with, and the only way is the simulator in the kernel. In this case, I don't think it's easy to make it have the same content on 2 different servers, so for now the simulator can only be used to test the migration on the same host.

Comment 20 Jonathon Jongsma 2023-03-14 19:06:30 UTC
(In reply to Stefano Garzarella from comment #17)

> # then we can start qemu with the virtio-blk-vhost-vdpa device
> ./qemu-system-x86_64 -m 512M -smp 2 -M q35,accel=kvm,memory-backend=mem \
>   -object
> memory-backend-file,share=on,id=mem,size="512M",mem-path="/dev/hugepages" \
>   -drive file=/path/to/f37.qcow2,format=qcow2,if=none,id=hd0 \
>   -device virtio-blk-pci,drive=hd0,bootindex=1 \
>   -blockdev
> node-name=vdpa0,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.
> direct=on \
>   -device virtio-blk-pci,drive=vdpa0

Are hugepages required for this to work? Do we need to enforce this in libvirt when a vdpa-block device is configured? Is there anything else we need to enforce? I notice you have specified 'share=on' for the memory backing and 'cache.direct=on' for the blockdev. Are these required as well?


> # Note: vhost-vdpa still requires pinning all the guest memory, so QEMU can
> fail to register the entire memory if `ulimit -l` is not unlimited with this
> message:
> #  qemu-system-x86_64: -device virtio-blk-pci,drive=drive_src1: Failed to
> add blkio mem region 0x7f204be00000 with size 536870912: Bad address (os
> error 14)
> #  Tuning /etc/security/limits.conf will help, but I'm working to remove
> this restriction.

Do you know how this interacts with other devices? For example, if there are multiple vdpa-block devices configured, will the memlock limit be affected? What about a vdpa block device and a vdpa net device? Is the memlock limit still "the entire guest memory"?  

I ask because for vfio devices with an iommu, we currently need to lock an amount of memory equal to the total guest memory for each vfio device assigned to a guest. In other words, if there are 3 vfio devices we would need to set the memlock limit to (3 * TOTAL_GUEST_MEMORY).

Also, I'm working on the implementation in libvirt and I get the following failure:
virsh # start vdpablock-test 
error: Failed to start domain 'vdpablock-test'
error: internal error: qemu unexpectedly closed the monitor: 2023-03-14T16:25:02.736535Z qemu-system-x86_64: -blockdev {"driver":"virtio-blk-vhost-vdpa","path":"/dev/vhost-vdpa-0","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}: blkio_connect failed: Failed to connect to vDPA device: Input/output error

I can also reproduce the same failure if I run your example qemu command above as the 'qemu' user. It only works if I run as root. Is there a way to get this to work with non-root accounts? The vDPA network device didn't seem to have this restriction.

Comment 21 Stefano Garzarella 2023-03-15 08:57:50 UTC
(In reply to Jonathon Jongsma from comment #20)
> Are hugepages required for this to work? Do we need to enforce this in
> libvirt when a vdpa-block device is configured? 

Nope, the important thing is that it is not an anonymous memory.
It should be the same requirements as vDPA net because we use the same interface.

> Is there anything else we
> need to enforce? I notice you have specified 'share=on' for the memory
> backing and 'cache.direct=on' for the blockdev. Are these required as well?

Yep, 'share=on' is required when using VDUSE as backend.
Since a vDPA device is always exposed as /dev/vhost-vdpa-X in all cases (HW, VDUSE, simulator), it is better to keep 'share=on' if it is not a problem.

Yep, 'cache.direct=on' is the only mode supported by virtio-blk-vhost-vdpa.

> 
> 
> > # Note: vhost-vdpa still requires pinning all the guest memory, so QEMU can
> > fail to register the entire memory if `ulimit -l` is not unlimited with this
> > message:
> > #  qemu-system-x86_64: -device virtio-blk-pci,drive=drive_src1: Failed to
> > add blkio mem region 0x7f204be00000 with size 536870912: Bad address (os
> > error 14)
> > #  Tuning /etc/security/limits.conf will help, but I'm working to remove
> > this restriction.
> 
> Do you know how this interacts with other devices? For example, if there are
> multiple vdpa-block devices configured, will the memlock limit be affected?
> What about a vdpa block device and a vdpa net device? Is the memlock limit
> still "the entire guest memory"?  
> 
> I ask because for vfio devices with an iommu, we currently need to lock an
> amount of memory equal to the total guest memory for each vfio device
> assigned to a guest. In other words, if there are 3 vfio devices we would
> need to set the memlock limit to (3 * TOTAL_GUEST_MEMORY).

I suspect there is the same requirement here, because each device pins all guest memory.
At least until we implement support for page faults for vDPA devices (in progress for software devices e.g. simulator)

> 
> Also, I'm working on the implementation in libvirt and I get the following
> failure:
> virsh # start vdpablock-test 
> error: Failed to start domain 'vdpablock-test'
> error: internal error: qemu unexpectedly closed the monitor:
> 2023-03-14T16:25:02.736535Z qemu-system-x86_64: -blockdev
> {"driver":"virtio-blk-vhost-vdpa","path":"/dev/vhost-vdpa-0","node-name":
> "libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-
> only":true,"discard":"unmap"}: blkio_connect failed: Failed to connect to
> vDPA device: Input/output error
> 
> I can also reproduce the same failure if I run your example qemu command
> above as the 'qemu' user. It only works if I run as root. Is there a way to
> get this to work with non-root accounts? The vDPA network device didn't seem
> to have this restriction.

I think it depends on the vhost char device permissions, in my case I see this:

$ ls -l /dev/vhost-vdpa-*
crw-------. 1 root root 510, 0 Mar 15 09:45 /dev/vhost-vdpa-0
crw-------. 1 root root 510, 1 Mar 15 09:50 /dev/vhost-vdpa-1

Where /dev/vhost-vdpa-0 is a blk device and /dev/vhost-vdpa-1 is a net device, so I thought the behavior was the same.

About vDPA net, does qemu open the char device or is the fd passed by libvirt which has more permissions?
For now the fd passing is not supported for virtio-blk-vhost-vdpa, but I could implement it.

Comment 22 Jonathon Jongsma 2023-03-15 14:03:10 UTC
(In reply to Stefano Garzarella from comment #21)
> About vDPA net, does qemu open the char device or is the fd passed by
> libvirt which has more permissions?
> For now the fd passing is not supported for virtio-blk-vhost-vdpa, but I
> could implement it.

Oh yes, of course that must be the difference. I definitely used fd passing for vdpa net and hadn't yet implemented it yet for vdpa block. But I'm not sure that you need to actually do anything in qemu to enable this. For vdpa net, we simply pass /dev/fdset/N as the path to the vhostdev. Presumably that should also work here if you used the qemu_open() api to open the passed device path.

Comment 23 Stefano Garzarella 2023-03-15 14:19:33 UTC
(In reply to Jonathon Jongsma from comment #22)
> (In reply to Stefano Garzarella from comment #21)
> > About vDPA net, does qemu open the char device or is the fd passed by
> > libvirt which has more permissions?
> > For now the fd passing is not supported for virtio-blk-vhost-vdpa, but I
> > could implement it.
> 
> Oh yes, of course that must be the difference. I definitely used fd passing
> for vdpa net and hadn't yet implemented it yet for vdpa block. But I'm not
> sure that you need to actually do anything in qemu to enable this. For vdpa
> net, we simply pass /dev/fdset/N as the path to the vhostdev. Presumably
> that should also work here if you used the qemu_open() api to open the
> passed device path.

Unfortunately the `path` is opened by the library (i.e. libblkio) that doesn't use qemu_open() api, so I'm not sure if it will work.
But I can extend the library to support fd passing in some way.

Comment 24 Jonathon Jongsma 2023-03-17 20:25:20 UTC
(In reply to Stefano Garzarella from comment #23)
> Unfortunately the `path` is opened by the library (i.e. libblkio) that
> doesn't use qemu_open() api, so I'm not sure if it will work.
> But I can extend the library to support fd passing in some way.

Ah, I see. It looks like i'll need to wait for the to be implemented in qemu before we can effectively support this in libvirt then.

Comment 27 Stefano Garzarella 2023-05-02 14:54:50 UTC
@jjongsma I sent the QEMU patch here: https://lore.kernel.org/qemu-devel/20230502145050.224615-1-sgarzare@redhat.com/T/#u

I added a new "fd" option, and the fd will be forwarded to the library, so without using qemu_open().
Is that okay for libvirt PoV?

Comment 28 Jonathon Jongsma 2023-05-02 15:46:05 UTC
From libvirt's point of view, we could accommodate either the fd=N syntax or the path=/dev/fdset/N syntax. So I guess it just depends on what qemu reviewers prefer.

Comment 29 Jonathon Jongsma 2023-05-05 20:58:47 UTC
I guess I should mention that although we can accomodate either, the fdset approach (using qemu_open() internally) does make things significantly nicer from a libvirt implementation point of view.

Comment 30 Stefano Garzarella 2023-05-08 07:30:10 UTC
(In reply to Jonathon Jongsma from comment #29)
> I guess I should mention that although we can accomodate either, the fdset
> approach (using qemu_open() internally) does make things significantly nicer
> from a libvirt implementation point of view.

Thanks for pointing that out, but QEMU reviewers already agreed on the new `fd` parameter (I also sent v2 fixing some comments).
This should better reflect the properties of the libblkio's driver.

Comment 31 Stefano Garzarella 2023-06-05 10:35:14 UTC
Update: QEMU patches merged upstream. The last version (thanks to Jonathon and Markus for the help!) supports `path=/dev/fdset/N` syntax and add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa to allow libvirt to discover when fd passing is supported.

QEMU downstream MR: https://gitlab.com/redhat/centos-stream/src/qemu-kvm/-/merge_requests/169

Comment 32 Jonathon Jongsma 2023-07-21 14:57:50 UTC
The libvirt patches were posted upstream here: https://listman.redhat.com/archives/libvir-list/2023-June/240213.html

I'm still waiting on some additional feedback before it can be pushed upstream.

Comment 35 John Ferlan 2023-09-12 19:47:41 UTC
NB: I see v2 posted/acked: https://listman.redhat.com/archives/libvir-list/2023-September/242054.html

Comment 36 Jonathon Jongsma 2023-09-13 19:24:17 UTC
patches merged upstream:
85205784e6296c74b42bf20d343c8f7d229f3632 virStorageSourceClear: Clear 'vdpadev' field
4ef2bcfd3fd8d40ac2671f2bb9c1784d20d71a65 qemu: Implement support for vDPA block devices
2efa9ba66a248a1a7ecc1bcc2decfcdbbf2c6b5d qemu: consider vdpa block devices for memlock limits
0ebb416d7ef38febb5690db8676e7bb52edc4980 qemu: make vdpa connect function more generic
6cf7dbeff8ab9bd5c18bd88d59bf3133d3679a6e qemu: add virtio-blk-vhost-vdpa capability
1df106cc20a4cc6417cfbaf01860f465ec3dd915 conf: add ability to configure a vdpa block disk device

Comment 38 RHEL Program Management 2023-09-22 16:34:58 UTC
Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 39 RHEL Program Management 2023-09-22 16:36:04 UTC
This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.


Note You need to log in before you can comment on or make changes to this bug.