Bug 1797058

Summary:

AMD/SEV: vhost-user support

Product:

Red Hat Enterprise Linux Advanced Virtualization

Reporter:

Dr. David Alan Gilbert <dgilbert>

Component:

qemu-kvm

Assignee:

Virtualization Maintenance <virt-maint>

qemu-kvm sub component:

Devices

QA Contact:

Pei Zhang <pezhang>

Status:

CLOSED WONTFIX

Docs Contact:

Severity:

low

Priority:

low

CC:

ailan, chayang, coli, jinzhao, juzhang, maxime.coquelin, virt-maint, yanghliu

Version:

8.2

Keywords:

Triaged

Target Milestone:

Target Release:

8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-07-31 07:27:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ovs boot script	none
SEV guest with vhost-user	none

Description Dr. David Alan Gilbert 2020-01-31 20:00:57 UTC

Description of problem:

I'm not sure if AMD's SEV is supposed to work with vhost-user - I know we have tricks to use unencrypted areas for kernel vhost; what happens with vhost-user.

Please test.

Version-Release number of selected component (if applicable):
8.2

How reproducible:
?

Steps to Reproduce:
1.  Set up a SEV VM
2. Add a vhost-user NIC
3.

Actual results:
Unsure!

Expected results:
Unsure!

Additional info:

Comment 3 Ademar Reis 2020-02-05 23:14:22 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 4 Pei Zhang 2020-03-03 12:38:23 UTC

Testing Update:

SEV guest can boot with vhost-user. Next I'll test if vhost-user can receive MoonGen packets. 

(As MoonGen tests need back-to-back NIC connection. Besides the NUMA node which NICs are located has no memory now, so I have sent requests to IT people to re-connect to satisfy these specific hardware connections. I'll update final results soon)

Comment 5 Pei Zhang 2020-03-17 05:41:16 UTC

Testing summary: SEV doesn't support vhost-user well.


== Testing result:

1. Sometimes guest can boot successfully. However starting dpdk's testpmd in guest will cause guest rebooting. In below testing steps, guest will always reboot after step 8 in this condition.

2. Sometimes guest fail boot with below error. In below testing steps, guest fail boot in step 7.

qemu-kvm: -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages,share=yes,size=4G,host-nodes=5,policy=bind: cannot bind memory to host NUMA nodes: Input/output error



== Testing highlight:

1. For vhost-user testing, hugepage is a must. So we have to reserve hugepage in this testing.

2. We need to use memory and cores from NUMA node where the NICs are located in. In our setup, NUMA Node 5 satisfy this requirement, so we use NUMA Node 5.

3. Without SEV, the vhost-user works very well. No any error, and MoonGen can receive packets from guest dpdk's testpmd well. So I think we can confirm the SEV is the key reason which cause above two testing results items.



== Testing steps:

1. Add "amd_iommu=on iommu=pt default_hugepagesz=1G" in host kernel and reboot host.

# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-184.el8.x86_64 root=/dev/mapper/rhel_hp--dl385g10--10-root ro crashkernel=auto resume=/dev/mapper/rhel_hp--dl385g10--10-swap rd.lvm.lv=rhel_hp-dl385g10-10/root rd.lvm.lv=rhel_hp-dl385g10-10/swap console=ttyS0,115200n81 amd_iommu=on iommu=pt default_hugepagesz=1G


2. Enable SEV in kvm_amd

# modprobe -r kvm_amd
# modprobe kvm_amd sev=1

3. Check NIC in which NUMA node, in this setup, it's in NUMA Node 5.

# hwloc-ls 
Machine (126GB total)
  Package L#0
    NUMANode L#0 (P#0 31GB)
    ...
    NUMANode L#5 (P#5 31GB)
      L3 L#10 (4096KB) + L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#26)
      L3 L#11 (4096KB) + L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#27)
      HostBridge L#5
        PCIBridge
          2 x { PCI 8086:1528 } 
      ...


4. Reserve hugepage from NUMA node 5

# echo 20 >  /sys/devices/system/node/node5/hugepages/hugepages-1048576kB/nr_hugepages
# echo 10 >  /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages

5. Bind NICs to VFIO

# modprobe vfio
# modprobe vfio-pci

# dpdk-devbind --bind=vfio-pci 0000:a3:00.0
# dpdk-devbind --bind=vfio-pci 0000:a3:00.1

# dpdk-devbind --status

Network devices using DPDK-compatible driver
============================================
0000:a3:00.0 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=vfio-pci unused=ixgbe
0000:a3:00.1 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=vfio-pci unused=ixgbe

6. Boot OVS, reserve Hugepage from NUMA node 5. Full XML will be attached in next Comment.

...
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,0,0,0,0,1024"
...

# ovs-vsctl show 
ee576f79-4200-49cc-8a47-3548d489ffa8
    Bridge ovsbr0
        datapath_type: netdev
        Port dpdk0
            Interface dpdk0
                type: dpdk
                options: {dpdk-devargs="0000:a3:00.0", n_rxq="1"}
        Port ovsbr0
            Interface ovsbr0
                type: internal
        Port vhost-user0
            Interface vhost-user0
                type: dpdkvhostuserclient
                options: {vhost-server-path="/tmp/vhostuser0.sock"}
    Bridge ovsbr1
        datapath_type: netdev
        Port ovsbr1
            Interface ovsbr1
                type: internal
        Port dpdk1
            Interface dpdk1
                type: dpdk
                options: {dpdk-devargs="0000:a3:00.1", n_rxq="1"}
        Port vhost-user1
            Interface vhost-user1
                type: dpdkvhostuserclient
                options: {vhost-server-path="/tmp/vhostuser1.sock"}

7. Boot QEMU with 2 vhost-user ports enabling SEV

/usr/libexec/qemu-kvm \
-enable-kvm \
-cpu EPYC \
-smp 4 \
-m 4G \
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages,share=yes,size=4G,host-nodes=5,policy=bind \
-numa node,nodeid=0,cpus=0-3,memdev=ram-node0 \
-object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 \
-machine q35,memory-encryption=sev0 \
-drive if=pflash,format=raw,unit=0,file=/usr/share/edk2/ovmf/sev/OVMF_CODE.secboot.fd,readonly \
-drive if=pflash,format=raw,unit=1,file=/usr/share/edk2/ovmf/sev/OVMF_VARS.fd \
-device pcie-root-port,id=root.1,chassis=1 \
-device pcie-root-port,id=root.2,chassis=2 \
-device pcie-root-port,id=root.3,chassis=3 \
-device pcie-root-port,id=root.4,chassis=4 \
-device pcie-root-port,id=root.5,chassis=5 \
-device virtio-scsi-pci,iommu_platform=on,id=scsi0,bus=root.1,addr=0x0 \
-drive file=/home/sev_guest.qcow2,format=qcow2,if=none,id=drive-scsi0-0-0-0 \
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scssi0-0-0-0,bootindex=1 \
-netdev tap,id=hostnet0,vhost=off \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=18:66:da:57:dd:12,bus=root.2,iommu_platform=on \
-vnc :0 \
-monitor stdio \
-chardev socket,id=charnet1,path=/tmp/vhostuser0.sock,server \
-netdev vhost-user,chardev=charnet1,id=hostnet1 \
-device virtio-net-pci,netdev=hostnet1,id=net1,mac=18:66:da:5f:dd:02,bus=root.3,iommu_platform=on \
-chardev socket,id=charnet2,path=/tmp/vhostuser1.sock,server \
-netdev vhost-user,chardev=charnet2,id=hostnet2 \
-device virtio-net-pci,netdev=hostnet2,id=net2,mac=18:66:da:5f:dd:03,bus=root.4,iommu_platform=on \

8. In guest, start testpmd.

# modprobe vfio enable_unsafe_noiommu_mode=Y
# modprobe vfio-pci
# cat /sys/module/vfio/parameters/enable_unsafe_noiommu_mode
Y

# dpdk-devbind --bind=vfio-pci 0000:03:00.0
# dpdk-devbind --bind=vfio-pci 0000:04:00.0

# echo 1 >  /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages

# /usr/bin/testpmd \
	-l 1,2,3 \
	-n 4 \
	-d /usr/lib64/librte_pmd_virtio.so \
	-w 0000:03:00.0 -w 0000:04:00.0 \
	-- \
	--nb-cores=2 \
	-i \
	--disable-rss \
	--rxd=512 --txd=512 \
	--rxq=1 --txq=1

Comment 6 Pei Zhang 2020-03-17 05:45:03 UTC

Created attachment 1670725 [details]
ovs boot script

More info:

Testing versions:
4.18.0-184.el8.x86_64
qemu-kvm-4.2.0-12.module+el8.2.0+5858+afd073bc.x86_64
openvswitch2.13-2.13.0-0.20200117git8ae6a5f.el8fdp.1.x86_64
dpdk-19.11-4.el8.x86_64


Testing servers:
hp-dl385g10-10.lab.eng.pek2.redhat.com


Testing NICs:
X540-AT2 (10G, ixgbe)

Comment 7 Pei Zhang 2020-03-17 05:54:56 UTC

Hi David,

Could you check the testings in Comment 5 and Comment 6, please? Do you have any Comment?

Also, if we plan to support vhost-user with SEV in future, I think we need to file new bugs to track the issues in Comment 5. Please let me know if we need to file and I can file them.


Thank you.

Best regards,

Pei

Comment 8 Dr. David Alan Gilbert 2020-03-17 14:31:14 UTC

(In reply to Pei Zhang from comment #7)
> Hi David,
> 
> Could you check the testings in Comment 5 and Comment 6, please? Do you have
> any Comment?
> 
> Also, if we plan to support vhost-user with SEV in future, I think we need
> to file new bugs to track the issues in Comment 5. Please let me know if we
> need to file and I can file them.
> 
> 
> Thank you.
> 
> Best regards,
> 
> Pei

This test case combines lots of different things;  it has both vhost-user and vfio and iommu's;
I think we probably need to try something simpler just with vhost-user in one test and then
vfio/iommu's in another test to see which ones cause a problem.
However, it would be good to file a bug for the case where the guest reboots.
The error about 'cannot bind memory to host NUMA nodes', after you do:

   echo 20 >  /sys/devices/system/node/node5/hugepages/hugepages-1048576kB/nr_hugepages
please do:
   cat /sys/devices/system/node/node5/hugepages/hugepages-1048576kB/nr_hugepages
and check it says 20; sometimes it takes some time, and sometimes it can't move pages to do it.
If that doesn't help please file a separate bug on that (please check first if it is really SEV only).

I suggest you try your current host configuration and qemu+ovs, but in the guest just run normal Linux network tests
to send packets over the vhost-user network rather than using testpmd.  That will tell us if normal vhost-user networking works
without worrying about VFIO.

Dave

Comment 9 Pei Zhang 2020-03-18 09:43:29 UTC

(In reply to Dr. David Alan Gilbert from comment #8)
...
> This test case combines lots of different things;  it has both vhost-user
> and vfio and iommu's;
> I think we probably need to try something simpler just with vhost-user in
> one test and then
> vfio/iommu's in another test to see which ones cause a problem.
> However, it would be good to file a bug for the case where the guest reboots.
> The error about 'cannot bind memory to host NUMA nodes', after you do:
> 
>    echo 20 > 
> /sys/devices/system/node/node5/hugepages/hugepages-1048576kB/nr_hugepages
> please do:
>    cat
> /sys/devices/system/node/node5/hugepages/hugepages-1048576kB/nr_hugepages
> and check it says 20; sometimes it takes some time, and sometimes it can't
> move pages to do it.
> If that doesn't help please file a separate bug on that (please check first
> if it is really SEV only).

I've filed Bug 1814502 to track hugepage issue. And it's not related with vhost-user, sev+ hugepage can cause this issue.

> 
> I suggest you try your current host configuration and qemu+ovs, but in the
> guest just run normal Linux network tests
> to send packets over the vhost-user network rather than using testpmd.  That
> will tell us if normal vhost-user networking works
> without worrying about VFIO.

I've filed Bug 1814509 to track guest reboot issue.

Next ,I'll run normal linux network tests over vhost-user. The testing results will be updated.

Comment 10 Pei Zhang 2020-03-24 13:27:25 UTC

Verified with qemu-kvm-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64:

With kernel driver, SEV vhost-user network works well. Below step is testing from libvirt. As this is the way to be covered in QE testing. 

1. Add "amd_iommu=on iommu=pt default_hugepagesz=1G" in host kernel and reboot host.

# cat /proc/cmdline 
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-191.el8.x86_64 root=/dev/mapper/rhel_hp--dl385g10--10-root ro crashkernel=auto resume=/dev/mapper/rhel_hp--dl385g10--10-swap rd.lvm.lv=rhel_hp-dl385g10-10/root rd.lvm.lv=rhel_hp-dl385g10-10/swap console=ttyS0,115200n81 iommu=pt amd_iommu=on default_hugepagesz=1G


2. Enable SEV in kvm_amd

# modprobe -r kvm_amd
# modprobe kvm_amd sev=1

3. Check NIC in which NUMA node, in this setup, it's in NUMA Node 5.

# hwloc-ls 
Machine (126GB total)
  Package L#0
    NUMANode L#0 (P#0 31GB)
    ...
    NUMANode L#5 (P#5 31GB)
      L3 L#10 (4096KB) + L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#26)
      L3 L#11 (4096KB) + L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#27)
      HostBridge L#5
        PCIBridge
          2 x { PCI 8086:1528 } 
      ...


4. Reserve hugepage from NUMA node 5

# echo 20 >  /sys/devices/system/node/node5/hugepages/hugepages-1048576kB/nr_hugepages
# echo 10 >  /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages

5. Bind NICs to VFIO

# modprobe vfio
# modprobe vfio-pci

# dpdk-devbind --bind=vfio-pci 0000:a3:00.0

# dpdk-devbind --status

Network devices using DPDK-compatible driver
============================================
0000:a3:00.0 'Ethernet Controller 10-Gigabit X540-AT2 1528' drv=vfio-pci unused=ixgbe
...

6. Boot OVS, reserve Hugepage from NUMA node 5. 
# cat boot_ovs_client.sh 
#!/bin/bash

set -e

echo "killing old ovs process"
pkill -f ovs-vswitchd || true
sleep 5
pkill -f ovsdb-server || true

echo "probing ovs kernel module"
modprobe -r openvswitch || true
modprobe openvswitch

echo "clean env"
DB_FILE=/etc/openvswitch/conf.db
rm -rf /var/run/openvswitch
mkdir /var/run/openvswitch
rm -f $DB_FILE

echo "init ovs db and boot db server"
export DB_SOCK=/var/run/openvswitch/db.sock
ovsdb-tool create /etc/openvswitch/conf.db /usr/share/openvswitch/vswitch.ovsschema
ovsdb-server --remote=punix:$DB_SOCK --remote=db:Open_vSwitch,Open_vSwitch,manager_options --pidfile --detach --log-file
ovs-vsctl --no-wait init

echo "start ovs vswitch daemon"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,0,0,0,0,1024"
ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask="0x1"
ovs-vsctl --no-wait set Open_vSwitch . other_config:vhost-iommu-support=true
ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file=/var/log/openvswitch/ovs-vswitchd.log

echo "creating bridge and ports"

ovs-vsctl --if-exists del-br ovsbr0
ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev
ovs-vsctl add-port ovsbr0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:a3:00.0 
ovs-vsctl add-port ovsbr0 vhost-user0 -- set Interface vhost-user0 type=dpdkvhostuserclient options:vhost-server-path=/tmp/vhostuser0.sock
ovs-ofctl del-flows ovsbr0
ovs-ofctl add-flow ovsbr0 "in_port=1,idle_timeout=0 actions=output:2"
ovs-ofctl add-flow ovsbr0 "in_port=2,idle_timeout=0 actions=output:1"

ovs-vsctl set Open_vSwitch . other_config={}
ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x1
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0xC00
ovs-vsctl set Interface dpdk0 options:n_rxq=1
echo "all done"


# ovs-vsctl show 
4051b115-51b7-44fc-a1ea-5d54296876f9
    Bridge ovsbr0
        datapath_type: netdev
        Port ovsbr0
            Interface ovsbr0
                type: internal
        Port vhost-user0
            Interface vhost-user0
                type: dpdkvhostuserclient
                options: {vhost-server-path="/tmp/vhostuser0.sock"}
        Port dpdk0
            Interface dpdk0
                type: dpdk
                options: {dpdk-devargs="0000:a3:00.0", n_rxq="1"}

7. Boot qemu SEV guest with 1 vhost-user. Full XML will be attached in next Comment.

8. In guest, set tmp IP to vhost-user NIC.

# ifconfig enp6s0 192.168.1.1/24

9. Do ping testing with another host.(In this setup, two hosts are connected back-to-back). Ping works. Hoever there are 30% packets loss. But without SEV, this issue also exists. So this is not SEV issue. I'll confirm and possibly file another new bug to track this packets loss issue.


# ping 192.168.1.2 -c 20 -i 0.1
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=0.118 ms
64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=0.149 ms
64 bytes from 192.168.1.2: icmp_seq=5 ttl=64 time=0.141 ms
64 bytes from 192.168.1.2: icmp_seq=6 ttl=64 time=0.147 ms
64 bytes from 192.168.1.2: icmp_seq=7 ttl=64 time=0.107 ms
64 bytes from 192.168.1.2: icmp_seq=9 ttl=64 time=0.140 ms
64 bytes from 192.168.1.2: icmp_seq=10 ttl=64 time=0.145 ms
64 bytes from 192.168.1.2: icmp_seq=11 ttl=64 time=0.150 ms
64 bytes from 192.168.1.2: icmp_seq=13 ttl=64 time=0.137 ms
64 bytes from 192.168.1.2: icmp_seq=14 ttl=64 time=0.147 ms
64 bytes from 192.168.1.2: icmp_seq=16 ttl=64 time=0.124 ms
64 bytes from 192.168.1.2: icmp_seq=17 ttl=64 time=0.088 ms
64 bytes from 192.168.1.2: icmp_seq=18 ttl=64 time=0.108 ms
64 bytes from 192.168.1.2: icmp_seq=20 ttl=64 time=0.134 ms

--- 192.168.1.2 ping statistics ---
20 packets transmitted, 14 received, 30% packet loss, time 979ms
rtt min/avg/max/mdev = 0.088/0.131/0.150/0.019 ms


Other versions info:
4.18.0-191.el8.x86_64
qemu-kvm-4.2.0-15.module+el8.2.0+6029+618ef2ec.x86_64
python3-libvirt-6.0.0-1.module+el8.2.0+5453+31b2b136.x86_64
openvswitch2.13-2.13.0-9.el8fdp.x86_64

Comment 11 Pei Zhang 2020-03-24 13:30:36 UTC

Created attachment 1673074 [details]
SEV guest with vhost-user

Comment 12 Pei Zhang 2020-03-24 13:34:13 UTC

As Comment 10, I would like to move this bug to 'VERIFIED'.

David, please let me know if you have any concern about the verification in Comment 10.

Comment 13 Dr. David Alan Gilbert 2020-03-24 13:47:23 UTC

Yep, that looks good to me - thanks!
We've got a couple of good test failures out of this; just not the ones I was expecting!

Comment 14 Pei Zhang 2020-05-20 08:16:35 UTC

(In reply to Pei Zhang from comment #10)
[...]
> 
> 9. Do ping testing with another host.(In this setup, two hosts are connected
> back-to-back). Ping works. Hoever there are 30% packets loss. But without
> SEV, this issue also exists. So this is not SEV issue. I'll confirm and
> possibly file another new bug to track this packets loss issue.

Update:

This ping loss is expected with scenario "vhost-user + vIOMMU + kernel virtio-net driver in guest". 

This was explained by Maxime with Bug 1572879#c13:

"This issue happens when using vhost-user with vIOMMU enabled and
with Kernel Virtio-net driver in guest.

This combination is not recommended for performance reasons[0], but
the problem is real and should be fixed in the future. I think we can
put a low priority on this bug.


[0]: This setup is not recommended, because when using kernel driver
in guest the performance with vIOMMU enabled is very bad, because
Kernel driver has a dynamic mapping that creates huge overhead
on vhost-user backend side as every packet results in an IOTLB
cache miss. The use of vIOMMU with vhost-user backend is
recommended when using DPDK Virtio PMD in guest, in that case it
provides the guest with user application and kernel isolation, while
the overhead is close to nil.
"

Comment 15 Pei Zhang 2020-08-12 01:53:08 UTC

After mail discussion with David, as there is packet loss during the testing in Comment 10, we should fail on_qa.

Move back to Assigned status.

Comment 16 Dr. David Alan Gilbert 2020-08-12 08:02:08 UTC

OK, given that SEV has some problems with vhost-user due to the iommu behaviour, I'm turning this from a testonly into a full bug.
However, I'm not sure we're needing it to work with vhost-user imminently; so lets leave this bug on the backlog.

Comment 17 John Ferlan 2020-08-12 19:45:31 UTC

Based on comment 16, adding Triaged so that it's really in the backlog

Comment 20 RHEL Program Management 2021-07-31 07:27:16 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.