Bug 1710357
| Summary: | cx5: poor ovs hw offload performance | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Amit Supugade <asupugad> |
| Component: | openvswitch | Assignee: | Alaa Hleihel (NVIDIA Mellanox) <ahleihel> |
| openvswitch sub component: | ovs-hw-offload | QA Contact: | Amit Supugade <asupugad> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | ahleihel, atragler, ctrautma, fbaudin, mleitner, qding, rkhan |
| Version: | FDP 19.C | Keywords: | Regression, TestBlocker |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-28 17:11:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Amit Supugade
2019-05-15 12:07:19 UTC
Hi, Performance on 19.B, for 100G CX5- 30628852 Job link- https://beaker.engineering.redhat.com/jobs/3448569 Issue is also present in 2.11 2.11 job link https://beaker.engineering.redhat.com/jobs/3535660 8456670 fps Interesting that the numbers are still higher than sw datapath, and the skip_sw tests prove that. What about kernel version, was is the same on the tests with 19.B and 19.C ? Kernel Info- 19.B on 3.10.0-957.10.1.el7.x86_64 19.C on 3.10.0-957.el7.x86_64 I am trying out few more combinations of kernel and ovs to see what results I get. (In reply to Amit Supugade from comment #6) > Kernel Info- > 19.B on 3.10.0-957.10.1.el7.x86_64 > 19.C on 3.10.0-957.el7.x86_64 Or the other way around? Or 19.C was really tested with an older kernel? > I am trying out few more combinations of kernel and ovs to see what results > I get. Cool, thanks. 3.10.0-957.el7.x86_64 is 7.6 GA kernel so I ran tests on 19C with this kernel. Additional results: 19B on 3.10.0-957.el7.x86_64 = Pass 19C on 3.10.0-957.10.1.el7.x86_64 = Fail So basically 19B passed and 19C failed on both the above kernels. So using either kernel; OVS 19B always passes and OVS 19C always fails ? meaning the issue happens only when switching between openvswitch versions regardless of the kernel version ? Just trying to understand on which component we should focus the debug. Also, please provide the exact RPM version numbers (for 19B and 19C), I am not sure how to map between them :) Thanks Alaa (In reply to Alaa Hleihel from comment #9) > So using either kernel; OVS 19B always passes and OVS 19C always fails ? > meaning the issue happens only when switching between openvswitch versions > regardless of the kernel version ? correct > > Just trying to understand on which component we should focus the debug. > > Also, please provide the exact RPM version numbers (for 19B and 19C), I am > not sure how to map between them :) > 19B- openvswitch-2.9.0-101.el7fdp.x86_64.rpm 19C- openvswitch-2.9.0-106.el7fdp.x86_64.rpm > Thanks > Alaa Hi Alaa, any updates? Hi Amit, there are only 5 versions in there. Can you please try to bisect it to a specific release? %changelog +* Thu Apr 18 2019 Lorenzo Bianconi <lorenzo.bianconi> - 2.9.0-106 +- Backport "OVN: fix DVR Floating IP support" (#1671776) + +* Tue Apr 09 2019 Timothy Redaelli <tredaelli> - 2.9.0-105 +- Fix missing dependencies for ovs-tcpdump (#1651232) + +* Fri Apr 05 2019 Timothy Redaelli <tredaelli> - 2.9.0-104 +- Add "Obsoletes: python-openvswitch < 2.9.0-57" to avoid yum to fail + on openstack during upgrade from .noarch to .arch (#1696340) + +* Tue Mar 26 2019 Numan Siddique <nusiddiq> - 2.9.0-103 +- Backport fixes for #1677616 (pinctrl thread) and fixes related to IPv6 RA. + +* Tue Mar 26 2019 Jakub Libosvar <libosvar> - 2.9.0-102 +- Backport "Add unixctl option for ovn-northd" (#1687480) + * Thu Mar 14 2019 Timothy Redaelli <tredaelli> - 2.9.0-101 (In reply to Marcelo Ricardo Leitner from comment #12) > Hi Amit, there are only 5 versions in there. Can you please try to bisect it > to a specific release? > > %changelog > +* Thu Apr 18 2019 Lorenzo Bianconi <lorenzo.bianconi> - 2.9.0-106 > +- Backport "OVN: fix DVR Floating IP support" (#1671776) > + > +* Tue Apr 09 2019 Timothy Redaelli <tredaelli> - 2.9.0-105 > +- Fix missing dependencies for ovs-tcpdump (#1651232) > + > +* Fri Apr 05 2019 Timothy Redaelli <tredaelli> - 2.9.0-104 > +- Add "Obsoletes: python-openvswitch < 2.9.0-57" to avoid yum to fail > + on openstack during upgrade from .noarch to .arch (#1696340) > + > +* Tue Mar 26 2019 Numan Siddique <nusiddiq> - 2.9.0-103 > +- Backport fixes for #1677616 (pinctrl thread) and fixes related to IPv6 RA. > + > +* Tue Mar 26 2019 Jakub Libosvar <libosvar> - 2.9.0-102 > +- Backport "Add unixctl option for ovn-northd" (#1687480) > + > * Thu Mar 14 2019 Timothy Redaelli <tredaelli> - 2.9.0-101 yes, please, that will be great. I wanted to check the diff between the versions but had to jump back to another issue.. from a quick look i am not sure how these patches can affect mlx5 only, are other vendors not affected ? $ diff -ru 19B 19C | diffstat BUILD/openvswitch-2.9.0/lib/automake.mk | 3 BUILD/openvswitch-2.9.0/lib/automake.mk.orig |only BUILD/openvswitch-2.9.0/lib/packets.h | 3 BUILD/openvswitch-2.9.0/lib/unixctl.xml |only BUILD/openvswitch-2.9.0/ovn/controller/pinctrl.c | 731 +++++++--- BUILD/openvswitch-2.9.0/ovn/lib/actions.c | 20 BUILD/openvswitch-2.9.0/ovn/lib/ovn-l7.h | 3 BUILD/openvswitch-2.9.0/ovn/northd/ovn-northd.8.xml | 58 BUILD/openvswitch-2.9.0/ovn/northd/ovn-northd.c | 125 + BUILD/openvswitch-2.9.0/ovn/northd/ovn-northd.c.orig | 199 ++ BUILD/openvswitch-2.9.0/tests/ovn-northd.at | 39 BUILD/openvswitch-2.9.0/tests/ovn.at | 10 SOURCES/.0001-Add-unixctl-option-for-ovn-northd.patch.swp |only SOURCES/0001-Add-unixctl-option-for-ovn-northd.patch |only SOURCES/0001-OVN-fix-DVR-Floating-IP-support.patch |only SOURCES/0001-ovn-pinctrl-Pass-struct-rconn-swconn-to-all-the-func.patch |only SOURCES/0002-ovn-controller-Add-a-new-thread-in-pinctrl-module-to.patch |only SOURCES/0003-OVN-Use-offset-instead-of-pointer-into-ofpbuf.patch |only SOURCES/0004-OVN-Always-send-prefix-option-in-RAs.patch |only SOURCES/0005-OVN-Make-periodic-RAs-consistent-with-RA-responder.patch |only SPECS/openvswitch.spec | 35 21 files changed, 1016 insertions(+), 210 deletions(-) Hi, Based on results of tests I ran, it looks like we are getting low performance only if I attach VF to VM using 'virsh attach-device <VM_NAME> <VF.xml>'. If I define VM using xml, the performance is as expected. 19C with VM defined from xml- https://beaker.engineering.redhat.com/jobs/3551606 If I use virsh attach-device, test also fails on netronome. I am running more tests on netronome. Will update the results here. Thanks! (In reply to Amit Supugade from comment #15) > Hi, > Based on results of tests I ran, it looks like we are getting low > performance only if I attach VF to VM using 'virsh attach-device <VM_NAME> > <VF.xml>'. If I define VM using xml, the performance is as expected. > 19C with VM defined from xml- > https://beaker.engineering.redhat.com/jobs/3551606 > If I use virsh attach-device, test also fails on netronome. I am running > more tests on netronome. Will update the results here. Thanks! Thanks a lot for the update Amit! This is interesting.. Is there also a different between outputs of "virsh dumpxml <VM_NAME>" after attaching the VF using the 2 methods ? Hi, Below is the difference between VM xmls. mlx_master.xml => Gives expected performance. [root@netqe28 ~]# diff mlx_master.xml attach.xml 1c1 < <domain type='kvm' id='3'> --- > <domain type='kvm' id='1'> 3,11c3,5 < <uuid>5fcb42cc-eb08-475a-b574-6f8aaa10bdd0</uuid> < <memory unit='KiB'>4194304</memory> < <currentMemory unit='KiB'>4194304</currentMemory> < <memoryBacking> < <hugepages> < <page size='1048576' unit='KiB' nodeset='0'/> < </hugepages> < <access mode='shared'/> < </memoryBacking> --- > <uuid>5318f61f-863a-4345-89d1-0f5d543a0042</uuid> > <memory unit='KiB'>8388608</memory> > <currentMemory unit='KiB'>8388608</currentMemory> 13,18d6 < <cputune> < <vcpupin vcpu='0' cpuset='4'/> < <vcpupin vcpu='1' cpuset='6'/> < <vcpupin vcpu='2' cpuset='14'/> < <emulatorpin cpuset='4'/> < </cputune> 30,34c18,28 < <cpu mode='host-passthrough'> < <feature policy='require' name='tsc-deadline'/> < <numa> < <cell id='0' cpus='0-2' memory='4194304' unit='KiB' memAccess='shared'/> < </numa> --- > <cpu mode='custom' match='exact' check='full'> > <model fallback='forbid'>Skylake-Client-IBRS</model> > <feature policy='require' name='avx512f'/> > <feature policy='require' name='avx512dq'/> > <feature policy='require' name='clwb'/> > <feature policy='require' name='avx512cd'/> > <feature policy='require' name='avx512bw'/> > <feature policy='require' name='avx512vl'/> > <feature policy='require' name='pdpe1gb'/> > <feature policy='require' name='hypervisor'/> > <feature policy='disable' name='arat'/> 85c79 < <mac address='52:54:00:8a:78:09'/> --- > <mac address='52:54:00:3d:36:df'/> 90c84 < <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> --- > <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> 99c93 < <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> --- > <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0'/> 108c102 < <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> --- > <address type='pci' domain='0x0000' bus='0x00' slot='0x11' function='0x0'/> 130c124 < <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-3-master/org.qemu.guest_agent.0'/> --- > <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-1-master/org.qemu.guest_agent.0'/> 143c137 < <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> --- > <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> 144a139,143 > <rng model='virtio'> > <backend model='random'>/dev/urandom</backend> > <alias name='rng0'/> > <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> > </rng> 147,148c146,147 < <label>system_u:system_r:svirt_t:s0:c718,c898</label> < <imagelabel>system_u:object_r:svirt_image_t:s0:c718,c898</imagelabel> --- > <label>system_u:system_r:svirt_t:s0:c569,c1017</label> > <imagelabel>system_u:object_r:svirt_image_t:s0:c569,c1017</imagelabel> [root@netqe28 ~]# If I'm reading this correctly the attach.xml doesn't have the correct tunings in the xml which the mlx_master.xml does. < <cputune> < <vcpupin vcpu='0' cpuset='4'/> < <vcpupin vcpu='1' cpuset='6'/> < <vcpupin vcpu='2' cpuset='14'/> < <emulatorpin cpuset='4'/> < </cputune> If this is correct that would explain the performance difference. This would explain why the performance has dropped as the VM isn't being properly isolated and would be having context switches all over its virtual CPUs. There are also other issues such as hugepages and cpu mode that would cause problems as well. I don't think this is a bug honestly and more of a need to do other steps after using the virsh attach method. Performance is as expected with correct tuning in the xml. Closing the bug. |