Description of problem: Due to new kernel patch here [1], the PF and VF representors are linked to their parent PCI device. Old Structure: The structure of VF's PCI Address/physfn/net contains only the PF of that VF $ ls /sys/bus/pci/devices/<vf-pci-addre>/physfn/net/ enp2s0f0 $ ls -l /sys/class/net ... lrwxrwxrwx 1 root root 0 Aug 17 11:11 enp2s0f0_0 -> ../../devices/virtual/net/enp2s0f0_0 lrwxrwxrwx 1 root root 0 Aug 17 11:11 enp2s0f0_1 -> ../../devices/virtual/net/enp2s0f0_1 lrwxrwxrwx 1 root root 0 Aug 17 11:11 enp2s0f0_2 -> ../../devices/virtual/net/enp2s0f0_2 lrwxrwxrwx 1 root root 0 Aug 17 11:11 enp2s0f0_3 -> ../../devices/virtual/net/enp2s0f0_3 ... New Structure: The structure of VF's PCI Address/physfn/net contains the PF of that VF and the VF representors $ ls /sys/bus/pci/devices/<vf-pci-addre>/physfn/net/ enp3s0f0 enp3s0f0_0 enp3s0f0_1 enp3s0f0_2 enp3s0f0_3 $ ls -l /sys/class/net ... lrwxrwxrwx. 1 root root 0 Aug 17 08:43 enp3s0f0_0 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/enp3s0f0_0 lrwxrwxrwx. 1 root root 0 Aug 17 08:43 enp3s0f0_1 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/enp3s0f0_1 lrwxrwxrwx. 1 root root 0 Aug 17 08:43 enp3s0f0_2 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/enp3s0f0_2 lrwxrwxrwx. 1 root root 0 Aug 17 08:43 enp3s0f0_3 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/enp3s0f0_3 ... [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1 Version-Release number of selected component (if applicable): we need to update os-vif to support it as well How reproducible: Steps to Reproduce: 1.create direct port with switchdev capabilities 2.boot vm with that port Actual results: os-vif will select VF representor instead of the PF Expected results: vm should boot successfully Additional info: need to merged [1] and [2] they aready merged to master [1] - https://review.opendev.org/c/openstack/os-vif/+/765912 [2] - https://review.opendev.org/c/openstack/os-vif/+/765970
the osp 16.X release are pinned to specific rhel versions 16.1 is pinned to rhel 8.2 and 16.2 will be pinned to 8.4 this will only be revelnt to backport to 16.2 if the corresponding kernl patches are backported to 8.4 my understanidn is we are going to backport this upstream to train i have no issue wiht important that backport into 16.2 when its completed upstream but im not sure this will be required. osp 17 will be based on wallaby and rhel 9 and will have this change by default. moshe can you confirm if the kernel 5.8 patch has been targeted for inclusion in rhel 8.4 and if there is a bug tracking it. ill quickly check and see if i can find one and update this if i do. for 16.0 and 16.1 this should not be required so this would be for 16.2 only. setting dev conditional nack design until we deterim if this is needed and if it will be inculded in rhel 8.4
ok clearing needinfo and devnack the kernel backport will break the userspace abi and require us to change are layered product to avoid a regression. im kind of suprised this was acceped into rhel given the api breakage but since it has been we need the backport
Thanks Sean, But in general can we backport these patches to openstack older version so that customer which uses other deployment tool won't break as well. I send Saravanan KR <skramaja> just to make sure that we can get the RH folks to review the backport.
ill bring that question up at our bug triage call later today. form a redhat product perspective i dont think it make sense to backprot it in OSP to older verions. 16.1.x is only supported on 8.2 and will not have the kernel change. form osp 13+ we require that all deployments are done using ooo/osp director otherwise they are unsupproted. backporting it to 16.1 in addtion to 16.2 is not much more work but 15 and 14 are both EOL and wont have new releases or security updates. 13 is rhel 7 based and wont have the kernel backports either so im not sure it really makes sense. what release/tool/os version combination did you have in mind. personally im surprised we backproted the kernel changes at all to 8.4 given its a userspace api break i severaly doubt this is something that shoudl be backported to rhel7 as this is something that will break hardware offloaded ovs in laywered products based on rhel so its a certification issue for vendors that support it too. this is not a flat no but before we look at bringing this back before 16.2 i would like to know the business justification/use-case that is enableing. we wont be suppprot vdpa before osp 17 and we wont be supporting subfucntion before 18 at the eailerst since that is not on our roadmap at all yes in openstack. so the main reaon for the backport i assume is to support he connectx6-dx and lx nics. they will only be useable in older release for basic hardware offload with vlan/flath networks and the newer functionality introduced in that line will largely be unused untill a newer verison fo ovs and openstack are deployed on them. in any case ill take a look at the backport upstream and do a pre-emptive back port downstream probably next week as i dont think ill get to it today. if you can let me know here or on irc where you think this is useful beyond 16.2 i can take a look at that too.
We have customers that uses old openstack but install mellanox ofed and without this change it will break. I understand it less makesens to backport fix for new kernel in old openstack distro. if it possible to backport them train/stain release from upstream perspective that will be great.
Hi, > [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1 libvirt in RHEL-8 was fixed/updated to support this kernel change. So, we're adding this driver change to RHEL-8.5 in BZ #1959367. Please confirm that the requested patches in this BZ will be in OSP 16.2 Thanks, Alaa
Hi Alla, Moshe, I have an environment with latest compose of 16.2 (RHEL8.4) with below version on compute nodes. [root@overcloud-r640compute-0 heat-admin]# rpm -q kernel kernel-4.18.0-305.el8.x86_64 [root@overcloud-r640compute-0 /]# rpm -q qemu-kvm <<<From nova_libvirt container qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.x86_64 [root@overcloud-r640compute-0 /]# rpm -qa | grep virt <<<From nova_libvirt container libvirt-bash-completion-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 libvirt-client-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 libvirt-libs-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 libvirt-daemon-driver-interface-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 libvirt-daemon-kvm-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 python3-libvirt-7.0.0-1.module+el8.4.0+9469+2eaf72bc.x86_64 libvirt-admin-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 libvirt-daemon-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 libvirt-daemon-driver-qemu-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 libvirt-daemon-driver-network-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 [root@overcloud-r640compute-0 /]# From BZ#1959367, "In RHEL-8.4 we skipped an mlx5 patch [1] that linked the VF representors to PF PCI device since user-space packages were not ready for that change." So, kernel changes regarding VF structure is not present in kernel i am using. [root@overcloud-r640compute-0 devices]# ls /sys/bus/pci/devices/0000\:5e\:00.5/physfn/net ens2f0 [root@overcloud-r640compute-0 devices]# ls -l /sys/class/net | grep ens2f0 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0 -> ../../devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/net/ens2f0 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_0 -> ../../devices/virtual/net/ens2f0_0 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_1 -> ../../devices/virtual/net/ens2f0_1 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_2 -> ../../devices/virtual/net/ens2f0_2 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_3 -> ../../devices/virtual/net/ens2f0_3 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_4 -> ../../devices/virtual/net/ens2f0_4 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_5 -> ../../devices/virtual/net/ens2f0_5 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_6 -> ../../devices/virtual/net/ens2f0_6 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_7 -> ../../devices/virtual/net/ens2f0_7 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_8 -> ../../devices/virtual/net/ens2f0_8 lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_9 -> ../../devices/virtual/net/ens2f0_9 [root@overcloud-r640compute-0 devices]# I have libvirt that works successfully while creating instance as libvirt changes are present in the compose. <interface type='hostdev' managed='yes'> <mac address='fa:16:3e:36:6c:13'/> <driver name='vfio'/> <source> <address type='pci' domain='0x0000' bus='0x5e' slot='0x00' function='0x7'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </interface> [root@localhost ~]# ethtool -i eth1 <<<<<<<<<<<<<<<Instance having switchdev VF driver: mlx5_core version: 5.0-0 firmware-version: 16.27.6106 (DEL0000000015) expansion-rom-version: bus-info: 0000:00:05.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes [root@localhost ~]# So, Since kernel change is not present in 16.2, Do you think os-vif (rpm of python-os-vif-1.17.0-2.20210602134810.3a08cc4.el8ost yet to be included in 16.2) is mandatory? As VM is able to spawn successfully. One more thing, In my current 16.2 environment (os-vif patch is not present), I see all traffic pertaining to switchdev VF takes ovs kernel data path. Not even at tc sw. The management traffic steered via tc sw. This behavior is observed with vlan, geneve, with bond and without bond configurations. Looking at ovs logs, 2021-06-11T06:47:58.287Z|00155|dpif_netlink(handler2)|ERR|failed to offload flow: Invalid argument: ens1f1 <<<<<<<<<<<<<<Management traffic 2021-06-11T06:47:58.286Z|00154|dpif_netlink(handler2)|ERR|failed to offload flow: Invalid argument: ovn-f29-h1-1 2021-06-11T06:47:58.270Z|00153|dpif_netlink(handler2)|ERR|failed to offload flow: Invalid argument: ens2f0_5 <<<<<<<<<<<<<<<<<<switchdev VF ens1f1: ufid:7ab5f22e-6289-43a2-aff8-a26c9040d3a2, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(br-ex),packet_type(ns=0/0,id=0/0),eth(src=40:a6:b7:2b:a6:e1,dst=ac:1f:6b:7d:14:b1),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:6429, bytes:632347, used:0.000s, dp:tc, actions:ens1f1 ufid:c258b88d-59ef-4b2a-847a-4f69f9c512c9, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens1f1),packet_type(ns=0/0,id=0/0),eth(src=ac:1f:6b:7d:14:b1,dst=40:a6:b7:2b:a6:e1),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:7539, bytes:441865, used:0.000s, dp:tc, actions:br-ex ens2f0_5: ufid:ea3e3ff5-5d41-4d41-8eef-8c0db898e86d, recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x4,src=172.17.2.57,dst=172.17.2.46,ttl=0/0,geneve({class=0x102,type=0x80,len=4,0x30002/0x7fffffff}),flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),eth(src=fa:16:3e:22:f5:57,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:558, bytes:54684, used:0.314s, dp:ovs, actions:ens2f0_5 ufid:694fa8d6-37f8-489b-b44f-1e05eeb1c3b7, recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(ens2f0_5),skb_mark(0/0),ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),eth(src=fa:16:3e:36:6c:13,dst=fa:16:3e:22:f5:57),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=192.168.2.16/255.255.255.240,proto=0/0,tos=0/0x3,ttl=0/0,frag=no), packets:558, bytes:54684, used:0.314s, dp:ovs, actions:set(tunnel(tun_id=0x4,dst=172.17.2.57,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20003}),flags(df|csum|key))),genev_sys_6081 BTW, isn't PF should be having mlx5_core driver? i see representor driver for PF. I do remember having it mlx5_core earlier unless something changed recently. [root@overcloud-r640compute-0 devices]# ethtool -i ens2f0 driver: mlx5e_rep <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< version: 4.18.0-305.el8.x86_64 firmware-version: 16.27.6106 (DEL0000000015) expansion-rom-version: bus-info: 0000:5e:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: no supports-priv-flags: no [root@overcloud-r640compute-0 devices]# I believe things are not in order in 16.2 right now.While i dig further, Would you please let us know if there are any missing patches and breaking the feature? We have not yet tried conntrack offload (Suppose to work in rhel 8.4 barring few flags) since not able to make basic work. Do you see we need kernel-module-extra rpm as well needed? On overcloud image, this rpm wont be present unless required, we need to rebuild the image in that case. Since this is GAed feature and break is considered as Regression, we need your support. Thanks
Also, I tried using brew RPM of os-vif and updating nova_compute (and eventually all nova containers on compute) on compute node. The problem still persist. So not sure having os-vif fix makes any positive difference to the functionality.
(I'm leaving all other questions to Nvidia) I am not sure is why CT is being involved here, btw. It doesn't seem to be very related to the original bug, but lets go: (In reply to Haresh Khandelwal from comment #12) > One more thing, In my current 16.2 environment (os-vif patch is not > present), I see all traffic pertaining to switchdev VF takes ovs kernel data > path. Not even at tc sw. > The management traffic steered via tc sw. > > This behavior is observed with vlan, geneve, with bond and without bond > configurations. > > Looking at ovs logs, > > 2021-06-11T06:47:58.287Z|00155|dpif_netlink(handler2)|ERR|failed to offload > flow: Invalid argument: ens1f1 <<<<<<<<<<<<<<Management traffic > 2021-06-11T06:47:58.286Z|00154|dpif_netlink(handler2)|ERR|failed to offload > flow: Invalid argument: ovn-f29-h1-1 > 2021-06-11T06:47:58.270Z|00153|dpif_netlink(handler2)|ERR|failed to offload > flow: Invalid argument: ens2f0_5 <<<<<<<<<<<<<<<<<<switchdev VF These should be because > ens2f0_5: > > ufid:ea3e3ff5-5d41-4d41-8eef-8c0db898e86d, > recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x4,src=172.17.2. > 57,dst=172.17.2.46,ttl=0/0,geneve({class=0x102,type=0x80,len=4,0x30002/ > 0x7fffffff}),flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0), > ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),eth(src=fa:16:3e: ^^^^^^ > 22:f5:57,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800), > ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0, > frag=no), packets:558, bytes:54684, used:0.314s, dp:ovs, actions:ens2f0_5 > > ufid:694fa8d6-37f8-489b-b44f-1e05eeb1c3b7, > recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(ens2f0_5),skb_mark(0/0), > ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),eth(src=fa:16:3e: ^^^^^^ > 36:6c:13,dst=fa:16:3e:22:f5:57),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0, > dst=192.168.2.16/255.255.255.240,proto=0/0,tos=0/0x3,ttl=0/0,frag=no), > packets:558, bytes:54684, used:0.314s, dp:ovs, > actions:set(tunnel(tun_id=0x4,dst=172.17.2.57,ttl=64,tp_dst=6081, > geneve({class=0x102,type=0x80,len=4,0x20003}),flags(df|csum|key))), > genev_sys_6081 In -305.el8 we have the downstream commit 97fdc396c46d ("net/sched: cls_flower: Reject invalid ct_state flags rules") but not its fix yet: afa536d8405a ("net/sched: cls_flower: fix only mask bit check in the validate_ct_state") which is landing via https://bugzilla.redhat.com/show_bug.cgi?id=1965457#c1 Unfortunately it's not available in an official build yet. Just on the scratch build: https://bugzilla.redhat.com/show_bug.cgi?id=1965457#c6 There should be a 8.4.z build next week with all CT HWOL collected fixes so far, btw, and we can use it for testing.
@Haresh, This backport fix issue when spawning VM with SR-IOV with switchdev when the kernel have this fix [1]. If your kernel don't have this fixes you won't encounter the issue. To support kernel that have fix [1] we need this change and libvirt change [2]. We want to backport it to support kernel with fix [1]. regarding the offload issue I believe it is track in different BZ and as @Marcelo said it is not related to this BZ [1] - https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1 [2] - https://bugzilla.redhat.com/show_bug.cgi?id=1959367
Hi, Haresh. > So, Since kernel change is not present in 16.2, Do you think os-vif (rpm of python-os-vif-1.17.0-2.20210602134810.3a08cc4.el8ost yet to be included in 16.2) is mandatory? As VM is able to spawn successfully. As I understand, yes, the team asked to add the fix in 16.2. This issue blocked us also when using MLNX_OFED where we wanted to verify things as preparation for using inbox eventually. So please add the fix to 16.2. > BTW, isn't PF should be having mlx5_core driver? i see representor driver for PF. I do remember having it mlx5_core earlier unless something changed recently. > [root@overcloud-r640compute-0 devices]# ethtool -i ens2f0 > driver: mlx5e_rep <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Yes, this is expected, it has been like this for a while now. In switchdev mode the PF netdev is replaced with an "Uplink Representor"; that's why you see the driver name change. > Do you see we need kernel-module-extra rpm as well needed? On overcloud image, this rpm wont be present unless required, we need to rebuild the image in that case. I'm not sure about other cases, but for Connection Tracking at least, kernel-module-extra is 100% required since it provides mandatory kernel modules. Finally, the offload issue you see might be related to BZ #1946162
Thanks Marcelo, Moshe and Alaa, Appreciate your responses. This fix will be available in a next compose i think even though kernel (kernel-4.18.0-305.el8.x86_64) doesnt have [1]. But Agree, It is good to have as future RHEL 8.4z may have kernel with [1]. Regarding, "kernel-module-extra", i will see how to get this into supplied overcloud image. Regarding the issue i am facing, I suspect some a firmware issue. firmware-version: 16.27.6106 (DEL0000000015) <<<<<This is Dell PSID. Since dell is the source of firmware here, will check with them and upgrade. Also, will try to figure out Dell supplied firmware feature list/versions compare to mlx5_core versioning. This is important to know in case escalations reported on dell servers, this matrix would help in troubleshoot. [1] - https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1 Thanks
Great, thanks a lot, Haresh!
nlevinki can we move this to verfied based on comment 24 ci has now passed based on comment 22 i think that was all we were waiting for.
(In reply to smooney from comment #25) > nlevinki can we move this to verfied based on comment 24 > ci has now passed based on comment 22 i think that was all we were waiting > for. NFV team is validating this, moving the needinfo to @supadhya
With RHOS-16.2-RHEL-8-20210728.n.2 we have already verified HWoffload job which NFV runs https://bugzilla.redhat.com/show_bug.cgi?id=1918703#c24 and also https://bugzilla.redhat.com/show_bug.cgi?id=1918703#c22 haresh has verified it. Moving this to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483