1918703 – support kernel 5.8 - Link non uplink representors to PCI device

Bug 1918703 - support kernel 5.8 - Link non uplink representors to PCI device

Summary: support kernel 5.8 - Link non uplink representors to PCI device

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-os-vif
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ga
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	smooney
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:	1858583 1908649
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-21 12:23 UTC by Moshe Levi
Modified:	2021-09-15 07:11 UTC (History)
CC List:	18 users (show)
Fixed In Version:	python-os-vif-1.17.0-2.20210602134810.3a08cc4.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-15 07:11:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	765912	None	NEW	Refactor code of linux_net to more cleaner and increase performace	2021-05-27 17:48:58 UTC
OpenStack gerrit	765970	None	NEW	Fix - os-vif fails to get the correct UpLink Representor	2021-05-27 17:48:58 UTC
Red Hat Issue Tracker	NFV-1984	None	None	None	2021-08-11 14:31:55 UTC
Red Hat Product Errata	RHEA-2021:3483	None	None	None	2021-09-15 07:11:33 UTC

Description Moshe Levi 2021-01-21 12:23:54 UTC

Description of problem: 

Due to new kernel patch here [1], the PF and VF representors are linked to their parent PCI device.

Old Structure:
The structure of VF's PCI Address/physfn/net contains only the PF of that VF

$ ls /sys/bus/pci/devices/<vf-pci-addre>/physfn/net/
enp2s0f0

$ ls -l /sys/class/net
...
lrwxrwxrwx 1 root root 0 Aug 17 11:11 enp2s0f0_0 -> ../../devices/virtual/net/enp2s0f0_0
lrwxrwxrwx 1 root root 0 Aug 17 11:11 enp2s0f0_1 -> ../../devices/virtual/net/enp2s0f0_1
lrwxrwxrwx 1 root root 0 Aug 17 11:11 enp2s0f0_2 -> ../../devices/virtual/net/enp2s0f0_2
lrwxrwxrwx 1 root root 0 Aug 17 11:11 enp2s0f0_3 -> ../../devices/virtual/net/enp2s0f0_3
...

New Structure:
The structure of VF's PCI Address/physfn/net contains the PF of that VF and the VF representors

$ ls /sys/bus/pci/devices/<vf-pci-addre>/physfn/net/
enp3s0f0 enp3s0f0_0 enp3s0f0_1 enp3s0f0_2 enp3s0f0_3

$ ls -l /sys/class/net
...
lrwxrwxrwx. 1 root root 0 Aug 17 08:43 enp3s0f0_0 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/enp3s0f0_0
lrwxrwxrwx. 1 root root 0 Aug 17 08:43 enp3s0f0_1 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/enp3s0f0_1
lrwxrwxrwx. 1 root root 0 Aug 17 08:43 enp3s0f0_2 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/enp3s0f0_2
lrwxrwxrwx. 1 root root 0 Aug 17 08:43 enp3s0f0_3 -> ../../devices/pci0000:00/0000:00:02.0/0000:03:00.0/net/enp3s0f0_3
...

[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1

Version-Release number of selected component (if applicable):
we need to update os-vif to support it as well

How reproducible:
 

Steps to Reproduce:
1.create direct port with switchdev capabilities
2.boot vm with that port


Actual results:
os-vif will select VF representor instead of the PF

Expected results:
vm should boot successfully 

Additional info:

need to merged [1] and [2] they aready merged to master
[1] - https://review.opendev.org/c/openstack/os-vif/+/765912
[2] - https://review.opendev.org/c/openstack/os-vif/+/765970

Comment 1 smooney 2021-01-28 18:21:50 UTC

the osp 16.X release are pinned to specific rhel versions 
16.1 is pinned to rhel 8.2 and 16.2 will be pinned to 8.4

this will only be revelnt to backport to 16.2 if the corresponding kernl patches are backported to 8.4

my understanidn is we are going to backport this upstream to train
i have no issue wiht important that backport into 16.2 when its completed upstream but im not sure this will be required.

osp 17 will be based on wallaby and rhel 9 and will have this change by default.

moshe can you confirm if the kernel 5.8 patch has been targeted for inclusion in rhel  8.4 and if there is a bug tracking it.

ill quickly check and see if i can find one and update this if i do.

for 16.0 and 16.1 this should not be required so this would be for 16.2  only.

setting dev conditional nack design until we deterim if this is needed and if it will be inculded in rhel 8.4

Comment 2 smooney 2021-01-28 18:32:23 UTC

ok clearing needinfo and devnack

the kernel backport will break the userspace abi and require us to change are layered product to avoid a regression.
im kind of suprised this was acceped into rhel given the api breakage but since it has been we need the backport

Comment 4 Moshe Levi 2021-01-29 08:00:39 UTC

Thanks Sean, 
But in general can we backport these patches to openstack older version so that customer which uses other deployment tool won't break as well.
I send Saravanan KR <skramaja> just to make sure that we can get the RH folks to review the backport.

Comment 5 smooney 2021-01-29 12:39:44 UTC

ill bring that question up at our bug triage call later today.
form a redhat product perspective i dont think it make sense to backprot it in OSP to older verions.
16.1.x is only supported on 8.2 and will not have the kernel change.
form osp 13+ we require that all deployments are done using ooo/osp director otherwise they are unsupproted.
backporting it to 16.1 in addtion to 16.2 is not much more work but 15 and 14 are both EOL and wont have new releases
or security updates. 13 is rhel 7 based and wont have the kernel backports either so im not sure it really makes sense.

what release/tool/os version combination did you have in mind.

personally im surprised we backproted the kernel changes at all to 8.4 given its a userspace api break i severaly doubt this
is something that shoudl be backported to rhel7 as this is something that will break hardware offloaded ovs in laywered products
based on rhel so its a certification issue for vendors that support it too.

this is not a flat no but before we look at bringing this back before 16.2 i would like to know the business justification/use-case
that is enableing. we wont be suppprot vdpa before osp 17 and we wont be supporting subfucntion before 18 at the eailerst since that is not on our
roadmap at all yes in openstack. so the main reaon for the backport i assume is to support he connectx6-dx and lx nics.

they will only be useable in older release for basic hardware offload with vlan/flath networks and the newer functionality introduced in that line
will largely be unused untill a newer verison fo ovs and openstack are deployed on them.

in any case ill take a look at the backport upstream and do a pre-emptive back port downstream probably next week as i dont think ill get to it today.

if you can let me know here or on irc where you think this is useful beyond 16.2 i can take a look at that too.

Comment 6 Moshe Levi 2021-03-18 19:29:17 UTC

We have customers that uses old openstack but install mellanox ofed and without this change it will break. 
I understand it less makesens to backport fix for new kernel in old openstack distro.
if it possible to backport them train/stain release from upstream perspective that will be great.

Comment 11 Alaa Hleihel (NVIDIA Mellanox) 2021-06-06 08:24:51 UTC

Hi,

> [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1

libvirt in RHEL-8 was fixed/updated to support this kernel change.
So, we're adding this driver change to RHEL-8.5 in BZ #1959367.

Please confirm that the requested patches in this BZ will be in OSP 16.2

Thanks,
Alaa

Comment 12 Haresh Khandelwal 2021-06-11 08:50:11 UTC

Hi Alla, Moshe,

I have an environment with latest compose of 16.2 (RHEL8.4) with below version on compute nodes. 

[root@overcloud-r640compute-0 heat-admin]# rpm -q kernel
kernel-4.18.0-305.el8.x86_64


[root@overcloud-r640compute-0 /]# rpm -q qemu-kvm   <<<From nova_libvirt container
qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.x86_64

[root@overcloud-r640compute-0 /]# rpm -qa | grep virt  <<<From nova_libvirt container
libvirt-bash-completion-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
libvirt-client-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
libvirt-libs-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
libvirt-daemon-driver-interface-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
libvirt-daemon-kvm-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
python3-libvirt-7.0.0-1.module+el8.4.0+9469+2eaf72bc.x86_64
libvirt-admin-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
libvirt-daemon-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
libvirt-daemon-driver-qemu-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
libvirt-daemon-driver-network-7.0.0-14.module+el8.4.0+10886+79296686.x86_64
[root@overcloud-r640compute-0 /]# 

From  BZ#1959367,
"In RHEL-8.4 we skipped an mlx5 patch [1] that linked the VF representors to PF PCI device since user-space packages were not ready for that change."

So, kernel changes regarding VF structure is not present in kernel i am using. 

[root@overcloud-r640compute-0 devices]# ls /sys/bus/pci/devices/0000\:5e\:00.5/physfn/net
ens2f0
[root@overcloud-r640compute-0 devices]# ls -l /sys/class/net | grep ens2f0
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0 -> ../../devices/pci0000:5d/0000:5d:00.0/0000:5e:00.0/net/ens2f0
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_0 -> ../../devices/virtual/net/ens2f0_0
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_1 -> ../../devices/virtual/net/ens2f0_1
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_2 -> ../../devices/virtual/net/ens2f0_2
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_3 -> ../../devices/virtual/net/ens2f0_3
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_4 -> ../../devices/virtual/net/ens2f0_4
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_5 -> ../../devices/virtual/net/ens2f0_5
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_6 -> ../../devices/virtual/net/ens2f0_6
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_7 -> ../../devices/virtual/net/ens2f0_7
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_8 -> ../../devices/virtual/net/ens2f0_8
lrwxrwxrwx. 1 root root 0 Jun 10 17:40 ens2f0_9 -> ../../devices/virtual/net/ens2f0_9
[root@overcloud-r640compute-0 devices]# 

I have libvirt that works successfully while creating instance as libvirt changes are present in the compose. 

    <interface type='hostdev' managed='yes'>
      <mac address='fa:16:3e:36:6c:13'/>
      <driver name='vfio'/>
      <source>
        <address type='pci' domain='0x0000' bus='0x5e' slot='0x00' function='0x7'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </interface>

[root@localhost ~]# ethtool -i eth1    <<<<<<<<<<<<<<<Instance having switchdev VF
driver: mlx5_core
version: 5.0-0
firmware-version: 16.27.6106 (DEL0000000015)
expansion-rom-version: 
bus-info: 0000:00:05.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
[root@localhost ~]# 

So, Since kernel change is not present in 16.2, Do you think os-vif (rpm of python-os-vif-1.17.0-2.20210602134810.3a08cc4.el8ost yet to be included in 16.2) is mandatory? As VM is able to spawn successfully. 

One more thing, In my current 16.2 environment (os-vif patch is not present), I see all traffic pertaining to switchdev VF takes ovs kernel data path. Not even at tc sw. 
The management traffic steered via tc sw. 

This behavior is observed with vlan, geneve, with bond and without bond configurations.

Looking at ovs logs, 

2021-06-11T06:47:58.287Z|00155|dpif_netlink(handler2)|ERR|failed to offload flow: Invalid argument: ens1f1   <<<<<<<<<<<<<<Management traffic
2021-06-11T06:47:58.286Z|00154|dpif_netlink(handler2)|ERR|failed to offload flow: Invalid argument: ovn-f29-h1-1
2021-06-11T06:47:58.270Z|00153|dpif_netlink(handler2)|ERR|failed to offload flow: Invalid argument: ens2f0_5   <<<<<<<<<<<<<<<<<<switchdev VF

ens1f1:

ufid:7ab5f22e-6289-43a2-aff8-a26c9040d3a2, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(br-ex),packet_type(ns=0/0,id=0/0),eth(src=40:a6:b7:2b:a6:e1,dst=ac:1f:6b:7d:14:b1),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:6429, bytes:632347, used:0.000s, dp:tc, actions:ens1f1

ufid:c258b88d-59ef-4b2a-847a-4f69f9c512c9, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(ens1f1),packet_type(ns=0/0,id=0/0),eth(src=ac:1f:6b:7d:14:b1,dst=40:a6:b7:2b:a6:e1),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:7539, bytes:441865, used:0.000s, dp:tc, actions:br-ex

ens2f0_5:

ufid:ea3e3ff5-5d41-4d41-8eef-8c0db898e86d, recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x4,src=172.17.2.57,dst=172.17.2.46,ttl=0/0,geneve({class=0x102,type=0x80,len=4,0x30002/0x7fffffff}),flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),eth(src=fa:16:3e:22:f5:57,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,frag=no), packets:558, bytes:54684, used:0.314s, dp:ovs, actions:ens2f0_5

ufid:694fa8d6-37f8-489b-b44f-1e05eeb1c3b7, recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(ens2f0_5),skb_mark(0/0),ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),eth(src=fa:16:3e:36:6c:13,dst=fa:16:3e:22:f5:57),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,dst=192.168.2.16/255.255.255.240,proto=0/0,tos=0/0x3,ttl=0/0,frag=no), packets:558, bytes:54684, used:0.314s, dp:ovs, actions:set(tunnel(tun_id=0x4,dst=172.17.2.57,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20003}),flags(df|csum|key))),genev_sys_6081

BTW, isn't PF should be having mlx5_core driver? i see representor driver for PF. I do remember having it mlx5_core earlier unless something changed recently. 
[root@overcloud-r640compute-0 devices]# ethtool -i ens2f0
driver: mlx5e_rep    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
version: 4.18.0-305.el8.x86_64
firmware-version: 16.27.6106 (DEL0000000015)
expansion-rom-version: 
bus-info: 0000:5e:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@overcloud-r640compute-0 devices]# 

I believe things are not in order in 16.2 right now.While i dig further, Would you please let us know if there are any missing patches and breaking the feature? 
We have not yet tried conntrack offload (Suppose to work in rhel 8.4 barring few flags) since not able to make basic work. 

Do you see we need kernel-module-extra rpm as well needed? On overcloud image, this rpm wont be present unless required, we need to rebuild the image in that case.

Since this is GAed feature and break is considered as Regression, we need your support. 

Thanks

Comment 13 Haresh Khandelwal 2021-06-11 08:53:36 UTC

Also, I tried using brew RPM of os-vif and updating nova_compute (and eventually all nova containers on compute) on compute node.
The problem still persist. So not sure having os-vif fix makes any positive difference to the functionality.

Comment 14 Marcelo Ricardo Leitner 2021-06-11 17:38:02 UTC

(I'm leaving all other questions to Nvidia)

I am not sure is why CT is being involved here, btw. It doesn't seem to be very related to the original bug, but lets go:

(In reply to Haresh Khandelwal from comment #12)
> One more thing, In my current 16.2 environment (os-vif patch is not
> present), I see all traffic pertaining to switchdev VF takes ovs kernel data
> path. Not even at tc sw. 
> The management traffic steered via tc sw. 
> 
> This behavior is observed with vlan, geneve, with bond and without bond
> configurations.
> 
> Looking at ovs logs, 
> 
> 2021-06-11T06:47:58.287Z|00155|dpif_netlink(handler2)|ERR|failed to offload
> flow: Invalid argument: ens1f1   <<<<<<<<<<<<<<Management traffic
> 2021-06-11T06:47:58.286Z|00154|dpif_netlink(handler2)|ERR|failed to offload
> flow: Invalid argument: ovn-f29-h1-1
> 2021-06-11T06:47:58.270Z|00153|dpif_netlink(handler2)|ERR|failed to offload
> flow: Invalid argument: ens2f0_5   <<<<<<<<<<<<<<<<<<switchdev VF

These should be because

> ens2f0_5:
> 
> ufid:ea3e3ff5-5d41-4d41-8eef-8c0db898e86d,
> recirc_id(0),dp_hash(0/0),skb_priority(0/0),tunnel(tun_id=0x4,src=172.17.2.
> 57,dst=172.17.2.46,ttl=0/0,geneve({class=0x102,type=0x80,len=4,0x30002/
> 0x7fffffff}),flags(-df+csum+key)),in_port(genev_sys_6081),skb_mark(0/0),
> ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),eth(src=fa:16:3e:
           ^^^^^^
> 22:f5:57,dst=00:00:00:00:00:00/01:00:00:00:00:00),eth_type(0x0800),
> ipv4(src=0.0.0.0/0.0.0.0,dst=0.0.0.0/0.0.0.0,proto=0/0,tos=0/0,ttl=0/0,
> frag=no), packets:558, bytes:54684, used:0.314s, dp:ovs, actions:ens2f0_5
> 
> ufid:694fa8d6-37f8-489b-b44f-1e05eeb1c3b7,
> recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(ens2f0_5),skb_mark(0/0),
> ct_state(0/0x3f),ct_zone(0/0),ct_mark(0/0),ct_label(0/0x1),eth(src=fa:16:3e:
           ^^^^^^
> 36:6c:13,dst=fa:16:3e:22:f5:57),eth_type(0x0800),ipv4(src=0.0.0.0/0.0.0.0,
> dst=192.168.2.16/255.255.255.240,proto=0/0,tos=0/0x3,ttl=0/0,frag=no),
> packets:558, bytes:54684, used:0.314s, dp:ovs,
> actions:set(tunnel(tun_id=0x4,dst=172.17.2.57,ttl=64,tp_dst=6081,
> geneve({class=0x102,type=0x80,len=4,0x20003}),flags(df|csum|key))),
> genev_sys_6081

In -305.el8 we have the downstream commit
97fdc396c46d ("net/sched: cls_flower: Reject invalid ct_state flags rules")
but not its fix yet:
afa536d8405a ("net/sched: cls_flower: fix only mask bit check in the validate_ct_state")
which is landing via
https://bugzilla.redhat.com/show_bug.cgi?id=1965457#c1

Unfortunately it's not available in an official build yet. Just on the scratch build:
https://bugzilla.redhat.com/show_bug.cgi?id=1965457#c6

There should be a 8.4.z build next week with all CT HWOL collected fixes so far, btw, and we can use it for testing.

Comment 15 Moshe Levi 2021-06-11 18:54:57 UTC

@Haresh, 
This backport fix issue when spawning VM with SR-IOV with switchdev when the kernel have this fix [1]. If your kernel don't have this fixes you won't encounter the issue. To support kernel that have fix [1] we need this change and libvirt change [2]. We want to backport it to support kernel with fix [1].
regarding the offload issue I believe it is track in different BZ and as @Marcelo said it is not related to this BZ



[1] - https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1
[2] - https://bugzilla.redhat.com/show_bug.cgi?id=1959367

Comment 16 Alaa Hleihel (NVIDIA Mellanox) 2021-06-13 08:19:49 UTC

Hi, Haresh.

> So, Since kernel change is not present in 16.2, Do you think os-vif (rpm of python-os-vif-1.17.0-2.20210602134810.3a08cc4.el8ost yet to be included in 16.2) is mandatory? As VM is able to spawn successfully. 

As I understand, yes, the team asked to add the fix in 16.2.
This issue blocked us also when using MLNX_OFED where we wanted to verify things as preparation for using inbox eventually.
So please add the fix to 16.2.


> BTW, isn't PF should be having mlx5_core driver? i see representor driver for PF. I do remember having it mlx5_core earlier unless something changed recently. 
> [root@overcloud-r640compute-0 devices]# ethtool -i ens2f0
> driver: mlx5e_rep    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Yes, this is expected, it has been like this for a while now. In switchdev mode the PF netdev is replaced with an "Uplink Representor"; that's why you see the driver name change.


> Do you see we need kernel-module-extra rpm as well needed? On overcloud image, this rpm wont be present unless required, we need to rebuild the image in that case.

I'm not sure about other cases, but for Connection Tracking at least, kernel-module-extra is 100% required since it provides mandatory kernel modules.

Finally, the offload issue you see might be related to BZ #1946162

Comment 20 Haresh Khandelwal 2021-06-14 11:30:35 UTC

Thanks Marcelo, Moshe and Alaa,  Appreciate your responses. 

This fix will be available in a next compose i think even though kernel (kernel-4.18.0-305.el8.x86_64) doesnt have [1]. But Agree, It is good to have as future RHEL 8.4z may have kernel with [1].

Regarding, "kernel-module-extra", i will see how to get this into supplied overcloud image. 

Regarding the issue i am facing, I suspect some a firmware issue.

firmware-version: 16.27.6106 (DEL0000000015) <<<<<This is Dell PSID. 
Since dell is the source of firmware here, will check with them and upgrade. 
Also, will try to figure out Dell supplied firmware feature list/versions compare to mlx5_core versioning. This is important to know in case escalations reported on dell servers, this matrix would help in troubleshoot. 

[1] - https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1

Thanks

Comment 21 Alaa Hleihel (NVIDIA Mellanox) 2021-06-15 07:22:41 UTC

Great, thanks a lot, Haresh!

Comment 25 smooney 2021-08-11 12:00:05 UTC

nlevinki can we move this to verfied based on comment 24
ci has now passed based on comment 22 i think that was all we were waiting for.

Comment 26 Archit Modi 2021-08-11 12:40:45 UTC

(In reply to smooney from comment #25)
> nlevinki can we move this to verfied based on comment 24
> ci has now passed based on comment 22 i think that was all we were waiting
> for.

NFV team is validating this, moving the needinfo to @supadhya

Comment 27 Sanjay Upadhyay 2021-08-11 12:48:56 UTC

With RHOS-16.2-RHEL-8-20210728.n.2 we have already verified HWoffload job which NFV runs https://bugzilla.redhat.com/show_bug.cgi?id=1918703#c24
and also https://bugzilla.redhat.com/show_bug.cgi?id=1918703#c22 haresh has verified it. 

Moving this to verified

Comment 29 errata-xmlrpc 2021-09-15 07:11:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483

Note You need to log in before you can comment on or make changes to this bug.