Bug 1891614 - [mlx] testpmd fails inside OpenShift pod using DevX version 19.11
Summary: [mlx] testpmd fails inside OpenShift pod using DevX version 19.11
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Sebastian Scheinkman
QA Contact: Nikita
URL:
Whiteboard:
Depends On: 1846560
Blocks: 1771572
TreeView+ depends on / blocked
 
Reported: 2020-10-26 20:36 UTC by Sebastian Scheinkman
Modified: 2021-02-24 15:29 UTC (History)
28 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1846560
Environment:
Last Closed: 2021-02-24 15:28:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-docs pull 27624 0 None closed Bug 1891614: Update mlx dpdk base on cve-2020-14386 2021-02-01 08:10:59 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:29:04 UTC

Description Sebastian Scheinkman 2020-10-26 20:36:25 UTC
I am not able to run a dpdk testpmd application, this validation was working before.

In summery this is the error I see when starting the testpmd application

net_mlx5: Failed to create TIS using DevX
net_mlx5: TIS allocation failure
EAL: Error: Invalid memory
net_mlx5: probe of PCI device 0000:3b:00.6 aborted after encountering an error: Cannot allocate memory
EAL: Requested device 0000:3b:00.6 cannot be used


Full details:

testpmd -l 8,10,48,50 -w 0000:3b:00.6 --iova-mode=va --socket-mem=1024 --socket-limit=1024 -- -i --portmask=0x1 --nb-cores=2 --forward-mode=mac --port-topology=loop --no-mlockall
EAL: Detected 80 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: PCI device 0000:3b:00.6 on NUMA socket 0
EAL:   probe driver: 15b3:1018 net_mlx5
net_mlx5: Failed to create TIS using DevX
net_mlx5: TIS allocation failure
EAL: Error: Invalid memory
net_mlx5: probe of PCI device 0000:3b:00.6 aborted after encountering an error: Cannot allocate memory
EAL: Requested device 0000:3b:00.6 cannot be used
testpmd: No probed ethernet devices
Interactive-mode selected
Set mac packet forwarding mode
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc


sh-4.4# lspci -v -nn -mm -k -s 0000:3b:00.6

Slot:	3b:00.6
Class:	Ethernet controller [0200]
Vendor:	Mellanox Technologies [15b3]
Device:	MT27800 Family [ConnectX-5 Virtual Function] [1018]
SVendor:	Mellanox Technologies [15b3]
SDevice:	Device [0091]
Driver:	mlx5_core
Module:	mlx5_core
NUMANode:	0

sh-4.4# lspci -v -nn -mm -k -s 0000:3b:00.0

Slot:	3b:00.0
Class:	Ethernet controller [0200]
Vendor:	Mellanox Technologies [15b3]
Device:	MT27800 Family [ConnectX-5] [1017]
SVendor:	Mellanox Technologies [15b3]
SDevice:	Device [0091]
Driver:	mlx5_core
Module:	mlx5_core
NUMANode:	0


ethtool -i ens1f0

driver: mlx5_core
version: 5.0-0
firmware-version: 16.26.6000 (DEL0000000015)
expansion-rom-version: 
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

dpdk version:

dpdk.x86_64                        19.11-4.el8                     @rhel-8-for-x86_64-appstream-rpms
dpdk-devel.x86_64                  19.11-4.el8                     @rhel-8-for-x86_64-appstream-rpms
dpdk-tools.x86_64                  19.11-4.el8                     @rhel-8-for-x86_64-appstream-rpms


kernel
4.18.0-193.24.1.el8_2.dt1.x86_64


with debug flag

testpmd -l 8,10,48,50 -w 0000:3b:00.5 --iova-mode=va --log-level="*:debug" -- -i --portmask=0x1 --nb-cores=2 --forward-mode=mac --port-topology=loop --no-mlockall
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 80 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_bnxt.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_e1000.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_enic.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_failsafe.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_i40e.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ixgbe.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_mlx4.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_mlx5.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_netvsc.so.20.0
EAL: Registered [vmbus] bus.
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_nfp.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_qede.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ring.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_tap.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vdev_netvsc.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vhost.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_virtio.so.20.0
EAL: Ask a virtual area of 0x5000 bytes
EAL: Virtual area found at 0x100000000 (size = 0x5000)
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Module /sys/module/vfio_pci not found! error 2 (No such file or directory)
EAL: VFIO PCI modules not loaded
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
EAL: VFIO modules not loaded, skipping VFIO support...
EAL: Ask a virtual area of 0x2e000 bytes
EAL: Virtual area found at 0x100005000 (size = 0x2e000)
EAL: Setting up physically contiguous memory...
EAL: Setting maximum number of open files to 1048576
EAL: Detected memory type: socket_id:0 hugepage_sz:1073741824
EAL: Detected memory type: socket_id:1 hugepage_sz:1073741824
EAL: Creating 4 segment lists: n_segs:32 socket_id:0 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x100033000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x140000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x940000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x980000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x1180000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x11c0000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x19c0000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x1a00000000 (size = 0x800000000)
EAL: Creating 4 segment lists: n_segs:32 socket_id:1 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x2200000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x2240000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x2a40000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x2a80000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x3280000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x32c0000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x3ac0000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x3b00000000 (size = 0x800000000)
EAL: TSC frequency is ~2490000 KHz
EAL: Master lcore 8 is ready (tid=7ff303020900;cpuset=[8])
EAL: lcore 48 is ready (tid=7ff2f9e00700;cpuset=[48])
EAL: lcore 50 is ready (tid=7ff2f95ff700;cpuset=[50])
EAL: lcore 10 is ready (tid=7ff2fa601700;cpuset=[10])
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was expanded by 1024MB
EAL: PCI device 0000:3b:00.5 on NUMA socket 0
EAL:   probe driver: 15b3:1018 net_mlx5
EAL: Mem event callback 'MLX5_MEM_EVENT_CB:(nil)' registered
net_mlx5: Failed to create TIS using DevX
net_mlx5: TIS allocation failure
Segmentation fault (core dumped)

Comment 1 Sebastian Scheinkman 2020-10-26 20:37:30 UTC
Update this is unrelated to the SElinux issue we found on this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1846560

Comment 2 David Marchand 2020-10-27 08:18:23 UTC
- If it was working before, what changed in your setup?
Do you have new SElinux logs?


- I don't know about the reason of the allocation failure for now, but the logs give us an indication something is wrong in the mlx5 cleanup routine:

* first try
net_mlx5: Failed to create TIS using DevX
net_mlx5: TIS allocation failure
EAL: Error: Invalid memory
net_mlx5: probe of PCI device 0000:3b:00.6 aborted after encountering an error: Cannot allocate memory

Here, "Error: Invalid memory" means that the mlx5 driver tried to free an invalid pointer.


- second try
net_mlx5: Failed to create TIS using DevX
net_mlx5: TIS allocation failure
Segmentation fault (core dumped)

Here, I guess rte_free() did not have the chance to complain and just crashed.
Do you have the coredump to confirm this?

Comment 3 Alaa Hleihel (NVIDIA Mellanox) 2020-10-27 08:27:42 UTC
This does seem like the other BZ.
Did you try to run with the WA from comment https://bugzilla.redhat.com/show_bug.cgi?id=1846560#c43 ?

Comment 4 Sebastian Scheinkman 2020-10-27 08:32:00 UTC
Hi David,

I am sorry I just validate with our QE team this is an openshift cluster running on new baremetal servers.

System Information
	Manufacturer: Dell Inc.
	Product Name: PowerEdge R740

And the coredump is about the other BZ I disable the selinux and there is no core dump any more but I can see there are no nic for testpmd

testpmd: No probed ethernet devices


I can give you access to the env if you want to debug it there.

full output run with debug flag

EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 80 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_bnxt.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_e1000.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_enic.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_failsafe.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_i40e.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ixgbe.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_mlx4.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_mlx5.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_netvsc.so.20.0
EAL: Registered [vmbus] bus.
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_nfp.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_qede.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ring.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_tap.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vdev_netvsc.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vhost.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_virtio.so.20.0
EAL: Ask a virtual area of 0x5000 bytes
EAL: Virtual area found at 0x100000000 (size = 0x5000)
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Module /sys/module/vfio_pci not found! error 2 (No such file or directory)
EAL: VFIO PCI modules not loaded
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
EAL: VFIO modules not loaded, skipping VFIO support...
EAL: Ask a virtual area of 0x2e000 bytes
EAL: Virtual area found at 0x100005000 (size = 0x2e000)
EAL: Setting up physically contiguous memory...
EAL: Setting maximum number of open files to 1048576
EAL: Detected memory type: socket_id:0 hugepage_sz:1073741824
EAL: Detected memory type: socket_id:1 hugepage_sz:1073741824
EAL: Creating 4 segment lists: n_segs:32 socket_id:0 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x100033000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x140000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x940000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x980000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x1180000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x11c0000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x19c0000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x1a00000000 (size = 0x800000000)
EAL: Creating 4 segment lists: n_segs:32 socket_id:1 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x2200000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x2240000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x2a40000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x2a80000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x3280000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x32c0000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x3ac0000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x3b00000000 (size = 0x800000000)
EAL: TSC frequency is ~2490000 KHz
EAL: Master lcore 8 is ready (tid=7fcb35aca900;cpuset=[8])
EAL: lcore 48 is ready (tid=7fcb2c8aa700;cpuset=[48])
EAL: lcore 50 is ready (tid=7fcb2c0a9700;cpuset=[50])
EAL: lcore 10 is ready (tid=7fcb2d0ab700;cpuset=[10])
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was expanded by 1024MB
EAL: PCI device 0000:3b:00.5 on NUMA socket 0
EAL:   probe driver: 15b3:1018 net_mlx5
EAL: Mem event callback 'MLX5_MEM_EVENT_CB:(nil)' registered
net_mlx5: Failed to create TIS using DevX
net_mlx5: TIS allocation failure
net_mlx5: probe of PCI device 0000:3b:00.5 aborted after encountering an error: Cannot allocate memory
EAL: Requested device 0000:3b:00.5 cannot be used
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
testpmd: No probed ethernet devices
Interactive-mode selected
Set mac packet forwarding mode
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Done
testpmd>

Comment 5 David Marchand 2020-10-27 10:27:09 UTC
- The hugepage files can be removed, and so this issue is different from bz1846560.

Inside the pod:
sh-4.4# uname -a
Linux dpdk-h4h9c 4.18.0-193.24.1.el8_2.dt1.x86_64 #1 SMP Thu Sep 24 14:57:05 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux                                                                                              
sh-4.4# rpm -q dpdk
dpdk-19.11-4.el8.x86_64
sh-4.4# rpm -q rdma-core
rdma-core-26.0-8.el8.x86_64
sh-4.4# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)


- We still have the issue with debug logs not working for dynamic dpdk plugins in this version.
I used gdb to force it (safer than my previous black magic that rewrote the default value in the binary).


sh-4.4# gdb testpmd # -l 8,10,48,50 -w 0000:3b:00.5 --iova-mode=va --log-level="*:debug" -- -i --portmask=0x1 --nb-cores=2 --forward-mode=mac --port-topology=loop --no-mlockall
[...]
(gdb) b eal_plugins_init
Function "eal_plugins_init" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (eal_plugins_init) pending.
[...]
(gdb) run -l 8,10,48,50 -w 0000:3b:00.5 --iova-mode=va --log-level="*:debug" -- -i --portmask=0x1 --nb-cores=2 --forward-mode=mac --port-topology=loop --no-mlockall                                              
Starting program: /usr/bin/testpmd -l 8,10,48,50 -w 0000:3b:00.5 --iova-mode=va --log-level="*:debug" -- -i --portmask=0x1 --nb-cores=2 --forward-mode=mac --port-topology=loop --no-mlockall                     
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: Loadable section ".note.gnu.property" outside of ELF segments
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 0 on socket 1
EAL: Detected lcore 2 as core 4 on socket 0
EAL: Detected lcore 3 as core 4 on socket 1
EAL: Detected lcore 4 as core 1 on socket 0
[...]
Breakpoint 1, eal_plugins_init () at /usr/src/debug/dpdk-19.11-4.el8.x86_64/lib/librte_eal/common/eal_common_options.c:280                                                                                        
280     {
Missing separate debuginfos, use: yum debuginfo-install numactl-libs-2.0.12-9.el8.x86_64
(gdb) finish
Run till exit from #0  eal_plugins_init () at /usr/src/debug/dpdk-19.11-4.el8.x86_64/lib/librte_eal/common/eal_common_options.c:280                                                                               
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_bnxt.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_e1000.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_enic.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_failsafe.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_i40e.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ixgbe.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_mlx4.so.20.0
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_mlx5.so.20.0
warning: Loadable section ".note.gnu.property" outside of ELF segments
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_netvsc.so.20.0
EAL: Registered [vmbus] bus.
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_nfp.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_qede.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ring.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_tap.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vdev_netvsc.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vhost.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_virtio.so.20.0
0x00007ffff47bdf60 in rte_eal_init (argc=argc@entry=14, argv=argv@entry=0x7fffffffe958) at /usr/src/debug/dpdk-19.11-4.el8.x86_64/lib/librte_eal/linux/eal/eal.c:1007                                             
1007            if (eal_plugins_init() < 0) {
Value returned is $1 = 0
Missing separate debuginfos, use: yum debuginfo-install libibverbs-26.0-8.el8.x86_64
(gdb) set rte_logs.dynamic_types[mlx5_logtype].loglevel = 8
(gdb) c
Continuing.


And so, after this, we can see the following mlx5 driver logs:

EAL: Heap on socket 0 was expanded by 1024MB
EAL: PCI device 0000:3b:00.5 on NUMA socket 0
EAL:   probe driver: 15b3:1018 net_mlx5
EAL: Mem event callback 'MLX5_MEM_EVENT_CB:(nil)' registered
net_mlx5: checking device "mlx5_6"
net_mlx5: checking device "mlx5_5"
net_mlx5: PCI information matches for device "mlx5_5"
net_mlx5: checking device "mlx5_4"
net_mlx5: checking device "mlx5_3"
net_mlx5: checking device "mlx5_2"
net_mlx5: checking device "mlx5_1"
net_mlx5: checking device "mlx5_0"
net_mlx5: no E-Switch support detected
net_mlx5: naming Ethernet device "0000:3b:00.5"
net_mlx5: DevX is supported
net_mlx5: Failed to create TIS using DevX
net_mlx5: TIS allocation failure
net_mlx5: probe of PCI device 0000:3b:00.5 aborted after encountering an error: Cannot allocate memory
EAL: Requested device 0000:3b:00.5 cannot be used
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)

Comment 6 David Marchand 2020-10-27 10:39:03 UTC
# From the pod

sh-4.4# dmesg |grep -i mlx.*0000:3b:00.5
[  184.025855] mlx5_core 0000:3b:00.5: enabling device (0000 -> 0002)
[  184.032240] mlx5_core 0000:3b:00.5: firmware version: 16.26.6000
[  184.320704] mlx5_core 0000:3b:00.5: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[  184.338062] mlx5_core 0000:3b:00.5: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[  184.346323] mlx5_core 0000:3b:00.5: Assigned random MAC address b2:a7:bd:88:e3:dc
[  184.482922] mlx5_core 0000:3b:00.5 ens1f0v3: renamed from eth0
[  189.683721] mlx5_core 0000:3b:00.5: enabling device (0000 -> 0002)
[  189.690085] mlx5_core 0000:3b:00.5: firmware version: 16.26.6000
[  189.972358] mlx5_core 0000:3b:00.5: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[  189.989391] mlx5_core 0000:3b:00.5: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[  190.131982] mlx5_core 0000:3b:00.5 ens1f0v3: renamed from eth0
[  913.855598] mlx5_core 0000:3b:00.5: enabling device (0000 -> 0002)
[  913.861974] mlx5_core 0000:3b:00.5: firmware version: 16.26.6000
[  914.148914] mlx5_core 0000:3b:00.5: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[  914.165601] mlx5_core 0000:3b:00.5: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[  914.173871] mlx5_core 0000:3b:00.5: Assigned random MAC address 4a:12:b9:19:07:2d
[  914.309407] mlx5_core 0000:3b:00.5 ens1f0v3: renamed from eth0
[  919.588496] mlx5_core 0000:3b:00.5: enabling device (0000 -> 0002)
[  919.596135] mlx5_core 0000:3b:00.5: firmware version: 16.26.6000
[  919.877791] mlx5_core 0000:3b:00.5: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[  919.895938] mlx5_core 0000:3b:00.5: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[  920.051348] mlx5_core 0000:3b:00.5 ens1f0v3: renamed from eth0
[ 4060.967387] mlx5_core 0000:3b:00.5: enabling device (0000 -> 0002)
[ 4060.973769] mlx5_core 0000:3b:00.5: firmware version: 16.26.6000
[ 4061.262376] mlx5_core 0000:3b:00.5: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[ 4061.279366] mlx5_core 0000:3b:00.5: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 4061.287642] mlx5_core 0000:3b:00.5: Assigned random MAC address 86:3a:c6:d5:2a:b9
[ 4061.421220] mlx5_core 0000:3b:00.5 ens1f0v3: renamed from eth0
[ 4066.700811] mlx5_core 0000:3b:00.5: enabling device (0000 -> 0002)
[ 4066.708610] mlx5_core 0000:3b:00.5: firmware version: 16.26.6000
[ 4066.993835] mlx5_core 0000:3b:00.5: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[ 4067.011190] mlx5_core 0000:3b:00.5: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[ 4067.165708] mlx5_core 0000:3b:00.5 ens1f0v3: renamed from eth0
[ 4117.444330] mlx5_core 0000:3b:00.5 temp_58: renamed from ens1f0v3
[ 4117.465870] mlx5_core 0000:3b:00.5 net1: renamed from temp_58
[ 4117.563422] mlx5_core 0000:3b:00.5 net1: Link up


sh-4.4# ethtool -i net1
driver: mlx5_core
version: 5.0-0
firmware-version: 16.26.6000 (DEL0000000015)
expansion-rom-version:
bus-info: 0000:3b:00.5
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes


# From the node

sh-4.4# modinfo mlx5_core
filename:       /lib/modules/4.18.0-193.24.1.el8_2.dt1.x86_64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko.xz                                                                                      
[...]

sh-4.4# modinfo mlx5_ib
filename:       /lib/modules/4.18.0-193.24.1.el8_2.dt1.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz                                                                                                     
[...]


Alaa, can you confirm those versions and firmwares look ok?

Comment 7 Alaa Hleihel (NVIDIA Mellanox) 2020-10-27 11:01:27 UTC
Hi, David.

Yes, this FW version worked for us in BZ 1846560.
Back then the root-cause was the issue with hugepages, after adding W/A to clear the memory, all worked fine.
Can you test with that W/A just to double-check it's not the same issue?

What is the difference between the hosts and/or configurations in this BZ and that one?

Comment 8 David Marchand 2020-10-27 11:05:35 UTC
I removed the hugepage files and started from scratch.
There is no memory issue, no reason to test with this hack.

Comment 9 Sebastian Scheinkman 2020-11-09 12:27:14 UTC
Hi Alaa,

Any update on this issue I change it to Urgent as it's blocking us.

Comment 10 Alaa Hleihel (NVIDIA Mellanox) 2020-11-09 13:21:42 UTC
(In reply to Sebastian Scheinkman from comment #9)
> Hi Alaa,
> 
> Any update on this issue I change it to Urgent as it's blocking us.

Hi, Sebastian.

Can you summarize the difference here from the working setup in BZ 1846560?
Different setup/card, different configuration, etc...?

Also, can you get a dump of firmware configurations from both systems?
To get it run:
# yum install -y mstflint
# mstconfig -e -d <ConnectX card pci BDF> q

Comment 11 Sebastian Scheinkman 2020-11-11 11:23:05 UTC
Hi Alaa,

yes this is a different cluster (I don't have access to the old one anymore)

here you go:
sh-4.4# mstconfig -e -d 0000:19:00.0 q

Device #1:
----------

Device type:    ConnectX4LX     
Name:           0R887V          
Description:    MCX422A-ACAA ConnectX-4 Lx EN Dual Port SFP28; 25GbE for Dell rack NDC
Device:         0000:19:00.0    

Configurations:                              Default         Current         Next Boot

         MEMIC_BAR_SIZE                      0               0               0               
         MEMIC_SIZE_LIMIT                    _256KB(1)       _256KB(1)       _256KB(1)       
         FLEX_PARSER_PROFILE_ENABLE          0               0               0               
         FLEX_IPV4_OVER_VXLAN_PORT           0               0               0               
         ROCE_NEXT_PROTOCOL                  254             254             254             
         NON_PREFETCHABLE_PF_BAR             False(0)        False(0)        False(0)        
         VF_VPD_ENABLE                       False(0)        False(0)        False(0)        
         STRICT_VF_MSIX_NUM                  False(0)        False(0)        False(0)        
         VF_NODNIC_ENABLE                    False(0)        False(0)        False(0)        
*        NUM_OF_VFS                          8               5               5               
*        SRIOV_EN                            False(0)        True(1)         True(1)         
         PF_LOG_BAR_SIZE                     5               5               5               
         VF_LOG_BAR_SIZE                     0               0               0               
         NUM_PF_MSIX                         63              63              63              
         NUM_VF_MSIX                         11              11              11              
         INT_LOG_MAX_PAYLOAD_SIZE            AUTOMATIC(0)    AUTOMATIC(0)    AUTOMATIC(0)    
         PARTIAL_RESET_EN                    False(0)        False(0)        False(0)        
         SW_RECOVERY_ON_ERRORS               False(0)        False(0)        False(0)        
         RESET_WITH_HOST_ON_ERRORS           False(0)        False(0)        False(0)        
         CQE_COMPRESSION                     BALANCED(0)     BALANCED(0)     BALANCED(0)     
         IP_OVER_VXLAN_EN                    False(0)        False(0)        False(0)        
         UCTX_EN                             True(1)         True(1)         True(1)         
         PCI_ATOMIC_MODE                     PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0) PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0) PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
         LRO_LOG_TIMEOUT0                    6               6               6               
         LRO_LOG_TIMEOUT1                    7               7               7               
         LRO_LOG_TIMEOUT2                    8               8               8               
         LRO_LOG_TIMEOUT3                    13              13              13              
         LOG_DCR_HASH_TABLE_SIZE             14              14              14              
         DCR_LIFO_SIZE                       16384           16384           16384           
         ROCE_CC_PRIO_MASK_P1                255             255             255             
         ROCE_CC_ALGORITHM_P1                ECN(0)          ECN(0)          ECN(0)          
         ROCE_CC_PRIO_MASK_P2                255             255             255             
         ROCE_CC_ALGORITHM_P2                ECN(0)          ECN(0)          ECN(0)          
         CLAMP_TGT_RATE_AFTER_TIME_INC_P1    True(1)         True(1)         True(1)         
         CLAMP_TGT_RATE_P1                   False(0)        False(0)        False(0)        
         RPG_TIME_RESET_P1                   300             300             300             
         RPG_BYTE_RESET_P1                   32767           32767           32767           
         RPG_THRESHOLD_P1                    1               1               1               
         RPG_MAX_RATE_P1                     0               0               0               
         RPG_AI_RATE_P1                      5               5               5               
         RPG_HAI_RATE_P1                     50              50              50              
         RPG_GD_P1                           11              11              11              
         RPG_MIN_DEC_FAC_P1                  50              50              50              
         RPG_MIN_RATE_P1                     1               1               1               
         RATE_TO_SET_ON_FIRST_CNP_P1         0               0               0               
         DCE_TCP_G_P1                        1019            1019            1019            
         DCE_TCP_RTT_P1                      1               1               1               
         RATE_REDUCE_MONITOR_PERIOD_P1       4               4               4               
         INITIAL_ALPHA_VALUE_P1              1023            1023            1023            
         MIN_TIME_BETWEEN_CNPS_P1            0               0               0               
         CNP_802P_PRIO_P1                    6               6               6               
         CNP_DSCP_P1                         48              48              48              
         CLAMP_TGT_RATE_AFTER_TIME_INC_P2    True(1)         True(1)         True(1)         
         CLAMP_TGT_RATE_P2                   False(0)        False(0)        False(0)        
         RPG_TIME_RESET_P2                   300             300             300             
         RPG_BYTE_RESET_P2                   32767           32767           32767           
         RPG_THRESHOLD_P2                    1               1               1               
         RPG_MAX_RATE_P2                     0               0               0               
         RPG_AI_RATE_P2                      5               5               5               
         RPG_HAI_RATE_P2                     50              50              50              
         RPG_GD_P2                           11              11              11              
         RPG_MIN_DEC_FAC_P2                  50              50              50              
         RPG_MIN_RATE_P2                     1               1               1               
         RATE_TO_SET_ON_FIRST_CNP_P2         0               0               0               
         DCE_TCP_G_P2                        1019            1019            1019            
         DCE_TCP_RTT_P2                      1               1               1               
         RATE_REDUCE_MONITOR_PERIOD_P2       4               4               4               
         INITIAL_ALPHA_VALUE_P2              1023            1023            1023            
         MIN_TIME_BETWEEN_CNPS_P2            0               0               0               
         CNP_802P_PRIO_P2                    6               6               6               
         CNP_DSCP_P2                         48              48              48              
         LLDP_NB_DCBX_P1                     False(0)        False(0)        False(0)        
         LLDP_NB_RX_MODE_P1                  ALL(2)          ALL(2)          ALL(2)          
         LLDP_NB_TX_MODE_P1                  ALL(2)          ALL(2)          ALL(2)          
         LLDP_NB_DCBX_P2                     False(0)        False(0)        False(0)        
         LLDP_NB_RX_MODE_P2                  ALL(2)          ALL(2)          ALL(2)          
         LLDP_NB_TX_MODE_P2                  ALL(2)          ALL(2)          ALL(2)          
         DCBX_IEEE_P1                        True(1)         True(1)         True(1)         
         DCBX_CEE_P1                         True(1)         True(1)         True(1)         
         DCBX_WILLING_P1                     True(1)         True(1)         True(1)         
         DCBX_IEEE_P2                        True(1)         True(1)         True(1)         
         DCBX_CEE_P2                         True(1)         True(1)         True(1)         
         DCBX_WILLING_P2                     True(1)         True(1)         True(1)         
         KEEP_ETH_LINK_UP_P1                 True(1)         True(1)         True(1)         
         KEEP_IB_LINK_UP_P1                  False(0)        False(0)        False(0)        
         KEEP_LINK_UP_ON_BOOT_P1             False(0)        False(0)        False(0)        
         KEEP_LINK_UP_ON_STANDBY_P1          False(0)        False(0)        False(0)        
         KEEP_ETH_LINK_UP_P2                 True(1)         True(1)         True(1)         
         KEEP_IB_LINK_UP_P2                  False(0)        False(0)        False(0)        
         KEEP_LINK_UP_ON_BOOT_P2             False(0)        False(0)        False(0)        
         KEEP_LINK_UP_ON_STANDBY_P2          False(0)        False(0)        False(0)        
         NUM_OF_VL_P1                        _4_VLs(3)       _4_VLs(3)       _4_VLs(3)       
         NUM_OF_TC_P1                        _8_TCs(0)       _8_TCs(0)       _8_TCs(0)       
         NUM_OF_PFC_P1                       8               8               8               
         NUM_OF_VL_P2                        _4_VLs(3)       _4_VLs(3)       _4_VLs(3)       
         NUM_OF_TC_P2                        _8_TCs(0)       _8_TCs(0)       _8_TCs(0)       
         NUM_OF_PFC_P2                       8               8               8               
         DUP_MAC_ACTION_P1                   LAST_CFG(0)     LAST_CFG(0)     LAST_CFG(0)     
         SRIOV_IB_ROUTING_MODE_P1            LID(1)          LID(1)          LID(1)          
         IB_ROUTING_MODE_P1                  LID(1)          LID(1)          LID(1)          
         DUP_MAC_ACTION_P2                   LAST_CFG(0)     LAST_CFG(0)     LAST_CFG(0)     
         SRIOV_IB_ROUTING_MODE_P2            LID(1)          LID(1)          LID(1)          
         IB_ROUTING_MODE_P2                  LID(1)          LID(1)          LID(1)          
         WOL_MAGIC_EN                        False(0)        False(0)        False(0)        
         PCI_WR_ORDERING                     per_mkey(0)     per_mkey(0)     per_mkey(0)     
         MULTI_PORT_VHCA_EN                  False(0)        False(0)        False(0)        
         PORT_OWNER                          True(1)         True(1)         True(1)         
         ALLOW_RD_COUNTERS                   True(1)         True(1)         True(1)         
         RENEG_ON_CHANGE                     True(1)         True(1)         True(1)         
         TRACER_ENABLE                       True(1)         True(1)         True(1)         
         BOOT_UNDI_NETWORK_WAIT              0               0               0               
         UEFI_HII_EN                         True(1)         True(1)         True(1)         
         BOOT_DBG_LOG                        False(0)        False(0)        False(0)        
         UEFI_LOGS                           DISABLED(0)     DISABLED(0)     DISABLED(0)     
         BOOT_VLAN                           1               1               1               
*        LEGACY_BOOT_PROTOCOL                PXE(1)          PXE(1)          NONE(0)         
         BOOT_RETRY_CNT                      NONE(0)         NONE(0)         NONE(0)         
         BOOT_LACP_DIS                       True(1)         True(1)         True(1)         
         BOOT_VLAN_EN                        False(0)        False(0)        False(0)        
         BOOT_PKEY                           0               0               0               
         DYNAMIC_VF_MSIX_TABLE               False(0)        False(0)        False(0)        
         ADVANCED_PCI_SETTINGS               False(0)        False(0)        False(0)        
         SAFE_MODE_THRESHOLD                 10              10              10              
         SAFE_MODE_ENABLE                    True(1)         True(1)         True(1)         
The '*' shows parameters with next value different from default/current value.


PF:
lspci -v -nn -mm -k -s 0000:19:00.0
Slot:	19:00.0
Class:	Ethernet controller [0200]
Vendor:	Mellanox Technologies [15b3]
Device:	MT27710 Family [ConnectX-4 Lx] [1015]
SVendor:	Mellanox Technologies [15b3]
SDevice:	ConnectX-4 Lx 25 GbE Dual Port SFP28 rNDC [0025]
Driver:	mlx5_core
Module:	mlx5_core
NUMANode:	0

ethtool -i eno1
driver: mlx5_core
version: 5.0-0
firmware-version: 14.26.6000 (DEL2810000034)
expansion-rom-version: 
bus-info: 0000:19:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes



VF:
lspci -v -nn -mm -k -s 0000:19:00.2
Slot:	19:00.2
Class:	Ethernet controller [0200]
Vendor:	Mellanox Technologies [15b3]
Device:	MT27710 Family [ConnectX-4 Lx Virtual Function] [1016]
SVendor:	Mellanox Technologies [15b3]
SDevice:	Device [0025]
Driver:	mlx5_core
Module:	mlx5_core
NUMANode:	0

I try to use dpdk-19.11-5.el8_2 this is the output:

testpmd -l ${CPU} -w ${PCIDEVICE_OPENSHIFT_IO_DPDKNIC} --log-level="*:debug" --iova-mode=va -- -i --portmask=0x1 --nb-cores=2 --forward-mode=mac --port-topology=loop --no-mlockall
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 0 on socket 1
EAL: Detected lcore 2 as core 6 on socket 0
EAL: Detected lcore 3 as core 6 on socket 1
EAL: Detected lcore 4 as core 1 on socket 0
EAL: Detected lcore 5 as core 1 on socket 1
EAL: Detected lcore 6 as core 5 on socket 0
EAL: Detected lcore 7 as core 5 on socket 1
EAL: Detected lcore 8 as core 2 on socket 0
EAL: Detected lcore 9 as core 2 on socket 1
EAL: Detected lcore 10 as core 4 on socket 0
EAL: Detected lcore 11 as core 4 on socket 1
EAL: Detected lcore 12 as core 3 on socket 0
EAL: Detected lcore 13 as core 3 on socket 1
EAL: Detected lcore 14 as core 13 on socket 0
EAL: Detected lcore 15 as core 13 on socket 1
EAL: Detected lcore 16 as core 8 on socket 0
EAL: Detected lcore 17 as core 8 on socket 1
EAL: Detected lcore 18 as core 12 on socket 0
EAL: Detected lcore 19 as core 12 on socket 1
EAL: Detected lcore 20 as core 9 on socket 0
EAL: Detected lcore 21 as core 9 on socket 1
EAL: Detected lcore 22 as core 11 on socket 0
EAL: Detected lcore 23 as core 11 on socket 1
EAL: Detected lcore 24 as core 10 on socket 0
EAL: Detected lcore 25 as core 10 on socket 1
EAL: Detected lcore 26 as core 22 on socket 0
EAL: Detected lcore 27 as core 22 on socket 1
EAL: Detected lcore 28 as core 16 on socket 0
EAL: Detected lcore 29 as core 16 on socket 1
EAL: Detected lcore 30 as core 21 on socket 0
EAL: Detected lcore 31 as core 21 on socket 1
EAL: Detected lcore 32 as core 17 on socket 0
EAL: Detected lcore 33 as core 17 on socket 1
EAL: Detected lcore 34 as core 20 on socket 0
EAL: Detected lcore 35 as core 20 on socket 1
EAL: Detected lcore 36 as core 18 on socket 0
EAL: Detected lcore 37 as core 18 on socket 1
EAL: Detected lcore 38 as core 19 on socket 0
EAL: Detected lcore 39 as core 19 on socket 1
EAL: Detected lcore 40 as core 24 on socket 0
EAL: Detected lcore 41 as core 24 on socket 1
EAL: Detected lcore 42 as core 29 on socket 0
EAL: Detected lcore 43 as core 29 on socket 1
EAL: Detected lcore 44 as core 25 on socket 0
EAL: Detected lcore 45 as core 25 on socket 1
EAL: Detected lcore 46 as core 28 on socket 0
EAL: Detected lcore 47 as core 28 on socket 1
EAL: Detected lcore 48 as core 26 on socket 0
EAL: Detected lcore 49 as core 26 on socket 1
EAL: Detected lcore 50 as core 27 on socket 0
EAL: Detected lcore 51 as core 27 on socket 1
EAL: Detected lcore 52 as core 0 on socket 0
EAL: Detected lcore 53 as core 0 on socket 1
EAL: Detected lcore 54 as core 6 on socket 0
EAL: Detected lcore 55 as core 6 on socket 1
EAL: Detected lcore 56 as core 1 on socket 0
EAL: Detected lcore 57 as core 1 on socket 1
EAL: Detected lcore 58 as core 5 on socket 0
EAL: Detected lcore 59 as core 5 on socket 1
EAL: Detected lcore 60 as core 2 on socket 0
EAL: Detected lcore 61 as core 2 on socket 1
EAL: Detected lcore 62 as core 4 on socket 0
EAL: Detected lcore 63 as core 4 on socket 1
EAL: Detected lcore 64 as core 3 on socket 0
EAL: Detected lcore 65 as core 3 on socket 1
EAL: Detected lcore 66 as core 13 on socket 0
EAL: Detected lcore 67 as core 13 on socket 1
EAL: Detected lcore 68 as core 8 on socket 0
EAL: Detected lcore 69 as core 8 on socket 1
EAL: Detected lcore 70 as core 12 on socket 0
EAL: Detected lcore 71 as core 12 on socket 1
EAL: Detected lcore 72 as core 9 on socket 0
EAL: Detected lcore 73 as core 9 on socket 1
EAL: Detected lcore 74 as core 11 on socket 0
EAL: Detected lcore 75 as core 11 on socket 1
EAL: Detected lcore 76 as core 10 on socket 0
EAL: Detected lcore 77 as core 10 on socket 1
EAL: Detected lcore 78 as core 22 on socket 0
EAL: Detected lcore 79 as core 22 on socket 1
EAL: Detected lcore 80 as core 16 on socket 0
EAL: Detected lcore 81 as core 16 on socket 1
EAL: Detected lcore 82 as core 21 on socket 0
EAL: Detected lcore 83 as core 21 on socket 1
EAL: Detected lcore 84 as core 17 on socket 0
EAL: Detected lcore 85 as core 17 on socket 1
EAL: Detected lcore 86 as core 20 on socket 0
EAL: Detected lcore 87 as core 20 on socket 1
EAL: Detected lcore 88 as core 18 on socket 0
EAL: Detected lcore 89 as core 18 on socket 1
EAL: Detected lcore 90 as core 19 on socket 0
EAL: Detected lcore 91 as core 19 on socket 1
EAL: Detected lcore 92 as core 24 on socket 0
EAL: Detected lcore 93 as core 24 on socket 1
EAL: Detected lcore 94 as core 29 on socket 0
EAL: Detected lcore 95 as core 29 on socket 1
EAL: Detected lcore 96 as core 25 on socket 0
EAL: Detected lcore 97 as core 25 on socket 1
EAL: Detected lcore 98 as core 28 on socket 0
EAL: Detected lcore 99 as core 28 on socket 1
EAL: Detected lcore 100 as core 26 on socket 0
EAL: Detected lcore 101 as core 26 on socket 1
EAL: Detected lcore 102 as core 27 on socket 0
EAL: Detected lcore 103 as core 27 on socket 1
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 104 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_bnxt.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_e1000.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_enic.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_failsafe.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_i40e.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ixgbe.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_mlx4.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_mlx5.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_netvsc.so.20.0
EAL: Registered [vmbus] bus.
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_nfp.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_qede.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_ring.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_tap.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vdev_netvsc.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_vhost.so.20.0
EAL: open shared lib /usr/lib64/dpdk-pmds/librte_pmd_virtio.so.20.0
EAL: Ask a virtual area of 0x5000 bytes
EAL: Virtual area found at 0x100000000 (size = 0x5000)
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Module /sys/module/vfio_pci not found! error 2 (No such file or directory)
EAL: VFIO PCI modules not loaded
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
EAL: VFIO modules not loaded, skipping VFIO support...
EAL: Ask a virtual area of 0x2e000 bytes
EAL: Virtual area found at 0x100005000 (size = 0x2e000)
EAL: Setting up physically contiguous memory...
EAL: Setting maximum number of open files to 1048576
EAL: Detected memory type: socket_id:0 hugepage_sz:1073741824
EAL: Detected memory type: socket_id:1 hugepage_sz:1073741824
EAL: Creating 4 segment lists: n_segs:32 socket_id:0 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x100033000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x140000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x940000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x980000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x1180000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x11c0000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x19c0000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x1a00000000 (size = 0x800000000)
EAL: Creating 4 segment lists: n_segs:32 socket_id:1 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x2200000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x2240000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x2a40000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x2a80000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x3280000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x32c0000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x3ac0000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x3b00000000 (size = 0x800000000)
EAL: TSC frequency is ~2100000 KHz
EAL: Master lcore 12 is ready (tid=7f3200db5900;cpuset=[12])
EAL: lcore 64 is ready (tid=7f31f7b90700;cpuset=[64])
EAL: lcore 66 is ready (tid=7f31f738f700;cpuset=[66])
EAL: lcore 14 is ready (tid=7f31f8391700;cpuset=[14])
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was expanded by 1024MB
EAL: PCI device 0000:19:00.6 on NUMA socket 0
EAL:   probe driver: 15b3:1016 net_mlx5
EAL: Mem event callback 'MLX5_MEM_EVENT_CB:(nil)' registered
net_mlx5: unable to recognize master/representors on the multiple IB devices
EAL: Requested device 0000:19:00.6 cannot be used
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
testpmd: No probed ethernet devices
Interactive-mode selected
Set mac packet forwarding mode
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Done


I also try to use u/s 19.11 and there I get the devX issue I will try to update the version to 20.11 rc 

testpmd -l ${CPU} -w ${PCIDEVICE_OPENSHIFT_IO_DPDKNIC} --iova-mode=va --log-level="*:debug" -- -i --portmask=0x1 --nb-cores=2 --forward-mode=mac --port-topology=loop --no-mlockall
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 0 on socket 1
EAL: Detected lcore 2 as core 6 on socket 0
EAL: Detected lcore 3 as core 6 on socket 1
EAL: Detected lcore 4 as core 1 on socket 0
EAL: Detected lcore 5 as core 1 on socket 1
EAL: Detected lcore 6 as core 5 on socket 0
EAL: Detected lcore 7 as core 5 on socket 1
EAL: Detected lcore 8 as core 2 on socket 0
EAL: Detected lcore 9 as core 2 on socket 1
EAL: Detected lcore 10 as core 4 on socket 0
EAL: Detected lcore 11 as core 4 on socket 1
EAL: Detected lcore 12 as core 3 on socket 0
EAL: Detected lcore 13 as core 3 on socket 1
EAL: Detected lcore 14 as core 13 on socket 0
EAL: Detected lcore 15 as core 13 on socket 1
EAL: Detected lcore 16 as core 8 on socket 0
EAL: Detected lcore 17 as core 8 on socket 1
EAL: Detected lcore 18 as core 12 on socket 0
EAL: Detected lcore 19 as core 12 on socket 1
EAL: Detected lcore 20 as core 9 on socket 0
EAL: Detected lcore 21 as core 9 on socket 1
EAL: Detected lcore 22 as core 11 on socket 0
EAL: Detected lcore 23 as core 11 on socket 1
EAL: Detected lcore 24 as core 10 on socket 0
EAL: Detected lcore 25 as core 10 on socket 1
EAL: Detected lcore 26 as core 22 on socket 0
EAL: Detected lcore 27 as core 22 on socket 1
EAL: Detected lcore 28 as core 16 on socket 0
EAL: Detected lcore 29 as core 16 on socket 1
EAL: Detected lcore 30 as core 21 on socket 0
EAL: Detected lcore 31 as core 21 on socket 1
EAL: Detected lcore 32 as core 17 on socket 0
EAL: Detected lcore 33 as core 17 on socket 1
EAL: Detected lcore 34 as core 20 on socket 0
EAL: Detected lcore 35 as core 20 on socket 1
EAL: Detected lcore 36 as core 18 on socket 0
EAL: Detected lcore 37 as core 18 on socket 1
EAL: Detected lcore 38 as core 19 on socket 0
EAL: Detected lcore 39 as core 19 on socket 1
EAL: Detected lcore 40 as core 24 on socket 0
EAL: Detected lcore 41 as core 24 on socket 1
EAL: Detected lcore 42 as core 29 on socket 0
EAL: Detected lcore 43 as core 29 on socket 1
EAL: Detected lcore 44 as core 25 on socket 0
EAL: Detected lcore 45 as core 25 on socket 1
EAL: Detected lcore 46 as core 28 on socket 0
EAL: Detected lcore 47 as core 28 on socket 1
EAL: Detected lcore 48 as core 26 on socket 0
EAL: Detected lcore 49 as core 26 on socket 1
EAL: Detected lcore 50 as core 27 on socket 0
EAL: Detected lcore 51 as core 27 on socket 1
EAL: Detected lcore 52 as core 0 on socket 0
EAL: Detected lcore 53 as core 0 on socket 1
EAL: Detected lcore 54 as core 6 on socket 0
EAL: Detected lcore 55 as core 6 on socket 1
EAL: Detected lcore 56 as core 1 on socket 0
EAL: Detected lcore 57 as core 1 on socket 1
EAL: Detected lcore 58 as core 5 on socket 0
EAL: Detected lcore 59 as core 5 on socket 1
EAL: Detected lcore 60 as core 2 on socket 0
EAL: Detected lcore 61 as core 2 on socket 1
EAL: Detected lcore 62 as core 4 on socket 0
EAL: Detected lcore 63 as core 4 on socket 1
EAL: Detected lcore 64 as core 3 on socket 0
EAL: Detected lcore 65 as core 3 on socket 1
EAL: Detected lcore 66 as core 13 on socket 0
EAL: Detected lcore 67 as core 13 on socket 1
EAL: Detected lcore 68 as core 8 on socket 0
EAL: Detected lcore 69 as core 8 on socket 1
EAL: Detected lcore 70 as core 12 on socket 0
EAL: Detected lcore 71 as core 12 on socket 1
EAL: Detected lcore 72 as core 9 on socket 0
EAL: Detected lcore 73 as core 9 on socket 1
EAL: Detected lcore 74 as core 11 on socket 0
EAL: Detected lcore 75 as core 11 on socket 1
EAL: Detected lcore 76 as core 10 on socket 0
EAL: Detected lcore 77 as core 10 on socket 1
EAL: Detected lcore 78 as core 22 on socket 0
EAL: Detected lcore 79 as core 22 on socket 1
EAL: Detected lcore 80 as core 16 on socket 0
EAL: Detected lcore 81 as core 16 on socket 1
EAL: Detected lcore 82 as core 21 on socket 0
EAL: Detected lcore 83 as core 21 on socket 1
EAL: Detected lcore 84 as core 17 on socket 0
EAL: Detected lcore 85 as core 17 on socket 1
EAL: Detected lcore 86 as core 20 on socket 0
EAL: Detected lcore 87 as core 20 on socket 1
EAL: Detected lcore 88 as core 18 on socket 0
EAL: Detected lcore 89 as core 18 on socket 1
EAL: Detected lcore 90 as core 19 on socket 0
EAL: Detected lcore 91 as core 19 on socket 1
EAL: Detected lcore 92 as core 24 on socket 0
EAL: Detected lcore 93 as core 24 on socket 1
EAL: Detected lcore 94 as core 29 on socket 0
EAL: Detected lcore 95 as core 29 on socket 1
EAL: Detected lcore 96 as core 25 on socket 0
EAL: Detected lcore 97 as core 25 on socket 1
EAL: Detected lcore 98 as core 28 on socket 0
EAL: Detected lcore 99 as core 28 on socket 1
EAL: Detected lcore 100 as core 26 on socket 0
EAL: Detected lcore 101 as core 26 on socket 1
EAL: Detected lcore 102 as core 27 on socket 0
EAL: Detected lcore 103 as core 27 on socket 1
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 104 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Ask a virtual area of 0x5000 bytes
EAL: Virtual area found at 0x100000000 (size = 0x5000)
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Module /sys/module/vfio_pci not found! error 2 (No such file or directory)
EAL: VFIO PCI modules not loaded
dpaa: rte_dpaa_bus_scan():  >>
EAL: DPAA Bus not present. Skipping.
fslmc: fslmc_get_container_group(): DPAA2: DPRC not available
fslmc: rte_fslmc_scan(): FSLMC Bus Not Available. Skipping (-22)
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
EAL: VFIO modules not loaded, skipping VFIO support...
EAL: Ask a virtual area of 0x2e000 bytes
EAL: Virtual area found at 0x100005000 (size = 0x2e000)
EAL: Setting up physically contiguous memory...
EAL: Setting maximum number of open files to 1048576
EAL: Detected memory type: socket_id:0 hugepage_sz:1073741824
EAL: Detected memory type: socket_id:1 hugepage_sz:1073741824
EAL: Creating 4 segment lists: n_segs:32 socket_id:0 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x100033000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x140000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x940000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x980000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x1180000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x11c0000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x19c0000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 0
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x1a00000000 (size = 0x800000000)
EAL: Creating 4 segment lists: n_segs:32 socket_id:1 hugepage_sz:1073741824
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x2200000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x2240000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x2a40000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x2a80000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x3280000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x32c0000000 (size = 0x800000000)
EAL: Ask a virtual area of 0x1000 bytes
EAL: Virtual area found at 0x3ac0000000 (size = 0x1000)
EAL: Memseg list allocated: 0x100000kB at socket 1
EAL: Ask a virtual area of 0x800000000 bytes
EAL: Virtual area found at 0x3b00000000 (size = 0x800000000)
EAL: TSC frequency is ~2100000 KHz
EAL: Master lcore 8 is ready (tid=7f2faa1e8c00;cpuset=[8])
EAL: lcore 60 is ready (tid=7f2fa6f45400;cpuset=[60])
EAL: lcore 62 is ready (tid=7f2fa6744400;cpuset=[62])
EAL: lcore 10 is ready (tid=7f2fa7746400;cpuset=[10])
EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: Restoring previous memory policy: 0
EAL: request: mp_malloc_sync
EAL: Heap on socket 0 was expanded by 1024MB
EAL: PCI device 0000:19:00.2 on NUMA socket 0
EAL:   probe driver: 15b3:1016 net_mlx5
EAL: Mem event callback 'MLX5_MEM_EVENT_CB:(nil)' registered
net_mlx5: checking device "mlx5_6"
net_mlx5: checking device "mlx5_5"
net_mlx5: checking device "mlx5_4"
net_mlx5: checking device "mlx5_3"
net_mlx5: checking device "mlx5_2"
net_mlx5: PCI information matches for device "mlx5_2"
net_mlx5: checking device "mlx5_1"
net_mlx5: checking device "mlx5_0"
net_mlx5: no E-Switch support detected
net_mlx5: naming Ethernet device "0000:19:00.2"
net_mlx5: DevX is supported
net_mlx5: Failed to create TIS using DevX
net_mlx5: TIS allocation failure
net_mlx5: probe of PCI device 0000:19:00.2 aborted after encountering an error: Cannot allocate memory
EAL: Requested device 0000:19:00.2 cannot be used
EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
testpmd: No probed ethernet devices
Interactive-mode selected
Set mac packet forwarding mode
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=171456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Done

Comment 12 Alaa Hleihel (NVIDIA Mellanox) 2020-11-11 11:44:22 UTC
OK, so you don't know what the difference between the clusters...

Can you provide me access with reproduction steps?

Comment 13 Sebastian Scheinkman 2020-11-11 12:55:32 UTC
sure

contacted you offline

Comment 14 Alaa Hleihel (NVIDIA Mellanox) 2020-11-12 13:52:14 UTC
> net_mlx5: Failed to create TIS using DevX
> net_mlx5: TIS allocation failure

In worker's node dmesg, I can see this error right after the testpmd failure:
[98942.666462] mlx5_core 0000:19:00.2: mlx5_cmd_check:760:(pid 1581338): CREATE_TIS(0x912) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x6a6678)

The syndrome meaning is:
0x6A6678 | create_tis: cannot create while log_max_tis = 0

> net_mlx5: probe of PCI device 0000:19:00.2 aborted after encountering an error: Cannot allocate memory
> EAL: Requested device 0000:19:00.2 cannot be used


Will continue checking...

Comment 15 Alaa Hleihel (NVIDIA Mellanox) 2020-11-16 10:41:02 UTC
Hi, Sebastian.

Can you send me the yaml file used to build this POD?

Thanks,
Alaa

Comment 16 Sebastian Scheinkman 2020-11-16 13:36:57 UTC
Hi Alaa,

sure

ENV BUILDER_VERSION 0.1
ENV DPDK_VER 19.11-4
ENV DPDK_DIR /usr/share/dpdk
ENV RTE_TARGET=x86_64-default-linux-gcc
ENV RTE_EXEC_ENV=linux
ENV RTE_SDK=${DPDK_DIR}

RUN INSTALL_PKGS="bsdtar \
findutils \
groff-base \
glibc-locale-source \
glibc-langpack-en \
gettext \
rsync \
scl-utils \
tar \
unzip \
xz \
yum \
dpdk \
dpdk-devel \
dpdk-tools \
make \
rdma-core \
libibverbs \
git \
gcc \
expect" && \
mkdir -p ${HOME}/.pki/nssdb && \
microdnf install -y --setopt=tsflags=nodocs $INSTALL_PKGS && \
rpm -V $INSTALL_PKGS && \
microdnf -y clean all --enablerepo='*'

Comment 17 Alaa Hleihel (NVIDIA Mellanox) 2020-11-16 13:47:01 UTC
(In reply to Sebastian Scheinkman from comment #16)
> Hi Alaa,
> 
> sure
> 
> ENV BUILDER_VERSION 0.1
> ENV DPDK_VER 19.11-4
> ENV DPDK_DIR /usr/share/dpdk
> ENV RTE_TARGET=x86_64-default-linux-gcc
> ENV RTE_EXEC_ENV=linux
> ENV RTE_SDK=${DPDK_DIR}
> 
> RUN INSTALL_PKGS="bsdtar \
> findutils \
> groff-base \
> glibc-locale-source \
> glibc-langpack-en \
> gettext \
> rsync \
> scl-utils \
> tar \
> unzip \
> xz \
> yum \
> dpdk \
> dpdk-devel \
> dpdk-tools \
> make \
> rdma-core \
> libibverbs \
> git \
> gcc \
> expect" && \
> mkdir -p ${HOME}/.pki/nssdb && \
> microdnf install -y --setopt=tsflags=nodocs $INSTALL_PKGS && \
> rpm -V $INSTALL_PKGS && \
> microdnf -y clean all --enablerepo='*'

I don't see the enabled capabilities.
Did you enable any?

like cap_sys_admin, cap_net_admin, cap_net_raw, cap_ipc_lock+ep, etc...

Comment 18 Sebastian Scheinkman 2020-11-16 14:23:08 UTC
This is in the pod spec and is the same for over 1 year (it was working before and still working on intel base nics)

apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      k8s.v1.cni.cncf.io/networks: "dpdk-testing/test-dpdk-network"
      openshift.io/scc: privileged
    labels:
      app: dpdk
    name: dpdk-pod
    namespace: dpdk-testing
  spec:
    containers:
    - command:
      - /bin/bash
      - -c
      - sleep INF
      env:
      - name: RUN_TYPE
        value: testpmd
      image: registry-proxy.engineering.redhat.com/rh-osbs/dpdk-base:v4.6.0-8
      imagePullPolicy: Always
      name: dpdk
      resources:
        limits:
          cpu: "4"
          hugepages-1Gi: 2Gi
          memory: 6Gi
        requests:
          cpu: "4"
          hugepages-1Gi: 2Gi
          memory: 6Gi
      securityContext:
        capabilities:
          add:
          - IPC_LOCK
          - SYS_RESOURCE
        runAsUser: 0
      volumeMounts:
      - mountPath: /mnt/huge
        name: hugepages
    restartPolicy: Always
    volumes:
    - emptyDir:
        medium: HugePages
      name: hugepages

Comment 19 Sebastian Scheinkman 2020-11-19 13:18:23 UTC
Hi Alaa,

do you have any update on this issue?

Thanks!
Sebastian

Comment 20 Alaa Hleihel (NVIDIA Mellanox) 2020-11-19 15:06:31 UTC
(In reply to Sebastian Scheinkman from comment #19)
> Hi Alaa,
> 
> do you have any update on this issue?
> 
> Thanks!
> Sebastian

hi,

The team was trying to get a similar system in our lab, but there isn't much progress on that.
So I'll need your system again.
Please make sure it will be available for debug next week.

Comment 21 Alaa Hleihel (NVIDIA Mellanox) 2020-11-22 14:58:22 UTC
The system seems to be down...

Can you check?

Comment 22 Sebastian Scheinkman 2020-11-23 10:13:15 UTC
I shared the new env offline

I also upgrade the firmware version of the nics as requested.

Comment 23 Alaa Hleihel (NVIDIA Mellanox) 2020-11-24 08:08:45 UTC
(In reply to Sebastian Scheinkman from comment #18)
> This is in the pod spec and is the same for over 1 year (it was working
> before and still working on intel base nics)
> 
> apiVersion: v1
>   kind: Pod
>   metadata:
>     annotations:
>       k8s.v1.cni.cncf.io/networks: "dpdk-testing/test-dpdk-network"
>       openshift.io/scc: privileged
>     labels:
>       app: dpdk
>     name: dpdk-pod
>     namespace: dpdk-testing
>   spec:
>     containers:
>     - command:
>       - /bin/bash
>       - -c
>       - sleep INF
>       env:
>       - name: RUN_TYPE
>         value: testpmd
>       image: registry-proxy.engineering.redhat.com/rh-osbs/dpdk-base:v4.6.0-8
>       imagePullPolicy: Always
>       name: dpdk
>       resources:
>         limits:
>           cpu: "4"
>           hugepages-1Gi: 2Gi
>           memory: 6Gi
>         requests:
>           cpu: "4"
>           hugepages-1Gi: 2Gi
>           memory: 6Gi
>       securityContext:
>         capabilities:
>           add:
>           - IPC_LOCK
>           - SYS_RESOURCE

This is a configuration issue.
Need to add also "NET_RAW" capability for DEVX commands to work properly.

>         runAsUser: 0
>       volumeMounts:
>       - mountPath: /mnt/huge
>         name: hugepages
>     restartPolicy: Always
>     volumes:
>     - emptyDir:
>         medium: HugePages
>       name: hugepages

Comment 24 Sebastian Scheinkman 2020-11-24 12:09:25 UTC
Hi Alaa,

Thanks for the comment!

The NET_RAW was enabled by default until this CVE https://access.redhat.com/security/cve/cve-2020-14386

I will update the openshift documentation 

https://docs.openshift.com/container-platform/4.6/networking/hardware_networks/using-dpdk-and-rdma.html

Comment 25 David Marchand 2020-11-24 12:20:03 UTC
Should I assign this bz to you Sebastian and/or change the component?

Comment 26 Sebastian Scheinkman 2020-11-24 12:21:43 UTC
Hi David,

I will do it

Comment 29 elevin 2020-12-22 14:34:54 UTC
OCP: 4.7.0-fc.0
CNF_IMAGE_VERSION: openshift4-cnf-tests:v4.7.0-10
DPDK_IMAGE_VERSION: dpdk-base:v4.7.0-4	

=========================================================

• [SLOW TEST:54.483 seconds]
dpdk
/remote-source/app/functests/dpdk/dpdk.go:87
  Validate a DPDK workload running inside a pod
  /remote-source/app/functests/dpdk/dpdk.go:168
    Should forward and receive packets
    /remote-source/app/functests/dpdk/dpdk.go:169

Comment 31 errata-xmlrpc 2021-02-24 15:28:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.