Bug 1785933

Summary: Unable to run a dpdk workload without privileged=true
Product: OpenShift Container Platform Reporter: Sebastian Scheinkman <sscheink>
Component: NetworkingAssignee: zenghui.shi <zshi>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: ailan, augol, bbennett, dmarchan, eparis, nhorman, zshi
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1789352 (view as bug list) Environment:
Last Closed: 2020-02-05 14:13:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1771572, 1789352, 1791410, 1791411    

Description Sebastian Scheinkman 2019-12-22 14:59:44 UTC
Description of problem:
I am unable to run a dpdk workload without privileged=true


Version-Release number of selected component (if applicable):
openshift 4.3

How reproducible:
100%

Steps to Reproduce:
1. deploy sriov operator
2. configure the sriov interface and policy
3. patch the nodes kernel parameter to enable "intel_iommu=on and iommu=pt"
4. configure hugepages in the system
5. deploy a testpmd pod using the following yaml.
6. exec into the pod and start the testpmd application

testpmd -l <cpu-on-the-same-numa> -w <vf-pci-address> -- -i

Actual results:
testpmd -l 22,24,26 -w 0000:19:01.2 -- -i
EAL: Detected 80 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: Cannot obtain physical addresses: No such file or directory. Only vfio will function.
error allocating rte services array
EAL: FATAL: rte_service_init() failed
EAL: rte_service_init() failed
PANIC in main():
Cannot init EAL
5: [testpmd() [0x42e657]]
4: [/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f044e1363d5]]
3: [testpmd(main+0xbd7) [0x42e5b7]]
2: [/lib64/librte_eal.so.9(__rte_panic+0xbd) [0x7f044f3396bd]]
1: [/lib64/librte_eal.so.9(rte_dump_stack+0x2d) [0x7f044f34503d]]
Aborted (core dumped)



Expected results:
testpmd -l 22,24,26 -w 0000:19:01.0 -- -i
EAL: Detected 80 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: PCI device 0000:19:01.0 on NUMA socket 0
EAL:   probe driver: 15b3:1016 net_mlx5
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=163456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc

Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.

Configuring Port 0 (socket 0)
Port 0: 42:E4:4B:F5:1E:9B
Checking link statuses...
Done


Additional info:

testpmd pod yaml with privileged=true:
apiVersion: v1
kind: Pod
metadata:
  name: testpod
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: appcntr1
    image: quay.io/mmirecki/ds-testpmd
    imagePullPolicy: IfNotPresent
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
      privileged: true
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        hugepages-1Gi: 4Gi
        cpu: '6'
        memory: 1000Mi
      limits:
        hugepages-1Gi: 4Gi
        cpu: '6'
        memory: 1000Mi
    volumeMounts:
    - mountPath: /mnt/huge
      name: hugepage
      readOnly: False
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages


testpmd pod yaml without privileged=true:
apiVersion: v1
kind: Pod
metadata:
  name: testpod
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: appcntr1
    image: quay.io/mmirecki/ds-testpmd
    imagePullPolicy: IfNotPresent
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        hugepages-1Gi: 4Gi
        cpu: '6'
        memory: 1000Mi
      limits:
        hugepages-1Gi: 4Gi
        cpu: '6'
        memory: 1000Mi
    volumeMounts:
    - mountPath: /mnt/huge
      name: hugepage
      readOnly: False
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

Comment 1 Neil Horman 2019-12-23 14:06:41 UTC
I presume you are using openvswitch 2.9.0 here (which builds with dpdk 17.11).  If thats the case It looks like upstream commit c2361bab70c56f64e50f07946b1b20bf688d782a may need to be backported, though I'm not sure it can be without additional backports.  It may be openvswitch needs to update to a later copy of dpdk (19.11 if possible)

Comment 2 zenghui.shi 2019-12-24 02:01:08 UTC
Hi Neil, 

We are testing Mellanox VF in bifcurated driver mode, the VF is directly attached to container without going through openvswitch.

To add more details on the test setup:

1) This is using mellanox VF in bifcurated driver mode inside container. VFs are provisioned on the host and attached to pod container directly via SR-IOV CNI (which moves VF interface to container namespace).
2) DPDK pkgs are installed inside container image, we have tried downstream dpdk pkgs with version 18.11.3 and 19.11, both tests showed the same failure.
3) Mellanox libibverbs pkg is installed inside container image.
4) vfio-pci module is loaded on host and intel_iommu=on iommu=pt are enabled in kernel cmdline.
5) Container is created with capability 'IPC_LOCK', NO privileged=true.

What works:

Run testpmd inside container with virtual memory address mode:

# testpmd -l 22,24,26 -w 0000:19:01.0 --iova-mode=va -- -i

What fails:

# testpmd -l 22,24,26 -w 0000:19:01.0 (--iova-mode=pa) -- -i

Interesting findings:

In a replicate environment, if an intel VF is configured to bind vfio-pci driver with below cmd, then testpmd in above container can run successfully with default iova-mode=pa mode.

# sh-4.4# echo 0000:87:0a.0 > /sys/bus/pci/drivers/iavf/unbind
# sh-4.4# echo vfio-pci > /sys/bus/pci/devices/0000:87:0a.0/driver_override
# sh-4.4# echo 0000:87:0a.0 > /sys/bus/pci/drivers/vfio-pci/bind            <===  I have a mellanox pod running. After executing this step, re-run testpmd cmd worked.

Note: 0000:87:0a.0 is a Intel VF that is on the same host as Mellanox card under tests.

Comment 3 Neil Horman 2019-12-24 11:36:22 UTC
19.11 should not have had the same failure, in fact it could not have, given that the first error message below:
EAL: Cannot obtain physical addresses: No such file or directory. Only vfio will function.
doesnt exist in that version.  If you are using 19.11, you should post the error log from that version of the library please

Comment 4 zenghui.shi 2019-12-24 13:21:24 UTC
Dpdk pkg version:

# rpm -qa | grep dpdk
dpdk-tools-19.11-1.el8.x86_64
dpdk-19.11-1.el8.x86_64


Testpmd log with iova-mode=pa: 

# testpmd -l 54,55,56 -w 0000:af:01.0 --iova-mode=pa -- -i
EAL: Detected 72 lcore(s)
EAL: Detected 2 NUMA nodes
net_mlx5: cannot load glue library: /lib64/libmlx5.so.1: version `MLX5_1.10' not found (required by /usr/lib64/dpdk-pmds-glue/librte_pmd_mlx5_glue.so.19.08.0)
net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: FATAL: Cannot use IOVA as 'PA' since physical addresses are not available
EAL: Cannot use IOVA as 'PA' since physical addresses are not available
EAL: Error - exiting with code: 1
  Cause: Cannot init EAL: Invalid argument

Testpmd log without specifying iova-mode:

# testpmd -l 54,55,56 -w 0000:af:01.0 -- -i
EAL: Detected 72 lcore(s)
EAL: Detected 2 NUMA nodes
net_mlx5: cannot load glue library: /lib64/libmlx5.so.1: version `MLX5_1.10' not found (required by /usr/lib64/dpdk-pmds-glue/librte_pmd_mlx5_glue.so.19.08.0)
net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL:   cannot open VFIO container, error 2 (No such file or directory)
EAL: VFIO support could not be initialized
testpmd: No probed ethernet devices
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=163456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
Done
testpmd> quit

Bye...

Comment 5 Neil Horman 2020-01-02 15:18:03 UTC
thats a very different error, one that requires the independent installation of the mlx5 rdma libraries.  Please install the libibverbs package and attempt to run the command again.

Comment 6 David Marchand 2020-01-03 09:42:57 UTC
(In reply to Sebastian Scheinkman from comment #0)
> testpmd -l <cpu-on-the-same-numa> -w <vf-pci-address> -- -i
> 
> Actual results:
> testpmd -l 22,24,26 -w 0000:19:01.2 -- -i
> EAL: Detected 80 lcore(s)
> EAL: Detected 2 NUMA nodes
> EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> EAL: Probing VFIO support...
> EAL: Cannot obtain physical addresses: No such file or directory. Only vfio
> will function.
> error allocating rte services array
> EAL: FATAL: rte_service_init() failed
> EAL: rte_service_init() failed
> PANIC in main():
> Cannot init EAL
> 5: [testpmd() [0x42e657]]
> 4: [/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f044e1363d5]]
> 3: [testpmd(main+0xbd7) [0x42e5b7]]
> 2: [/lib64/librte_eal.so.9(__rte_panic+0xbd) [0x7f044f3396bd]]
> 1: [/lib64/librte_eal.so.9(rte_dump_stack+0x2d) [0x7f044f34503d]]
> Aborted (core dumped)

I would suspect a permission issue on the hugepages.
Could you try with debug logs?

testpmd -l 22,24,26 -w 0000:19:01.2 --log-level=*:debug -- -i

I expect something like:

EAL: Trying to obtain current memory policy.
EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: get_seg_fd(): open failed: Permission denied
EAL: Couldn't get fd on hugepage file
EAL: attempted to allocate 1 segments, but only 0 were allocated
EAL: Restoring previous memory policy: 0
error allocating rte services array
EAL: FATAL: rte_service_init() failed
EAL: rte_service_init() failed
PANIC in main():
Cannot init EAL

Comment 7 David Marchand 2020-01-03 09:45:04 UTC
(In reply to Neil Horman from comment #1)
> I presume you are using openvswitch 2.9.0 here (which builds with dpdk
> 17.11).  If thats the case It looks like upstream commit
> c2361bab70c56f64e50f07946b1b20bf688d782a may need to be backported, though
> I'm not sure it can be without additional backports.  It may be openvswitch
> needs to update to a later copy of dpdk (19.11 if possible)

17.11 should not be affected.
18.11 would be (I did the backports in ovs 2.11 https://bugzilla.redhat.com/show_bug.cgi?id=1711739) but first, let's figure out what is wrong with the rte_service_init failure.

Comment 8 David Marchand 2020-01-03 14:50:00 UTC
Ok, thought again about the trace, and the error on permission should have popped.

I reproduced Sebastian issue.
The debug logs should look like:

EAL: Setting policy MPOL_PREFERRED for socket 0
EAL: alloc_seg(): can't get IOVA addr          <=========
EAL: Ask a virtual area of 0x40000000 bytes
EAL: Virtual area found at 0x140000000 (size = 0x40000000)
EAL: attempted to allocate 1 segments, but only 0 were allocated
EAL: Restoring previous memory policy: 0
error allocating rte services array
EAL: FATAL: rte_service_init() failed
EAL: rte_service_init() failed


It is indeed the IOVA backports that Neil pointed at that are missing in the dpdk 18.11 downstream package.

Prepared a scratch build for the 18.11 downstream package for rhel8: 
http://brew-task-repos.usersys.redhat.com/repos/scratch/dmarchan/dpdk/18.11.2/4.bz1785933.el8/x86_64/


Just to clarify on the dpdk packages:
- dpdk 17.11 should be fine,
- dpdk 18.11 will have the reported issue, which should be worked around by setting the --iova-mode=va option, and requires the backports,
- dpdk 19.11 should be fine,


For OVS packages:
- ovs 2.9 (dpdk 17.11) should be fine,
- ovs 2.11 and ovs 2.12 (dpdk 18.11) should be fine, since the fixes have been backported
- ovs master which will become 2.13 (dpdk 19.11) should be fine,

Comment 11 David Marchand 2020-01-06 14:00:05 UTC
(In reply to David Marchand from comment #8)
> Prepared a scratch build for the 18.11 downstream package for rhel8: 
> http://brew-task-repos.usersys.redhat.com/repos/scratch/dmarchan/dpdk/18.11.2/4.bz1785933.el8/x86_64/
> 

Sebastian, can you have a try with this scratch build?
Thanks.

Comment 13 zenghui.shi 2020-01-09 09:00:50 UTC
(In reply to David Marchand from comment #11)
> (In reply to David Marchand from comment #8)
> > Prepared a scratch build for the 18.11 downstream package for rhel8: 
> > http://brew-task-repos.usersys.redhat.com/repos/scratch/dmarchan/dpdk/18.11.2/4.bz1785933.el8/x86_64/
> > 
> 
> Sebastian, can you have a try with this scratch build?
> Thanks.

I ran test with this scratch build, it worked without specifying iova mode.
From the log message, it detects the iova mode automatically.

[root@testpod1 /]# testpmd -l 18,19 -w 0000:af:00.6 -- -i
EAL: Detected 72 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: Probing VFIO support...
EAL: PCI device 0000:af:00.6 on NUMA socket 1
EAL:   probe driver: 15b3:1016 net_mlx5
net_mlx5: flow rules relying on switch offloads will not be supported: netlink: failed to remove ingress qdisc: Operation not permitted
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_1>: n=155456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc

Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.

Configuring Port 0 (socket 1)
Port 0: CA:FE:C0:FF:EE:01
Checking link statuses...
Done
testpmd> show port info all

********************* Infos for port 0  *********************
MAC address: CA:FE:C0:FF:EE:01
Device name: 0000:af:00.6
Driver name: net_mlx5
Devargs: 
Connect to socket: 1
memory allocation on the socket: 1
Link status: up
Link speed: 25000 Mbps
Link duplex: full-duplex
MTU: 1500
Promiscuous mode: enabled
Allmulticast mode: disabled
Maximum number of MAC addresses: 128
Maximum number of MAC addresses of hash filtering: 0
VLAN offload: 
  strip off 
  filter off 
  qinq(extend) off 
Hash key size in bytes: 40
Redirection table size: 1
Supported RSS offload flow types:
  ipv4
  ipv4-frag
  ipv4-tcp
  ipv4-udp
  ipv4-other
  ipv6
  ipv6-frag
  ipv6-tcp
  ipv6-udp
  ipv6-other
  user defined 15
  user defined 16
  user defined 17
Minimum size of RX buffer: 32
Maximum configurable length of RX packet: 65536
Current number of RX queues: 1
Max possible RX queues: 65535
Max possible number of RXDs per queue: 65535
Min possible number of RXDs per queue: 0
RXDs number alignment: 1
Current number of TX queues: 1
Max possible TX queues: 65535
Max possible number of TXDs per queue: 65535
Min possible number of TXDs per queue: 0
TXDs number alignment: 1
Switch name: 0000:af:00.6
Switch domain Id: 1
Switch Port Id: 65535
testpmd> quit

Stopping port 0...
Stopping ports...
Done

Shutting down port 0...
Closing ports...
Done

Bye...
[root@testpod1 /]# rpm -qa | grep -i dpdk
dpdk-tools-18.11.2-4.bz1785933.el8.x86_64
dpdk-18.11.2-4.bz1785933.el8.x86_64
dpdk-devel-18.11.2-4.bz1785933.el8.x86_64

Comment 15 David Marchand 2020-01-09 09:15:38 UTC
Thanks for the test.

Just want to be sure.
You did this test as root user inside a unpriviledged container. Is it correct?

Comment 16 zenghui.shi 2020-01-09 09:42:07 UTC
(In reply to David Marchand from comment #15)
> Thanks for the test.
> 
> Just want to be sure.
> You did this test as root user inside a unpriviledged container. Is it
> correct?

Yes, root user inside container with IPC_LOCK capability assigned.

User:

# oc exec -it testpod1 bash
[root@testpod1 /]# whoami
root


Container spec:

spec:
  containers:
  - name: appcntr1
    image: quay.io/zshi/ubi8-dpdk
    imagePullPolicy: IfNotPresent
    securityContext:
     capabilities:
       add: ["IPC_LOCK"]
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        hugepages-1Gi: 4Gi
        cpu: '4'
        memory: 1000Mi
      limits:
        hugepages-1Gi: 4Gi
        cpu: '4'
        memory: 1000Mi

Comment 18 David Marchand 2020-01-10 16:53:48 UTC
(In reply to zenghui.shi from comment #4)
> Dpdk pkg version:
> 
> # rpm -qa | grep dpdk
> dpdk-tools-19.11-1.el8.x86_64
> dpdk-19.11-1.el8.x86_64
> 
> 
> Testpmd log with iova-mode=pa: 
> 
> # testpmd -l 54,55,56 -w 0000:af:01.0 --iova-mode=pa -- -i
> EAL: Detected 72 lcore(s)
> EAL: Detected 2 NUMA nodes
> net_mlx5: cannot load glue library: /lib64/libmlx5.so.1: version `MLX5_1.10'
> not found (required by
> /usr/lib64/dpdk-pmds-glue/librte_pmd_mlx5_glue.so.19.08.0)
> net_mlx5: cannot initialize PMD due to missing run-time dependency on
> rdma-core libraries (libibverbs, libmlx5)

Zengshui, not relevant with the IOVA issue, but coming back on this warning.

This warning is because you installed the 19.11 dpdk package on a 8.1 system.
The 19.11 package has been generated on rhel 8.2 and needs newer rdma-core/libibverbs packages than the ones in 8.1.

Comment 19 Sebastian Scheinkman 2020-01-12 14:55:09 UTC
(In reply to David Marchand from comment #11)
> (In reply to David Marchand from comment #8)
> > Prepared a scratch build for the 18.11 downstream package for rhel8: 
> > http://brew-task-repos.usersys.redhat.com/repos/scratch/dmarchan/dpdk/18.11.2/4.bz1785933.el8/x86_64/
> > 
> 
> Sebastian, can you have a try with this scratch build?
> Thanks.


Hi David,

I try this and it works.

Thanks!

Comment 20 Jens Freimann 2020-01-13 12:54:51 UTC
*** Bug 1783763 has been marked as a duplicate of this bug. ***