Bug 1654142 - [sriov-cni] Cannot allocate more than 1 vf to pod when the pod requested it via resource limit
Summary: [sriov-cni] Cannot allocate more than 1 vf to pod when the pod requested it v...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: zenghui.shi
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-28 06:51 UTC by Meng Bo
Modified: 2019-10-16 06:27 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:27:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:27:56 UTC

Description Meng Bo 2018-11-28 06:51:15 UTC
Description of problem:
Setup ocp cluster with multus enabled. Deploy the sriov-device-plugin and sriov-cni to the cluster.

Enable the SRIOV for the NIC on the node:
echo 6 > /sys/class/net/eno1/device/sriov_numvfs

Try to create pod with sriov interface which requests more than 1 vf. And the pod will get only one interface with the requested number of vf consumed on the node.


Version-Release number of selected component (if applicable):
v4.0

How reproducible:
always

Steps to Reproduce:
1. Try to create pod which requests max_vfs number of the node (sriovdp log will be attached below)
apiVersion: v1
kind: Pod
metadata:
  generateName: testpod-
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: test-pod
    image: bmeng/centos-network
    resources:
      requests:
        intel.com/sriov: 6
      limits:
        intel.com/sriov: 6

2. Check the interfaces in the first pod

3. Try to create one more pod which requests 1 vf
apiVersion: v1
kind: Pod
metadata:
  generateName: testpod-
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: test-pod
    image: bmeng/centos-network
    resources:
      requests:
        intel.com/sriov: 1
      limits:
        intel.com/sriov: 1



Actual results:
2. There is only one sriov vf allocated in the pod
sh-4.2# ip -d link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
3: eth0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 0a:58:0a:80:00:08 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 
    veth addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
205: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 9e:b9:04:4c:9e:20 brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535 

sh-4.2# ethtool -i net1
driver: i40evf
version: 3.2.2-k
firmware-version: N/A
expansion-rom-version: 
bus-info: 0000:3d:02.4
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

3. The 2nd pod cannot be created due to "Insufficient intel.com/sriov"

Expected results:
The pod which requests more than 1 vf via resource limit should be able to allocate the requested number of vfs.

Additional info:
sriov-device-plugin log for the first pod creation:

I1128 06:37:07.577333   11279 sriov-device-plugin.go:279] SRIOV Network Device Plugin server started serving
I1128 06:37:07.579007   11279 sriov-device-plugin.go:290] SRIOV Network Device Plugin registered with the Kubelet
I1128 06:37:07.580036   11279 sriov-device-plugin.go:391] ListAndWatch: send initial devices &ListAndWatchResponse{Devices:[&Device{ID:0000:3d:02.1,Health:Healthy,} &Device{ID:0000:3d:02.2,Health:Healthy,} &Device{ID:0000:3d:02.3,Health:Healthy,} &Device{ID:0000:3d:02.4,Health:Healthy,} &Device{ID:0000:3d:02.5,Health:Healthy,} &Device{ID:0000:3d:02.0,Health:Healthy,}],}
I1128 06:38:18.966389   11279 sriov-device-plugin.go:442] DeviceID in Allocate: 0000:3d:02.3
I1128 06:38:18.966460   11279 sriov-device-plugin.go:442] DeviceID in Allocate: 0000:3d:02.4
I1128 06:38:18.966475   11279 sriov-device-plugin.go:442] DeviceID in Allocate: 0000:3d:02.5
I1128 06:38:18.966487   11279 sriov-device-plugin.go:442] DeviceID in Allocate: 0000:3d:02.0
I1128 06:38:18.966499   11279 sriov-device-plugin.go:442] DeviceID in Allocate: 0000:3d:02.1
I1128 06:38:18.966510   11279 sriov-device-plugin.go:442] DeviceID in Allocate: 0000:3d:02.2
I1128 06:38:18.966522   11279 sriov-device-plugin.go:456] PCI Addrs allocated: 0000:3d:02.3,0000:3d:02.4,0000:3d:02.5,0000:3d:02.0,0000:3d:02.1,0000:3d:02.2,

Comment 1 zenghui.shi 2018-11-28 09:26:47 UTC
Thanks for testing! 

Good coverage on multiple resource request!

This might be the expected behavior when pod only has one network custom resource specified in Pod spec annotation but multiple devices are requested. Meaning in order to configure networks on multiple devices or allocate multiple devices for one pod or container, we will need to add as many network customer resources as requested devices in Pod Spec annotation field(the num of network custom resource shall be equal to the num of requested devices) and separate them comma.

for example:

1) Pod spec requesting one device

apiVersion: v1
kind: Pod
metadata:
  generateName: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: test-pod
    image: <image>
    resources:
      requests:
        intel.com/sriov: 1
      limits:
        intel.com/sriov: 1

2) Pod spec requesting multiple devices


apiVersion: v1
kind: Pod
metadata:
  generateName: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network, sriov-network, sriov-network
spec:
  containers:
  - name: test-pod
    image: <image>
    resources:
      requests:
        intel.com/sriov: 3
      limits:
        intel.com/sriov: 3

The name of network resource above can be different depending on which network the device is expected to be connected to, for example, one can specify the annotation as: sriov-network-a, sriov-network-b, sriov-network-c and the 3 requested devices in 2) will be connected to sriov-network-a/b/c separately.

Comment 2 Meng Bo 2018-11-28 10:47:03 UTC
Yeah, the condition you mentioned in comment#1 is how it should be.

But it will still be a problem when the annotation number does not match the spec.resources.requests.

For example, if I give only one annotation but set the spec.resources.requests to 3. It will consume all the 3 requested vfs, and only one will be attached to the pod.

That means some of the vfs are wasted.

Comment 3 zenghui.shi 2018-11-29 04:41:57 UTC
(In reply to Meng Bo from comment #2)
> Yeah, the condition you mentioned in comment#1 is how it should be.
> 
> But it will still be a problem when the annotation number does not match the
> spec.resources.requests.
> 
> For example, if I give only one annotation but set the
> spec.resources.requests to 3. It will consume all the 3 requested vfs, and
> only one will be attached to the pod.
> 
> That means some of the vfs are wasted.

agreed on that user might be confused about where the un-configured devices are when they are not shown in the container namespace.

Just to update the thoughts we discussed in nfvpe-container meeting on how we expect this to be solved:

1) Ideally we may have multus capture this potential configuration issue and prompt an warning message so that user is able to find out what's wrong with their configuration, but multus will still config the device the way it does now.

2) Add an admission webhook to block the creation of pod whenever there is a mismatch between the num of network custom resource(containing device plugin resourceName annotation) and requested devices.

Comment 4 zenghui.shi 2018-12-03 01:12:27 UTC
(In reply to zenghui.shi from comment #3)
> (In reply to Meng Bo from comment #2)
> > Yeah, the condition you mentioned in comment#1 is how it should be.
> > 
> > But it will still be a problem when the annotation number does not match the
> > spec.resources.requests.
> > 
> > For example, if I give only one annotation but set the
> > spec.resources.requests to 3. It will consume all the 3 requested vfs, and
> > only one will be attached to the pod.
> > 
> > That means some of the vfs are wasted.
> 
> agreed on that user might be confused about where the un-configured devices
> are when they are not shown in the container namespace.
> 
> Just to update the thoughts we discussed in nfvpe-container meeting on how
> we expect this to be solved:
> 
> 1) Ideally we may have multus capture this potential configuration issue and
> prompt an warning message so that user is able to find out what's wrong with
> their configuration, but multus will still config the device the way it does
> now.
> 
> 2) Add an admission webhook to block the creation of pod whenever there is a
> mismatch between the num of network custom resource(containing device plugin
> resourceName annotation) and requested devices.

option 2) is the way to go, admission controller mutates resource limit/request fields in the pod spec with num of network custom resource(containing device resourceName as annotation) found in pod annotation.

Comment 7 zenghui.shi 2019-08-02 02:06:57 UTC
SR-IOV admission controller is available for testing in 4.2 now.
It helps fill the resource limit and request according to resourceName specified in net-attach-def.
There is no need for user to specify the resource limit and request manually which should resolve this mismatch problem.
Moving to ON_QA.

Comment 8 zhaozhanqi 2019-08-06 10:19:26 UTC
this bug was block to verify by https://bugzilla.redhat.com/show_bug.cgi?id=1732598

Comment 9 zhaozhanqi 2019-09-20 10:53:17 UTC
verified this bug on 4.2.0-0.nightly-2019-09-19-153821

when using below yaml, the pod will consume 1 VF
apiVersion: v1
kind: Pod
metadata:
  generateName: testpod-
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: test-pod
    image: bmeng/centos-network
    resources:
      requests:
        intel.com/sriov: 6
      limits:
        intel.com/sriov: 6

Comment 11 errata-xmlrpc 2019-10-16 06:27:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.