Bug 1654174

Summary: [sriov-cni] Non-existed vf will be assigned to pod if disable the sriov feature on the PF when the sriovdp is running
Product: OpenShift Container Platform Reporter: Meng Bo <bmeng>
Component: NetworkingAssignee: zenghui.shi <zshi>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: medium CC: aos-bugs, bbennett, cdc, fpan, zshi
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:27:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Meng Bo 2018-11-28 07:57:34 UTC
Description of problem:
If remove the vfs from PF after sriovdp is already running. The sriovdp will not update the vf deletion and try to assign the removed PCI device ID to the pod which requests sriov.

Version-Release number of selected component (if applicable):
v4.0

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster with multus, sriov enabled

2. Set the vfs number to 2 on the node
# echo 2 > /sys/class/net/eno1/device/sriov_numvfs

3. Deploy the sriovdp and make sure the vfs on the node can be discovered

4. Remove the vfs on the node
# echo 0 > /sys/class/net/eno1/device/sriov_numvfs

5. Try to create pod which is requesting sriov-cni
apiVersion: v1
kind: Pod
metadata:
  generateName: testpod-
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: test-pod
    image: bmeng/centos-network
    resources:
      requests:
        intel.com/sriov: 1
      limits:
        intel.com/sriov: 1

6. Check the pod status



Actual results:
The pod creations is failed due to the PCI device in non-existed.
Events:
  Type     Reason                  Age              From                                          Message
  ----     ------                  ----             ----                                          -------
  Normal   Scheduled               1m               default-scheduler                             Successfully assigned default/testpod-6blf8 to nfvpe-node
  Warning  FailedCreatePodSandBox  1m               kubelet, nfvpe-node  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox contain
er "52c481ea43fafa247704781a71853a46de98e760a440cc21c234ad40ce1fb227" network for pod "testpod-6blf8": NetworkPlugin cni failed to set up pod "testpod-6blf8_default" network: Multus: Err in 
tearing down failed plugins: Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: lstat /sys/bus/pci/devices/0000:3d:02.0/physfn/net: no such file or directory
  Warning  FailedCreatePodSandBox  1m               kubelet, nfvpe-node  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox contain
er "dd9017e38ab3ff33b1e2bd1c7c1d7a60fb8f5bb7d821dd0ded6408c024084d35" network for pod "testpod-6blf8": NetworkPlugin cni failed to set up pod "testpod-6blf8_default" network: Multus: Err in 
tearing down failed plugins: Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: lstat /sys/bus/pci/devices/0000:3d:02.0/physfn/net: no such file or directory


Expected results:
Should not try to assign the removed PCI device id to the pod.
The pod should be fail with error like "Insufficient resource"

Additional info:
The device id 3d:02.0 was generated when the first time sriovdp discovered the system, and it is not removed after disable the sriov on PF.
# cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint | python -mjson.tool
{
    "Checksum": 828840705,
    "Data": {
        "PodDeviceEntries": [
            {
                "AllocResp": "CiIKEVNSSU9WLVZGLVBDSS1BRERSEg0wMDAwOjNkOjAyLjAs",
                "ContainerName": "test-pod",
                "DeviceIDs": [
                    "0000:3d:02.0"
                ],
                "PodUID": "0443ffd5-f2e1-11e8-9c8e-0242ef7e06e8",
                "ResourceName": "intel.com/sriov"
            }
        ],
        "RegisteredDevices": {
            "intel.com/sriov": [
                "0000:3d:02.0",
                "0000:3d:02.1"
            ]
        }
    }
}

The full device list on the node:
# ls /sys/class/net/ -l
total 0
lrwxrwxrwx. 1 root root 0 Nov  5 06:13 br0 -> ../../devices/virtual/net/br0
lrwxrwxrwx. 1 root root 0 Nov  5 02:04 docker0 -> ../../devices/virtual/net/docker0
lrwxrwxrwx. 1 root root 0 Nov 13 04:22 eno1 -> ../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/0000:3c:03.0/0000:3d:00.0/net/eno1
lrwxrwxrwx. 1 root root 0 Nov 13 04:22 eno2 -> ../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/0000:3c:03.0/0000:3d:00.1/net/eno2
lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp134s0f0 -> ../../devices/pci0000:85/0000:85:00.0/0000:86:00.0/net/enp134s0f0
lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp134s0f1 -> ../../devices/pci0000:85/0000:85:00.0/0000:86:00.1/net/enp134s0f1
lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp24s0f0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/enp24s0f0
lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp24s0f1 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/enp24s0f1
lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp26s0f0 -> ../../devices/pci0000:17/0000:17:02.0/0000:1a:00.0/net/enp26s0f0
lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp26s0f1 -> ../../devices/pci0000:17/0000:17:02.0/0000:1a:00.1/net/enp26s0f1
lrwxrwxrwx. 1 root root 0 Nov  4 21:17 lo -> ../../devices/virtual/net/lo
lrwxrwxrwx. 1 root root 0 Nov  5 06:13 ovs-system -> ../../devices/virtual/net/ovs-system
lrwxrwxrwx. 1 root root 0 Nov  5 06:13 tun0 -> ../../devices/virtual/net/tun0
lrwxrwxrwx. 1 root root 0 Nov 28 02:56 veth75415c65 -> ../../devices/virtual/net/veth75415c65
lrwxrwxrwx. 1 root root 0 Nov  5 06:13 vxlan_sys_4789 -> ../../devices/virtual/net/vxlan_sys_4789

Comment 1 zenghui.shi 2018-11-28 09:58:53 UTC
Thanks for testing and reporting the bug!

I think this is a valid bug in sriov device plugin, because it doesn't probe the actual VF states to update device healthy state, instead it probes PF operstate and report it back to kubelet as VF healthy state. In order to solve this issue, we will need to auto discover the changes of devices(num of devices, newly added/deleted devices etc) when probing device state periodically and update the newly discovered devices to kubelet. but this also has some potential problems:

1) At least with intel NIC, changing VF num requires setting VF num to 0 first, this may interrupt any running workload on existing containers which already have device allocated. I'm not sure this is a normal action of cluster admin to re-created those VFs without restarting those workloads. Also currently there is no garbage collect for devices in this circumstance which means those devices will still be considered as allocated. (device will only be released by kubelet when the pod is in terminated state)

2) When VF is allocated to pod, it's likely moved to the pod network namespace which sriov device plugin is not in. this may cause a problem for device plugin to get access to the state of actual VF device.

will talk with Intel and see how we can address this issue.

Comment 2 zenghui.shi 2018-11-29 04:28:42 UTC
update:

had some discussion internally about how we expect this to work in ocp-4.0: we assume the num of VFs will not be changed on the fly after sriov device plugin is launched, this excludes cases such as:
1) VF num is changed to 0 and then to the new desired num
2) new PF device is added or deleted on the host after sriov device plugin is launched.

but we are exploring and planning to support changing VF configuration as part of machine config process which happens before sriov device plugin is launched.  For example, network admin can specify the num of VFs they want to create for each PF on the host via machine configuration, MCD(machine config daemon) running on each host applies these VF configuration and reboots node whenever the machine configuration is changed. This allows the sriov-device-plugin to get restarted and re-discover all the changed devices without having to re-discover the devices on the fly.

Comment 7 zenghui.shi 2019-10-09 14:28:24 UTC
With SR-IOV Operator introduced in 4.2, all the configuration and provisioning of SR-IOV devices shall be done via SR-IOV Operator.
SR-IOV device plugin daemon will be restarted by Operator once the number of SR-IOV devices are changed.

Comment 8 zhaozhanqi 2019-10-10 06:41:41 UTC
verified this bug on quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.2.0-201910070933-ose-sriov-network-operator

When the networknodepolicy is created with 4 vfNums, pod can request the specified Vf
when the networknodepolicy is deleted. create pod and still request resource. the pod will be pending.

Comment 10 errata-xmlrpc 2019-10-16 06:27:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922