Description of problem: If remove the vfs from PF after sriovdp is already running. The sriovdp will not update the vf deletion and try to assign the removed PCI device ID to the pod which requests sriov. Version-Release number of selected component (if applicable): v4.0 How reproducible: always Steps to Reproduce: 1. Setup ocp cluster with multus, sriov enabled 2. Set the vfs number to 2 on the node # echo 2 > /sys/class/net/eno1/device/sriov_numvfs 3. Deploy the sriovdp and make sure the vfs on the node can be discovered 4. Remove the vfs on the node # echo 0 > /sys/class/net/eno1/device/sriov_numvfs 5. Try to create pod which is requesting sriov-cni apiVersion: v1 kind: Pod metadata: generateName: testpod- labels: env: test annotations: k8s.v1.cni.cncf.io/networks: sriov-network spec: containers: - name: test-pod image: bmeng/centos-network resources: requests: intel.com/sriov: 1 limits: intel.com/sriov: 1 6. Check the pod status Actual results: The pod creations is failed due to the PCI device in non-existed. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 1m default-scheduler Successfully assigned default/testpod-6blf8 to nfvpe-node Warning FailedCreatePodSandBox 1m kubelet, nfvpe-node Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox contain er "52c481ea43fafa247704781a71853a46de98e760a440cc21c234ad40ce1fb227" network for pod "testpod-6blf8": NetworkPlugin cni failed to set up pod "testpod-6blf8_default" network: Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: lstat /sys/bus/pci/devices/0000:3d:02.0/physfn/net: no such file or directory Warning FailedCreatePodSandBox 1m kubelet, nfvpe-node Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox contain er "dd9017e38ab3ff33b1e2bd1c7c1d7a60fb8f5bb7d821dd0ded6408c024084d35" network for pod "testpod-6blf8": NetworkPlugin cni failed to set up pod "testpod-6blf8_default" network: Multus: Err in tearing down failed plugins: Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: lstat /sys/bus/pci/devices/0000:3d:02.0/physfn/net: no such file or directory Expected results: Should not try to assign the removed PCI device id to the pod. The pod should be fail with error like "Insufficient resource" Additional info: The device id 3d:02.0 was generated when the first time sriovdp discovered the system, and it is not removed after disable the sriov on PF. # cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint | python -mjson.tool { "Checksum": 828840705, "Data": { "PodDeviceEntries": [ { "AllocResp": "CiIKEVNSSU9WLVZGLVBDSS1BRERSEg0wMDAwOjNkOjAyLjAs", "ContainerName": "test-pod", "DeviceIDs": [ "0000:3d:02.0" ], "PodUID": "0443ffd5-f2e1-11e8-9c8e-0242ef7e06e8", "ResourceName": "intel.com/sriov" } ], "RegisteredDevices": { "intel.com/sriov": [ "0000:3d:02.0", "0000:3d:02.1" ] } } } The full device list on the node: # ls /sys/class/net/ -l total 0 lrwxrwxrwx. 1 root root 0 Nov 5 06:13 br0 -> ../../devices/virtual/net/br0 lrwxrwxrwx. 1 root root 0 Nov 5 02:04 docker0 -> ../../devices/virtual/net/docker0 lrwxrwxrwx. 1 root root 0 Nov 13 04:22 eno1 -> ../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/0000:3c:03.0/0000:3d:00.0/net/eno1 lrwxrwxrwx. 1 root root 0 Nov 13 04:22 eno2 -> ../../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/0000:3c:03.0/0000:3d:00.1/net/eno2 lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp134s0f0 -> ../../devices/pci0000:85/0000:85:00.0/0000:86:00.0/net/enp134s0f0 lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp134s0f1 -> ../../devices/pci0000:85/0000:85:00.0/0000:86:00.1/net/enp134s0f1 lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp24s0f0 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.0/net/enp24s0f0 lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp24s0f1 -> ../../devices/pci0000:17/0000:17:00.0/0000:18:00.1/net/enp24s0f1 lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp26s0f0 -> ../../devices/pci0000:17/0000:17:02.0/0000:1a:00.0/net/enp26s0f0 lrwxrwxrwx. 1 root root 0 Nov 13 04:22 enp26s0f1 -> ../../devices/pci0000:17/0000:17:02.0/0000:1a:00.1/net/enp26s0f1 lrwxrwxrwx. 1 root root 0 Nov 4 21:17 lo -> ../../devices/virtual/net/lo lrwxrwxrwx. 1 root root 0 Nov 5 06:13 ovs-system -> ../../devices/virtual/net/ovs-system lrwxrwxrwx. 1 root root 0 Nov 5 06:13 tun0 -> ../../devices/virtual/net/tun0 lrwxrwxrwx. 1 root root 0 Nov 28 02:56 veth75415c65 -> ../../devices/virtual/net/veth75415c65 lrwxrwxrwx. 1 root root 0 Nov 5 06:13 vxlan_sys_4789 -> ../../devices/virtual/net/vxlan_sys_4789
Thanks for testing and reporting the bug! I think this is a valid bug in sriov device plugin, because it doesn't probe the actual VF states to update device healthy state, instead it probes PF operstate and report it back to kubelet as VF healthy state. In order to solve this issue, we will need to auto discover the changes of devices(num of devices, newly added/deleted devices etc) when probing device state periodically and update the newly discovered devices to kubelet. but this also has some potential problems: 1) At least with intel NIC, changing VF num requires setting VF num to 0 first, this may interrupt any running workload on existing containers which already have device allocated. I'm not sure this is a normal action of cluster admin to re-created those VFs without restarting those workloads. Also currently there is no garbage collect for devices in this circumstance which means those devices will still be considered as allocated. (device will only be released by kubelet when the pod is in terminated state) 2) When VF is allocated to pod, it's likely moved to the pod network namespace which sriov device plugin is not in. this may cause a problem for device plugin to get access to the state of actual VF device. will talk with Intel and see how we can address this issue.
update: had some discussion internally about how we expect this to work in ocp-4.0: we assume the num of VFs will not be changed on the fly after sriov device plugin is launched, this excludes cases such as: 1) VF num is changed to 0 and then to the new desired num 2) new PF device is added or deleted on the host after sriov device plugin is launched. but we are exploring and planning to support changing VF configuration as part of machine config process which happens before sriov device plugin is launched. For example, network admin can specify the num of VFs they want to create for each PF on the host via machine configuration, MCD(machine config daemon) running on each host applies these VF configuration and reboots node whenever the machine configuration is changed. This allows the sriov-device-plugin to get restarted and re-discover all the changed devices without having to re-discover the devices on the fly.
With SR-IOV Operator introduced in 4.2, all the configuration and provisioning of SR-IOV devices shall be done via SR-IOV Operator. SR-IOV device plugin daemon will be restarted by Operator once the number of SR-IOV devices are changed.
verified this bug on quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.2.0-201910070933-ose-sriov-network-operator When the networknodepolicy is created with 4 vfNums, pod can request the specified Vf when the networknodepolicy is deleted. create pod and still request resource. the pod will be pending.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922