Description of problem: Pods using secondary interfaces do not start Version-Release number of selected component (if applicable): oc version Client Version: 4.8.0-0.nightly-2021-06-07-023220 Server Version: 4.8.0-0.nightly-2021-06-07-023220 Kubernetes Version: v1.21.0-rc.0+2dfc46b How reproducible: Steps to Reproduce: 1. Run a pod configured with a SR-IOV nic 2. oc describe pod 3. Actual results: The pod remains in ContainerCreating status Expected results: The pod starts Additional info: This is where multus is failing https://github.com/openshift/multus-cni/blob/118cc629cfa0eecf3fb85e683d82758213ac8c92/pkg/checkpoint/checkpoint.go#L81 I believe that is for a change in the format of /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint The content is sh-4.4# more kubelet_internal_checkpoint {"Data":{"PodDeviceEntries":[{"PodUID":"acb8c0e6-9243-494d-b1d3-bb78d175f449","ContainerName":"testsctp-server","ResourceName":"openshift.io/sctptestres","DeviceIDs":{"0":["0000:19:00.6"]},"AllocResp":"CjIKIlBDSURFVklDRV9PUEVOU0hJR lRfSU9fU0NUUFRFU1RSRVMSDDAwMDA6MTk6MDAuNg=="}],"RegisteredDevices":{"openshift.io/sctptestres":["0000:19:00.3","0000:19:00.4","0000:19:00.5","0000:19:00.6","0000:19:00.2"]}},"Checksum":3803801153} whereas the go struct expects deviceids to be a []string
This sure smells a lot like this upstream issue regarding the checkpoint file in k8s 1.21+: https://github.com/k8snetworkplumbingwg/multus-cni/issues/665 This k8s version may have just hit the nightlies.
Turns out this was caused by a regression after other changes were merged for Multus, Federico pointed out this commit in particular: Big thanks to Federico and Peng for identifying it, and Peng for posting the fix. upstream PR posted @ https://github.com/k8snetworkplumbingwg/multus-cni/pull/683/files Tomo working to get it merged upstream and we'll backport it downstream.
Verified this bug on 4.8.0-0.nightly-2021-06-10-014052 # oc exec -n l044j testpod1sgzn2 -- ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if44: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:0a:83:00:10 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.131.0.16/23 brd 10.131.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fd01:0:0:1::10/64 scope global valid_lft forever preferred_lft forever inet6 fe80::858:aff:fe83:10/64 scope link valid_lft forever preferred_lft forever 29: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether ca:fe:c0:ff:ee:01 brd ff:ff:ff:ff:ff:ff inet 192.168.2.206/24 brd 192.168.2.255 scope global net1 valid_lft forever preferred_lft forever inet6 2001::2/64 scope global valid_lft forever preferred_lft forever inet6 fe80::c8fe:c0ff:feff:ee01/64 scope link valid_lft forever preferred_lft forever
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438