Bug 1968625

Summary:	Pods using sr-iov interfaces failign to start for Failed to create pod sandbox
Product:	OpenShift Container Platform	Reporter:	Federico Paolinelli <fpaoline>
Component:	Networking	Assignee:	Douglas Smith <dosmith>
Networking sub component:	multus	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:	jfrye
Severity:	urgent
Priority:	urgent	CC:	dosmith, jfrye, yjoseph, ykashtan
Version:	4.8
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	RN content: Previously, refactoring for a shadowed variable caused a regression related to the use of the checkpoint file, and SR-IOV pod sandboxes would not start up. A check for the path of the kubelet socket was not properly accounted for during the refactor. The fix properly restores the check for the kubelet socket path, and now the SR-IOV pod sandboxes are properly created. ------ Cause: Refactoring for a shadowed variable caused a regression related to the use of the checkpoint file. Consequence: SRIOV pods sandboxes wouldn't come up. Fix: Posted upstream @ https://github.com/k8snetworkplumbingwg/multus-cni/pull/683/files Result: SRIOV pod sandboxes are properly created.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:11:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Federico Paolinelli 2021-06-07 17:07:41 UTC

Description of problem:

Pods using secondary interfaces do not start

Version-Release number of selected component (if applicable):

oc version
Client Version: 4.8.0-0.nightly-2021-06-07-023220
Server Version: 4.8.0-0.nightly-2021-06-07-023220
Kubernetes Version: v1.21.0-rc.0+2dfc46b


How reproducible:


Steps to Reproduce:
1. Run a pod configured with a SR-IOV nic
2. oc describe pod
3.

Actual results:

The pod remains in ContainerCreating status


Expected results:

The pod starts

Additional info:

This is where multus is failing https://github.com/openshift/multus-cni/blob/118cc629cfa0eecf3fb85e683d82758213ac8c92/pkg/checkpoint/checkpoint.go#L81

I believe that is for a change in the format of /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint 

The content is 

sh-4.4# more kubelet_internal_checkpoint 
{"Data":{"PodDeviceEntries":[{"PodUID":"acb8c0e6-9243-494d-b1d3-bb78d175f449","ContainerName":"testsctp-server","ResourceName":"openshift.io/sctptestres","DeviceIDs":{"0":["0000:19:00.6"]},"AllocResp":"CjIKIlBDSURFVklDRV9PUEVOU0hJR
lRfSU9fU0NUUFRFU1RSRVMSDDAwMDA6MTk6MDAuNg=="}],"RegisteredDevices":{"openshift.io/sctptestres":["0000:19:00.3","0000:19:00.4","0000:19:00.5","0000:19:00.6","0000:19:00.2"]}},"Checksum":3803801153}

whereas the go struct expects deviceids to be a []string

Comment 1 Douglas Smith 2021-06-07 17:10:23 UTC

This sure smells a lot like this upstream issue regarding the checkpoint file in k8s 1.21+: https://github.com/k8snetworkplumbingwg/multus-cni/issues/665

This k8s version may have just hit the nightlies.

Comment 2 Douglas Smith 2021-06-08 14:38:41 UTC

Turns out this was caused by a regression after other changes were merged for Multus, Federico pointed out this commit in particular: 

Big thanks to Federico and Peng for identifying it, and Peng for posting the fix.

upstream PR posted @ https://github.com/k8snetworkplumbingwg/multus-cni/pull/683/files

Tomo working to get it merged upstream and we'll backport it downstream.

Comment 4 zhaozhanqi 2021-06-10 06:24:04 UTC

Verified this bug on 4.8.0-0.nightly-2021-06-10-014052



# oc exec -n l044j testpod1sgzn2 -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if44: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:83:00:10 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.16/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd01:0:0:1::10/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe83:10/64 scope link 
       valid_lft forever preferred_lft forever
29: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ca:fe:c0:ff:ee:01 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.206/24 brd 192.168.2.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 2001::2/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::c8fe:c0ff:feff:ee01/64 scope link 
       valid_lft forever preferred_lft forever

Comment 7 errata-xmlrpc 2021-07-27 23:11:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438