Bug 1968625 - Pods using sr-iov interfaces failign to start for Failed to create pod sandbox
Summary: Pods using sr-iov interfaces failign to start for Failed to create pod sandbox
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.8.0
Assignee: Douglas Smith
QA Contact: zhaozhanqi
jfrye
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-07 17:07 UTC by Federico Paolinelli
Modified: 2021-07-27 23:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
RN content: Previously, refactoring for a shadowed variable caused a regression related to the use of the checkpoint file, and SR-IOV pod sandboxes would not start up. A check for the path of the kubelet socket was not properly accounted for during the refactor. The fix properly restores the check for the kubelet socket path, and now the SR-IOV pod sandboxes are properly created. ------ Cause: Refactoring for a shadowed variable caused a regression related to the use of the checkpoint file. Consequence: SRIOV pods sandboxes wouldn't come up. Fix: Posted upstream @ https://github.com/k8snetworkplumbingwg/multus-cni/pull/683/files Result: SRIOV pod sandboxes are properly created.
Clone Of:
Environment:
Last Closed: 2021-07-27 23:11:53 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift multus-cni pull 102 0 None closed Bug 1968625: Use the default socket path in GetResourceClient when unspecified 2021-06-09 14:09:25 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:12:11 UTC

Description Federico Paolinelli 2021-06-07 17:07:41 UTC
Description of problem:

Pods using secondary interfaces do not start

Version-Release number of selected component (if applicable):

oc version
Client Version: 4.8.0-0.nightly-2021-06-07-023220
Server Version: 4.8.0-0.nightly-2021-06-07-023220
Kubernetes Version: v1.21.0-rc.0+2dfc46b


How reproducible:


Steps to Reproduce:
1. Run a pod configured with a SR-IOV nic
2. oc describe pod
3.

Actual results:

The pod remains in ContainerCreating status


Expected results:

The pod starts

Additional info:

This is where multus is failing https://github.com/openshift/multus-cni/blob/118cc629cfa0eecf3fb85e683d82758213ac8c92/pkg/checkpoint/checkpoint.go#L81

I believe that is for a change in the format of /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint 

The content is 

sh-4.4# more kubelet_internal_checkpoint 
{"Data":{"PodDeviceEntries":[{"PodUID":"acb8c0e6-9243-494d-b1d3-bb78d175f449","ContainerName":"testsctp-server","ResourceName":"openshift.io/sctptestres","DeviceIDs":{"0":["0000:19:00.6"]},"AllocResp":"CjIKIlBDSURFVklDRV9PUEVOU0hJR
lRfSU9fU0NUUFRFU1RSRVMSDDAwMDA6MTk6MDAuNg=="}],"RegisteredDevices":{"openshift.io/sctptestres":["0000:19:00.3","0000:19:00.4","0000:19:00.5","0000:19:00.6","0000:19:00.2"]}},"Checksum":3803801153}

whereas the go struct expects deviceids to be a []string

Comment 1 Douglas Smith 2021-06-07 17:10:23 UTC
This sure smells a lot like this upstream issue regarding the checkpoint file in k8s 1.21+: https://github.com/k8snetworkplumbingwg/multus-cni/issues/665

This k8s version may have just hit the nightlies.

Comment 2 Douglas Smith 2021-06-08 14:38:41 UTC
Turns out this was caused by a regression after other changes were merged for Multus, Federico pointed out this commit in particular: 

Big thanks to Federico and Peng for identifying it, and Peng for posting the fix.

upstream PR posted @ https://github.com/k8snetworkplumbingwg/multus-cni/pull/683/files

Tomo working to get it merged upstream and we'll backport it downstream.

Comment 4 zhaozhanqi 2021-06-10 06:24:04 UTC
Verified this bug on 4.8.0-0.nightly-2021-06-10-014052



# oc exec -n l044j testpod1sgzn2 -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if44: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:83:00:10 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.131.0.16/23 brd 10.131.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd01:0:0:1::10/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::858:aff:fe83:10/64 scope link 
       valid_lft forever preferred_lft forever
29: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ca:fe:c0:ff:ee:01 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.206/24 brd 192.168.2.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 2001::2/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::c8fe:c0ff:feff:ee01/64 scope link 
       valid_lft forever preferred_lft forever

Comment 7 errata-xmlrpc 2021-07-27 23:11:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.