1732598 – With multus configured pod can not get secondary interface (Regression bug)

Bug 1732598 - With multus configured pod can not get secondary interface (Regression bug)

Summary: With multus configured pod can not get secondary interface (Regression bug)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Dan Williams
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-23 20:35 UTC by Weibin Liang
Modified:	2019-10-16 06:31 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:30:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 2800	'None'	closed	Update vendor code for cni and ocicni and libpod	2020-12-24 06:28:43 UTC
Github	openshift cluster-network-operator pull 311	'None'	closed	Bug 1732598: Alternate CNI config dir	2020-12-24 06:29:17 UTC
Github	openshift multus-cni pull 21	'None'	closed	Bug 1732598: Entrypoint changes for watch loop & CRIO restart	2020-12-24 06:28:45 UTC
Github	openshift multus-cni pull 23	'None'	closed	Bug 1732598: Adds one-shot CRIO restart	2020-12-24 06:29:18 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:31:00 UTC

Description Weibin Liang 2019-07-23 20:35:13 UTC

Description of problem:
With multus configured, pod can not get secondary interface and IP, this is a regression bug, same configuration worked before.

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-22-160516

How reproducible:
Reproduced in both AWS and Azure

Steps to Reproduce:
1. oc project openshift-multus
2. oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/Features/multus/ipam-static.yaml
3. oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/Features/multus/multus-pod-1.yaml
4. [root@dhcp-41-193 FILE]# oc rsh multuspod-1
sh-4.2# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP 
    link/ether 0a:58:0a:80:02:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.128.2.13/23 brd 10.128.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::c9a:cbff:feaf:998f/64 scope link 
       valid_lft forever preferred_lft forever


Actual results:
Does not see the secondary interface/ip

Expected results:
Should see the secondary interface/ip


Additional info:

Comment 1 Douglas Smith 2019-07-24 18:34:47 UTC

I've been able to replicate the issue. I've been keeping a log of my investigation in this gist: https://gist.github.com/dougbtv/b2dc698cbe7b75d4cea3e7c148e599d3

Currently, I don't believe that Multus is being executed. I believe this as I have updated the configuration to add more logging, however, no logs are created. 

I have verified that CRI-O is pointing at the directory where the CNI configurations are located, and that the CNI configurations are correctly generated.

Comment 2 Douglas Smith 2019-07-24 19:51:40 UTC

Some further investigation shows that this may be related to this open issue for CRI-O: https://github.com/cri-o/ocicni/issues/46

It appears that what might happen is that the 80-openshift-network.conf CNI configuration is created first, CRI-O picks up on that configuration, and later the 00-multus.conf is created -- the 80-openshift-network.conf is cached by CRI-O -- and the 00-multus.conf is not read by CRI-O, therefore making is so that only openshift-sdn is executed given the configuration in 80-openshift-network.conf

A snippet from debug level CRI-O logs shows: 

```
Jul 24 18:46:09 ip-10-0-151-105 crio[1048]: time="2019-07-24 18:46:09.873557757Z" level=info msg="Got pod network &{Name:samplepod Namespace:doug ID:02fc2cea9a1aafea6ba30caed2c743dc973b2bd1ecca8e7b1c46c08b99ae39a9 NetNS:/proc/9158/ns/net PortMappings:[] Networks:[] NetworkConfig:map[]}"
Jul 24 18:46:09 ip-10-0-151-105 crio[1048]: time="2019-07-24 18:46:09.873579553Z" level=info msg="About to add CNI network cni-loopback (type=loopback)"
Jul 24 18:46:09 ip-10-0-151-105 crio[1048]: time="2019-07-24 18:46:09.876138171Z" level=info msg="Got pod network &{Name:samplepod Namespace:doug ID:02fc2cea9a1aafea6ba30caed2c743dc973b2bd1ecca8e7b1c46c08b99ae39a9 NetNS:/proc/9158/ns/net PortMappings:[] Networks:[] NetworkConfig:map[]}"
Jul 24 18:46:09 ip-10-0-151-105 crio[1048]: time="2019-07-24 18:46:09.876157773Z" level=info msg="About to add CNI network openshift-sdn (type=openshift-sdn)"
```

This shows that CRI-O is executing loopback and openshift-sdn, however, it is not executing multus, and therefore may not have loaded to the 00-multus.conf

Comment 5 Weibin Liang 2019-07-31 12:26:11 UTC

This bug block all multus testing in v4.2, change priority to high.

Comment 10 Douglas Smith 2019-08-29 15:10:54 UTC

The proposed fix for this BZ is coming in the form of three pull requests, two to Multus CNI, and one to the cluster-network-operator to implement those changes.

The approach to making these patches includes that we will need to restart CRIO when the Multus CNI configuration is generated in order to have ocicni properly read the Multus CNI configuration. The changes we have made are primarily to the entrypoint script used by Multus CNI to generate its configuration. There is also a PR to the cluster-network-operator to use the new entrypoint script parameters.

Those PRs are:

* https://github.com/openshift/multus-cni/pull/21 -- merged
* https://github.com/openshift/multus-cni/pull/23 -- pending
* https://github.com/openshift/cluster-network-operator/pull/299 -- pending

The cluster-network-operator PR #299 is dependent on Multus CNI PR #23 being merged to be retested. One modification was request was made after Multus PR #21, which is to restart CRIO once and only once during an instantiation of the Multus entrypoint script to mitigate the impact of a CRIO restart (e.g. so there is not multiple restarts of CRIO).

Comment 11 Douglas Smith 2019-08-30 22:21:29 UTC

We encountered a downstream issue with the upstream changes required for https://github.com/openshift/multus-cni/pull/23, however, we've isolated the cause and have a pending fix upstream in https://github.com/intel/multus-cni/pull/369

Comment 13 zhaozhanqi 2019-09-02 03:11:40 UTC

move this bug to assign since https://github.com/openshift/cluster-network-operator/pull/299 is still not be merged.

Comment 14 Nikita 2019-09-03 12:39:14 UTC

CNV network team encountered the same issue. We can not create VM with multiple interfaces due to this bug.

Comment 15 Douglas Smith 2019-09-03 13:40:02 UTC

All dependencies for https://github.com/openshift/cluster-network-operator/pull/299 have been merged, so only that PR needs to be merged (or trumped by changes to OCICNI itself).

Comment 16 Ihar Hrachyshka 2019-09-05 06:16:25 UTC

The operator PR merged. I've just rebuilt the installer from master; then deployed a cluster in AWS; checked that the multus patches are present (by checking that the new command line arguments are present in containers manifest for multus daemonset), and then applied the example net-attach-def and pod manifests (the former required a minor adjustment due to a syntax error + the name of the interface is ens3 not eth0 in the cluster).

Both network interfaces are present in the pod now. I think this issue is fixed now.

Comment 18 Weibin Liang 2019-09-05 13:11:39 UTC

Tested and verified on 4.2.0-0.ci-2019-09-05-093822

$ oc exec multuspod-1 -- ip address  
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP 
    link/ether 0a:58:0a:81:02:0b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.129.2.11/23 brd 10.129.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::682e:14ff:fe97:1fc2/64 scope link 
       valid_lft forever preferred_lft forever
4: net1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP 
    link/ether 0e:7c:0d:a6:f8:61 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.1.200/24 brd 192.168.1.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::c7c:dff:fea6:f861/64 scope link 
       valid_lft forever preferred_lft forever

Comment 21 zhaozhanqi 2019-09-09 02:05:12 UTC

I tried this on 4.2.0-0.nightly-2019-09-08-180038, this issue was fixed

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:80:02:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.128.2.13/23 brd 10.128.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::d4d3:69ff:fe82:75af/64 scope link 
       valid_lft forever preferred_lft forever
4: net1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default 
    link/ether 6a:b5:a2:f5:9e:8f brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.1.1.100/24 brd 10.1.1.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::68b5:a2ff:fef5:9e8f/64 scope link 
       valid_lft forever preferred_lft forever

maybe 4.2.0-0.nightly-2019-09-06-102435 did not exist the fixed PR.

Comment 22 Douglas Smith 2019-09-09 16:01:59 UTC

Our patch for the CNO apparently has caused some other collateral damage, documented in: https://bugzilla.redhat.com/show_bug.cgi?id=1749446

This is going to cause us to have to revert the changes made in https://github.com/openshift/cluster-network-operator/pull/299 (revert happening in: https://github.com/openshift/cluster-network-operator/pull/310)

Generally, it appears that the alternate configuration directory we use in the /run directory isn't always accessible by openshift-sdn and kuryr, and this causes kuryr to break.

We have identified some selinux failures thanks to the Kuryr team, these have been posted here: https://pastebin.com/iGvBec6u

Comment 23 Weibin Liang 2019-09-09 18:48:40 UTC

According to comment22, QE need re open this bug and will verify it again after the new fixed PR.

Comment 24 Douglas Smith 2019-09-11 13:34:45 UTC

We're moving forward with changes to OCICNI as previously mentioned, this will make for the solution (as opposed to the workaround in CNO PR #299 which caused a CRIO restart)

In the meanwhile, we have a PR open to re-implement changes for using an alternate CNI configuration directory and Multus entrypoint options (as was performed in CNO PR #299), minus the CRIO restart. This is available in https://github.com/openshift/cluster-network-operator/pull/310

Until the companion OCICNI changes land, the CNO PR #311 is expect to NOT pass upgrade tests, as it requires those changes to pass that scenario.

Comment 28 gsharma 2019-09-18 15:22:20 UTC

This issue was observed on baremetal as well as part of RHHI.Next E2E demo integration.

Comment 33 Weibin Liang 2019-09-19 20:28:46 UTC

Tested and verified on 4.2.0-0.nightly-2019-09-19-153821

[root@dhcp-41-193 FILE]# oc exec multuspod-1 -- ip address 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP 
    link/ether 0a:58:0a:82:00:0c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.130.0.12/23 brd 10.130.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::581f:1ff:fe90:3eac/64 scope link 
       valid_lft forever preferred_lft forever
4: net1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP 
    link/ether 9a:60:f6:d0:8a:6d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.1.200/24 brd 192.168.1.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::9860:f6ff:fed0:8a6d/64 scope link 
       valid_lft forever preferred_lft forever

Comment 34 errata-xmlrpc 2019-10-16 06:30:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
bbennett
cdc
danken
dcbw
dosmith
eminguez
eparis
fdeutsch
fsimonce
gklein
gsharma
ihrachys
jhopper
jhou
jparrill
mcambria
mniranja
nagrawal
ncredi
nkononov
piqin
scuppett
tohayash
wsun