Hide Forgot
Description of problem: With multus configured, pod can not get secondary interface and IP, this is a regression bug, same configuration worked before. Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-07-22-160516 How reproducible: Reproduced in both AWS and Azure Steps to Reproduce: 1. oc project openshift-multus 2. oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/Features/multus/ipam-static.yaml 3. oc create -f https://raw.githubusercontent.com/weliang1/Openshift_Networking/master/Features/multus/multus-pod-1.yaml 4. [root@dhcp-41-193 FILE]# oc rsh multuspod-1 sh-4.2# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP link/ether 0a:58:0a:80:02:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.13/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::c9a:cbff:feaf:998f/64 scope link valid_lft forever preferred_lft forever Actual results: Does not see the secondary interface/ip Expected results: Should see the secondary interface/ip Additional info:
I've been able to replicate the issue. I've been keeping a log of my investigation in this gist: https://gist.github.com/dougbtv/b2dc698cbe7b75d4cea3e7c148e599d3 Currently, I don't believe that Multus is being executed. I believe this as I have updated the configuration to add more logging, however, no logs are created. I have verified that CRI-O is pointing at the directory where the CNI configurations are located, and that the CNI configurations are correctly generated.
Some further investigation shows that this may be related to this open issue for CRI-O: https://github.com/cri-o/ocicni/issues/46 It appears that what might happen is that the 80-openshift-network.conf CNI configuration is created first, CRI-O picks up on that configuration, and later the 00-multus.conf is created -- the 80-openshift-network.conf is cached by CRI-O -- and the 00-multus.conf is not read by CRI-O, therefore making is so that only openshift-sdn is executed given the configuration in 80-openshift-network.conf A snippet from debug level CRI-O logs shows: ``` Jul 24 18:46:09 ip-10-0-151-105 crio[1048]: time="2019-07-24 18:46:09.873557757Z" level=info msg="Got pod network &{Name:samplepod Namespace:doug ID:02fc2cea9a1aafea6ba30caed2c743dc973b2bd1ecca8e7b1c46c08b99ae39a9 NetNS:/proc/9158/ns/net PortMappings:[] Networks:[] NetworkConfig:map[]}" Jul 24 18:46:09 ip-10-0-151-105 crio[1048]: time="2019-07-24 18:46:09.873579553Z" level=info msg="About to add CNI network cni-loopback (type=loopback)" Jul 24 18:46:09 ip-10-0-151-105 crio[1048]: time="2019-07-24 18:46:09.876138171Z" level=info msg="Got pod network &{Name:samplepod Namespace:doug ID:02fc2cea9a1aafea6ba30caed2c743dc973b2bd1ecca8e7b1c46c08b99ae39a9 NetNS:/proc/9158/ns/net PortMappings:[] Networks:[] NetworkConfig:map[]}" Jul 24 18:46:09 ip-10-0-151-105 crio[1048]: time="2019-07-24 18:46:09.876157773Z" level=info msg="About to add CNI network openshift-sdn (type=openshift-sdn)" ``` This shows that CRI-O is executing loopback and openshift-sdn, however, it is not executing multus, and therefore may not have loaded to the 00-multus.conf
This bug block all multus testing in v4.2, change priority to high.
The proposed fix for this BZ is coming in the form of three pull requests, two to Multus CNI, and one to the cluster-network-operator to implement those changes. The approach to making these patches includes that we will need to restart CRIO when the Multus CNI configuration is generated in order to have ocicni properly read the Multus CNI configuration. The changes we have made are primarily to the entrypoint script used by Multus CNI to generate its configuration. There is also a PR to the cluster-network-operator to use the new entrypoint script parameters. Those PRs are: * https://github.com/openshift/multus-cni/pull/21 -- merged * https://github.com/openshift/multus-cni/pull/23 -- pending * https://github.com/openshift/cluster-network-operator/pull/299 -- pending The cluster-network-operator PR #299 is dependent on Multus CNI PR #23 being merged to be retested. One modification was request was made after Multus PR #21, which is to restart CRIO once and only once during an instantiation of the Multus entrypoint script to mitigate the impact of a CRIO restart (e.g. so there is not multiple restarts of CRIO).
We encountered a downstream issue with the upstream changes required for https://github.com/openshift/multus-cni/pull/23, however, we've isolated the cause and have a pending fix upstream in https://github.com/intel/multus-cni/pull/369
move this bug to assign since https://github.com/openshift/cluster-network-operator/pull/299 is still not be merged.
CNV network team encountered the same issue. We can not create VM with multiple interfaces due to this bug.
All dependencies for https://github.com/openshift/cluster-network-operator/pull/299 have been merged, so only that PR needs to be merged (or trumped by changes to OCICNI itself).
The operator PR merged. I've just rebuilt the installer from master; then deployed a cluster in AWS; checked that the multus patches are present (by checking that the new command line arguments are present in containers manifest for multus daemonset), and then applied the example net-attach-def and pod manifests (the former required a minor adjustment due to a syntax error + the name of the interface is ens3 not eth0 in the cluster). Both network interfaces are present in the pod now. I think this issue is fixed now.
Tested and verified on 4.2.0-0.ci-2019-09-05-093822 $ oc exec multuspod-1 -- ip address 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP link/ether 0a:58:0a:81:02:0b brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.129.2.11/23 brd 10.129.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::682e:14ff:fe97:1fc2/64 scope link valid_lft forever preferred_lft forever 4: net1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP link/ether 0e:7c:0d:a6:f8:61 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.1.200/24 brd 192.168.1.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::c7c:dff:fea6:f861/64 scope link valid_lft forever preferred_lft forever
I tried this on 4.2.0-0.nightly-2019-09-08-180038, this issue was fixed 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP group default link/ether 0a:58:0a:80:02:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.128.2.13/23 brd 10.128.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::d4d3:69ff:fe82:75af/64 scope link valid_lft forever preferred_lft forever 4: net1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default link/ether 6a:b5:a2:f5:9e:8f brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.1.1.100/24 brd 10.1.1.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::68b5:a2ff:fef5:9e8f/64 scope link valid_lft forever preferred_lft forever maybe 4.2.0-0.nightly-2019-09-06-102435 did not exist the fixed PR.
Our patch for the CNO apparently has caused some other collateral damage, documented in: https://bugzilla.redhat.com/show_bug.cgi?id=1749446 This is going to cause us to have to revert the changes made in https://github.com/openshift/cluster-network-operator/pull/299 (revert happening in: https://github.com/openshift/cluster-network-operator/pull/310) Generally, it appears that the alternate configuration directory we use in the /run directory isn't always accessible by openshift-sdn and kuryr, and this causes kuryr to break. We have identified some selinux failures thanks to the Kuryr team, these have been posted here: https://pastebin.com/iGvBec6u
According to comment22, QE need re open this bug and will verify it again after the new fixed PR.
We're moving forward with changes to OCICNI as previously mentioned, this will make for the solution (as opposed to the workaround in CNO PR #299 which caused a CRIO restart) In the meanwhile, we have a PR open to re-implement changes for using an alternate CNI configuration directory and Multus entrypoint options (as was performed in CNO PR #299), minus the CRIO restart. This is available in https://github.com/openshift/cluster-network-operator/pull/310 Until the companion OCICNI changes land, the CNO PR #311 is expect to NOT pass upgrade tests, as it requires those changes to pass that scenario.
This issue was observed on baremetal as well as part of RHHI.Next E2E demo integration.
Tested and verified on 4.2.0-0.nightly-2019-09-19-153821 [root@dhcp-41-193 FILE]# oc exec multuspod-1 -- ip address 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP link/ether 0a:58:0a:82:00:0c brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.130.0.12/23 brd 10.130.1.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::581f:1ff:fe90:3eac/64 scope link valid_lft forever preferred_lft forever 4: net1@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP link/ether 9a:60:f6:d0:8a:6d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.1.200/24 brd 192.168.1.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::9860:f6ff:fed0:8a6d/64 scope link valid_lft forever preferred_lft forever
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922