Description of problem:
Version-Release number of selected component (if applicable):
Multus using whereabouts on bonded interface vlan.
Every time (on customer platform) -- difficult to reproduce internally (not reproduced yet).
Steps to Reproduce:
**SEE FIRST COMMENT BELOW PROBLEM DESCRIPTION FOR INTERNAL REPRODUCER STEPS**
1. deploy vlan on target host node
2. deploy networkattachmentdefinition specifying sriov interface net1,net2 and bond0 interface linking both.
3.create pod1 in target hostname on nodeA -- observe multus create bonded interface with ip: 10.0.0.16
4. create pod2 in secondary namespace on nodeA -- observe multus create bonded interface with ip: 10.0.0.16 (dup entry).
We are seeing that the reconciler is running every 15 minutes and is notating that the ip's are cleaned, at which point we believe the IP is released again into the pool as a viable assignment.
Observed also that the bond0 interface is not providing the IP addresses in the network-status, which may be part of the problem with the reconciler -- may be that it does not detect that this IP is allocated and therefore clears it again for re-assignment after it runs.
Pods deploy with the same secondary interface at bond0 (duplicated IP). Same host node, same networkattachmentdefinition and same vlan tied in.
NETATTACHDEF defines a HUGE whereabouts range .0|254 allocation and is not restrictive by restrictions annotation -- should be no conflict or reason for reuse on these pods.
Pods should deploy every time with a new IP unless host pod is removed, clearing entry for IP allocation.
Please see linked case for specific deployment information, including:
- network attachment definition yamls
- pod deployment describe outs
- pod creation yamls for reproduction
Available to gather any additional information required here.
Have been working with Doug in Forum multus here:
see ongoing discussion:
Looks like we've also found that the bond-cni isn't returning an IP address result, from the multus CNI cache
sh-4.4# cat /var/lib/cni/multus/results/bond-net1-823c50be23bc812099fe3adb88f1eb8e45c33cf70e2026925442795a3d34275d-net3
Even though the interface shows it's been assigned when exec'ing into the pod:
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
link/ether 36:d7:d2:7f:13:4a brd ff:ff:ff:ff:ff:ff
inet 192.0.2.1/24 brd 192.0.2.255 scope global bond0
(documentation IP address used)
Have a proposed upstream fix: https://github.com/k8snetworkplumbingwg/bond-cni/pull/34
* The Whereabouts ip reconciler deletes the IP address allocation because it is not shown in the */network-status annotation
* The bond cni is not returning a proper IP address result
* The bond cni is using CNI 1.0.0
* There appears to be CNI version incompatibility between bond-cni @ CNI v1.0.0 and Multus at 0.3.2.
In brief: It appears this problem occurs because of a CNI version incompatibility between Multus CNI and Bond-CNI.
Just an additional update to note that it looks like we can solve for this inconsistency between CNI versions with Multus CNI. We've got a change merged upstream, and we're looking into bumping up a dependency.
Additionally, we're planning on backporting this fix to 4.10.
QE can install bonding interface cluster in vsphere only.
The vsphere cluster installation is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=2092129
Tested and verified in 4.11.0-0.nightly-2022-06-01-200905 by following steps in https://gist.github.com/dougbtv/bea4214382f1bbb1e820457e14a46eca
[weliang@weliang ~]$ oc get pod
NAME READY STATUS RESTARTS AGE
singlepod 1/1 Running 0 7s
[weliang@weliang ~]$ oc describe pod singlepod | grep network-status -A35
[weliang@weliang ~]$ oc describe pod singlepod | grep network-status -A35 | grep -P "network.status|192.0"
[weliang@weliang ~]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-01-200905 True False 16m Cluster version is 4.11.0-0.nightly-2022-06-01-200905
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.