Description of problem: Filed to create a pod with balance-alb Warning FailedCreatePodSandBox 2m7s (x3 over 2m43s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_bondpod-2_sriov-operator-tests_6286be7c-c427-4eb1-8f85-faf4f29a5089_0(9ff0650477c6a590bbbbaa1d25ad953c20d786d9e9f6bf57355d62f2e95ffecd): error adding pod sriov-operator-tests_bondpod-2 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [sriov-operator-tests/bondpod-2/6286be7c-c427-4eb1-8f85-faf4f29a5089:bond-net2]: error adding container to network "bond-net2": Failed to attached links to bond, error: Failed to set link: net2 MASTER, master index used: 4, error: bad address Version-Release number of selected component (if applicable): 4.11.0-rc.0 How reproducible: 50% Steps to Reproduce: 1.Create 2 NAD bonds with VFs - bond mode balance-alb 2.Create 2 Pods with bonds - They are created successfully 3.Remove 2 pods 4.Create again Actual results: one or two pods got stuck on ContainerCreating Expected results: 2 pods are running Additional info: !!!!!!!!!!!!!!!!!!!! *** CMD ADD *** ##### TIME: 2022-07-14 12:08:49.789003493 +0000 UTC m=+0.000813897 ========= bond.go attachLinksToBond &{LinkAttrs:{Index:4 MTU:9000 TxQLen:-1 Name:bond0 HardwareAddr: Flags:0 RawFlags:0 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:<nil> Promisc:0 Xdp:<nil> EncapType: Protinfo:<nil> OperState:unknown NetNsID:-1 NumTxQueues:0 NumRxQueues:0 GSOMaxSize:0 GSOMaxSegs:0 Vfs:[] Group:0 Slave:<nil>} Mode:balance-alb ActiveSlave:-1 Miimon:100 UpDelay:-1 DownDelay:-1 UseCarrier:-1 ArpInterval:-1 ArpIpTargets:[] ArpValidate:BondArpValidate(-1) ArpAllTargets:BondArpAllTargets(-1) Primary:-1 PrimaryReselect:BondPrimaryReselect(-1) FailOverMac:active XmitHashPolicy:XmitHashPolicy(-1) ResendIgmp:-1 NumPeerNotif:-1 AllSlavesActive:-1 MinLinks:-1 LpInterval:-1 PacketsPerSlave:-1 LacpRate:LacpRate(-1) AdSelect:BondAdSelect(-1) AdInfo:<nil> AdActorSysPrio:-1 AdUserPortKey:-1 AdActorSystem: TlbDynamicLb:-1} [0xc0001ee5a0 0xc0001ee6c0] &{sockets:map[0:0xc000190b70 6:0xc000190b90 12:0xc000190bb0] lookupByDump:false} ---------- LIST EXISTING INTERFACES IN NAMESPACE NS LINK: Name: lo , Index: 1 , TYPE: device, MAC: ATRS: {Index:1 MTU:65536 TxQLen:1000 Name:lo HardwareAddr: Flags:up|loopback RawFlags:65609 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f380 Promisc:0 Xdp:0xc0001cd290 EncapType:loopback Protinfo:<nil> OperState:unknown NetNsID:-1 NumTxQueues:1 NumRxQueues:1 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>} ---------------- NS LINK: Name: eth0 , Index: 3 , TYPE: veth, MAC: 0a:58:0a:80:02:62 ATRS: {Index:3 MTU:1400 TxQLen:0 Name:eth0 HardwareAddr:0a:58:0a:80:02:62 Flags:up|broadcast|multicast RawFlags:69699 ParentIndex:1537 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f500 Promisc:0 Xdp:0xc0001cd2a8 EncapType:ether Protinfo:<nil> OperState:up NetNsID:0 NumTxQueues:80 NumRxQueues:80 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>} ---------------- NS LINK: Name: bond0 , Index: 4 , TYPE: bond, MAC: 4e:c9:c0:a6:00:8f ATRS: {Index:4 MTU:9000 TxQLen:1000 Name:bond0 HardwareAddr:4e:c9:c0:a6:00:8f Flags:broadcast|multicast RawFlags:5122 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f680 Promisc:0 Xdp:0xc0001cd2c0 EncapType:ether Protinfo:<nil> OperState:down NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>} ---------------- NS LINK: Name: net1 , Index: 563 , TYPE: device, MAC: 96:cb:1b:59:47:6d ATRS: {Index:563 MTU:9000 TxQLen:1000 Name:net1 HardwareAddr:96:cb:1b:59:47:6d Flags:up|broadcast|multicast RawFlags:69699 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f800 Promisc:0 Xdp:0xc0001cd2d8 EncapType:ether Protinfo:<nil> OperState:up NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>} ---------------- NS LINK: Name: net2 , Index: 693 , TYPE: device, MAC: 96:cb:1b:59:47:6d ATRS: {Index:693 MTU:9000 TxQLen:1000 Name:net2 HardwareAddr:96:cb:1b:59:47:6d Flags:up|broadcast|multicast RawFlags:4099 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f980 Promisc:0 Xdp:0xc0001cd2f0 EncapType:ether Protinfo:<nil> OperState:down NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>} ---------------- ------------ DONE LISTING INTERFACES --------- ==== bond.go.attachLinksToBond ADDING LINK: linkObject: &{LinkAttrs:{Index:563 MTU:9000 TxQLen:1000 Name:net1 HardwareAddr:96:cb:1b:59:47:6d Flags:up|broadcast|multicast RawFlags:69699 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc000220000 Promisc:0 Xdp:0xc0001cc0d8 EncapType:ether Protinfo:<nil> OperState:up NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}} ==== bond.go.attachLinksToBond BOND bondLinkIndex: 4 ==== netNsHandle.LinkSetMasterByIndex Execute &{RtAttr:{Len:0 Type:10} Data:[4 0 0 0] children:[]} &{NlMsghdr:{Len:16 Type:19 Flags:5 Seq:0 Pid:0} Data:[0xc0003505a0 0xc00030d640] RawData:[] Sockets:map[0:0xc000190b70 6:0xc000190b90 12:0xc000190bb0]} 0 ==== bond.go.attachLinksToBond ADDING LINK: linkObject: &{LinkAttrs:{Index:693 MTU:9000 TxQLen:1000 Name:net2 HardwareAddr:96:cb:1b:59:47:6d Flags:up|broadcast|multicast RawFlags:4099 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc000220180 Promisc:0 Xdp:0xc0001cc138 EncapType:ether Protinfo:<nil> OperState:down NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}} ==== bond.go.attachLinksToBond BOND bondLinkIndex: 4 ==== netNsHandle.LinkSetMasterByIndex Execute &{RtAttr:{Len:0 Type:10} Data:[4 0 0 0] children:[]} &{NlMsghdr:{Len:16 Type:19 Flags:5 Seq:0 Pid:0} Data:[0xc000350860 0xc00030d6c0] RawData:[] Sockets:map[0:0xc000190b70 6:0xc000190b90 12:0xc000190bb0]} 0 ========================= ERROR !!!!!!!!!! ============================== ERROR MESSAGE: bad address =============================== apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2022-07-14T13:18:32Z" generation: 1 name: bond-net2 namespace: sriov-operator-tests resourceVersion: "2548996" uid: 88a22472-6ee6-4cb9-8fa9-6920ede76e14 spec: config: |- {"type": "bond", "cniVersion": "0.3.1", "name": "bond-net2", "mode": "balance-alb", "failOverMac": 1, "linksInContainer": true, "miimon": "100", "mtu": 9000, "links": [{"name": "net1"},{"name": "net2"}], "capabilities": {"ips": true}, "ipam": {"type": "static"}} =================================== apiVersion: v1 kind: Pod metadata: annotations: k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.2.224/23"],"mac_address":"0a:58:0a:80:02:e0","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.224/23","gateway_ip":"10.128.2.1"}}' k8s.v1.cni.cncf.io/networks: '[ { "name": "test-sriov-static-bond" }, { "name": "test-sriov-static-bond-diff" }, { "name": "bond-net2", "interface": "bond0", "ips": ["192.168.100.2/24"] } ]' kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{"k8s.v1.cni.cncf.io/networks":"[ { \"name\": \"test-sriov-static-bond\" }, { \"name\": \"test-sriov-static-bond-diff\" }, { \"name\": \"bond-net2\", \"interface\": \"bond0\", \"ips\": [\"192.168.100.2/24\"] } ]"},"name":"bondpod-2","namespace":"sriov-operator-tests"},"spec":{"containers":[{"command":["/bin/bash","-c","sleep 2000000000000"],"image":"quay.io/ocp-edge-qe/cnf-gotests-client:v4.10","name":"bondpod-1","securityContext":{"privileged":true}}],"nodeSelector":{"kubernetes.io/hostname":"helix11.lab.eng.tlv2.redhat.com"}}} openshift.io/scc: privileged creationTimestamp: "2022-07-14T13:19:07Z" name: bondpod-2 namespace: sriov-operator-tests resourceVersion: "2549356" uid: 6286be7c-c427-4eb1-8f85-faf4f29a5089 spec: containers: - command: - /bin/bash - -c - sleep 2000000000000 image: quay.io/ocp-edge-qe/cnf-gotests-client:v4.10 imagePullPolicy: IfNotPresent name: bondpod-1 resources: limits: openshift.io/testresourcejumbo: "1" openshift.io/testresourcejumbodiff: "1" requests: openshift.io/testresourcejumbo: "1" openshift.io/testresourcejumbodiff: "1" securityContext: privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-45ldd readOnly: true - mountPath: /etc/podnetinfo name: podnetinfo readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: default-dockercfg-6v6c2 nodeName: helix11.lab.eng.tlv2.redhat.com nodeSelector: kubernetes.io/hostname: helix11.lab.eng.tlv2.redhat.com
NOTE: the logs provided above are NOT the logs provided by the system. These were additional logs added to the CNI binary to debug the issue. The additional logging was activated on each CNI ADD, and first listed all the network interfaces in the pod namespace, and later the bond interface along with the bond slaves to be added. Please ping me if you need more info on this. This bug could be fixed in two places: - sriov cni - the sriov cni could "reset" the vf mac every time the cni delete function is invoked. This will prevent interfaces with duplicate macs are given back to the vf pool. - bond cni - we should check the slaves for mac duplicates, and provide a better log in case there are, or maybe even update one interface with a random mac and proceed with creating the bond
Marcin's proposed fix is taking the second approach: bond CNI detects MAC duplicates, generates random MAC and sets it on one of the interfaces. PR: https://github.com/k8snetworkplumbingwg/bond-cni/pull/41
This will fix the bond-cni problem. We should also update the sriov-cni to fix this there, but with the bond fix in place that is less urgent.
Fixed in ose-network-interface-bond-cni-container-v4.12.0-202207281636.p0.ga88d72f.assembly.stream
Server Version: 4.12.0-rc.0 ==================================== NAME READY STATUS RESTARTS AGE testpod-ks7qm 1/1 Running 0 62s testpod-scb7d 1/1 Running 0 65s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399