Bug 2107178 - Bond CNI: Failed to recreate pod with active-active bond: Failed to attached links to bond: Failed to set link: net2 MASTER, master index used: 4, error: bad address
Summary: Bond CNI: Failed to recreate pod with active-active bond: Failed to attached ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.12.0
Assignee: Marcin Mirecki
QA Contact: elevin
URL:
Whiteboard:
Depends On:
Blocks: 2112297
TreeView+ depends on / blocked
 
Reported: 2022-07-14 13:31 UTC by elevin
Modified: 2023-01-17 19:52 UTC (History)
2 users (show)

Fixed In Version: ose-network-interface-bond-cni-container-v4.12.0-202207281636.p0.ga88d72f.assembly.stream
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:52:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github k8snetworkplumbingwg bond-cni pull 41 0 None open Validate bond slaves have no mac duplicates 2022-07-26 08:07:05 UTC
Github openshift bond-cni pull 40 0 None Merged ds merges: mac duplicates 2022-07-29 09:28:05 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:52:57 UTC

Description elevin 2022-07-14 13:31:37 UTC
Description of problem:
Filed to create a pod with balance-alb

  Warning  FailedCreatePodSandBox  2m7s (x3 over 2m43s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_bondpod-2_sriov-operator-tests_6286be7c-c427-4eb1-8f85-faf4f29a5089_0(9ff0650477c6a590bbbbaa1d25ad953c20d786d9e9f6bf57355d62f2e95ffecd): error adding pod sriov-operator-tests_bondpod-2 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [sriov-operator-tests/bondpod-2/6286be7c-c427-4eb1-8f85-faf4f29a5089:bond-net2]: error adding container to network "bond-net2": Failed to attached links to bond, error: Failed to set link: net2 MASTER, master index used: 4, error: bad address
 

Version-Release number of selected component (if applicable):
4.11.0-rc.0

How reproducible:
50% 

Steps to Reproduce:
1.Create 2 NAD bonds with VFs - bond mode balance-alb
2.Create 2 Pods with bonds - They are created successfully
3.Remove 2 pods
4.Create again 

Actual results:
one or two pods got stuck on ContainerCreating

Expected results:
2 pods are running

Additional info:

!!!!!!!!!!!!!!!!!!!!
  *** CMD ADD ***  
  ##### TIME: 2022-07-14 12:08:49.789003493 +0000 UTC m=+0.000813897   
=========
 bond.go  attachLinksToBond  &{LinkAttrs:{Index:4 MTU:9000 TxQLen:-1 Name:bond0 HardwareAddr: Flags:0 RawFlags:0 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:<nil> Promisc:0 Xdp:<nil> EncapType: Protinfo:<nil> OperState:unknown NetNsID:-1 NumTxQueues:0 NumRxQueues:0 GSOMaxSize:0 GSOMaxSegs:0 Vfs:[] Group:0 Slave:<nil>} Mode:balance-alb ActiveSlave:-1 Miimon:100 UpDelay:-1 DownDelay:-1 UseCarrier:-1 ArpInterval:-1 ArpIpTargets:[] ArpValidate:BondArpValidate(-1) ArpAllTargets:BondArpAllTargets(-1) Primary:-1 PrimaryReselect:BondPrimaryReselect(-1) FailOverMac:active XmitHashPolicy:XmitHashPolicy(-1) ResendIgmp:-1 NumPeerNotif:-1 AllSlavesActive:-1 MinLinks:-1 LpInterval:-1 PacketsPerSlave:-1 LacpRate:LacpRate(-1) AdSelect:BondAdSelect(-1) AdInfo:<nil> AdActorSysPrio:-1 AdUserPortKey:-1 AdActorSystem: TlbDynamicLb:-1}    [0xc0001ee5a0 0xc0001ee6c0]   &{sockets:map[0:0xc000190b70 6:0xc000190b90 12:0xc000190bb0] lookupByDump:false}
 ---------- LIST EXISTING INTERFACES IN NAMESPACE
    NS LINK:    Name:  lo  , Index:  1 ,   TYPE:  device,   MAC:  
           ATRS:  {Index:1 MTU:65536 TxQLen:1000 Name:lo HardwareAddr: Flags:up|loopback RawFlags:65609 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f380 Promisc:0 Xdp:0xc0001cd290 EncapType:loopback Protinfo:<nil> OperState:unknown NetNsID:-1 NumTxQueues:1 NumRxQueues:1 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}  
    ----------------
    NS LINK:    Name:  eth0  , Index:  3 ,   TYPE:  veth,   MAC:  0a:58:0a:80:02:62
           ATRS:  {Index:3 MTU:1400 TxQLen:0 Name:eth0 HardwareAddr:0a:58:0a:80:02:62 Flags:up|broadcast|multicast RawFlags:69699 ParentIndex:1537 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f500 Promisc:0 Xdp:0xc0001cd2a8 EncapType:ether Protinfo:<nil> OperState:up NetNsID:0 NumTxQueues:80 NumRxQueues:80 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}  
    ----------------
    NS LINK:    Name:  bond0  , Index:  4 ,   TYPE:  bond,   MAC:  4e:c9:c0:a6:00:8f
           ATRS:  {Index:4 MTU:9000 TxQLen:1000 Name:bond0 HardwareAddr:4e:c9:c0:a6:00:8f Flags:broadcast|multicast RawFlags:5122 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f680 Promisc:0 Xdp:0xc0001cd2c0 EncapType:ether Protinfo:<nil> OperState:down NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}  
    ----------------
    NS LINK:    Name:  net1  , Index:  563 ,   TYPE:  device,   MAC:  96:cb:1b:59:47:6d
           ATRS:  {Index:563 MTU:9000 TxQLen:1000 Name:net1 HardwareAddr:96:cb:1b:59:47:6d Flags:up|broadcast|multicast RawFlags:69699 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f800 Promisc:0 Xdp:0xc0001cd2d8 EncapType:ether Protinfo:<nil> OperState:up NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}  
    ----------------
    NS LINK:    Name:  net2  , Index:  693 ,   TYPE:  device,   MAC:  96:cb:1b:59:47:6d
           ATRS:  {Index:693 MTU:9000 TxQLen:1000 Name:net2 HardwareAddr:96:cb:1b:59:47:6d Flags:up|broadcast|multicast RawFlags:4099 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc00032f980 Promisc:0 Xdp:0xc0001cd2f0 EncapType:ether Protinfo:<nil> OperState:down NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}  
    ----------------
------------ DONE LISTING INTERFACES ---------


====  bond.go.attachLinksToBond  ADDING LINK: linkObject:    &{LinkAttrs:{Index:563 MTU:9000 TxQLen:1000 Name:net1 HardwareAddr:96:cb:1b:59:47:6d Flags:up|broadcast|multicast RawFlags:69699 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc000220000 Promisc:0 Xdp:0xc0001cc0d8 EncapType:ether Protinfo:<nil> OperState:up NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}}     
====  bond.go.attachLinksToBond  BOND    bondLinkIndex:     4      
==== netNsHandle.LinkSetMasterByIndex  Execute &{RtAttr:{Len:0 Type:10} Data:[4 0 0 0] children:[]}    &{NlMsghdr:{Len:16 Type:19 Flags:5 Seq:0 Pid:0} Data:[0xc0003505a0 0xc00030d640] RawData:[] Sockets:map[0:0xc000190b70 6:0xc000190b90 12:0xc000190bb0]}   0  
====  bond.go.attachLinksToBond  ADDING LINK: linkObject:    &{LinkAttrs:{Index:693 MTU:9000 TxQLen:1000 Name:net2 HardwareAddr:96:cb:1b:59:47:6d Flags:up|broadcast|multicast RawFlags:4099 ParentIndex:0 MasterIndex:0 Namespace:<nil> Alias: Statistics:0xc000220180 Promisc:0 Xdp:0xc0001cc138 EncapType:ether Protinfo:<nil> OperState:down NetNsID:-1 NumTxQueues:16 NumRxQueues:16 GSOMaxSize:65536 GSOMaxSegs:65535 Vfs:[] Group:0 Slave:<nil>}}     
====  bond.go.attachLinksToBond  BOND    bondLinkIndex:     4      
==== netNsHandle.LinkSetMasterByIndex  Execute &{RtAttr:{Len:0 Type:10} Data:[4 0 0 0] children:[]}    &{NlMsghdr:{Len:16 Type:19 Flags:5 Seq:0 Pid:0} Data:[0xc000350860 0xc00030d6c0] RawData:[] Sockets:map[0:0xc000190b70 6:0xc000190b90 12:0xc000190bb0]}   0  

 =========================    ERROR !!!!!!!!!!          ==============================
 ERROR MESSAGE:    bad address     

===============================

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  creationTimestamp: "2022-07-14T13:18:32Z"
  generation: 1
  name: bond-net2
  namespace: sriov-operator-tests
  resourceVersion: "2548996"
  uid: 88a22472-6ee6-4cb9-8fa9-6920ede76e14
spec:
  config: |-
    {"type": "bond", "cniVersion": "0.3.1", "name": "bond-net2",
    "mode": "balance-alb", "failOverMac": 1, "linksInContainer": true, "miimon": "100", "mtu": 9000,
    "links": [{"name": "net1"},{"name": "net2"}], "capabilities": {"ips": true}, "ipam": {"type": "static"}}

===================================

apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.2.224/23"],"mac_address":"0a:58:0a:80:02:e0","gateway_ips":["10.128.2.1"],"ip_address":"10.128.2.224/23","gateway_ip":"10.128.2.1"}}'
    k8s.v1.cni.cncf.io/networks: '[ { "name": "test-sriov-static-bond" }, { "name":
      "test-sriov-static-bond-diff" }, { "name": "bond-net2", "interface": "bond0",
      "ips":  ["192.168.100.2/24"] } ]'
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{"k8s.v1.cni.cncf.io/networks":"[ { \"name\": \"test-sriov-static-bond\" }, { \"name\": \"test-sriov-static-bond-diff\" }, { \"name\": \"bond-net2\", \"interface\": \"bond0\", \"ips\":  [\"192.168.100.2/24\"] } ]"},"name":"bondpod-2","namespace":"sriov-operator-tests"},"spec":{"containers":[{"command":["/bin/bash","-c","sleep 2000000000000"],"image":"quay.io/ocp-edge-qe/cnf-gotests-client:v4.10","name":"bondpod-1","securityContext":{"privileged":true}}],"nodeSelector":{"kubernetes.io/hostname":"helix11.lab.eng.tlv2.redhat.com"}}}
    openshift.io/scc: privileged
  creationTimestamp: "2022-07-14T13:19:07Z"
  name: bondpod-2
  namespace: sriov-operator-tests
  resourceVersion: "2549356"
  uid: 6286be7c-c427-4eb1-8f85-faf4f29a5089
spec:
  containers:
  - command:
    - /bin/bash
    - -c
    - sleep 2000000000000
    image: quay.io/ocp-edge-qe/cnf-gotests-client:v4.10
    imagePullPolicy: IfNotPresent
    name: bondpod-1
    resources:
      limits:
        openshift.io/testresourcejumbo: "1"
        openshift.io/testresourcejumbodiff: "1"
      requests:
        openshift.io/testresourcejumbo: "1"
        openshift.io/testresourcejumbodiff: "1"
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-45ldd
      readOnly: true
    - mountPath: /etc/podnetinfo
      name: podnetinfo
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: default-dockercfg-6v6c2
  nodeName: helix11.lab.eng.tlv2.redhat.com
  nodeSelector:
    kubernetes.io/hostname: helix11.lab.eng.tlv2.redhat.com

Comment 1 Marcin Mirecki 2022-07-19 13:55:03 UTC
NOTE: the logs provided above are NOT the logs provided by the system. These were additional logs added to the CNI binary to debug the issue. The additional logging was activated on each CNI ADD, and first listed all the network interfaces in the pod namespace, and later the bond interface along with the bond slaves to be added. Please ping me if you need more info on this.

This bug could be fixed in two places:
- sriov cni - the sriov cni could "reset" the vf mac every time the cni delete function is invoked. This will prevent interfaces with duplicate macs are given back to the vf pool.
- bond cni - we should check the slaves for mac duplicates, and provide a better log in case there are, or maybe even update one interface with a random mac and proceed with creating the bond

Comment 2 Carlos Goncalves 2022-07-26 08:07:06 UTC
Marcin's proposed fix is taking the second approach: bond CNI detects MAC duplicates, generates random MAC and sets it on one of the interfaces.

PR: https://github.com/k8snetworkplumbingwg/bond-cni/pull/41

Comment 3 Marcin Mirecki 2022-07-28 10:56:16 UTC
This will fix the bond-cni problem.
We should also update the sriov-cni to fix this there, but with the bond fix in place that is less urgent.

Comment 4 Carlos Goncalves 2022-07-29 09:28:06 UTC
Fixed in ose-network-interface-bond-cni-container-v4.12.0-202207281636.p0.ga88d72f.assembly.stream

Comment 6 elevin 2022-11-16 10:25:44 UTC
Server Version: 4.12.0-rc.0
====================================

NAME            READY   STATUS    RESTARTS   AGE
testpod-ks7qm   1/1     Running   0          62s
testpod-scb7d   1/1     Running   0          65s

Comment 8 errata-xmlrpc 2023-01-17 19:52:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.