Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1997205

Summary: ovn-kube test flake "operator conditions network" "DaemonSet "..." is not available (awaiting 1 nodes)
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: low CC: qili, rravaiol, trozet, zzhao
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-23 10:03:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Winship 2021-08-24 15:56:22 UTC
Seen sporadically, maybe only in ipv6 runs, but I'm not sure.

eg, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6/1429865915159154688

The test failed with:

  operator conditions network
    Operator progressing (Deploying): DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)

Digging around eventually uncovered that kubelet was reporting:

Aug 23 19:49:14.892179 worker-0.ostest.test.metalkube.org hyperkube[2079]: E0823 19:49:14.892136    2079 remote_runtime.go:116] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-metrics-daemon-8nn5z_openshift-multus_b463fd6d-b041-478e-867a-c3c92244f8bf_0(9912e227bcf966521307fdc497b4dced280e030370ef7e019643250a60a2828e): error adding pod openshift-multus_network-metrics-daemon-8nn5z to CNI network \"multus-cni-network\": [openshift-multus/network-metrics-daemon-8nn5z:ovn-kubernetes]: error adding container to network \"ovn-kubernetes\": CNI request failed with status 400: '[openshift-multus/network-metrics-daemon-8nn5z 9912e227bcf966521307fdc497b4dced280e030370ef7e019643250a60a2828e] [openshift-multus/network-metrics-daemon-8nn5z 9912e227bcf966521307fdc497b4dced280e030370ef7e019643250a60a2828e] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:50:2d:e9:e7 [fd01:0:0:4::4/64]\n'"

(in particular, "timed out waiting for OVS port binding (ovn-installed) for 0a:58:50:2d:e9:e7 [fd01:0:0:4::4/64]")

ovnkube-node-v4b8r logs show the same:

I0823 19:48:14.203104    5209 cni.go:236] [openshift-multus/network-metrics-daemon-8nn5z 9912e227bcf966521307fdc497b4dced280e030370ef7e019643250a60a2828e] ADD starting CNI request [openshift-multus/network-metrics-daemon-8nn5z 9912e227bcf966521307fdc497b4dced280e030370ef7e019643250a60a2828e]
W0823 19:49:14.366222    5209 helper_linux.go:318] [openshift-multus/network-metrics-daemon-8nn5z 9912e227bcf966521307fdc497b4dced280e030370ef7e019643250a60a2828e] pod uid b463fd6d-b041-478e-867a-c3c92244f8bf: timed out waiting for OVS port binding (ovn-installed) for 0a:58:50:2d:e9:e7 [fd01:0:0:4::4/64]
I0823 19:49:14.366272    5209 cni.go:246] [openshift-multus/network-metrics-daemon-8nn5z 9912e227bcf966521307fdc497b4dced280e030370ef7e019643250a60a2828e] ADD finished CNI request [openshift-multus/network-metrics-daemon-8nn5z 9912e227bcf966521307fdc497b4dced280e030370ef7e019643250a60a2828e], result "", err failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:50:2d:e9:e7 [fd01:0:0:4::4/64]

corresponding ovn-controller logs show:

2021-08-23T19:48:14Z|02797|binding|INFO|Claiming lport openshift-multus_network-metrics-daemon-8nn5z for this chassis.
2021-08-23T19:48:14Z|02798|binding|INFO|openshift-multus_network-metrics-daemon-8nn5z: Claiming 0a:58:92:02:16:1f fd01:0:0:4::3
2021-08-23T19:48:14Z|02799|binding|INFO|Setting lport openshift-multus_network-metrics-daemon-8nn5z ovn-installed in OVS
2021-08-23T19:48:14Z|02800|binding|INFO|Setting lport openshift-multus_network-metrics-daemon-8nn5z up in Southbound

which... wait, that's the wrong IP!


scrolling back, the very first attempt in this ovnkube-node log was:

I0823 19:02:39.219719    5209 cni.go:236] [openshift-multus/network-metrics-daemon-8nn5z 76b8bc5030bebe3a6c0e4390600e44c1da45e86898264e99a1693cd7b3a1fa30] ADD starting CNI request [openshift-multus/network-metrics-daemon-8nn5z 76b8bc5030bebe3a6c0e4390600e44c1da45e86898264e99a1693cd7b3a1fa30]
W0823 19:03:39.414174    5209 helper_linux.go:318] [openshift-multus/network-metrics-daemon-8nn5z 76b8bc5030bebe3a6c0e4390600e44c1da45e86898264e99a1693cd7b3a1fa30] pod uid b463fd6d-b041-478e-867a-c3c92244f8bf: timed out waiting for OVS port binding (ovn-installed) for 0a:58:50:2d:e9:e7 [fd01:0:0:4::4/64]

vs

2021-08-23T19:02:39Z|00032|binding|INFO|Claiming lport openshift-multus_network-metrics-daemon-8nn5z for this chassis.
2021-08-23T19:02:39Z|00033|binding|INFO|openshift-multus_network-metrics-daemon-8nn5z: Claiming 0a:58:92:02:16:1f fd01:0:0:4::3

so it's confused from the beginning...

Comment 1 zhaozhanqi 2021-09-02 03:27:01 UTC
we also hit this issue on ipv4 cluster. 


$ oc get pod -n openshift-network-diagnostics -o wide
NAME                                    READY   STATUS              RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
network-check-source-79dbb9d7bf-mc6qh   1/1     Running             0          10h   10.128.2.13   ip-10-0-219-200.us-east-2.compute.internal   <none>           <none>
network-check-target-4wk49              1/1     Running             0          10h   10.129.2.4    ip-10-0-163-213.us-east-2.compute.internal   <none>           <none>
network-check-target-68bgx              1/1     Running             0          10h   10.128.0.4    ip-10-0-175-19.us-east-2.compute.internal    <none>           <none>
network-check-target-9d8s4              1/1     Running             0          10h   10.130.0.4    ip-10-0-153-43.us-east-2.compute.internal    <none>           <none>
network-check-target-jsnpx              0/1     ContainerCreating   0          10h   <none>        ip-10-0-146-178.us-east-2.compute.internal   <none>           <none>
network-check-target-pwsfs              1/1     Running             0          10h   10.128.2.4    ip-10-0-219-200.us-east-2.compute.internal   <none>           <none>
network-check-target-zpnfg              1/1     Running             0          10h   10.129.0.4    ip-10-0-210-23.us-east-2.compute.internal    <none>           <none>

$ oc describe pod network-check-target-jsnpx -n openshift-network-diagnostics
Name:                 network-check-target-jsnpx
Namespace:            openshift-network-diagnostics
Priority:             1000000000
Priority Class Name:  openshift-user-critical
Node:                 ip-10-0-146-178.us-east-2.compute.internal/10.0.146.178
Start Time:           Thu, 02 Sep 2021 00:49:29 +0800
Labels:               app=network-check-target
                      controller-revision-hash=55cd96fd4f
                      kubernetes.io/os=linux
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.131.0.4/23"],"mac_address":"0a:58:0a:83:00:04","gateway_ips":["10.131.0.1"],"ip_address":"10.131.0.4/23","...
                      openshift.io/scc: restricted
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        DaemonSet/network-check-target
Containers:
  network-check-target-container:
    Container ID:  
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:159bd605c612c8eee47fa1538420d3161e5b26baae02e63b10750f30da758818
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      cluster-network-check-target
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      10m
      memory:   15Mi
    Readiness:  http-get http://:8080/ delay=30s timeout=10s period=10s #success=1 #failure=3
    Environment:
      K8S_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rkzsh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-rkzsh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              beta.kubernetes.io/os=linux
Tolerations:                 op=Exists
Events:
  Type     Reason                  Age                    From     Message
  ----     ------                  ----                   ----     -------
  Warning  FailedCreatePodSandBox  4m17s (x501 over 10h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_network-check-target-jsnpx_openshift-network-diagnostics_7edf4ac7-239f-44c8-8761-7e77ad1ab0f4_0(c3533e05e600600ea0ca59ca5df4f30452d9f3f0f7e91186287cfea2b202fb13): error adding pod openshift-network-diagnostics_network-check-target-jsnpx to CNI network "multus-cni-network": [openshift-network-diagnostics/network-check-target-jsnpx:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-network-diagnostics/network-check-target-jsnpx c3533e05e600600ea0ca59ca5df4f30452d9f3f0f7e91186287cfea2b202fb13] [openshift-network-diagnostics/network-check-target-jsnpx c3533e05e600600ea0ca59ca5df4f30452d9f3f0f7e91186287cfea2b202fb13] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:00:04 [10.131.0.4/23]


from above the pod is binding to 0a:58:0a:83:00:04 [10.131.0.4/23].  However from the ovn-controller logs. it is binding Claiming 0a:58:0a:83:00:03 10.131.0.3

ovnkube-node-nd844-ovn-controller.log:2021-09-02T00:37:19Z|02417|binding|INFO|openshift-network-diagnostics_network-check-target-jsnpx: Claiming 0a:58:0a:83:00:03 10.131.0.3

Comment 2 zhaozhanqi 2021-09-02 03:32:39 UTC
must-gather logs in : http://file.apac.redhat.com/~zzhao/must-gather.local.203549028165079256.tar.gz

Comment 3 Martin Kennelly 2021-09-22 10:06:06 UTC
*** Bug 2003558 has been marked as a duplicate of this bug. ***

Comment 4 Tim Rozet 2021-09-22 13:29:57 UTC
If we look at the logs on ovnkube-master:

2021-08-23T19:02:07.677909398Z I0823 19:02:07.677878       1 kube.go:67] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["fd01:0:0:4::3/64"],"mac_address":"0a:58:92:02:16:1f","gateway_ips":["fd01:0:0:4::1"],"ip_address":"fd01:0:0:4::3/64","gateway_ip":"fd01:0:0:4::1"}}] on pod openshift-multus/network-metrics-daemon-8nn5z
2021-08-23T19:02:07.719510291Z I0823 19:02:07.719472       1 ovn.go:458] [b463fd6d-b041-478e-867a-c3c92244f8bf/openshift-multus/network-metrics-daemon-8nn5z] retry pod setup
2021-08-23T19:02:07.720064280Z I0823 19:02:07.720044       1 pods.go:251] [openshift-multus/network-metrics-daemon-8nn5z] addLogicalPort took 66.78163ms
2021-08-23T19:02:07.775485278Z I0823 19:02:07.775415       1 kube.go:67] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["fd01:0:0:4::4/64"],"mac_address":"0a:58:50:2d:e9:e7","gateway_ips":["fd01:0:0:4::1"],"ip_address":"fd01:0:0:4::4/64","gateway_ip":"fd01:0:0:4::1"}}] on pod openshift-multus/network-metrics-daemon-8nn5z
2021-08-23T19:02:07.815623813Z I0823 19:02:07.810948       1 pods.go:251] [openshift-multus/network-metrics-daemon-8nn5z] addLogicalPort took 91.446109ms
2021-08-23T19:02:07.815623813Z E0823 19:02:07.810975       1 ovn.go:555] error while creating logical port openshift-multus_network-metrics-daemon-8nn5z error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (openshift-multus_network-metrics-daemon-8nn5z) for index on column "name".  First row, with UUID 356acf71-787c-4b49-a048-2c69b9a1ff63, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index.  Second row, with UUID 1d285325-fbeb-47ba-8338-bafbb2a49276, was inserted by this transaction. in committing transaction
2021-08-23T19:02:07.815623813Z I0823 19:02:07.811029       1 ovn.go:464] [b463fd6d-b041-478e-867a-c3c92244f8bf/openshift-multus/network-metrics-daemon-8nn5z] setup retry failed; will try again later
2021-08-23T19:02:07.815623813Z I0823 19:02:07.811249       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-multus", Name:"network-metrics-daemon-8nn5z", UID:"b463fd6d-b041-478e-867a-c3c92244f8bf", APIVersion:"v1", ResourceVersion:"26503", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' error while creating logical port openshift-multus_network-metrics-daemon-8nn5z error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (openshift-multus_network-metrics-daemon-8nn5z) for index on column "name".  First row, with UUID 356acf71-787c-4b49-a048-2c69b9a1ff63, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index.  Second row, with UUID 1d285325-fbeb-47ba-8338-bafbb2a49276, was inserted by this transaction. in committing transaction

You are most likely hitting either/both of:
https://bugzilla.redhat.com/show_bug.cgi?id=2004340
https://bugzilla.redhat.com/show_bug.cgi?id=1998614

Comment 5 Tim Rozet 2021-09-22 13:41:45 UTC
Looks like this job has been mostly passing now for the last 4-5 days:
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-ovn-ipv6

Dan, are we OK to close this bug?

Comment 6 Riccardo Ravaioli 2021-09-22 14:44:52 UTC
*** Bug 2006751 has been marked as a duplicate of this bug. ***

Comment 7 Dan Winship 2021-09-22 15:44:35 UTC
Yeah, if you think it's fixed, then go ahead and close it. (This is just a recurring failure I noticed in the CI logs. I don't have any extra insight about it, or any way of reproducing it.)

Comment 8 zhaozhanqi 2021-09-23 01:27:29 UTC
@Tim @Dan

I reproduced this issue on the latest 4.9 build 4.9.0-0.nightly-2021-09-21-215600

see https://bugzilla.redhat.com/show_bug.cgi?id=2006751

Comment 9 Surya Seetharaman 2021-09-23 10:03:29 UTC
@zhaozhanqi: https://bugzilla.redhat.com/show_bug.cgi?id=2006751 is not a bug actually. When you see the message "Reconnecting...Transaction Failed due to an error: constraint violation details:", and then subsequent logs show the pod creations happening successfully, its all good. https://bugzilla.redhat.com/show_bug.cgi?id=1998614 actually fixes the transaction error bug and the fix forces a reconnect from ovn-k to nbdb to solve it. If the transaction error persists that's when its a bug. Basically its what Tim said above.

Closing this as dupe of 1998614 and 2004340.

Please reopen in case you see pods are stuck and not getting created or something along those lines.

Cheers,
Surya.

*** This bug has been marked as a duplicate of bug 1998614 ***

Comment 10 Riccardo Ravaioli 2022-01-24 17:33:15 UTC
*** Bug 2006751 has been marked as a duplicate of this bug. ***