Bug 2002748 - [ICNI2] 'ErrorAddingLogicalPort' failed to handle external GW check: timeout waiting for namespace event [NEEDINFO]
Summary: [ICNI2] 'ErrorAddingLogicalPort' failed to handle external GW check: timeout ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Tim Rozet
QA Contact: Yurii Prokulevych
URL:
Whiteboard:
Depends On:
Blocks: 2005476
TreeView+ depends on / blocked
 
Reported: 2021-09-09 15:38 UTC by Jose Castillo Lema
Modified: 2022-12-21 22:02 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2005476 (view as bug list)
Environment:
Last Closed: 2022-03-12 04:38:07 UTC
Target Upstream Version:
Embargoed:
trozet: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:38:28 UTC

Description Jose Castillo Lema 2021-09-09 15:38:11 UTC
Description of problem:
After deleting SPK pods and namespaces the cluster is left in a state where is not possible to create new pods.

Version-Release number of selected component (if applicable):
OCP 4.7.28

How reproducible:
100%

Steps to Reproduce:
1. Create SPK pods and namespaces (in our case, 35 namespaces with 3 SPK pods each)
2. Delete SPK namespaces and wait for them to complete termination
3. Try to create new pods

Actual results:
New pods will remain forever in Container Creating

Expected results:
New pods get to Running state

Additional info:
Lots of this messages in the logs:
I0909 14:15:26.542039       1 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"serving-ns-3", Name:"pod-serving-3-2-serving-job", UID:"49e0c8bc-9ee1-4ac6-8195-38797bc98bf7", APIVersion:"v1", ResourceVersion:"15681423", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' failed to handle external GW check: timeout waiting for namespace event

It was not possible to run must-gather in the cluster as the must-gather pods never gets to Running state.

Example of a pod left in ContainerCreating forever:
[kni@e24-h01-000-r640 icni2]$ oc get po
NAME   READY   STATUS              RESTARTS   AGE
app    0/1     ContainerCreating   0          79m

Events:
  Type     Reason                  Age   From               Message
  ----     ------                  ----  ----               -------
  Normal   Scheduled               79m   default-scheduler  Successfully assigned spk-app/app to worker000-fc640
  Warning  FailedCreatePodSandBox  78m   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(4733a3d2b5e2e0edef4dc800e6090209d836c42d3fe56f92e5e47d8e61c0f9e8): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app 4733a3d2b5e2e0edef4dc800e6090209d836c42d3fe56f92e5e47d8e61c0f9e8] [spk-app/app 4733a3d2b5e2e0edef4dc800e6090209d836c42d3fe56f92e5e47d8e61c0f9e8] timed out waiting for annotations
'
  Warning  FailedCreatePodSandBox  77m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(ed13c9ce8fbeaf23c7b91088e66479b6debeceb1b605e355a74acf1d0b15f604): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app ed13c9ce8fbeaf23c7b91088e66479b6debeceb1b605e355a74acf1d0b15f604] [spk-app/app ed13c9ce8fbeaf23c7b91088e66479b6debeceb1b605e355a74acf1d0b15f604] timed out waiting for annotations
'
  Warning  FailedCreatePodSandBox  77m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(e93d516d3e7ffbec4ef5ec28fd69cda585e7c6014a0890a56775109db0ed8cc7): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app e93d516d3e7ffbec4ef5ec28fd69cda585e7c6014a0890a56775109db0ed8cc7] [spk-app/app e93d516d3e7ffbec4ef5ec28fd69cda585e7c6014a0890a56775109db0ed8cc7] timed out waiting for annotations
'
  Warning  FailedCreatePodSandBox  76m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(d4a2aee05462aa415a09f4515b50cc780d493c467c8b5c9ccc6a8c1fac0edaf6): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app d4a2aee05462aa415a09f4515b50cc780d493c467c8b5c9ccc6a8c1fac0edaf6] [spk-app/app d4a2aee05462aa415a09f4515b50cc780d493c467c8b5c9ccc6a8c1fac0edaf6] timed out waiting for annotations
'
  Warning  FailedCreatePodSandBox  75m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(858ecbbeb041bf189b1de54eece6fb29c875a7628e7213878f27ee4fe1685b7b): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app 858ecbbeb041bf189b1de54eece6fb29c875a7628e7213878f27ee4fe1685b7b] [spk-app/app 858ecbbeb041bf189b1de54eece6fb29c875a7628e7213878f27ee4fe1685b7b] timed out waiting for annotations
'
  Warning  FailedCreatePodSandBox  75m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(822631c393bea069044c7e7fae972618fe1f3f2a9dc4405a981e417bf2a6839e): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app 822631c393bea069044c7e7fae972618fe1f3f2a9dc4405a981e417bf2a6839e] [spk-app/app 822631c393bea069044c7e7fae972618fe1f3f2a9dc4405a981e417bf2a6839e] timed out waiting for annotations
'
  Warning  FailedCreatePodSandBox  74m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(2c49ce1afe3c938b070cf3eb741c6d09802b06cf8f2d72e70b59ca11e0828391): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app 2c49ce1afe3c938b070cf3eb741c6d09802b06cf8f2d72e70b59ca11e0828391] [spk-app/app 2c49ce1afe3c938b070cf3eb741c6d09802b06cf8f2d72e70b59ca11e0828391] timed out waiting for annotations
'
  Warning  FailedCreatePodSandBox  73m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(24fa726977d4008c0f19a0943df385ca292fb32fcfc2162567fea4e8d19ec1d5): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app 24fa726977d4008c0f19a0943df385ca292fb32fcfc2162567fea4e8d19ec1d5] [spk-app/app 24fa726977d4008c0f19a0943df385ca292fb32fcfc2162567fea4e8d19ec1d5] timed out waiting for annotations
'
  Warning  FailedCreatePodSandBox  72m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(71f93bd49e0eb03fe3ccd74dd513481aaf2e780c40241659dda688430d183405): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app 71f93bd49e0eb03fe3ccd74dd513481aaf2e780c40241659dda688430d183405] [spk-app/app 71f93bd49e0eb03fe3ccd74dd513481aaf2e780c40241659dda688430d183405] timed out waiting for annotations
'
  Warning  ErrorAddingLogicalPort  25m                    controlplane  error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID e8905f64-57a9-4fb9-bbf3-3715317ed564, was inserted by this transaction.  Second row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction
  Warning  ErrorAddingLogicalPort  25m                    controlplane  failed to get the logical switch port: spk-app_app from the ovn client, error: object not found
  Warning  ErrorAddingLogicalPort  24m                    controlplane  error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index.  Second row, with UUID 0f14a4e2-026d-42d5-a398-d3086daccfce, was inserted by this transaction. in committing transaction
  Warning  ErrorAddingLogicalPort  24m                    controlplane  error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index.  Second row, with UUID 0b58770b-5b92-4800-a7bf-7c4b44117c3c, was inserted by this transaction. in committing transaction
  Warning  ErrorAddingLogicalPort  23m                    controlplane  error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID 5e2ba284-05ee-42ff-b2ef-02012b9a6701, was inserted by this transaction.  Second row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction
  Warning  ErrorAddingLogicalPort  22m                    controlplane  error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID 4c2f1e02-c7e4-4ea1-af3d-62bd4078d79b, was inserted by this transaction.  Second row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction
  Warning  ErrorAddingLogicalPort  21m                    controlplane  error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID 92cc692d-6abd-4cac-b9ec-7e1d52246304, was inserted by this transaction.  Second row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction
  Warning  ErrorAddingLogicalPort  20m                    controlplane  error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID b5f4f6e0-c269-4f0f-8f1d-38d7cdf43df3, was inserted by this transaction.  Second row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction
  Warning  ErrorAddingLogicalPort  19m                    controlplane  error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index.  Second row, with UUID 15a0ad0b-c96c-4196-b79b-a3a909982fa3, was inserted by this transaction. in committing transaction
  Warning  FailedCreatePodSandBox  3m20s (x124 over 72m)  kubelet       (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_app_spk-app_ec5b560f-ceee-4745-9899-6b49e139178c_0(ded4476c3c712e80172aa157b3b432f941ba34aa4ae21e3d0dd45dfe0417322b): error adding pod spk-app_app to CNI network "multus-cni-network": [spk-app/app:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[spk-app/app ded4476c3c712e80172aa157b3b432f941ba34aa4ae21e3d0dd45dfe0417322b] [spk-app/app ded4476c3c712e80172aa157b3b432f941ba34aa4ae21e3d0dd45dfe0417322b] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows
'
  Warning  ErrorAddingLogicalPort  35s (x19 over 18m)  controlplane  (combined from similar events): error while creating logical port spk-app_app error: Transaction Failed due to an error: constraint violation details: Transaction causes multiple rows in "Logical_Switch_Port" table to have identical values (spk-app_app) for index on column "name".  First row, with UUID 365c1fa0-3320-4b2e-9162-be237a3149b0, was inserted by this transaction.  Second row, with UUID 1988e6a9-f757-4aff-9fcb-fe5c856bc852, existed in the database before this transaction, which modified some of the row's columns but not any columns in this index. in committing transaction

Comment 4 Tim Rozet 2021-09-13 20:18:59 UTC
timed out waiting for annotations + timed out waiting for namespace locked will both be fixed by: https://bugzilla.redhat.com/show_bug.cgi?id=1959352

There is also another error:
E0908 20:59:44.673215       1 ovn.go:635] failed to get the logical switch port: openshift-marketplace_redhat-operators-4skg2 from the ovn client, error: object not found

This indicates that the cache is broken on the ovnkube-node side and you may be hitting: https://bugzilla.redhat.com/show_bug.cgi?id=2001364

We can check the monitors to be sure on a debug cluster.

Comment 5 Tim Rozet 2021-09-15 14:48:17 UTC
Further investigation shows the root cause of this is backup both in the pod handlers and the pod retry loop due to waiting for namespace. I believe the fixes from:
https://bugzilla.redhat.com/show_bug.cgi?id=1959352

which include getting rid of waiting for namespace and using RWMutex on namespace:
https://github.com/openshift/ovn-kubernetes/pull/609/commits/ea5a757d3eb4d4ea1ec70716abbf22fe6841456c
https://github.com/openshift/ovn-kubernetes/pull/667/commits/98e02b20432e40a661a1fb2c787c118e8ce611e8

and the fix from:
https://bugzilla.redhat.com/show_bug.cgi?id=2002010
https://github.com/openshift/ovn-kubernetes/pull/739/commits/f5ec5669f072559972127cb7c89034f423852eac

will address this. Let's keep this BZ open until those 3 are in 4.7z and then re-test this issue.

Comment 6 Tim Rozet 2021-09-17 13:46:22 UTC
Anurag/Jose, the relevant fixes are now all present in 4.10. Can you please verify this issue doesn't happen in 4.10. I'll start backports to other versions in the mean time.

Comment 9 Tim Rozet 2021-09-20 14:21:08 UTC
I have an easy way to reproduce this without a large scale setup. Create a deployment with 100 external gateway instances referring to a routing-namespace that does not exist. Example:
https://gist.githubusercontent.com/trozet/c86eb4e149c270a8b4aa9d7a4d19ccb2/raw/9bfd7ead57ebb7a193aa5f38b40d0644ea6f1141/exgw_deployment.yaml

Right after creating this, create any regular pod:
---
apiVersion: v1
kind: Pod
metadata:
  name: trozet1
  labels:
    pod-name: client
    role: webserver
spec:
  #hostNetwork: true
  containers:
  - name: pod1
    image: docker.io/centos/tools:latest
    command:
      - /sbin/init
    imagePullPolicy: IfNotPresent


Now with the bug, in ovnkube-master for the first 100 pods you will see them wait for 10 seconds each for the namespace "doesnotexist" and then timeout:
I0920 14:10:37.186435      55 egressgw.go:38] External gateway pod: exgw-deployment-7ffb784899-b2jb5, detected for namespace(s) doesnotexist
I0920 14:10:47.189752      55 pods.go:302] [default/exgw-deployment-7ffb784899-b2jb5] addLogicalPort took 10.145347085s
E0920 14:10:47.189829      55 ovn.go:656] failed to handle external GW check: timeout waiting for namespace event


The regular pod (trozet1) created after all other 100 pods, will be severely delayed in coming up, due to the time out of all the previous pods:
kubectl get pod trozet1 -o yaml
...

  - lastProbeTime: null
    lastTransitionTime: "2021-09-20T14:18:11Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-20T14:05:07Z"
    status: "True"
    type: PodScheduled


^13 minutes of delay. With the fix there should be no timed out waiting for namespace in the logs, and the 101st pod should come up quickly. Also, the 101st pod (trozet1) should have a log in ovnkube-master like:
I0920 14:13:10.942832      55 pods.go:302] [default/trozet1] addLogicalPort took 1.132425375s

^this should show up within seconds after the pod being scheduled.

Comment 14 errata-xmlrpc 2022-03-12 04:38:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.