Bug 2077082 - ovnkube cleanup: Stale ports left over in nbdb after pods are deleted leading to MAC/IP conflicts
Summary: ovnkube cleanup: Stale ports left over in nbdb after pods are deleted leading...
Status: CLOSED DUPLICATE of bug 2052017
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: x86_64
OS: Linux
Target Milestone: ---
: ---
Assignee: Surya Seetharaman
QA Contact: Anurag saxena
Whiteboard: ovn-perfscale
Depends On:
TreeView+ depends on / blocked
Reported: 2022-04-20 16:24 UTC by Sai Sindhur Malleni
Modified: 2022-05-01 12:04 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2022-05-01 12:04:24 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1025 0 None Merged [release-4.10] Bug 2052017: Retry Pod Deletions on Failure 2022-04-27 17:47:06 UTC

Description Sai Sindhur Malleni 2022-04-20 16:24:25 UTC
Description of problem:
I'm trying to run a node-desnity-cni test using the e2e-benchmarking scripts (https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/kube-burner/run.sh#L35)
The test tries to create 245 pods/node on a 27 worker node cluster. 
The test creates a single namespace with a number of applications equals to job_iterations. This application consists on two deployments (a node.js webserver and a simple client that curls the webserver) and a service that is used by the client to reach the webserver. Each iteration of this workload creates the following objects:

1 deployment holding a node.js webserver

1 deployment holding a client application for curling the webserver

1 service pointing to the webserver

The startupProbe of the client pod depends on being able to reach the webserver so that the PodReady latencies collected by kube-burner reflect network connectivity.

On every attempt, most pods become ready fairly quickly but the test doesn't complete because 1-2 curl pods never have their startupProbe succeed.

Ond ebugging it looks like the pods that didn't go into ready have the same MAC/IP as another pods that was deleted a long time ago but has a port entry for that logical switch in the nbdb.

Version-Release number of selected component (if applicable):

How reproducible:
90%, reproduced it atleast 9 times out of 10

Steps to Reproduce:
1. Run a node-denisty or node-density cni test and cleanup the namespace
2. Run the node-density-cni workload
3. Test doesn't complete since 1-2 pods never go into ready.

Actual results:

The test doesn't complete as 1-2 pods never go into ready

Expected results:
All pods must go into ready even if it takes a long time

Additional info:
The ports are only present in the nbdb, OVS on the node doesn't have the deleted pod's interface. 
[smalleni@localhost dittybopper]$ oc rsh -c nbdb ovnkube-master-bgtc4
sh-4.4# ovn-nbctl --no-leader-only show 76a44337-777b-47bd-93f2-0ffbfd9f8ed9 | grep
        addresses: ["0a:58:0a:80:0f:bb"]
        addresses: ["0a:58:0a:80:0f:bb"]

Comment 1 Sai Sindhur Malleni 2022-04-20 16:27:58 UTC
Created attachment 1873844 [details]

Comment 3 Sai Sindhur Malleni 2022-04-27 17:46:24 UTC
Tried to reproduce this and finally found the log message for the LSP delete. There appears to be no retry:

56493520+00:00 stderr F I0426 19:33:44.856458       1 pods.go:147] Deleting pod: 9831b83e-6e53-478f-b72f-76c853953b36/curl-1-1569-5f76fcc585-d48cl
ovnkube-master-9rnmt_openshift-ovn-kubernetes_ovnkube-master-a8e7594cace281082482ceb9c193c8891d1e476d2b1e54a060b59363fbd5a4db.log:101937:2022-04-26T19:34:04.654764228+00:00 stderr F E0426 19:34:04.654745       1 pods.go:192] Cannot delete logical switch port 9831b83e-6e53-478f-b72f-76c853953b36_curl-1-1569-5f76fcc585-d48cl, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[]}}] Timeout:<nil> Where:[where column _uuid == {c6eeb9cb-61a7-42c6-9df8-2bb3012ea669}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:236781ff-3229-4515-aabb-c00c91a20d3e}]}}] Timeout:<nil> Where:[where column _uuid == {b4869398-1b4f-4446-bd29-40d998281c1b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {236781ff-3229-4515-aabb-c00c91a20d3e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded: while awaiting reconnection

Comment 5 Surya Seetharaman 2022-05-01 12:04:24 UTC

*** This bug has been marked as a duplicate of bug 2052017 ***

Note You need to log in before you can comment on or make changes to this bug.