2077082 – ovnkube cleanup: Stale ports left over in nbdb after pods are deleted leading to MAC/IP conflicts

Bug 2077082 - ovnkube cleanup: Stale ports left over in nbdb after pods are deleted leading to MAC/IP conflicts

Summary: ovnkube cleanup: Stale ports left over in nbdb after pods are deleted leading...

Keywords:
Status:	CLOSED DUPLICATE of bug 2052017
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Surya Seetharaman
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:	ovn-perfscale
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-20 16:24 UTC by Sai Sindhur Malleni
Modified:	2022-05-01 12:04 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-01 12:04:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 1025	0	None	Merged	[release-4.10] Bug 2052017: Retry Pod Deletions on Failure	2022-04-27 17:47:06 UTC

Description Sai Sindhur Malleni 2022-04-20 16:24:25 UTC

Description of problem:
I'm trying to run a node-desnity-cni test using the e2e-benchmarking scripts (https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/kube-burner/run.sh#L35)
The test tries to create 245 pods/node on a 27 worker node cluster.
The test creates a single namespace with a number of applications equals to job_iterations. This application consists on two deployments (a node.js webserver and a simple client that curls the webserver) and a service that is used by the client to reach the webserver. Each iteration of this workload creates the following objects:

1 deployment holding a node.js webserver

1 deployment holding a client application for curling the webserver

1 service pointing to the webserver

The startupProbe of the client pod depends on being able to reach the webserver so that the PodReady latencies collected by kube-burner reflect network connectivity.

On every attempt, most pods become ready fairly quickly but the test doesn't complete because 1-2 curl pods never have their startupProbe succeed.

Ond ebugging it looks like the pods that didn't go into ready have the same MAC/IP as another pods that was deleted a long time ago but has a port entry for that logical switch in the nbdb.

Version-Release number of selected component (if applicable):
4.10.6

How reproducible:
90%, reproduced it atleast 9 times out of 10

Steps to Reproduce:
1. Run a node-denisty or node-density cni test and cleanup the namespace
2. Run the node-density-cni workload
3. Test doesn't complete since 1-2 pods never go into ready.

Actual results:

The test doesn't complete as 1-2 pods never go into ready

Expected results:
All pods must go into ready even if it takes a long time

Additional info:
The ports are only present in the nbdb, OVS on the node doesn't have the deleted pod's interface.
[smalleni@localhost dittybopper]$ oc rsh -c nbdb ovnkube-master-bgtc4
sh-4.4# ovn-nbctl --no-leader-only show 76a44337-777b-47bd-93f2-0ffbfd9f8ed9 | grep 10.128.15.187
addresses: ["0a:58:0a:80:0f:bb 10.128.15.187"]
addresses: ["0a:58:0a:80:0f:bb 10.128.15.187"]

Comment 1 Sai Sindhur Malleni 2022-04-20 16:27:58 UTC

Created attachment 1873844 [details]
nbdb

Comment 3 Sai Sindhur Malleni 2022-04-27 17:46:24 UTC

Tried to reproduce this and finally found the log message for the LSP delete. There appears to be no retry:

56493520+00:00 stderr F I0426 19:33:44.856458       1 pods.go:147] Deleting pod: 9831b83e-6e53-478f-b72f-76c853953b36/curl-1-1569-5f76fcc585-d48cl
ovnkube-master-9rnmt_openshift-ovn-kubernetes_ovnkube-master-a8e7594cace281082482ceb9c193c8891d1e476d2b1e54a060b59363fbd5a4db.log:101937:2022-04-26T19:34:04.654764228+00:00 stderr F E0426 19:34:04.654745       1 pods.go:192] Cannot delete logical switch port 9831b83e-6e53-478f-b72f-76c853953b36_curl-1-1569-5f76fcc585-d48cl, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.128.2.133]}}] Timeout:<nil> Where:[where column _uuid == {c6eeb9cb-61a7-42c6-9df8-2bb3012ea669}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:236781ff-3229-4515-aabb-c00c91a20d3e}]}}] Timeout:<nil> Where:[where column _uuid == {b4869398-1b4f-4446-bd29-40d998281c1b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {236781ff-3229-4515-aabb-c00c91a20d3e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded: while awaiting reconnection

Comment 5 Surya Seetharaman 2022-05-01 12:04:24 UTC


*** This bug has been marked as a duplicate of bug 2052017 ***

Note You need to log in before you can comment on or make changes to this bug.