1753649 – 3.11: Recreating recently deleted NodePort service results in 'port is already allocated' error

Bug 1753649 - 3.11: Recreating recently deleted NodePort service results in 'port is already allocated' error

Summary: 3.11: Recreating recently deleted NodePort service results in 'port is alread...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	3.11.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:	1848379
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-19 14:10 UTC by Matthew Robson
Modified:	2023-12-15 16:46 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1848379 (view as bug list)
Environment:
Last Closed:	2020-07-27 13:49:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25151	0	None	closed	Bug 1753649: UPSTREAM: 89937: portAllocator sync local data before allocate	2020-11-17 08:32:01 UTC
Red Hat Product Errata	RHBA-2020:2990	0	None	None	None	2020-07-27 13:49:23 UTC

Description Matthew Robson 2019-09-19 14:10:32 UTC

Description of problem:

Typically through automation, when a project is deleted and recreated or a specific service is deleted and recreated we see an error that the NodePort is already in use even though the server is deleted and it no longer exists in iptables.

Thu 19 Sep 2019 09:24:27 EDT
service "nodeport-test" deleted
The Service "nodeport-test" is invalid: spec.ports[0].nodePort: Invalid value: 30006: provided port is already allocated
Thu 19 Sep 2019 09:24:30 EDT
The Service "nodeport-test" is invalid: spec.ports[0].nodePort: Invalid value: 30006: provided port is already allocated
Thu 19 Sep 2019 09:24:32 EDT
service/nodeport-test created

Depending on the environment (possibly load related), you may see no delay or a significant delay into the minutes.

It looks like cleaning up the resources of the service is handled via rs.releaseAllocatedResources(svc) asynchronously. This is probably fine for ClusterIPs, but for something usually statically used like a NodePort, you see issues like this.

https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/service/storage/rest.go#L267

And depending on how long it takes for the Release to be committed into etcd, determines how long between service deletion and the ability to actually reuse that port:

https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/service/allocator/storage/storage.go#L143
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/service/allocator/storage/storage.go#L163


Version-Release number of selected component (if applicable):
Customer 3.10

Internal 3.11
Internal 4.2 nightly

How reproducible:

Pretty reproducible against the Sept 16th 4.2 nightly

Steps to Reproduce:

Example Node Port:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: nodeport-admin
  name: nodeport-test
spec:
  type: NodePort
  externalTrafficPolicy: Cluster
  ports:
  - name: 8080-tcp
    port: 8080
    protocol: TCP
    nodePort: 30006
  selector:
    deploymentconfig: nodeport-app

Example script:

#!/bin/bash
date
oc delete svc nodeport-test
#while [ -n "$(oc create -f nodeport.yml 2>&1 > /dev/null)" ]; do
while ! oc create -f nodeport.yml; do
  date
done

Actual results:

The Service "nodeport-test" is invalid: spec.ports[0].nodePort: Invalid value: 30006: provided port is already allocated

Expected results:

The service should remove the nodeport and sync. Right now it feels like the service is deleted, but the removal of the nodeport is happening async and, depending on the environment, taking a long time.


Additional info:

This has been discussed and reported upstream:

https://bugzilla.redhat.com/show_bug.cgi?id=1571752
https://github.com/kubernetes/kubernetes/issues/32987
https://github.com/kubernetes/kubernetes/issues/73140

Comment 5 Michal Fojtik 2020-05-19 13:18:11 UTC

This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 6 Matthew Robson 2020-05-19 13:27:56 UTC

The bug is still relevant. The is a PR open looking to address it.

https://github.com/kubernetes/kubernetes/pull/89937

Comment 7 Michal Fojtik 2020-05-26 11:04:34 UTC

This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

Comment 8 Stefan Schimanski 2020-06-18 09:20:02 UTC

https://github.com/kubernetes/kubernetes/pull/89937 is in the merge queue upstream finally.

Comment 9 Stefan Schimanski 2020-06-18 09:39:18 UTC

Created a pick for 3.11. If it works without huge effort, we are fine. If not, this is probably not going to happen.

Comment 13 Xingxing Xia 2020-07-22 13:26:27 UTC

Per https://github.com/kubernetes/kubernetes/pull/89937 , need a 3.11 cluster that has 3 masters to verify. Currently on hand 3.11 envs are 1 master clusters. Will launch HA cluster tomorrow to verify.

Comment 14 Xingxing Xia 2020-07-23 09:01:49 UTC

Verified in 3.11.248 HA cluster:
$ cat test.sh
echo "`date` begins"
./oc delete svc nodeport-test
while ! ./oc create -f nodeport.yml; do
  echo "`date` failed"
done

$ i=0
$ while true; do
  bash test.sh
  let i+=1
  echo "time: $i"
  echo
done |& tee test.log

# after many loop times in above script, search "failed", no "failed", means the bug is also not reproduced.
$ vi test.log

Comment 17 errata-xmlrpc 2020-07-27 13:49:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990

Note You need to log in before you can comment on or make changes to this bug.