Bug 1753649

Summary:	3.11: Recreating recently deleted NodePort service results in 'port is already allocated' error
Product:	OpenShift Container Platform	Reporter:	Matthew Robson <mrobson>
Component:	kube-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	high	Docs Contact:
Priority:	low
Version:	3.11.0	CC:	aos-bugs, mfojtik, sttts
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.11.z
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1848379 (view as bug list)		Environment:
Last Closed:	2020-07-27 13:49:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1848379
Bug Blocks:

Description Matthew Robson 2019-09-19 14:10:32 UTC

Description of problem:

Typically through automation, when a project is deleted and recreated or a specific service is deleted and recreated we see an error that the NodePort is already in use even though the server is deleted and it no longer exists in iptables.

Thu 19 Sep 2019 09:24:27 EDT
service "nodeport-test" deleted
The Service "nodeport-test" is invalid: spec.ports[0].nodePort: Invalid value: 30006: provided port is already allocated
Thu 19 Sep 2019 09:24:30 EDT
The Service "nodeport-test" is invalid: spec.ports[0].nodePort: Invalid value: 30006: provided port is already allocated
Thu 19 Sep 2019 09:24:32 EDT
service/nodeport-test created

Depending on the environment (possibly load related), you may see no delay or a significant delay into the minutes.

It looks like cleaning up the resources of the service is handled via rs.releaseAllocatedResources(svc) asynchronously. This is probably fine for ClusterIPs, but for something usually statically used like a NodePort, you see issues like this.

https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/service/storage/rest.go#L267

And depending on how long it takes for the Release to be committed into etcd, determines how long between service deletion and the ability to actually reuse that port:

https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/service/allocator/storage/storage.go#L143
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/core/service/allocator/storage/storage.go#L163


Version-Release number of selected component (if applicable):
Customer 3.10

Internal 3.11
Internal 4.2 nightly

How reproducible:

Pretty reproducible against the Sept 16th 4.2 nightly

Steps to Reproduce:

Example Node Port:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: nodeport-admin
  name: nodeport-test
spec:
  type: NodePort
  externalTrafficPolicy: Cluster
  ports:
  - name: 8080-tcp
    port: 8080
    protocol: TCP
    nodePort: 30006
  selector:
    deploymentconfig: nodeport-app

Example script:

#!/bin/bash
date
oc delete svc nodeport-test
#while [ -n "$(oc create -f nodeport.yml 2>&1 > /dev/null)" ]; do
while ! oc create -f nodeport.yml; do
  date
done

Actual results:

The Service "nodeport-test" is invalid: spec.ports[0].nodePort: Invalid value: 30006: provided port is already allocated

Expected results:

The service should remove the nodeport and sync. Right now it feels like the service is deleted, but the removal of the nodeport is happening async and, depending on the environment, taking a long time.


Additional info:

This has been discussed and reported upstream:

https://bugzilla.redhat.com/show_bug.cgi?id=1571752
https://github.com/kubernetes/kubernetes/issues/32987
https://github.com/kubernetes/kubernetes/issues/73140

Comment 5 Michal Fojtik 2020-05-19 13:18:11 UTC

This bug hasn't had any engineering activity in the last ~30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale".

If you have further information on the current state of the bug, please update it and remove the "LifecycleStale" keyword, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 6 Matthew Robson 2020-05-19 13:27:56 UTC

The bug is still relevant. The is a PR open looking to address it.

https://github.com/kubernetes/kubernetes/pull/89937

Comment 7 Michal Fojtik 2020-05-26 11:04:34 UTC

This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug.

Comment 8 Stefan Schimanski 2020-06-18 09:20:02 UTC

https://github.com/kubernetes/kubernetes/pull/89937 is in the merge queue upstream finally.

Comment 9 Stefan Schimanski 2020-06-18 09:39:18 UTC

Created a pick for 3.11. If it works without huge effort, we are fine. If not, this is probably not going to happen.

Comment 13 Xingxing Xia 2020-07-22 13:26:27 UTC

Per https://github.com/kubernetes/kubernetes/pull/89937 , need a 3.11 cluster that has 3 masters to verify. Currently on hand 3.11 envs are 1 master clusters. Will launch HA cluster tomorrow to verify.

Comment 14 Xingxing Xia 2020-07-23 09:01:49 UTC

Verified in 3.11.248 HA cluster:
$ cat test.sh
echo "`date` begins"
./oc delete svc nodeport-test
while ! ./oc create -f nodeport.yml; do
  echo "`date` failed"
done

$ i=0
$ while true; do
  bash test.sh
  let i+=1
  echo "time: $i"
  echo
done |& tee test.log

# after many loop times in above script, search "failed", no "failed", means the bug is also not reproduced.
$ vi test.log

Comment 17 errata-xmlrpc 2020-07-27 13:49:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990