Bug 1571752

Summary: Adding an already deleted nodePort to service shows already acquired error
Product: OpenShift Container Platform Reporter: Gonzalo Marcote <gmarcote>
Component: NetworkingAssignee: Ravi Sankar <rpenta>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Meng Bo <bmeng>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, bbennett, gmarcote, rpenta, weliang
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-15 18:30:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gonzalo Marcote 2018-04-25 11:07:23 UTC
Description of problem:
Deleting one service with one nodePort does not release the port and can not be allocated again until 5-10 minutes later.

Trying to check if the port is still allocated does not show any evidence:
$ netstat -puntal | grep 31414
$ iptables-save | grep 31414
$ ss -tulpn | grep 31414

This kukbernetes bug is related -> https://github.com/kubernetes/kubernetes/issues/32987
As pointed by the users in that bug, this still happens in k8s v1.7.4 v1.8.7 v1.9.4

Version-Release number of selected component (if applicable):
OCP versions 3.6

How reproducible:
Not happens always and not in all clusters. The time to wait until port is releases also varies.
It seems like Kubernetes api / proxy is some how caching nodePort for some grace period although it is not allocated.

Steps to Reproduce:
1. Create one service specifying one nodePort
2. Remove service with the nodePort
3. Checking with iptables or netstat does not show that port being used.
4. Immediately try to create the same service with the same port
5. It can't let you create it until have passed some minutes with the following error:
error: Service "gateway" is invalid: spec.ports[0].nodePort: Invalid value: 30120: provided port is already allocated
    serviceaccount "gateway" created

Actual results:
For some automated deployments where you need to remove and create one service for different customers this breaks the automated deployment.

Expected results:
To be able to create the service with the same nodePort immediately after it was deleted

Additional info:
This kukbernetes bug is related -> https://github.com/kubernetes/kubernetes/issues/32987

Comment 1 Ben Bennett 2018-05-04 16:10:39 UTC
@Ravi: Please comment with what you have found so far.

Comment 2 Weibin Liang 2018-05-10 17:42:27 UTC
@Gonzalo, tested on OCP 3.6 HA env with 3 masters and two nodes, can not reproduce the problem, below is my testing steps and logs, is my testing env too small to see the problem?

[root@ip-172-18-7-208 ~]# oc version
oc v3.6.173.0.118
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ec2-54-243-2-232.compute-1.amazonaws.com
openshift v3.6.173.0.118
kubernetes v1.6.1+5115d708d7
[root@ip-172-18-7-208 ~]# oc get nodes
NAME                            STATUS                     AGE       VERSION
ip-172-18-10-218.ec2.internal   Ready,SchedulingDisabled   2h        v1.6.1+5115d708d7
ip-172-18-12-193.ec2.internal   Ready                      2h        v1.6.1+5115d708d7
ip-172-18-4-101.ec2.internal    Ready                      2h        v1.6.1+5115d708d7
ip-172-18-5-165.ec2.internal    Ready,SchedulingDisabled   2h        v1.6.1+5115d708d7
ip-172-18-7-208.ec2.internal    Ready,SchedulingDisabled   2h        v1.6.1+5115d708d7
[root@ip-172-18-7-208 ~]# cat svc.yaml 
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    name: nginx
spec:
  type: NodePort
  ports:
    - port: 80
      nodePort: 30080
      name: http
    - port: 443
      nodePort: 31414
      name: https
  selector:
    name: nginx
[root@ip-172-18-7-208 ~]# for i in {1..100}; do oc create -f svc.yaml ; sleep 1; netstat -puntal | grep 31414; iptables-save | grep 31414; ss -tulpn | grep 31414; oc delete svc nginx; sleep 1; netstat -puntal | grep 31414; iptables-save | grep 31414; ss -tulpn | grep 31414; done
service "nginx" created
tcp6       0      0 :::31414                :::*                    LISTEN      14733/openshift     
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:https" -m tcp --dport 31414 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:https" -m tcp --dport 31414 -j KUBE-SVC-N3YB2VSZWW2B76BJ
-A KUBE-SERVICES -p tcp -m comment --comment "default/nginx:https has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 31414 -j REJECT --reject-with icmp-port-unreachable
tcp    LISTEN     0      128      :::31414                :::*                   users:(("openshift",pid=14733,fd=21))
service "nginx" deleted
service "nginx" created
tcp6       0      0 :::31414                :::*                    LISTEN      14733/openshift     
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:https" -m tcp --dport 31414 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:https" -m tcp --dport 31414 -j KUBE-SVC-N3YB2VSZWW2B76BJ
-A KUBE-SERVICES -p tcp -m comment --comment "default/nginx:https has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 31414 -j REJECT --reject-with icmp-port-unreachable
tcp    LISTEN     0      128      :::31414                :::*                   users:(("openshift",pid=14733,fd=21))
service "nginx" deleted
service "nginx" created
tcp6       0      0 :::31414                :::*                    LISTEN      14733/openshift     
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:https" -m tcp --dport 31414 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:https" -m tcp --dport 31414 -j KUBE-SVC-N3YB2VSZWW2B76BJ
-A KUBE-SERVICES -p tcp -m comment --comment "default/nginx:https has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 31414 -j REJECT --reject-with icmp-port-unreachable
tcp    LISTEN     0      128      :::31414                :::*                   users:(("openshift",pid=14733,fd=21))
service "nginx" deleted
service "nginx" created
tcp6       0      0 :::31414                :::*                    LISTEN      14733/openshift     
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:https" -m tcp --dport 31414 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:https" -m tcp --dport 31414 -j KUBE-SVC-N3YB2VSZWW2B76BJ
-A KUBE-SERVICES -p tcp -m comment --comment "default/nginx:https has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 31414 -j REJECT --reject-with icmp-port-unreachable
tcp    LISTEN     0      128      :::31414                :::*                   users:(("openshift",pid=14733,fd=21))
service "nginx" deleted
^C
[root@ip-172-18-7-208 ~]#

Comment 3 Ravi Sankar 2018-05-10 18:30:31 UTC
@gmarcote
I have tried to reproduce on 3.10 HA setup (3 api servers, 3 etcd cluster) with the given reproduction steps but I was not successful. Weibin tried on 3.6 HA setup (as mentioned in previous comment) and he couldn't reproduce either.

I'm pretty sure there is an issue here and my hunch is related to nodePort allocator handling mismatch between in memory map vs data in etcd. If that is the case, I have the patch ready: https://github.com/openshift/origin/commit/f465e8ceff12d4c58a76a480c4d34461eaf4cdbe

If you could provide more details about the HA setup and exact reproduction steps then we could test our patch.

Comment 4 Ben Bennett 2018-06-15 18:30:07 UTC
We are unable to reproduce this.  If you can provide more information then please re-open it.

Comment 5 Red Hat Bugzilla 2023-09-14 04:27:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days