2047299 – nodeport not reachable port connection timeout

Bug 2047299 - nodeport not reachable port connection timeout

Summary: nodeport not reachable port connection timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.13.0
Assignee:	Nadia Pinaeva
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-27 14:38 UTC by Erik Lalancette
Modified:	2023-05-17 22:46 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-05-17 22:46:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 1474	0	None	open	Bug 2047299, OCPBUGS-2337: [DownstreamMerge] 13 Jan 2023	2023-01-16 09:21:01 UTC
Red Hat Product Errata	RHSA-2023:1326	0	None	None	None	2023-05-17 22:46:44 UTC

Description Erik Lalancette 2022-01-27 14:38:19 UTC

Description of problem:

NodePort port not accessible 

Version-Release number of selected component (if applicable):

OCP 4.8.20


How reproducible:

 $oc -n ui-nprd get services -o wide
NAME                 TYPE           CLUSTER-IP       EXTERNAL-IP                                                                           PORT(S)          AGE     SELECTOR
docker-registry      ClusterIP      10.201.219.240   <none>                                                                                5000/TCP         24d     app=registry
docker-registry-lb   LoadBalancer   10.201.252.253   internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com   5000:30779/TCP   3d22h   app=registry
docker-registry-np   NodePort       10.201.216.26    <none>                                                                                5000:32428/TCP   3d16h   app=registry

$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.


In a new-created namespaces the same deployment works:

[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000

[1]+  Stopped                 oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7:  ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7:  ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000

[RHEL7:  ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+  Terminated              oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7:  ~/tmp]> oc get services
NAME                 TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
docker-registry-np   NodePort   10.201.224.174   <none>        5000:31793/TCP   68s

[RHEL7:  ~/tmp]> oc get pods -o wide
NAME                        READY   STATUS    RESTARTS   AGE    IP            NODE                                           NOMINATED NODE   READINESS GATES
registry-75b7c7fd94-rx29j   1/1     Running   0          7m5s   10.201.1.29   ip-xxx.ca-central-1.compute.internal   <none>           <none>
[RHEL7:  ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.

Actual results:

- Working on new created namespace
- Not working on already created namespace 

Expected results:

- Suppose to work on all namespaces.

Additional info:

- This cluster get upgrade from 4.7.x to 4.8 and then they manually enable OVN.
- The issue was happening on all namespaces but after restarting the ovnkube-master-xxxx pods only the newly created namespaces work.

Comment 6 Erik Lalancette 2022-02-03 21:39:57 UTC

Hi @npinaeva  exactly they have a  egressfirewall present on the namespaces. When customer remove this egressfirewall the nodeport connection is working fine. Can you help better understand why this happened. 

thanks

Comment 8 Erik Lalancette 2022-02-09 16:28:58 UTC

Hi @npinaeva , I was able to reproduce this scenario on my cluster IPI 4.8.18

The cluster is using Gateway Mode: local

Here’s the step to reproduces it.

1- oc new-project hello

2- oc new-app --docker-image=docker.io/openshift/hello-openshift --labels='app=hello-openshift' -n hello

3- cat <<EOF | oc create  -f -
apiVersion: v1
kind: Service
metadata:
  name: lb
spec:
  ports:
  - name: lb
    port: 8080
  loadBalancerIP:
  type: LoadBalancer
  selector:
    app: hello-openshift
EOF

4- cat <<EOF | oc create  -f -
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress
  - type: Allow
    to:
      dnsName: www.test.com
  - type: Allow
    to:
      cidrSelector: 172.30.0.0/16
  - type: Allow
    to:
      cidrSelector: 10.128.0.0/14
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0
EOF

5- The nodeport is unreachable from the external and also from all nodes into the cluster. Therefore AWS ELB status  is OutofService.
  
6- If I add the 100.64.0.0/16 (OVN GW local switch ip range) ciddr into the egressfirewall like this. Therefore AWS ELB status  is InService. And nodeport is reachable from external and also from all nodes into the cluster.

cat <<EOF | oc create  -f -
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress
  - type: Allow
    to:
      dnsName: www.test.com
  - type: Allow
    to:
      cidrSelector: 172.30.0.0/16
  - type: Allow
    to:
      cidrSelector: 10.128.0.0/14
  - type: Allow
    to:
      cidrSelector: 100.64.0.0/16
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0
EOF

Comment 9 Erik Lalancette 2022-02-09 16:50:50 UTC

Here’s the output how to test the nodeport from the node directly. 

---> Before the egressfirewall

oc get svc
NAME              TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)             AGE
hello-openshift   ClusterIP      172.30.175.164   <none>                                                                   8080/TCP,8888/TCP   16s
lb                LoadBalancer   172.30.73.43     a04a3409e7cec408eb8746845f87bfdc-642313472.us-east-2.elb.amazonaws.com   8080:32064/TCP      3s

$ oc get node
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-130-205.us-east-2.compute.internal   Ready    master   48m   v1.21.6+bb8d50a
ip-10-0-140-74.us-east-2.compute.internal    Ready    worker   38m   v1.21.6+bb8d50a
ip-10-0-188-79.us-east-2.compute.internal    Ready    master   48m   v1.21.6+bb8d50a
ip-10-0-190-185.us-east-2.compute.internal   Ready    worker   38m   v1.21.6+bb8d50a
ip-10-0-194-5.us-east-2.compute.internal     Ready    worker   39m   v1.21.6+bb8d50a
ip-10-0-215-32.us-east-2.compute.internal    Ready    master   47m   v1.21.6+bb8d50a

$oc debug node/ip-10-0-140-74.us-east-2.compute.internal
Starting pod/ip-10-0-140-74us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
chroot /host
Pod IP: 10.0.140.74
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# nc -v 10.0.140.74 32064
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.140.74:32064.

---> After applying  the egressfirewall 

oc debug node/ip-10-0-140-74.us-east-2.compute.internal
Starting pod/ip-10-0-140-74us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.140.74
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# nc -v 10.0.140.74 32064
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.

---> After applying  the egressfirewall with the 100.64.0.0/16  Allow ovn logical ciddr

oc debug node/ip-10-0-140-74.us-east-2.compute.internal
Starting pod/ip-10-0-140-74us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.140.74
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# nc -v 10.0.140.74 32064
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.140.74:32064.

Comment 10 Erik Lalancette 2022-02-09 18:33:53 UTC

I just try on OCP 4.9.18 with Gateway Mode: shared the result is exactly the same.

Comment 11 Nadia Pinaeva 2022-02-23 10:45:19 UTC

This is a bug in our EgressFirewall implementation and we need some time to fix it (it requires fixes for 2 different components actually, so it can be not very fast).

While we're working on a bug fix, can we suggest to add 
- type: Allow
    to:
      cidrSelector: 100.64.0.0/16

as a workaround for this customer?

Comment 14 Tim Rozet 2022-12-15 15:13:55 UTC

Related OVN bug: https://bugzilla.redhat.com/show_bug.cgi?id=2057426

Comment 18 jechen 2023-01-20 13:41:07 UTC

Verified the fix in 4.13.0-0.nightly-2023-01-17-152326

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2023-01-17-152326   True        False         75m     Cluster version is 4.13.0-0.nightly-2023-01-17-152326

 

1. create test namespace, and nodeport service

$ oc new-project test

$ oc label ns test security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=privileged pod-security.kubernetes.io/warn=privileged --overwrite
namespace/test labeled

$ cat pod_httpserver.yaml 
apiVersion: v1
kind: Pod 
metadata: 
 name: hello-pod
 labels: 
  name: hello-pod
spec: 
 containers:
 - name: hello-world
   image: gcr.io/google-samples/node-hello:1.0
   ports:
   - containerPort: 8080
     protocol: TCP

$ oc apply -f  pod_httpserver.yaml 
ocpod/hello-pod created


$ cat svc_nodeport.yaml 
kind: Service 
apiVersion: v1 
metadata:
  name: hello-pod 
  labels:
    name: hello-pod
spec:
  ports:
    - name: http
      port: 27017
      protocol: TCP
      nodePort: 30000
      targetPort: 8080
  selector:
    name: hello-pod
  type: NodePort

$ oc apply -f  svc_nodeport.yaml 
service/hello-pod created

$ oc get all -owide
NAME            READY   STATUS              RESTARTS   AGE   IP       NODE                                         NOMINATED NODE   READINESS GATES
pod/hello-pod   0/1     ContainerCreating   0          14s   <none>   ip-10-0-128-209.us-east-2.compute.internal   <none>           <none>

NAME                TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE   SELECTOR
service/hello-pod   NodePort   172.30.10.132   <none>        27017:30000/TCP   6s    name=hello-pod

 

2. before applying egressfirewall rule, curl nodeport service from external bootstrap node got reply

[core@ip-10-0-31-153 ~]$ curl 10.0.128.209:30000
Hello Kubernetes!

 

3. after applying egressfirewall rule, curl nodeport service from external bootstrap node still got reply

$ cat egressfirewall_denyall.yaml 
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
  name: default
spec:
  egress:
  - type: Deny
    to:
      cidrSelector: 0.0.0.0/0

$ oc apply -f  egressfirewall_denyall.yaml 
egressfirewall.k8s.ovn.org/default created

 

[core@ip-10-0-31-153 ~]$ curl 10.0.128.209:30000
Hello Kubernetes!


==> verified the fix

Comment 19 jechen 2023-01-20 13:53:20 UTC

correction for error in comment #18, in step 3, should be: after applying egressfirewall rule, curl nodeport service from external bootstrap node got reply

Comment 22 errata-xmlrpc 2023-05-17 22:46:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:1326

Note You need to log in before you can comment on or make changes to this bug.