Bug 1935691

Summary:	ovnkube ExternalIP for services that listen on port 80/443 will break IngressControllers after node reboot or scale in / scale out
Product:	OpenShift Container Platform	Reporter:	Andreas Karis <akaris>
Component:	Networking	Assignee:	Alexander Constantinescu <aconstan>
Networking sub component:	ovn-kubernetes	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, astoycos, zzhao
Version:	4.6
Target Milestone:	---
Target Release:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1937727 (view as bug list)		Environment:
Last Closed:	2021-03-30 17:03:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1937727
Bug Blocks:

Description Andreas Karis 2021-03-05 11:51:24 UTC

Description of problem:

The reproducer steps are fairly easy:

* Install the latest version of OC stable (4.6.18 or 4.7.0)

* Configure ExternalIP feature  https://docs.openshift.com/container-platform/4.6/networking/configuring_ingress_cluster_traffic/configuring-externalip.html
~~~
oc get networks.config cluster -o jsonpath-as-json='{.spec.externalIP}'
[
    {
        "autoAssignCIDRs": [
            "10.1.1.64/27"
        ]
    }
]
~~~

~~~
oc get -o yaml networks.config cluster
(...)
spec:
(...)
  externalIP:
    autoAssignCIDRs:
    - 10.1.1.64/27
(...)
~~~

* Scale in the Ingress Controller to 1 replica

* Deploy a deployment and SVC that listen on port 80:
~~~
oc new-project test
oc adm policy add-scc-to-user privileged -z default
cat <<'EOF' > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fedora-deployment
  labels:
    app: fedora-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fedora-pod
  template:
    metadata:
      labels:
        app: fedora-pod
    spec:
      containers:
      - name: fedora
        image: registry.example.com:5000/python:latest
        command:
          - python3
        args:
          - "-m"
          - "http.server" 
          - "80"
        imagePullPolicy: IfNotPresent
        securityContext:
          runAsUser: 0
          capabilities:
            add:
              - "SETFCAP"
---
apiVersion: v1
kind: Service
metadata:
  name: shell-demo
spec:
  selector:
    app: fedora-pod
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer
EOF
oc apply -f deployment.yaml
~~~

* Wait until ovnkube listens on port :80 (it will listen on all ports on port 80, on all masters and workers)

* Scale out the Ingress Controller to 2 replicas

The Ingress Controller router will not be able to listen on port 80


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Andreas Karis 2021-03-05 12:54:30 UTC

Prerequisites:

Build python image and push it to registry:
~~~
IMAGE=registry.example.com:5000/python:latest
mkdir python
cat <<'EOF' > python/Dockerfile
FROM registry.access.redhat.com/ubi8/ubi
RUN yum install iproute iputils tcpdump python38 -y
EOF
cd python
buildah bud -t $IMAGE .
podman push $IMAGE
cd -
~~~

Scale in ingress operator:
~~~
[root@openshift-jumpserver-0 ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.18    True        False         9m38s   Error while reconciling 4.6.18: the cluster operator monitoring is degraded
[root@openshift-jumpserver-0 ~]# oc scale ingresscontrollers -n openshift-ingress-operator --replicas=1 default
ingresscontroller.operator.openshift.io/default scaled
[root@openshift-jumpserver-0 ~]# oc get pods -A -o wide | grep -i ingress
openshift-ingress-operator                         ingress-operator-6cfd945dfb-qc8bd                         2/2     Running             0          19m   172.26.0.36       openshift-master-2   <none>           <none>
openshift-ingress                                  router-default-6d6d869656-sqj4h                           1/1     Running             0          23m   192.168.123.221   openshift-worker-1   <none>           <none>
~~~

Now deploy the service:
~~~
oc new-project test
oc project test
oc adm policy add-scc-to-user privileged -z default
cat <<'EOF' > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fedora-deployment
  labels:
    app: fedora-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fedora-pod
  template:
    metadata:
      labels:
        app: fedora-pod
    spec:
      containers:
      - name: fedora
        image: registry.example.com:5000/python:latest
        command:
          - python3
        args:
          - "-m"
          - "http.server" 
          - "80"
        imagePullPolicy: IfNotPresent
        securityContext:
          runAsUser: 0
          capabilities:
            add:
              - "SETFCAP"
---
apiVersion: v1
kind: Service
metadata:
  name: shell-demo
spec:
  selector:
    app: fedora-pod
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer
EOF
oc apply -f deployment.yaml
~~~

Test the service:
~~~
[root@openshift-jumpserver-0 ~]# ip r a 10.1.1.67 via 192.168.123.220
[root@openshift-jumpserver-0 ~]# ip r | grep 10.1.1.67
10.1.1.67 via 192.168.123.220 dev eth0 
[root@openshift-jumpserver-0 ~]# curl 10.1.1.67:80
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
<ul>
<li><a href="bin/">bin@</a></li>
<li><a href="boot/">boot/</a></li>
<li><a href="dev/">dev/</a></li>
<li><a href="etc/">etc/</a></li>
<li><a href="home/">home/</a></li>
<li><a href="lib/">lib@</a></li>
<li><a href="lib64/">lib64@</a></li>
<li><a href="lost%2Bfound/">lost+found/</a></li>
<li><a href="media/">media/</a></li>
<li><a href="mnt/">mnt/</a></li>
<li><a href="opt/">opt/</a></li>
<li><a href="proc/">proc/</a></li>
<li><a href="root/">root/</a></li>
<li><a href="run/">run/</a></li>
<li><a href="sbin/">sbin@</a></li>
<li><a href="srv/">srv/</a></li>
<li><a href="sys/">sys/</a></li>
<li><a href="tmp/">tmp/</a></li>
<li><a href="usr/">usr/</a></li>
<li><a href="var/">var/</a></li>
</ul>
<hr>
</body>
</html>
~~~

Check ovnkube logs for both workers:
~~~
[root@openshift-jumpserver-0 ~]# oc logs -n openshift-ovn-kubernetes ovnkube-node-7n5qd -c ovnkube-node | grep test/shell-demo
W0305 12:47:29.195368    4214 port_claim.go:191] PortClaim for svc: test/shell-demo on port: 80, err: listen tcp :80: bind: address already in use
E0305 12:47:29.195384    4214 port_claim.go:60] Error updating port claim for service: test/shell-demo: listen tcp :80: bind: address already in use
I0305 12:47:29.195445    4214 event.go:278] Event(v1.ObjectReference{Kind:"Service", Namespace:"test", Name:"shell-demo", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'PortClaim' Service: test/shell-demo requires port: 80 to be opened on node, but port cannot be opened, err: listen tcp :80: bind: address already in use
[root@openshift-jumpserver-0 ~]# oc logs -n openshift-ovn-kubernetes ovnkube-node-jf8zs -c ovnkube-node | grep test/shell-demo
[root@openshift-jumpserver-0 ~]# 
~~~

The problem here is that ovnkube binds to all ports:
~~~
[root@openshift-jumpserver-0 ~]# oc exec -it -n openshift-ovn-kubernetes ovnkube-node-jf8zs -- ss -lntp | grep :80
Defaulting container name to ovn-controller.
Use 'oc describe pod/ovnkube-node-jf8zs -n openshift-ovn-kubernetes' to see all of the containers in this pod.
LISTEN   0         128                        *:80                     *:*       users:(("ovnkube",pid=4231,fd=8)) 
~~~

Now, scale out the ingress controller:
~~~
oc scale ingresscontrollers -n openshift-ingress-operator --replicas=2 default
~~~

The IngressController will not be able to spawn:
~~~
           <none>
[root@openshift-jumpserver-0 ~]# oc get pods -A -o wide | grep ingress
openshift-ingress-operator                         ingress-operator-6cfd945dfb-qc8bd                         2/2     Running             0          26m     172.26.0.36       openshift-master-2   <none>           <none>
openshift-ingress                                  router-default-6d6d869656-bd7zl                           0/1     Running             1          103s    192.168.123.220   openshift-worker-0   <none>           <none>
openshift-ingress                                  router-default-6d6d869656-sqj4h                           1/1     Running             0          31m     192.168.123.221   openshift-worker-1   <none>           <none>
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc logs -n openshift-ingress router-default-6d6d869656-bd7zl | tail -n 20
I0305 12:53:52.058358       1 template.go:403] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 0ced824c9667a259b75e963a16f3dda4b5d781f6\nversionFromGit: 4.0.0-232-g0ced824\ngitTreeState: clean\nbuildDate: 2021-02-13T02:16:38Z\n"
I0305 12:53:52.059973       1 metrics.go:154] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"="0.0.0.0:1936"
I0305 12:53:52.065260       1 router.go:185] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I0305 12:53:52.065329       1 router.go:263] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I0305 12:53:52.065720       1 router.go:325] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I0305 12:53:52.065779       1 router.go:262] router "msg"="router is including routes in all namespaces"  
E0305 12:53:52.173533       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
E0305 12:53:52.193445       1 limiter.go:165] error reloading router: exit status 1
[ALERT] 063/125352 (36) : Starting frontend public: cannot bind socket [0.0.0.0:80]
E0305 12:53:57.173618       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
E0305 12:53:57.191593       1 limiter.go:165] error reloading router: exit status 1
[ALERT] 063/125357 (39) : Starting frontend public: cannot bind socket [0.0.0.0:80]
I0305 12:54:25.803756       1 template.go:657] router "msg"="Shutdown requested, waiting 45s for new connections to cease"  
[root@openshift-jumpserver-0 ~]# 
~~~

Comment 2 Andreas Karis 2021-03-05 13:01:49 UTC

This can also be reproduced by ACPI shutting down the entire cluster and bringing it back up. Or, in a cluster with 2 workers and 2 ingress routers, you can simply reboot the worker node. When the node comes up, ovnkube will open the port before the ingress haproxy can.

The problem is that if we configure this a configuration issue, then this configuration issue can go unnoticed for weeks or months. At some point, when a customer had to restart the Ingress router and it tries to come up on the same node, this will fail.

Why is it required that ovnkube bind to 0.0.0.0:80 for an ExternalIP service, anyway? It seems that all of this is implemented as an OVN loadbalancer internally, so binding to :80 seems to be unnecessary?

Comment 3 Andreas Karis 2021-03-05 14:44:12 UTC

I also tested with OCP 4.7.0 and behavior is exactly the same

Comment 5 Alexander Constantinescu 2021-03-10 09:08:19 UTC

Hi Andreas

Could you provide me with a reproducer for this? I've tried reproducing on AWS, but I don't see the same behavior. 

OVN-Kubernetes will perform port claims for NodePort and LoadBalancer type services, but will bind the port to the nodePort defined for those services, so it should not bind to port 80 in your case. OVN-Kubernetes will also bind ports for ExternalIP type services, but for those services it will specifically bind to the $EXTERNAL_IP:$PORT, not 0.0.0.0:$PORT. 

See my reproducer below:

$ oc project
Using project "test" on server "https://api.ci-ln-hb93yw2-d5d6b.origin-ci-int-aws.dev.rhcloud.com:6443".
aconstan@localhost ~ $ oc get svc
NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)        AGE
cluster-ip-service   LoadBalancer   172.30.82.243   aa255dab2e6e54cbdb633287010e861c-835744214.us-west-2.elb.amazonaws.com   80:30132/TCP   9m43s
aconstan@localhost ~ $ oc get ep
NAME                 ENDPOINTS           AGE
cluster-ip-service   10.128.10.14:8080   9m48s
aconstan@localhost ~ $ oc get pod
NAME                        READY   STATUS    RESTARTS   AGE
netserver-66c995678-fdpkn   1/1     Running   0          12m
aconstan@localhost ~ $ oc get svc -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    creationTimestamp: "2021-03-10T08:57:00Z"
    name: cluster-ip-service
    namespace: test
    resourceVersion: "30486"
    uid: a255dab2-e6e5-4cbd-b633-287010e861c9
  spec:
    clusterIP: 172.30.82.243
    clusterIPs:
    - 172.30.82.243
    externalTrafficPolicy: Cluster
    ports:
    - name: tcp
      nodePort: 30132
      port: 80
      protocol: TCP
      targetPort: 8080
    selector:
      deployment: "true"
    sessionAffinity: None
    type: LoadBalancer
  status:
    loadBalancer:
      ingress:
      - hostname: aa255dab2e6e54cbdb633287010e861c-835744214.us-west-2.elb.amazonaws.com
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
aconstan@localhost ~ $ oc exec -tic ovnkube-node ovnkube-node-n557f -n openshift-ovn-kubernetes -- bash
[root@ip-10-0-193-68 ~]# ss -lntp | grep :30132
LISTEN    0         128                      *:30132                  *:*        users:(("ovnkube",pid=2549,fd=9))                                              
[root@ip-10-0-193-68 ~]# ss -lntp | grep :80
[root@ip-10-0-193-68 ~]# 


If you could provide me with a kubeconfig to your setup, I'll have a look at it. I might have overlooked something when trying to reproduce. 

/Alex

Comment 6 Andreas Karis 2021-03-10 09:25:19 UTC

Hi,

I'll work on a reproducer. Did you use the baremetal ExternalIP feature for this? The problem IMO lies there, and not in Loadbalancer type services per se:

* Configure ExternalIP feature  https://docs.openshift.com/container-platform/4.6/networking/configuring_ingress_cluster_traffic/configuring-externalip.html
~~~
oc get networks.config cluster -o jsonpath-as-json='{.spec.externalIP}'
[
    {
        "autoAssignCIDRs": [
            "10.1.1.64/27"
        ]
    }
]
~~~

- Andreas

Comment 7 Alexander Constantinescu 2021-03-10 09:50:32 UTC

Hi

I understand, and I agree that the problem most likely lies in those subtleties. However OVN-Kubernetes makes no distinction of that for what concerns the service specification, i.e: if there is an external IP defined for a service then it will bind the port to the $EXTERNAL_IP:PORT (at least on 4.7, on 4.6 this is not the case and does not work, see: 4.7 code, https://github.com/openshift/ovn-kubernetes/blob/release-4.7/go-controller/pkg/node/port_claim.go#L189) 

I am thus wondering if there's another component modifying the service specification on your baremetal cluster with this loadBalancer type service, to something we didn't expect. 

/Alex

Comment 8 Andreas Karis 2021-03-10 14:19:27 UTC

Fresh install OCP 4.7.1 with UPI:

~~~
[root@openshift-jumpserver-0 ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        False         3h49m   Cluster version is 4.7.1
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc edit  networks.config cluster
network.config.openshift.io/cluster edited
[root@openshift-jumpserver-0 ~]# oc get networks.config cluster -o jsonpath-as-json='{.spec.externalIP}'
[
    {
        "autoAssignCIDRs": [
            "10.1.1.64/27"
        ]
    }
]
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc scale -n openshift-ingress-operator ingresscontroller default --replicas=1
ingresscontroller.operator.openshift.io/default scaled
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc new-project test
Error from server (AlreadyExists): project.project.openshift.io "test" already exists
[root@openshift-jumpserver-0 ~]# oc project test
Already on project "test" on server "https://api.cluster.example.com:6443".
[root@openshift-jumpserver-0 ~]# oc adm policy add-scc-to-user privileged -z default
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "default"
[root@openshift-jumpserver-0 ~]# cat <<'EOF' > deployment.yaml
> apiVersion: apps/v1
> kind: Deployment
> metadata:
>   name: fedora-deployment
>   labels:
>     app: fedora-deployment
> spec:
>   replicas: 1
>   selector:
>     matchLabels:
>       app: fedora-pod
>   template:
>     metadata:
>       labels:
>         app: fedora-pod
>     spec:
>       containers:
>       - name: fedora
>         image: registry.example.com:5000/python:latest
>         command:
>           - python3
>         args:
>           - "-m"
>           - "http.server" 
>           - "80"
>         imagePullPolicy: IfNotPresent
>         securityContext:
>           runAsUser: 0
>           capabilities:
>             add:
>               - "SETFCAP"
> ---
> apiVersion: v1
> kind: Service
> metadata:
>   name: shell-demo
> spec:
>   selector:
>     app: fedora-pod
>   ports:
>     - protocol: TCP
>       port: 80
>       targetPort: 80
>   type: LoadBalancer
> EOF
[root@openshift-jumpserver-0 ~]# oc apply -f deployment.yaml
oc edeployment.apps/fedora-deployment created
teservice/shell-demo created
[root@openshift-jumpserver-0 ~]# oc get pods
NAME                                 READY   STATUS    RESTARTS   AGE
fedora-deployment-68c46ccfd6-jdb42   1/1     Running   0          4s
[root@openshift-jumpserver-0 ~]# oc get svc
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
shell-demo   LoadBalancer   172.30.94.243   10.1.1.72     80:32536/TCP   8s
[root@openshift-jumpserver-0 ~]# ip r a 10.1.1.72 via 192.168.123.220
[root@openshift-jumpserver-0 ~]#  curl 10.1.1.72:80
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
<ul>
<li><a href="bin/">bin@</a></li>
<li><a href="boot/">boot/</a></li>
<li><a href="dev/">dev/</a></li>
<li><a href="etc/">etc/</a></li>
<li><a href="home/">home/</a></li>
<li><a href="lib/">lib@</a></li>
<li><a href="lib64/">lib64@</a></li>
<li><a href="lost%2Bfound/">lost+found/</a></li>
<li><a href="media/">media/</a></li>
<li><a href="mnt/">mnt/</a></li>
<li><a href="opt/">opt/</a></li>
<li><a href="proc/">proc/</a></li>
<li><a href="root/">root/</a></li>
<li><a href="run/">run/</a></li>
<li><a href="sbin/">sbin@</a></li>
<li><a href="srv/">srv/</a></li>
<li><a href="sys/">sys/</a></li>
<li><a href="tmp/">tmp/</a></li>
<li><a href="usr/">usr/</a></li>
<li><a href="var/">var/</a></li>
</ul>
<hr>
</body>
</html>
~~~


Wait until the Ingress Pods are done terminating and until the new pod is up:
~~~
[root@openshift-jumpserver-0 ~]# oc get pods -A | grep ingress
openshift-ingress-canary                           ingress-canary-5t64b                                      1/1     Running             0          4h5m
openshift-ingress-canary                           ingress-canary-qn6vl                                      1/1     Running             0          4h5m
openshift-ingress-operator                         ingress-operator-67f8fdc58f-hj2n2                         2/2     Running             3          4h30m
openshift-ingress                                  router-default-766754d55f-mj6rf                           1/1     Running             0          99s
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc get pods -A -o wide | grep ingress
openshift-ingress-canary                           ingress-canary-5t64b                                      1/1     Running     0          4h10m   172.24.2.7        openshift-worker-0   <none>           <none>
openshift-ingress-canary                           ingress-canary-qn6vl                                      1/1     Running     0          4h10m   172.27.0.7        openshift-worker-1   <none>           <none>
openshift-ingress-operator                         ingress-operator-67f8fdc58f-hj2n2                         2/2     Running     3          4h35m   172.26.0.13       openshift-master-2   <none>           <none>
openshift-ingress                                  router-default-766754d55f-mj6rf                           1/1     Running     0          6m42s   192.168.123.221   openshift-worker-1   <none>           <none>
[root@openshift-jumpserver-0 ~]# oc get pods -n openshift-ovn-kubernetes -o wide | grep ovnkube | grep worker
ovnkube-node-bdz5v     3/3     Running   0          4h11m   192.168.123.220   openshift-worker-0   <none>           <none>
ovnkube-node-rhwzb     3/3     Running   0          4h11m   192.168.123.221   openshift-worker-1   <none>           <none>
~~~

Interesting:
~~~
[root@openshift-jumpserver-0 ~]# oc exec -it -n openshift-ovn-kubernetes ovnkube-node-bdz5v -c ovn-controller -- ss -lntp | grep :80
[root@openshift-jumpserver-0 ~]# oc exec -it -n openshift-ovn-kubernetes ovnkube-node-bdz5v -c ovnkube-node -- ss -lntp | grep :80
[root@openshift-jumpserver-0 ~]# 
[root@openshift-jumpserver-0 ~]# 
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc scale -n openshift-ingress-operator ingresscontroller default --replicas=2
ingresscontroller.operator.openshift.io/default scaled
[root@openshift-jumpserver-0 ~]# oc get pods -A -o wide | grep ingress
openshift-ingress-canary                           ingress-canary-5t64b                                      1/1     Running     0          4h14m   172.24.2.7        openshift-worker-0   <none>           <none>
openshift-ingress-canary                           ingress-canary-qn6vl                                      1/1     Running     0          4h14m   172.27.0.7        openshift-worker-1   <none>           <none>
openshift-ingress-operator                         ingress-operator-67f8fdc58f-hj2n2                         2/2     Running     3          4h40m   172.26.0.13       openshift-master-2   <none>           <none>
openshift-ingress                                  router-default-766754d55f-mj6rf                           1/1     Running     0          11m     192.168.123.221   openshift-worker-1   <none>           <none>
openshift-ingress                                  router-default-766754d55f-mwzcb                           1/1     Running     0          14s     192.168.123.220   openshift-worker-0   <none>           <none>
~~~

So either my earlier test with OCP 4.7.0 was off, or this was fixed with OCP 4.7.1 ... (?). I could definitely reproduce this with 4.6.18

Comment 9 Andreas Karis 2021-03-10 14:21:26 UTC

Did https://github.com/ovn-org/ovn-kubernetes/commit/4efbb59969223c4090c572be2c99d7280a871c8e only recently make it downstream? Is it possible that this only affects OCP 4.6 and not 4.7?

Comment 11 Alexander Constantinescu 2021-03-11 12:44:41 UTC

So, given that it's been verified as working on 4.7 (see #comment 8) we can safely assume that we'e missing the commit mentioned in #comment 9 on 4.6. I am thus using this bug as a backport bug for it. 

Given this somewhat strange situation, I will need to file master bugs against 4.7 and 4.8 and set them directly to CLOSED ERRATA (as this is not a problem on those versions).

Comment 16 Weibin Liang 2021-03-23 19:50:29 UTC

Tested and verified in 4.6.0-0.nightly-2021-03-21-131139

[weliang@weliang Config]$ oc exec -it ovnkube-node-hffqt -c ovn-controller -- ss -lntp | grep :80
[weliang@weliang Config]$ oc exec -it ovnkube-node-hffqt -c ovnkube-node -- ss -lntp | grep :80
[weliang@weliang Config]$ oc exec -it ovnkube-node-h5bcl -c ovn-controller -- ss -lntp | grep :80
[weliang@weliang Config]$ oc exec -it ovnkube-node-h5bcl -c ovnkube-node -- ss -lntp | grep :80
[weliang@weliang Config]$ oc exec -it ovnkube-node-55ps7 -c ovn-controller -- ss -lntp | grep :80
[weliang@weliang Config]$ oc exec -it ovnkube-node-55ps7 -c ovnkube-node -- ss -lntp | grep :80

[weliang@weliang Config]$ oc get svc
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP                                                               PORT(S)        AGE
externalip-svc   ClusterIP      172.30.94.49    3.139.113.65                                                              80/TCP         6m25s
hello-service1   LoadBalancer   172.30.121.21   aa630cad0f2e64b5da8abe56fb3a0830-1691815349.us-east-2.elb.amazonaws.com   80:31261/TCP   11m

[weliang@weliang Config]$ oc get pod -n openshift-ingress -o wide
NAME                              READY   STATUS    RESTARTS   AGE    IP            NODE                          NOMINATED NODE   READINESS GATES
router-default-576fdf88d6-852cj   1/1     Running   0          24h    10.0.99.130   weliang-224-w5tzb-compute-1   <none>           <none>
router-default-576fdf88d6-jlzwf   0/1     Running   43         130m   10.0.97.15    weliang-224-w5tzb-compute-0   <none>           <none>
router-default-576fdf88d6-qrtnj   1/1     Running   0          130m   10.0.98.203   weliang-224-w5tzb-compute-2   <none>           <none>
[weliang@weliang Config]$

Comment 18 errata-xmlrpc 2021-03-30 17:03:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.23 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0952