1945948 – SNO: pods can't reach ingress when the ingress uses a different IPv6.

Bug 1945948 - SNO: pods can't reach ingress when the ingress uses a different IPv6.

Summary: SNO: pods can't reach ingress when the ingress uses a different IPv6.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-02 20:47 UTC by Alexander Chuzhoy
Modified:	2021-07-27 22:57 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:57:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 498	None	open	Bug 1945948: Fixes local node IP reachability in shared gateway mode	2021-04-14 02:32:48 UTC
Github	ovn-org ovn-kubernetes pull 2159	None	open	Fix routes for node ips	2021-04-08 18:43:49 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:57:46 UTC

Description Alexander Chuzhoy 2021-04-02 20:47:39 UTC

Version:
4.8.0-0.nightly-2021-04-01-213116
4.8.0-0.nightly-2021-04-01-072432

The issue started a few days ago and always reproduces.

Upon attempt to deploy SNO, the process doesn't complete.

[kni@r640-u09 ~]$ oc get co|grep -v "True.*False.*False";
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.nightly-2021-04-01-213116   False       False         True       14h
console                                    4.8.0-0.nightly-2021-04-01-213116   False       True          True       14h
ingress                                    4.8.0-0.nightly-2021-04-01-213116   True        False         True       14h

[kni@r640-u09 ~]$ oc get pod -n openshift-console
NAME                         READY   STATUS    RESTARTS   AGE
console-5787485c6d-4srlh     0/1     Running   46         4h16m
console-5f6c5d669b-4pl6w     0/1     Running   46         4h15m
downloads-7f8d988d97-pg7db   1/1     Running   0          14h
[kni@r640-u09 ~]$ 





[kni@r640-u09 ~]$ oc logs -n openshift-console console-5787485c6d-4srlh
W0402 20:40:20.414454       1 main.go:203] Flag inactivity-timeout is set to less then 300 seconds and will be ignored!
I0402 20:40:20.414539       1 main.go:272] cookies are secure!
E0402 20:40:25.444163       1 auth.go:231] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0402 20:40:40.450173       1 auth.go:231] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0402 20:40:55.454198       1 auth.go:231] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
E0402 20:41:10.460107       1 auth.go:231] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[kni@r640-u09 ~]$ 



[kni@r640-u09 ~]$ oc exec -n openshift-console console-5787485c6d-4srlh -- timeout 10 curl https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com/oauth/token -kI
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0command terminated with exit code 124
[kni@r640-u09 ~]$ 


In this setup the api,ingress and the node itself have different IPs:

api.qe3.kni.lab.eng.bos.redhat.com has IPv6 address 2620:52:0:1386::97
openshift-master-0.qe3.kni.lab.eng.bos.redhat.com has IPv6 address 2620:52:0:1386::91
wildcard.apps.qe3.kni.lab.eng.bos.redhat.com has IPv6 address 2620:52:0:1386::96


This worked fine until a few days ago.


[kni@r640-u09 ~]$ oc exec -n openshift-ovn-kubernetes                           ovs-node-qkv7z -- ip -6 address show dev br-ex
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet6 2620:52:0:1386::97/121 scope global 
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:1386::96/121 scope global 
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:1386::91/121 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::9a03:9bff:fe61:7179/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
[kni@r640-u09 ~]$

Comment 1 Alexander Chuzhoy 2021-04-02 20:49:10 UTC

Have a setup where api/ingress/node resolve to the same IP and everything works.
Seems like this issue doesn't happen on HA (non SNO) cluster

Comment 2 Alexander Chuzhoy 2021-04-03 23:20:53 UTC

The issue doesn't reproduce with ipv4.

Comment 3 Alexander Chuzhoy 2021-04-03 23:26:00 UTC

actually on ipv4 used OpenShiftSDN and on ipv6 OVNKubernetes

Comment 4 Antonio Ojea 2021-04-06 16:25:08 UTC

So, the problem is that pods can not reach the new added ips in the node


[kni@r640-u09 ~]$ oc exec -n openshift-ovn-kubernetes                           ovs-node-qkv7z -- ip -6 address show dev br-ex
5: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet6 2620:52:0:1386::97/121 scope global 
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:1386::96/121 scope global 
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:1386::91/121 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::9a03:9bff:fe61:7179/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever


The pod fails
E0406 16:13:53.912561       1 auth.go:231] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe3.kni.lab.eng.bos.redhat.com": dial tcp [2620:52:0:1386::96]:443: i/o timeout (Client.Timeout exceeded while awaiting headers)



However, those ips are reachable from outside

[kni@r640-u09 ~]$ curl -k -v https://[2620:52:0:1386::96]:443
* Rebuilt URL to: https://[2620:52:0:1386::96]:443/
*   Trying 2620:52:0:1386::96...
* TCP_NODELAY set
* Connected to 2620:52:0:1386::96 (2620:52:0:1386::96) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1

Comment 5 Antonio Ojea 2021-04-06 16:45:15 UTC

The pod tries to reach the router-internal-defaul pod, that is a pod with host network, but it fails to access it in the new IP

openshift-ingress                                  router-default-7c5ff5965d-mfbb4                                              1/1     Running     0          4d      2620:52:0:1386::91   openshift-master-0.qe3.kni.lab.eng.bos.redhat.com   <none>           <none>

However, it can reach that pod on the original NodeIP


[root@openshift-master-0 ~]# crictl ps | grep console
2feb662e438a5       a0a41f9beddd6c92945501b55f31abe2bf301c7faa7178178066f3e80ee79dde                                                         5 hours ago         Running             console-operator                              54                  7cb0c0b903a62
[root@openshift-master-0 ~]# crictl exec -it 2feb662e438a5 bash
bash-4.4$ curl -k https://2620:52:0:1386::90

bash-4.4$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if77: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default 
    link/ether 0a:58:68:27:b6:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fd01:0:0:1::22/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::858:68ff:fe27:b6f3/64 scope link 
       valid_lft forever preferred_lft forever
bash-4.4$ curl -k https://[2620:52:0:1386::91]:443
<html>
  <head>

Comment 7 Tim Rozet 2021-04-07 13:26:14 UTC

To fix this, we need to watch for new IPs added to the host and then update policy routes to redirect traffic into mp0.

Comment 10 zhaozhanqi 2021-04-15 12:05:06 UTC

@sasha 

could you help verified this bug?

Comment 11 Alexander Chuzhoy 2021-04-15 14:59:47 UTC

Version: 4.8.0-0.nightly-2021-04-15-074503
The reported issue doesn't reproduce.


oc get pod -n openshift-console
NAME                         READY   STATUS    RESTARTS   AGE
console-5bc4546fd8-q4dvr     1/1     Running   1          30m
downloads-7bc5989474-qnscs   1/1     Running   0          34m



oc rsh -n openshift-console console-5bc4546fd8-q4dvr
sh-4.4$ curl [2620:52:0:1386::91]:443 -kv
* Rebuilt URL to: [2620:52:0:1386::91]:443/
*   Trying 2620:52:0:1386::91...
* TCP_NODELAY set
* Connected to 2620:52:0:1386::91 (2620:52:0:1386::91) port 443 (#0)
> GET / HTTP/1.1
> Host: [2620:52:0:1386::91]:443
> User-Agent: curl/7.61.1
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host 2620:52:0:1386::91 left intact
curl: (52) Empty reply from server
sh-4.4$ curl [2620:52:0:1386::96]:443 -kv
* Rebuilt URL to: [2620:52:0:1386::96]:443/
*   Trying 2620:52:0:1386::96...
* TCP_NODELAY set
* Connected to 2620:52:0:1386::96 (2620:52:0:1386::96) port 443 (#0)
> GET / HTTP/1.1
> Host: [2620:52:0:1386::96]:443
> User-Agent: curl/7.61.1
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host 2620:52:0:1386::96 left intact
curl: (52) Empty reply from server
sh-4.4$ curl [2620:52:0:1386::97]:443 -kv
* Rebuilt URL to: [2620:52:0:1386::97]:443/
*   Trying 2620:52:0:1386::97...
* TCP_NODELAY set
* Connected to 2620:52:0:1386::97 (2620:52:0:1386::97) port 443 (#0)
> GET / HTTP/1.1
> Host: [2620:52:0:1386::97]:443
> User-Agent: curl/7.61.1
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host 2620:52:0:1386::97 left intact
curl: (52) Empty reply from server
sh-4.4$

Comment 14 errata-xmlrpc 2021-07-27 22:57:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.