Bug 1711127
Summary: | Failing Install due to Authentication Operator not deploying | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Eric Rich <erich> | ||||||
Component: | Networking | Assignee: | Casey Callendrello <cdc> | ||||||
Status: | CLOSED CANTFIX | QA Contact: | Meng Bo <bmeng> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.1.0 | CC: | aos-bugs, bbennett, grajaiya, jokerman, mmccomas, nagrawal, slaznick, smunilla | ||||||
Target Milestone: | --- | ||||||||
Target Release: | 4.2.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-06-18 13:40:39 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Eric Rich
2019-05-17 03:18:08 UTC
It should be noted that this is an install of http://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.1/4.1.0-rc.3/ not sure why the clusterversion shows rc.0 While it appears that there are more problems in your cluster than the authentication operator not working ("failed to GET route: net/http: TLS handshake timeout", "Some cluster operators are still updating: authentication, console, image-registry" as examples), I can with 100% certainty say that the logs for our operator don't match the expectations for the bits that should be part of the "4.1.0-rc.3" release. I am going to throw this at the release team to investigate whether they indeed supply the "4.1.0-rc.3" bits in the repository you've got your installation from. *** Bug 1711126 has been marked as a duplicate of this bug. *** This seems to be an issue with the router? $ curl -kv https://oauth-openshift.apps.thoran.dwarf.mine * Rebuilt URL to: https://oauth-openshift.apps.thoran.dwarf.mine/ * Trying 192.168.100.1... * TCP_NODELAY set * Connected to oauth-openshift.apps.thoran.dwarf.mine (192.168.100.1) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * ignoring certificate verify locations due to disabled peer verification * TLSv1.2 (OUT), TLS handshake, Client hello (1): * OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.thoran.dwarf.mine:443 * stopped the pause stream! * Closing connection 0 curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.thoran.dwarf.mine:443 It's not an issue with the front end LB! $ timeout 5 curl -kv --resolve oauth-openshift.apps.thoran.dwarf.mine:443:192.168.100.20 https://oauth-openshift.apps.thoran.dwarf.mine * Added oauth-openshift.apps.thoran.dwarf.mine:443:192.168.100.20 to DNS cache * Rebuilt URL to: https://oauth-openshift.apps.thoran.dwarf.mine/ * Hostname oauth-openshift.apps.thoran.dwarf.mine was found in DNS cache * Trying 192.168.100.20... * TCP_NODELAY set * Connected to oauth-openshift.apps.thoran.dwarf.mine (192.168.100.20) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * ignoring certificate verify locations due to disabled peer verification * TLSv1.2 (OUT), TLS handshake, Client hello (1): Hitting the router directly produces the same issue! Do we need to send this over to the ingress team? This turned out to be a worker (host network) -> master (pod) communication problem. But it is running on a libvirt setup, so is not a 4.1 blocker. A lot of movement has happened on this BZ in the SDN space. While I will likely not capture everything I am going to try and capture what I can. > All data from this point on is from the master unless denoted. We have been able to deduce that between the master and the node we have an MTU issue. This is seen by the following! $ sudo ip a s ens3 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:1f:ac:18 brd ff:ff:ff:ff:ff:ff inet 192.168.100.10/24 brd 192.168.100.255 scope global dynamic noprefixroute ens3 valid_lft 3134sec preferred_lft 3134sec inet6 fe80::2a52:38ee:a82d:94db/64 scope link noprefixroute valid_lft forever preferred_lft forever ### From the Worker (below) $ sudo ip a s ens3 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:64:13:1c brd ff:ff:ff:ff:ff:ff inet 192.168.100.20/24 brd 192.168.100.255 scope global dynamic noprefixroute ens3 valid_lft 2260sec preferred_lft 2260sec inet6 fe80::321c:df3c:35e7:342f/64 scope link noprefixroute valid_lft forever preferred_lft forever > Shows we have an MTU on the main interface[s] of 1500 However, if we try an reach the worker from the master we see the following: $ ping -c 3 -M do -s 1472 -I ens3 192.168.100.20 PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1472(1500) bytes of data. ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 This suggests that the MTU is 1450 and not the 1500 it should be. This is confirmed with the following: $ for x in 1423 1423; do ping -c 3 -M do -s $x -I ens3 192.168.100.20 -c 3; done PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1423(1451) bytes of data. ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 --- 192.168.100.20 ping statistics --- 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 8ms PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1423(1451) bytes of data. ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 --- 192.168.100.20 ping statistics --- 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 54ms It's not clear what causes the MTU to be set to 1450 but it seems to be set by the route cache $ ip route get 192.168.100.20 192.168.100.20 dev ens3 src 192.168.100.10 uid 1000 cache expires 562sec mtu 1450 Created attachment 1571386 [details]
trcpdump from master
We can see in the attached pcap that the worker is stating the MTU is too large!
$ sudo tcpdump -i ens3 -tttt -w /tmp/$(hostname)-$(date +"%m-%d-%Y").pcap -s 0 & date; sudo ip route flush cache; echo "Cache Flushed"; date; ip route get 192.168.100.20; sleep 10; date; ip route get 192.168.100.20
[1] 33891
Mon May 20 19:26:50 UTC 2019
Cache Flushed
Mon May 20 19:26:50 UTC 2019
192.168.100.20 dev ens3 src 192.168.100.10 uid 1000
cache
tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
Mon May 20 19:27:00 UTC 2019
192.168.100.20 dev ens3 src 192.168.100.10 uid 1000
cache expires 592sec mtu 1450
$ tshark -r master-0-05-20-2019.pcap | grep -i icmp
1083 2.052417 192.168.100.20 โ 192.168.100.10 ICMP 590 Destination unreachable (Fragmentation needed)
1085 2.052505 192.168.100.20 โ 192.168.100.10 ICMP 590 Destination unreachable (Fragmentation needed)
Created attachment 1571387 [details]
iptables rules from hosts
Uploading iptables rules for review.
This turned out to be a virtual nic driver issue. Moving to the "virtio" nic driver fixed this issue. In short, when I installed these systems they picked the "default" nic driver "rtl8139" (it apparently has an issue) - this bug should be moved to libvirt networking to fix (that issue) Information in https://gitlab.cee.redhat.com/erich/ocp-4-libvirt-lab are my scripts https://gitlab.cee.redhat.com/erich/ocp-4-libvirt-lab/commit/b2fb2fe9987ad2efb735050f4f1e4d15c1388b35#b4408aed2189118f01caf6c4da2f2d20ad53966f_107_106 these changes (are how I fixed - worked around the issue). |