Description of problem: Authentication Operator fails to deploy or startup. Version-Release number of selected component (if applicable): $ oc --config test_cluster/auth/kubeconfig get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-rc.0 False True 92m Unable to apply 4.1.0-rc.0: some cluster operators have not yet rolled out How reproducible: Only seen this once Steps to Reproduce: 1. https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#installation-approve-csrs_installing-bare-metal Actual results: >E0517 03:07:03.918897 1 controller.go:129] {๐ผ ๐ผ} failed with: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout > I0517 03:07:03.920730 1 status_controller.go:159] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-05-17T03:07:03Z","message":"Failing: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout","reason":"AsExpected","status":"False","type":"Failing"},{"lastTransitionTime":"2019-05-17T01:48:36Z","reason":"NoData","status":"Unknown","type":"Progressing"},{"lastTransitionTime":"2019-05-17T01:48:36Z","reason":"NoData","status":"Unknown","type":"Available"},{"lastTransitionTime":"2019-05-17T01:48:36Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} > I0517 03:07:03.927443 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"cb20bece-7844-11e9-9275-525400fb04b2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: Failing changed from True to False ("Failing: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout") > E0517 03:07:13.990209 1 controller.go:129] {๐ผ ๐ผ} failed with: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout Expected results: The authentication operator should tell you what it's failing to connect to (what endpoint it's trying to reach). Additional info: This Operators failed deployment seem to be holding up, the console from deploying: I0517 03:15:26.663474 1 option.go:62] Console: handling update openshift-console/console: /apis/route.openshift.io/v1/namespaces/openshift-console/routes/console time="2019-05-17T03:15:26Z" level=info msg="started syncing operator \"cluster\" (2019-05-17 03:15:26.671787061 +0000 UTC m=+1242.374271175)" time="2019-05-17T03:15:26Z" level=info msg="console is in a managed state." time="2019-05-17T03:15:26Z" level=info msg="running sync loop 4.0.0" time="2019-05-17T03:15:26Z" level=info msg="validating console route..." time="2019-05-17T03:15:26Z" level=info msg="route ingress 'default' found and admitted, host: console-openshift-console.apps.thoran.dwarf.mine \n" time="2019-05-17T03:15:26Z" level=info msg="route exists and is in the correct state" time="2019-05-17T03:15:26Z" level=info msg="validating console service..." time="2019-05-17T03:15:26Z" level=info msg="service exists and is in the correct state" time="2019-05-17T03:15:26Z" level=info msg="validating console configmap..." time="2019-05-17T03:15:26Z" level=info msg="generated console config yaml:" time="2019-05-17T03:15:26Z" level=info msg="apiVersion: console.openshift.io/v1\nauth:\n clientID: console\n clientSecretFile: /var/oauth-config/clientSecret\n logoutRedirect: \"\"\n oauthEndpointCAFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\nclusterInfo:\n consoleBaseAddress: https://console-openshift-console.apps.thoran.dwarf.mine\n consoleBasePath: \"\"\n masterPublicURL: https://api.thoran.dwarf.mine:6443\ncustomization:\n branding: ocp\n documentationBaseURL: https://docs.openshift.com/container-platform/4.1/\nkind: ConsoleConfig\nservingInfo:\n bindAddress: https://0.0.0.0:8443\n certFile: /var/serving-cert/tls.crt\n keyFile: /var/serving-cert/tls.key\n \n" time="2019-05-17T03:15:26Z" level=info msg="configmap exists and is in the correct state" time="2019-05-17T03:15:26Z" level=info msg="validating service-ca configmap..." time="2019-05-17T03:15:26Z" level=info msg="service-ca configmap exists and is in the correct state" time="2019-05-17T03:15:26Z" level=info msg="validating oauth secret..." time="2019-05-17T03:15:26Z" level=info msg="secret exists and is in the correct state" time="2019-05-17T03:15:26Z" level=info msg="validating oauthclient..." time="2019-05-17T03:15:26Z" level=info msg="route ingress 'default' found and admitted, host: console-openshift-console.apps.thoran.dwarf.mine \n" time="2019-05-17T03:15:26Z" level=info msg="oauthclient exists and is in the correct state" time="2019-05-17T03:15:26Z" level=info msg="validating console deployment..." time="2019-05-17T03:15:26Z" level=info msg="deployment exists and is in the correct state" time="2019-05-17T03:15:26Z" level=info msg=----------------------- time="2019-05-17T03:15:26Z" level=info msg="sync loop 4.0.0 resources updated: false \n" time="2019-05-17T03:15:26Z" level=info msg=----------------------- time="2019-05-17T03:15:26Z" level=info msg="deployment is available: false \n" time="2019-05-17T03:15:26Z" level=info msg="sync_v400: updating console status" time="2019-05-17T03:15:26Z" level=info msg="route ingress 'default' found and admitted, host: console-openshift-console.apps.thoran.dwarf.mine \n" time="2019-05-17T03:15:26Z" level=info msg="updating console.config.openshift.io with hostname: console-openshift-console.apps.thoran.dwarf.mine \n" time="2019-05-17T03:15:26Z" level=info msg="sync loop 4.0.0 complete:" time="2019-05-17T03:15:26Z" level=info msg="\t service changed: false" time="2019-05-17T03:15:26Z" level=info msg="\t route changed: false" time="2019-05-17T03:15:26Z" level=info msg="\t configMap changed: false" time="2019-05-17T03:15:26Z" level=info msg="\t secret changed: false" time="2019-05-17T03:15:26Z" level=info msg="\t oauth changed: false" time="2019-05-17T03:15:26Z" level=info msg="\t deployment changed: false" time="2019-05-17T03:15:26Z" level=info msg=Operator.Status.Conditions time="2019-05-17T03:15:26Z" level=info msg="Status.Condition.UnsupportedConfigOverridesUpgradeable: True" time="2019-05-17T03:15:26Z" level=info msg="Status.Condition.Available: False | (NoPodsAvailable) No pods available for console deployment." time="2019-05-17T03:15:26Z" level=info msg="Status.Condition.Progressing: True | (SyncLoopProgressing) Moving to version 4.1.0" time="2019-05-17T03:15:26Z" level=info msg="Status.Condition.Failing: False" time="2019-05-17T03:15:26Z" level=info msg="finished syncing operator \"cluster\" (341.775ยตs) \n\n" > tail -f test_cluster/.openshift_install.log >time="2019-05-16T22:42:54-04:00" level=info msg="Waiting up to 30m0s for the cluster at https://api.thoran.dwarf.mine:6443 to initialize..." > time="2019-05-16T22:42:54-04:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry" > time="2019-05-16T22:43:14-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-rc.0: 98% complete" > time="2019-05-16T22:46:29-04:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console" > time="2019-05-16T23:12:54-04:00" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"
It should be noted that this is an install of http://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.1/4.1.0-rc.3/ not sure why the clusterversion shows rc.0
While it appears that there are more problems in your cluster than the authentication operator not working ("failed to GET route: net/http: TLS handshake timeout", "Some cluster operators are still updating: authentication, console, image-registry" as examples), I can with 100% certainty say that the logs for our operator don't match the expectations for the bits that should be part of the "4.1.0-rc.3" release. I am going to throw this at the release team to investigate whether they indeed supply the "4.1.0-rc.3" bits in the repository you've got your installation from.
*** Bug 1711126 has been marked as a duplicate of this bug. ***
This seems to be an issue with the router? $ curl -kv https://oauth-openshift.apps.thoran.dwarf.mine * Rebuilt URL to: https://oauth-openshift.apps.thoran.dwarf.mine/ * Trying 192.168.100.1... * TCP_NODELAY set * Connected to oauth-openshift.apps.thoran.dwarf.mine (192.168.100.1) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * ignoring certificate verify locations due to disabled peer verification * TLSv1.2 (OUT), TLS handshake, Client hello (1): * OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.thoran.dwarf.mine:443 * stopped the pause stream! * Closing connection 0 curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.thoran.dwarf.mine:443
It's not an issue with the front end LB! $ timeout 5 curl -kv --resolve oauth-openshift.apps.thoran.dwarf.mine:443:192.168.100.20 https://oauth-openshift.apps.thoran.dwarf.mine * Added oauth-openshift.apps.thoran.dwarf.mine:443:192.168.100.20 to DNS cache * Rebuilt URL to: https://oauth-openshift.apps.thoran.dwarf.mine/ * Hostname oauth-openshift.apps.thoran.dwarf.mine was found in DNS cache * Trying 192.168.100.20... * TCP_NODELAY set * Connected to oauth-openshift.apps.thoran.dwarf.mine (192.168.100.20) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * ignoring certificate verify locations due to disabled peer verification * TLSv1.2 (OUT), TLS handshake, Client hello (1): Hitting the router directly produces the same issue! Do we need to send this over to the ingress team?
This turned out to be a worker (host network) -> master (pod) communication problem. But it is running on a libvirt setup, so is not a 4.1 blocker.
A lot of movement has happened on this BZ in the SDN space. While I will likely not capture everything I am going to try and capture what I can. > All data from this point on is from the master unless denoted. We have been able to deduce that between the master and the node we have an MTU issue. This is seen by the following! $ sudo ip a s ens3 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:1f:ac:18 brd ff:ff:ff:ff:ff:ff inet 192.168.100.10/24 brd 192.168.100.255 scope global dynamic noprefixroute ens3 valid_lft 3134sec preferred_lft 3134sec inet6 fe80::2a52:38ee:a82d:94db/64 scope link noprefixroute valid_lft forever preferred_lft forever ### From the Worker (below) $ sudo ip a s ens3 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 52:54:00:64:13:1c brd ff:ff:ff:ff:ff:ff inet 192.168.100.20/24 brd 192.168.100.255 scope global dynamic noprefixroute ens3 valid_lft 2260sec preferred_lft 2260sec inet6 fe80::321c:df3c:35e7:342f/64 scope link noprefixroute valid_lft forever preferred_lft forever > Shows we have an MTU on the main interface[s] of 1500 However, if we try an reach the worker from the master we see the following: $ ping -c 3 -M do -s 1472 -I ens3 192.168.100.20 PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1472(1500) bytes of data. ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 This suggests that the MTU is 1450 and not the 1500 it should be. This is confirmed with the following: $ for x in 1423 1423; do ping -c 3 -M do -s $x -I ens3 192.168.100.20 -c 3; done PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1423(1451) bytes of data. ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 --- 192.168.100.20 ping statistics --- 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 8ms PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1423(1451) bytes of data. ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 ping: local error: Message too long, mtu=1450 --- 192.168.100.20 ping statistics --- 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 54ms It's not clear what causes the MTU to be set to 1450 but it seems to be set by the route cache $ ip route get 192.168.100.20 192.168.100.20 dev ens3 src 192.168.100.10 uid 1000 cache expires 562sec mtu 1450
Created attachment 1571386 [details] trcpdump from master We can see in the attached pcap that the worker is stating the MTU is too large! $ sudo tcpdump -i ens3 -tttt -w /tmp/$(hostname)-$(date +"%m-%d-%Y").pcap -s 0 & date; sudo ip route flush cache; echo "Cache Flushed"; date; ip route get 192.168.100.20; sleep 10; date; ip route get 192.168.100.20 [1] 33891 Mon May 20 19:26:50 UTC 2019 Cache Flushed Mon May 20 19:26:50 UTC 2019 192.168.100.20 dev ens3 src 192.168.100.10 uid 1000 cache tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes Mon May 20 19:27:00 UTC 2019 192.168.100.20 dev ens3 src 192.168.100.10 uid 1000 cache expires 592sec mtu 1450 $ tshark -r master-0-05-20-2019.pcap | grep -i icmp 1083 2.052417 192.168.100.20 โ 192.168.100.10 ICMP 590 Destination unreachable (Fragmentation needed) 1085 2.052505 192.168.100.20 โ 192.168.100.10 ICMP 590 Destination unreachable (Fragmentation needed)
Created attachment 1571387 [details] iptables rules from hosts Uploading iptables rules for review.
This turned out to be a virtual nic driver issue. Moving to the "virtio" nic driver fixed this issue. In short, when I installed these systems they picked the "default" nic driver "rtl8139" (it apparently has an issue) - this bug should be moved to libvirt networking to fix (that issue) Information in https://gitlab.cee.redhat.com/erich/ocp-4-libvirt-lab are my scripts https://gitlab.cee.redhat.com/erich/ocp-4-libvirt-lab/commit/b2fb2fe9987ad2efb735050f4f1e4d15c1388b35#b4408aed2189118f01caf6c4da2f2d20ad53966f_107_106 these changes (are how I fixed - worked around the issue).