Bug 1711127

Summary:

Failing Install due to Authentication Operator not deploying

Product:

OpenShift Container Platform

Reporter:

Eric Rich <erich>

Component:

Networking

Assignee:

Casey Callendrello <cdc>

Status:

CLOSED CANTFIX

QA Contact:

Meng Bo <bmeng>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

4.1.0

CC:

aos-bugs, bbennett, grajaiya, jokerman, mmccomas, nagrawal, slaznick, smunilla

Target Milestone:

---

Target Release:

4.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-18 13:40:39 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
trcpdump from master	none
iptables rules from hosts	none

Description Eric Rich 2019-05-17 03:18:08 UTC

Description of problem: Authentication Operator fails to deploy or startup. 

Version-Release number of selected component (if applicable):

$ oc --config test_cluster/auth/kubeconfig get  clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-rc.0   False       True          92m     Unable to apply 4.1.0-rc.0: some cluster operators have not yet rolled out

How reproducible: Only seen this once

Steps to Reproduce:
1. https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#installation-approve-csrs_installing-bare-metal

Actual results:

>E0517 03:07:03.918897       1 controller.go:129] {🐼 🐼} failed with: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout
> I0517 03:07:03.920730       1 status_controller.go:159] clusteroperator/authentication diff {"status":{"conditions":[{"lastTransitionTime":"2019-05-17T03:07:03Z","message":"Failing: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout","reason":"AsExpected","status":"False","type":"Failing"},{"lastTransitionTime":"2019-05-17T01:48:36Z","reason":"NoData","status":"Unknown","type":"Progressing"},{"lastTransitionTime":"2019-05-17T01:48:36Z","reason":"NoData","status":"Unknown","type":"Available"},{"lastTransitionTime":"2019-05-17T01:48:36Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}
> I0517 03:07:03.927443       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-authentication-operator", Name:"openshift-authentication-operator", UID:"cb20bece-7844-11e9-9275-525400fb04b2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator authentication changed: Failing changed from True to False ("Failing: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout")
> E0517 03:07:13.990209       1 controller.go:129] {🐼 🐼} failed with: error checking current version: unable to check route health: failed to GET route: net/http: TLS handshake timeout

Expected results: The authentication operator should tell you what it's failing to connect to (what endpoint it's trying to reach). 

Additional info:

This Operators failed deployment seem to be holding up, the console from deploying: 

I0517 03:15:26.663474       1 option.go:62] Console: handling update openshift-console/console: /apis/route.openshift.io/v1/namespaces/openshift-console/routes/console
time="2019-05-17T03:15:26Z" level=info msg="started syncing operator \"cluster\" (2019-05-17 03:15:26.671787061 +0000 UTC m=+1242.374271175)"
time="2019-05-17T03:15:26Z" level=info msg="console is in a managed state."
time="2019-05-17T03:15:26Z" level=info msg="running sync loop 4.0.0"
time="2019-05-17T03:15:26Z" level=info msg="validating console route..."
time="2019-05-17T03:15:26Z" level=info msg="route ingress 'default' found and admitted, host: console-openshift-console.apps.thoran.dwarf.mine \n"
time="2019-05-17T03:15:26Z" level=info msg="route exists and is in the correct state"
time="2019-05-17T03:15:26Z" level=info msg="validating console service..."
time="2019-05-17T03:15:26Z" level=info msg="service exists and is in the correct state"
time="2019-05-17T03:15:26Z" level=info msg="validating console configmap..."
time="2019-05-17T03:15:26Z" level=info msg="generated console config yaml:"
time="2019-05-17T03:15:26Z" level=info msg="apiVersion: console.openshift.io/v1\nauth:\n  clientID: console\n  clientSecretFile: /var/oauth-config/clientSecret\n  logoutRedirect: \"\"\n  oauthEndpointCAFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt\nclusterInfo:\n  consoleBaseAddress: https://console-openshift-console.apps.thoran.dwarf.mine\n  consoleBasePath: \"\"\n  masterPublicURL: https://api.thoran.dwarf.mine:6443\ncustomization:\n  branding: ocp\n  documentationBaseURL: https://docs.openshift.com/container-platform/4.1/\nkind: ConsoleConfig\nservingInfo:\n  bindAddress: https://0.0.0.0:8443\n  certFile: /var/serving-cert/tls.crt\n  keyFile: /var/serving-cert/tls.key\n \n"
time="2019-05-17T03:15:26Z" level=info msg="configmap exists and is in the correct state"
time="2019-05-17T03:15:26Z" level=info msg="validating service-ca configmap..."
time="2019-05-17T03:15:26Z" level=info msg="service-ca configmap exists and is in the correct state"
time="2019-05-17T03:15:26Z" level=info msg="validating oauth secret..."
time="2019-05-17T03:15:26Z" level=info msg="secret exists and is in the correct state"
time="2019-05-17T03:15:26Z" level=info msg="validating oauthclient..."
time="2019-05-17T03:15:26Z" level=info msg="route ingress 'default' found and admitted, host: console-openshift-console.apps.thoran.dwarf.mine \n"
time="2019-05-17T03:15:26Z" level=info msg="oauthclient exists and is in the correct state"
time="2019-05-17T03:15:26Z" level=info msg="validating console deployment..."
time="2019-05-17T03:15:26Z" level=info msg="deployment exists and is in the correct state"
time="2019-05-17T03:15:26Z" level=info msg=-----------------------
time="2019-05-17T03:15:26Z" level=info msg="sync loop 4.0.0 resources updated: false \n"
time="2019-05-17T03:15:26Z" level=info msg=-----------------------
time="2019-05-17T03:15:26Z" level=info msg="deployment is available: false \n"
time="2019-05-17T03:15:26Z" level=info msg="sync_v400: updating console status"
time="2019-05-17T03:15:26Z" level=info msg="route ingress 'default' found and admitted, host: console-openshift-console.apps.thoran.dwarf.mine \n"
time="2019-05-17T03:15:26Z" level=info msg="updating console.config.openshift.io with hostname: console-openshift-console.apps.thoran.dwarf.mine \n"
time="2019-05-17T03:15:26Z" level=info msg="sync loop 4.0.0 complete:"
time="2019-05-17T03:15:26Z" level=info msg="\t service changed: false"
time="2019-05-17T03:15:26Z" level=info msg="\t route changed: false"
time="2019-05-17T03:15:26Z" level=info msg="\t configMap changed: false"
time="2019-05-17T03:15:26Z" level=info msg="\t secret changed: false"
time="2019-05-17T03:15:26Z" level=info msg="\t oauth changed: false"
time="2019-05-17T03:15:26Z" level=info msg="\t deployment changed: false"
time="2019-05-17T03:15:26Z" level=info msg=Operator.Status.Conditions
time="2019-05-17T03:15:26Z" level=info msg="Status.Condition.UnsupportedConfigOverridesUpgradeable: True"
time="2019-05-17T03:15:26Z" level=info msg="Status.Condition.Available: False | (NoPodsAvailable) No pods available for console deployment."
time="2019-05-17T03:15:26Z" level=info msg="Status.Condition.Progressing: True | (SyncLoopProgressing) Moving to version 4.1.0"
time="2019-05-17T03:15:26Z" level=info msg="Status.Condition.Failing: False"
time="2019-05-17T03:15:26Z" level=info msg="finished syncing operator \"cluster\" (341.775µs) \n\n"


> tail -f test_cluster/.openshift_install.log
>time="2019-05-16T22:42:54-04:00" level=info msg="Waiting up to 30m0s for the cluster at https://api.thoran.dwarf.mine:6443 to initialize..."
> time="2019-05-16T22:42:54-04:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry"
> time="2019-05-16T22:43:14-04:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.1.0-rc.0: 98% complete"
> time="2019-05-16T22:46:29-04:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console"
> time="2019-05-16T23:12:54-04:00" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"

Comment 3 Eric Rich 2019-05-17 03:40:15 UTC

It should be noted that this is an install of http://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.1/4.1.0-rc.3/ not sure why the clusterversion shows rc.0

Comment 4 Standa Laznicka 2019-05-17 07:53:43 UTC

While it appears that there are more problems in your cluster than the authentication operator not working ("failed to GET route: net/http: TLS handshake timeout", "Some cluster operators are still updating: authentication, console, image-registry" as examples), I can with 100% certainty say that the logs for our operator don't match the expectations for the bits that should be part of the "4.1.0-rc.3" release.

I am going to throw this at the release team to investigate whether they indeed supply the "4.1.0-rc.3" bits in the repository you've got your installation from.

Comment 5 Xiaoli Tian 2019-05-17 08:02:45 UTC

*** Bug 1711126 has been marked as a duplicate of this bug. ***

Comment 8 Eric Rich 2019-05-17 13:15:02 UTC

This seems to be an issue with the router? 

$ curl -kv https://oauth-openshift.apps.thoran.dwarf.mine
* Rebuilt URL to: https://oauth-openshift.apps.thoran.dwarf.mine/
*   Trying 192.168.100.1...
* TCP_NODELAY set
* Connected to oauth-openshift.apps.thoran.dwarf.mine (192.168.100.1) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* ignoring certificate verify locations due to disabled peer verification
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.thoran.dwarf.mine:443 
* stopped the pause stream!
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.thoran.dwarf.mine:443

Comment 9 Eric Rich 2019-05-17 14:08:05 UTC

It's not an issue with the front end LB! 

$ timeout 5 curl -kv --resolve oauth-openshift.apps.thoran.dwarf.mine:443:192.168.100.20 https://oauth-openshift.apps.thoran.dwarf.mine
* Added oauth-openshift.apps.thoran.dwarf.mine:443:192.168.100.20 to DNS cache
* Rebuilt URL to: https://oauth-openshift.apps.thoran.dwarf.mine/
* Hostname oauth-openshift.apps.thoran.dwarf.mine was found in DNS cache
*   Trying 192.168.100.20...
* TCP_NODELAY set
* Connected to oauth-openshift.apps.thoran.dwarf.mine (192.168.100.20) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* ignoring certificate verify locations due to disabled peer verification
* TLSv1.2 (OUT), TLS handshake, Client hello (1):

Hitting the router directly produces the same issue! 
Do we need to send this over to the ingress team?

Comment 11 Ben Bennett 2019-05-17 18:16:10 UTC

This turned out to be a worker (host network) -> master (pod) communication problem.  But it is running on a libvirt setup, so is not a 4.1 blocker.

Comment 12 Eric Rich 2019-05-20 19:57:58 UTC

A lot of movement has happened on this BZ in the SDN space. While I will likely not capture everything I am going to try and capture what I can. 

> All data from this point on is from the master unless denoted. 

We have been able to deduce that between the master and the node we have an MTU issue. This is seen by the following! 

$ sudo ip a s ens3
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:1f:ac:18 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.10/24 brd 192.168.100.255 scope global dynamic noprefixroute ens3
       valid_lft 3134sec preferred_lft 3134sec
    inet6 fe80::2a52:38ee:a82d:94db/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

### From the Worker (below)

$ sudo ip a s ens3
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:64:13:1c brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.20/24 brd 192.168.100.255 scope global dynamic noprefixroute ens3
       valid_lft 2260sec preferred_lft 2260sec
    inet6 fe80::321c:df3c:35e7:342f/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

> Shows we have an MTU on the main interface[s] of 1500

However, if we try an reach the worker from the master we see the following:

$ ping -c 3 -M do -s 1472 -I ens3 192.168.100.20
PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1472(1500) bytes of data.
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450

This suggests that the MTU is 1450 and not the 1500 it should be. This is confirmed with the following: 

$ for x in 1423 1423; do ping -c 3 -M do -s $x -I ens3 192.168.100.20 -c 3; done 
PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1423(1451) bytes of data.
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450

--- 192.168.100.20 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 8ms

PING 192.168.100.20 (192.168.100.20) from 192.168.100.10 ens3: 1423(1451) bytes of data.
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450
ping: local error: Message too long, mtu=1450

--- 192.168.100.20 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 54ms

It's not clear what causes the MTU to be set to 1450 but it seems to be set by the route cache

$ ip route get 192.168.100.20
192.168.100.20 dev ens3 src 192.168.100.10 uid 1000 
    cache expires 562sec mtu 1450

Comment 13 Eric Rich 2019-05-20 19:58:56 UTC

Created attachment 1571386 [details]
trcpdump from master

We can see in the attached pcap that the worker is stating the MTU is too large! 

$ sudo tcpdump -i ens3 -tttt -w /tmp/$(hostname)-$(date +"%m-%d-%Y").pcap -s 0 & date; sudo ip route flush cache; echo "Cache Flushed"; date; ip route get 192.168.100.20; sleep 10; date; ip route get 192.168.100.20
[1] 33891
Mon May 20 19:26:50 UTC 2019
Cache Flushed
Mon May 20 19:26:50 UTC 2019
192.168.100.20 dev ens3 src 192.168.100.10 uid 1000 
    cache 
tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
Mon May 20 19:27:00 UTC 2019
192.168.100.20 dev ens3 src 192.168.100.10 uid 1000 
    cache expires 592sec mtu 1450 

$ tshark -r master-0-05-20-2019.pcap  | grep -i icmp
 1083   2.052417 192.168.100.20 → 192.168.100.10 ICMP 590 Destination unreachable (Fragmentation needed)
 1085   2.052505 192.168.100.20 → 192.168.100.10 ICMP 590 Destination unreachable (Fragmentation needed)

Comment 14 Eric Rich 2019-05-20 20:00:34 UTC

Created attachment 1571387 [details]
iptables rules from hosts

Uploading iptables rules for review.

Comment 15 Eric Rich 2019-05-21 14:51:16 UTC

This turned out to be a virtual nic driver issue. Moving to the "virtio" nic driver fixed this issue. 

In short, when I installed these systems they picked the "default" nic driver "rtl8139" (it apparently has an issue) - this bug should be moved to libvirt networking to fix (that issue)

Information in https://gitlab.cee.redhat.com/erich/ocp-4-libvirt-lab are my scripts

https://gitlab.cee.redhat.com/erich/ocp-4-libvirt-lab/commit/b2fb2fe9987ad2efb735050f4f1e4d15c1388b35#b4408aed2189118f01caf6c4da2f2d20ad53966f_107_106 these changes (are how I fixed - worked around the issue).