Bug 1926345 - authentication operator stuck in degraded stage while assisted-service installation process
Summary: authentication operator stuck in degraded stage while assisted-service instal...
Keywords:
Status: CLOSED DUPLICATE of bug 1935539
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
: 1928426 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-08 16:57 UTC by Igal Tsoiref
Modified: 2022-06-16 12:09 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-22 14:54:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather OCP 4.7.0 Assisted Install (13.16 MB, application/gzip)
2021-03-08 19:53 UTC, JP Jung
no flags Details
Assisted Bare Metal Cluster 4.7.0 Must-Gather (16.59 MB, application/gzip)
2021-03-16 21:52 UTC, JP Jung
no flags Details

Comment 1 Standa Laznicka 2021-02-09 08:37:25 UTC
Please retry with 4.7.0-rc.0, 4.7.0-fc.0 is old.

Also, this BZ is missing __ALL__ of the information required in the template. Provide the information after you've tried an installation with the newer version and it still fails, close this BZ if it passes.

Comment 2 Igal Tsoiref 2021-02-09 17:10:00 UTC
We used 4.7.0-fc.2 i think in this run. I tried to understand what gone wrong and checked logs, i didn't find something strange. Relevant auth pods were running. I connected to auth operator container and tried to reach this api and failed on dns resolving. On the other hand oauth-openshift.apps.test.ivanlab.org was reachable from host. Tried to find if there was errors in ocp dns and mdns services. Didn't find. Mainly had no clue how to proceed from this point.

Comment 3 Standa Laznicka 2021-02-11 08:18:24 UTC
clusterversion/version from your must-gather:
```
  desired:
    image: quay.io/openshift-release-dev/ocp-release@sha256:2419f9cd3ea9bd114764855653012e305ade2527210d332bfdd6dbdae538bd66
    version: 4.7.0-fc
```

Either you uploaded a wrong must-gather, or the version is different from what you say it is. 4.7.0-fc.2 would still be old, though.

Please, check installation with the latest test version and report whether the problem still exists or not.

Comment 4 Igal Tsoiref 2021-02-16 19:14:28 UTC
*** Bug 1928426 has been marked as a duplicate of this bug. ***

Comment 5 Osher De Paz 2021-02-16 19:26:59 UTC
From the logs:

2021-02-12T00:38:13.931900859Z I0212 00:38:13.931825       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-console-operator", Name:"console-operator", UID:"873b2553-5ece-4fb6-988f-2bea92674c4e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/console changed: Degraded message changed from "" to "RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.lab3.epmlab.ca/health): Get \"https://console-openshift-console.apps.lab3.epmlab.ca/health\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

@jjung  can you take a second look on the DNS configuration?
https://access.redhat.com/solutions/4985361

Comment 6 JP Jung 2021-02-16 22:06:22 UTC
Below some information from my workstation and one of the master nodes, @odepaz

I have to outline that the SAME infrastructure (same VMs, same DNS config, same use of assisted install) deploy OCP 4.6.8 properly. Only 4.7pre consistently fails. My understanding is that whatever should respond at https://console-openshift-console.apps.lab3.epmlab.ca/health never makes it (and the console never starts) and the installer dies after one hour timeout.
From the logs below I think my DNS is proper.

For information, master 1 to 3 are 192.168.50.101-103, worker 1 & 2 are 192.168.50.111 & 112, the API vIP is .105 and the Ingress vIP is .107

From my workstation:

$ export KUBECONFIG=/Users/jp/git/OCP/lab2/kubeconfig-noingress
$ ./oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                          False       True          True       3d
baremetal                                  4.7.0-fc.0   True        False         False      3d
cloud-credential                           4.7.0-fc.0   True        False         False      3d1h
cluster-autoscaler                         4.7.0-fc.0   True        False         False      3d
config-operator                            4.7.0-fc.0   True        False         False      3d
console                                    4.7.0-fc.0   False       False         True       2d
csi-snapshot-controller                    4.7.0-fc.0   True        False         False      3d
dns                                        4.7.0-fc.0   True        False         False      3d
etcd                                       4.7.0-fc.0   True        False         False      3d
image-registry                             4.7.0-fc.0   True        False         True       3d
ingress                                    4.7.0-fc.0   True        False         False      2d23h
insights                                   4.7.0-fc.0   True        False         False      3d
kube-apiserver                             4.7.0-fc.0   True        False         False      3d
kube-controller-manager                    4.7.0-fc.0   True        False         False      3d
kube-scheduler                             4.7.0-fc.0   True        False         False      3d
kube-storage-version-migrator              4.7.0-fc.0   True        False         False      2d23h
machine-api                                4.7.0-fc.0   True        False         True       3d
machine-approver                           4.7.0-fc.0   True        False         False      3d
machine-config                             4.7.0-fc.0   True        False         False      3d
marketplace                                4.7.0-fc.0   True        False         False      3d
monitoring                                              False       True          True       3d
network                                    4.7.0-fc.0   True        False         False      2d23h
node-tuning                                4.7.0-fc.0   True        False         False      3d
openshift-apiserver                        4.7.0-fc.0   False       False         False      2d7h
openshift-controller-manager               4.7.0-fc.0   True        False         False      2d1h
openshift-samples                          4.7.0-fc.0   False       False         False      3d
operator-lifecycle-manager                 4.7.0-fc.0   True        False         False      3d
operator-lifecycle-manager-catalog         4.7.0-fc.0   True        False         False      3d
operator-lifecycle-manager-packageserver   4.7.0-fc.0   False       False         False      5s
service-ca                                 4.7.0-fc.0   True        False         False      3d
storage                                    4.7.0-fc.0   True        False         False      3d


$ ping console-openshift-console.apps.lab2.epmlab.ca
PING console-openshift-console.apps.lab2.epmlab.ca (192.168.50.107): 56 data bytes
64 bytes from 192.168.50.107: icmp_seq=0 ttl=63 time=1.417 ms
64 bytes from 192.168.50.107: icmp_seq=1 ttl=63 time=1.439 ms
^C
--- console-openshift-console.apps.lab2.epmlab.ca ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.417/1.428/1.439/0.011 ms

$ curl -L https://console-openshift-console.apps.lab2.epmlab.ca/health -k
<html>
  <head>
    <meta name="viewport" content="width=device-width, initial-scale=1">

  <style type="text/css">
  /*!
   * Bootstrap v3.3.5 (http://getbootstrap.com)
   * Copyright 2011-2015 Twitter, Inc.
   * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE)
   */
  /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.css */
  html {
    font-family: sans-serif;
    -ms-text-size-adjust: 100%;
    -webkit-text-size-adjust: 100%;
  }
  body {
    margin: 0;
  }
  h1 {
    font-size: 1.7em;
    font-weight: 400;
    line-height: 1.3;
    margin: 0.68em 0;
  }
  * {
    -webkit-box-sizing: border-box;
    -moz-box-sizing: border-box;
    box-sizing: border-box;
  }
  *:before,
  *:after {
    -webkit-box-sizing: border-box;
    -moz-box-sizing: border-box;
    box-sizing: border-box;
  }
  html {
    -webkit-tap-highlight-color: rgba(0, 0, 0, 0);
  }
  body {
    font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
    line-height: 1.66666667;
    font-size: 13px;
    color: #333333;
    background-color: #ffffff;
    margin: 2em 1em;
  }
  p {
    margin: 0 0 10px;
    font-size: 13px;
  }
  .alert.alert-info {
    padding: 15px;
    margin-bottom: 20px;
    border: 1px solid transparent;
    background-color: #f5f5f5;
    border-color: #8b8d8f;
    color: #363636;
    margin-top: 30px;
  }
  .alert p {
    padding-left: 35px;
  }
  a {
    color: #0088ce;
  }

  ul {
    position: relative;
    padding-left: 51px;
  }
  p.info {
    position: relative;
    font-size: 15px;
    margin-bottom: 10px;
  }
  p.info:before, p.info:after {
    content: "";
    position: absolute;
    top: 9%;
    left: 0;
  }
  p.info:before {
    content: "i";
    left: 3px;
    width: 20px;
    height: 20px;
    font-family: serif;
    font-size: 15px;
    font-weight: bold;
    line-height: 21px;
    text-align: center;
    color: #fff;
    background: #4d5258;
    border-radius: 16px;
  }

  @media (min-width: 768px) {
    body {
      margin: 4em 3em;
    }
    h1 {
      font-size: 2.15em;}
  }

  </style>
  </head>
  <body>
    <div>
      <h1>Application is not available</h1>
      <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>

      <div class="alert alert-info">
        <p class="info">
          Possible reasons you are seeing this page:
        </p>
        <ul>
          <li>
            <strong>The host doesn't exist.</strong>
            Make sure the hostname was typed correctly and that a route matching this hostname exists.
          </li>
          <li>
            <strong>The host exists, but doesn't have a matching path.</strong>
            Check if the URL path was typed correctly and that the route was created using the desired path.
          </li>
          <li>
            <strong>Route and path matches, but all pods are down.</strong>
            Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
          </li>
        </ul>
      </div>
    </div>
  </body>
</html>



From one of the Masters:

$ ssh core.epmlab.ca
The authenticity of host 'master1.lab2.epmlab.ca (192.168.50.101)' can't be established.
ECDSA key fingerprint is SHA256:RpfaJ5r3LUEKyOY6tdGA/O4a5KTQj9BIf6ZMISh6Jjc.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'master1.lab2.epmlab.ca,192.168.50.101' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 47.83.202012171901-0
  Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html

---
Last login: Tue Feb 16 21:48:38 2021 from 192.168.193.11
[core@master1 ~]$ ping console-openshift-console.apps.lab2.epmlab.ca
PING console-openshift-console.apps.lab2.epmlab.ca (192.168.50.107) 56(84) bytes of data.
64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=1 ttl=64 time=0.219 ms
64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=2 ttl=64 time=0.200 ms
64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=3 ttl=64 time=0.225 ms
^C
--- console-openshift-console.apps.lab2.epmlab.ca ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 26ms
rtt min/avg/max/mdev = 0.200/0.214/0.225/0.020 ms




[core@master1 ~]$ dig oauth-openshift.apps.lab2.epmlab.ca

; <<>> DiG 9.11.20-RedHat-9.11.20-5.el8 <<>> oauth-openshift.apps.lab2.epmlab.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40452
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: dab1a3401301fc90 (echoed)
;; QUESTION SECTION:
;oauth-openshift.apps.lab2.epmlab.ca. IN	A

;; ANSWER SECTION:
oauth-openshift.apps.lab2.epmlab.ca. 30	IN A	192.168.50.107

;; Query time: 0 msec
;; SERVER: 192.168.50.101#53(192.168.50.101)
;; WHEN: Tue Feb 16 21:56:47 UTC 2021
;; MSG SIZE  rcvd: 127



[core@master1 ~]$ ping console-openshift-console.apps.lab2.epmlab.ca
PING console-openshift-console.apps.lab2.epmlab.ca (192.168.50.107) 56(84) bytes of data.
64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=1 ttl=64 time=0.219 ms
64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=2 ttl=64 time=0.200 ms
64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=3 ttl=64 time=0.225 ms
^C
--- console-openshift-console.apps.lab2.epmlab.ca ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 26ms
rtt min/avg/max/mdev = 0.200/0.214/0.225/0.020 ms
[core@master1 ~]$ curl -L https://console-openshift-console.apps.lab2.epmlab.ca/health -k
<html>
  <head>
    <meta name="viewport" content="width=device-width, initial-scale=1">

  <style type="text/css">
  /*!
   * Bootstrap v3.3.5 (http://getbootstrap.com)
   * Copyright 2011-2015 Twitter, Inc.
   * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE)
   */
  /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.css */
  html {
    font-family: sans-serif;
    -ms-text-size-adjust: 100%;
    -webkit-text-size-adjust: 100%;
  }
  body {
    margin: 0;
  }
  h1 {
    font-size: 1.7em;
    font-weight: 400;
    line-height: 1.3;
    margin: 0.68em 0;
  }
  * {
    -webkit-box-sizing: border-box;
    -moz-box-sizing: border-box;
    box-sizing: border-box;
  }
  *:before,
  *:after {
    -webkit-box-sizing: border-box;
    -moz-box-sizing: border-box;
    box-sizing: border-box;
  }
  html {
    -webkit-tap-highlight-color: rgba(0, 0, 0, 0);
  }
  body {
    font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
    line-height: 1.66666667;
    font-size: 13px;
    color: #333333;
    background-color: #ffffff;
    margin: 2em 1em;
  }
  p {
    margin: 0 0 10px;
    font-size: 13px;
  }
  .alert.alert-info {
    padding: 15px;
    margin-bottom: 20px;
    border: 1px solid transparent;
    background-color: #f5f5f5;
    border-color: #8b8d8f;
    color: #363636;
    margin-top: 30px;
  }
  .alert p {
    padding-left: 35px;
  }
  a {
    color: #0088ce;
  }

  ul {
    position: relative;
    padding-left: 51px;
  }
  p.info {
    position: relative;
    font-size: 15px;
    margin-bottom: 10px;
  }
  p.info:before, p.info:after {
    content: "";
    position: absolute;
    top: 9%;
    left: 0;
  }
  p.info:before {
    content: "i";
    left: 3px;
    width: 20px;
    height: 20px;
    font-family: serif;
    font-size: 15px;
    font-weight: bold;
    line-height: 21px;
    text-align: center;
    color: #fff;
    background: #4d5258;
    border-radius: 16px;
  }

  @media (min-width: 768px) {
    body {
      margin: 4em 3em;
    }
    h1 {
      font-size: 2.15em;}
  }

  </style>
  </head>
  <body>
    <div>
      <h1>Application is not available</h1>
      <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p>

      <div class="alert alert-info">
        <p class="info">
          Possible reasons you are seeing this page:
        </p>
        <ul>
          <li>
            <strong>The host doesn't exist.</strong>
            Make sure the hostname was typed correctly and that a route matching this hostname exists.
          </li>
          <li>
            <strong>The host exists, but doesn't have a matching path.</strong>
            Check if the URL path was typed correctly and that the route was created using the desired path.
          </li>
          <li>
            <strong>Route and path matches, but all pods are down.</strong>
            Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
          </li>
        </ul>
      </div>
    </div>
  </body>
</html>

Comment 7 Standa Laznicka 2021-02-17 11:08:47 UTC
Reminder: I'm still waiting for the results with the latest RC version. If these are not provided, I'm going to close this BZ next week. Or, if you can't use the latest RC, say why.

Some other notes:
- `Get \"https://console-openshift-console.apps.lab3.epmlab.ca/health\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"` does not appear to be a DNS problem.
- comment 6 is unnecessarily long and makes this BZ hard to read :-(

Comment 8 Igal Tsoiref 2021-03-03 19:04:13 UTC
we will go straight to GA, waited till we move to new version but that was decided. If we will see it in GA will reopen. Thanks

Comment 9 JP Jung 2021-03-04 21:56:10 UTC
Fails with 4.7.0 GA bits. Same error. Console never shows up, setup fails after the hour timeout waiting for the console.

2021-03-04, 4:24:33 p.m.	
critical Failed installing cluster lab2. Reason: Timeout while waiting for console pod to be running

Comment 10 Standa Laznicka 2021-03-08 10:20:20 UTC
Please post a fresh must-gather, then. Just to let you know, this is 100% platform specific, we're not seeing these issues anywhere else.

Comment 11 JP Jung 2021-03-08 19:53:44 UTC
Created attachment 1761820 [details]
must-gather OCP 4.7.0 Assisted Install

Comment 12 Igal Tsoiref 2021-03-10 18:43:06 UTC
@slaznick we have relatively many failures like this. The main problem that we are not always possible to get must-gather logs. From one of the failures. What is the status?Did you managed to check the latest attached logs? This is issue starts to be very urgent from our side

Comment 14 Standa Laznicka 2021-03-11 08:45:28 UTC
Looking at the must-gather, the cluster appears to be in ruins. There're
- oauth/apiserver pods getting connection refused for a long time (~8-10mins) from etcd pods. They eventually figure out how to connect, but etcd should have been available for a long time when these apiservers are starting. This looks like an issue with the installer itself.
- the etcd pods all log `embed: rejected connection from "192.168.50.101:*" (error "EOF", ServerName "")`. This could be networking/installer issue.
- etcd-operator is logging connection refused from some of its etcd pods for 25 minutes. Looks like an issue with the installer.
- etcd-operator eventually gets proper connections to etcd pods but appears to still log `"FSyncController" controller failed to sync "key", err: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp: i/o timeout`. Sometimes it's getting refused. Possibly networking issue.
- kube-apiserver pods log "i/o timeouts" when they attempt to reach the oauth/openshift API servers, looks like a networking issue

Based on the above observations, I'm moving this to openshift-sdn, I do not know how to read their logs unfortunately.

Comment 16 Dan Winship 2021-03-15 18:02:15 UTC
Magic 8 Ball says "Check your MTU". I'm guessing your VMware cluster is on top of a real network with a 1500 MTU, but is adding its own VXLAN tunnel between the VMware hosts, resulting in a 1450 MTU for the OCP nodes themselves, but you are not configuring the nodes to be aware of this fact[*], so they come up with the default ethernet MTU (1500), and then eventually connections fail because they are trying to send too-large TCP packets over the openshift-sdn network, and trip over the MTU mismatch and get dropped, resulting in

      OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: Get "https://10.136.0.57:6443/healthz": dial tcp 10.136.0.57:6443: i/o timeout (Client.Timeout exceeded while awaiting headers)

ie, the SYN/ACK packets are transmitted correctly, because they are small, so the connection is established, but the TLS handshake response is a large packet (containing the server TLS cert) so it gets dropped, which from the client's perspective looks like "timed out because the server isn't responding".


[*] I don't know how you do that. It's *not* the MTU field in network.operator.openshift.io. You need to change the ignition config or something like that.

Comment 17 JP Jung 2021-03-15 19:01:09 UTC
Dan, I will check my environment. The part that puzzle me is that OCP 4.6 deploys with Assisted Install and works perfectly (same environment, same VMs, same DNS, same networking...). Issues started with 4.7-pre. So feel free to point the finger toward the environment, but something changed between 4.6 and 4.7.

Comment 18 JP Jung 2021-03-16 12:08:01 UTC
Take 2. My virtual switch MTU is set to (default) 1500, I run all the VMs out of a single ESXi server, no VxLAN tunnel. The traffic within cluster members does not leave the vSwitch (MTU 1500), as all the VMs are on a single host, connected to the same vSwitch. And (again) Assisted Installer works with OCP 4.6, started to fail with OCP 4.7 on the same environment. If I am having this issue, I guess many folks running out of VMware VMs  will face similar  problems.

Comment 20 Dan Winship 2021-03-16 14:42:49 UTC
I mean... that was just a guess before, but it _really_ looks MTU-y...

I'd say try tcpdumping one of the connections that's timing out, from both ends, to see if the problem is that small packets are making it through, but larger ones aren't.

I'm not sure what would have changed between 4.6 and 4.7. OpenShift SDN itself has barely changed at all, so it would have to be something somewhere else. Presumably VMware-related if the bug only affects VMware...

Comment 21 JP Jung 2021-03-16 21:52:22 UTC
Created attachment 1763835 [details]
Assisted Bare Metal Cluster 4.7.0 Must-Gather

This is a new installation (single ESXi server, all VMs on that single host, using the same virtual switch, switch MTU is 1500). I went and manually forced the NIC of each cluster member to 1400 (on the live ISO upon boot, then right after reboot from HD). I monitored the installation. I re-applied MTU 1400 whenever it changed. The veth interfaces all ended up with MTU 1350.
The installation failed with the usual:
"Cluster installation failed
Timeout while waiting for console to become available."
I can provide ssh access to the cluster if required.

Comment 22 Aniket Bhat 2021-03-18 21:55:11 UTC
@jjung Can you please provide ssh access? Do you need public key?

Thanks,
Aniket.


Note You need to log in before you can comment on or make changes to this bug.