Please retry with 4.7.0-rc.0, 4.7.0-fc.0 is old. Also, this BZ is missing __ALL__ of the information required in the template. Provide the information after you've tried an installation with the newer version and it still fails, close this BZ if it passes.
We used 4.7.0-fc.2 i think in this run. I tried to understand what gone wrong and checked logs, i didn't find something strange. Relevant auth pods were running. I connected to auth operator container and tried to reach this api and failed on dns resolving. On the other hand oauth-openshift.apps.test.ivanlab.org was reachable from host. Tried to find if there was errors in ocp dns and mdns services. Didn't find. Mainly had no clue how to proceed from this point.
clusterversion/version from your must-gather: ``` desired: image: quay.io/openshift-release-dev/ocp-release@sha256:2419f9cd3ea9bd114764855653012e305ade2527210d332bfdd6dbdae538bd66 version: 4.7.0-fc ``` Either you uploaded a wrong must-gather, or the version is different from what you say it is. 4.7.0-fc.2 would still be old, though. Please, check installation with the latest test version and report whether the problem still exists or not.
*** Bug 1928426 has been marked as a duplicate of this bug. ***
From the logs: 2021-02-12T00:38:13.931900859Z I0212 00:38:13.931825 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-console-operator", Name:"console-operator", UID:"873b2553-5ece-4fb6-988f-2bea92674c4e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/console changed: Degraded message changed from "" to "RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.lab3.epmlab.ca/health): Get \"https://console-openshift-console.apps.lab3.epmlab.ca/health\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" @jjung can you take a second look on the DNS configuration? https://access.redhat.com/solutions/4985361
Below some information from my workstation and one of the master nodes, @odepaz I have to outline that the SAME infrastructure (same VMs, same DNS config, same use of assisted install) deploy OCP 4.6.8 properly. Only 4.7pre consistently fails. My understanding is that whatever should respond at https://console-openshift-console.apps.lab3.epmlab.ca/health never makes it (and the console never starts) and the installer dies after one hour timeout. From the logs below I think my DNS is proper. For information, master 1 to 3 are 192.168.50.101-103, worker 1 & 2 are 192.168.50.111 & 112, the API vIP is .105 and the Ingress vIP is .107 From my workstation: $ export KUBECONFIG=/Users/jp/git/OCP/lab2/kubeconfig-noingress $ ./oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication False True True 3d baremetal 4.7.0-fc.0 True False False 3d cloud-credential 4.7.0-fc.0 True False False 3d1h cluster-autoscaler 4.7.0-fc.0 True False False 3d config-operator 4.7.0-fc.0 True False False 3d console 4.7.0-fc.0 False False True 2d csi-snapshot-controller 4.7.0-fc.0 True False False 3d dns 4.7.0-fc.0 True False False 3d etcd 4.7.0-fc.0 True False False 3d image-registry 4.7.0-fc.0 True False True 3d ingress 4.7.0-fc.0 True False False 2d23h insights 4.7.0-fc.0 True False False 3d kube-apiserver 4.7.0-fc.0 True False False 3d kube-controller-manager 4.7.0-fc.0 True False False 3d kube-scheduler 4.7.0-fc.0 True False False 3d kube-storage-version-migrator 4.7.0-fc.0 True False False 2d23h machine-api 4.7.0-fc.0 True False True 3d machine-approver 4.7.0-fc.0 True False False 3d machine-config 4.7.0-fc.0 True False False 3d marketplace 4.7.0-fc.0 True False False 3d monitoring False True True 3d network 4.7.0-fc.0 True False False 2d23h node-tuning 4.7.0-fc.0 True False False 3d openshift-apiserver 4.7.0-fc.0 False False False 2d7h openshift-controller-manager 4.7.0-fc.0 True False False 2d1h openshift-samples 4.7.0-fc.0 False False False 3d operator-lifecycle-manager 4.7.0-fc.0 True False False 3d operator-lifecycle-manager-catalog 4.7.0-fc.0 True False False 3d operator-lifecycle-manager-packageserver 4.7.0-fc.0 False False False 5s service-ca 4.7.0-fc.0 True False False 3d storage 4.7.0-fc.0 True False False 3d $ ping console-openshift-console.apps.lab2.epmlab.ca PING console-openshift-console.apps.lab2.epmlab.ca (192.168.50.107): 56 data bytes 64 bytes from 192.168.50.107: icmp_seq=0 ttl=63 time=1.417 ms 64 bytes from 192.168.50.107: icmp_seq=1 ttl=63 time=1.439 ms ^C --- console-openshift-console.apps.lab2.epmlab.ca ping statistics --- 2 packets transmitted, 2 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 1.417/1.428/1.439/0.011 ms $ curl -L https://console-openshift-console.apps.lab2.epmlab.ca/health -k <html> <head> <meta name="viewport" content="width=device-width, initial-scale=1"> <style type="text/css"> /*! * Bootstrap v3.3.5 (http://getbootstrap.com) * Copyright 2011-2015 Twitter, Inc. * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) */ /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.css */ html { font-family: sans-serif; -ms-text-size-adjust: 100%; -webkit-text-size-adjust: 100%; } body { margin: 0; } h1 { font-size: 1.7em; font-weight: 400; line-height: 1.3; margin: 0.68em 0; } * { -webkit-box-sizing: border-box; -moz-box-sizing: border-box; box-sizing: border-box; } *:before, *:after { -webkit-box-sizing: border-box; -moz-box-sizing: border-box; box-sizing: border-box; } html { -webkit-tap-highlight-color: rgba(0, 0, 0, 0); } body { font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; line-height: 1.66666667; font-size: 13px; color: #333333; background-color: #ffffff; margin: 2em 1em; } p { margin: 0 0 10px; font-size: 13px; } .alert.alert-info { padding: 15px; margin-bottom: 20px; border: 1px solid transparent; background-color: #f5f5f5; border-color: #8b8d8f; color: #363636; margin-top: 30px; } .alert p { padding-left: 35px; } a { color: #0088ce; } ul { position: relative; padding-left: 51px; } p.info { position: relative; font-size: 15px; margin-bottom: 10px; } p.info:before, p.info:after { content: ""; position: absolute; top: 9%; left: 0; } p.info:before { content: "i"; left: 3px; width: 20px; height: 20px; font-family: serif; font-size: 15px; font-weight: bold; line-height: 21px; text-align: center; color: #fff; background: #4d5258; border-radius: 16px; } @media (min-width: 768px) { body { margin: 4em 3em; } h1 { font-size: 2.15em;} } </style> </head> <body> <div> <h1>Application is not available</h1> <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p> <div class="alert alert-info"> <p class="info"> Possible reasons you are seeing this page: </p> <ul> <li> <strong>The host doesn't exist.</strong> Make sure the hostname was typed correctly and that a route matching this hostname exists. </li> <li> <strong>The host exists, but doesn't have a matching path.</strong> Check if the URL path was typed correctly and that the route was created using the desired path. </li> <li> <strong>Route and path matches, but all pods are down.</strong> Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running. </li> </ul> </div> </div> </body> </html> From one of the Masters: $ ssh core.epmlab.ca The authenticity of host 'master1.lab2.epmlab.ca (192.168.50.101)' can't be established. ECDSA key fingerprint is SHA256:RpfaJ5r3LUEKyOY6tdGA/O4a5KTQj9BIf6ZMISh6Jjc. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'master1.lab2.epmlab.ca,192.168.50.101' (ECDSA) to the list of known hosts. Red Hat Enterprise Linux CoreOS 47.83.202012171901-0 Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html --- Last login: Tue Feb 16 21:48:38 2021 from 192.168.193.11 [core@master1 ~]$ ping console-openshift-console.apps.lab2.epmlab.ca PING console-openshift-console.apps.lab2.epmlab.ca (192.168.50.107) 56(84) bytes of data. 64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=1 ttl=64 time=0.219 ms 64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=2 ttl=64 time=0.200 ms 64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=3 ttl=64 time=0.225 ms ^C --- console-openshift-console.apps.lab2.epmlab.ca ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 26ms rtt min/avg/max/mdev = 0.200/0.214/0.225/0.020 ms [core@master1 ~]$ dig oauth-openshift.apps.lab2.epmlab.ca ; <<>> DiG 9.11.20-RedHat-9.11.20-5.el8 <<>> oauth-openshift.apps.lab2.epmlab.ca ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40452 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: dab1a3401301fc90 (echoed) ;; QUESTION SECTION: ;oauth-openshift.apps.lab2.epmlab.ca. IN A ;; ANSWER SECTION: oauth-openshift.apps.lab2.epmlab.ca. 30 IN A 192.168.50.107 ;; Query time: 0 msec ;; SERVER: 192.168.50.101#53(192.168.50.101) ;; WHEN: Tue Feb 16 21:56:47 UTC 2021 ;; MSG SIZE rcvd: 127 [core@master1 ~]$ ping console-openshift-console.apps.lab2.epmlab.ca PING console-openshift-console.apps.lab2.epmlab.ca (192.168.50.107) 56(84) bytes of data. 64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=1 ttl=64 time=0.219 ms 64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=2 ttl=64 time=0.200 ms 64 bytes from 192.168.50.107 (192.168.50.107): icmp_seq=3 ttl=64 time=0.225 ms ^C --- console-openshift-console.apps.lab2.epmlab.ca ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 26ms rtt min/avg/max/mdev = 0.200/0.214/0.225/0.020 ms [core@master1 ~]$ curl -L https://console-openshift-console.apps.lab2.epmlab.ca/health -k <html> <head> <meta name="viewport" content="width=device-width, initial-scale=1"> <style type="text/css"> /*! * Bootstrap v3.3.5 (http://getbootstrap.com) * Copyright 2011-2015 Twitter, Inc. * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) */ /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.css */ html { font-family: sans-serif; -ms-text-size-adjust: 100%; -webkit-text-size-adjust: 100%; } body { margin: 0; } h1 { font-size: 1.7em; font-weight: 400; line-height: 1.3; margin: 0.68em 0; } * { -webkit-box-sizing: border-box; -moz-box-sizing: border-box; box-sizing: border-box; } *:before, *:after { -webkit-box-sizing: border-box; -moz-box-sizing: border-box; box-sizing: border-box; } html { -webkit-tap-highlight-color: rgba(0, 0, 0, 0); } body { font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; line-height: 1.66666667; font-size: 13px; color: #333333; background-color: #ffffff; margin: 2em 1em; } p { margin: 0 0 10px; font-size: 13px; } .alert.alert-info { padding: 15px; margin-bottom: 20px; border: 1px solid transparent; background-color: #f5f5f5; border-color: #8b8d8f; color: #363636; margin-top: 30px; } .alert p { padding-left: 35px; } a { color: #0088ce; } ul { position: relative; padding-left: 51px; } p.info { position: relative; font-size: 15px; margin-bottom: 10px; } p.info:before, p.info:after { content: ""; position: absolute; top: 9%; left: 0; } p.info:before { content: "i"; left: 3px; width: 20px; height: 20px; font-family: serif; font-size: 15px; font-weight: bold; line-height: 21px; text-align: center; color: #fff; background: #4d5258; border-radius: 16px; } @media (min-width: 768px) { body { margin: 4em 3em; } h1 { font-size: 2.15em;} } </style> </head> <body> <div> <h1>Application is not available</h1> <p>The application is currently not serving requests at this endpoint. It may not have been started or is still starting.</p> <div class="alert alert-info"> <p class="info"> Possible reasons you are seeing this page: </p> <ul> <li> <strong>The host doesn't exist.</strong> Make sure the hostname was typed correctly and that a route matching this hostname exists. </li> <li> <strong>The host exists, but doesn't have a matching path.</strong> Check if the URL path was typed correctly and that the route was created using the desired path. </li> <li> <strong>Route and path matches, but all pods are down.</strong> Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running. </li> </ul> </div> </div> </body> </html>
Reminder: I'm still waiting for the results with the latest RC version. If these are not provided, I'm going to close this BZ next week. Or, if you can't use the latest RC, say why. Some other notes: - `Get \"https://console-openshift-console.apps.lab3.epmlab.ca/health\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"` does not appear to be a DNS problem. - comment 6 is unnecessarily long and makes this BZ hard to read :-(
we will go straight to GA, waited till we move to new version but that was decided. If we will see it in GA will reopen. Thanks
Fails with 4.7.0 GA bits. Same error. Console never shows up, setup fails after the hour timeout waiting for the console. 2021-03-04, 4:24:33 p.m. critical Failed installing cluster lab2. Reason: Timeout while waiting for console pod to be running
Please post a fresh must-gather, then. Just to let you know, this is 100% platform specific, we're not seeing these issues anywhere else.
Created attachment 1761820 [details] must-gather OCP 4.7.0 Assisted Install
@slaznick we have relatively many failures like this. The main problem that we are not always possible to get must-gather logs. From one of the failures. What is the status?Did you managed to check the latest attached logs? This is issue starts to be very urgent from our side
Looking at the must-gather, the cluster appears to be in ruins. There're - oauth/apiserver pods getting connection refused for a long time (~8-10mins) from etcd pods. They eventually figure out how to connect, but etcd should have been available for a long time when these apiservers are starting. This looks like an issue with the installer itself. - the etcd pods all log `embed: rejected connection from "192.168.50.101:*" (error "EOF", ServerName "")`. This could be networking/installer issue. - etcd-operator is logging connection refused from some of its etcd pods for 25 minutes. Looks like an issue with the installer. - etcd-operator eventually gets proper connections to etcd pods but appears to still log `"FSyncController" controller failed to sync "key", err: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp: i/o timeout`. Sometimes it's getting refused. Possibly networking issue. - kube-apiserver pods log "i/o timeouts" when they attempt to reach the oauth/openshift API servers, looks like a networking issue Based on the above observations, I'm moving this to openshift-sdn, I do not know how to read their logs unfortunately.
Magic 8 Ball says "Check your MTU". I'm guessing your VMware cluster is on top of a real network with a 1500 MTU, but is adding its own VXLAN tunnel between the VMware hosts, resulting in a 1450 MTU for the OCP nodes themselves, but you are not configuring the nodes to be aware of this fact[*], so they come up with the default ethernet MTU (1500), and then eventually connections fail because they are trying to send too-large TCP packets over the openshift-sdn network, and trip over the MTU mismatch and get dropped, resulting in OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: Get "https://10.136.0.57:6443/healthz": dial tcp 10.136.0.57:6443: i/o timeout (Client.Timeout exceeded while awaiting headers) ie, the SYN/ACK packets are transmitted correctly, because they are small, so the connection is established, but the TLS handshake response is a large packet (containing the server TLS cert) so it gets dropped, which from the client's perspective looks like "timed out because the server isn't responding". [*] I don't know how you do that. It's *not* the MTU field in network.operator.openshift.io. You need to change the ignition config or something like that.
Dan, I will check my environment. The part that puzzle me is that OCP 4.6 deploys with Assisted Install and works perfectly (same environment, same VMs, same DNS, same networking...). Issues started with 4.7-pre. So feel free to point the finger toward the environment, but something changed between 4.6 and 4.7.
Take 2. My virtual switch MTU is set to (default) 1500, I run all the VMs out of a single ESXi server, no VxLAN tunnel. The traffic within cluster members does not leave the vSwitch (MTU 1500), as all the VMs are on a single host, connected to the same vSwitch. And (again) Assisted Installer works with OCP 4.6, started to fail with OCP 4.7 on the same environment. If I am having this issue, I guess many folks running out of VMware VMs will face similar problems.
I mean... that was just a guess before, but it _really_ looks MTU-y... I'd say try tcpdumping one of the connections that's timing out, from both ends, to see if the problem is that small packets are making it through, but larger ones aren't. I'm not sure what would have changed between 4.6 and 4.7. OpenShift SDN itself has barely changed at all, so it would have to be something somewhere else. Presumably VMware-related if the bug only affects VMware...
Created attachment 1763835 [details] Assisted Bare Metal Cluster 4.7.0 Must-Gather This is a new installation (single ESXi server, all VMs on that single host, using the same virtual switch, switch MTU is 1500). I went and manually forced the NIC of each cluster member to 1400 (on the live ISO upon boot, then right after reboot from HD). I monitored the installation. I re-applied MTU 1400 whenever it changed. The veth interfaces all ended up with MTU 1350. The installation failed with the usual: "Cluster installation failed Timeout while waiting for console to become available." I can provide ssh access to the cluster if required.
@jjung Can you please provide ssh access? Do you need public key? Thanks, Aniket.