+++ This bug was initially created as a clone of Bug #1970459 +++ Description of problem: While installing many SNO clusters with wan emulation/network impairments we can see some clusters fail to show up as installed although upon closer inspection it seems they have completed install just perhaps too late for a timeout to judge that the cluster has not actually failed. The current wan emulation setup is: * egress hypervisor latency 75ms, 0.02% packet loss * ingress hypervisor latency 75ms, 0.02% packet loss Version-Release number of selected component (if applicable): HUB OCP cluster 4.8.0-fc.8 SNO OCP cluster 4.8.0-fc.8 ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02 How reproducible: With the latest bits: 1/60 on the first run however 5 other clusters when checked were still reconciling operators however I did not permit them more time to see how many of those would eventually complete 2/60 on the second run in which more time was allowed to see the state of the SNO clusters overnight. Steps to Reproduce: 1. 2. 3. Actual results: # oc get cd -A | egrep "NAME|sno00059|sno00057" NAMESPACE NAME PLATFORM REGION CLUSTERTYPE INSTALLED INFRAID VERSION POWERSTATE AGE sno00057 sno00057 agent-baremetal false Initialized 14h sno00059 sno00059 agent-baremetal false Initialized 14h * Both these clusters do not show as installed However when on the cluster: [root@sno00057 ~]# oc --kubeconfig=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig get no,clusterversion NAME STATUS ROLES AGE VERSION node/sno00057 Ready master,worker 14h v1.21.0-rc.0+2dfc46b NAME VERSION AVAILABLE PROGRESSING SINCE STATUS clusterversion.config.openshift.io/version 4.8.0-fc.8 True False 13h Cluster version is 4.8.0-fc.8 [root@sno00057 ~]# oc --kubeconfig=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig describe clusterversion | egrep "Completion Time:|Started Time:" Completion Time: 2021-06-10T00:38:33Z Started Time: 2021-06-09T22:52:53Z and cluster 00059: [root@sno00059 ~]# oc --kubeconfig=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig get no,clusterversion NAME STATUS ROLES AGE VERSION node/sno00059 Ready master,worker 14h v1.21.0-rc.0+2dfc46b NAME VERSION AVAILABLE PROGRESSING SINCE STATUS clusterversion.config.openshift.io/version 4.8.0-fc.8 True False 13h Cluster version is 4.8.0-fc.8 [root@sno00059 ~]# oc --kubeconfig=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig describe clusterversion | egrep "Completion Time:|Started Time:" Completion Time: 2021-06-10T00:40:44Z Started Time: 2021-06-09T22:53:55Z Expected results: Potentially to allow an adjustment to permit clusters to take longer to install/reconcile when being installed over wan-like conditions. Additional info: --- Additional comment from akrzos on 20210610T14:41:25 Created attachment 1789883 [details] all agent cluster installs yaml --- Additional comment from akrzos on 20210610T20:14:26 From the events of sno00057: { "cluster_id": "753142ea-a399-4e7e-90d3-38a99aadd92a", "event_time": "2021-06-10T00:10:07.480Z", "host_id": "5ca45edf-5423-4a69-ba4b-f270e25f2142", "message": "Host sno00057: updated status from \"installing-in-progress\" to \"error\" (Host failed to install because its installation stage Rebooting took longer than expected 1h10m0s)", "severity": "error" }, based off the clusterversion object it took another ~28m to finish installing From the events of sno00059: { "cluster_id": "57276ba5-dea5-4b77-b23d-7e26c39877bc", "event_time": "2021-06-10T00:10:07.474Z", "host_id": "716fc151-79dd-4843-9463-8a9690f4ed49", "message": "Host sno00059: updated status from \"installing-in-progress\" to \"error\" (Host failed to install because its installation stage Rebooting took longer than expected 1h10m0s)", "severity": "error" }, ~30m after the assisted-installer said the cluster failed, it ended up finishing install --- Additional comment from mfilanov on 20210613T07:56:06 @akrzos can you please add the logs from the service? You can download them from agent cluster install status under debug info you will see logsURL. Without it we don't really know what happens during the installation and why the controller doesn't report installed status. --- Additional comment from akrzos on 20210614T13:20:22 (In reply to Michael Filanov from comment #3) > @akrzos can you please add the logs from the service? > You can download them from agent cluster install status under debug info you > will see logsURL. > Without it we don't really know what happens during the installation and why > the controller doesn't report installed status. I am going to have to re-run this since the current cluster I have that shows up as not completing an install (sno00036) just shows an error after following the logs urls. {"code":"500","href":"","id":500,"kind":"Error","reason":"No log files were found"} I did grab the clusterversion object and the assisted-installer controller pod logs off the sno cluster though. Name: version Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterVersion Metadata: Creation Timestamp: 2021-06-10T21:18:45Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:channel: f:clusterID: Manager: cluster-bootstrap Operation: Update Time: 2021-06-10T21:18:45Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:availableUpdates: f:conditions: f:desired: .: f:image: f:version: f:history: f:observedGeneration: f:versionHash: Manager: cluster-version-operator Operation: Update Time: 2021-06-10T21:18:45Z Resource Version: 73727 UID: 400e1854-0744-4f03-90a5-ecfbf3259af3 Spec: Channel: stable-4.8 Cluster ID: 28453b8f-4900-455c-a24c-43b22f4b3360 Status: Available Updates: <nil> Conditions: Last Transition Time: 2021-06-11T02:21:03Z Message: Done applying 4.8.0-fc.8 Status: True Type: Available Last Transition Time: 2021-06-11T02:21:03Z Status: False Type: Failing Last Transition Time: 2021-06-11T02:21:03Z Message: Cluster version is 4.8.0-fc.8 Status: False Type: Progressing Last Transition Time: 2021-06-10T21:18:45Z Message: Unable to retrieve available updates: currently reconciling cluster version 4.8.0-fc.8 not found in the "stable-4.8" channel Reason: VersionNotFound Status: False Type: RetrievedUpdates Desired: Image: quay.io/openshift-release-dev/ocp-release@sha256:9d0d47c9ac6539183133fac9b0c5f6ff389ea2cc2fef115a9f77967f57d5d420 Version: 4.8.0-fc.8 History: Completion Time: 2021-06-11T02:21:03Z Image: quay.io/openshift-release-dev/ocp-release@sha256:9d0d47c9ac6539183133fac9b0c5f6ff389ea2cc2fef115a9f77967f57d5d420 Started Time: 2021-06-10T21:18:45Z State: Completed Verified: false Version: 4.8.0-fc.8 Observed Generation: 1 Version Hash: 4gafDaNwq8U= Events: <none> --- Additional comment from akrzos on 20210614T13:21:03 Created attachment 1790989 [details] assisted-installer controller logs off of a cluster and completed install "too late" --- Additional comment from mfilanov on 20210614T17:00:39 {code} time="2021-06-10T22:11:05Z" level=info msg="Start waiting to be ready" time="2021-06-10T22:17:26Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout" time="2021-06-10T22:23:34Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout" time="2021-06-10T22:29:37Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout" time="2021-06-10T22:36:00Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout" time="2021-06-10T22:42:02Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout" time="2021-06-10T22:47:50Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout" time="2021-06-10T22:52:48Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:60243->172.30.0.10:53: read: connection refused" time="2021-06-10T22:53:27Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:35314->172.30.0.10:53: read: connection refused" time="2021-06-10T22:54:02Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:34980->172.30.0.10:53: read: connection refused" time="2021-06-10T22:54:46Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:56940->172.30.0.10:53: read: connection refused" time="2021-06-10T22:55:28Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:45049->172.30.0.10:53: read: connection refused" time="2021-06-10T22:56:13Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:48085->172.30.0.10:53: read: connection refused" time="2021-06-10T22:56:42Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:55393->172.30.0.10:53: read: connection refused" time="2021-06-10T22:57:13Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:55474->172.30.0.10:53: read: connection refused" time="2021-06-10T22:57:57Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:54752->172.30.0.10:53: read: connection refused" time="2021-06-10T22:58:39Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:55156->172.30.0.10:53: read: connection refused" time="2021-06-10T22:59:18Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:36431->172.30.0.10:53: read: connection refused" time="2021-06-10T23:00:01Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:38740->172.30.0.10:53: read: connection refused" time="2021-06-10T23:00:33Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:54814->172.30.0.10:53: read: connection refused" time="2021-06-10T23:01:20Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:42379->172.30.0.10:53: read: connection refused" time="2021-06-10T23:01:59Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:45992->172.30.0.10:53: read: connection refused" time="2021-06-10T23:02:43Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:35010->172.30.0.10:53: read: connection refused" time="2021-06-10T23:03:25Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:50132->172.30.0.10:53: read: connection refused" time="2021-06-10T23:04:19Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:49977->172.30.0.10:53: read: connection refused" time="2021-06-10T23:05:06Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:56558->172.30.0.10:53: read: connection refused" {code} Looks like for an hour controller couldn't connect to the service, i can't tell if it's a network issue or some problem with the service @itsoiref @ercohen what do you think? --- Additional comment from akrzos on 20210614T21:01:07 In an effort to gather a bit more data I reran the a very similar test with some newer bits and still witnessed several clusters that succeeded install after they were declared failed: Test parameters: * egress hypervisor latency 75ms, 0.02% packet loss * ingress hypervisor latency 75ms, 0.02% packet loss * Hub OCP version 4.8.0-fc.8 * SNO OCP version 4.8.0-fc.9 * ACM 2.3.0-DOWNSTREAM-2021-06-13-16-46-23 Out of 60 clusters - 7/60 failed to show up as installed: * 2 failed because of https://bugzilla.redhat.com/show_bug.cgi?id=1964688 * 3 failed to apply all cluster operators: * sno00025 - Unable to apply 4.8.0-fc.9: the cluster operator monitoring has not yet successfully rolled out * sno00028 - Unable to apply 4.8.0-fc.9: some cluster operators have not yet rolled out * sno00035 - Unable to apply 4.8.0-fc.9: an unknown error has occurred: MultipleErrors * 2 came up but too late: * sno00035 * sno00039 I captured some additional data from sno00035 and sno00039 and will attach as a tar file. Unfortunately it seems both examined clusters did not contain any data in their url for logs from the assisted service most likely due to this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1970015 Looking specifically at cluster sno00039 however it seems that the assisted-installer controller pod came up "too late": # oc get agentclusterinstall -n sno00039 -o yaml ... - lastProbeTime: "2021-06-14T19:15:15Z" lastTransitionTime: "2021-06-14T19:15:15Z" message: 'The installation failed: cluster has hosts in error' reason: InstallationFailed status: "True" type: Failed - lastProbeTime: "2021-06-14T19:15:15Z" lastTransitionTime: "2021-06-14T19:15:15Z" message: The installation has stopped due to error reason: InstallationFailed status: "True" type: Stopped ... And from the events: # cat sno00039.events.log | jq ... { "cluster_id": "0edee69b-8d7f-4fa6-b672-5dfb42b6afc9", "event_time": "2021-06-14T18:00:02.445Z", "host_id": "65fbdf8a-56ac-4bc3-afeb-3402439e80ec", "message": "Host sno00039: reached installation stage Writing image to disk: 100%", "severity": "info" }, { "cluster_id": "0edee69b-8d7f-4fa6-b672-5dfb42b6afc9", "event_time": "2021-06-14T18:00:10.773Z", "host_id": "65fbdf8a-56ac-4bc3-afeb-3402439e80ec", "message": "Host sno00039: reached installation stage Rebooting", "severity": "info" }, { "cluster_id": "0edee69b-8d7f-4fa6-b672-5dfb42b6afc9", "event_time": "2021-06-14T19:15:07.701Z", "host_id": "65fbdf8a-56ac-4bc3-afeb-3402439e80ec", "message": "Host sno00039: updated status from \"installing-in-progress\" to \"error\" (Host failed to install because its installation stage Rebooting took longer than expected 1h10m0s)", "severity": "error" }, { "cluster_id": "0edee69b-8d7f-4fa6-b672-5dfb42b6afc9", "event_time": "2021-06-14T19:15:15.698Z", "message": "Updated status of cluster sno00039 to error", "severity": "info" }, Thus this cluster was marked as error at 2021-06-14T19:15:07.701Z When we examine the controller logs on the sno cluster itself we see: # cat sno00039.assisted-installer-controller-6tkgg.log time="2021-06-14T19:26:05Z" level=info msg="Start running Assisted-Controller. Configuration is:\n struct ControllerConfig {\n\tClusterID: \"0edee69b-8d7f-4fa6-b672-5dfb42b6afc9\",\n\tURL: \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com\",\n\tPullSecretToken: \"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiMGVkZWU2OWItOGQ3Zi00ZmE2LWI2NzItNWRmYjQyYjZhZmM5In0.ahWLULAAg91eoL0qjCIhhnAVOD8Ba2bkrDSjQK1Hb1VTPgqYxVVIt2S1B66YUyjDqkdG8Qnak5VHdqqZE1esFw\",\n\tSkipCertVerification: false,\n\tCACertPath: \"/etc/assisted-service/service-ca-cert.crt\",\n\tNamespace: \"assisted-installer\",\n\tOpenshiftVersion: \"4.8.0-fc.9\",\n\tHighAvailabilityMode: \"None\",\n\tWaitForClusterVersion: false,\n\tMustGatherImage: \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b16a8a682178e39acbaec8d6c46b0aa34af5bb2aa9b3ab2e384eca917c2290d0\",\n}" W0614 19:26:05.500912 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0614 19:26:06.557929 1 request.go:645] Throttling request took 1.032179165s, request: GET:https://172.30.0.1:443/apis/tuned.openshift.io/v1?timeout=32s time="2021-06-14T19:26:08Z" level=info msg="Using proxy {HTTPProxy: HTTPSProxy: NoProxy:} to set env-vars for installer-controller pod" time="2021-06-14T19:26:08Z" level=info msg="Using custom CA certificate: /etc/assisted-service/service-ca-cert.crt" time="2021-06-14T19:26:08Z" level=info msg="Start waiting to be ready" time="2021-06-14T19:27:17Z" level=info msg="assisted-service is available" The controller did not even start until 2021-06-14T19:26:05Z (~11m after the assisted-service marked this cluster as failed) Looking at an oc describe of the controller pod we see: # cat sno00039.describe.assisted-installer-controller-6tkgg.log ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 135m default-scheduler no nodes available to schedule pods Warning FailedScheduling 134m default-scheduler 0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate. Normal Scheduled 69m default-scheduler Successfully assigned assisted-installer/assisted-installer-controller-6tkgg to sno00039 Warning FailedScheduling 158m (x6 over 158m) default-scheduler no nodes available to schedule pods Normal AddedInterface 69m multus Add eth0 [10.128.0.17/23] from ovn-kubernetes Normal Pulling 69m kubelet Pulling image "quay.io:443/acm-d/assisted-installer-reporter-rhel8@sha256:59e29d660e233eeef1f712d0592718ebfe9ea5f175bab916f29d6a009d92967f" Normal Pulled 69m kubelet Successfully pulled image "quay.io:443/acm-d/assisted-installer-reporter-rhel8@sha256:59e29d660e233eeef1f712d0592718ebfe9ea5f175bab916f29d6a009d92967f" in 34.413941705s Normal Created 69m kubelet Created container assisted-installer-controller Normal Started 69m kubelet Started container assisted-installer-controller It seems it took a long time for the cluster to become ready enough to run this pod - 65m or so of the pod sitting in pending state. When examining sno00035 we see the following: # oc get agentclusterinstall -n sno00035 -o yaml ... - lastProbeTime: "2021-06-14T19:10:05Z" lastTransitionTime: "2021-06-14T19:10:05Z" message: 'The installation has failed: cluster has hosts in error' reason: InstallationFailed status: "False" type: Completed - lastProbeTime: "2021-06-14T19:10:05Z" lastTransitionTime: "2021-06-14T19:10:05Z" message: 'The installation failed: cluster has hosts in error' reason: InstallationFailed status: "True" type: Failed - lastProbeTime: "2021-06-14T19:10:05Z" lastTransitionTime: "2021-06-14T19:10:05Z" message: The installation has stopped due to error reason: InstallationFailed status: "True" type: Stopped Thus the install effectively failed at 2021-06-14T19:10:05Z However the controller logs show: # cat sno00035.assisted-installer-controller-c98j9.log time="2021-06-14T18:46:29Z" level=info msg="Start running Assisted-Controller. Configuration is:\n struct ControllerConfig {\n\tClusterID: \"7b63776b-1ce7-4288-afa1-91f7eb65f8f5\",\n\tURL: \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com\",\n\tPullSecretToken: \"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiN2I2Mzc3NmItMWNlNy00Mjg4LWFmYTEtOTFmN2ViNjVmOGY1In0.TMdFRHDaKZFejxyPFFK4UESLFsJd3nu_sKjV1R6n7snGJu6TfrUTMbvTJCgyu7YGIvHXoc0_1eowzLIwKz1sOg\",\n\tSkipCertVerification: false,\n\tCACertPath: \"/etc/assisted-service/service-ca-cert.crt\",\n\tNamespace: \"assisted-installer\",\n\tOpenshiftVersion: \"4.8.0-fc.9\",\n\tHighAvailabilityMode: \"None\",\n\tWaitForClusterVersion: false,\n\tMustGatherImage: \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b16a8a682178e39acbaec8d6c46b0aa34af5bb2aa9b3ab2e384eca917c2290d0\",\n}" W0614 18:46:29.114085 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0614 18:46:30.168884 1 request.go:645] Throttling request took 1.038354671s, request: GET:https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1?timeout=32s time="2021-06-14T18:46:30Z" level=info msg="Using proxy {HTTPProxy: HTTPSProxy: NoProxy:} to set env-vars for installer-controller pod" time="2021-06-14T18:46:30Z" level=info msg="Using custom CA certificate: /etc/assisted-service/service-ca-cert.crt" time="2021-06-14T18:46:30Z" level=info msg="Start waiting to be ready" time="2021-06-14T18:52:26Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/7b63776b-1ce7-4288-afa1-91f7eb65f8f5\": dial tcp: i/o timeout" time="2021-06-14T18:53:26Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/7b63776b-1ce7-4288-afa1-91f7eb65f8f5\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.15:44643->172.30.0.10:53: read: connection refused" ... time="2021-06-14T19:21:15Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/7b63776b-1ce7-4288-afa1-91f7eb65f8f5\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.15:48787->172.30.0.10:53: read: connection refused" time="2021-06-14T19:21:41Z" level=info msg="assisted-service is available" time="2021-06-14T19:21:41Z" level=info msg="kube-apiserver is available" time="2021-06-14T19:21:41Z" level=info msg="Sending ready event" time="2021-06-14T19:21:41Z" level=info msg="monitor cluster installation status" time="2021-06-14T19:21:41Z" level=info msg="Start sending logs" time="2021-06-14T19:21:41Z" level=info msg="Waiting till all nodes will join and update status to assisted installer" time="2021-06-14T19:21:41Z" level=info msg="Start approving CSRs" time="2021-06-14T19:21:41Z" level=info msg="Making sure service dns-default can reserve the .10 address" time="2021-06-14T19:21:41Z" level=info msg="Service dns-default has successfully taken IP 172.30.0.10" time="2021-06-14T19:21:41Z" level=info msg="HackDNSAddressConflict finished" time="2021-06-14T19:22:11Z" level=info msg="Number of BMHs is 0" time="2021-06-14T19:22:11Z" level=info msg="Hosts status: map[sno00035:[error Rebooting ]]" request_id=b593727d-ec8a-4b1d-99fb-483b1c113fed time="2021-06-14T19:22:11Z" level=info msg="Done waiting for all the nodes. Nodes in error status: 1\n" time="2021-06-14T19:22:11Z" level=info msg="WaitAndUpdateNodesStatus finished" time="2021-06-14T19:22:11Z" level=info msg="Finish approving CSRs" I0614 19:22:12.344230 1 request.go:645] Throttling request took 1.045614527s, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s time="2021-06-14T19:22:17Z" level=error msg="failed to list Machines, error no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" time="2021-06-14T19:22:17Z" level=error msg="Failed to find unallocated machines" error="no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" time="2021-06-14T19:22:41Z" level=info msg="Number of BMHs is 0" I0614 19:22:42.341279 1 request.go:645] Throttling request took 1.046542033s, request: GET:https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s time="2021-06-14T19:22:42Z" level=info msg="Cluster installation failed." time="2021-06-14T19:22:42Z" level=info msg="Waiting for all go routines to finish" time="2021-06-14T19:22:42Z" level=info msg="Finished PostInstallConfigs" time="2021-06-14T19:22:42Z" level=info msg="Finished UploadLogs" time="2021-06-14T19:22:47Z" level=error msg="failed to list Machines, error no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" time="2021-06-14T19:22:47Z" level=error msg="Failed to find unallocated machines" error="no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\"" time="2021-06-14T19:22:47Z" level=info msg="Finished UpdateBMHs" time="2021-06-14T19:22:47Z" level=info msg="Finished all" In this case it seems that the controller pod comes up before the cluster is marked as an error, but can not resolve the hub cluster address until it is too late. I am curious if this is another network manager related issue in that dns is not working until too late in the install process similar to bug - https://bugzilla.redhat.com/show_bug.cgi?id=1968634 --- Additional comment from akrzos on 20210614T21:02:50 Created attachment 1791081 [details] Data from the fc.9 test --- Additional comment from itsoiref on 20210615T09:05:16 @akrzos will be very helpful to debug it live or to have must-gather logs too.
Verified by @cchun on 2.3.0-DOWNSTREAM-2021-06-24-02-38-49
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438