1972140 – [4.8.0] ACM/ZTP with Wan emulation, SNO cluster installs do not show as installed although they are

Bug 1972140 - [4.8.0] ACM/ZTP with Wan emulation, SNO cluster installs do not show as installed although they are

Summary: [4.8.0] ACM/ZTP with Wan emulation, SNO cluster installs do not show as insta...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	assisted-installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Igal Tsoiref
QA Contact:	Yuri Obshansky
Docs Contact:
URL:
Whiteboard:	AI-Team-Hive KNI-EDGE-4.8
Depends On:	1970459
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-15 10:28 UTC by Michael Filanov
Modified:	2021-07-27 23:13 UTC (History)
CC List:	7 users (show)
Fixed In Version:	OCP-Metal-v1.0.22.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1970459
Environment:
Last Closed:	2021-07-27 23:12:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift assisted-installer pull 308	0	None	open	Bug 1970459: ACM/ZTP with Wan emulation, SNO cluster installs do not show as installed although they are	2021-06-16 16:59:21 UTC
Red Hat Bugzilla	1970459	1	high	CLOSED	[master] ACM/ZTP with Wan emulation, SNO cluster installs do not show as installed although they are	2022-08-28 08:47:34 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:13:35 UTC

Description Michael Filanov 2021-06-15 10:28:36 UTC

+++ This bug was initially created as a clone of Bug #1970459 +++

Description of problem:
While installing many SNO clusters with wan emulation/network impairments we can see some clusters fail to show up as installed although upon closer inspection it seems they have completed install just perhaps too late for a timeout to judge that the cluster has not actually failed.

The current wan emulation setup is:
* egress hypervisor latency 75ms, 0.02% packet loss
* ingress hypervisor latency 75ms, 0.02% packet loss

Version-Release number of selected component (if applicable):
HUB OCP cluster 4.8.0-fc.8
SNO OCP cluster 4.8.0-fc.8
ACM 2.3.0-DOWNSTREAM-2021-06-07-19-53-02


How reproducible:
With the latest bits:
1/60 on the first run however 5 other clusters when checked were still reconciling operators however I did not permit them more time to see how many of those would eventually complete

2/60 on the second run in which more time was allowed to see the state of the SNO clusters overnight.

Steps to Reproduce:
1.
2.
3.

Actual results:
# oc get cd -A | egrep "NAME|sno00059|sno00057"
NAMESPACE   NAME       PLATFORM          REGION   CLUSTERTYPE   INSTALLED   INFRAID                                VERSION   POWERSTATE    AGE
sno00057    sno00057   agent-baremetal                          false                                                        Initialized   14h
sno00059    sno00059   agent-baremetal                          false                                                        Initialized   14h

* Both these clusters do not show as installed

However when on the cluster:

[root@sno00057 ~]# oc --kubeconfig=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig get no,clusterversion
NAME            STATUS   ROLES           AGE   VERSION
node/sno00057   Ready    master,worker   14h   v1.21.0-rc.0+2dfc46b

NAME                                         VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
clusterversion.config.openshift.io/version   4.8.0-fc.8   True        False         13h     Cluster version is 4.8.0-fc.8
[root@sno00057 ~]# oc --kubeconfig=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig describe clusterversion | egrep "Completion Time:|Started Time:"
    Completion Time:    2021-06-10T00:38:33Z
    Started Time:       2021-06-09T22:52:53Z

and cluster 00059:
[root@sno00059 ~]# oc --kubeconfig=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig get no,clusterversion
NAME            STATUS   ROLES           AGE   VERSION
node/sno00059   Ready    master,worker   14h   v1.21.0-rc.0+2dfc46b

NAME                                         VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
clusterversion.config.openshift.io/version   4.8.0-fc.8   True        False         13h     Cluster version is 4.8.0-fc.8
[root@sno00059 ~]# oc --kubeconfig=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig describe clusterversion | egrep "Completion Time:|Started Time:"
    Completion Time:    2021-06-10T00:40:44Z
    Started Time:       2021-06-09T22:53:55Z




Expected results:
Potentially to allow an adjustment to permit clusters to take longer to install/reconcile when being installed over wan-like conditions.

Additional info:

--- Additional comment from akrzos on 20210610T14:41:25

Created attachment 1789883 [details]
all agent cluster installs yaml

--- Additional comment from akrzos on 20210610T20:14:26

From the events of sno00057:
  {                                                                                                                                                                                             
    "cluster_id": "753142ea-a399-4e7e-90d3-38a99aadd92a",                                                                                                                                       
    "event_time": "2021-06-10T00:10:07.480Z",                                                                                                                                                   
    "host_id": "5ca45edf-5423-4a69-ba4b-f270e25f2142",                                                                                                                                          
    "message": "Host sno00057: updated status from \"installing-in-progress\" to \"error\" (Host failed to install because its installation stage Rebooting took longer than expected 1h10m0s)",
    "severity": "error"
  },

based off the clusterversion object it took another ~28m to finish installing

From the events of sno00059:
  {                   
    "cluster_id": "57276ba5-dea5-4b77-b23d-7e26c39877bc",
    "event_time": "2021-06-10T00:10:07.474Z",
    "host_id": "716fc151-79dd-4843-9463-8a9690f4ed49",   
    "message": "Host sno00059: updated status from \"installing-in-progress\" to \"error\" (Host failed to install because its installation stage Rebooting took longer than expected 1h10m0s)",
    "severity": "error"                               
  },       

~30m after the assisted-installer said the cluster failed, it ended up finishing install

--- Additional comment from mfilanov on 20210613T07:56:06

@akrzos can you please add the logs from the service? 
You can download them from agent cluster install status under debug info you will see logsURL.
Without it we don't really know what happens during the installation and why the controller doesn't report installed status.

--- Additional comment from akrzos on 20210614T13:20:22

(In reply to Michael Filanov from comment #3)
> @akrzos can you please add the logs from the service? 
> You can download them from agent cluster install status under debug info you
> will see logsURL.
> Without it we don't really know what happens during the installation and why
> the controller doesn't report installed status.

I am going to have to re-run this since the current cluster I have that shows up as not completing an install (sno00036) just shows an error after following the logs urls.

{"code":"500","href":"","id":500,"kind":"Error","reason":"No log files were found"}

I did grab the clusterversion object and the assisted-installer controller pod logs off the sno cluster though.

Name:         version
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterVersion
Metadata:
  Creation Timestamp:  2021-06-10T21:18:45Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:channel:
        f:clusterID:
    Manager:      cluster-bootstrap
    Operation:    Update
    Time:         2021-06-10T21:18:45Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:availableUpdates:
        f:conditions:
        f:desired:
          .:
          f:image:
          f:version:
        f:history:
        f:observedGeneration:
        f:versionHash:
    Manager:         cluster-version-operator
    Operation:       Update
    Time:            2021-06-10T21:18:45Z
  Resource Version:  73727
  UID:               400e1854-0744-4f03-90a5-ecfbf3259af3
Spec:
  Channel:     stable-4.8
  Cluster ID:  28453b8f-4900-455c-a24c-43b22f4b3360
Status:
  Available Updates:  <nil>
  Conditions:
    Last Transition Time:  2021-06-11T02:21:03Z
    Message:               Done applying 4.8.0-fc.8
    Status:                True
    Type:                  Available
    Last Transition Time:  2021-06-11T02:21:03Z
    Status:                False
    Type:                  Failing
    Last Transition Time:  2021-06-11T02:21:03Z
    Message:               Cluster version is 4.8.0-fc.8
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2021-06-10T21:18:45Z
    Message:               Unable to retrieve available updates: currently reconciling cluster version 4.8.0-fc.8 not found in the "stable-4.8" channel
    Reason:                VersionNotFound
    Status:                False
    Type:                  RetrievedUpdates
  Desired:
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:9d0d47c9ac6539183133fac9b0c5f6ff389ea2cc2fef115a9f77967f57d5d420
    Version:  4.8.0-fc.8
  History:
    Completion Time:    2021-06-11T02:21:03Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:9d0d47c9ac6539183133fac9b0c5f6ff389ea2cc2fef115a9f77967f57d5d420
    Started Time:       2021-06-10T21:18:45Z
    State:              Completed
    Verified:           false
    Version:            4.8.0-fc.8
  Observed Generation:  1
  Version Hash:         4gafDaNwq8U=
Events:                 <none>

--- Additional comment from akrzos on 20210614T13:21:03

Created attachment 1790989 [details]
assisted-installer controller logs off of a cluster and completed install "too late"

--- Additional comment from mfilanov on 20210614T17:00:39

{code}
time="2021-06-10T22:11:05Z" level=info msg="Start waiting to be ready"
time="2021-06-10T22:17:26Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout"
time="2021-06-10T22:23:34Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout"
time="2021-06-10T22:29:37Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout"
time="2021-06-10T22:36:00Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout"
time="2021-06-10T22:42:02Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout"
time="2021-06-10T22:47:50Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: i/o timeout"
time="2021-06-10T22:52:48Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:60243->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:53:27Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:35314->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:54:02Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:34980->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:54:46Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:56940->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:55:28Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:45049->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:56:13Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:48085->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:56:42Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:55393->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:57:13Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:55474->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:57:57Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:54752->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:58:39Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:55156->172.30.0.10:53: read: connection refused"
time="2021-06-10T22:59:18Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:36431->172.30.0.10:53: read: connection refused"
time="2021-06-10T23:00:01Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:38740->172.30.0.10:53: read: connection refused"
time="2021-06-10T23:00:33Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:54814->172.30.0.10:53: read: connection refused"
time="2021-06-10T23:01:20Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:42379->172.30.0.10:53: read: connection refused"
time="2021-06-10T23:01:59Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:45992->172.30.0.10:53: read: connection refused"
time="2021-06-10T23:02:43Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:35010->172.30.0.10:53: read: connection refused"
time="2021-06-10T23:03:25Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:50132->172.30.0.10:53: read: connection refused"
time="2021-06-10T23:04:19Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:49977->172.30.0.10:53: read: connection refused"
time="2021-06-10T23:05:06Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/78e954df-a0bf-452f-be1f-7560d7464dba\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.13:56558->172.30.0.10:53: read: connection refused"
{code}

Looks like for an hour controller couldn't connect to the service, i can't tell if it's a network issue or some problem with the service
@itsoiref @ercohen what do you think?

--- Additional comment from akrzos on 20210614T21:01:07

In an effort to gather a bit more data I reran the a very similar test with some newer bits and still witnessed several clusters that succeeded install after they were declared failed:

Test parameters:
* egress hypervisor latency 75ms, 0.02% packet loss
* ingress hypervisor latency 75ms, 0.02% packet loss
* Hub OCP version 4.8.0-fc.8
* SNO OCP version 4.8.0-fc.9
* ACM 2.3.0-DOWNSTREAM-2021-06-13-16-46-23

Out of 60 clusters - 7/60 failed to show up as installed:
* 2 failed because of https://bugzilla.redhat.com/show_bug.cgi?id=1964688
* 3 failed to apply all cluster operators:
  * sno00025 - Unable to apply 4.8.0-fc.9: the cluster operator monitoring has not yet successfully rolled out
  * sno00028 - Unable to apply 4.8.0-fc.9: some cluster operators have not yet rolled out
  * sno00035 - Unable to apply 4.8.0-fc.9: an unknown error has occurred: MultipleErrors
* 2 came up but too late:
  * sno00035
  * sno00039

I captured some additional data from sno00035 and sno00039 and will attach as a tar file. Unfortunately it seems both examined clusters did not contain any data in their url for logs from the assisted service most likely due to this bug - https://bugzilla.redhat.com/show_bug.cgi?id=1970015

Looking specifically at cluster sno00039 however it seems that the assisted-installer controller pod came up "too late":

# oc get agentclusterinstall -n sno00039 -o yaml
...
    - lastProbeTime: "2021-06-14T19:15:15Z"
      lastTransitionTime: "2021-06-14T19:15:15Z"
      message: 'The installation failed: cluster has hosts in error'
      reason: InstallationFailed
      status: "True"
      type: Failed
    - lastProbeTime: "2021-06-14T19:15:15Z"
      lastTransitionTime: "2021-06-14T19:15:15Z"
      message: The installation has stopped due to error
      reason: InstallationFailed
      status: "True"
      type: Stopped
...

And from the events:
# cat sno00039.events.log | jq
...
  {
    "cluster_id": "0edee69b-8d7f-4fa6-b672-5dfb42b6afc9",
    "event_time": "2021-06-14T18:00:02.445Z",
    "host_id": "65fbdf8a-56ac-4bc3-afeb-3402439e80ec",
    "message": "Host sno00039: reached installation stage Writing image to disk: 100%",
    "severity": "info"
  },
  {
    "cluster_id": "0edee69b-8d7f-4fa6-b672-5dfb42b6afc9",
    "event_time": "2021-06-14T18:00:10.773Z",
    "host_id": "65fbdf8a-56ac-4bc3-afeb-3402439e80ec",
    "message": "Host sno00039: reached installation stage Rebooting",
    "severity": "info"
  },
  {
    "cluster_id": "0edee69b-8d7f-4fa6-b672-5dfb42b6afc9",
    "event_time": "2021-06-14T19:15:07.701Z",
    "host_id": "65fbdf8a-56ac-4bc3-afeb-3402439e80ec",
    "message": "Host sno00039: updated status from \"installing-in-progress\" to \"error\" (Host failed to install because its installation stage Rebooting took longer than expected 1h10m0s)",
    "severity": "error"
  },
  {
    "cluster_id": "0edee69b-8d7f-4fa6-b672-5dfb42b6afc9",
    "event_time": "2021-06-14T19:15:15.698Z",
    "message": "Updated status of cluster sno00039 to error",
    "severity": "info"
  },

Thus this cluster was marked as error at 2021-06-14T19:15:07.701Z

When we examine the controller logs on the sno cluster itself we see:

# cat sno00039.assisted-installer-controller-6tkgg.log
time="2021-06-14T19:26:05Z" level=info msg="Start running Assisted-Controller. Configuration is:\n struct ControllerConfig {\n\tClusterID: \"0edee69b-8d7f-4fa6-b672-5dfb42b6afc9\",\n\tURL: \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com\",\n\tPullSecretToken: \"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiMGVkZWU2OWItOGQ3Zi00ZmE2LWI2NzItNWRmYjQyYjZhZmM5In0.ahWLULAAg91eoL0qjCIhhnAVOD8Ba2bkrDSjQK1Hb1VTPgqYxVVIt2S1B66YUyjDqkdG8Qnak5VHdqqZE1esFw\",\n\tSkipCertVerification: false,\n\tCACertPath: \"/etc/assisted-service/service-ca-cert.crt\",\n\tNamespace: \"assisted-installer\",\n\tOpenshiftVersion: \"4.8.0-fc.9\",\n\tHighAvailabilityMode: \"None\",\n\tWaitForClusterVersion: false,\n\tMustGatherImage: \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b16a8a682178e39acbaec8d6c46b0aa34af5bb2aa9b3ab2e384eca917c2290d0\",\n}"
W0614 19:26:05.500912       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0614 19:26:06.557929       1 request.go:645] Throttling request took 1.032179165s, request: GET:https://172.30.0.1:443/apis/tuned.openshift.io/v1?timeout=32s
time="2021-06-14T19:26:08Z" level=info msg="Using proxy {HTTPProxy: HTTPSProxy: NoProxy:} to set env-vars for installer-controller pod"
time="2021-06-14T19:26:08Z" level=info msg="Using custom CA certificate: /etc/assisted-service/service-ca-cert.crt"
time="2021-06-14T19:26:08Z" level=info msg="Start waiting to be ready"
time="2021-06-14T19:27:17Z" level=info msg="assisted-service is available"

The controller did not even start until 2021-06-14T19:26:05Z (~11m after the assisted-service marked this cluster as failed)

Looking at an oc describe of the controller pod we see:

# cat sno00039.describe.assisted-installer-controller-6tkgg.log
...
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  135m                 default-scheduler  no nodes available to schedule pods
  Warning  FailedScheduling  134m                 default-scheduler  0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
  Normal   Scheduled         69m                  default-scheduler  Successfully assigned assisted-installer/assisted-installer-controller-6tkgg to sno00039
  Warning  FailedScheduling  158m (x6 over 158m)  default-scheduler  no nodes available to schedule pods
  Normal   AddedInterface    69m                  multus             Add eth0 [10.128.0.17/23] from ovn-kubernetes
  Normal   Pulling           69m                  kubelet            Pulling image "quay.io:443/acm-d/assisted-installer-reporter-rhel8@sha256:59e29d660e233eeef1f712d0592718ebfe9ea5f175bab916f29d6a009d92967f"
  Normal   Pulled            69m                  kubelet            Successfully pulled image "quay.io:443/acm-d/assisted-installer-reporter-rhel8@sha256:59e29d660e233eeef1f712d0592718ebfe9ea5f175bab916f29d6a009d92967f" in 34.413941705s
  Normal   Created           69m                  kubelet            Created container assisted-installer-controller
  Normal   Started           69m                  kubelet            Started container assisted-installer-controller

It seems it took a long time for the cluster to become ready enough to run this pod - 65m or so of the pod sitting in pending state.  

When examining sno00035 we see the following:
# oc get agentclusterinstall -n sno00035 -o yaml
...
    - lastProbeTime: "2021-06-14T19:10:05Z"
      lastTransitionTime: "2021-06-14T19:10:05Z"
      message: 'The installation has failed: cluster has hosts in error'
      reason: InstallationFailed
      status: "False"
      type: Completed
    - lastProbeTime: "2021-06-14T19:10:05Z"
      lastTransitionTime: "2021-06-14T19:10:05Z"
      message: 'The installation failed: cluster has hosts in error'
      reason: InstallationFailed
      status: "True"
      type: Failed
    - lastProbeTime: "2021-06-14T19:10:05Z"
      lastTransitionTime: "2021-06-14T19:10:05Z"
      message: The installation has stopped due to error
      reason: InstallationFailed
      status: "True"
      type: Stopped

Thus the install effectively failed at 2021-06-14T19:10:05Z

However the controller logs show:

# cat sno00035.assisted-installer-controller-c98j9.log
time="2021-06-14T18:46:29Z" level=info msg="Start running Assisted-Controller. Configuration is:\n struct ControllerConfig {\n\tClusterID: \"7b63776b-1ce7-4288-afa1-91f7eb65f8f5\",\n\tURL: \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com\",\n\tPullSecretToken: \"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiN2I2Mzc3NmItMWNlNy00Mjg4LWFmYTEtOTFmN2ViNjVmOGY1In0.TMdFRHDaKZFejxyPFFK4UESLFsJd3nu_sKjV1R6n7snGJu6TfrUTMbvTJCgyu7YGIvHXoc0_1eowzLIwKz1sOg\",\n\tSkipCertVerification: false,\n\tCACertPath: \"/etc/assisted-service/service-ca-cert.crt\",\n\tNamespace: \"assisted-installer\",\n\tOpenshiftVersion: \"4.8.0-fc.9\",\n\tHighAvailabilityMode: \"None\",\n\tWaitForClusterVersion: false,\n\tMustGatherImage: \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b16a8a682178e39acbaec8d6c46b0aa34af5bb2aa9b3ab2e384eca917c2290d0\",\n}"
W0614 18:46:29.114085       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0614 18:46:30.168884       1 request.go:645] Throttling request took 1.038354671s, request: GET:https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1?timeout=32s
time="2021-06-14T18:46:30Z" level=info msg="Using proxy {HTTPProxy: HTTPSProxy: NoProxy:} to set env-vars for installer-controller pod"
time="2021-06-14T18:46:30Z" level=info msg="Using custom CA certificate: /etc/assisted-service/service-ca-cert.crt"
time="2021-06-14T18:46:30Z" level=info msg="Start waiting to be ready"
time="2021-06-14T18:52:26Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/7b63776b-1ce7-4288-afa1-91f7eb65f8f5\": dial tcp: i/o timeout"
time="2021-06-14T18:53:26Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/7b63776b-1ce7-4288-afa1-91f7eb65f8f5\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.15:44643->172.30.0.10:53: read: connection refused"
...
time="2021-06-14T19:21:15Z" level=warning msg="Failed to connect to assisted service" error="Get \"https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/7b63776b-1ce7-4288-afa1-91f7eb65f8f5\": dial tcp: lookup assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com on 172.30.0.10:53: read udp 10.128.0.15:48787->172.30.0.10:53: read: connection refused"
time="2021-06-14T19:21:41Z" level=info msg="assisted-service is available"
time="2021-06-14T19:21:41Z" level=info msg="kube-apiserver is available"
time="2021-06-14T19:21:41Z" level=info msg="Sending ready event"
time="2021-06-14T19:21:41Z" level=info msg="monitor cluster installation status"
time="2021-06-14T19:21:41Z" level=info msg="Start sending logs"
time="2021-06-14T19:21:41Z" level=info msg="Waiting till all nodes will join and update status to assisted installer"
time="2021-06-14T19:21:41Z" level=info msg="Start approving CSRs"
time="2021-06-14T19:21:41Z" level=info msg="Making sure service dns-default can reserve the .10 address"
time="2021-06-14T19:21:41Z" level=info msg="Service dns-default has successfully taken IP 172.30.0.10"
time="2021-06-14T19:21:41Z" level=info msg="HackDNSAddressConflict finished"
time="2021-06-14T19:22:11Z" level=info msg="Number of BMHs is 0"
time="2021-06-14T19:22:11Z" level=info msg="Hosts status: map[sno00035:[error Rebooting ]]" request_id=b593727d-ec8a-4b1d-99fb-483b1c113fed
time="2021-06-14T19:22:11Z" level=info msg="Done waiting for all the nodes. Nodes in error status: 1\n"
time="2021-06-14T19:22:11Z" level=info msg="WaitAndUpdateNodesStatus finished"
time="2021-06-14T19:22:11Z" level=info msg="Finish approving CSRs"
I0614 19:22:12.344230       1 request.go:645] Throttling request took 1.045614527s, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
time="2021-06-14T19:22:17Z" level=error msg="failed to list Machines, error no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\""
time="2021-06-14T19:22:17Z" level=error msg="Failed to find unallocated machines" error="no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\""
time="2021-06-14T19:22:41Z" level=info msg="Number of BMHs is 0"
I0614 19:22:42.341279       1 request.go:645] Throttling request took 1.046542033s, request: GET:https://172.30.0.1:443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
time="2021-06-14T19:22:42Z" level=info msg="Cluster installation failed."
time="2021-06-14T19:22:42Z" level=info msg="Waiting for all go routines to finish"
time="2021-06-14T19:22:42Z" level=info msg="Finished PostInstallConfigs"
time="2021-06-14T19:22:42Z" level=info msg="Finished UploadLogs"
time="2021-06-14T19:22:47Z" level=error msg="failed to list Machines, error no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\""
time="2021-06-14T19:22:47Z" level=error msg="Failed to find unallocated machines" error="no matches for kind \"Machine\" in version \"machine.openshift.io/v1beta1\""
time="2021-06-14T19:22:47Z" level=info msg="Finished UpdateBMHs"
time="2021-06-14T19:22:47Z" level=info msg="Finished all"

In this case it seems that the controller pod comes up before the cluster is marked as an error, but can not resolve the hub cluster address until it is too late.  I am curious if this is another network manager related issue in that dns is not working until too late in the install process similar to bug - https://bugzilla.redhat.com/show_bug.cgi?id=1968634

--- Additional comment from akrzos on 20210614T21:02:50

Created attachment 1791081 [details]
Data from the fc.9 test

--- Additional comment from itsoiref on 20210615T09:05:16

@akrzos will be very helpful to debug it live or to have must-gather logs too.

Comment 3 Trey West 2021-06-30 17:55:50 UTC

Verified by @cchun on 2.3.0-DOWNSTREAM-2021-06-24-02-38-49

Comment 5 errata-xmlrpc 2021-07-27 23:12:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.