Bug 1880450

Summary: Host failed to install because its installation stage joined took longer than expected 20m0s.
Product: OpenShift Container Platform Reporter: Alexander Chuzhoy <sasha>
Component: assisted-installerAssignee: Eran Cohen <ercohen>
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, ncarboni
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: OCP-Metal-v1.0.10.2 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:18:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster_logs none

Description Alexander Chuzhoy 2020-09-18 15:14:45 UTC
A cluster deployment fails with nodes getting status error:

master-0   Error 5/7
master-1   Error 6/7
master-2   Error 6/7
worker-0   Error 5/7
worker-1   Error 5/7


5/7 error says:
Error

This host failed its installation.
Host is part of a cluster that failed to install.


6/7 error says:
Error

This host failed its installation.
Host failed to install because its installation stage joined took longer than expected 20m0s.

This was attempted on a real BM setup.

Version:
Current Application:
    Clusters
Application Path:
    /openshift/assisted-installer/clusters/713d7654-c7c5-4666-97e4-e646d53c415e
Chrome Version:
    3dab1e9314c641c709d758c876d8e8bdc8c74186
Dashboard Version:
    da0936242d9771fa2c4694c7d450b45ce044dff4
Inventory Version:
    a45b358582697a8d98f00e52fd775645896c3301
Remediations Version:
    72bd90315e3b6791eab10490383efa9394d96087
Vulnerability Version:
    a4bc02ec3347e4417749dca9d8c5bce77083e40c
Compliance Version:
    1a44d154c505eafdad729d8a6f2cf89856800714
Cost Management Version:
    ca1ea0c4739995f5938071865de59fa1b532996b
Advisor Version:
    3c17a1782cbc473ca58d1dc967cafc4b5ad9f2cd
Drift Version:
    9a4e02d2bdf857ab201145a3f8467efa3bd75f15
Migration Analytics Version:
    9e3cd1f8499e2aaf3a484c789de93204e108a0df
Automation Hub Version:
    5884daee9a75b899e0788324ab9cf61ede96af08
Automation Analytics Version:
    89f9692730b63bf77849b61bfd25950e68a21678
Policies Version:
    1dafea47caae35d2bb1780892279192606371347
Patch Version:
    3b5a0ab7bf535a25a5dca820ef2a1b6fa5236299
Automation Services Catalog Version:
    98fb9e1bcca8681282b2e2f1c84391a4e454c1b9
Approval Version:
    4147fbf57b1cf6c68058acef8e93803d42a3806d
Sources Version:
    7524e34d2a0a0459fc8ccaaa6475ef9469ded496

Comment 1 Alexander Chuzhoy 2020-09-18 15:35:50 UTC
The following errors are spotted in journalctl:
Sep 18 15:33:54 master-1 hyperkube[2861]: W0918 14:20:51.527638       1 builder.go:97] graceful termination failed, controllers failed with error: stopped
Sep 18 15:33:55 master-1 hyperkube[2861]: I0918 15:33:55.354503    2861 handler.go:295] error while reading "/proc/2861/fd/39" link: readlink /proc/2861/fd/39: no such file or directory
Sep 18 15:33:59 master-1 hyperkube[2861]: I0918 15:33:59.978287    2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/gce-pd
Sep 18 15:33:59 master-1 hyperkube[2861]: I0918 15:33:59.978298    2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/azure-disk
Sep 18 15:33:59 master-1 hyperkube[2861]: I0918 15:33:59.978303    2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/cinder
Sep 18 15:33:59 master-1 hyperkube[2861]: I0918 15:33:59.978308    2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/aws-ebs
Sep 18 15:34:09 master-1 hyperkube[2861]: I0918 15:34:09.987943    2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/aws-ebs
Sep 18 15:34:09 master-1 hyperkube[2861]: I0918 15:34:09.987956    2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/gce-pd
Sep 18 15:34:09 master-1 hyperkube[2861]: I0918 15:34:09.987965    2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/azure-disk
Sep 18 15:34:09 master-1 hyperkube[2861]: I0918 15:34:09.987970    2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/cinder

Comment 3 Alexander Chuzhoy 2020-09-18 15:38:47 UTC
Created attachment 1715400 [details]
cluster_logs

Comment 4 Nick Carboni 2020-09-18 15:52:40 UTC
In the agent logs for host c0e6fab6-4c8c-f516-2fa2-e922d53b7d32 I see:
Sep 18 14:15:34 openshift-master-1 agent[2839]: time="18-09-2020 14:15:34" level=warning msg="Could not query next steps" file="step_processor.go:102" error="response status code does not match any response statuses defined for this endpoint in the swagger spec (status 504): {}" request_id=707e9caa-8519-4abb-bf32-1d319f3f5246

This request ID corresponds with the following logs from the service:

time="2020-09-18T14:15:32Z" level=info msg="GetNextSteps cluster: ,<713d7654-c7c5-4666-97e4-e646d53c415e> host: <c0e6fab6-4c8c-f516-2fa2-e922d53b7d32>, host status: <installing-in-progress>" func="github.com/openshift/assisted-service/internal/host.(*InstructionManager).GetNextSteps" file="/go/src/github.com/openshift/origin/internal/host/instructionmanager.go:89" go-id=277222 pkg=instructions request_id=707e9caa-8519-4abb-bf32-1d319f3f5246
time="2020-09-18T14:15:32Z" level=info msg="No steps required for cluster <713d7654-c7c5-4666-97e4-e646d53c415e> host <c0e6fab6-4c8c-f516-2fa2-e922d53b7d32>" func=github.com/openshift/assisted-service/internal/host.logSteps file="/go/src/github.com/openshift/origin/internal/host/instructionmanager.go:122" go-id=277222 pkg=instructions request_id=707e9caa-8519-4abb-bf32-1d319f3f5246

In the agent logs for host a2e1088f-4834-5688-b105-750ffc3297f5 I see connectivity check failures:

Sep 18 14:00:41 openshift-master-2 connectivity_ch[8844]: time="18-09-2020 14:00:41" level=info msg="Error running ping to 10.19.134.3 on interface eno1: exit status 1" file="connectivity_check.go:68"
Sep 18 14:00:41 openshift-master-2 connectivity_ch[8844]: time="18-09-2020 14:00:41" level=info msg="Error running ping to 10.19.134.6 on interface eno1: exit status 1" file="connectivity_check.go:68"
Sep 18 14:00:41 openshift-master-2 connectivity_ch[8844]: time="18-09-2020 14:00:41" level=info msg="Error running ping to 10.19.134.5 on interface eno1: exit status 1" file="connectivity_check.go:68"
Sep 18 14:00:41 openshift-master-2 connectivity_ch[8844]: time="18-09-2020 14:00:41" level=info msg="Error running ping to 10.19.134.2 on interface eno1: exit status 1" file="connectivity_check.go:68"

Not sure if either of these could cause the timeout error, but both seem conspicuous.

Comment 5 Nick Carboni 2020-09-18 16:02:16 UTC
I also see the following message in kibana for every host:

time="2020-09-18T14:01:33Z" level=error msg="failed to refresh host f7ea8d4b-73f2-a68d-2480-1ebed5a7345e state" func="github.com/openshift/assisted-service/internal/host.(*Manager).HostMonitoring" file="/go/src/github.com/openshift/origin/internal/host/monitor.go:46" error="failed to update host f7ea8d4b-73f2-a68d-2480-1ebed5a7345e from cluster 713d7654-c7c5-4666-97e4-e646d53c415e state from pending-for-input to known: failed to update host f7ea8d4b-73f2-a68d-2480-1ebed5a7345e from cluster 713d7654-c7c5-4666-97e4-e646d53c415e. nothing have changed" pkg=host-state request_id=efec2aec-da43-4f31-92f0-62de8fc35dc4

Comment 6 Alexander Chuzhoy 2020-09-18 17:56:20 UTC
The issue didn't reproduce on another attempt to reinstall.

Comment 7 Eran Cohen 2020-09-22 10:33:06 UTC
From the cluster logs it seems that the installation went just fine.
Bootkube completed with working control plane.
All hosts got to the reboot stage (bootstrap as well).
Any chance we can get the assisted-controller logs? 
Also is this c0e6fab6-4c8c-f516-2fa2-e922d53b7d32 the id of the host that got the timeout?
We decided on the timeout based on the time each step takes on VM installations
I guess we should collect some data on BM installation and make the adjustments.

Comment 8 Alexander Chuzhoy 2020-10-01 20:10:21 UTC
Verified:

Version:
Application Path:
    /openshift/assisted-installer/clusters
Chrome Version:
    3dab1e9314c641c709d758c876d8e8bdc8c74186
Dashboard Version:
    70db21162e2e50482d151f1ae4a420fa1c6fbd99
Inventory Version:
    a45b358582697a8d98f00e52fd775645896c3301
Remediations Version:
    72bd90315e3b6791eab10490383efa9394d96087
Vulnerability Version:
    a4bc02ec3347e4417749dca9d8c5bce77083e40c
Compliance Version:
    1f9bd9e0b1aefad86b6498c5b3154e6c8a41dad1
Cost Management Version:
Advisor Version:
    056986a484643f663676146162bf273554760f79
Drift Version:
    c12ba359bb698c567e72278f22ec2fe0b004f3bb
Migration Analytics Version:
    0d3e6ed385c4607ecd9991f4dfe5d47e1ea5937e
Automation Hub Version:
    83dfeaa7c76bfa93bfed6bde91f266571ab77b72
Automation Analytics Version:
    89f9692730b63bf77849b61bfd25950e68a21678
Policies Version:
    18c0ae43195be0149b1b8f08a80d4858f2c39e75
Patch Version:
    3b5a0ab7bf535a25a5dca820ef2a1b6fa5236299
Automation Services Catalog Version:
    1903e5512af71010886c5a7444f28f728ee04839
Approval Version:
    4147fbf57b1cf6c68058acef8e93803d42a3806d
Sources Version:
    ee7e5e9a5d0f1152fe8a4fb1c972251055aa4770

The reported issue didn't reproduce.

Comment 11 errata-xmlrpc 2021-02-24 15:18:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633