Bug 1880450
Summary: | Host failed to install because its installation stage joined took longer than expected 20m0s. | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alexander Chuzhoy <sasha> | ||||
Component: | assisted-installer | Assignee: | Eran Cohen <ercohen> | ||||
assisted-installer sub component: | assisted-service | QA Contact: | Yuri Obshansky <yobshans> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | unspecified | CC: | aos-bugs, ncarboni | ||||
Version: | 4.6 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | OCP-Metal-v1.0.10.2 | Doc Type: | No Doc Update | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-02-24 15:18:37 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Alexander Chuzhoy
2020-09-18 15:14:45 UTC
The following errors are spotted in journalctl: Sep 18 15:33:54 master-1 hyperkube[2861]: W0918 14:20:51.527638 1 builder.go:97] graceful termination failed, controllers failed with error: stopped Sep 18 15:33:55 master-1 hyperkube[2861]: I0918 15:33:55.354503 2861 handler.go:295] error while reading "/proc/2861/fd/39" link: readlink /proc/2861/fd/39: no such file or directory Sep 18 15:33:59 master-1 hyperkube[2861]: I0918 15:33:59.978287 2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/gce-pd Sep 18 15:33:59 master-1 hyperkube[2861]: I0918 15:33:59.978298 2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/azure-disk Sep 18 15:33:59 master-1 hyperkube[2861]: I0918 15:33:59.978303 2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/cinder Sep 18 15:33:59 master-1 hyperkube[2861]: I0918 15:33:59.978308 2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/aws-ebs Sep 18 15:34:09 master-1 hyperkube[2861]: I0918 15:34:09.987943 2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/aws-ebs Sep 18 15:34:09 master-1 hyperkube[2861]: I0918 15:34:09.987956 2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/gce-pd Sep 18 15:34:09 master-1 hyperkube[2861]: I0918 15:34:09.987965 2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/azure-disk Sep 18 15:34:09 master-1 hyperkube[2861]: I0918 15:34:09.987970 2861 setters.go:773] Error getting volume limit for plugin kubernetes.io/cinder Created attachment 1715400 [details]
cluster_logs
In the agent logs for host c0e6fab6-4c8c-f516-2fa2-e922d53b7d32 I see: Sep 18 14:15:34 openshift-master-1 agent[2839]: time="18-09-2020 14:15:34" level=warning msg="Could not query next steps" file="step_processor.go:102" error="response status code does not match any response statuses defined for this endpoint in the swagger spec (status 504): {}" request_id=707e9caa-8519-4abb-bf32-1d319f3f5246 This request ID corresponds with the following logs from the service: time="2020-09-18T14:15:32Z" level=info msg="GetNextSteps cluster: ,<713d7654-c7c5-4666-97e4-e646d53c415e> host: <c0e6fab6-4c8c-f516-2fa2-e922d53b7d32>, host status: <installing-in-progress>" func="github.com/openshift/assisted-service/internal/host.(*InstructionManager).GetNextSteps" file="/go/src/github.com/openshift/origin/internal/host/instructionmanager.go:89" go-id=277222 pkg=instructions request_id=707e9caa-8519-4abb-bf32-1d319f3f5246 time="2020-09-18T14:15:32Z" level=info msg="No steps required for cluster <713d7654-c7c5-4666-97e4-e646d53c415e> host <c0e6fab6-4c8c-f516-2fa2-e922d53b7d32>" func=github.com/openshift/assisted-service/internal/host.logSteps file="/go/src/github.com/openshift/origin/internal/host/instructionmanager.go:122" go-id=277222 pkg=instructions request_id=707e9caa-8519-4abb-bf32-1d319f3f5246 In the agent logs for host a2e1088f-4834-5688-b105-750ffc3297f5 I see connectivity check failures: Sep 18 14:00:41 openshift-master-2 connectivity_ch[8844]: time="18-09-2020 14:00:41" level=info msg="Error running ping to 10.19.134.3 on interface eno1: exit status 1" file="connectivity_check.go:68" Sep 18 14:00:41 openshift-master-2 connectivity_ch[8844]: time="18-09-2020 14:00:41" level=info msg="Error running ping to 10.19.134.6 on interface eno1: exit status 1" file="connectivity_check.go:68" Sep 18 14:00:41 openshift-master-2 connectivity_ch[8844]: time="18-09-2020 14:00:41" level=info msg="Error running ping to 10.19.134.5 on interface eno1: exit status 1" file="connectivity_check.go:68" Sep 18 14:00:41 openshift-master-2 connectivity_ch[8844]: time="18-09-2020 14:00:41" level=info msg="Error running ping to 10.19.134.2 on interface eno1: exit status 1" file="connectivity_check.go:68" Not sure if either of these could cause the timeout error, but both seem conspicuous. I also see the following message in kibana for every host: time="2020-09-18T14:01:33Z" level=error msg="failed to refresh host f7ea8d4b-73f2-a68d-2480-1ebed5a7345e state" func="github.com/openshift/assisted-service/internal/host.(*Manager).HostMonitoring" file="/go/src/github.com/openshift/origin/internal/host/monitor.go:46" error="failed to update host f7ea8d4b-73f2-a68d-2480-1ebed5a7345e from cluster 713d7654-c7c5-4666-97e4-e646d53c415e state from pending-for-input to known: failed to update host f7ea8d4b-73f2-a68d-2480-1ebed5a7345e from cluster 713d7654-c7c5-4666-97e4-e646d53c415e. nothing have changed" pkg=host-state request_id=efec2aec-da43-4f31-92f0-62de8fc35dc4 The issue didn't reproduce on another attempt to reinstall. From the cluster logs it seems that the installation went just fine. Bootkube completed with working control plane. All hosts got to the reboot stage (bootstrap as well). Any chance we can get the assisted-controller logs? Also is this c0e6fab6-4c8c-f516-2fa2-e922d53b7d32 the id of the host that got the timeout? We decided on the timeout based on the time each step takes on VM installations I guess we should collect some data on BM installation and make the adjustments. Verified: Version: Application Path: /openshift/assisted-installer/clusters Chrome Version: 3dab1e9314c641c709d758c876d8e8bdc8c74186 Dashboard Version: 70db21162e2e50482d151f1ae4a420fa1c6fbd99 Inventory Version: a45b358582697a8d98f00e52fd775645896c3301 Remediations Version: 72bd90315e3b6791eab10490383efa9394d96087 Vulnerability Version: a4bc02ec3347e4417749dca9d8c5bce77083e40c Compliance Version: 1f9bd9e0b1aefad86b6498c5b3154e6c8a41dad1 Cost Management Version: Advisor Version: 056986a484643f663676146162bf273554760f79 Drift Version: c12ba359bb698c567e72278f22ec2fe0b004f3bb Migration Analytics Version: 0d3e6ed385c4607ecd9991f4dfe5d47e1ea5937e Automation Hub Version: 83dfeaa7c76bfa93bfed6bde91f266571ab77b72 Automation Analytics Version: 89f9692730b63bf77849b61bfd25950e68a21678 Policies Version: 18c0ae43195be0149b1b8f08a80d4858f2c39e75 Patch Version: 3b5a0ab7bf535a25a5dca820ef2a1b6fa5236299 Automation Services Catalog Version: 1903e5512af71010886c5a7444f28f728ee04839 Approval Version: 4147fbf57b1cf6c68058acef8e93803d42a3806d Sources Version: ee7e5e9a5d0f1152fe8a4fb1c972251055aa4770 The reported issue didn't reproduce. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |