Summary: | Metal Day 1 - Delayed and Vague Ironic Failure on Day-1 Syntax Errors. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yoav Porag <yporagpa> |
Component: | Installer | Assignee: | Dmitry Tantsur <dtantsur> |
Installer sub component: | OpenShift on Bare Metal IPI | QA Contact: | Yoav Porag <yporagpa> |
Status: | CLOSED CANTFIX | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | augol, awolff, dtantsur, eglottma, rpittau, zbitter |
Version: | 4.10 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-06-01 08:34:34 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: |
Description
Yoav Porag
2022-01-26 10:53:27 UTC
This is a good point. For workers we now report failures to build the image in the PreprovisioningImage resource, so that errors appear in the BaremetalHost resource where you'd expect them. There's no equivalent path for error reporting in the installer, so the experience on the control plane is not good. At a minimum, we should monitor the image customization in the installer to see if it exits, and fail the installation as soon as it does. Better error reporting would be good too though. [kni@provisionhost-0-0 ~]$ ./openshift-baremetal-install version ./openshift-baremetal-install 4.11.0-0.nightly-2022-02-05-211325 built from commit c10437e20dc5d7ecd7a53eb02f353397abab59a5 release image registry.ci.openshift.org/ocp/release@sha256:1cf7259c0cb1ae73c6b7af3176e22f7078fdb6f92bf62fe795b774325a1d2066 Added to install config: networkConfig: routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.123.1 next-hop-interface: enp0s4 dns-resolver: config: server: - 192.168.123.1 interfaces: - name: enp0s4 type: ethernet state: up ipv4: address: - ip: 192.168.123.160 prefix-length: 24 -------------> Line missing on purpose. Installation attempt failed, but only after the full 60 min wait time. DEBUG ironic_node_v1.openshift-master-host[2]: Still creating... [59m50s elapsed] DEBUG ironic_node_v1.openshift-master-host[0]: Still creating... [59m50s elapsed] DEBUG ironic_node_v1.openshift-master-host[1]: Still creating... [59m50s elapsed] ERROR ERROR Error: could not contact Ironic API: timeout reached ERROR ERROR on ../../tmp/openshift-install-masters-3029058312/main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": ERROR 13: resource "ironic_node_v1" "openshift-master-host" { ERROR ERROR ERROR ERROR Error: could not contact Ironic API: timeout reached ERROR ERROR on ../../tmp/openshift-install-masters-3029058312/main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": ERROR 13: resource "ironic_node_v1" "openshift-master-host" { ERROR ERROR ERROR ERROR Error: could not contact Ironic API: context deadline exceeded ERROR ERROR on ../../tmp/openshift-install-masters-3029058312/main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": ERROR 13: resource "ironic_node_v1" "openshift-master-host" { ERROR ERROR FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: error(BaremetalIronicAPITimeout) from Infrastructure Provider: Unable to the reach provisioning service. This failure can be caused by incorrect network/proxy settings, inability to download the machine operating system images, or other misconfiguration. Please check access to the bootstrap host, and for any failing services. [core@localhost ~]$ sudo podman logs image-customization {"level":"info","ts":1644139477.7491918,"logger":"static-server","msg":"Go Version: go1.17.2"} {"level":"info","ts":1644139477.7493894,"logger":"static-server","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1644139477.7494211,"logger":"static-server","msg":"Git commit: unknown"} {"level":"info","ts":1644139477.7494524,"logger":"static-server","msg":"Build time: unknown"} {"level":"info","ts":1644139477.749478,"logger":"static-server","msg":"Component: openshift/image-customization-controller was not built with version info"} {"level":"error","ts":1644139478.2035537,"logger":"static-server","msg":"problem loading static ignitions","error":"failed to convert nmstate data: exit status 1","errorVerbose":"exit status 1\nfailed to convert nmstate data"} Original issue remains unchanged. moving BZ back to assigned I will double-check, but that's probably the best I can do given the installer architecture. We might have a couple of options in the future (not short term): * Improve the installer to be able to signal an unrecoverable failure from the bootstrap * Integrate the Rust version of NMState (doesn't currently support gc command) into the installer so we can validate the data in the install config Options for this bug: 1. Declare victory so we can backport the patch to 4.10.z (I think it makes debugging a little better at least?); open a new bug or Jira story for future improvements 2. Close as CANTFIX; open a story in Jira for future improvements 3. Leave open for future improvements Yoav, I guess it's your call (keeping in mind that no one of us will probably have cycles for a significant investment in the installer). (I personally think that "could not contact Ironic API" is much better than an inspection timeout in terms of debugability) I agree that "could not contact Ironic API" is better, and I think a Jira story will be better suited for the state of this bug. However, the state in which the bug is in now, where it takes over an hour to fail on a configuration error seemes really unreasonable to me. The state it was before where it took around 15 minutes from installation launch to failure was preferable. Is there any way to reduce this significantly while preserving (or improving) the output? @dtantsur @zbitter > Is there any way to reduce this significantly while preserving (or improving) the output? Per Zane's comment 6: not without a significant rework of the installer. > The state it was before where it took around 15 minutes from installation launch to failure was preferable. Which failure are you comparing with? Network data did not exist before. When i originally posted the bug, the failure happened during ironic inspection phase, and usually failed around 15 min after installation started. unfortunately, i did not record what version this was, my bad. DEBUG ironic_node_v1.openshift-master-host[1]: Still creating... [6m0s elapsed] DEBUG ironic_node_v1.openshift-master-host[0]: Still creating... [6m0s elapsed] DEBUG ironic_node_v1.openshift-master-host[2]: Still creating... [6m0s elapsed] ERROR ERROR Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://[fd00:1101:0:1::2]:8084/openshift-master-0-2.initramfs failed, reason: HTTPConnectionPool(host='fd00:1101:0:1::2', port=8084): Max retries exceeded with url: /openshift-master-0-2.initramfs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb5cca24710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))' ..... in its current state, the installer forces the customer to wait over an hour until it fails the installation, which seemes unreasonable. @dtantsur It sounds like the terraform provider has no timeout when waiting for ironic to come up (whereas ironic *does* have a timeout waiting for inspection to start, which ten reports an error to terraform). I think I've found a way to reduce the timeout: https://github.com/openshift/installer/pull/5639. I haven't tested it myself, help welcome. [kni@provisionhost-0-0 ~]$ ./openshift-baremetal-install version ./openshift-baremetal-install 4.11.0-0.nightly-2022-02-27-122819 built from commit 3d19350885d593ee2b1d9ecd7612c2d697dab2a3 release image registry.ci.openshift.org/ocp/release@sha256:e474d91b23fdcfd2de7732442c9c985b0000a6560f7f5db14461f1c92105e8da release architecture amd64 Added to install config: networkConfig: routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.123.1 next-hop-interface: enp0s4 dns-resolver: config: server: - 192.168.123.1 interfaces: - name: enp0s4 type: ethernet state: up ipv4: address: - ip: 192.168.123.160 prefix-length: 24 -------------> Line missing on purpose. Installation failed on ironic inspection DEBUG ironic_node_v1.openshift-master-host[1]: Still creating... [4m40s elapsed] DEBUG ironic_node_v1.openshift-master-host[1]: Still creating... [4m50s elapsed] DEBUG ironic_node_v1.openshift-master-host[0]: Still creating... [4m50s elapsed] DEBUG ironic_node_v1.openshift-master-host[2]: Still creating... [4m50s elapsed] ERROR ERROR Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://192.168.123.5:8084/openshift-master-0-1.initramfs failed, reason: HTTPConnectionPool(host='192.168.123.5', port=8084): Max retries exceeded with url: /openshift-master-0-1.initramfs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe8fb63438>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))' ERROR ERROR on ../../tmp/openshift-install-masters-673864748/main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": ERROR 13: resource "ironic_node_v1" "openshift-master-host" { ERROR ERROR ERROR bootstrap is not accessible to extract logs from. The time to failure is shorter than the initial state of the bug, but the error is still vague, and bootstrap logs aren't available for analysis so the problem cant be verified. as per our discussion in comments #6 to #9, I think that the bug should be closed and a Jira story for this issue should be opened. please attach a link to the story here, and then the bug can close the bug. for now, I'm setting it back to assigned. |