Bug 2046181

Summary:	Metal Day 1 - Delayed and Vague Ironic Failure on Day-1 Syntax Errors.
Product:	OpenShift Container Platform	Reporter:	Yoav Porag <yporagpa>
Component:	Installer	Assignee:	Dmitry Tantsur <dtantsur>
Installer sub component:	OpenShift on Bare Metal IPI	QA Contact:	Yoav Porag <yporagpa>
Status:	CLOSED CANTFIX	Docs Contact:
Severity:	medium
Priority:	medium	CC:	augol, awolff, dtantsur, eglottma, rpittau, zbitter
Version:	4.10	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-06-01 08:34:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yoav Porag 2022-01-26 10:53:27 UTC

Version:

Platform:

Baremetal IPI

What happened?

#Enter text here.

some syntex errors in the install-config.yaml regarding metal-day-1 leads to ironic failure, and ultimately a failure to convert nmstate data.

ERROR Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://[fd00:1101:0:1::2]:8084/openshift-master-0-2.initramfs failed, reason: HTTPConnectionPool(host='fd00:1101:0:1::2', port=8084): Max retries exceeded with url: /openshift-master-0-2.initramfs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb5cca24710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))'

[kni@provisionhost-0-0 ~]$ ssh core@fd00:1101:0:1::2 -t sudo podman logs image-customization
{"level":"info","ts":1643112802.9780934,"logger":"static-server","msg":"Go Version: go1.17.2"}
{"level":"info","ts":1643112802.9781542,"logger":"static-server","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1643112802.9781632,"logger":"static-server","msg":"Git commit: unknown"}
{"level":"info","ts":1643112802.9781702,"logger":"static-server","msg":"Build time: unknown"}
{"level":"info","ts":1643112802.978177,"logger":"static-server","msg":"Component: openshift/image-customization-controller was not built with version info"}
{"level":"error","ts":1643112803.462108,"logger":"static-server","msg":"problem loading static ignitions","error":"failed to convert nmstate data: exit status 1","errorVerbose":"exit status 1\nfailed to convert nmstate data"}

this seems like a very abstract failure that will lead to bad user experience.

What did you expect to happen?

these failures should be found faster (takes around 10 minutes for installation to fail) and supply more information on the error, at the very least what line failed to convert.

How to reproduce it (as minimally and precisely as possible)?

example - configuring static ip in install-config.yaml, and removing a relevant line, unmarshling succeeds.

        networkConfig: 
          routes:
            config:
            - destination: ::/0
              next-hop-address: fd2e:6f44:5dd8::1
              next-hop-interface: enp0s4
          dns-resolver:
            config:
              server:
              - fd2e:6f44:5dd8::1
          interfaces:
          - name: enp0s4
            type: ethernet
            state: up
            ipv6:
              address:
              - ip: fd2e:6f44:5dd8::face
                prefix-length: 64
#             enabled: true <----- remove this line to reproduce error.

Comment 1 Zane Bitter 2022-01-26 15:33:57 UTC

This is a good point. For workers we now report failures to build the image in the PreprovisioningImage resource, so that errors appear in the BaremetalHost resource where you'd expect them. There's no equivalent path for error reporting in the installer, so the experience on the control plane is not good.

At a minimum, we should monitor the image customization in the installer to see if it exits, and fail the installation as soon as it does. Better error reporting would be good too though.

Comment 4 Yoav Porag 2022-02-06 10:48:23 UTC

[kni@provisionhost-0-0 ~]$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.11.0-0.nightly-2022-02-05-211325
built from commit c10437e20dc5d7ecd7a53eb02f353397abab59a5
release image registry.ci.openshift.org/ocp/release@sha256:1cf7259c0cb1ae73c6b7af3176e22f7078fdb6f92bf62fe795b774325a1d2066

Added to install config:
        networkConfig:
          routes:
            config:
            - destination: 0.0.0.0/0
              next-hop-address: 192.168.123.1
              next-hop-interface: enp0s4
          dns-resolver:
            config:
              server:
              - 192.168.123.1
          interfaces:
          - name: enp0s4
            type: ethernet
            state: up
            ipv4:
              address:
              - ip: 192.168.123.160
                prefix-length: 24
-------------> Line missing on purpose.

Installation attempt failed, but only after the full 60 min wait time.

DEBUG ironic_node_v1.openshift-master-host[2]: Still creating... [59m50s elapsed] 
DEBUG ironic_node_v1.openshift-master-host[0]: Still creating... [59m50s elapsed] 
DEBUG ironic_node_v1.openshift-master-host[1]: Still creating... [59m50s elapsed] 
ERROR                                              
ERROR Error: could not contact Ironic API: timeout reached 
ERROR                                              
ERROR   on ../../tmp/openshift-install-masters-3029058312/main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": 
ERROR   13: resource "ironic_node_v1" "openshift-master-host" { 
ERROR                                              
ERROR                                              
ERROR                                              
ERROR Error: could not contact Ironic API: timeout reached 
ERROR                                              
ERROR   on ../../tmp/openshift-install-masters-3029058312/main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": 
ERROR   13: resource "ironic_node_v1" "openshift-master-host" { 
ERROR                                              
ERROR                                              
ERROR                                              
ERROR Error: could not contact Ironic API: context deadline exceeded 
ERROR                                              
ERROR   on ../../tmp/openshift-install-masters-3029058312/main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": 
ERROR   13: resource "ironic_node_v1" "openshift-master-host" { 
ERROR                                              
ERROR                                              
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: error(BaremetalIronicAPITimeout) from Infrastructure Provider: Unable to the reach provisioning service. This failure can be caused by incorrect network/proxy settings, inability to download the machine operating system images, or other misconfiguration. Please check access to the bootstrap host, and for any failing services. 

[core@localhost ~]$ sudo podman logs image-customization
{"level":"info","ts":1644139477.7491918,"logger":"static-server","msg":"Go Version: go1.17.2"}
{"level":"info","ts":1644139477.7493894,"logger":"static-server","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1644139477.7494211,"logger":"static-server","msg":"Git commit: unknown"}
{"level":"info","ts":1644139477.7494524,"logger":"static-server","msg":"Build time: unknown"}
{"level":"info","ts":1644139477.749478,"logger":"static-server","msg":"Component: openshift/image-customization-controller was not built with version info"}
{"level":"error","ts":1644139478.2035537,"logger":"static-server","msg":"problem loading static ignitions","error":"failed to convert nmstate data: exit status 1","errorVerbose":"exit status 1\nfailed to convert nmstate data"}

Original issue remains unchanged. moving BZ back to assigned

Comment 5 Dmitry Tantsur 2022-02-07 16:30:51 UTC

I will double-check, but that's probably the best I can do given the installer architecture.

Comment 6 Zane Bitter 2022-02-07 17:09:36 UTC

We might have a couple of options in the future (not short term):

* Improve the installer to be able to signal an unrecoverable failure from the bootstrap
* Integrate the Rust version of NMState (doesn't currently support gc command) into the installer so we can validate the data in the install config

Options for this bug:
1. Declare victory so we can backport the patch to 4.10.z (I think it makes debugging a little better at least?); open a new bug or Jira story for future improvements
2. Close as CANTFIX; open a story in Jira for future improvements
3. Leave open for future improvements

Comment 7 Dmitry Tantsur 2022-02-07 17:27:54 UTC

Yoav, I guess it's your call (keeping in mind that no one of us will probably have cycles for a significant investment in the installer).

Comment 8 Dmitry Tantsur 2022-02-07 17:29:12 UTC

(I personally think that "could not contact Ironic API" is much better than an inspection timeout in terms of debugability)

Comment 9 Yoav Porag 2022-02-08 06:26:29 UTC

I agree that "could not contact Ironic API" is better, and I think a Jira story will be better suited for the state of this bug.

However, the state in which the bug is in now, where it takes over an hour to fail on a configuration error seemes really unreasonable to me.
The state it was before where it took around 15 minutes from installation launch to failure was preferable.
Is there any way to reduce this significantly while preserving (or improving) the output?
@dtantsur 
@zbitter

Comment 10 Dmitry Tantsur 2022-02-08 08:19:31 UTC

> Is there any way to reduce this significantly while preserving (or improving) the output?

Per Zane's comment 6: not without a significant rework of the installer.

> The state it was before where it took around 15 minutes from installation launch to failure was preferable.

Which failure are you comparing with? Network data did not exist before.

Comment 11 Yoav Porag 2022-02-08 08:52:30 UTC

When i originally posted the bug, the failure happened during ironic inspection phase, and usually failed around 15 min after installation started. 
unfortunately, i did not record what version this was, my bad.

DEBUG ironic_node_v1.openshift-master-host[1]: Still creating... [6m0s elapsed] 
DEBUG ironic_node_v1.openshift-master-host[0]: Still creating... [6m0s elapsed] 
DEBUG ironic_node_v1.openshift-master-host[2]: Still creating... [6m0s elapsed] 
ERROR                                              
ERROR Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://[fd00:1101:0:1::2]:8084/openshift-master-0-2.initramfs failed, reason: HTTPConnectionPool(host='fd00:1101:0:1::2', port=8084): Max retries exceeded with url: /openshift-master-0-2.initramfs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb5cca24710>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))' 
.....

in its current state, the installer forces the customer to wait over an hour until it fails the installation, which seemes unreasonable.

@dtantsur

Comment 12 Zane Bitter 2022-02-08 20:23:17 UTC

It sounds like the terraform provider has no timeout when waiting for ironic to come up (whereas ironic *does* have a timeout waiting for inspection to start, which ten reports an error to terraform).

Comment 13 Dmitry Tantsur 2022-02-10 17:32:42 UTC

I think I've found a way to reduce the timeout: https://github.com/openshift/installer/pull/5639. I haven't tested it myself, help welcome.

Comment 16 Yoav Porag 2022-03-01 09:59:07 UTC

[kni@provisionhost-0-0 ~]$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.11.0-0.nightly-2022-02-27-122819
built from commit 3d19350885d593ee2b1d9ecd7612c2d697dab2a3
release image registry.ci.openshift.org/ocp/release@sha256:e474d91b23fdcfd2de7732442c9c985b0000a6560f7f5db14461f1c92105e8da
release architecture amd64

Added to install config:
        networkConfig:
          routes:
            config:
            - destination: 0.0.0.0/0
              next-hop-address: 192.168.123.1
              next-hop-interface: enp0s4
          dns-resolver:
            config:
              server:
              - 192.168.123.1
          interfaces:
          - name: enp0s4
            type: ethernet
            state: up
            ipv4:
              address:
              - ip: 192.168.123.160
                prefix-length: 24
-------------> Line missing on purpose.

Installation failed on ironic inspection

DEBUG ironic_node_v1.openshift-master-host[1]: Still creating... [4m40s elapsed] 
DEBUG ironic_node_v1.openshift-master-host[1]: Still creating... [4m50s elapsed] 
DEBUG ironic_node_v1.openshift-master-host[0]: Still creating... [4m50s elapsed] 
DEBUG ironic_node_v1.openshift-master-host[2]: Still creating... [4m50s elapsed] 
ERROR                                              
ERROR Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Failed to inspect hardware. Reason: unable to start inspection: Validation of image href http://192.168.123.5:8084/openshift-master-0-1.initramfs failed, reason: HTTPConnectionPool(host='192.168.123.5', port=8084): Max retries exceeded with url: /openshift-master-0-1.initramfs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe8fb63438>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))' 
ERROR                                              
ERROR   on ../../tmp/openshift-install-masters-673864748/main.tf line 13, in resource "ironic_node_v1" "openshift-master-host": 
ERROR   13: resource "ironic_node_v1" "openshift-master-host" { 
ERROR                                              
ERROR                                              
ERROR

bootstrap is not accessible to extract logs from.

The time to failure is shorter than the initial state of the bug, but the error is still vague, and bootstrap logs aren't available for analysis so the problem cant be verified.

as per our discussion in comments #6 to #9, I think that the bug should be closed and a Jira story for this issue should be opened.

please attach a link to the story here, and then the bug can close the bug.
for now, I'm setting it back to assigned.