1962592 – Worker nodes restarting during OS installation

Bug 1962592 - Worker nodes restarting during OS installation

Summary: Worker nodes restarting during OS installation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Derek Higgins
QA Contact:	Ori Michaeli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1972291
TreeView+	depends on / blocked

Reported:	2021-05-20 10:39 UTC by Derek Higgins
Modified:	2024-10-01 18:16 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: In cases where baremetal ipi was being deployed with a proxy configured, an internal machine-os image download was being directed through the proxy Consequence: The image couldn't be downloaded and was corrupted. Fix: Internal image traffic is now added to no_proxy Result: The image download no longer uses a proxy
Clone Of:
Environment:
Last Closed:	2021-07-27 23:09:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
IPA log (461.78 KB, text/plain) 2021-05-20 10:42 UTC, Derek Higgins	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-baremetal-operator pull 147	0	None	open	Bug 1962592: Use a cache URL with the .svc.cluster.local suffix	2021-05-25 21:35:36 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:09:55 UTC

Description Derek Higgins 2021-05-20 10:39:56 UTC

In some deployment we're seeing worker nodes failing to get provisioned and restarting. In the ironic logs the error most relevant seems to be errors after writing the image to disk

2021-05-17 06:09:53.078 1 ERROR ironic.conductor.utils [-] Agent returned error for deploy step {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'} on node 33c5edf0-569e-4895-9105-137bb676b177 : Error performing deploy_step write_image: Command execution failed: Unable to find a valid partition table on the disk after writing the image. Error Unexpected error while running command.
Command: parted -s -m /dev/sda unit MiB print
Exit code: 1
Stdout: 'BYT;\n/dev/sda:915715MiB:scsi:512:4096:unknown:ATA INTEL SSDSC2KB96:;\n'
Stderr: 'Error: /dev/sda: unrecognised disk label\n'.^[[00m

Comment 1 Derek Higgins 2021-05-20 10:42:17 UTC

Created attachment 1785113 [details]
IPA log

Comment 2 Derek Higgins 2021-05-21 08:18:06 UTC

It looks like this is down to the image-cache container trying to download the image through a proxy

From rhcos-48.84.202104271417-0-openstack.x86_64.qcow2 (contains a proxy error message instead of a qcow image)
<p>The following error was encountered while trying to retrieve the URL: <a href="http://metal3-state.openshift-machine-api:6180/images/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2">http://metal3-state.openshift-machine-api:6180/images/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2</a></p>
<pre>Name Error: The domain name does not exist.</pre>

From squid logs
1621432495.708     29 fd00:1101::6ef0:c42d:33f4:c2f TCP_MISS/503 4587 GET http://metal3-state.openshift-machine-api:6180/images/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2/rhcos-47.83.202103251640-0-openstack.x86_64.qcow2 - HIER_NONE/- text/html

and the metal3-machine-os-downloader container in the image-cache pod
    env:                                                      
    - name: RHCOS_IMAGE_URL                       
      value: http://metal3-state.openshift-machine-api:6180/images/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2/rhcos-48.84.202104271417-0-openstack.x86_64.qcow2
    - name: HTTP_PROXY                        
      value: http://[fd00:1101::1]:3128
    - name: HTTPS_PROXY               
      value: http://[fd00:1101::1]:3128         
    - name: NO_PROXY  
      value: .cluster.local,.svc,127.0.0.1,9999,api-int.ostest.test.metalkube.org,fd00:1101::/64,fd01::/48,fd02::/112,fd2e:6f44:5dd8:c956::/120,localhost

Comment 3 Derek Higgins 2021-05-21 08:20:28 UTC

to work around this you can add "metal3-state.openshift-machine-api" to your noProxy variable in install-config.yaml

Comment 5 Ori Michaeli 2021-06-03 14:09:05 UTC

Verified with 4.8.0-0.nightly-2021-06-02-025513

[kni@provisionhost-0-0 ~]$ oc get pod/metal3-image-cache-c4pft -o yaml | grep qcow2
      value: http://metal3-state.openshift-machine-api.svc.cluster.local:6180/images/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2/rhcos-48.84.202105190318-0-openstack.x86_64.qcow2

Comment 8 errata-xmlrpc 2021-07-27 23:09:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.