Bug 1994122 - openshift-install fails on AWS ARM64 nodes
Summary: openshift-install fails on AWS ARM64 nodes
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: aos-install
QA Contact: aleskandro
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-16 19:30 UTC by Jed Lejosne
Modified: 2021-08-19 13:23 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-19 13:23:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
.openshift_install.log (171.51 KB, text/plain)
2021-08-16 19:30 UTC, Jed Lejosne
no flags Details

Description Jed Lejosne 2021-08-16 19:30:47 UTC
Created attachment 1814560 [details]
.openshift_install.log

Created attachment 1814560 [details]
.openshift_install.log

Version:
$ openshift-install version
openshift-install 4.9.0-0.nightly-2021-08-16-082143
built from commit c7d810f497d0c6c3ad22e5c14f873a70b0586231
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:89aa37cfa85591440b3b099fed4cde52329308425cc5803c864eefbf7ce9e265

Platform: aws ARM64

What happened?
openshift-install failed when trying to install on AWS ARM64 nodes.
The AWS interface shows all 3 master nodes running but no worker.
Looking at the debug output, the master nodes got created but were unreachable on port 6443.

What did you expect to happen?
Install to create all nodes and finish successfully

How to reproduce it (as minimally and precisely as possible)?
$ mkdir ocp
$ cat > ocp/install-config.yaml <<EOF
apiVersion: v1
baseDomain: devcluster.openshift.com
compute:
- architecture: arm64
  hyperthreading: Enabled
  name: worker
  platform:
    aws:
      type: m6g.xlarge
  replicas: 3
controlPlane:
  architecture: arm64
  hyperthreading: Enabled
  name: master
  platform:
    aws:
      type: m6g.xlarge
  replicas: 3
metadata:
  creationTimestamp: null
  name: jed
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  aws:
    region: us-east-1
publish: External
pullSecret: <REDACTED>
EOF
$ openshift-install --log-level debug --dir ocp create cluster

Comment 1 aleskandro 2021-08-17 08:00:08 UTC
Reproducing the error with the same installer version give the following in the release-image systemd service log:

Pulling quay.io/openshift-release-dev/ocp-release-nightly@sha256:89aa37cfa85591440b3b099fed4cde52329308425cc5803c864eefbf7ce9e265...
Aug 17 05:37:26 ip-10-0-10-205 release-image-download.sh[1533]: 90e82a591baa01bf736d167f9e18f39c0148aa736f8e618ddb7be478131de674
Aug 17 05:37:27 ip-10-0-10-205 release-image-download.sh[1533]: ERROR: release image arch amd64 does not match host arch arm64
Aug 17 05:37:27 ip-10-0-10-205 systemd[1]: release-image.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 05:37:27 ip-10-0-10-205 systemd[1]: release-image.service: Failed with result 'exit-code'.
Aug 17 05:37:27 ip-10-0-10-205 systemd[1]: Failed to start Download the OpenShift Release Image.


So the image you are deploying, quay.io/openshift-release-dev/ocp-release-nightly@sha256:89aa37cfa85591440b3b099fed4cde52329308425cc5803c864eefbf7ce9e265, is for amd64 platforms.


$ podman image inspect quay.io/openshift-release-dev/ocp-release-nightly@sha256:89aa37cfa85591440b3b099fed4cde52329308425cc5803c864eefbf7ce9e265 | grep Architecture
        "Architecture": "amd64",


In order to deploy on ARM64 with a non-arm64 related installer, you can set OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE to the arm64 release image you'd install.

As an example:

$ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift-release-dev/ocp-release-nightly:4.9.0-0.nightly-arm64-2021-08-16-154214
$ podman image inspect quay.io/openshift-release-dev/ocp-release-nightly:4.9.0-0.nightly-arm64-2021-08-16-154214 | grep Architecture
        "Architecture": "arm64",
$ ./openshift create cluster --dir ocp

By reproducing your steps and environment, but setting the arm64 image, all is working fine.

Finally, in order to install on ARM64, one can use installer binaries from https://mirror.openshift.com/pub/openshift-v4/aarch64/clients/ocp-dev-preview/

If you're on amd64 and want to install for arm64, you can download download one of the openshift-install-linux-amd64-4.9.0-0.nightly-arm64-.*.tar.gz

Then, your installer will be (1) built for amd64 platform and (2) linked to arm64 images by default:

./openshift-install version
./openshift-install 4.9.0-0.nightly-arm64-2021-08-16-154214
built from commit c7d810f497d0c6c3ad22e5c14f873a70b0586231
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:f5507d0e00a653c4e4a1333ca8649d52609d85c47270ace3dbc824f6a4d6de1b


podman image inspect quay.io/openshift-release-dev/ocp-release-nightly@sha256:f5507d0e00a653c4e4a1333ca8649d52609d85c47270ace3dbc824f6a4d6de1b | grep Architecture
        "Architecture": "arm64"


However:

1. The openshift-install could also validate, before the deployment, the compatibility between the target platform and the images being used (this has to be tracked in another issue)
2. We could have a way to provide one only installer that is able to gather the correct image to be used based on the controlPlane.architecture and compute.architecture fields in the install-config.yaml. This, maybe, would need changes on CI current architecture and registry artifact, image streams and imagestreamtags that are being used.

@psundararaman Got some of your code for the installer on Github and tasks related to libvirt on Jira. Do you have any information about point 2?

Comment 2 Jed Lejosne 2021-08-17 12:53:37 UTC
Ah, user error! Thanks a lot for the detailed explanation.
I agree with point 1, getting a meaningful error would be nice. Point 2 would be amazing!

Comment 3 Prashanth Sundararaman 2021-08-17 14:24:42 UTC
Yes, the openshift-installer with ARM payload can be downloaded here: https://console.redhat.com/openshift/install/aws/arm

The payload does not dynamically vary based on the architecture specified in the install-config.yaml. Instead, the openshift-install is built for specific payloads for all arches. So, you can use the x86 openshift-installer binary from the above link and it would give you an ARM payload. This is useful if you want to provision an ARM AWS cluster from your laptop which is x86.

To point 1 - the openshift-installer cannot validate that the arch specified in the install-config matches the arch specified in the payload. For that to happen, the payload needs to be downloaded and inspected and then the arch needs to be compared which is what is happening at bootstrap phase with the error that you see in the systemd unit which clearly indicates the problem. This route was already investigated and was not pursued because typically people deploying OCP do not build their own installer. they download it from the tile page.


Note You need to log in before you can comment on or make changes to this bug.