Bug 1994122

Summary: openshift-install fails on AWS ARM64 nodes
Product: OpenShift Container Platform Reporter: Jed Lejosne <jlejosne>
Component: InstallerAssignee: aos-install
Installer sub component: openshift-installer QA Contact: aleskandro <adistefa>
Status: CLOSED NOTABUG Docs Contact:
Severity: low    
Priority: low CC: jhixson, psundara, yunjiang
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-19 13:23:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
.openshift_install.log none

Description Jed Lejosne 2021-08-16 19:30:47 UTC
Created attachment 1814560 [details]

Created attachment 1814560 [details]

$ openshift-install version
openshift-install 4.9.0-0.nightly-2021-08-16-082143
built from commit c7d810f497d0c6c3ad22e5c14f873a70b0586231
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:89aa37cfa85591440b3b099fed4cde52329308425cc5803c864eefbf7ce9e265

Platform: aws ARM64

What happened?
openshift-install failed when trying to install on AWS ARM64 nodes.
The AWS interface shows all 3 master nodes running but no worker.
Looking at the debug output, the master nodes got created but were unreachable on port 6443.

What did you expect to happen?
Install to create all nodes and finish successfully

How to reproduce it (as minimally and precisely as possible)?
$ mkdir ocp
$ cat > ocp/install-config.yaml <<EOF
apiVersion: v1
baseDomain: devcluster.openshift.com
- architecture: arm64
  hyperthreading: Enabled
  name: worker
      type: m6g.xlarge
  replicas: 3
  architecture: arm64
  hyperthreading: Enabled
  name: master
      type: m6g.xlarge
  replicas: 3
  creationTimestamp: null
  name: jed
  - cidr:
    hostPrefix: 23
  - cidr:
  networkType: OpenShiftSDN
    region: us-east-1
publish: External
pullSecret: <REDACTED>
$ openshift-install --log-level debug --dir ocp create cluster

Comment 1 aleskandro 2021-08-17 08:00:08 UTC
Reproducing the error with the same installer version give the following in the release-image systemd service log:

Pulling quay.io/openshift-release-dev/ocp-release-nightly@sha256:89aa37cfa85591440b3b099fed4cde52329308425cc5803c864eefbf7ce9e265...
Aug 17 05:37:26 ip-10-0-10-205 release-image-download.sh[1533]: 90e82a591baa01bf736d167f9e18f39c0148aa736f8e618ddb7be478131de674
Aug 17 05:37:27 ip-10-0-10-205 release-image-download.sh[1533]: ERROR: release image arch amd64 does not match host arch arm64
Aug 17 05:37:27 ip-10-0-10-205 systemd[1]: release-image.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 05:37:27 ip-10-0-10-205 systemd[1]: release-image.service: Failed with result 'exit-code'.
Aug 17 05:37:27 ip-10-0-10-205 systemd[1]: Failed to start Download the OpenShift Release Image.

So the image you are deploying, quay.io/openshift-release-dev/ocp-release-nightly@sha256:89aa37cfa85591440b3b099fed4cde52329308425cc5803c864eefbf7ce9e265, is for amd64 platforms.

$ podman image inspect quay.io/openshift-release-dev/ocp-release-nightly@sha256:89aa37cfa85591440b3b099fed4cde52329308425cc5803c864eefbf7ce9e265 | grep Architecture
        "Architecture": "amd64",

In order to deploy on ARM64 with a non-arm64 related installer, you can set OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE to the arm64 release image you'd install.

As an example:

$ export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift-release-dev/ocp-release-nightly:4.9.0-0.nightly-arm64-2021-08-16-154214
$ podman image inspect quay.io/openshift-release-dev/ocp-release-nightly:4.9.0-0.nightly-arm64-2021-08-16-154214 | grep Architecture
        "Architecture": "arm64",
$ ./openshift create cluster --dir ocp

By reproducing your steps and environment, but setting the arm64 image, all is working fine.

Finally, in order to install on ARM64, one can use installer binaries from https://mirror.openshift.com/pub/openshift-v4/aarch64/clients/ocp-dev-preview/

If you're on amd64 and want to install for arm64, you can download download one of the openshift-install-linux-amd64-4.9.0-0.nightly-arm64-.*.tar.gz

Then, your installer will be (1) built for amd64 platform and (2) linked to arm64 images by default:

./openshift-install version
./openshift-install 4.9.0-0.nightly-arm64-2021-08-16-154214
built from commit c7d810f497d0c6c3ad22e5c14f873a70b0586231
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:f5507d0e00a653c4e4a1333ca8649d52609d85c47270ace3dbc824f6a4d6de1b

podman image inspect quay.io/openshift-release-dev/ocp-release-nightly@sha256:f5507d0e00a653c4e4a1333ca8649d52609d85c47270ace3dbc824f6a4d6de1b | grep Architecture
        "Architecture": "arm64"


1. The openshift-install could also validate, before the deployment, the compatibility between the target platform and the images being used (this has to be tracked in another issue)
2. We could have a way to provide one only installer that is able to gather the correct image to be used based on the controlPlane.architecture and compute.architecture fields in the install-config.yaml. This, maybe, would need changes on CI current architecture and registry artifact, image streams and imagestreamtags that are being used.

@psundararaman Got some of your code for the installer on Github and tasks related to libvirt on Jira. Do you have any information about point 2?

Comment 2 Jed Lejosne 2021-08-17 12:53:37 UTC
Ah, user error! Thanks a lot for the detailed explanation.
I agree with point 1, getting a meaningful error would be nice. Point 2 would be amazing!

Comment 3 Prashanth Sundararaman 2021-08-17 14:24:42 UTC
Yes, the openshift-installer with ARM payload can be downloaded here: https://console.redhat.com/openshift/install/aws/arm

The payload does not dynamically vary based on the architecture specified in the install-config.yaml. Instead, the openshift-install is built for specific payloads for all arches. So, you can use the x86 openshift-installer binary from the above link and it would give you an ARM payload. This is useful if you want to provision an ARM AWS cluster from your laptop which is x86.

To point 1 - the openshift-installer cannot validate that the arch specified in the install-config matches the arch specified in the payload. For that to happen, the payload needs to be downloaded and inspected and then the arch needs to be compared which is what is happening at bootstrap phase with the error that you see in the systemd unit which clearly indicates the problem. This route was already investigated and was not pursued because typically people deploying OCP do not build their own installer. they download it from the tile page.