On Jan 27, Jerry alerted us to a problem with the RHCOS metadata pointing to a GCP image that did not exist. This was related to the `coreos-assembler` refactor of `cosalib` https://github.com/coreos/coreos-assembler/commit/1dc3cddb393fcab1000ee62908aaadb8d41d8ccd https://coreos.slack.com/archives/C999USB0D/p1580162777292800 The symptom was that `openshift-install` failed to find the image to use from the metadata provided. The metadata problem was fixed with https://github.com/coreos/coreos-assembler/pull/1079 However, problems with the GCP image persists. In a recent PR (https://github.com/openshift/installer/pull/3016) to the installer repo, the `e2e-gcp` job is observed to fail with errors like: ``` time="2020-01-30T15:02:49Z" level=error msg="Error: Error waiting to create Image: Error waiting for Creating Image: timeout while waiting for state to become 'DONE' (last state: 'RUNNING', timeout: 4m0s)" time="2020-01-30T15:02:49Z" level=error time="2020-01-30T15:02:49Z" level=error msg=" on ../tmp/openshift-install-184888010/main.tf line 95, in resource \"google_compute_image\" \"cluster\":" time="2020-01-30T15:02:49Z" level=error msg=" 95: resource \"google_compute_image\" \"cluster\" {" time="2020-01-30T15:02:49Z" level=error time="2020-01-30T15:02:49Z" level=error time="2020-01-30T15:02:49Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform" ``` While I was able to boot the RHCOS image individually (see https://github.com/openshift/installer/pull/3016#issuecomment-580286007), the installer fails to get an RHCOS node up and running during the install of OCP. The last successful build was 44.81.202001241431.0 The subsequent build 44.81.202001241932.0 used the refactored `cosalib`. Even after massaging the `rhcos.json` for the installer, the install fails: ``` $ jq .gcp < data/data/rhcos.json { "image": "rhcos-44-81-202001241932-0-gcp-x86-64", "url": "https://storage.googleapis.com/rhcos/rhcos/rhcos-44-81-202001250231-0-gcp-x86-64.tar.gz" } ``` ``` DEBUG module.network.google_compute_router_nat.master_nat[0]: Still creating... [10s elapsed] DEBUG google_compute_image.cluster: Still creating... [1m10s elapsed] DEBUG module.network.google_compute_router_nat.worker_nat[0]: Creation complete after 16s [id=us-central1/miabbo-sjf9h-router/miabbo-sjf9h-nat-worker] DEBUG module.network.google_compute_router_nat.master_nat[0]: Still creating... [20s elapsed] DEBUG module.network.google_compute_router_nat.master_nat[0]: Creation complete after 21s [id=us-central1/miabbo-sjf9h-router/miabbo-sjf9h-nat-master] DEBUG google_compute_image.cluster: Still creating... [1m20s elapsed] DEBUG google_compute_image.cluster: Still creating... [1m30s elapsed] DEBUG google_compute_image.cluster: Still creating... [1m40s elapsed] DEBUG google_compute_image.cluster: Still creating... [1m50s elapsed] DEBUG google_compute_image.cluster: Still creating... [2m0s elapsed] DEBUG google_compute_image.cluster: Still creating... [2m10s elapsed] DEBUG google_compute_image.cluster: Still creating... [2m20s elapsed] DEBUG google_compute_image.cluster: Still creating... [2m30s elapsed] DEBUG google_compute_image.cluster: Still creating... [2m40s elapsed] DEBUG google_compute_image.cluster: Still creating... [2m50s elapsed] DEBUG google_compute_image.cluster: Still creating... [3m0s elapsed] DEBUG google_compute_image.cluster: Still creating... [3m10s elapsed] DEBUG google_compute_image.cluster: Still creating... [3m20s elapsed] DEBUG google_compute_image.cluster: Still creating... [3m30s elapsed] DEBUG google_compute_image.cluster: Still creating... [3m40s elapsed] DEBUG google_compute_image.cluster: Still creating... [3m50s elapsed] DEBUG google_compute_image.cluster: Still creating... [4m0s elapsed] ERROR ERROR Error: Error waiting to create Image: Error waiting for Creating Image: timeout while waiting for state to become 'DONE' (last state: 'RUNNING', timeout: 4m0s) ERROR ERROR on ../../../../tmp/openshift-install-946154729/main.tf line 95, in resource "google_compute_image" "cluster": ERROR 95: resource "google_compute_image" "cluster" { ERROR ERROR FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform ```
build metadata for the two versions: last known good - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.4/44.81.202001241431.0/x86_64/meta.json first bad - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.4/44.81.202001241932.0/x86_64/meta.json Package diffs between the two versions: ``` $ ./differ.py -fe art -fv 44.81.202001241431.0 -se art -sv 44.81.202001241932.0 { "sources": { "44.81.202001241431.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.4/44.81.202001241431.0/x86_64/commitmeta.json", "44.81.202001241932.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.4/44.81.202001241932.0/x86_64/commitmeta.json" }, "diff": { "machine-config-daemon": { "44.81.202001241431.0": "machine-config-daemon-4.4.0-202001241232.git.1.189a2ca.el8.x86_64", "44.81.202001241932.0": "machine-config-daemon-4.4.0-202001241616.git.1.3a866d3.el8.x86_64" } } } ```
The refactor resulted in non-compliant tarballs, required by GCP. PR pending https://github.com/coreos/coreos-assembler/pull/1088
The linked PR hasn't been merged yet, so we technically can't put this into MODIFIED. Reverting to POST. Once the PR is merged and we have a successful RHCOS build, we can move to MODIFIED.
Linked PR is merged and new RHCOS builds have landed since then with no reports of additional problems. Moving to MODIFIED.
verified on 4.4.0-0.nightly-2020-03-06-141620 which is running RHCOS 44.81.202003052104-0. I installed OCP 4.4 cluster on GCP with no issues.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581