Bug 1796632 - GCP image creation times out after building RHCOS with cosa refactor
Summary: GCP image creation times out after building RHCOS with cosa refactor
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: 4.4.0
Assignee: Ben Howard
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-30 19:42 UTC by Micah Abbott
Modified: 2020-05-04 11:28 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:28:24 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:28:48 UTC

Description Micah Abbott 2020-01-30 19:42:44 UTC
On Jan 27, Jerry alerted us to a problem with the RHCOS metadata pointing to a GCP image that did not exist.  This was related to the `coreos-assembler` refactor of `cosalib`

https://github.com/coreos/coreos-assembler/commit/1dc3cddb393fcab1000ee62908aaadb8d41d8ccd

https://coreos.slack.com/archives/C999USB0D/p1580162777292800


The symptom was that `openshift-install` failed to find the image to use from the metadata provided.

The metadata problem was fixed with https://github.com/coreos/coreos-assembler/pull/1079


However, problems with the GCP image persists.  In a recent PR (https://github.com/openshift/installer/pull/3016) to the installer repo, the `e2e-gcp` job is observed to fail with errors like:

```
time="2020-01-30T15:02:49Z" level=error msg="Error: Error waiting to create Image: Error waiting for Creating Image: timeout while waiting for state to become 'DONE' (last state: 'RUNNING', timeout: 4m0s)"
time="2020-01-30T15:02:49Z" level=error
time="2020-01-30T15:02:49Z" level=error msg="  on ../tmp/openshift-install-184888010/main.tf line 95, in resource \"google_compute_image\" \"cluster\":"
time="2020-01-30T15:02:49Z" level=error msg="  95: resource \"google_compute_image\" \"cluster\" {"
time="2020-01-30T15:02:49Z" level=error
time="2020-01-30T15:02:49Z" level=error
time="2020-01-30T15:02:49Z" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"
```


While I was able to boot the RHCOS image individually (see https://github.com/openshift/installer/pull/3016#issuecomment-580286007), the installer fails to get an RHCOS node up and running during the install of OCP.



The last successful build was 44.81.202001241431.0

The subsequent build 44.81.202001241932.0 used the refactored `cosalib`.  Even after massaging the `rhcos.json` for the installer, the install fails:

```
$ jq .gcp < data/data/rhcos.json 
{
  "image": "rhcos-44-81-202001241932-0-gcp-x86-64",
  "url": "https://storage.googleapis.com/rhcos/rhcos/rhcos-44-81-202001250231-0-gcp-x86-64.tar.gz"
}
```

```
DEBUG module.network.google_compute_router_nat.master_nat[0]: Still creating... [10s elapsed]               
DEBUG google_compute_image.cluster: Still creating... [1m10s elapsed]                                                                                                                                                                                                                       
DEBUG module.network.google_compute_router_nat.worker_nat[0]: Creation complete after 16s [id=us-central1/miabbo-sjf9h-router/miabbo-sjf9h-nat-worker]                         
DEBUG module.network.google_compute_router_nat.master_nat[0]: Still creating... [20s elapsed]                                                                                                                                                                                               
DEBUG module.network.google_compute_router_nat.master_nat[0]: Creation complete after 21s [id=us-central1/miabbo-sjf9h-router/miabbo-sjf9h-nat-master]                   
DEBUG google_compute_image.cluster: Still creating... [1m20s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [1m30s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [1m40s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [1m50s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [2m0s elapsed]                                                                                                                                                                                                                        
DEBUG google_compute_image.cluster: Still creating... [2m10s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [2m20s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [2m30s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [2m40s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [2m50s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [3m0s elapsed]                                                                                                                                                                                                                        
DEBUG google_compute_image.cluster: Still creating... [3m10s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [3m20s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [3m30s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [3m40s elapsed]                                                                                                                                                                                                                       
DEBUG google_compute_image.cluster: Still creating... [3m50s elapsed] 
DEBUG google_compute_image.cluster: Still creating... [4m0s elapsed] 
ERROR                                              
ERROR Error: Error waiting to create Image: Error waiting for Creating Image: timeout while waiting for state to become 'DONE' (last state: 'RUNNING', timeout: 4m0s) 
ERROR                                              
ERROR   on ../../../../tmp/openshift-install-946154729/main.tf line 95, in resource "google_compute_image" "cluster": 
ERROR   95: resource "google_compute_image" "cluster" { 
ERROR                                              
ERROR                                              
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform 
```

Comment 1 Micah Abbott 2020-01-30 19:49:16 UTC
build metadata for the two versions:

last known good - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.4/44.81.202001241431.0/x86_64/meta.json

first bad - https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.4/44.81.202001241932.0/x86_64/meta.json


Package diffs between the two versions:

```
$ ./differ.py -fe art -fv 44.81.202001241431.0 -se art -sv 44.81.202001241932.0
{
    "sources": {
        "44.81.202001241431.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.4/44.81.202001241431.0/x86_64/commitmeta.json",
        "44.81.202001241932.0": "https://releases-rhcos-art.cloud.privileged.psi.redhat.com/storage/releases/rhcos-4.4/44.81.202001241932.0/x86_64/commitmeta.json"
    },
    "diff": {
        "machine-config-daemon": {
            "44.81.202001241431.0": "machine-config-daemon-4.4.0-202001241232.git.1.189a2ca.el8.x86_64",
            "44.81.202001241932.0": "machine-config-daemon-4.4.0-202001241616.git.1.3a866d3.el8.x86_64"
        }
    }
}
```

Comment 2 Ben Howard 2020-01-30 23:09:29 UTC
The refactor resulted in non-compliant tarballs, required by GCP. 
PR pending https://github.com/coreos/coreos-assembler/pull/1088

Comment 4 Micah Abbott 2020-01-31 15:36:47 UTC
The linked PR hasn't been merged yet, so we technically can't put this into MODIFIED.  Reverting to POST.

Once the PR is merged and we have a successful RHCOS build, we can move to MODIFIED.

Comment 6 Micah Abbott 2020-03-05 20:54:04 UTC
Linked PR is merged and new RHCOS builds have landed since then with no reports of additional problems.  Moving to MODIFIED.

Comment 9 Michael Nguyen 2020-03-06 20:46:36 UTC
verified on 4.4.0-0.nightly-2020-03-06-141620 which is running RHCOS 44.81.202003052104-0.  I installed OCP 4.4 cluster on GCP with no issues.

Comment 11 errata-xmlrpc 2020-05-04 11:28:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.