1721343 – Install is failing for 4.2 nightly

Bug 1721343 - Install is failing for 4.2 nightly

Summary: Install is failing for 4.2 nightly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Evan Cordell
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1721897 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-18 02:00 UTC by Vikas Laad
Modified:	2019-06-20 14:26 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1721650 (view as bug list)
Environment:
Last Closed:	2019-06-20 14:26:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Vikas Laad 2019-06-18 02:00:45 UTC

Please see 4.2 nightlies at https://openshift-release.svc.ci.openshift.org/

Install is failing with following error, more details can be found here https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/1


level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-8qphyj32-e9160.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.14.0+da2e2c0 up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Pulling debug logs from the bootstrap machine"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20190617233221.tar.gz\""
level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition"
2019/06/17 23:33:09 Container setup in pod e2e-aws failed, exit code 1, reason Error
Another process exited
2019/06/17 23:33:09 Container test in pod e2e-aws failed, exit code 1, reason Error
2019/06/17 23:44:13 Copied 3.21Mi of artifacts from e2e-aws to /logs/artifacts/e2e-aws
2019/06/17 23:44:19 Ran for 57m12s
skipped 8 lines unfold_more
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-8qphyj32-e9160.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.14.0+da2e2c0 up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Pulling debug logs from the bootstrap machine"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20190617233221.tar.gz\""
level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition"
---
Container test exited with code 1, reason Error
---
Another process exited

Comment 1 Maciej Szulik 2019-06-18 10:17:11 UTC

From looking at the logs it looks like the release is broken:

F0618 06:05:24.136906       1 start.go:22] error: the config map openshift-config-managed/release-verification has an invalid key "verifier-public-key-redhat" that must be a GPG public key: openpgp: invalid data: tag byte does not have MSB set: openpgp: invalid data: tag byte does not have MSB set

Comment 2 Sudha Ponnaganti 2019-06-18 13:20:05 UTC

@luke.meyer - Can you take a look at this?

Comment 3 Eric Paris 2019-06-18 13:31:30 UTC

Wild guess, I think this is likely: https://github.com/openshift/cluster-update-keys/pull/15
not an ART problem....

Comment 4 Eric Paris 2019-06-18 15:14:32 UTC

https://github.com/openshift/cluster-update-keys/pull/15

Comment 5 Vikas Laad 2019-06-18 19:07:28 UTC

failure after above PR was merged https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/4

level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-gwygc2fy-e9160.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.14.0+6cdbace up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-gwygc2fy-e9160.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: Could not update operatorgroup \"openshift-monitoring/openshift-cluster-monitoring\" (170 of 337): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"
2019/06/18 17:05:58 Container setup in pod e2e-aws failed, exit code 1, reason Error
Another process exited
2019/06/18 17:06:03 Container test in pod e2e-aws failed, exit code 1, reason Error
2019/06/18 17:13:09 Copied 89.16Mi of artifacts from e2e-aws to /logs/artifacts/e2e-aws
2019/06/18 17:13:13 Ran for 1h4m8s
skipped 8 lines unfold_more
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-gwygc2fy-e9160.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.14.0+6cdbace up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-gwygc2fy-e9160.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: Could not update operatorgroup \"openshift-monitoring/openshift-cluster-monitoring\" (170 of 337): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"
---
Container test exited with code 1, reason Error
---
Another process exited
---

Comment 6 Abhinav Dahiya 2019-06-18 19:17:48 UTC

logs from cluster-version-operator:

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/4/artifacts/e2e-aws/pods/openshift-cluster-version_cluster-version-operator-7cf97c974d-p6jj8_cluster-version-operator.log

```
E0618 17:06:19.549601       1 task.go:77] error running apply for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (170 of 337): failed to get resource type: no matches for kind "OperatorGroup" in version "operators.coreos.com/v1alpha2"
```

The manifest that's failing is `https://github.com/openshift/cluster-monitoring-operator/blob/master/manifests/06-operatorgroup.yaml`

Comment 7 Eric Paris 2019-06-18 19:35:58 UTC

Please do not report different unrelated problems in the same bugzilla. Next time lets open a new BZ. This bug changed topic at https://bugzilla.redhat.com/show_bug.cgi?id=1721343#c5

That should be seen as the beginning of the useful information in this BZ. (I have cloned the bug to track the original, solved, problem)

Comment 8 Pawel Krupa 2019-06-19 08:53:58 UTC

*** Bug 1721897 has been marked as a duplicate of this bug. ***

Comment 10 Stefan Schimanski 2019-06-19 09:25:01 UTC

In CVO logs there is nothing found about olm, eventually leading to timeouts with: failed to get resource type: no matches for kind "OperatorGroup" in version "operators.coreos.com/v1alpha2"

Comment 11 Eric Paris 2019-06-19 11:21:13 UTC

OLM is not building nightly:

--> COPY --from=builder /go/src/github.com/operator-framework/operator-lifecycle-manager/bin/olm /bin/olm
API error (404): lstat /var/lib/docker/overlay2/ef13ef327a4b883af8b9d00b950e1a3b6dc7658707a8dd16a02ea1b76701debf/merged/go/src/github.com/operator-framework/operator-lifecycle-manager/bin: no such file or directory

Comment 12 Evan Cordell 2019-06-19 12:05:25 UTC

We tracked this original issue down and fixed yesterday, here's the scoop:

- On May 9th we added the helm install to an early stage of our `src` image. Our manifests are templated with helm and this check was added to prevent us from merging any `/manifests` that were inadvertantly out of sync. Helm is not used in the later stages of the Dockerfile/build.
- Yesterday we noticed that OCP builds were failing via automated emails. 
- Despite the OLM OCP builds failing, OCP was still being published, just without OLM at all.
- Other projects that depend on OLM (monitoring) started getting pinged that their components were failing.

We've fixed the first issue here: https://github.com/operator-framework/operator-lifecycle-manager/pull/912 which has already merged.

But, the build failed again with (the error Eric mentioned):

--> COPY --from=builder /go/src/github.com/operator-framework/operator-lifecycle-manager/bin/olm /bin/olm
API error (404): lstat /var/lib/docker/overlay2/d0be810d7f78f992ed88c81b042eba85aabcb26b1a965b5a6de7818e356785ac/merged/go/src/github.com/operator-framework/operator-lifecycle-manager/bin: no such file or directory
2019-06-19 01:39:00,616 - atomic_reactor.plugin - ERROR - Build step plugin imagebuilder failed: image build failed (rc=1): API error (404): lstat /var/lib/docker/overlay2/d0be810d7f78f992ed88c81b042eba85aabcb26b1a965b5a6de7818e356785ac/merged/go/src/github.com/operator-framework/operator-lifecycle-manager/bin: no such file or directory

At first glance, it appears that the version of go in the base image we use (openshift/origin-release:golang-1.10) has changed. We do different things based on go version (because we use modules, which have different behaviors between go versions). 

We may have some additional issues to work out in CI (in particular I think the kube codegen tools don't like modules yet, so we may have to disable another one of our CI checks), but a potential fix to this issue is now up here: https://github.com/operator-framework/operator-lifecycle-manager/pull/914


But as followups we should figure out:

- Why it took from May 9th to June 18th to fail a build
- Why we're publishing OCP nightlies that are missing required components
- Why such different things are happening to OLM in CI vs ART

Comment 14 Evan Cordell 2019-06-19 14:41:41 UTC

These PRs should fix all of the issues once they are merged:

- https://github.com/operator-framework/operator-lifecycle-manager/pull/914
- https://github.com/openshift/release/pull/4107
- https://gitlab.cee.redhat.com/openshift-art/ocp-build-data/merge_requests/162

Comment 15 Evan Cordell 2019-06-19 15:06:03 UTC

We also believe we have identified where some of our confusion around the order of events comes from:

- When we added the helm tool for validating manifests on May 9th, ART allowed network access.
- Helm was then in a cached build layer for our subsequent builds
- At some point, external network access was disabled. This didn't affect our helm validation because the layer containing the tool was cached.
- Yesterday, ART swapped the baseimage from go 1.10 to go 1.11. This broke our build cache, so we no longer had helm, and our docker build attempted to fetch it, which failed.
- Removing helm from our build fixed the network access issue, but our Dockerfile was still assuming it was building in a go 1.10 context (and therefore lacked our go.mod file)
- Adding in the go.mod should fix our build when that merges.

Comment 16 Evan Cordell 2019-06-19 21:59:14 UTC

All mentioned PRs have now merged. I would expect the next ART build to succeed

Comment 17 Evan Cordell 2019-06-20 13:15:46 UTC

Vikas, can you verify that the build is no longer failing, and close this out if so?

Note You need to log in before you can comment on or make changes to this bug.