Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1886131

Summary:	[upgrade] 4.4 -> 4.5 GCP upgrades failing
Product:	OpenShift Container Platform	Reporter:	slowrie
Component:	Test Infrastructure	Assignee:	W. Trevor King <wking>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	low	Docs Contact:
Priority:	low
Version:	4.5	CC:	skuznets, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	operator install service-catalog-apiserver
Last Closed:	2020-11-02 16:31:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description slowrie 2020-10-07 17:34:37 UTC

test:
operator install service-catalog-apiserver 

test:
operator install service-catalog-controller-manager

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=operator+install+service-catalog-apiserver

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1311793781384679424

4.4 -> 4.5 GCP upgrades are all dying; these two tests seem to have some sort of common denominator as they're showing up with very high failure rates.

Test failure line:
Operator unavailable (null): operator is not reporting conditions

Other maybe relevant failure lines:

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ci-op-z3ln5zkl/release@sha256:c7626733f02057d8b5088417a20eacfff0c185051683be213b5061a8883a60a5
Unexpected error:
    <*errors.errorString | 0xc000b3bdf0>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator etcd is reporting a failure: StaticPodsDegraded: nodes/ci-op-z3ln5zkl-6bd15-zsxgs-master-1 pods/etcd-ci-op-z3ln5zkl-6bd15-zsxgs-master-1 container=\"etcd\" is not ready\nStaticPodsDegraded: nodes/ci-op-z3ln5zkl-6bd15-zsxgs-master-1 pods/etcd-ci-op-z3ln5zkl-6bd15-zsxgs-master-1 container=\"etcd\" is waiting: \"CrashLoopBackOff\" - \"back-off 5m0s restarting failed container=etcd pod=etcd-ci-op-z3ln5zkl-6bd15-zsxgs-master-1_openshift-etcd(58a6323b70902209edd1e223a48a21e8)\"\nEtcdMembersDegraded: 2 of 3 members are available, ci-op-z3ln5zkl-6bd15-zsxgs-master-1 is unhealthy",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator etcd is reporting a failure: StaticPodsDegraded: nodes/ci-op-z3ln5zkl-6bd15-zsxgs-master-1 pods/etcd-ci-op-z3ln5zkl-6bd15-zsxgs-master-1 container="etcd" is not ready
    StaticPodsDegraded: nodes/ci-op-z3ln5zkl-6bd15-zsxgs-master-1 pods/etcd-ci-op-z3ln5zkl-6bd15-zsxgs-master-1 container="etcd" is waiting: "CrashLoopBackOff" - "back-off 5m0s restarting failed container=etcd pod=etcd-ci-op-z3ln5zkl-6bd15-zsxgs-master-1_openshift-etcd(58a6323b70902209edd1e223a48a21e8)"
    EtcdMembersDegraded: 2 of 3 members are available, ci-op-z3ln5zkl-6bd15-zsxgs-master-1 is unhealthy
occurred
failed: (2h38m50s) 2020-10-02T01:44:36 "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]"
skipped 155 lines unfold_more
Oct 01 23:06:35.627 I ns/openshift-machine-api machine/ci-op-z3ln5zkl-6bd15-zsxgs-master-2 reason/Update Updated Machine ci-op-z3ln5zkl-6bd15-zsxgs-master-2 (7 times)
Oct 01 23:06:36.141 I ns/openshift-machine-api machine/ci-op-z3ln5zkl-6bd15-zsxgs-worker-b-pmm9v reason/Update Updated Machine ci-op-z3ln5zkl-6bd15-zsxgs-worker-b-pmm9v (12 times)
Oct 01 23:06:36.779 I ns/openshift-machine-api machine/ci-op-z3ln5zkl-6bd15-zsxgs-worker-c-4ffwl reason/Update Updated Machine ci-op-z3ln5zkl-6bd15-zsxgs-worker-c-4ffwl (12 times)
Oct 01 23:06:37.317 I ns/openshift-machine-api machine/ci-op-z3ln5zkl-6bd15-zsxgs-worker-d-w2vgb reason/Update Updated Machine ci-op-z3ln5zkl-6bd15-zsxgs-worker-d-w2vgb (11 times)
Oct 01 23:06:40.185 I ns/openshift-kube-controller-manager pod/kube-controller-manager-ci-op-z3ln5zkl-6bd15-zsxgs-master-1 node/ci-op-z3ln5zkl-6bd15-zsxgs-master-1 container/kube-controller-manager reason/Pulled image/quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c011bbecf7609480bbf4979ebf249cbfd721025a9bfbb7cbe919c06ffcd12e6
Oct 01 23:06:40.186 E ns/openshift-kube-controller-manager pod/kube-controller-manager-ci-op-z3ln5zkl-6bd15-zsxgs-master-1 node/ci-op-z3ln5zkl-6bd15-zsxgs-master-1 container/kube-controller-manager container exited with code 255 (Error): =true: dial tcp [::1]:6443: connect: connection refused\nE1001 23:06:39.762160       1 reflector.go:307] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to watch *v1.PartialObjectMetadata: Get https://localhost:6443/apis/operators.coreos.com/v1alpha1/catalogsources?allowWatchBookmarks=true&resourceVersion=24960&timeout=6m25s&timeoutSeconds=385&watch=true: dial tcp [::1]:6443: connect: connection refused\nE1001 23:06:39.763714       1 reflector.go:307] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to watch *v1.PartialObjectMetadata: Get https://localhost:6443/apis/operators.coreos.com/v1/operatorsources?allowWatchBookmarks=true&resourceVersion=18788&timeout=7m46s&timeoutSeconds=466&watch=true: dial tcp [::1]:6443: connect: connection refused\nE1001 23:06:39.764736       1 reflector.go:307] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to watch *v1.PartialObjectMetadata: Get https://localhost:6443/apis/monitoring.coreos.com/v1/prometheusrules?allowWatchBookmarks=true&resourceVersion=24650&timeout=5m55s&timeoutSeconds=355&watch=true: dial tcp [::1]:6443: connect: connection refused\nE1001 23:06:39.765871       1 reflector.go:307] k8s.io/client-go/informers/factory.go:135: Failed to watch *v1.CSINode: Get https://localhost:6443/apis/storage.k8s.io/v1/csinodes?allowWatchBookmarks=true&resourceVersion=20280&timeout=8m5s&timeoutSeconds=485&watch=true: dial tcp [::1]:6443: connect: connection refused\nE1001 23:06:39.767063       1 reflector.go:307] k8s.io/client-go/metadata/metadatainformer/informer.go:89: Failed to watch *v1.PartialObjectMetadata: Get https://localhost:6443/apis/operator.openshift.io/v1/dnses?allowWatchBookmarks=true&resourceVersion=22228&timeout=8m41s&timeoutSeconds=521&watch=true: dial tcp [::1]:6443: connect: connection refused\nI1001 23:06:39.811243       1 leaderelection.go:288] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition\nF1001 23:06:39.811446       1 controllermanager.go:291] leaderelection lost\n

Comment 1 W. Trevor King 2020-10-07 23:49:32 UTC

From Suresh:

> 2020-10-02 01:50:24.807234 C | etcdserver/membership: cluster cannot be downgraded (current version: 3.3.22 is lower than determined cluster version: 3.4).
> ...
> Did someone try to roll back on master-1? Otherwise this doesn't make much sense.

And indeed, this was a 4.4 -> 4.5 -> 4.4 round-trip attempt:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1311793781384679424/artifacts/e2e-gcp-upgrade/gather-extra/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-10-01T23:49:45Z  4.4.27 Partial true
2020-10-01T23:08:11Z 2020-10-01T23:49:35Z 4.5.0-0.ci-2020-10-01-174117 Completed false
2020-10-01T22:31:41Z 2020-10-01T23:05:26Z 4.4.27 Completed false

Assigning to test-infra, because we'll need an openshift/release PR to make these release informers a one-way 4.4->4.5 test.

Comment 2 W. Trevor King 2020-10-08 05:36:25 UTC

Took a stab at a PR^, but here are some notes:

Some informer jobs are setting abort-at, e.g. release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 [1].  But that's explicitly a rollback test, so that's ok.  release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci is not setting abort-at [2], so I'm not sure why it's rolling back, but:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1311793781384679424/build-log.txt | grep aborted
Oct  1 23:08:00.578: INFO: Upgrade will be aborted and the cluster will roll back to the current version after 100% of operators have upgraded

Code [3,4] certainly reads to me like the default is zero, i.e. "no abort", but I don't see anything in [5,6] about abort-at.

[7] looks like multi-step, though.  [8] seems to be selecting a multi-step target.  Only one update workflow for the release repo:

$ git --no-pager grep 'workflow.*upgrade' ci-operator/config/openshift/release
ci-operator/config/openshift/release/openshift-release-master__ocp-4.5-ci.yaml:    workflow: openshift-upgrade-aws

Which is the e2e-44-stable-to-45-ci entry [9] only used for the periodic-ci-openshift-release-master-ocp-4.5-ci-e2e-44-stable-to-45-ci job [10].  So I'm still a bit fuzzy on how abort-at is being set for release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci.

[1]: https://github.com/openshift/release/blob/f180f1a8fb7d7fc4d1a64f29e97c8c4f4f37f5b2/ci-operator/jobs/openshift/release/openshift-release-release-4.5-periodics.yaml#L5775-L5841
[2]: https://github.com/openshift/release/blob/f180f1a8fb7d7fc4d1a64f29e97c8c4f4f37f5b2/ci-operator/jobs/openshift/release/openshift-release-release-4.5-periodics.yaml#L7303-L7369
[3]: https://github.com/openshift/origin/blob/3b3a17d295c52bb6712eaeea2682e682524b83cc/test/e2e/upgrade/upgrade.go#L88-L113
[4]: https://github.com/openshift/origin/blob/3b3a17d295c52bb6712eaeea2682e682524b83cc/test/e2e/upgrade/upgrade.go#L261-L271
[5]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1311793781384679424/podinfo.json
[6]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1311793781384679424/prowjob.json
[7]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/1311793781384679424/artifacts/e2e-gcp-upgrade/
[8]: https://github.com/openshift/release/blob/f180f1a8fb7d7fc4d1a64f29e97c8c4f4f37f5b2/ci-operator/jobs/openshift/release/openshift-release-release-4.5-periodics.yaml#L7313
[9]: https://github.com/openshift/release/blob/f180f1a8fb7d7fc4d1a64f29e97c8c4f4f37f5b2/ci-operator/config/openshift/release/openshift-release-master__ocp-4.5-ci.yaml#L17-L21
[10]: https://github.com/openshift/release/blob/f180f1a8fb7d7fc4d1a64f29e97c8c4f4f37f5b2/ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml#L455-L465

Comment 3 W. Trevor King 2020-10-08 16:29:41 UTC

Ah, it had been using workflows, but was recently moved back to templates [1].

[1]: https://github.com/openshift/release/pull/12395