A 4.1 to 4.2 upgrade test failed with: ``` fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:139]: during upgrade Unexpected error: <*errors.errorString | 0xc002601b50>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards registry.svc.ci.openshift.org/ocp/release:4.2.0-0.ci-2019-07-03-230711: downloading update", } Cluster did not complete upgrade: timed out waiting for the condition: Working towards registry.svc.ci.openshift.org/ocp/release:4.2.0-0.ci-2019-07-03-230711: downloading update occurred ``` It's unclear what caused this timeout (client, network or server), but it sounds like we should either retry or connect/escalate to the team that should have SLAs on the server the download is served from. Seen in build https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/146
``` curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/146/artifacts/e2e-aws-upgrade/clusteroperators.json | jq '.items[] | select( .status.versions[] | .name == "operator" and .version != "4.2.0-0.ci-2019-07-03-230711")' { "apiVersion": "config.openshift.io/v1", "kind": "ClusterOperator", "metadata": { "creationTimestamp": "2019-07-04T02:03:05Z", "generation": 1, "name": "dns", "resourceVersion": "2279", "selfLink": "/apis/config.openshift.io/v1/clusteroperators/dns", "uid": "dc6f731d-9dff-11e9-85ac-12b4dd5f15ba" }, "spec": {}, "status": { "conditions": [ { "lastTransitionTime": "2019-07-04T02:03:18Z", "message": "All desired DNS DaemonSets available and operand Namespace exists", "reason": "AsExpected", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2019-07-04T02:03:20Z", "message": "Desired and available number of DNS DaemonSets are equal", "reason": "AsExpected", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2019-07-04T02:03:18Z", "message": "At least 1 DNS DaemonSet available", "reason": "AsExpected", "status": "True", "type": "Available" } ], "extension": null, "relatedObjects": [ { "group": "", "name": "openshift-dns-operator", "resource": "namespaces" }, { "group": "", "name": "openshift-dns", "resource": "namespaces" } ], "versions": [ { "name": "operator", "version": "4.1.4" }, { "name": "coredns", "version": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a3d8c33144c665f38d479b3f922578e85c6a10dbb721cbb9a103dbea050d621c" }, { "name": "openshift-cli", "version": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c2b7b83cf5d399b61414052aae326455bf0dc6154a5f9118e2527a0e5b446b4" } ] } } { "apiVersion": "config.openshift.io/v1", "kind": "ClusterOperator", "metadata": { "creationTimestamp": "2019-07-04T02:08:44Z", "generation": 1, "name": "image-registry", "resourceVersion": "13113", "selfLink": "/apis/config.openshift.io/v1/clusteroperators/image-registry", "uid": "a6a9daef-9e00-11e9-921d-0e2a683742ec" }, "spec": {}, "status": { "conditions": [ { "lastTransitionTime": "2019-07-04T02:10:01Z", "message": "The registry is ready", "reason": "Ready", "status": "True", "type": "Available" }, { "lastTransitionTime": "2019-07-04T02:10:02Z", "message": "The registry is ready", "reason": "Ready", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2019-07-04T02:08:45Z", "status": "False", "type": "Degraded" } ], "extension": null, "relatedObjects": [ { "group": "rbac.authorization.k8s.io", "name": "system:registry", "resource": "clusterroles" }, { "group": "rbac.authorization.k8s.io", "name": "registry-registry-role", "resource": "clusterrolebindings" }, { "group": "", "name": "registry", "namespace": "openshift-image-registry", "resource": "serviceaccounts" }, { "group": "", "name": "serviceca", "namespace": "openshift-image-registry", "resource": "configmaps" }, { "group": "", "name": "image-registry-certificates", "namespace": "openshift-image-registry", "resource": "configmaps" }, { "group": "", "name": "image-registry-private-configuration", "namespace": "openshift-image-registry", "resource": "secrets" }, { "group": "config.openshift.io", "name": "cluster", "resource": "images" }, { "group": "apps", "name": "node-ca", "namespace": "openshift-image-registry", "resource": "daemonsets" }, { "group": "", "name": "image-registry", "namespace": "openshift-image-registry", "resource": "services" }, { "group": "apps", "name": "image-registry", "namespace": "openshift-image-registry", "resource": "deployments" } ], "versions": [ { "name": "operator", "version": "4.1.4" } ] } } { "apiVersion": "config.openshift.io/v1", "kind": "ClusterOperator", "metadata": { "creationTimestamp": "2019-07-04T02:03:23Z", "generation": 1, "name": "machine-config", "resourceVersion": "9877", "selfLink": "/apis/config.openshift.io/v1/clusteroperators/machine-config", "uid": "e79d6a18-9dff-11e9-85ac-12b4dd5f15ba" }, "spec": {}, "status": { "conditions": [ { "lastTransitionTime": "2019-07-04T02:03:57Z", "message": "Cluster has deployed 4.1.4", "status": "True", "type": "Available" }, { "lastTransitionTime": "2019-07-04T02:03:57Z", "message": "Cluster version is 4.1.4", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2019-07-04T02:03:24Z", "status": "False", "type": "Degraded" } ], "extension": { "master": "all 3 nodes are at latest configuration rendered-master-941ae957a181d3cd13d9c27dc68609fe", "worker": "all 3 nodes are at latest configuration rendered-worker-35742076660d2b75d6deba0aae93b0c1" }, "relatedObjects": [ { "group": "", "name": "openshift-machine-config-operator", "resource": "namespaces" }, { "group": "machineconfiguration.openshift.io", "name": "master", "resource": "machineconfigpools" }, { "group": "machineconfiguration.openshift.io", "name": "worker", "resource": "machineconfigpools" }, { "group": "machineconfiguration.openshift.io", "name": "cluster", "resource": "controllerconfigs" } ], "versions": [ { "name": "operator", "version": "4.1.4" } ] } } { "apiVersion": "config.openshift.io/v1", "kind": "ClusterOperator", "metadata": { "creationTimestamp": "2019-07-04T02:02:28Z", "generation": 1, "name": "network", "resourceVersion": "9633", "selfLink": "/apis/config.openshift.io/v1/clusteroperators/network", "uid": "c65d558d-9dff-11e9-85ac-12b4dd5f15ba" }, "spec": {}, "status": { "conditions": [ { "lastTransitionTime": "2019-07-04T02:02:29Z", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2019-07-04T02:07:38Z", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2019-07-04T02:03:50Z", "status": "True", "type": "Available" } ], "extension": null, "versions": [ { "name": "operator", "version": "4.1.4" } ] } } ```
Haven't seen this anymore.
"downloading update" timeouts are currently involved in ~32% of all AWS upgrade failures in CI [1]. Recent examples: $ curl -s 'https://ci-search-ci-search-next.svc.ci.openshift.org/search?name=e2e-aws-upgrade&maxAge=86400s&context=0&search=Cluster+did+not+complete+upgrade.*timed+out.*downloading+update' | jq -r '. | keys[]' https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5440 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5445 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5460 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5464 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5482 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5488 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23581/pull-ci-openshift-origin-master-e2e-aws-upgrade/4082 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-authentication-operator/172/pull-ci-openshift-cluster-authentication-operator-master-e2e-aws-upgrade/102 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/340/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-upgrade/431 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/360/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-upgrade/427 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/360/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-upgrade/429 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/282/pull-ci-openshift-cluster-network-operator-master-e2e-aws-upgrade/583 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/283/pull-ci-openshift-cluster-network-operator-master-e2e-aws-upgrade/586 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/1203/pull-ci-openshift-installer-master-e2e-aws-upgrade/1736 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_multus-cni/19/pull-ci-openshift-multus-cni-master-e2e-aws-upgrade/17 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_multus-cni/19/pull-ci-openshift-multus-cni-master-e2e-aws-upgrade/19 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_multus-cni/19/pull-ci-openshift-multus-cni-master-e2e-aws-upgrade/20 Digging into one of those [2], the build log had: fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:139]: during upgrade Unexpected error: <*errors.errorString | 0xc0019f6ec0>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-10-002649: downloading update", } Cluster did not complete upgrade: timed out waiting for the condition: Working towards registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-10-002649: downloading update occurred the source and targets were 4.2.0 nightlies [3,4]. The cluster-version operator was complaining [5]: - lastTransitionTime: 2019-08-10T00:52:02Z message: 'Unable to retrieve available updates: currently installed version 4.2.0-0.nightly-2019-08-10-002649 not found in the "stable-4.2" channel' reason: RemoteFailed status: "False" type: RetrievedUpdates desired: force: true image: registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-10-002649 version: "" I'd have expected 'force' to mean "don't worry that you don't see the next release in your target channel", but maybe it (no longer?) means that? [1]: https://github.com/openshift/installer/pull/1203#issuecomment-520116714 [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496 [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/release-images-initial/release-images-initial [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/release-images-latest/release-images-latest [5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/e2e-aws-upgrade/must-gather/cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml
Hrm, actually that RetrievedUpdates status is about the current version. One condition up is about the new version: - lastTransitionTime: 2019-08-10T01:32:01Z message: 'Working towards registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-10-002649: downloading update' reason: DownloadingUpdate status: "True" type: Progressing I'm trying to figure out how that could have hung for over a minute...
From the CVO's logs [1]: 2019-08-10T01:32:19.013068717Z I0810 01:32:19.012836 1 start.go:19] ClusterVersionOperator v4.2.0-201908081419-dirty ... 2019-08-10T01:32:19.444606723Z I0810 01:32:19.444526 1 cvo.go:370] Desired version from spec is v1.Update{Version:"", Image:"registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-10-002649", Force:true} 2019-08-10T01:32:19.444808306Z I0810 01:32:19.444760 1 sync_worker.go:453] Running sync registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-10-002649 (force=true) on generation 2 in state Updating at attempt 0 2019-08-10T01:32:19.444808306Z I0810 01:32:19.444791 1 sync_worker.go:459] Loading payload 2019-08-10T01:32:19.444870155Z I0810 01:32:19.444823 1 payload.go:210] Loading updatepayload from "/" ... 2019-08-10T01:32:19.879722853Z I0810 01:32:19.879690 1 sync_worker.go:497] Payload loaded from registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-10-002649 with hash 0574hhi_ipo= ... 2019-08-10T01:32:19.881185677Z I0810 01:32:19.881146 1 sync_worker.go:574] Running sync for namespace "openshift-cluster-version" (1 of 406) ... 2019-08-10T01:37:42.979084375Z I0810 01:37:42.977786 1 sync_worker.go:574] Running sync for deployment "openshift-image-registry/cluster-image-registry-operator" (159 of 406) ... 2019-08-10T01:37:43.452707455Z I0810 01:37:43.452648 1 request.go:530] Throttling request took 168.525252ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-image-registry/deployments/cluster-image-registry-operator ... 2019-08-10T01:37:43.703423876Z I0810 01:37:43.703373 1 request.go:530] Throttling request took 241.654592ms, request: PUT:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-image-registry/deployments/cluster-image-registry-operator ... 2019-08-10T01:37:43.96053968Z I0810 01:37:43.960486 1 request.go:530] Throttling request took 228.278406ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-image-registry/deployments/cluster-image-registry-operator ... 2019-08-10T01:38:04.731347395Z E0810 01:38:04.731330 1 task.go:77] error running apply for deployment "openshift-image-registry/cluster-image-registry-operator" (159 of 406): timed out waiting for the condition ... 2019-08-10T02:51:04.961007914Z I0810 02:51:04.960959 1 apps.go:115] Deployment cluster-image-registry-operator is not ready. status: (replicas: 2, updated: 1, ready: 1, unavailable: 1, reason: ProgressDeadlineExceeded, message: ReplicaSet "cluster-image-registry-operator-655557cc97" has timed out progressing.) ... Let's check on that deployment (not sure why we weren't surfacing this error in the ClusterVersion object...). Strangely, I only see an image-registry Deployment in the must-gather output [2], but here's the operator deployment via our direct oc gets: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/e2e-aws-upgrade/deployments.json | gunzip | jq '.items[] | select(.metadata.name == "cluster-image-registry-operator").status' { "availableReplicas": 1, "conditions": [ { "lastTransitionTime": "2019-08-10T00:57:09Z", "lastUpdateTime": "2019-08-10T00:57:09Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2019-08-10T01:47:44Z", "lastUpdateTime": "2019-08-10T01:47:44Z", "message": "ReplicaSet \"cluster-image-registry-operator-655557cc97\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ], "observedGeneration": 2, "readyReplicas": 1, "replicas": 2, "unavailableReplicas": 1, "updatedReplicas": 1 } Where's that other pod? Must-gather missing registry-namespace pods too [3]? Back to our oc get: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.namespace == "openshift-image-registry") | select(.status.phase != "Running" and .status.phase != "Succeeded") | .status.phase + "\t" + .metadata.name + "\t" + ([([(.status.containerStatuses // [])[].state.waiting.reason] + [(.status.conditions // [])[].message] + [""])[] | select(. != null)][0])' Pending cluster-image-registry-operator-655557cc97-5h5t8 CreateContainerError Pending? Looks like a busted container: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.name == "cluster-image-registry-operator-655557cc97-5h5t8").status.containerStatuses[] | select(.ready | not).state.waiting' { "message": "container create failed: container_linux.go:336: starting container process caused \"exec: \\\"cluster-image-registry-operator-watch\\\": executable file not found in $PATH\"\n", "reason": "CreateContainerError" } Dunno what that's about, but I feel pretty comfortable bumping this over to the registry folks for a closer look. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/e2e-aws-upgrade/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-58ff6d9944-n92ch/cluster-version-operator/cluster-version-operator/logs/current.log [2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/e2e-aws-upgrade/must-gather/namespaces/openshift-image-registry/apps/deployments/ [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/5496/artifacts/e2e-aws-upgrade/must-gather/namespaces/openshift-image-registry/
Seems like the cluster-image-registry-operator-watch is just an issue for ART nightlies, and it's maybe involved with 4.2 nightlies being broken for the last ~2 days [1]. [1]: https://openshift-release.svc.ci.openshift.org/#4.2.0-0.nightly
New cluster-image-registry-operator-watch stuff just landed via https://github.com/openshift/cluster-image-registry-operator/pull/342 a day ago, so I suspect that's just not compatible with the ART nightly image builds at the moment.
this was broken because the original PR didn't update dockerfile.rhel7. Ricardo can fix it when it undoes my reversion of his PR.
Also the registry operator CI configuration should be fixed to use the same dockerfile that ART is using, and the other dockerfile should be deleted, so we do not run into this situation again.
Could upgrade co image-registry from 4.1.11 to 4.2.0-0.nightly-2019-08-13-061659 without error. Could upgrade co image-registry from 4.2.0-0.nightly-2019-08-08-103722 to 4.2.0-0.nightly-2019-08-13-061659 without error. Could install a fresh cluster with 4.2.0-0.nightly-2019-08-13-061659 payload without error for co image-registry. $oc get pods -n openshift-image-registry NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-65f9589bf5-tmqcw 2/2 Running 0 6m3s $ oc logs -f cluster-image-registry-operator-65f9589bf5-tmqcw -n openshift-image-registry -c cluster-image-registry-operator-watch I0813 08:57:11.399444 1 cmd.go:296] Waiting for process "cluster-image-registry-operator" PID ... I0813 08:57:11.400425 1 cmd.go:303] Watching for changes in: ([]string) (len=1 cap=1) { (string) (len=61) "/proc/1/root/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem" } I0813 08:57:11.412450 1 observer_polling.go:106] Starting file observer
XiuJuan, looks like you cleared "Target Release"? We expect this fix to go out in 4.2.0, right? Am I just missing something about our Bugzilla process?
Hmmm, I think it is a mistake to clear 4.2.0 "Target Release". This bug need release out in 4.2.0. Thanks for fixing it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922