+++ This bug was initially created as a clone of Bug #1734462 +++ Description of problem: Cluster upgrade seems to timeout. [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial] expand_less 1h20m37s fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:139]: during upgrade Unexpected error: <*errors.errorString | 0xc0027f2ac0>: { s: "Pools did not complete upgrade: timed out waiting for the condition", } Pools did not complete upgrade: timed out waiting for the condition occurred Link: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831 Not sure if related to Bug 1734462.
*** Bug 1734462 has been marked as a duplicate of this bug. ***
``` Jul 30 11:09:14.614: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) ```
``` E0730 10:23:25.394029 2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure ```
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-lcl6b_machine-config-daemon.log ``` 'etcdserver: request timed out, possibly due to previous leader failure' (will not retry!) E0730 10:23:25.394029 2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure ``` Lots of etcd errors here: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-134-214.ec2.internal_etcd-member.log
From MCPs https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/machineconfigpools.json: ``` "message": "All nodes are updating to rendered-worker-b5a06c8b4bb82ffea9b2308428039754", "reason": "", "status": "True", "type": "Updating" ``` and also: ``` "degradedMachineCount": 0, "machineCount": 3, "observedGeneration": 3, "readyMachineCount": 0, "unavailableMachineCount": 1, "updatedMachineCount": 1 ``` Masters upgraded correctly. For workers: during upgrade of 1st worker hit the etcdserver error, marked degraded and that worker is in new config. Also seeing in a few etcd member (https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-164-172.ec2.internal_etcd-member.log): `etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1.999997135s) to execute` As well as(https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-164-172.ec2.internal_etcd-member_previous.log): 2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member Not sure about etcd any more than what I've found. Going to let them take a look to confirm.
> 2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member > E0730 10:23:25.394029 2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure # kubelet logs ``` 139391:Jul 30 10:23:51 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:51.782459 1196 container.go:464] Start housekeeping for container "/system.slice/machine-config-daemon-reboot.service" ``` # etcd logs ``` 2019-07-30 10:23:51.937656 N | pkg/osutil: received terminated signal, shutting down... 2019-07-30 10:23:51.938319 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler") 2019-07-30 10:23:51.938415 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler") 2019-07-30 10:23:51.938501 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler") 2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member ``` This is a result of etcd being terminated by MCD requesting restart of the node and is an info level log from etcd. # etcd quorum guard then reports the cluster as unhealthy because the local etcd it is trying to communicate with has not yet started. # below is the start of the etcd instance at 10:23:33.866734 # etcd logs ``` 2019-07-30 10:23:33.866734 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd ``` # kubelet ``` kubelet_service.log:135217:Jul 30 10:23:19 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:19.661543 1196 prober.go:112] Readiness probe for "etcd-quorum-guard-7f5b4bd65c-q5f9t_openshift-machine-config-operator(64e232e0-b2ac-11e9-8b86-12ab9625f47c):guard" failed (failure): kubelet_service.log:135218:Jul 30 10:23:19 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:19.661643 1196 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"etcd-quorum-guard-7f5b4bd65c-q5f9t", UID:"64e232e0-b2ac-11e9-8b86-12ab9625f47c", APIVersion:"v1", ResourceVersion:"4264", FieldPath:"spec.containers{guard}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: kubelet_service.log:137369:Jul 30 10:23:29 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:29.652902 1196 prober.go:112] Readiness probe for "etcd-quorum-guard-7f5b4bd65c-q5f9t_openshift-machine-config-operator(64e232e0-b2ac-11e9-8b86-12ab9625f47c):guard" failed (failure): kubelet_service.log:137370:Jul 30 10:23:29 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:29.653011 1196 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"etcd-quorum-guard-7f5b4bd65c-q5f9t", UID:"64e232e0-b2ac-11e9-8b86-12ab9625f47c", APIVersion:"v1", ResourceVersion:"4264", FieldPath:"spec.containers{guard}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: ``` 2 seconds later MCD sets Degraded state, this feels related.
As a note I was unable to reproduce this failure locally. Looking at the runs around this time is CI: https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/?before=4840 It seems to be an isolated failure.
>As a note I was unable to reproduce this failure locally. >Looking at the runs around this time is CI: https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/?before=4840 >It seems to be an isolated failure. I agree moving to 4.3 so we can track as a possible issue with etcd-quorum-guard but otherwise, I see nothing that etcd could have done differently.
> Jul 30 10:29:51 ip-10-0-134-214 hyperkube[1844]: I0730 10:29:51.487016 1844 file.go:200] Reading config file "/etc/kubernetes/manifests/etcd-member.yaml" > Jul 30 10:29:51 ip-10-0-134-214 hyperkube[1844]: I0730 10:29:51.488886 1844 file.go:200] Reading config file "/etc/kubernetes/manifests/etcd-member.yaml.mcdorig" This was a bug that we resolved via https://bugzilla.redhat.com/show_bug.cgi?id=1744580 marking as a duplicate. ref: - https://github.com/openshift/machine-config-operator/issues/1074 - https://github.com/openshift/machine-config-operator/pull/1075 *** This bug has been marked as a duplicate of bug 1744580 ***