Bug 1734483
Summary: | Test failure: Cluster upgrade should maintain a functioning cluster | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Lokesh Mandvekar <lsm5> |
Component: | Machine Config Operator | Assignee: | Sam Batschelet <sbatsche> |
Status: | CLOSED DUPLICATE | QA Contact: | ge liu <geliu> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.z | CC: | adahiya, jialiu, kgarriso, mfojtik |
Target Milestone: | --- | ||
Target Release: | 4.3.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1734462 | Environment: | |
Last Closed: | 2019-11-12 22:34:08 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Lokesh Mandvekar
2019-07-30 16:33:55 UTC
*** Bug 1734462 has been marked as a duplicate of this bug. *** ``` Jul 30 11:09:14.614: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) ``` ``` E0730 10:23:25.394029 2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure ``` https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-lcl6b_machine-config-daemon.log ``` 'etcdserver: request timed out, possibly due to previous leader failure' (will not retry!) E0730 10:23:25.394029 2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure ``` Lots of etcd errors here: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-134-214.ec2.internal_etcd-member.log From MCPs https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/machineconfigpools.json: ``` "message": "All nodes are updating to rendered-worker-b5a06c8b4bb82ffea9b2308428039754", "reason": "", "status": "True", "type": "Updating" ``` and also: ``` "degradedMachineCount": 0, "machineCount": 3, "observedGeneration": 3, "readyMachineCount": 0, "unavailableMachineCount": 1, "updatedMachineCount": 1 ``` Masters upgraded correctly. For workers: during upgrade of 1st worker hit the etcdserver error, marked degraded and that worker is in new config. Also seeing in a few etcd member (https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-164-172.ec2.internal_etcd-member.log): `etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1.999997135s) to execute` As well as(https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-164-172.ec2.internal_etcd-member_previous.log): 2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member Not sure about etcd any more than what I've found. Going to let them take a look to confirm. > 2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member > E0730 10:23:25.394029 2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure # kubelet logs ``` 139391:Jul 30 10:23:51 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:51.782459 1196 container.go:464] Start housekeeping for container "/system.slice/machine-config-daemon-reboot.service" ``` # etcd logs ``` 2019-07-30 10:23:51.937656 N | pkg/osutil: received terminated signal, shutting down... 2019-07-30 10:23:51.938319 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler") 2019-07-30 10:23:51.938415 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler") 2019-07-30 10:23:51.938501 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler") 2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member ``` This is a result of etcd being terminated by MCD requesting restart of the node and is an info level log from etcd. # etcd quorum guard then reports the cluster as unhealthy because the local etcd it is trying to communicate with has not yet started. # below is the start of the etcd instance at 10:23:33.866734 # etcd logs ``` 2019-07-30 10:23:33.866734 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd ``` # kubelet ``` kubelet_service.log:135217:Jul 30 10:23:19 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:19.661543 1196 prober.go:112] Readiness probe for "etcd-quorum-guard-7f5b4bd65c-q5f9t_openshift-machine-config-operator(64e232e0-b2ac-11e9-8b86-12ab9625f47c):guard" failed (failure): kubelet_service.log:135218:Jul 30 10:23:19 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:19.661643 1196 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"etcd-quorum-guard-7f5b4bd65c-q5f9t", UID:"64e232e0-b2ac-11e9-8b86-12ab9625f47c", APIVersion:"v1", ResourceVersion:"4264", FieldPath:"spec.containers{guard}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: kubelet_service.log:137369:Jul 30 10:23:29 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:29.652902 1196 prober.go:112] Readiness probe for "etcd-quorum-guard-7f5b4bd65c-q5f9t_openshift-machine-config-operator(64e232e0-b2ac-11e9-8b86-12ab9625f47c):guard" failed (failure): kubelet_service.log:137370:Jul 30 10:23:29 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:29.653011 1196 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"etcd-quorum-guard-7f5b4bd65c-q5f9t", UID:"64e232e0-b2ac-11e9-8b86-12ab9625f47c", APIVersion:"v1", ResourceVersion:"4264", FieldPath:"spec.containers{guard}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: ``` 2 seconds later MCD sets Degraded state, this feels related. As a note I was unable to reproduce this failure locally. Looking at the runs around this time is CI: https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/?before=4840 It seems to be an isolated failure. >As a note I was unable to reproduce this failure locally. >Looking at the runs around this time is CI: https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/?before=4840 >It seems to be an isolated failure. I agree moving to 4.3 so we can track as a possible issue with etcd-quorum-guard but otherwise, I see nothing that etcd could have done differently. > Jul 30 10:29:51 ip-10-0-134-214 hyperkube[1844]: I0730 10:29:51.487016 1844 file.go:200] Reading config file "/etc/kubernetes/manifests/etcd-member.yaml" > Jul 30 10:29:51 ip-10-0-134-214 hyperkube[1844]: I0730 10:29:51.488886 1844 file.go:200] Reading config file "/etc/kubernetes/manifests/etcd-member.yaml.mcdorig" This was a bug that we resolved via https://bugzilla.redhat.com/show_bug.cgi?id=1744580 marking as a duplicate. ref: - https://github.com/openshift/machine-config-operator/issues/1074 - https://github.com/openshift/machine-config-operator/pull/1075 *** This bug has been marked as a duplicate of bug 1744580 *** |