Bug 1734483

Summary:	Test failure: Cluster upgrade should maintain a functioning cluster
Product:	OpenShift Container Platform	Reporter:	Lokesh Mandvekar <lsm5>
Component:	Machine Config Operator	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.1.z	CC:	adahiya, jialiu, kgarriso, mfojtik
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1734462	Environment:
Last Closed:	2019-11-12 22:34:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lokesh Mandvekar 2019-07-30 16:33:55 UTC

+++ This bug was initially created as a clone of Bug #1734462 +++

Description of problem:

Cluster upgrade seems to timeout.

[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial] expand_less 	1h20m37s
fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:139]: during upgrade
Unexpected error:
    <*errors.errorString | 0xc0027f2ac0>: {
        s: "Pools did not complete upgrade: timed out waiting for the condition",
    }
    Pools did not complete upgrade: timed out waiting for the condition
occurred



Link: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831

Not sure if related to Bug 1734462.

Comment 1 Abhinav Dahiya 2019-07-30 16:40:58 UTC

*** Bug 1734462 has been marked as a duplicate of this bug. ***

Comment 2 Abhinav Dahiya 2019-07-30 16:43:04 UTC

```
Jul 30 11:09:14.614: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
```

Comment 3 Antonio Murdaca 2019-07-30 16:57:48 UTC

```
E0730 10:23:25.394029    2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure
```

Comment 4 Kirsten Garrison 2019-08-19 18:29:59 UTC

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-lcl6b_machine-config-daemon.log

```
'etcdserver: request timed out, possibly due to previous leader failure' (will not retry!)
E0730 10:23:25.394029    2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure
```

Lots of etcd errors here:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-134-214.ec2.internal_etcd-member.log

Comment 5 Kirsten Garrison 2019-08-19 18:55:49 UTC

From MCPs https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/machineconfigpools.json:
```                     
                        "message": "All nodes are updating to rendered-worker-b5a06c8b4bb82ffea9b2308428039754",
                        "reason": "",
                        "status": "True",
                        "type": "Updating"
```

and also:
```
                "degradedMachineCount": 0,
                "machineCount": 3,
                "observedGeneration": 3,
                "readyMachineCount": 0,
                "unavailableMachineCount": 1,
                "updatedMachineCount": 1
```

Masters upgraded correctly. For workers: during upgrade of 1st worker hit the etcdserver error, marked degraded and that worker is in new config.

Also seeing in a few etcd member (https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-164-172.ec2.internal_etcd-member.log):
`etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "error:context deadline exceeded" took too long (1.999997135s) to execute`

As well as(https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4831/artifacts/e2e-aws-upgrade/pods/openshift-etcd_etcd-member-ip-10-0-164-172.ec2.internal_etcd-member_previous.log):

2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member

Not sure about etcd any more than what I've found. Going to let them take a look to confirm.

Comment 6 Sam Batschelet 2019-08-20 21:04:45 UTC

> 2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member

> E0730 10:23:25.394029    2248 writer.go:127] Marking Degraded due to: etcdserver: request timed out, possibly due to previous leader failure

# kubelet logs
```
139391:Jul 30 10:23:51 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:51.782459    1196 container.go:464] Start housekeeping for container "/system.slice/machine-config-daemon-reboot.service"
```

# etcd logs
```
2019-07-30 10:23:51.937656 N | pkg/osutil: received terminated signal, shutting down...
2019-07-30 10:23:51.938319 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
2019-07-30 10:23:51.938415 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
2019-07-30 10:23:51.938501 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
2019-07-30 10:23:52.438955 I | etcdserver: skipped leadership transfer for stopping non-leader member
```

This is a result of etcd being terminated by MCD requesting restart of the node and is an info level log from etcd.

# etcd quorum guard then reports the cluster as unhealthy because the local etcd it is trying to communicate with has not yet started.
# below is the start of the etcd instance at 10:23:33.866734

# etcd logs
```
2019-07-30 10:23:33.866734 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
```

# kubelet
```
kubelet_service.log:135217:Jul 30 10:23:19 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:19.661543    1196 prober.go:112] Readiness probe for "etcd-quorum-guard-7f5b4bd65c-q5f9t_openshift-machine-config-operator(64e232e0-b2ac-11e9-8b86-12ab9625f47c):guard" failed (failure):
kubelet_service.log:135218:Jul 30 10:23:19 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:19.661643    1196 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"etcd-quorum-guard-7f5b4bd65c-q5f9t", UID:"64e232e0-b2ac-11e9-8b86-12ab9625f47c", APIVersion:"v1", ResourceVersion:"4264", FieldPath:"spec.containers{guard}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed:
kubelet_service.log:137369:Jul 30 10:23:29 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:29.652902    1196 prober.go:112] Readiness probe for "etcd-quorum-guard-7f5b4bd65c-q5f9t_openshift-machine-config-operator(64e232e0-b2ac-11e9-8b86-12ab9625f47c):guard" failed (failure):
kubelet_service.log:137370:Jul 30 10:23:29 ip-10-0-164-172 hyperkube[1196]: I0730 10:23:29.653011    1196 event.go:209] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-machine-config-operator", Name:"etcd-quorum-guard-7f5b4bd65c-q5f9t", UID:"64e232e0-b2ac-11e9-8b86-12ab9625f47c", APIVersion:"v1", ResourceVersion:"4264", FieldPath:"spec.containers{guard}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed:
```
2 seconds later MCD sets Degraded state, this feels related.

Comment 7 Kirsten Garrison 2019-08-20 21:43:00 UTC

As a note I was unable to reproduce this failure locally. 

Looking at the runs around this time is CI: https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/?before=4840

It seems to be an isolated failure.

Comment 8 Sam Batschelet 2019-08-20 23:34:38 UTC

>As a note I was unable to reproduce this failure locally. 

>Looking at the runs around this time is CI: https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/?before=4840

>It seems to be an isolated failure.

I agree moving to 4.3 so we can track as a possible issue with etcd-quorum-guard but otherwise, I see nothing that etcd could have done differently.

Comment 9 Sam Batschelet 2019-11-12 22:34:08 UTC

> Jul 30 10:29:51 ip-10-0-134-214 hyperkube[1844]: I0730 10:29:51.487016    1844 file.go:200] Reading config file "/etc/kubernetes/manifests/etcd-member.yaml"
> Jul 30 10:29:51 ip-10-0-134-214 hyperkube[1844]: I0730 10:29:51.488886    1844 file.go:200] Reading config file "/etc/kubernetes/manifests/etcd-member.yaml.mcdorig"

This was a bug that we resolved via https://bugzilla.redhat.com/show_bug.cgi?id=1744580 marking as a duplicate.

ref:
- https://github.com/openshift/machine-config-operator/issues/1074
- https://github.com/openshift/machine-config-operator/pull/1075

*** This bug has been marked as a duplicate of bug 1744580 ***