1754133 – During upgrade: unable to retrieve cluster version during upgrade: ... connect: connection refused

Bug 1754133 - During upgrade: unable to retrieve cluster version during upgrade: ... connect: connection refused

Summary: During upgrade: unable to retrieve cluster version during upgrade: ... connec...

Keywords:
Status:	CLOSED DUPLICATE of bug 1754125
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-20 23:07 UTC by W. Trevor King
Modified:	2019-09-30 12:42 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-30 12:42:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2019-09-20 23:07:29 UTC

Possibly a dup of bug 1754125, although this one is "connection refused" and not "the server could not find the requested resource".  Example in a failed upgrade from 4.2.0-0.ci-2019-09-20-104546 to 4.2.0-0.ci-2019-09-20-130212 [1]:

Sep 20 14:12:05.822: INFO: cluster upgrade is Progressing: Working towards 4.2.0-0.ci-2019-09-20-130212: 0% complete
Sep 20 14:12:15.836: INFO: unable to retrieve cluster version during upgrade: Get https://api.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions/version: dial tcp 3.230.105.193:6443: connect: connection refused
Sep 20 14:12:25.836: INFO: unable to retrieve cluster version during upgrade: Get https://api.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions/version: dial tcp 3.230.105.193:6443: connect: connection refused
Sep 20 14:12:35.872: INFO: cluster upgrade is Progressing: Working towards 4.2.0-0.ci-2019-09-20-130212: 1% complete

Lots of concerning error events in that run, including:

Sep 20 14:08:29.877 E ns/openshift-etcd pod/etcd-member-ip-10-0-158-65.ec2.internal node/ip-10-0-158-65.ec2.internal container=etcd-member container exited with code 255 (Error): 06:10.242780 I | rafthttp: stopped peer c63674ca7ba88d2d\n2019-09-20 14:06:10.242788 I | rafthttp: stopping peer 2bef8a836e31cf17...\n2019-09-20 14:06:10.243125 I | rafthttp: closed the TCP streaming connection with peer 2bef8a836e31cf17 (stream MsgApp v2 writer)\n2019-09-20 14:06:10.243147 I | rafthttp: stopped streaming with peer 2bef8a836e31cf17 (writer)\n2019-09-20 14:06:10.243513 I | rafthttp: closed the TCP streaming connection with peer 2bef8a836e31cf17 (stream Message writer)\n2019-09-20 14:06:10.243528 I | rafthttp: stopped streaming with peer 2bef8a836e31cf17 (writer)\n2019-09-20 14:06:10.243603 I | rafthttp: stopped HTTP pipelining with peer 2bef8a836e31cf17\n2019-09-20 14:06:10.243700 W | rafthttp: lost the TCP streaming connection with peer 2bef8a836e31cf17 (stream MsgApp v2 reader)\n2019-09-20 14:06:10.243723 E | rafthttp: failed to read 2bef8a836e31cf17 on stream MsgApp v2 (context canceled)\n2019-09-20 14:06:10.243731 I | rafthttp: peer 2bef8a836e31cf17 became inactive (message send to peer failed)\n2019-09-20 14:06:10.243740 I | rafthttp: stopped streaming with peer 2bef8a836e31cf17 (stream MsgApp v2 reader)\n2019-09-20 14:06:10.243809 W | rafthttp: lost the TCP streaming connection with peer 2bef8a836e31cf17 (stream Message reader)\n2019-09-20 14:06:10.243823 I | rafthttp: stopped streaming with peer 2bef8a836e31cf17 (stream Message reader)\n2019-09-20 14:06:10.243833 I | rafthttp: stopped peer 2bef8a836e31cf17\n2019-09-20 14:06:10.248374 I | embed: rejected connection from "10.0.137.48:58844" (error "set tcp 10.0.158.65:2380: use of closed network connection", ServerName "")\n2019-09-20 14:06:10.248792 I | embed: rejected connection from "10.0.137.48:58842" (error "set tcp 10.0.158.65:2380: use of closed network connection", ServerName "ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com")\n2019-09-20 14:06:10.250569 I | embed: rejected connection from "10.0.143.6:46688" (error "set tcp 10.0.158.65:2380: use of closed network connection", ServerName "ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com")\n

We've seen five jobs fail with "connection refused" in the past 24 hours [2].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7381
[2]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=unable%20to%20retrieve%20cluster%20version%20during%20upgrade:.*connect:%20connection%20refused

Comment 2 Michal Fojtik 2019-09-30 09:25:59 UTC

In artifacts, I can see:

Operator "kube-scheduler" is Degraded since 2019-09-20 16:59:39 +0200 CEST because MultipleConditionsMatching: "StaticPodsDegraded: nodes/ip-10-0-143-6.ec2.internal pods/openshift-kube-scheduler-ip-10-0-143-6.ec2.internal container=\"scheduler\" is not ready\nStaticPodsD│
egraded: nodes/ip-10-0-143-6.ec2.internal pods/openshift-kube-scheduler-ip-10-0-143-6.ec2.internal container=\"scheduler\" is waiting: \"CrashLoopBackOff\" - \"Back-off 5m0s restarting failed container=scheduler pod=openshift-kube-scheduler-ip-10-0-143-6.ec2.internal_ope│
nshift-kube-scheduler(8285dbe0871a92ea67b12c973f4ec61e)\"\nNodeControllerDegraded: The master node(s) \"ip-10-0-143-6.ec2.internal\" not ready"

Also:

openshift-kube-apiserver/kube-apiserver-ip-10-0-143-6.ec2.internal                           | Running   | 16 restarts                                                                                                                                                         openshift-kube-controller-manager/kube-controller-manager-ip-10-0-143-6.ec2.internal         | Running   | 18 restarts                                                                                                                                                    
openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-143-6.ec2.internal                 | Running   | 17 restarts   


Other weird:

2019-09-20 15:28:59 +0200 CEST | kube-controller-manager | Normal  | unexpected addresses: 10.0.0.75

Comment 3 Michal Fojtik 2019-09-30 09:45:25 UTC

The KCM restarts are likely caused by:

E0920 15:03:53.693688       1 webhook.go:107] Failed to make webhook authenticator request: Post https://localhost:6443/apis/authentication.k8s.io/v1beta1/tokenreviews: dial tcp 127.0.0.1:6443: i/o timeout
E0920 15:03:53.693738       1 authentication.go:65] Unable to authenticate the request due to an error: [invalid bearer token, Post https://localhost:6443/apis/authentication.k8s.io/v1beta1/tokenreviews: dial tcp 127.0.0.1:6443: i/o timeout]

Which means the API server is not running on that node, which suggests that node has problem and can't schedule pods, that means the etcd has no quorum?

Comment 4 Michal Fojtik 2019-09-30 09:47:19 UTC

The kube API server on '.6' node:

W0920 14:58:30.184701       1 asm_amd64.s:1337] Failed to dial etcd-0.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:2379: grpc: the connection is closing; please retry.
W0920 14:58:30.986339       1 asm_amd64.s:1337] Failed to dial etcd-0.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:2379: grpc: the connection is closing; please retry.
W0920 14:58:30.997344       1 asm_amd64.s:1337] Failed to dial etcd-0.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:2379: grpc: the connection is closing; please retry.

Comment 5 Michal Fojtik 2019-09-30 11:12:39 UTC

Another interesting find:

ip-10-0-137-48 hyperkube[1176]: I0920 14:08:28.413605    1176 status_manager.go:524] Status for pod "etcd-member-ip-10-0-137-48.ec2.internal_openshift-etcd(04c674e0-dbaa-11e9-bb58-12d30f419b3e)" updated successfully: (1, {Phase:Pending

This means that on 14:08:28 kubelet noticed the etcd member on .48 is down/pending and in BZ description the event says: "E ns/openshift-etcd pod/etcd-member-ip-10-0-158-65.ec2.internal node/ip-10-0-158-65.ec2.internal container=etcd-member container exited with code 255" which suggests 2 etcd members were down at the same time?

Comment 6 Michal Fojtik 2019-09-30 11:21:18 UTC

So:

ip-10-0-158-65 hyperkube[1172]: I0920 14:05:48.711156    1172 status_manager.go:524] Status for pod "etcd-member-ip-10-0-158-65.ec2.internal_openshift-etcd(bcea7785-dbaf-11e9-b16e-12c930de7f36)" updated successfully: (2, {Phase:Pending Conditions:[{│
ip-10-0-137-48 hyperkube[1176]: I0920 14:08:28.413605    1176 status_manager.go:524] Status for pod "etcd-member-ip-10-0-137-48.ec2.internal_openshift-etcd(04c674e0-dbaa-11e9-bb58-12d30f419b3e)" updated successfully: (1, {Phase:Pending

This confirms that 2 etcd members were down the same time.

Comment 7 Michal Fojtik 2019-09-30 12:42:31 UTC

E0920 14:58:42.691526       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.ClusterRole: Get https://localhost:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout


This suggests the API server died on that node because of localhost bug.

*** This bug has been marked as a duplicate of bug 1754125 ***

Note You need to log in before you can comment on or make changes to this bug.