Possibly a dup of bug 1754125, although this one is "connection refused" and not "the server could not find the requested resource". Example in a failed upgrade from 4.2.0-0.ci-2019-09-20-104546 to 4.2.0-0.ci-2019-09-20-130212 [1]: Sep 20 14:12:05.822: INFO: cluster upgrade is Progressing: Working towards 4.2.0-0.ci-2019-09-20-130212: 0% complete Sep 20 14:12:15.836: INFO: unable to retrieve cluster version during upgrade: Get https://api.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions/version: dial tcp 3.230.105.193:6443: connect: connection refused Sep 20 14:12:25.836: INFO: unable to retrieve cluster version during upgrade: Get https://api.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions/version: dial tcp 3.230.105.193:6443: connect: connection refused Sep 20 14:12:35.872: INFO: cluster upgrade is Progressing: Working towards 4.2.0-0.ci-2019-09-20-130212: 1% complete Lots of concerning error events in that run, including: Sep 20 14:08:29.877 E ns/openshift-etcd pod/etcd-member-ip-10-0-158-65.ec2.internal node/ip-10-0-158-65.ec2.internal container=etcd-member container exited with code 255 (Error): 06:10.242780 I | rafthttp: stopped peer c63674ca7ba88d2d\n2019-09-20 14:06:10.242788 I | rafthttp: stopping peer 2bef8a836e31cf17...\n2019-09-20 14:06:10.243125 I | rafthttp: closed the TCP streaming connection with peer 2bef8a836e31cf17 (stream MsgApp v2 writer)\n2019-09-20 14:06:10.243147 I | rafthttp: stopped streaming with peer 2bef8a836e31cf17 (writer)\n2019-09-20 14:06:10.243513 I | rafthttp: closed the TCP streaming connection with peer 2bef8a836e31cf17 (stream Message writer)\n2019-09-20 14:06:10.243528 I | rafthttp: stopped streaming with peer 2bef8a836e31cf17 (writer)\n2019-09-20 14:06:10.243603 I | rafthttp: stopped HTTP pipelining with peer 2bef8a836e31cf17\n2019-09-20 14:06:10.243700 W | rafthttp: lost the TCP streaming connection with peer 2bef8a836e31cf17 (stream MsgApp v2 reader)\n2019-09-20 14:06:10.243723 E | rafthttp: failed to read 2bef8a836e31cf17 on stream MsgApp v2 (context canceled)\n2019-09-20 14:06:10.243731 I | rafthttp: peer 2bef8a836e31cf17 became inactive (message send to peer failed)\n2019-09-20 14:06:10.243740 I | rafthttp: stopped streaming with peer 2bef8a836e31cf17 (stream MsgApp v2 reader)\n2019-09-20 14:06:10.243809 W | rafthttp: lost the TCP streaming connection with peer 2bef8a836e31cf17 (stream Message reader)\n2019-09-20 14:06:10.243823 I | rafthttp: stopped streaming with peer 2bef8a836e31cf17 (stream Message reader)\n2019-09-20 14:06:10.243833 I | rafthttp: stopped peer 2bef8a836e31cf17\n2019-09-20 14:06:10.248374 I | embed: rejected connection from "10.0.137.48:58844" (error "set tcp 10.0.158.65:2380: use of closed network connection", ServerName "")\n2019-09-20 14:06:10.248792 I | embed: rejected connection from "10.0.137.48:58842" (error "set tcp 10.0.158.65:2380: use of closed network connection", ServerName "ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com")\n2019-09-20 14:06:10.250569 I | embed: rejected connection from "10.0.143.6:46688" (error "set tcp 10.0.158.65:2380: use of closed network connection", ServerName "ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com")\n We've seen five jobs fail with "connection refused" in the past 24 hours [2]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7381 [2]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=unable%20to%20retrieve%20cluster%20version%20during%20upgrade:.*connect:%20connection%20refused
In artifacts, I can see: Operator "kube-scheduler" is Degraded since 2019-09-20 16:59:39 +0200 CEST because MultipleConditionsMatching: "StaticPodsDegraded: nodes/ip-10-0-143-6.ec2.internal pods/openshift-kube-scheduler-ip-10-0-143-6.ec2.internal container=\"scheduler\" is not ready\nStaticPodsD│ egraded: nodes/ip-10-0-143-6.ec2.internal pods/openshift-kube-scheduler-ip-10-0-143-6.ec2.internal container=\"scheduler\" is waiting: \"CrashLoopBackOff\" - \"Back-off 5m0s restarting failed container=scheduler pod=openshift-kube-scheduler-ip-10-0-143-6.ec2.internal_ope│ nshift-kube-scheduler(8285dbe0871a92ea67b12c973f4ec61e)\"\nNodeControllerDegraded: The master node(s) \"ip-10-0-143-6.ec2.internal\" not ready" Also: openshift-kube-apiserver/kube-apiserver-ip-10-0-143-6.ec2.internal | Running | 16 restarts openshift-kube-controller-manager/kube-controller-manager-ip-10-0-143-6.ec2.internal | Running | 18 restarts openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-143-6.ec2.internal | Running | 17 restarts Other weird: 2019-09-20 15:28:59 +0200 CEST | kube-controller-manager | Normal | unexpected addresses: 10.0.0.75
The KCM restarts are likely caused by: E0920 15:03:53.693688 1 webhook.go:107] Failed to make webhook authenticator request: Post https://localhost:6443/apis/authentication.k8s.io/v1beta1/tokenreviews: dial tcp 127.0.0.1:6443: i/o timeout E0920 15:03:53.693738 1 authentication.go:65] Unable to authenticate the request due to an error: [invalid bearer token, Post https://localhost:6443/apis/authentication.k8s.io/v1beta1/tokenreviews: dial tcp 127.0.0.1:6443: i/o timeout] Which means the API server is not running on that node, which suggests that node has problem and can't schedule pods, that means the etcd has no quorum?
The kube API server on '.6' node: W0920 14:58:30.184701 1 asm_amd64.s:1337] Failed to dial etcd-0.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:2379: grpc: the connection is closing; please retry. W0920 14:58:30.986339 1 asm_amd64.s:1337] Failed to dial etcd-0.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:2379: grpc: the connection is closing; please retry. W0920 14:58:30.997344 1 asm_amd64.s:1337] Failed to dial etcd-0.ci-op-im3z8fv2-77109.origin-ci-int-aws.dev.rhcloud.com:2379: grpc: the connection is closing; please retry.
Another interesting find: ip-10-0-137-48 hyperkube[1176]: I0920 14:08:28.413605 1176 status_manager.go:524] Status for pod "etcd-member-ip-10-0-137-48.ec2.internal_openshift-etcd(04c674e0-dbaa-11e9-bb58-12d30f419b3e)" updated successfully: (1, {Phase:Pending This means that on 14:08:28 kubelet noticed the etcd member on .48 is down/pending and in BZ description the event says: "E ns/openshift-etcd pod/etcd-member-ip-10-0-158-65.ec2.internal node/ip-10-0-158-65.ec2.internal container=etcd-member container exited with code 255" which suggests 2 etcd members were down at the same time?
So: ip-10-0-158-65 hyperkube[1172]: I0920 14:05:48.711156 1172 status_manager.go:524] Status for pod "etcd-member-ip-10-0-158-65.ec2.internal_openshift-etcd(bcea7785-dbaf-11e9-b16e-12c930de7f36)" updated successfully: (2, {Phase:Pending Conditions:[{│ ip-10-0-137-48 hyperkube[1176]: I0920 14:08:28.413605 1176 status_manager.go:524] Status for pod "etcd-member-ip-10-0-137-48.ec2.internal_openshift-etcd(04c674e0-dbaa-11e9-bb58-12d30f419b3e)" updated successfully: (1, {Phase:Pending This confirms that 2 etcd members were down the same time.
E0920 14:58:42.691526 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.ClusterRole: Get https://localhost:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout This suggests the API server died on that node because of localhost bug. *** This bug has been marked as a duplicate of bug 1754125 ***