Bug 1821654

Summary:	Unable to connect to the server EOF when run oc command
Product:	OpenShift Container Platform	Reporter:	Ke Wang <kewang>
Component:	kube-apiserver	Assignee:	Standa Laznicka <slaznick>
Status:	CLOSED NOTABUG	QA Contact:	Xingxing Xia <xxia>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.4	CC:	aos-bugs, jzmeskal, lsvaty, mfojtik, pelauter, rphillips, slaznick, smaitra, sttts, wsun, xtian
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-14 15:26:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Peter Lauterbach 2020-04-07 11:45:07 UTC

This is blocking defect for the OCP 4.4 on RHV IPI, it need to be triaged immediately.

Comment 3 Standa Laznicka 2020-04-07 13:31:33 UTC

Debugging the cluster is a bit beyond me.

I checked a bunch of things from 
- KAS logs
- through OAS logs (I can see the login request there), etcd logs (these are filled with messages like `embed: rejected connection from "10.35.71.100:37682" (error "EOF", ServerName "")` so that's a slight hint of something going wrong)
- through SDN logs (they contain `I0407 08:27:24.882747    2901 roundrobin.go:267] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.35.71.100:6443 10.35.71.103:6443 10.35.71.141:6443]` at about the time of the above login attempt)
- KCM logs (I could see

```
E0407 02:33:13.391517       1 leaderelection.go:331] error retrieving resource lock kube-system/kube-controller-manager: configmaps "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
I0407 08:54:40.439664       1 leaderelection.go:252] successfully acquired lease kube-system/kube-controller-manager
I0407 08:54:40.439755       1 event.go:281] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"kube-controller-manager", UID:"99c190bc-bb02-440b-b9ab-61262aabb4b2", APIVersion:"v1", ResourceVersion:"5888102", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' leopard-hw72r-master-0_0ccdb190-0e35-4e9d-889f-d1e00461dce9 became leader
 ```
--> notice the 6hrs unavailability due to leader election)

but I honestly can't point my finger at anything specific. If anyone else would like to have a go, that would be dandy.

Comment 4 Ryan Phillips 2020-04-08 01:38:55 UTC

Using this option (--insecure-skip-tls-verify=true) is probably not correct... Unable to connect errors are usually TLS cert, or some sort of network error.

Comment 5 Ryan Phillips 2020-04-08 01:47:01 UTC

/home/jenkins/workspace/Runner-v3/workdir/ocp4_pm1.kubeconfig may need the loadbalancer cert CA.

Comment 8 Stefan Schimanski 2020-04-08 11:31:10 UTC

It looks like there are serious (non-SDN) networking issues. Both in apiserver logs:

E0408 11:14:33.423946       1 watch.go:256] unable to encode watch object *v1.WatchEvent: write tcp 10.35.70.1:6443->10.35.71.141:60356: write: connection reset by peer (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc0162b1a60), encoder:(*versioning.codec)(0xc03a77d4a0), buf:(*bytes.Buffer)(0xc044d36ab0)})

 *and* etcd:

2020-04-08 11:18:59.249080 I | embed: rejected connection from "10.35.71.103:33748" (error "EOF", ServerName "")
2020-04-08 11:19:00.979584 I | embed: rejected connection from "10.130.0.69:60290" (error "EOF", ServerName "")
2020-04-08 11:19:09.242635 I | embed: rejected connection from "10.35.71.103:33934" (error "EOF", ServerName "")
2020-04-08 11:19:10.977984 I | embed: rejected connection from "10.130.0.69:60870" (error "EOF", ServerName "")
2020-04-08 11:19:11.138429 I | embed: rejected connection from "10.35.71.141:48934" (error "EOF", ServerName "")
2020-04-08 11:19:14.642312 I | embed: rejected connection from "10.130.0.10:56396" (error "EOF", ServerName "")
2020-04-08 11:19:19.252096 I | embed: rejected connection from "10.35.71.103:34144" (error "EOF", ServerName "")

we see that connections are closed early. The network interfaces don't show dropped packets though.
Kube-apiserver suffers under etcd not being stable

Trace[2122564133]: [1.863990711s] [1.863965979s] About to write a response
I0408 11:28:55.178204       1 healthz.go:191] [+]ping ok
[+]log ok
[-]etcd failed: reason withheld

Surprisingly, etcd is silent otherwise, only a couple of logs reporting it is slow:

2020-04-08 11:23:06.234114 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/kubecontrollermanagers/cluster\" " with result "range_response_count:1 size:3179" took too long (119.607473ms) to execute

Comment 9 Xingxing Xia 2020-04-08 11:43:22 UTC

Though Stefan updated it, still updating it from my side:
Per https://coreos.slack.com/archives/CH76YSYSC/p1586345000184200?thread_ts=1586323570.104000&cid=CH76YSYSC and later after I do more trying, updating the bug:
Here have a script `while true; do oc login -u xxia1 -p <password> https://api.<domain>:6443; echo "tried $i times login"; let i+=1; done |& tee -a log | grep -i -e "Login successful" -e EOF -e "logged" -e "tried.*times" `.
Try it in my local PC in the office which must use above proxy like the auto jobs to access the cluster, hit EOF in 6/200: given high-memory scale-up* pods in the slack are removed, now EOF much less than before they're removed.
But try it in above rhv engine which is in the same network of the cluster without the need of using above proxy, the commands are run much faster, hit EOF in zero/1000.
Thus, looks like the high-memory usage issue, the client-cluster network issue and possible performance issue had big impact to the bug, just a guess.

Comment 11 Wei Sun 2020-04-10 11:53:42 UTC

Remove rhv from the summary since it's one rhv specific issue. Set it to low severity since looks like the test does not be terminated or be impacted without specifying a proxy.

Comment 12 Peter Lauterbach 2020-04-10 19:27:27 UTC

We are no longer seeing this issue since we removed the proxy from our test environment, and access the cluster directly.
This issue is NOT a blocker for GA of OCP 4.4 on RHV IPI

Comment 15 Ke Wang 2020-04-14 15:26:30 UTC

We have a such new automation test on OCP-on-RHV environment without proxy, no longer hit the same error. So close bug.