Bug 1821654
| Summary: | Unable to connect to the server EOF when run oc command | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Ke Wang <kewang> |
| Component: | kube-apiserver | Assignee: | Standa Laznicka <slaznick> |
| Status: | CLOSED NOTABUG | QA Contact: | Xingxing Xia <xxia> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.4 | CC: | aos-bugs, jzmeskal, lsvaty, mfojtik, pelauter, rphillips, slaznick, smaitra, sttts, wsun, xtian |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-14 15:26:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 1
Peter Lauterbach
2020-04-07 11:45:07 UTC
Debugging the cluster is a bit beyond me.
I checked a bunch of things from
- KAS logs
- through OAS logs (I can see the login request there), etcd logs (these are filled with messages like `embed: rejected connection from "10.35.71.100:37682" (error "EOF", ServerName "")` so that's a slight hint of something going wrong)
- through SDN logs (they contain `I0407 08:27:24.882747 2901 roundrobin.go:267] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.35.71.100:6443 10.35.71.103:6443 10.35.71.141:6443]` at about the time of the above login attempt)
- KCM logs (I could see
```
E0407 02:33:13.391517 1 leaderelection.go:331] error retrieving resource lock kube-system/kube-controller-manager: configmaps "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
I0407 08:54:40.439664 1 leaderelection.go:252] successfully acquired lease kube-system/kube-controller-manager
I0407 08:54:40.439755 1 event.go:281] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"kube-controller-manager", UID:"99c190bc-bb02-440b-b9ab-61262aabb4b2", APIVersion:"v1", ResourceVersion:"5888102", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' leopard-hw72r-master-0_0ccdb190-0e35-4e9d-889f-d1e00461dce9 became leader
```
--> notice the 6hrs unavailability due to leader election)
but I honestly can't point my finger at anything specific. If anyone else would like to have a go, that would be dandy.
Using this option (--insecure-skip-tls-verify=true) is probably not correct... Unable to connect errors are usually TLS cert, or some sort of network error. /home/jenkins/workspace/Runner-v3/workdir/ocp4_pm1.kubeconfig may need the loadbalancer cert CA. It looks like there are serious (non-SDN) networking issues. Both in apiserver logs:
E0408 11:14:33.423946 1 watch.go:256] unable to encode watch object *v1.WatchEvent: write tcp 10.35.70.1:6443->10.35.71.141:60356: write: connection reset by peer (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc0162b1a60), encoder:(*versioning.codec)(0xc03a77d4a0), buf:(*bytes.Buffer)(0xc044d36ab0)})
*and* etcd:
2020-04-08 11:18:59.249080 I | embed: rejected connection from "10.35.71.103:33748" (error "EOF", ServerName "")
2020-04-08 11:19:00.979584 I | embed: rejected connection from "10.130.0.69:60290" (error "EOF", ServerName "")
2020-04-08 11:19:09.242635 I | embed: rejected connection from "10.35.71.103:33934" (error "EOF", ServerName "")
2020-04-08 11:19:10.977984 I | embed: rejected connection from "10.130.0.69:60870" (error "EOF", ServerName "")
2020-04-08 11:19:11.138429 I | embed: rejected connection from "10.35.71.141:48934" (error "EOF", ServerName "")
2020-04-08 11:19:14.642312 I | embed: rejected connection from "10.130.0.10:56396" (error "EOF", ServerName "")
2020-04-08 11:19:19.252096 I | embed: rejected connection from "10.35.71.103:34144" (error "EOF", ServerName "")
we see that connections are closed early. The network interfaces don't show dropped packets though.
Kube-apiserver suffers under etcd not being stable
Trace[2122564133]: [1.863990711s] [1.863965979s] About to write a response
I0408 11:28:55.178204 1 healthz.go:191] [+]ping ok
[+]log ok
[-]etcd failed: reason withheld
Surprisingly, etcd is silent otherwise, only a couple of logs reporting it is slow:
2020-04-08 11:23:06.234114 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/kubecontrollermanagers/cluster\" " with result "range_response_count:1 size:3179" took too long (119.607473ms) to execute
Though Stefan updated it, still updating it from my side: Per https://coreos.slack.com/archives/CH76YSYSC/p1586345000184200?thread_ts=1586323570.104000&cid=CH76YSYSC and later after I do more trying, updating the bug: Here have a script `while true; do oc login -u xxia1 -p <password> https://api.<domain>:6443; echo "tried $i times login"; let i+=1; done |& tee -a log | grep -i -e "Login successful" -e EOF -e "logged" -e "tried.*times" `. Try it in my local PC in the office which must use above proxy like the auto jobs to access the cluster, hit EOF in 6/200: given high-memory scale-up* pods in the slack are removed, now EOF much less than before they're removed. But try it in above rhv engine which is in the same network of the cluster without the need of using above proxy, the commands are run much faster, hit EOF in zero/1000. Thus, looks like the high-memory usage issue, the client-cluster network issue and possible performance issue had big impact to the bug, just a guess. Remove rhv from the summary since it's one rhv specific issue. Set it to low severity since looks like the test does not be terminated or be impacted without specifying a proxy. We are no longer seeing this issue since we removed the proxy from our test environment, and access the cluster directly. This issue is NOT a blocker for GA of OCP 4.4 on RHV IPI We have a such new automation test on OCP-on-RHV environment without proxy, no longer hit the same error. So close bug. |