This is blocking defect for the OCP 4.4 on RHV IPI, it need to be triaged immediately.
Debugging the cluster is a bit beyond me. I checked a bunch of things from - KAS logs - through OAS logs (I can see the login request there), etcd logs (these are filled with messages like `embed: rejected connection from "10.35.71.100:37682" (error "EOF", ServerName "")` so that's a slight hint of something going wrong) - through SDN logs (they contain `I0407 08:27:24.882747 2901 roundrobin.go:267] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [10.35.71.100:6443 10.35.71.103:6443 10.35.71.141:6443]` at about the time of the above login attempt) - KCM logs (I could see ``` E0407 02:33:13.391517 1 leaderelection.go:331] error retrieving resource lock kube-system/kube-controller-manager: configmaps "kube-controller-manager" is forbidden: User "system:kube-controller-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system" I0407 08:54:40.439664 1 leaderelection.go:252] successfully acquired lease kube-system/kube-controller-manager I0407 08:54:40.439755 1 event.go:281] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"kube-controller-manager", UID:"99c190bc-bb02-440b-b9ab-61262aabb4b2", APIVersion:"v1", ResourceVersion:"5888102", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' leopard-hw72r-master-0_0ccdb190-0e35-4e9d-889f-d1e00461dce9 became leader ``` --> notice the 6hrs unavailability due to leader election) but I honestly can't point my finger at anything specific. If anyone else would like to have a go, that would be dandy.
Using this option (--insecure-skip-tls-verify=true) is probably not correct... Unable to connect errors are usually TLS cert, or some sort of network error.
/home/jenkins/workspace/Runner-v3/workdir/ocp4_pm1.kubeconfig may need the loadbalancer cert CA.
It looks like there are serious (non-SDN) networking issues. Both in apiserver logs: E0408 11:14:33.423946 1 watch.go:256] unable to encode watch object *v1.WatchEvent: write tcp 10.35.70.1:6443->10.35.71.141:60356: write: connection reset by peer (&streaming.encoder{writer:(*framer.lengthDelimitedFrameWriter)(0xc0162b1a60), encoder:(*versioning.codec)(0xc03a77d4a0), buf:(*bytes.Buffer)(0xc044d36ab0)}) *and* etcd: 2020-04-08 11:18:59.249080 I | embed: rejected connection from "10.35.71.103:33748" (error "EOF", ServerName "") 2020-04-08 11:19:00.979584 I | embed: rejected connection from "10.130.0.69:60290" (error "EOF", ServerName "") 2020-04-08 11:19:09.242635 I | embed: rejected connection from "10.35.71.103:33934" (error "EOF", ServerName "") 2020-04-08 11:19:10.977984 I | embed: rejected connection from "10.130.0.69:60870" (error "EOF", ServerName "") 2020-04-08 11:19:11.138429 I | embed: rejected connection from "10.35.71.141:48934" (error "EOF", ServerName "") 2020-04-08 11:19:14.642312 I | embed: rejected connection from "10.130.0.10:56396" (error "EOF", ServerName "") 2020-04-08 11:19:19.252096 I | embed: rejected connection from "10.35.71.103:34144" (error "EOF", ServerName "") we see that connections are closed early. The network interfaces don't show dropped packets though. Kube-apiserver suffers under etcd not being stable Trace[2122564133]: [1.863990711s] [1.863965979s] About to write a response I0408 11:28:55.178204 1 healthz.go:191] [+]ping ok [+]log ok [-]etcd failed: reason withheld Surprisingly, etcd is silent otherwise, only a couple of logs reporting it is slow: 2020-04-08 11:23:06.234114 W | etcdserver: read-only range request "key:\"/kubernetes.io/operator.openshift.io/kubecontrollermanagers/cluster\" " with result "range_response_count:1 size:3179" took too long (119.607473ms) to execute
Though Stefan updated it, still updating it from my side: Per https://coreos.slack.com/archives/CH76YSYSC/p1586345000184200?thread_ts=1586323570.104000&cid=CH76YSYSC and later after I do more trying, updating the bug: Here have a script `while true; do oc login -u xxia1 -p <password> https://api.<domain>:6443; echo "tried $i times login"; let i+=1; done |& tee -a log | grep -i -e "Login successful" -e EOF -e "logged" -e "tried.*times" `. Try it in my local PC in the office which must use above proxy like the auto jobs to access the cluster, hit EOF in 6/200: given high-memory scale-up* pods in the slack are removed, now EOF much less than before they're removed. But try it in above rhv engine which is in the same network of the cluster without the need of using above proxy, the commands are run much faster, hit EOF in zero/1000. Thus, looks like the high-memory usage issue, the client-cluster network issue and possible performance issue had big impact to the bug, just a guess.
Remove rhv from the summary since it's one rhv specific issue. Set it to low severity since looks like the test does not be terminated or be impacted without specifying a proxy.
We are no longer seeing this issue since we removed the proxy from our test environment, and access the cluster directly. This issue is NOT a blocker for GA of OCP 4.4 on RHV IPI
We have a such new automation test on OCP-on-RHV environment without proxy, no longer hit the same error. So close bug.