Description of problem: On a 20 node cluster with a pod density of 200 pods/node kube-apiserver started crashlooping continuously with no client access available. oc commands unavailable and oc adm must-gather not possible. Grabbed the pod logs and journal directly from one of the crashing masters and will link it in a private comment shortly. The errors just before the container exited are: 2020-08-10T17:07:32.117353187+00:00 stderr F E0810 17:07:32.117296 258 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} 2020-08-10T17:07:32.117430172+00:00 stderr F E0810 17:07:32.117397 258 writers.go:105] apiserver was unable to write a JSON response: http: Handler timeout 2020-08-10T17:07:32.117664013+00:00 stderr F F0810 17:07:32.117605 258 controller.go:165] Unable to perform initial service nodePort check: unable to refresh the port block: the server was unable to return a response in the time allotted, but may still be processing the request (get services) 2020-08-10T17:07:32.117353187+00:00 stderr F E0810 17:07:32.117296 258 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} 2020-08-10T17:07:32.117430172+00:00 stderr F E0810 17:07:32.117397 258 writers.go:105] apiserver was unable to write a JSON response: http: Handler timeout 2020-08-10T17:07:32.117664013+00:00 stderr F F0810 17:07:32.117605 258 controller.go:165] Unable to perform initial service nodePort check: unable to refresh the port block: the server was unable to return a response in the time allotted, but may still be processing the request (get services) Followed by a lot of stack traces. See logs/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-166-226.us-east-2.compute.internal_f7af153282ba7bef8411b9002fdbaec5/kube-apiserver/9.log in the tar for example. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-07-202945 How reproducible: Unknown - will attempt again. We did have one run earlier in 4.6 that had some failed pods, but did not see the apiserver crash. Steps to Reproduce: 1. 3 master + 20 compute node cluster on AWS (openshift-sdn is the network plugin) 2. Increase pod density in steps with per node counts of 25, 50, 100 and finally 200 3. Issue oc commands to determine apiserver responsiveness Actual results: At 200 pods/node the issue described above occurred.
Did it happen again?
What are these logs? I only see one API server. Can you provide must-gather logs instead which are more complete? The 9.log shows that the API server never become healthy and ready. Nevertheless, it is full of node being rejected. Nodes should never contact an unhealthy API server. Is this a standard cluster? Normally installed? Anything customized?
The logs are for one API server. oc adm must-gather was not possible because the cluster was completely unresponsive. Let me know what other info we can try to gather manually - but it is painful. This is a standard IPI cluster on AWS. 3 m5.xlarge masters, 20 m5.xlarge workers. Nothing customized beyond the applying workload.
We need all API server logs. The one posted is not relevant as it never became healthy and therefore never answered requests.
> 4.6.0-0.nightly-2020-08-07-202945 Mike can you help me eliminate a few things. Is this rhcos or ci nightly.
It was verified that this is a wrong rhcos build, hence duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1873590. *** This bug has been marked as a duplicate of bug 1873590 ***