Bug 1868025

Summary:	kube-apiserver crashlooping on a 20 node cluster with 200 pods/node
Product:	OpenShift Container Platform	Reporter:	Mike Fiedler <mifiedle>
Component:	kube-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED DUPLICATE	QA Contact:	Ke Wang <kewang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	aos-bugs, mfojtik, prubenda, sbatsche, vareti, xxia
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-31 07:01:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Fiedler 2020-08-11 12:51:07 UTC

Description of problem:

On a 20 node cluster with a pod density of 200 pods/node kube-apiserver started crashlooping continuously with no client access available.   oc commands unavailable and oc adm must-gather not possible.

Grabbed the pod logs and journal directly from one of the crashing masters and will link it in a private comment shortly.

The errors just before the container exited are:

2020-08-10T17:07:32.117353187+00:00 stderr F E0810 17:07:32.117296     258 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
2020-08-10T17:07:32.117430172+00:00 stderr F E0810 17:07:32.117397     258 writers.go:105] apiserver was unable to write a JSON response: http: Handler timeout
2020-08-10T17:07:32.117664013+00:00 stderr F F0810 17:07:32.117605     258 controller.go:165] Unable to perform initial service nodePort check: unable to refresh the port block: the server was unable to return a response in the time allotted, but may still be processing the request (get services)
2020-08-10T17:07:32.117353187+00:00 stderr F E0810 17:07:32.117296     258 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
2020-08-10T17:07:32.117430172+00:00 stderr F E0810 17:07:32.117397     258 writers.go:105] apiserver was unable to write a JSON response: http: Handler timeout
2020-08-10T17:07:32.117664013+00:00 stderr F F0810 17:07:32.117605     258 controller.go:165] Unable to perform initial service nodePort check: unable to refresh the port block: the server was unable to return a response in the time allotted, but may still be processing the request (get services)


Followed by a lot of stack traces.   See logs/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-166-226.us-east-2.compute.internal_f7af153282ba7bef8411b9002fdbaec5/kube-apiserver/9.log in the tar for example.


Version-Release number of selected component (if applicable):  4.6.0-0.nightly-2020-08-07-202945


How reproducible: Unknown - will attempt again.   We did have one run earlier in 4.6 that had some failed pods, but did not see the apiserver crash.


Steps to Reproduce:
1.  3 master + 20 compute node cluster on AWS (openshift-sdn is the network plugin)
2.  Increase pod density in steps with per node counts of 25, 50, 100 and finally 200
3.  Issue oc commands to determine apiserver responsiveness

Actual results:

At 200 pods/node the issue described above occurred.

Comment 2 Stefan Schimanski 2020-08-21 15:33:45 UTC

Did it happen again?

Comment 3 Stefan Schimanski 2020-08-21 15:40:52 UTC

What are these logs? I only see one API server. Can you provide must-gather logs instead which are more complete?

The 9.log shows that the API server never become healthy and ready. Nevertheless, it is full of node being rejected. Nodes should never contact an unhealthy API server.

Is this a standard cluster? Normally installed? Anything customized?

Comment 4 Mike Fiedler 2020-08-25 14:34:38 UTC

The logs are for one API server.  oc adm must-gather was not possible because the cluster was completely unresponsive.  Let me know what other info we can try to gather manually - but it is painful.

This is a standard IPI cluster on AWS.   3 m5.xlarge masters, 20 m5.xlarge workers.   Nothing customized beyond the applying workload.

Comment 5 Stefan Schimanski 2020-08-28 08:18:56 UTC

We need all API server logs. The one posted is not relevant as it never became healthy and therefore never answered requests.

Comment 6 Sam Batschelet 2020-08-28 16:48:25 UTC

> 4.6.0-0.nightly-2020-08-07-202945

Mike can you help me eliminate a few things. Is this rhcos or ci nightly.

Comment 7 Stefan Schimanski 2020-08-31 07:01:48 UTC

It was verified that this is a wrong rhcos build, hence duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1873590.

*** This bug has been marked as a duplicate of bug 1873590 ***