1707403 – API throttling results in long delays to state update

Bug 1707403 - API throttling results in long delays to state update

Summary: API throttling results in long delays to state update

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Robert Krawitz
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:	aos-scalability-41
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-07 12:55 UTC by Walid A.
Modified:	2020-01-23 11:03 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:03:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:03:57 UTC

Comment 1 Seth Jennings 2019-05-20 20:21:51 UTC

This is a continued attempt to get a feature in via a bug.  The current supported limit is 250 pods/node.

However, Robert could you try to find an explanation?  Don't spent too much time.  I might actually be the watch-based secret manager back before we disabled it.

Comment 3 Walid A. 2019-05-21 17:39:52 UTC

Created attachment 1571714 [details]
oc_get_events_oc_describe_terminating_pods

Added oc get events, oc describe pod output and oc logs for the terminating pods.

Comment 4 Robert Krawitz 2019-05-21 17:53:22 UTC

Can you also get me kubelet logs?  I was hoping there would be enough in oc describe pod but there isn't.

Comment 36 Robert Krawitz 2019-07-23 18:14:22 UTC

Created attachment 1592947 [details]
Pause test

Run this:

$ oc apply -f minipause.yaml
$ for f in $(seq 0 649) ; do oc apply -f - <<< $(oc process minipause -p SERIAL=$f -p PODS=1 -p NAMESPACE=minipause-$f); done

If you want to use just one namespace:

$ for f in $(seq 0 649) ; do oc apply -f - <<< $(oc process minipause -p SERIAL=$f -p PODS=1); done

Comment 45 David Eads 2019-07-24 19:19:50 UTC

One kubelet related thought occurs to me.  Kubelets have to set up watches on a per secret basis.  Each pod comes with a secret, but in a single namespace it would all be the same secret, so you have one get/list/watch.  With multiple namespaces, you have many.  Maybe you're using up your rate limit on secrets and your client is ratelimiting the patch.  You could test this by setting the pod.spec.automountServiceAccountToken .

If you still experience the problem, then it's worth it for us to build an unratelimited `oc` to push patches through.  If we see patches as slow on the server-side, then this can come to the apiserver team for a weird scaling problem.

Comment 46 Robert Krawitz 2019-07-24 19:54:45 UTC

Adding spec.automountServiceAccountToken: false to the pod definition allows things to work correctly.

Comment 47 David Eads 2019-07-25 10:57:49 UTC

Alright, this suggests that the kubelet is running out of QPS for its clients in these cases.  Not completely sure what you want to do about that. You could increase QPS, you could change the test harness, you could consider the watch based secret/configmap refresh, you could do something else, but the API server appears to be functioning.

Comment 48 Robert Krawitz 2019-07-25 21:58:35 UTC

With the following kubelet config:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-max-pods
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: large-pods
  kubeletConfig:
    maxPods: 750
    KubeAPIBurst: 200
    KubeAPIQPS: 100

I ran 2000 pods on 3 nodes without anything getting hung up (not using spec.automountServiceAccountToken: false).

Comment 49 Robert Krawitz 2019-07-26 00:53:52 UTC

Also worked fine at 50/25

Comment 50 Robert Krawitz 2019-07-26 02:31:12 UTC

At 30/15 this started happening around 1250 total pods (~400/node on average).

Comment 51 Robert Krawitz 2019-07-26 17:06:33 UTC

xref https://github.com/kubernetes/kubernetes/issues/80647
xref https://github.com/kubernetes/kubernetes/pull/80649

Comment 52 Robert Krawitz 2019-08-02 18:18:07 UTC

At 30/5 the issue started if anything even earlier than with the default 10/5; raising the burst alone does not appear to help.

Comment 53 Walid A. 2019-08-13 09:22:59 UTC

I re-ran the initial 500 pods per node scale test with nodejs and mongodb quickstart apps and verified that the bottlenecks preventing us from getting to 500 pods per node were resolved by increasing kubeAPIQPS to 20 and kubeAPIBurst to 40 according to this kubelet config:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-max-pods
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: large-pods
  kubeletConfig:
    maxPods: 500
    kubeAPIBurst: 40
    kubeAPIQPS: 20

I was able to deploy up to 483 pods per node (maxPods 500) before we ran out of IP addresses (hostPrefix was 23) on each of 2 worker node with instance type m5.24xlarge.

We may want to update our docs for increasing maxPods and mention the need to also increase kubeAPIQPS and kubeAPIBurst values to achieve the desired pod density when working with large number of namespaces.
It would also be helpful if we could mention specific messages in the logs or metrics we could track that could indicate when we are approaching our QPS limits.

Comment 62 errata-xmlrpc 2020-01-23 11:03:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.