2005076 – Increase in pod startup latency (p99) while running node-density-light test

Bug 2005076 - Increase in pod startup latency (p99) while running node-density-light test

Summary: Increase in pod startup latency (p99) while running node-density-light test

Keywords:
Status:	CLOSED DUPLICATE of bug 2024643
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:	aos-scalability-49
Duplicates (1):	2016003 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-16 17:14 UTC by Mohit Sheth
Modified:	2022-03-04 15:27 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-04 15:27:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes issues 105220	0	None	open	Pods not being evenly scheduled across worker nodes	2021-09-27 07:54:05 UTC

Description Mohit Sheth 2021-09-16 17:14:09 UTC

Description of problem:
The pod startup latency (99Pctl) has increased from ~4seconds to ~16+ seconds on a 25 Node cluster. 

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-08-13-053517 and later

How reproducible:
Always

Steps to Reproduce:
Run a node-density-light test i.e 250 sleep pods per worker node 
Observer the pod latency

Actual results:
~16 Seconds

Expected results:
~4 Seconds

Additional info:
The last time we observed a ~4seconds latency was with 4.9.0-0.nightly-2021-08-07-175228 build

Comment 1 Joe Talerico 2021-09-16 17:18:23 UTC

More data showing the regression...


+-----------------------+--------------------------------------+-----------------------------------+
|       metadata        |                 uuid                 |               value               |
+-----------------------+--------------------------------------+-----------------------------------+
| value.cluster_version | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 4.9.0-0.nightly-2021-09-14-200602 |
| value.cluster_version | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 4.9.0-0.nightly-2021-09-07-201519 |
| value.cluster_version | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 4.9.0-0.nightly-2021-08-04-081505 |
| value.cluster_version | 060af6bb-cf31-4c49-af59-155209605103 | 4.8.0-0.nightly-2021-09-13-115830 |
+-----------------------+--------------------------------------+-----------------------------------+
+--------------------------------+-----------------+----------+--------------------------------------+---------+
|           metricName           |  quantileName   |   key    |                 uuid                 |  value  |
+--------------------------------+-----------------+----------+--------------------------------------+---------+
| podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 17900.0 |
| podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 17535.0 |
| podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 3798.0  |
| podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | 060af6bb-cf31-4c49-af59-155209605103 | 4098.0  |
| podLatencyQuantilesMeasurement |   Initialized   | avg(P99) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 |  122.0  |
| podLatencyQuantilesMeasurement |   Initialized   | avg(P99) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 |  108.0  |
| podLatencyQuantilesMeasurement |   Initialized   | avg(P99) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af |  50.0   |
| podLatencyQuantilesMeasurement |   Initialized   | avg(P99) | 060af6bb-cf31-4c49-af59-155209605103 |  61.0   |
| podLatencyQuantilesMeasurement |  PodScheduled   | avg(P99) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 |  13.0   |
| podLatencyQuantilesMeasurement |  PodScheduled   | avg(P99) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 |  14.0   |
| podLatencyQuantilesMeasurement |  PodScheduled   | avg(P99) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af |  15.0   |
| podLatencyQuantilesMeasurement |  PodScheduled   | avg(P99) | 060af6bb-cf31-4c49-af59-155209605103 |  16.0   |
| podLatencyQuantilesMeasurement |      Ready      | avg(P99) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 17900.0 |
| podLatencyQuantilesMeasurement |      Ready      | avg(P99) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 17535.0 |
| podLatencyQuantilesMeasurement |      Ready      | avg(P99) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 3798.0  |
| podLatencyQuantilesMeasurement |      Ready      | avg(P99) | 060af6bb-cf31-4c49-af59-155209605103 | 4098.0  |
+--------------------------------+-----------------+----------+--------------------------------------+---------+
+-----------------------+--------------------------------------+----------------------+--------------------------------------+--------------------+
|      metricName       |                 uuid                 |         key          |                 uuid                 |       value        |
+-----------------------+--------------------------------------+----------------------+--------------------------------------+--------------------+
| podLatencyMeasurement | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | avg(podReadyLatency) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 |  6359.29117147708  |
| podLatencyMeasurement | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | avg(podReadyLatency) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 6250.486408426776  |
| podLatencyMeasurement | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | avg(podReadyLatency) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 2831.8779168075753 |
| podLatencyMeasurement | 060af6bb-cf31-4c49-af59-155209605103 | avg(podReadyLatency) | 060af6bb-cf31-4c49-af59-155209605103 | 3044.3616912888438 |

Comment 2 Dan Winship 2021-09-18 15:49:39 UTC

Further investigation revealed that on each node, things were going fine, until suddenly the scheduler decided to spam that node with pods, causing the average startup time to increase because pod network setup is serialized so when lots of pods are scheduled all at once, the later ones get stuck waiting for the earlier ones to complete.

http://people.redhat.com/dwinship/node-pod-counts.txt shows that it's even more specific and weird than just that, though: the scheduler seems to behave fairly until some node has 144+/-1 pods on it. At that point, it then schedules the next ~92 pods to that same node before it starts scheduling pods to other nodes again.

The regression had been narrowed down to having started somewhere between 8/7 and 8/14. Given that the kube 1.22 rebase landed on 8/11, this suggests a kube-scheduler regression in 1.22.


kube-scheduler logs don't show any obvious misbehavior; in particular, when it starts spamming the first node, it still thinks most of the other worker nodes are still "feasible":

2021-09-17T16:48:36.601707445Z I0917 16:48:36.601658       1 scheduler.go:672] "Successfully bound pod to node" pod="node-density-5af449ee-52d1-4e90-9ce8-249336fd06c1/node-density-2588" node="ip-10-0-244-166.us-west-2.compute.internal" evaluatedNodes=27 feasibleNodes=20

but apparently the reason it stops spamming the node is because it eventually decides it's no longer feasible:

2021-09-17T16:48:41.149640450Z I0917 16:48:41.149561       1 scheduler.go:672] "Successfully bound pod to node" pod="node-density-5af449ee-52d1-4e90-9ce8-249336fd06c1/node-density-2679" node="ip-10-0-244-166.us-west-2.compute.internal" evaluatedNodes=27 feasibleNodes=20
2021-09-17T16:48:41.203106370Z I0917 16:48:41.203053       1 scheduler.go:672] "Successfully bound pod to node" pod="node-density-5af449ee-52d1-4e90-9ce8-249336fd06c1/node-density-2680" node="ip-10-0-128-172.us-west-2.compute.internal" evaluatedNodes=27 feasibleNodes=19

and likewise for the others (eventually ending with feasibleNodes=1)

Comment 4 Joe Talerico 2021-09-20 17:59:28 UTC

4.8 nightly is now showing the regression...

Notable changes here, we went from 1.21.1 to 1.21.4, which had some scheduler chances.

Comment 5 Mohit Sheth 2021-09-20 19:45:18 UTC

For 4.10.0-0.nightly-2021-09-20-032334 it is at ~17seconds
Kubernetes 1.22.1

Comment 6 Mohit Sheth 2021-09-20 21:09:35 UTC

For 4.9.0-0.nightly-2021-07-29-103526 when the test was run at 120 NODE scale (approx 30,000 pods in total) the pod startup latency was 3.7 seconds
Kubernetes 1.21.1

Comment 7 Raul Sevilla 2021-09-21 15:27:10 UTC

After re-running the benchmark against a 4.8 cluster deployed using a payload based on https://github.com/openshift/kubernetes/pull/973 (revert 102925: Revert calculation for resource scoring), we managed to get good pod startup latency results

http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz/kube-burner-report?orgId=1&from=1632235847751&to=1632236517420&var-Datasource=Development-AWS-ES_ripsaw-kube-burner&var-sdn=openshift-sdn&var-job=node-density&var-uuid=9c079180-c079-4e4e-946c-190598ac66b1&var-master=All&var-worker=ip-10-0-128-243.eu-west-2.compute.internal&var-infra=&var-namespace=All&var-verb=All&var-resource=All&var-flowschema=All&var-priority_level=All

+--------------------------------+-----------------+----------+--------------------------------------+--------+                                                                                                                               
|           metricName           |  quantileName   |   key    |                 uuid                 | value  |
+--------------------------------+-----------------+----------+--------------------------------------+--------+                                                                                                                               
| podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | 9c079180-c079-4e4e-946c-190598ac66b1 | 4102.0 |
| podLatencyQuantilesMeasurement |   Initialized   | avg(P99) | 9c079180-c079-4e4e-946c-190598ac66b1 |  53.0  |
| podLatencyQuantilesMeasurement |  PodScheduled   | avg(P99) | 9c079180-c079-4e4e-946c-190598ac66b1 |  13.0  |
| podLatencyQuantilesMeasurement |      Ready      | avg(P99) | 9c079180-c079-4e4e-946c-190598ac66b1 | 4102.0 |
+--------------------------------+-----------------+----------+--------------------------------------+--------+

Comment 8 Mike Dame 2021-10-21 12:25:48 UTC

*** Bug 2016003 has been marked as a duplicate of this bug. ***

Comment 9 Raul Sevilla 2021-10-21 13:22:02 UTC

After doing some more tests using a workload consisting of deploying a certain number of pause pods with requests (cpu and memory) we've seen that pods are not scheduled correctly even after configuring pod requests.

The workload described below deployed 446 pause pods in the same namespace (node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415).


# Number of pods in the workload's namespace
rsevilla@wonderland ~ $ oc get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 --no-headers  | wc -l
446



# All of the deployed pods have resource requests configured
rsevilla@wonderland ~ $ oc get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 node-density-1 -o jsonpath="{.spec.containers[*].resources}"
{"requests":{"cpu":"1m","memory":"10Mi"}}



# Worker nodes total
$ oc describe node -l node-role.kubernetes.io/worker | grep -E "(^Name:|^Non-terminated)"
Name:               ip-10-0-147-142.eu-west-3.compute.internal
Non-terminated Pods:                                      (249 in total)
Name:               ip-10-0-158-24.eu-west-3.compute.internal
Non-terminated Pods:                                      (249 in total)
Name:               ip-10-0-187-55.eu-west-3.compute.internal
Non-terminated Pods:                                      (25 in total)
Name:               ip-10-0-218-220.eu-west-3.compute.internal
Non-terminated Pods:                                      (31 in total)



# Number of pods per node in the workload's namespace
rsevilla@wonderland ~ $ oc get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 -o wide --no-headers | awk '{node[$7]++ }END{ for (n in node) print n": "node[n]; }'
ip-10-0-147-142.eu-west-3.compute.internal: 218
ip-10-0-187-55.eu-west-3.compute.internal: 5
ip-10-0-158-24.eu-west-3.compute.internal: 223


As shown above, the pods were deployed across 3 worker nodes (one of the nodes didn't get any pod), and one of those nodes only got 3 pods, while the other two got scheduled 218 and 223 respectively.


# Cluster version

rsevilla@wonderland ~ $ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-rc.4   True        False         16d     Cluster version is 4.9.0-rc.4

Comment 14 Raul Sevilla 2022-02-11 00:13:39 UTC

Hi, our tests with recent 4.10 versions does not show this issue, as the recent k8s 1.23 rebase included the upstream fix https://github.com/kubernetes/kubernetes/pull/105845
In the flip side, we're still observing this issue in 4.9 since the patch hasn't been backported yet, (Seems is gonna be part of the 1.22.5 rebase https://github.com/openshift/kubernetes/pull/1103/commits)

Comment 16 Jan Chaloupka 2022-02-18 11:23:02 UTC

Due to higher priority tasks I have been able to resolve this issue in time. Moving to the next sprint.

Comment 17 Jan Chaloupka 2022-03-03 15:26:09 UTC

https://github.com/openshift/kubernetes/pull/1103 4.9 rebase containing the upstream fix https://github.com/kubernetes/kubernetes/pull/105845 merged. Raul, would you mind measuring the latency for 4.9 to verify the fix resolves the issue?

Comment 18 Scott Dodson 2022-03-03 15:33:35 UTC

If we verify that the problem is fixed lets close this as a dupe of 2024643 which was used to land the rebase.

Comment 19 Raul Sevilla 2022-03-04 11:40:25 UTC

Hi, (In reply to Jan Chaloupka from comment #17)
> https://github.com/openshift/kubernetes/pull/1103 4.9 rebase containing the
> upstream fix https://github.com/kubernetes/kubernetes/pull/105845 merged.
> Raul, would you mind measuring the latency for 4.9 to verify the fix
> resolves the issue?

Hi, I can confirm the scheduling issue has been fixed and pod startup latencies are back to normal for the node-density scenario (below 5 seconds), thanks to the recent rebase to k8s 1.22.5

rsevilla@wonderland ~ $ oc version
Client Version: 4.8.0-rc.0
Server Version: 4.9.0-0.nightly-2022-03-03-120755
Kubernetes Version: v1.22.5+5c84e52
rsevilla@wonderland ~ $ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2022-03-03-120755   True        False         14h     Cluster version is 4.9.0-0.nightly-2022-03-03-120755

http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz/kube-burner-report?orgId=1&from=1646390668032&to=1646391146450&var-Datasource=Development-AWS-ES_ripsaw-kube-burner&var-sdn=openshift-sdn&var-job=node-density&var-uuid=15ac0e8d-4a09-4657-a83a-cca1f84ddc94&var-master=ip-10-0-150-251.eu-west-3.compute.internal&var-worker=ip-10-0-130-40.eu-west-3.compute.internal&var-infra=&var-namespace=All&var-verb=All&var-resource=All&var-flowschema=All&var-priority_level=All

In this test the observed P99 pod ready latency was 4.582s

Comment 20 Scott Dodson 2022-03-04 15:27:42 UTC


*** This bug has been marked as a duplicate of bug 2024643 ***

Note You need to log in before you can comment on or make changes to this bug.