Description of problem: The pod startup latency (99Pctl) has increased from ~4seconds to ~16+ seconds on a 25 Node cluster. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-08-13-053517 and later How reproducible: Always Steps to Reproduce: Run a node-density-light test i.e 250 sleep pods per worker node Observer the pod latency Actual results: ~16 Seconds Expected results: ~4 Seconds Additional info: The last time we observed a ~4seconds latency was with 4.9.0-0.nightly-2021-08-07-175228 build
More data showing the regression... +-----------------------+--------------------------------------+-----------------------------------+ | metadata | uuid | value | +-----------------------+--------------------------------------+-----------------------------------+ | value.cluster_version | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 4.9.0-0.nightly-2021-09-14-200602 | | value.cluster_version | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 4.9.0-0.nightly-2021-09-07-201519 | | value.cluster_version | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 4.9.0-0.nightly-2021-08-04-081505 | | value.cluster_version | 060af6bb-cf31-4c49-af59-155209605103 | 4.8.0-0.nightly-2021-09-13-115830 | +-----------------------+--------------------------------------+-----------------------------------+ +--------------------------------+-----------------+----------+--------------------------------------+---------+ | metricName | quantileName | key | uuid | value | +--------------------------------+-----------------+----------+--------------------------------------+---------+ | podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 17900.0 | | podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 17535.0 | | podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 3798.0 | | podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | 060af6bb-cf31-4c49-af59-155209605103 | 4098.0 | | podLatencyQuantilesMeasurement | Initialized | avg(P99) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 122.0 | | podLatencyQuantilesMeasurement | Initialized | avg(P99) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 108.0 | | podLatencyQuantilesMeasurement | Initialized | avg(P99) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 50.0 | | podLatencyQuantilesMeasurement | Initialized | avg(P99) | 060af6bb-cf31-4c49-af59-155209605103 | 61.0 | | podLatencyQuantilesMeasurement | PodScheduled | avg(P99) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 13.0 | | podLatencyQuantilesMeasurement | PodScheduled | avg(P99) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 14.0 | | podLatencyQuantilesMeasurement | PodScheduled | avg(P99) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 15.0 | | podLatencyQuantilesMeasurement | PodScheduled | avg(P99) | 060af6bb-cf31-4c49-af59-155209605103 | 16.0 | | podLatencyQuantilesMeasurement | Ready | avg(P99) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 17900.0 | | podLatencyQuantilesMeasurement | Ready | avg(P99) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 17535.0 | | podLatencyQuantilesMeasurement | Ready | avg(P99) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 3798.0 | | podLatencyQuantilesMeasurement | Ready | avg(P99) | 060af6bb-cf31-4c49-af59-155209605103 | 4098.0 | +--------------------------------+-----------------+----------+--------------------------------------+---------+ +-----------------------+--------------------------------------+----------------------+--------------------------------------+--------------------+ | metricName | uuid | key | uuid | value | +-----------------------+--------------------------------------+----------------------+--------------------------------------+--------------------+ | podLatencyMeasurement | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | avg(podReadyLatency) | 2b267e00-cc35-4750-a46a-b6e4fdc1aab4 | 6359.29117147708 | | podLatencyMeasurement | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | avg(podReadyLatency) | a98ff20d-d730-4f60-b8fa-f99f0d7dd160 | 6250.486408426776 | | podLatencyMeasurement | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | avg(podReadyLatency) | ee91f4b2-14bc-48b7-8fc0-5c052d9631af | 2831.8779168075753 | | podLatencyMeasurement | 060af6bb-cf31-4c49-af59-155209605103 | avg(podReadyLatency) | 060af6bb-cf31-4c49-af59-155209605103 | 3044.3616912888438 |
Further investigation revealed that on each node, things were going fine, until suddenly the scheduler decided to spam that node with pods, causing the average startup time to increase because pod network setup is serialized so when lots of pods are scheduled all at once, the later ones get stuck waiting for the earlier ones to complete. http://people.redhat.com/dwinship/node-pod-counts.txt shows that it's even more specific and weird than just that, though: the scheduler seems to behave fairly until some node has 144+/-1 pods on it. At that point, it then schedules the next ~92 pods to that same node before it starts scheduling pods to other nodes again. The regression had been narrowed down to having started somewhere between 8/7 and 8/14. Given that the kube 1.22 rebase landed on 8/11, this suggests a kube-scheduler regression in 1.22. kube-scheduler logs don't show any obvious misbehavior; in particular, when it starts spamming the first node, it still thinks most of the other worker nodes are still "feasible": 2021-09-17T16:48:36.601707445Z I0917 16:48:36.601658 1 scheduler.go:672] "Successfully bound pod to node" pod="node-density-5af449ee-52d1-4e90-9ce8-249336fd06c1/node-density-2588" node="ip-10-0-244-166.us-west-2.compute.internal" evaluatedNodes=27 feasibleNodes=20 but apparently the reason it stops spamming the node is because it eventually decides it's no longer feasible: 2021-09-17T16:48:41.149640450Z I0917 16:48:41.149561 1 scheduler.go:672] "Successfully bound pod to node" pod="node-density-5af449ee-52d1-4e90-9ce8-249336fd06c1/node-density-2679" node="ip-10-0-244-166.us-west-2.compute.internal" evaluatedNodes=27 feasibleNodes=20 2021-09-17T16:48:41.203106370Z I0917 16:48:41.203053 1 scheduler.go:672] "Successfully bound pod to node" pod="node-density-5af449ee-52d1-4e90-9ce8-249336fd06c1/node-density-2680" node="ip-10-0-128-172.us-west-2.compute.internal" evaluatedNodes=27 feasibleNodes=19 and likewise for the others (eventually ending with feasibleNodes=1)
4.8 nightly is now showing the regression... Notable changes here, we went from 1.21.1 to 1.21.4, which had some scheduler chances.
For 4.10.0-0.nightly-2021-09-20-032334 it is at ~17seconds Kubernetes 1.22.1
For 4.9.0-0.nightly-2021-07-29-103526 when the test was run at 120 NODE scale (approx 30,000 pods in total) the pod startup latency was 3.7 seconds Kubernetes 1.21.1
After re-running the benchmark against a 4.8 cluster deployed using a payload based on https://github.com/openshift/kubernetes/pull/973 (revert 102925: Revert calculation for resource scoring), we managed to get good pod startup latency results http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz/kube-burner-report?orgId=1&from=1632235847751&to=1632236517420&var-Datasource=Development-AWS-ES_ripsaw-kube-burner&var-sdn=openshift-sdn&var-job=node-density&var-uuid=9c079180-c079-4e4e-946c-190598ac66b1&var-master=All&var-worker=ip-10-0-128-243.eu-west-2.compute.internal&var-infra=&var-namespace=All&var-verb=All&var-resource=All&var-flowschema=All&var-priority_level=All +--------------------------------+-----------------+----------+--------------------------------------+--------+ | metricName | quantileName | key | uuid | value | +--------------------------------+-----------------+----------+--------------------------------------+--------+ | podLatencyQuantilesMeasurement | ContainersReady | avg(P99) | 9c079180-c079-4e4e-946c-190598ac66b1 | 4102.0 | | podLatencyQuantilesMeasurement | Initialized | avg(P99) | 9c079180-c079-4e4e-946c-190598ac66b1 | 53.0 | | podLatencyQuantilesMeasurement | PodScheduled | avg(P99) | 9c079180-c079-4e4e-946c-190598ac66b1 | 13.0 | | podLatencyQuantilesMeasurement | Ready | avg(P99) | 9c079180-c079-4e4e-946c-190598ac66b1 | 4102.0 | +--------------------------------+-----------------+----------+--------------------------------------+--------+
*** Bug 2016003 has been marked as a duplicate of this bug. ***
After doing some more tests using a workload consisting of deploying a certain number of pause pods with requests (cpu and memory) we've seen that pods are not scheduled correctly even after configuring pod requests. The workload described below deployed 446 pause pods in the same namespace (node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415). # Number of pods in the workload's namespace rsevilla@wonderland ~ $ oc get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 --no-headers | wc -l 446 # All of the deployed pods have resource requests configured rsevilla@wonderland ~ $ oc get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 node-density-1 -o jsonpath="{.spec.containers[*].resources}" {"requests":{"cpu":"1m","memory":"10Mi"}} # Worker nodes total $ oc describe node -l node-role.kubernetes.io/worker | grep -E "(^Name:|^Non-terminated)" Name: ip-10-0-147-142.eu-west-3.compute.internal Non-terminated Pods: (249 in total) Name: ip-10-0-158-24.eu-west-3.compute.internal Non-terminated Pods: (249 in total) Name: ip-10-0-187-55.eu-west-3.compute.internal Non-terminated Pods: (25 in total) Name: ip-10-0-218-220.eu-west-3.compute.internal Non-terminated Pods: (31 in total) # Number of pods per node in the workload's namespace rsevilla@wonderland ~ $ oc get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 -o wide --no-headers | awk '{node[$7]++ }END{ for (n in node) print n": "node[n]; }' ip-10-0-147-142.eu-west-3.compute.internal: 218 ip-10-0-187-55.eu-west-3.compute.internal: 5 ip-10-0-158-24.eu-west-3.compute.internal: 223 As shown above, the pods were deployed across 3 worker nodes (one of the nodes didn't get any pod), and one of those nodes only got 3 pods, while the other two got scheduled 218 and 223 respectively. # Cluster version rsevilla@wonderland ~ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-rc.4 True False 16d Cluster version is 4.9.0-rc.4
Hi, our tests with recent 4.10 versions does not show this issue, as the recent k8s 1.23 rebase included the upstream fix https://github.com/kubernetes/kubernetes/pull/105845 In the flip side, we're still observing this issue in 4.9 since the patch hasn't been backported yet, (Seems is gonna be part of the 1.22.5 rebase https://github.com/openshift/kubernetes/pull/1103/commits)
Due to higher priority tasks I have been able to resolve this issue in time. Moving to the next sprint.
https://github.com/openshift/kubernetes/pull/1103 4.9 rebase containing the upstream fix https://github.com/kubernetes/kubernetes/pull/105845 merged. Raul, would you mind measuring the latency for 4.9 to verify the fix resolves the issue?
If we verify that the problem is fixed lets close this as a dupe of 2024643 which was used to land the rebase.
Hi, (In reply to Jan Chaloupka from comment #17) > https://github.com/openshift/kubernetes/pull/1103 4.9 rebase containing the > upstream fix https://github.com/kubernetes/kubernetes/pull/105845 merged. > Raul, would you mind measuring the latency for 4.9 to verify the fix > resolves the issue? Hi, I can confirm the scheduling issue has been fixed and pod startup latencies are back to normal for the node-density scenario (below 5 seconds), thanks to the recent rebase to k8s 1.22.5 rsevilla@wonderland ~ $ oc version Client Version: 4.8.0-rc.0 Server Version: 4.9.0-0.nightly-2022-03-03-120755 Kubernetes Version: v1.22.5+5c84e52 rsevilla@wonderland ~ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2022-03-03-120755 True False 14h Cluster version is 4.9.0-0.nightly-2022-03-03-120755 http://grafana.rdu2.scalelab.redhat.com:3000/d/hIBqKNvMz/kube-burner-report?orgId=1&from=1646390668032&to=1646391146450&var-Datasource=Development-AWS-ES_ripsaw-kube-burner&var-sdn=openshift-sdn&var-job=node-density&var-uuid=15ac0e8d-4a09-4657-a83a-cca1f84ddc94&var-master=ip-10-0-150-251.eu-west-3.compute.internal&var-worker=ip-10-0-130-40.eu-west-3.compute.internal&var-infra=&var-namespace=All&var-verb=All&var-resource=All&var-flowschema=All&var-priority_level=All In this test the observed P99 pod ready latency was 4.582s
*** This bug has been marked as a duplicate of bug 2024643 ***