Bug 1901363 - High Podready Latency due to timed out waiting for annotations
Summary: High Podready Latency due to timed out waiting for annotations
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Alexander Constantinescu
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 1908472
TreeView+ depends on / blocked
 
Reported: 2020-11-24 23:52 UTC by Sai Sindhur Malleni
Modified: 2021-07-11 16:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1908472 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:35:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:36:17 UTC

Description Sai Sindhur Malleni 2020-11-24 23:52:45 UTC
Description of problem:

We have a 246 worker node OCP cluster (total 252 nodes counting masters). On it, there are already around 5500 pods running that run various kubernetes components and operators. On top of it, we are trying to launch 4626 pods spread across the 246ish worker nodes, which should roughly bring the total running pods per node to 40. While all the pods eventually launch, we see extremely high start up times (According to upstream SLAs the 99% pod start up time should be under 5s). We see the 99% pod ready latency to be around 15m. Several pods are stuck in containercreating with errors like,

7s          Warning   FailedCreatePodSandBox   pod/kubelet-density-996-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-density-996-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_0e2f3d6e-b1b6-4c4b-acc8-cbadfa6fb73b_0(6c3c7c95c44883a3961985daf99a029e26070208cb38098692b01a96259180ad): [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-996-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-996-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 6c3c7c95c44883a3961985daf99a029e26070208cb38098692b01a96259180ad] [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-996-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 6c3c7c95c44883a3961985daf99a029e26070208cb38098692b01a96259180ad] timed out waiting for annotations
'
78s         Normal    Scheduled                pod/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Successfully assigned kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 to worker072-fc640
48s         Warning   FailedCreatePodSandBox   pod/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_f951f372-883b-4691-9685-652fbce0861b_0(37a10d08fe6f42f42c35adb0352d47e1341b3a7fad289f14066efc70a428e10f): [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 37a10d08fe6f42f42c35adb0352d47e1341b3a7fad289f14066efc70a428e10f] [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 37a10d08fe6f42f42c35adb0352d47e1341b3a7fad289f14066efc70a428e10f] timed out waiting for annotations
'
7s          Warning   FailedCreatePodSandBox   pod/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_f951f372-883b-4691-9685-652fbce0861b_0(6cee444c489b64bde03acb93b72ae7249e856f66af2f8febc03cf1ca0586724a): [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 6cee444c489b64bde03acb93b72ae7249e856f66af2f8febc03cf1ca0586724a] [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-997-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 6cee444c489b64bde03acb93b72ae7249e856f66af2f8febc03cf1ca0586724a] timed out waiting for annotations
'
78s         Normal    Scheduled                pod/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Successfully assigned kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 to worker201-r640
48s         Warning   FailedCreatePodSandBox   pod/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_baee6b7b-4550-4f55-843a-0ab838a8b8b0_0(1f9d67e403790464e68390b8e8e6c2585065f03cd3e5b3d36488b670374804b1): [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 1f9d67e403790464e68390b8e8e6c2585065f03cd3e5b3d36488b670374804b1] [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 1f9d67e403790464e68390b8e8e6c2585065f03cd3e5b3d36488b670374804b1] timed out waiting for annotations
'
4s          Warning   FailedCreatePodSandBox   pod/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_baee6b7b-4550-4f55-843a-0ab838a8b8b0_0(ffe3b38ab0dd1d4d536d5a3233f8b5dadaa0a9f63ae44dfc75e22c6b0eacf7f0): [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 ffe3b38ab0dd1d4d536d5a3233f8b5dadaa0a9f63ae44dfc75e22c6b0eacf7f0] [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-998-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 ffe3b38ab0dd1d4d536d5a3233f8b5dadaa0a9f63ae44dfc75e22c6b0eacf7f0] timed out waiting for annotations
'
78s         Normal    Scheduled                pod/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Successfully assigned kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 to worker231-r640
48s         Warning   FailedCreatePodSandBox   pod/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_2ce4ef4a-c54f-43ec-94c6-5818d8451560_0(1f32e469e379a58a88ce751296250fde9faacad67d7cba51caa7292f206f3be6): [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 1f32e469e379a58a88ce751296250fde9faacad67d7cba51caa7292f206f3be6] [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 1f32e469e379a58a88ce751296250fde9faacad67d7cba51caa7292f206f3be6] timed out waiting for annotations
'
3s          Warning   FailedCreatePodSandBox   pod/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9    Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9_2ce4ef4a-c54f-43ec-94c6-5818d8451560_0(3237d341a4c8452bc3e2f78789445eb87f0f9533bc70cb77ee08b9d3d1ff59a5): [kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[kubelet-density-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9/kubelet-density-999-dc3cf8c5-857d-4c4d-8789-d4b3943be2a9 3237d341a4c8452bc3e2f78789445eb87f0f9533bc70cb77ee08b9d3d1

Due to what looks like IPAM issues, we are seeing extremely high podready latencies. 


All of these pods are created in the same namespace.
Version-Release number of selected component (if applicable):
4.6.4

How reproducible:
100%

Steps to Reproduce:
1. Build a large cluster with OVN
2. Launch several thousand pods
3.

Actual results:
Extremely high podready latencies

Expected results:
Podready latency should be under 5s

Additional info:

Comment 2 Sai Sindhur Malleni 2020-11-25 01:29:05 UTC
Not sure if it is related, but our cluster-density tests which load up the API are also failing with errors like
  Warning  FailedCreatePodSandBox  36m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-density-2-b4395973-907b-4519-99db-ef3143f748b1-1-build_cluster-density-b4395973-907b-4519-99db-ef3143f748b1-83_095caf13-c62b-49fc-8e2f-83ef90675304_0(765d1a8bb4f886cd746c360f372fa292d430c6a5cc9d8e16762a2674aa2391e9): [cluster-density-b4395973-907b-4519-99db-ef3143f748b1-83/cluster-density-2-b4395973-907b-4519-99db-ef3143f748b1-1-build:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-b4395973-907b-4519-99db-ef3143f748b1-83/cluster-density-2-b4395973-907b-4519-99db-ef3143f748b1-1-build 765d1a8bb4f886cd746c360f372fa292d430c6a5cc9d8e16762a2674aa2391e9] [cluster-density-b4395973-907b-4519-99db-ef3143f748b1-83/cluster-density-2-b4395973-907b-4519-99db-ef3143f748b1-1-build 765d1a8bb4f886cd746c360f372fa292d430c6a5cc9d8e16762a2674aa2391e9] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows

Comment 6 Sai Sindhur Malleni 2020-11-25 14:52:36 UTC
oc adm must-gather --dest-dir="${ARTIFACT_DIR}/network-ovn" -- /usr/bin/gather_network_logs is the command i used to gather related logs, can you let me know the command to run to get the logs youcare about?

Comment 7 Alexander Constantinescu 2020-11-25 15:31:25 UTC
I guess this is not needed anymore since the problem has been identified in #comment 5, right? 

For future reference we'd also appreciate the regular must-gather containing all pod logs:

oc adm must-gather --dest-dir="${ARTIFACT_DIR}/must-gather"

Comment 8 Sai Sindhur Malleni 2020-11-25 16:51:21 UTC
Thanks. I think we need some discussion on what good and safe values for QPS and burst would be though? any thoughts?

Comment 9 Sai Sindhur Malleni 2020-11-25 23:12:48 UTC
I am seeing results similar to Raul, with reduced pod ready latencies in baremetal on decreasing QPS of my test driver. It looks like we need to optimize the OOTB QPS and Burst of the OVN client to handle higher load.

Comment 10 Alexander Constantinescu 2020-11-26 10:07:29 UTC
It's difficult for me to say what those right values are, they obviously depend on the work-load. Could you guys - whom have access to high scalability testing - run tests with different work-loads and evaluate what right setting is? If there's no downside to setting a high value we could just bump the settings in ovn-kubernetes to a high value. We would need to validate that there's no problem with doing that when the work-load is low, and what this "high" value is.

Comment 11 Raul Sevilla 2020-11-26 11:43:16 UTC
(In reply to Alexander Constantinescu from comment #10)
> It's difficult for me to say what those right values are, they obviously
> depend on the work-load. Could you guys - whom have access to high
> scalability testing - run tests with different work-loads and evaluate what
> right setting is? If there's no downside to setting a high value we could
> just bump the settings in ovn-kubernetes to a high value. We would need to
> validate that there's no problem with doing that when the work-load is low,
> and what this "high" value is.

OpenShiftSDN uses 10 and 20 respectively https://github.com/openshift/sdn/blob/release-4.7/pkg/openshift-sdn/informers.go#L118-L122, however it does not annotate each pod, so the number API transactions it performs are much fewer.

I've done a small breakdown of what API transactions were throttled during the test:

root@ip-172-31-84-130: ~ # oc logs ovnkube-master-wjjx4 -c ovnkube-master | grep  Throttl | grep -c GET                   Lock renew and and get node
55
# oc logs ovnkube-master-wjjx4 -c ovnkube-master | grep  Throttl | grep -c PATCH     Pod annotations
5597
# oc logs ovnkube-master-wjjx4 -c ovnkube-master | grep  Throttl | grep -c PUT       Which corresponds with lock renew (every 20 seconds arox)
13

Given that our test usually runs with 20 QPS, I think it will make sense to increase/default QPS/Burst values to something slightly higher than 20. maybe 25?

I know there're other factors that might affect this number of transactions (services, networkpolicies, etc). However I think the defaults values are too low for this component. There're situations, such as an AZ outage, where we could see a massive pod eviction/reschedule, this increase of QPS and Burst values, will help to ease the impact of this kind of events.


On the other hand, we have API protection mechanisms in the API such as API priority and fairness, that should prevent kube API starvation.

Comment 12 Alexander Constantinescu 2020-11-26 12:06:19 UTC
> I've done a small breakdown of what API transactions were throttled during the test

That's great! Do you guys have the possibility to mimic a production environment (i.e: creating network policies, services, etc) and validate that these QPS and burst settings work under those conditions too? 

If that is the case, then feel free to open a PR to ovn-org/ovn-kubernetes updating those settings.

/Alexander

Comment 13 Alexander Constantinescu 2020-11-30 16:53:21 UTC
Setting this to blocker as this scale and performance for OVN-Kubernetes in 4.6 is important. 

Could you guys from the scale and perf team get back to me on if the QPS/Burst settings worked with network policy and services?

Comment 14 Raul Sevilla 2020-11-30 17:33:32 UTC
(In reply to Alexander Constantinescu from comment #13)
> Setting this to blocker as this scale and performance for OVN-Kubernetes in
> 4.6 is important. 
> 
> Could you guys from the scale and perf team get back to me on if the
> QPS/Burst settings worked with network policy and services?

Hi Alexander, we can't provide an exact answer here. The reason of this, is that the QPS and Burst values required depend on the API transaction rate generated by OVN-Kubernetes. Our workload creates 20 pods/second, but again, it totally depends on the object creation/deletion rate generated.
The default QPS/Burst values are too small for most of our scalability tests, so I think it would be a good idea to increment those to something like 25/25. A ideal solution would be to expose these parameters so a user or a different entity might be able to tune them accordingly to the cluster size.

Hope that helps,

Comment 15 Sai Sindhur Malleni 2020-12-03 19:11:05 UTC
Upstream patch merged: https://github.com/ovn-org/ovn-kubernetes/pull/1878

Comment 16 Raul Sevilla 2020-12-04 11:28:07 UTC
Downstream patch merged as well: https://github.com/openshift/ovn-kubernetes/pull/366

Comment 17 Alexander Constantinescu 2020-12-04 11:41:37 UTC
Indeed, sorry for not updating: I got caught up in some customer escalations. 

Could you re-try this scale test on 4.7 and check how far we get?

Moving to ON_QA

Comment 18 Raul Sevilla 2020-12-04 15:21:02 UTC
Patch is working properly, got 5.032 seconds for P99 pod startup latency after executing the same workload.

Comment 21 errata-xmlrpc 2021-02-24 15:35:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.