Bug 1901363
| Summary: | High Podready Latency due to timed out waiting for annotations | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Sai Sindhur Malleni <smalleni> | |
| Component: | Networking | Assignee: | Alexander Constantinescu <aconstan> | |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | urgent | CC: | aconstan, dblack, jlema, mark.d.gray, rsevilla, wking | |
| Version: | 4.6.z | |||
| Target Milestone: | --- | |||
| Target Release: | 4.7.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1908472 (view as bug list) | Environment: | ||
| Last Closed: | 2021-02-24 15:35:48 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1908472 | |||
|
Description
Sai Sindhur Malleni
2020-11-24 23:52:45 UTC
Not sure if it is related, but our cluster-density tests which load up the API are also failing with errors like Warning FailedCreatePodSandBox 36m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-density-2-b4395973-907b-4519-99db-ef3143f748b1-1-build_cluster-density-b4395973-907b-4519-99db-ef3143f748b1-83_095caf13-c62b-49fc-8e2f-83ef90675304_0(765d1a8bb4f886cd746c360f372fa292d430c6a5cc9d8e16762a2674aa2391e9): [cluster-density-b4395973-907b-4519-99db-ef3143f748b1-83/cluster-density-2-b4395973-907b-4519-99db-ef3143f748b1-1-build:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cluster-density-b4395973-907b-4519-99db-ef3143f748b1-83/cluster-density-2-b4395973-907b-4519-99db-ef3143f748b1-1-build 765d1a8bb4f886cd746c360f372fa292d430c6a5cc9d8e16762a2674aa2391e9] [cluster-density-b4395973-907b-4519-99db-ef3143f748b1-83/cluster-density-2-b4395973-907b-4519-99db-ef3143f748b1-1-build 765d1a8bb4f886cd746c360f372fa292d430c6a5cc9d8e16762a2674aa2391e9] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows oc adm must-gather --dest-dir="${ARTIFACT_DIR}/network-ovn" -- /usr/bin/gather_network_logs is the command i used to gather related logs, can you let me know the command to run to get the logs youcare about?
I guess this is not needed anymore since the problem has been identified in #comment 5, right? For future reference we'd also appreciate the regular must-gather containing all pod logs: oc adm must-gather --dest-dir="${ARTIFACT_DIR}/must-gather" Thanks. I think we need some discussion on what good and safe values for QPS and burst would be though? any thoughts? I am seeing results similar to Raul, with reduced pod ready latencies in baremetal on decreasing QPS of my test driver. It looks like we need to optimize the OOTB QPS and Burst of the OVN client to handle higher load. It's difficult for me to say what those right values are, they obviously depend on the work-load. Could you guys - whom have access to high scalability testing - run tests with different work-loads and evaluate what right setting is? If there's no downside to setting a high value we could just bump the settings in ovn-kubernetes to a high value. We would need to validate that there's no problem with doing that when the work-load is low, and what this "high" value is. (In reply to Alexander Constantinescu from comment #10) > It's difficult for me to say what those right values are, they obviously > depend on the work-load. Could you guys - whom have access to high > scalability testing - run tests with different work-loads and evaluate what > right setting is? If there's no downside to setting a high value we could > just bump the settings in ovn-kubernetes to a high value. We would need to > validate that there's no problem with doing that when the work-load is low, > and what this "high" value is. OpenShiftSDN uses 10 and 20 respectively https://github.com/openshift/sdn/blob/release-4.7/pkg/openshift-sdn/informers.go#L118-L122, however it does not annotate each pod, so the number API transactions it performs are much fewer. I've done a small breakdown of what API transactions were throttled during the test: root@ip-172-31-84-130: ~ # oc logs ovnkube-master-wjjx4 -c ovnkube-master | grep Throttl | grep -c GET Lock renew and and get node 55 # oc logs ovnkube-master-wjjx4 -c ovnkube-master | grep Throttl | grep -c PATCH Pod annotations 5597 # oc logs ovnkube-master-wjjx4 -c ovnkube-master | grep Throttl | grep -c PUT Which corresponds with lock renew (every 20 seconds arox) 13 Given that our test usually runs with 20 QPS, I think it will make sense to increase/default QPS/Burst values to something slightly higher than 20. maybe 25? I know there're other factors that might affect this number of transactions (services, networkpolicies, etc). However I think the defaults values are too low for this component. There're situations, such as an AZ outage, where we could see a massive pod eviction/reschedule, this increase of QPS and Burst values, will help to ease the impact of this kind of events. On the other hand, we have API protection mechanisms in the API such as API priority and fairness, that should prevent kube API starvation. > I've done a small breakdown of what API transactions were throttled during the test
That's great! Do you guys have the possibility to mimic a production environment (i.e: creating network policies, services, etc) and validate that these QPS and burst settings work under those conditions too?
If that is the case, then feel free to open a PR to ovn-org/ovn-kubernetes updating those settings.
/Alexander
Setting this to blocker as this scale and performance for OVN-Kubernetes in 4.6 is important. Could you guys from the scale and perf team get back to me on if the QPS/Burst settings worked with network policy and services? (In reply to Alexander Constantinescu from comment #13) > Setting this to blocker as this scale and performance for OVN-Kubernetes in > 4.6 is important. > > Could you guys from the scale and perf team get back to me on if the > QPS/Burst settings worked with network policy and services? Hi Alexander, we can't provide an exact answer here. The reason of this, is that the QPS and Burst values required depend on the API transaction rate generated by OVN-Kubernetes. Our workload creates 20 pods/second, but again, it totally depends on the object creation/deletion rate generated. The default QPS/Burst values are too small for most of our scalability tests, so I think it would be a good idea to increment those to something like 25/25. A ideal solution would be to expose these parameters so a user or a different entity might be able to tune them accordingly to the cluster size. Hope that helps, Upstream patch merged: https://github.com/ovn-org/ovn-kubernetes/pull/1878 Downstream patch merged as well: https://github.com/openshift/ovn-kubernetes/pull/366 Indeed, sorry for not updating: I got caught up in some customer escalations. Could you re-try this scale test on 4.7 and check how far we get? Moving to ON_QA Patch is working properly, got 5.032 seconds for P99 pod startup latency after executing the same workload. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |