Bug 1886953
| Summary: | cfs throttling in OCP 4.x is higher than 3.x | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | jooho lee <jlee> | ||||||
| Component: | Node Tuning Operator | Assignee: | Jenifer Abrams <jhopper> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Mike Fiedler <mifiedle> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 4.5 | CC: | akamra, aos-bugs, apaladug, dwalsh, jhopper, jmencak, jokerman, nagrawal, pauld, sejug | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.7.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-11-25 12:00:11 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Created attachment 1720334 [details]
OCP 3.11 CFS Throttling Metric
@jmencak Is it sufficient to run the same container and cfs test in a customer environment ? Can I provide update to customer that we have not been able to produce the issue internally and hence looking for a simple reproducer in their env ? Thanks Anand @jmencak Thanks for the update. Few questions. 1. It looks like customer is actually running 4.5.6 (although they may have other clusters at 4.5.13). Per the KCS article -> https://access.redhat.com/solutions/5285071 the issue is fixed in kernel-4.18.0-147.3.1.el8_1. I cannot find the kernel version we are shipping in OCP 4.5.6 in the release notes. Can you confirm ? I am also getting a sos report to confirm it. 2. From the latest updates / commits in the end of article -> https://github.com/kubernetes/kubernetes/issues/67577, do you think the issue is not fully resolved yet for all scenarios ? Particularly the comment 19 hours ago (THEBOSS619 added a commit to THEBOSS619/Note9-ZeusKernelQ-OneUI-AOSP that referenced this issue 19 hours ago) about non-expiration of per-cpu slices ? 3. Would the issue only occur if a CPU limit was to be set for the container ? I am trying to see if we can offer workaround in the interim. Thanks Anand SOS report is attached to the case. Kernel version is 4.18.0-193.14.3.el8_2.x86_64. Can you check to see if the fix is back ported to this kernel version ? Thanks Anand CFS scheduler code in 4.18.0-193.23.1.el8_2 (OCP 4.5.13) and 4.18.0-193.14.3.el8_2 (OCP 4.5.6) is exactly the same. I've ran a reproducer (https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1) for https://bugzilla.kernel.org/show_bug.cgi?id=198197 in several iterations on both OCP 4.5.13/4.5.6 and the reproducer did not discover any throttling. It could be there is another CFS issue the reproducer above doesn't catch. Jenifer, can you take a look, please? Team, Customer has conveyed that distributing the replicas on many nodes Vs two helped to reduce the thresholding a bit. But in both cases (i.e running replicas on many nodes Vs two) 3.11 performed better than 4.5. I am still trying to get numbers for 3.11 Vs 4.5. Customer did configure requests and limits equally for the pod (to get guaranteed QOS). Would increasing the limits or removing the limits help in this situation (to reduce or avoid throttling) ? Thanks Anand Team, customer provided some clarity and graphs to show differences between 3.11 and 4.5.6. My comments in response and files are all in the case. Please review and comment. thanks Anand As no response was received after two weeks of requesting information whether the issues still persist and the customer case shows as Closed, closing this BZ. Please re-open if the problems persist. |
Created attachment 1720333 [details] OCP 4.5.13 CFS Throttling Metrics Description of problem: Customer have noticed in perf tshooting an application that is gslb'd between two openshift clusters (3.11 cluster and a 4.5 cluster) in production the CFS throttling is much higher in openshift 4. The nodes on both sides are the same baremetal spec, at 56-70 cores @ 2.7ghz and no where near high resource utilization. Attached are prometheus metrics of what we've observed with a particular application deployment in production. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: There a lot higher cfs throttling value in OCP 4.x Expected results: The CFS throttling value of OCP 4.x has to be similar with OCP 3.x Additional info: It impacts application performance (response time in ocp 4.x is much higher than ocp 3.x)