Bug 1886953

Summary: cfs throttling in OCP 4.x is higher than 3.x
Product: OpenShift Container Platform Reporter: jooho lee <jlee>
Component: Node Tuning OperatorAssignee: Jenifer Abrams <jhopper>
Status: CLOSED NOTABUG QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: akamra, aos-bugs, apaladug, dwalsh, jhopper, jmencak, jokerman, nagrawal, pauld, sejug
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-25 12:00:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
OCP 4.5.13 CFS Throttling Metrics
none
OCP 3.11 CFS Throttling Metric none

Description jooho lee 2020-10-09 19:42:37 UTC
Created attachment 1720333 [details]
OCP 4.5.13 CFS Throttling Metrics

Description of problem:
Customer have noticed in perf tshooting an application that is gslb'd between two openshift clusters (3.11 cluster and a 4.5 cluster) in production the CFS throttling is much higher in openshift 4. The nodes on both sides are the same baremetal spec, at 56-70 cores @ 2.7ghz and no where near high resource utilization. Attached are prometheus metrics of what we've observed with a particular application deployment in production. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
There a lot higher cfs throttling value in OCP 4.x

Expected results:
The CFS throttling value of OCP 4.x has to be similar with OCP 3.x 

Additional info:
It impacts application performance (response time in ocp 4.x is much higher than ocp 3.x)

Comment 1 jooho lee 2020-10-09 19:43:26 UTC
Created attachment 1720334 [details]
OCP 3.11 CFS Throttling Metric

Comment 5 Anand Paladugu 2020-10-13 16:34:08 UTC
@jmencak 

Is it sufficient to run the same container and cfs test in a customer environment ?   Can I provide update to customer that we have not been able to produce the issue internally and hence looking for a simple reproducer in their env ?

Thanks

Anand

Comment 7 Anand Paladugu 2020-10-13 20:37:46 UTC
@jmencak 

Thanks for the update.

Few questions.

1. It looks like customer is actually running 4.5.6 (although they may have other clusters at 4.5.13).  Per the KCS article ->
https://access.redhat.com/solutions/5285071   the issue is fixed in kernel-4.18.0-147.3.1.el8_1.  I cannot find the kernel version we are shipping in OCP 4.5.6 in the release notes.  Can you confirm ?   I am also getting a sos report to confirm it.

2. From the latest updates / commits in the end of article ->   https://github.com/kubernetes/kubernetes/issues/67577, do you think the issue is not fully resolved yet for all scenarios ?  Particularly the comment 19 hours ago (THEBOSS619 added a commit to THEBOSS619/Note9-ZeusKernelQ-OneUI-AOSP that referenced this issue 19 hours ago)  about non-expiration of per-cpu slices ?

3. Would the issue only occur if a CPU limit was to be set for the container ?  I am trying to see if we can offer workaround in the interim.

Thanks

Anand

Comment 8 Anand Paladugu 2020-10-13 23:17:02 UTC
SOS report is attached to the case.  Kernel version is  4.18.0-193.14.3.el8_2.x86_64.  Can you check to see if the fix is back ported to this kernel version ?

Thanks

Anand

Comment 9 Jiří Mencák 2020-10-14 06:48:36 UTC
CFS scheduler code in 4.18.0-193.23.1.el8_2 (OCP 4.5.13) and 4.18.0-193.14.3.el8_2 (OCP 4.5.6) is exactly the same.
I've ran a reproducer (https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1) for
https://bugzilla.kernel.org/show_bug.cgi?id=198197 in several iterations on both OCP 4.5.13/4.5.6 and the reproducer
did not discover any throttling.  It could be there is another CFS issue the reproducer above doesn't catch.

Jenifer, can you take a look, please?

Comment 10 Anand Paladugu 2020-10-15 02:51:29 UTC
Team,

Customer has conveyed that distributing the replicas on many nodes Vs two helped to reduce the thresholding a bit.  But in both cases (i.e running replicas on many nodes Vs two) 3.11 performed better than 4.5.

I am still trying to get numbers for 3.11 Vs 4.5.

Customer did configure requests and limits equally for the pod (to get guaranteed QOS).  Would increasing the limits or removing the limits help in this situation (to reduce or avoid throttling) ?

Thanks

Anand

Comment 11 Anand Paladugu 2020-10-15 15:29:17 UTC
Team,  customer provided some clarity and graphs to show differences between 3.11 and 4.5.6.   My comments in response and files are all in the case.  Please review and comment.

thanks

Anand

Comment 22 Jiří Mencák 2020-11-25 12:00:11 UTC
As no response was received after two weeks of requesting information whether the issues still persist and the customer case shows as Closed, closing this BZ.  Please re-open if the problems persist.