Bug 1707834 - Cluster samples operator has a high read rate from operator
Summary: Cluster samples operator has a high read rate from operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Templates
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On: 1703233
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-08 14:03 UTC by Gabe Montero
Modified: 2019-10-16 06:28 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Even though the samples operator maintained controller/informer based watches for secrets, imagestreams, and templates, it was still using a direct openshift go client to GET those resources during processing. Consequence: Unnecessary activity against the OpenShift API server was a result. Fix: Leverage the informer/lister API to leverage the cache maintained in the controller via its watches on those api object types. Result: GET Activity against the OpenShift API server has been reduced.
Clone Of: 1703233
Environment:
Last Closed: 2019-10-16 06:28:32 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 None None None 2019-10-16 06:28:43 UTC

Description Gabe Montero 2019-05-08 14:03:14 UTC
+++ This bug was initially created as a clone of Bug #1703233 +++

{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configs",scope="resource",verb="GET"}	0.7333333333333333

Looks like a check that happens frequently but should not happen that often.  Contributes to load on the cluster.

--- Additional comment from Gabe Montero on 2019-04-26 11:14:41 UTC ---

PR https://github.com/openshift/cluster-samples-operator/pull/138 is some low hanging fruit

After that, if need be, access to diagnositc data (operator pod logs) associated with that sample
to make sure something unexpected is not happening would be helpful.

--- Additional comment from Gabe Montero on 2019-04-26 11:18:19 UTC ---

Or entertaining some non-trivial ideas I have.

--- Additional comment from Gabe Montero on 2019-04-26 12:41:02 UTC ---

also have a diagnostic PR up https://github.com/openshift/cluster-samples-operator/pull/140

(where maybe we leave that dbg in long term)

--- Additional comment from Adam Kaplan on 2019-04-26 16:18:17 UTC ---

Moving this to POST as #138 is in merge queue. Will use diagnostic PR #140 to verify.

--- Additional comment from Gabe Montero on 2019-04-29 19:44:35 UTC ---

I'm compiling 2 to 3 commits (1 for imagestreams, 1 for templates, possibly 1 for secrets) that should greatly reduce the number of gets on the config object (from 100s to single digits).

I should have them up in https://github.com/openshift/cluster-samples-operator/pull/140 later today.

We can then debate the risk / reward as needed.

None of them are overly overly complex.  The imagestream changed line delta is the biggest change.  The others are simple.

Depending on image stream import speed and stagger, it is a race between whether templates or imagestreams generate more gets currently.

--- Additional comment from Gabe Montero on 2019-04-29 20:25:39 UTC ---

PR https://github.com/openshift/cluster-samples-operator/pull/140 updated with optimizations.

Will move to on qa when it merges.

--- Additional comment from Gabe Montero on 2019-04-30 17:47:36 UTC ---

use of the shared informer cache store for congfig CRD gets has also been added to https://github.com/openshift/cluster-samples-operator/pull/140

--- Additional comment from Gabe Montero on 2019-05-01 19:20:56 UTC ---

OK the PR has merged ... settled on use of the shared informer cache store for congfig CRD gets and the simple/safe template optimization

ended up the more complex imagestream change did not have an optimal flow during the operator's upgrade test, so that was dropped

--- Additional comment from XiuJuan Wang on 2019-05-06 10:11:45 UTC ---

Gabe, 
I only find this log in pod of openshift-kube-controller-manager with old payload 4.1.0-0.nightly-2019-04-22-192604, and only show one time:
samples.operator.openshift.io/v1, Resource=configs scheduling.k8s.io/v1beta1
don't find this kind log as comment #0 in cluster-samples-operator pod.
Could you help point where to get the log as comment #0?

Thanks~

--- Additional comment from Gabe Montero on 2019-05-06 14:09:43 UTC ---

Hey XiuJuan

I believe that item in https://bugzilla.redhat.com/show_bug.cgi?id=1703233#c0 is a prometheus query that Clayton ran.

I don't know how exactly he executed the query, acceptable values, etc.

We ended up just taking this as a performance exercise, with Ben ultimately setting me on the path of using the informer
cache for GETs since we already had a controller watch in place vs. revisiting the master.  That change should ultimately
normalize  things.

Adam / Ben / Paul - with Clayton at Summit this week, any of you know how Clayton ran the prometheus query?

If they don't XiuJuan, unless Clayton manages to see this in chime in, I think a general regression test of samples operator 
is sufficient.

He'll undoubtedly chime in to devex directly on performance next time he looks at it if need be.

--- Additional comment from Ben Parees on 2019-05-06 15:08:35 UTC ---

Gabe, you can use a prometheus query like this to see the api server traffic from the samples operator:

sum without (instance,verb,resource,subresource,code,scope,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m]))


play with the args to see the rate from other clients.

--- Additional comment from Gabe Montero on 2019-05-06 20:11:57 UTC ---

Thanks Ben.

So XiuJuan, here's a detailed set of steps:

1) oc project openshift-monitoring 
2) oc get routes
3) copy the prometheus-k8s route and putting a "https://" in front of it, bring it up in your browser
4) click the login from openshift icons ... don't use the initial username/password fields you see
5) you should get the openshift oauth prompt about letting the SA do stuff for you ... similar to logging into jenkins
6) when you get to the classic openshfit login, supply kubeadmin and the kubeadmin password
7) to get a big list of clients, I modified Ben's query to 

sum without (instance,verb,resource,subresource,code,scope,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-*.*"}[5m]))

8) cluster samples operator looks really low compared both to the others, as well as compared to Clayton's sample in #Comment 0

For example, I just got:

{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https"}	0.007407407407407408

9) I see both it and others oscillate a good bit.  And when I graph it, at roughly 10 minute intervals, I see bumps to around 0.77 ... presumably when
the operator/controller's relist occurs

10) to prove that, and break things down to presumably catch the qualifiers in #Comment 0 of resource="configs", scope="resource", and verb="GET" I modified
Ben's query again to this:

sum without (instance,code,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m]))

That gave a a result that broke down the varioud api objs the samples operator deals with
- config obj
- secrets
- templates
- imagestreams
- clusteroperator

And the various operations:
- UPDATE
- GET
- PUT
- WATCH
- CREATE
- LIST

And what I see is that we are no longer even doing the combination from #Comment 0 of a GET against the operator configs object

So in other words, we've eliminated that entirely since to switch to Ben's suggestion of getting from the controller/informer cache.

Try some of the prior queries if you like, but for me at least the query in 10) captures things such that we verified that the activity 
noted in #Comment 0 is no longer occurring

--- Additional comment from XiuJuan Wang on 2019-05-07 03:40:15 UTC ---

Gabe, Ben
You are aswsome and so kind ^_^, thanks a millon times~

With query in 10 minute intervals:
sum without (instance,code,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m]))
with latest payload 4.1.0-0.nightly-2019-05-06-011159, the result of query:
Only these two items about config , they are very low, no GET verb.

{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="configs",scope="cluster",verb="LIST"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="configs",scope="cluster",verb="WATCH"}    0.003703703703703704

Other items , like templates resources with GET verb, there is higher value, 0.3407407407407408, should we treat it a issue?
But I will verify this bug, since the original issue resolved.

{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="GET"}	0.3407407407407408
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="WATCH"}	0.003703703703703704
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="LIST"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="POST"}	0.003703703703703704
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="GET"}	0.17777777777777778
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="LIST"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="WATCH"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="LIST"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="clusteroperators",scope="resource",verb="GET"}	0.011111111111111112
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="PUT"}	0.003703703703703704
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="WATCH"}	0.003703703703703704

--- Additional comment from Gabe Montero on 2019-05-07 14:01:00 UTC ---

Hey XiuJuan

Yep your data looks similar to mine.

On the template GET (and maybe you can imagestream GET as well), I was never given
a threshold on what constitutes "too much".

That said, we could employ similar changes for those two types.

Marking this bug verified seems correct, though, and 
Unless Clayton dictates it for 4.1.0, tackling those two feels like a 4.1.x or 4.2.

--- Additional comment from XiuJuan Wang on 2019-05-08 09:06:26 UTC ---

Gabe, 
Could you keep to fix the template|imagestream GET events?
I will not change this bug status, but keep an eye on it, is that ok?

--- Additional comment from Gabe Montero on 2019-05-08 14:02:23 UTC ---

Hey XiuJuan Wang - just confirmed with Adam, Ben, and Clayton, that the change is too risky to drop now into 4.1.0

I'm about to clone this bug to track imagestream/templates for 4.2, and should have a PR up relatively soon.

Comment 1 Gabe Montero 2019-05-22 14:51:09 UTC
OK PR https://github.com/openshift/cluster-samples-operator/pull/146 has merged.

It now employs the informer/lister for GETs on secrets, imagestreams, and templates, in addition to the use for the operator's config object which occurred with https://bugzilla.redhat.com/show_bug.cgi?id=1703233

Not when you run the query we settled on with the last bug, I saw only 0's for GETs for those 4 types.

You'll only see GET activity for the clusteroperator object.  As we are not watching those objects, there is not informer/lister cache to leverage.  And we do NOT want to start watching that type at this time.

Comment 2 XiuJuan Wang 2019-05-23 09:37:02 UTC
Verified with 4.2.0-0.ci-2019-05-23-003410 payload.

Only GET activity for the clusteroperator is shown, the GET events for configs,secrets, imagestreams, and templates all gone.

Comment 3 errata-xmlrpc 2019-10-16 06:28:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.