Bug 1703233 - Cluster samples operator has a high read rate from operator
Summary: Cluster samples operator has a high read rate from operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks: 1707834
TreeView+ depends on / blocked
 
Reported: 2019-04-25 20:41 UTC by Clayton Coleman
Modified: 2019-06-04 10:48 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1707834 (view as bug list)
Environment:
Last Closed: 2019-06-04 10:48:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:48:12 UTC

Description Clayton Coleman 2019-04-25 20:41:06 UTC
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configs",scope="resource",verb="GET"}	0.7333333333333333

Looks like a check that happens frequently but should not happen that often.  Contributes to load on the cluster.

Comment 1 Gabe Montero 2019-04-26 11:14:41 UTC
PR https://github.com/openshift/cluster-samples-operator/pull/138 is some low hanging fruit

After that, if need be, access to diagnositc data (operator pod logs) associated with that sample
to make sure something unexpected is not happening would be helpful.

Comment 2 Gabe Montero 2019-04-26 11:18:19 UTC
Or entertaining some non-trivial ideas I have.

Comment 3 Gabe Montero 2019-04-26 12:41:02 UTC
also have a diagnostic PR up https://github.com/openshift/cluster-samples-operator/pull/140

(where maybe we leave that dbg in long term)

Comment 4 Adam Kaplan 2019-04-26 16:18:17 UTC
Moving this to POST as #138 is in merge queue. Will use diagnostic PR #140 to verify.

Comment 5 Gabe Montero 2019-04-29 19:44:35 UTC
I'm compiling 2 to 3 commits (1 for imagestreams, 1 for templates, possibly 1 for secrets) that should greatly reduce the number of gets on the config object (from 100s to single digits).

I should have them up in https://github.com/openshift/cluster-samples-operator/pull/140 later today.

We can then debate the risk / reward as needed.

None of them are overly overly complex.  The imagestream changed line delta is the biggest change.  The others are simple.

Depending on image stream import speed and stagger, it is a race between whether templates or imagestreams generate more gets currently.

Comment 6 Gabe Montero 2019-04-29 20:25:39 UTC
PR https://github.com/openshift/cluster-samples-operator/pull/140 updated with optimizations.

Will move to on qa when it merges.

Comment 7 Gabe Montero 2019-04-30 17:47:36 UTC
use of the shared informer cache store for congfig CRD gets has also been added to https://github.com/openshift/cluster-samples-operator/pull/140

Comment 8 Gabe Montero 2019-05-01 19:20:56 UTC
OK the PR has merged ... settled on use of the shared informer cache store for congfig CRD gets and the simple/safe template optimization

ended up the more complex imagestream change did not have an optimal flow during the operator's upgrade test, so that was dropped

Comment 9 XiuJuan Wang 2019-05-06 10:11:45 UTC
Gabe, 
I only find this log in pod of openshift-kube-controller-manager with old payload 4.1.0-0.nightly-2019-04-22-192604, and only show one time:
samples.operator.openshift.io/v1, Resource=configs scheduling.k8s.io/v1beta1
don't find this kind log as comment #0 in cluster-samples-operator pod.
Could you help point where to get the log as comment #0?

Thanks~

Comment 10 Gabe Montero 2019-05-06 14:09:43 UTC
Hey XiuJuan

I believe that item in https://bugzilla.redhat.com/show_bug.cgi?id=1703233#c0 is a prometheus query that Clayton ran.

I don't know how exactly he executed the query, acceptable values, etc.

We ended up just taking this as a performance exercise, with Ben ultimately setting me on the path of using the informer
cache for GETs since we already had a controller watch in place vs. revisiting the master.  That change should ultimately
normalize  things.

Adam / Ben / Paul - with Clayton at Summit this week, any of you know how Clayton ran the prometheus query?

If they don't XiuJuan, unless Clayton manages to see this in chime in, I think a general regression test of samples operator 
is sufficient.

He'll undoubtedly chime in to devex directly on performance next time he looks at it if need be.

Comment 11 Ben Parees 2019-05-06 15:08:35 UTC
Gabe, you can use a prometheus query like this to see the api server traffic from the samples operator:

sum without (instance,verb,resource,subresource,code,scope,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m]))


play with the args to see the rate from other clients.

Comment 12 Gabe Montero 2019-05-06 20:11:57 UTC
Thanks Ben.

So XiuJuan, here's a detailed set of steps:

1) oc project openshift-monitoring 
2) oc get routes
3) copy the prometheus-k8s route and putting a "https://" in front of it, bring it up in your browser
4) click the login from openshift icons ... don't use the initial username/password fields you see
5) you should get the openshift oauth prompt about letting the SA do stuff for you ... similar to logging into jenkins
6) when you get to the classic openshfit login, supply kubeadmin and the kubeadmin password
7) to get a big list of clients, I modified Ben's query to 

sum without (instance,verb,resource,subresource,code,scope,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-*.*"}[5m]))

8) cluster samples operator looks really low compared both to the others, as well as compared to Clayton's sample in #Comment 0

For example, I just got:

{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https"}	0.007407407407407408

9) I see both it and others oscillate a good bit.  And when I graph it, at roughly 10 minute intervals, I see bumps to around 0.77 ... presumably when
the operator/controller's relist occurs

10) to prove that, and break things down to presumably catch the qualifiers in #Comment 0 of resource="configs", scope="resource", and verb="GET" I modified
Ben's query again to this:

sum without (instance,code,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m]))

That gave a a result that broke down the varioud api objs the samples operator deals with
- config obj
- secrets
- templates
- imagestreams
- clusteroperator

And the various operations:
- UPDATE
- GET
- PUT
- WATCH
- CREATE
- LIST

And what I see is that we are no longer even doing the combination from #Comment 0 of a GET against the operator configs object

So in other words, we've eliminated that entirely since to switch to Ben's suggestion of getting from the controller/informer cache.

Try some of the prior queries if you like, but for me at least the query in 10) captures things such that we verified that the activity 
noted in #Comment 0 is no longer occurring

Comment 13 XiuJuan Wang 2019-05-07 03:40:15 UTC
Gabe, Ben
You are aswsome and so kind ^_^, thanks a millon times~

With query in 10 minute intervals:
sum without (instance,code,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m]))
with latest payload 4.1.0-0.nightly-2019-05-06-011159, the result of query:
Only these two items about config , they are very low, no GET verb.

{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="configs",scope="cluster",verb="LIST"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="configs",scope="cluster",verb="WATCH"}    0.003703703703703704

Other items , like templates resources with GET verb, there is higher value, 0.3407407407407408, should we treat it a issue?
But I will verify this bug, since the original issue resolved.

{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="GET"}	0.3407407407407408
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="WATCH"}	0.003703703703703704
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="LIST"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="POST"}	0.003703703703703704
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="GET"}	0.17777777777777778
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="LIST"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="WATCH"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="LIST"}	0
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="clusteroperators",scope="resource",verb="GET"}	0.011111111111111112
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="PUT"}	0.003703703703703704
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="WATCH"}	0.003703703703703704

Comment 14 Gabe Montero 2019-05-07 14:01:00 UTC
Hey XiuJuan

Yep your data looks similar to mine.

On the template GET (and maybe you can imagestream GET as well), I was never given
a threshold on what constitutes "too much".

That said, we could employ similar changes for those two types.

Marking this bug verified seems correct, though, and 
Unless Clayton dictates it for 4.1.0, tackling those two feels like a 4.1.x or 4.2.

Comment 15 XiuJuan Wang 2019-05-08 09:06:26 UTC
Gabe, 
Could you keep to fix the template|imagestream GET events?
I will not change this bug status, but keep an eye on it, is that ok?

Comment 16 Gabe Montero 2019-05-08 14:02:23 UTC
Hey XiuJuan Wang - just confirmed with Adam, Ben, and Clayton, that the change is too risky to drop now into 4.1.0

I'm about to clone this bug to track imagestream/templates for 4.2, and should have a PR up relatively soon.

Comment 18 errata-xmlrpc 2019-06-04 10:48:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.