Hide Forgot
{client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configs",scope="resource",verb="GET"} 0.7333333333333333 Looks like a check that happens frequently but should not happen that often. Contributes to load on the cluster.
PR https://github.com/openshift/cluster-samples-operator/pull/138 is some low hanging fruit After that, if need be, access to diagnositc data (operator pod logs) associated with that sample to make sure something unexpected is not happening would be helpful.
Or entertaining some non-trivial ideas I have.
also have a diagnostic PR up https://github.com/openshift/cluster-samples-operator/pull/140 (where maybe we leave that dbg in long term)
Moving this to POST as #138 is in merge queue. Will use diagnostic PR #140 to verify.
I'm compiling 2 to 3 commits (1 for imagestreams, 1 for templates, possibly 1 for secrets) that should greatly reduce the number of gets on the config object (from 100s to single digits). I should have them up in https://github.com/openshift/cluster-samples-operator/pull/140 later today. We can then debate the risk / reward as needed. None of them are overly overly complex. The imagestream changed line delta is the biggest change. The others are simple. Depending on image stream import speed and stagger, it is a race between whether templates or imagestreams generate more gets currently.
PR https://github.com/openshift/cluster-samples-operator/pull/140 updated with optimizations. Will move to on qa when it merges.
use of the shared informer cache store for congfig CRD gets has also been added to https://github.com/openshift/cluster-samples-operator/pull/140
OK the PR has merged ... settled on use of the shared informer cache store for congfig CRD gets and the simple/safe template optimization ended up the more complex imagestream change did not have an optimal flow during the operator's upgrade test, so that was dropped
Gabe, I only find this log in pod of openshift-kube-controller-manager with old payload 4.1.0-0.nightly-2019-04-22-192604, and only show one time: samples.operator.openshift.io/v1, Resource=configs scheduling.k8s.io/v1beta1 don't find this kind log as comment #0 in cluster-samples-operator pod. Could you help point where to get the log as comment #0? Thanks~
Hey XiuJuan I believe that item in https://bugzilla.redhat.com/show_bug.cgi?id=1703233#c0 is a prometheus query that Clayton ran. I don't know how exactly he executed the query, acceptable values, etc. We ended up just taking this as a performance exercise, with Ben ultimately setting me on the path of using the informer cache for GETs since we already had a controller watch in place vs. revisiting the master. That change should ultimately normalize things. Adam / Ben / Paul - with Clayton at Summit this week, any of you know how Clayton ran the prometheus query? If they don't XiuJuan, unless Clayton manages to see this in chime in, I think a general regression test of samples operator is sufficient. He'll undoubtedly chime in to devex directly on performance next time he looks at it if need be.
Gabe, you can use a prometheus query like this to see the api server traffic from the samples operator: sum without (instance,verb,resource,subresource,code,scope,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m])) play with the args to see the rate from other clients.
Thanks Ben. So XiuJuan, here's a detailed set of steps: 1) oc project openshift-monitoring 2) oc get routes 3) copy the prometheus-k8s route and putting a "https://" in front of it, bring it up in your browser 4) click the login from openshift icons ... don't use the initial username/password fields you see 5) you should get the openshift oauth prompt about letting the SA do stuff for you ... similar to logging into jenkins 6) when you get to the classic openshfit login, supply kubeadmin and the kubeadmin password 7) to get a big list of clients, I modified Ben's query to sum without (instance,verb,resource,subresource,code,scope,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-*.*"}[5m])) 8) cluster samples operator looks really low compared both to the others, as well as compared to Clayton's sample in #Comment 0 For example, I just got: {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https"} 0.007407407407407408 9) I see both it and others oscillate a good bit. And when I graph it, at roughly 10 minute intervals, I see bumps to around 0.77 ... presumably when the operator/controller's relist occurs 10) to prove that, and break things down to presumably catch the qualifiers in #Comment 0 of resource="configs", scope="resource", and verb="GET" I modified Ben's query again to this: sum without (instance,code,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m])) That gave a a result that broke down the varioud api objs the samples operator deals with - config obj - secrets - templates - imagestreams - clusteroperator And the various operations: - UPDATE - GET - PUT - WATCH - CREATE - LIST And what I see is that we are no longer even doing the combination from #Comment 0 of a GET against the operator configs object So in other words, we've eliminated that entirely since to switch to Ben's suggestion of getting from the controller/informer cache. Try some of the prior queries if you like, but for me at least the query in 10) captures things such that we verified that the activity noted in #Comment 0 is no longer occurring
Gabe, Ben You are aswsome and so kind ^_^, thanks a millon times~ With query in 10 minute intervals: sum without (instance,code,job,namespace,pod,service) (rate (apiserver_request_count{client=~"cluster-samples-operator.*"}[5m])) with latest payload 4.1.0-0.nightly-2019-05-06-011159, the result of query: Only these two items about config , they are very low, no GET verb. {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="configs",scope="cluster",verb="LIST"} 0 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="configs",scope="cluster",verb="WATCH"} 0.003703703703703704 Other items , like templates resources with GET verb, there is higher value, 0.3407407407407408, should we treat it a issue? But I will verify this bug, since the original issue resolved. {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="GET"} 0.3407407407407408 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="WATCH"} 0.003703703703703704 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="LIST"} 0 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="POST"} 0.003703703703703704 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="GET"} 0.17777777777777778 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="LIST"} 0 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="imagestreams",scope="namespace",verb="WATCH"} 0 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="templates",scope="namespace",verb="LIST"} 0 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="clusteroperators",scope="resource",verb="GET"} 0.011111111111111112 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="PUT"} 0.003703703703703704 {client="cluster-samples-operator/v0.0.0 (linux/amd64) kubernetes/$Format",contentType="application/json",endpoint="https",resource="secrets",scope="namespace",verb="WATCH"} 0.003703703703703704
Hey XiuJuan Yep your data looks similar to mine. On the template GET (and maybe you can imagestream GET as well), I was never given a threshold on what constitutes "too much". That said, we could employ similar changes for those two types. Marking this bug verified seems correct, though, and Unless Clayton dictates it for 4.1.0, tackling those two feels like a 4.1.x or 4.2.
Gabe, Could you keep to fix the template|imagestream GET events? I will not change this bug status, but keep an eye on it, is that ok?
Hey XiuJuan Wang - just confirmed with Adam, Ben, and Clayton, that the change is too risky to drop now into 4.1.0 I'm about to clone this bug to track imagestream/templates for 4.2, and should have a PR up relatively soon.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758