Created attachment 1750895 [details] logs from operator Description of problem: Probably issue with catalog operator. Customer runs OCP 4.6.8 on VMware. Fio and etcd-perf shows 1.5ms but etcdctl check perf fails even for 'm' test. ETCD logs are full of 'took too long' messages and compaction around 200ms. As customer noticed: that there is cpu throttling in one of our master nodes. When we invetigate the cpu comsuption cause, we found that catalog-operator pod cpu consumption is the root cause of CPU throttling on master node. catalog-operator pod had been located at master01 until this morning 10:30 am and you can see cpu consumption report in attached master01_Cpu.JPG file. After that we moved this pod to master03 and also CPU throttling moved to master03 after 10:35 am this morning. You can see cpu consumption report in attached master03_Cpu.JPG file. We looked at master02 cpu usage for cross check, There is not any unusual usage in it's graphic at master02_Cpu.JPG . catalog-operator-pod resource usage is paralel with picks of master nodes' cpu usage (catalog-operator-pod.JPG) . Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1750898 [details] cpu3
Created attachment 1750899 [details] cpu1
Created attachment 1750900 [details] cpu2
Created attachment 1750901 [details] cpu of operator
Transferred to OLM since since catalog-operator is not part of the Service Catalog component.
Created attachment 1751236 [details] cpu profile
Changing the parameter "interval" from "30m0s" to "10m0s" in the CatalogSources as shown below seems to be solving the problem, at least in the clusters where this workaround has been tested. updateStrategy: registryPoll: interval: 10m0s
*** Bug 1921849 has been marked as a duplicate of this bug. ***
KCS article written describing this problem: https://access.redhat.com/solutions/5759731
Created 15 Catalog Source I don't see any significant cpu consumption and I see the updateStrategy reflecting the changes proposed in PR: OCP Version: 4.7.0-0.nightly-2021-02-01-060607 OLM version: 0.17.0 git commit: f0875583f6988c91719ef721829ac1d305054a4d oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE certified-operators-698q7 1/1 Running 0 48m community-operators-xcpt4 1/1 Running 0 103m iib-1-zskbg 1/1 Running 0 5h3m iib-10-zwcfp 1/1 Running 0 4h43m iib-11-5b749 1/1 Running 0 4h43m iib-12-75f7n 1/1 Running 0 4h42m iib-13-j2qd8 1/1 Running 0 4h42m iib-14-sks9s 1/1 Running 0 4h42m iib-15-v6q4d 1/1 Running 0 4h41m iib-2-mbc4w 1/1 Running 0 4h59m iib-3-b4w8z 1/1 Running 0 4h51m iib-4-k24f6 1/1 Running 0 4h51m iib-5-lfbrz 1/1 Running 0 4h51m iib-6-jq942 1/1 Running 0 4h51m iib-7-fw699 1/1 Running 0 4h50m iib-8-9dzfz 1/1 Running 0 4h50m iib-9-jwjn7 1/1 Running 0 4h49m marketplace-operator-78c7ccbb67-8v5hd 1/1 Running 0 6h28m qe-app-registry-82hlw 1/1 Running 0 3h28m redhat-marketplace-c4q8w 1/1 Running 0 6h28m redhat-operators-mm87g 1/1 Running 0 14m oc get catalogsource redhat-operators -o yaml -n openshift-marketplace spec: displayName: Red Hat Operators icon: base64data: "" mediatype: "" image: registry.redhat.io/redhat/redhat-operator-index:v4.6 priority: -100 publisher: Red Hat sourceType: grpc updateStrategy: registryPoll: interval: 10m0s status: connectionState: address: redhat-operators.openshift-marketplace.svc:50051 lastConnect: "2021-02-01T19:11:16Z" lastObservedState: READY latestImageRegistryPoll: "2021-02-01T19:10:46Z" registryService: createdAt: "2021-02-01T13:01:56Z" port: "50051" protocol: grpc serviceName: redhat-operators serviceNamespace: openshift-marketplace
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days