Bug 1920526 - catalog operator causing CPU spikes and bad etcd performance
Summary: catalog operator causing CPU spikes and bad etcd performance
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Daniel Sover
QA Contact: Bruno Andrade
URL:
Whiteboard:
: 1921849 (view as bug list)
Depends On:
Blocks: 1922919
TreeView+ depends on / blocked
 
Reported: 2021-01-26 13:40 UTC by peter ducai
Modified: 2023-09-18 00:24 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Reconciliation for catalog update pods with polling intervals that are multiples of 15 greater than the default 15 minutes would continuously reconcile until some time from the update. Consequence: High CPU load around polling update time for catalogs with these polling intervals, for example 30 minutes. Fix: Improve logic for determining reconciliation requeuing for multiples of 15 greater than 15 minutes, for example 30 minutes. Result: No more continuous reconciliation and CPU usage is normal around polling time
Clone Of:
Environment:
Last Closed: 2021-02-24 15:56:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from operator (570.05 KB, text/plain)
2021-01-26 13:40 UTC, peter ducai
no flags Details
cpu3 (123.21 KB, image/jpeg)
2021-01-26 13:43 UTC, peter ducai
no flags Details
cpu1 (149.57 KB, image/jpeg)
2021-01-26 13:43 UTC, peter ducai
no flags Details
cpu2 (127.52 KB, image/jpeg)
2021-01-26 13:44 UTC, peter ducai
no flags Details
cpu of operator (116.03 KB, image/jpeg)
2021-01-26 13:45 UTC, peter ducai
no flags Details
cpu profile (2.94 KB, application/gzip)
2021-01-27 12:29 UTC, peter ducai
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1982 0 None closed Bug 1920526: Fix zero-delay resyncs for certain registry update policies. 2021-02-19 21:25:03 UTC
Red Hat Knowledge Base (Solution) 5759731 0 None None None 2021-02-24 13:08:16 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:56:51 UTC

Description peter ducai 2021-01-26 13:40:24 UTC
Created attachment 1750895 [details]
logs from operator

Description of problem:

Probably issue with catalog operator.

Customer runs OCP 4.6.8 on VMware. Fio and etcd-perf shows 1.5ms but etcdctl check perf fails even for 'm' test. ETCD logs are full of 'took too long' messages and compaction around 200ms.

As customer noticed: that there is cpu throttling in one of our master nodes. When we invetigate the cpu comsuption cause, we found that catalog-operator pod cpu consumption is the root cause of CPU throttling on master node.
catalog-operator pod had been located at master01 until this morning 10:30 am and you can see cpu consumption report in attached master01_Cpu.JPG file. After that we moved this pod to master03 and also CPU throttling moved to master03 after 10:35 am this morning. You can see  cpu consumption report in attached master03_Cpu.JPG file. We looked at master02 cpu usage for cross check, There is not any unusual usage in it's graphic at master02_Cpu.JPG .

catalog-operator-pod resource usage is paralel with picks of master nodes' cpu usage (catalog-operator-pod.JPG) .



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 peter ducai 2021-01-26 13:43:08 UTC
Created attachment 1750898 [details]
cpu3

Comment 2 peter ducai 2021-01-26 13:43:33 UTC
Created attachment 1750899 [details]
cpu1

Comment 3 peter ducai 2021-01-26 13:44:18 UTC
Created attachment 1750900 [details]
cpu2

Comment 4 peter ducai 2021-01-26 13:45:30 UTC
Created attachment 1750901 [details]
cpu of operator

Comment 5 Jesus M. Rodriguez 2021-01-26 16:42:12 UTC
Transferred to OLM since since catalog-operator is not part of the Service Catalog component.

Comment 7 peter ducai 2021-01-27 12:29:48 UTC
Created attachment 1751236 [details]
cpu profile

Comment 16 Lucas López Montero 2021-01-29 15:34:12 UTC
Changing the parameter "interval" from "30m0s" to "10m0s" in the CatalogSources as shown below seems to be solving the problem, at least in the clusters where this workaround has been tested.

updateStrategy:
    registryPoll:
      interval: 10m0s

Comment 18 Lucas López Montero 2021-01-29 15:38:21 UTC
*** Bug 1921849 has been marked as a duplicate of this bug. ***

Comment 21 Lucas López Montero 2021-02-01 09:35:56 UTC
KCS article written describing this problem: https://access.redhat.com/solutions/5759731

Comment 24 Bruno Andrade 2021-02-01 20:01:15 UTC
Created 15 Catalog Source I don't see any significant cpu consumption and I see the updateStrategy reflecting the changes proposed in PR:



OCP Version: 4.7.0-0.nightly-2021-02-01-060607
OLM version: 0.17.0
git commit: f0875583f6988c91719ef721829ac1d305054a4d

oc get pods -n openshift-marketplace                                                                                                                                                                                                
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-698q7               1/1     Running   0          48m
community-operators-xcpt4               1/1     Running   0          103m
iib-1-zskbg                             1/1     Running   0          5h3m
iib-10-zwcfp                            1/1     Running   0          4h43m
iib-11-5b749                            1/1     Running   0          4h43m
iib-12-75f7n                            1/1     Running   0          4h42m
iib-13-j2qd8                            1/1     Running   0          4h42m
iib-14-sks9s                            1/1     Running   0          4h42m
iib-15-v6q4d                            1/1     Running   0          4h41m
iib-2-mbc4w                             1/1     Running   0          4h59m
iib-3-b4w8z                             1/1     Running   0          4h51m
iib-4-k24f6                             1/1     Running   0          4h51m
iib-5-lfbrz                             1/1     Running   0          4h51m
iib-6-jq942                             1/1     Running   0          4h51m
iib-7-fw699                             1/1     Running   0          4h50m
iib-8-9dzfz                             1/1     Running   0          4h50m
iib-9-jwjn7                             1/1     Running   0          4h49m
marketplace-operator-78c7ccbb67-8v5hd   1/1     Running   0          6h28m
qe-app-registry-82hlw                   1/1     Running   0          3h28m
redhat-marketplace-c4q8w                1/1     Running   0          6h28m
redhat-operators-mm87g                  1/1     Running   0          14m


oc get catalogsource redhat-operators -o yaml -n openshift-marketplace 
spec:
  displayName: Red Hat Operators
  icon:
    base64data: ""
    mediatype: ""
  image: registry.redhat.io/redhat/redhat-operator-index:v4.6
  priority: -100
  publisher: Red Hat
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s
status:
  connectionState:
    address: redhat-operators.openshift-marketplace.svc:50051
    lastConnect: "2021-02-01T19:11:16Z"
    lastObservedState: READY
  latestImageRegistryPoll: "2021-02-01T19:10:46Z"
  registryService:
    createdAt: "2021-02-01T13:01:56Z"
    port: "50051"
    protocol: grpc
    serviceName: redhat-operators
    serviceNamespace: openshift-marketplace

Comment 32 errata-xmlrpc 2021-02-24 15:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 33 Red Hat Bugzilla 2023-09-18 00:24:23 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.