Bug 1989600 - Registry server RSS and CPU utilization too high during normal operation
Summary: Registry server RSS and CPU utilization too high during normal operation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Ben Luddy
QA Contact: Tom Buskey
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-03 14:22 UTC by Ben Luddy
Modified: 2021-10-18 17:44 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:44:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 147 0 None None None 2021-08-03 14:32:43 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:44:29 UTC

Description Ben Luddy 2021-08-03 14:22:36 UTC
The registry server binary that exposes catalog contents to cluster components (including catalog-operator and packageserver) at runtime tends to balloon its memory footprint and waste CPU cycles when transferring all catalog entries to catalog-operator via its ListBundles API.

Steps to reproduce:

1. Extract a registry database file from redhat-operator-index:

$ docker pull registry.redhat.io/redhat/redhat-operator-index:v4.8
$ docker cp $(docker create registry.redhat.io/redhat/redhat-operator-index:v4.8):/database/index.db redhat.db

2. Serve the registry API using opm. This will block, you will need to either background the job or open a second terminal after this step:

$ opm registry serve -d redhat.db

3. Query the ListBundles endpoint:

$ grpcurl -plaintext localhost:50051 api.Registry/ListBundles > /dev/null

4. Measure the resident set size (the column RES in the output of 'top' below) and CPU time (TIME+):

$ top -b -n1 -p $(ps aux | grep 'opm registry serve -d redhat.db' | grep -v grep | awk '{print $2}')

top - 10:48:43 up 52 days, 17:31,  1 user,  load average: 0.82, 0.71, 0.95
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.1 us,  0.8 sy,  0.0 ni, 94.2 id,  0.0 wa,  0.8 hi,  0.0 si,  0.0 st
MiB Mem :  31948.9 total,   2276.9 free,   7766.4 used,  21905.6 buff/cache
MiB Swap:  24264.0 total,  22392.7 free,   1871.3 used.  20637.0 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2469916 bluddy    20   0 1843444 197988  28524 S   0.0   0.6   0:03.87 opm

Comment 1 Ben Luddy 2021-08-03 14:53:55 UTC
On my machine, following the repro steps using opm built with the proposed changes and using an identical "redhat.db", top shows:

top - 10:50:31 up 52 days, 17:33,  1 user,  load average: 0.76, 0.73, 0.93
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us,  0.8 sy,  0.0 ni, 97.5 id,  0.0 wa,  0.8 hi,  0.0 si,  0.0 st
MiB Mem :  31948.9 total,   2480.5 free,   7505.3 used,  21963.1 buff/cache
MiB Swap:  24264.0 total,  22392.4 free,   1871.6 used.  20876.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2470165 bluddy    20   0 1633064  42164  27476 S   0.0   0.1   0:00.29 opm

Reducing RES from 197988KiB to 42164KiB and TIME+ from 3.87s to 0.29s.

Comment 3 Tom Buskey 2021-08-04 16:51:53 UTC
Thank you for the detailed directions!

opm version
Version: version.Version{OpmVersion:"v1.17.5-21-g5fc0f422", GitCommit:"5fc0f422", BuildDate:"2021-08-03T14:52:16Z", GoOs:"linux", GoArch:"amd64"}

opm registry serve -d redhat.db
WARN[0000] unable to set termination log path            error="open /dev/termination-log: permission denied"
INFO[0000] Keeping server open for infinite seconds      database=redhat.db port=50051
INFO[0000] serving registry                              database=redhat.db port=50051

 top -b -n1 -p $(ps aux | grep 'opm registry serve -d redhat.db' | grep -v grep | awk '{print $2}')
top - 11:21:55 up 34 days,  1:21,  2 users,  load average: 4.25, 4.41, 3.73
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s): 24.6 us,  5.7 sy,  0.0 ni, 68.0 id,  0.0 wa,  0.8 hi,  0.8 si,  0.0 st
MiB Mem :  31949.7 total,   4738.5 free,  17990.3 used,   9221.0 buff/cache
MiB Swap:  20168.0 total,  17991.2 free,   2176.8 used.  11935.5 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2857413 tbuskey   20   0 1633132  43948  27588 S   0.0   0.1   0:00.45 opm

20 minutes later: 
top - 12:48:23 up 34 days,  2:47,  2 users,  load average: 3.24, 3.66, 3.69
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.7 us,  2.5 sy,  0.0 ni, 89.1 id,  0.0 wa,  0.8 hi,  0.8 si,  0.0 st
MiB Mem :  31949.7 total,   2111.8 free,  18705.1 used,  11132.9 buff/cache
MiB Swap:  20168.0 total,  17994.7 free,   2173.2 used.  10911.5 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2857413 tbuskey   20   0 1633132  40812  27636 S   0.0   0.1   0:00.99 opm

/lgtm

Comment 6 errata-xmlrpc 2021-10-18 17:44:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.