Bug 1899107 - [4.6] ironic-api used by metal3 is over provisioned and consumes a lot of RAM
Summary: [4.6] ironic-api used by metal3 is over provisioned and consumes a lot of RAM
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.z
Assignee: Dmitry Tantsur
QA Contact: Lubov
URL:
Whiteboard:
Depends On: 1894146
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-18 14:27 UTC by Dmitry Tantsur
Modified: 2021-02-08 13:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Ironic API service embedded in baremetal IPI now uses 4 workers instead of 8 to reduce the RAM usage.
Clone Of: 1894146
Environment:
Last Closed: 2021-02-08 13:50:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ironic-image pull 122 0 None closed Bug 1899107: Limit the default number of API workers to 4 2021-02-02 15:44:09 UTC
Red Hat Product Errata RHSA-2021:0308 0 None None None 2021-02-08 13:51:05 UTC

Description Dmitry Tantsur 2020-11-18 14:27:15 UTC
+++ This bug was initially created as a clone of Bug #1894146 +++

Description of problem:

Memory consumption by metal3 is pretty high, at least when we compare it to some other things in the deployment. The metal3 pod as a whole uses 880+MiB, of which the ironic-api container is using 530MiB.

Inside the container, I see

# ps -o pid,user,%mem,rss,command -ax
    PID USER     %MEM   RSS COMMAND
      1 root      0.5 86988 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     40 root      0.5 92804 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     41 root      0.5 91944 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     42 root      0.5 92768 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     43 root      0.5 92072 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     44 root      0.5 96096 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     45 root      0.5 92764 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     46 root      0.5 92244 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     47 root      0.5 91452 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     61 root      0.0  3716 bash
    109 root      0.0  3784 ps -o pid,user,%mem,rss,command -ax

So we have 9 API server processes (1 main process and 8 copies managed by oslo.service, as configured in our ironic.conf) each of which is around 90MiB. 

If we think about what's going to be talking to the server, it would be at most 3 calls at a time from the baremetal-operator (one per reconcile thread) and then the agents during provisioning.

Version-Release number of selected component (if applicable):

4.6+

How reproducible:

Always

Steps to Reproduce:
1. oc exec -it ${name-of-metal3-pod} -- bash
2. ps -o pid,user,%mem,rss,command -ax

Actual results:

9 processes

Expected results:

Fewer than 9 processes

Additional info:

--- Additional comment from Derek Higgins on 2020-11-03 17:22:39 UTC ---

(In reply to Doug Hellmann from comment #0)
> So we have 9 API server processes (1 main process and 8 copies managed by
> oslo.service, as configured in our ironic.conf) each of which is around
> 90MiB. 
> 

Looks like this is set based on the number of processors

configure-ironic.sh:export NUMWORKERS=$(( NUMPROC < 12 ? NUMPROC : 12 ))
ironic.conf.j2:api_workers = {{ env.NUMWORKERS }}

if that is the best behaviour or not is another matter

--- Additional comment from Zane Bitter on 2020-11-03 18:06:31 UTC ---

Each API worker has a thread pool that can process up to 100 requests simultaneously (though not necessarily performantly!), plus it will queue up to 128 further requests before accept()ing them.

AFAIK even on OpenStack underclouds we only configure half as many worker threads as CPUs, and the only reason we have so many workers in general is to make sure that CPU-bound parts of the installation that are dependent on a single service don't get slowed down.

For metal³, the bottleneck is ironic-conductor - we only have one of those. Work is in progress to limit it to provisioning 20 nodes at a time by default.
ironic-api responds to requests from both the baremetal-operator (max 3 at one time) and from IPA running on any non-provisioned nodes. Nonetheless, I'd be surprised if we needed more than one worker to avoid ironic-api being a bottleneck at current scales.

Knowing when and how to scale in future (e.g. when the ironic-conductor is deployed and scaled separately) is more challenging. ironic-api listens on a fixed IP (the provisioning VIP) and port using host networking, and has to work before Ingresses are available, so we can't scale out in a traditional k8s way.

--- Additional comment from Doug Hellmann on 2020-11-03 22:10:42 UTC ---

(In reply to Derek Higgins from comment #1)
> (In reply to Doug Hellmann from comment #0)
> > So we have 9 API server processes (1 main process and 8 copies managed by
> > oslo.service, as configured in our ironic.conf) each of which is around
> > 90MiB. 
> > 
> 
> Looks like this is set based on the number of processors
> 
> configure-ironic.sh:export NUMWORKERS=$(( NUMPROC < 12 ? NUMPROC : 12 ))
> ironic.conf.j2:api_workers = {{ env.NUMWORKERS }}
> 
> if that is the best behaviour or not is another matter

Yeah, good point. I checked this on a dev-scripts deployment. I suppose that means on real hardware we're likely to be running at 12 instead of 8, so consumption would be even worse.

--- Additional comment from Dmitry Tantsur on 2020-11-12 16:39:24 UTC ---

Do we need to clone this for 4.6? I assume yes.

--- Additional comment from Doug Hellmann on 2020-11-12 16:54:28 UTC ---

(In reply to Dmitry Tantsur from comment #4)
> Do we need to clone this for 4.6? I assume yes.

Yes, let's do that.

Comment 3 Lubov 2021-01-28 14:42:27 UTC
Verified on 4.6.0-0.nightly-2021-01-28-083619

[kni@provisionhost-0-0 ~]$ oc exec -it metal3-86c9d47458-55tpg -c metal3-ironic-api -- bash
[root@master-0-2 /]# ps -o pid,user,%mem,rss,command -ax
    PID USER     %MEM   RSS COMMAND
      1 root      0.2 90956 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     31 root      0.2 91196 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     32 root      0.2 91208 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     33 root      0.2 93308 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     34 root      0.2 89480 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     35 root      0.0  3772 bash
     56 root      0.0  3800 ps -o pid,user,%mem,rss,command -ax

Comment 6 errata-xmlrpc 2021-02-08 13:50:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.6.16 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0308


Note You need to log in before you can comment on or make changes to this bug.