Bug 1894146 - ironic-api used by metal3 is over provisioned and consumes a lot of RAM
Summary: ironic-api used by metal3 is over provisioned and consumes a lot of RAM
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Dmitry Tantsur
QA Contact: Lubov
URL:
Whiteboard:
Depends On:
Blocks: 1899107
TreeView+ depends on / blocked
 
Reported: 2020-11-03 16:20 UTC by Doug Hellmann
Modified: 2021-02-24 15:31 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Ironic API service embedded in baremetal IPI now uses 4 workers instead of 8 to reduce the RAM usage.
Clone Of:
: 1899107 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:31:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ironic-image pull 119 0 None closed Bug 1894146: Limit the default number of API workers to 4 2021-02-02 15:41:52 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:31:09 UTC

Description Doug Hellmann 2020-11-03 16:20:53 UTC
Description of problem:

Memory consumption by metal3 is pretty high, at least when we compare it to some other things in the deployment. The metal3 pod as a whole uses 880+MiB, of which the ironic-api container is using 530MiB.

Inside the container, I see

# ps -o pid,user,%mem,rss,command -ax
    PID USER     %MEM   RSS COMMAND
      1 root      0.5 86988 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     40 root      0.5 92804 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     41 root      0.5 91944 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     42 root      0.5 92768 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     43 root      0.5 92072 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     44 root      0.5 96096 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     45 root      0.5 92764 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     46 root      0.5 92244 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     47 root      0.5 91452 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf
     61 root      0.0  3716 bash
    109 root      0.0  3784 ps -o pid,user,%mem,rss,command -ax

So we have 9 API server processes (1 main process and 8 copies managed by oslo.service, as configured in our ironic.conf) each of which is around 90MiB. 

If we think about what's going to be talking to the server, it would be at most 3 calls at a time from the baremetal-operator (one per reconcile thread) and then the agents during provisioning.

Version-Release number of selected component (if applicable):

4.6+

How reproducible:

Always

Steps to Reproduce:
1. oc exec -it ${name-of-metal3-pod} -- bash
2. ps -o pid,user,%mem,rss,command -ax

Actual results:

9 processes

Expected results:

Fewer than 9 processes

Additional info:

Comment 1 Derek Higgins 2020-11-03 17:22:39 UTC
(In reply to Doug Hellmann from comment #0)
> So we have 9 API server processes (1 main process and 8 copies managed by
> oslo.service, as configured in our ironic.conf) each of which is around
> 90MiB. 
> 

Looks like this is set based on the number of processors

configure-ironic.sh:export NUMWORKERS=$(( NUMPROC < 12 ? NUMPROC : 12 ))
ironic.conf.j2:api_workers = {{ env.NUMWORKERS }}

if that is the best behaviour or not is another matter

Comment 2 Zane Bitter 2020-11-03 18:06:31 UTC
Each API worker has a thread pool that can process up to 100 requests simultaneously (though not necessarily performantly!), plus it will queue up to 128 further requests before accept()ing them.

AFAIK even on OpenStack underclouds we only configure half as many worker threads as CPUs, and the only reason we have so many workers in general is to make sure that CPU-bound parts of the installation that are dependent on a single service don't get slowed down.

For metal³, the bottleneck is ironic-conductor - we only have one of those. Work is in progress to limit it to provisioning 20 nodes at a time by default.
ironic-api responds to requests from both the baremetal-operator (max 3 at one time) and from IPA running on any non-provisioned nodes. Nonetheless, I'd be surprised if we needed more than one worker to avoid ironic-api being a bottleneck at current scales.

Knowing when and how to scale in future (e.g. when the ironic-conductor is deployed and scaled separately) is more challenging. ironic-api listens on a fixed IP (the provisioning VIP) and port using host networking, and has to work before Ingresses are available, so we can't scale out in a traditional k8s way.

Comment 3 Doug Hellmann 2020-11-03 22:10:42 UTC
(In reply to Derek Higgins from comment #1)
> (In reply to Doug Hellmann from comment #0)
> > So we have 9 API server processes (1 main process and 8 copies managed by
> > oslo.service, as configured in our ironic.conf) each of which is around
> > 90MiB. 
> > 
> 
> Looks like this is set based on the number of processors
> 
> configure-ironic.sh:export NUMWORKERS=$(( NUMPROC < 12 ? NUMPROC : 12 ))
> ironic.conf.j2:api_workers = {{ env.NUMWORKERS }}
> 
> if that is the best behaviour or not is another matter

Yeah, good point. I checked this on a dev-scripts deployment. I suppose that means on real hardware we're likely to be running at 12 instead of 8, so consumption would be even worse.

Comment 4 Dmitry Tantsur 2020-11-12 16:39:24 UTC
Do we need to clone this for 4.6? I assume yes.

Comment 5 Doug Hellmann 2020-11-12 16:54:28 UTC
(In reply to Dmitry Tantsur from comment #4)
> Do we need to clone this for 4.6? I assume yes.

Yes, let's do that.

Comment 7 Lubov 2020-12-02 09:10:02 UTC
Verified on 4.7.0-0.nightly-2020-11-30-172451

1. [kni@provisionhost-0-0 ~]$ oc exec -it metal3-86699bf5fc-hbpgp -c metal3-ironic-api -- bash
2. [root@master-0-2 /]# ps -o pid,user,%mem,rss,command -ax
    PID USER     %MEM   RSS COMMAND
      1 root      0.2 91172 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic
     40 root      0.3 99992 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic
     41 root      0.3 99892 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic
     42 root      0.3 99836 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic
     43 root      0.3 101880 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironi
     89 root      0.0  3700 bash
    116 root      0.0  3676 ps -o pid,user,%mem,rss,command -ax

Comment 10 errata-xmlrpc 2021-02-24 15:31:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.