Description of problem: Memory consumption by metal3 is pretty high, at least when we compare it to some other things in the deployment. The metal3 pod as a whole uses 880+MiB, of which the ironic-api container is using 530MiB. Inside the container, I see # ps -o pid,user,%mem,rss,command -ax PID USER %MEM RSS COMMAND 1 root 0.5 86988 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 40 root 0.5 92804 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 41 root 0.5 91944 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 42 root 0.5 92768 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 43 root 0.5 92072 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 44 root 0.5 96096 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 45 root 0.5 92764 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 46 root 0.5 92244 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 47 root 0.5 91452 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic.conf 61 root 0.0 3716 bash 109 root 0.0 3784 ps -o pid,user,%mem,rss,command -ax So we have 9 API server processes (1 main process and 8 copies managed by oslo.service, as configured in our ironic.conf) each of which is around 90MiB. If we think about what's going to be talking to the server, it would be at most 3 calls at a time from the baremetal-operator (one per reconcile thread) and then the agents during provisioning. Version-Release number of selected component (if applicable): 4.6+ How reproducible: Always Steps to Reproduce: 1. oc exec -it ${name-of-metal3-pod} -- bash 2. ps -o pid,user,%mem,rss,command -ax Actual results: 9 processes Expected results: Fewer than 9 processes Additional info:
(In reply to Doug Hellmann from comment #0) > So we have 9 API server processes (1 main process and 8 copies managed by > oslo.service, as configured in our ironic.conf) each of which is around > 90MiB. > Looks like this is set based on the number of processors configure-ironic.sh:export NUMWORKERS=$(( NUMPROC < 12 ? NUMPROC : 12 )) ironic.conf.j2:api_workers = {{ env.NUMWORKERS }} if that is the best behaviour or not is another matter
Each API worker has a thread pool that can process up to 100 requests simultaneously (though not necessarily performantly!), plus it will queue up to 128 further requests before accept()ing them. AFAIK even on OpenStack underclouds we only configure half as many worker threads as CPUs, and the only reason we have so many workers in general is to make sure that CPU-bound parts of the installation that are dependent on a single service don't get slowed down. For metal³, the bottleneck is ironic-conductor - we only have one of those. Work is in progress to limit it to provisioning 20 nodes at a time by default. ironic-api responds to requests from both the baremetal-operator (max 3 at one time) and from IPA running on any non-provisioned nodes. Nonetheless, I'd be surprised if we needed more than one worker to avoid ironic-api being a bottleneck at current scales. Knowing when and how to scale in future (e.g. when the ironic-conductor is deployed and scaled separately) is more challenging. ironic-api listens on a fixed IP (the provisioning VIP) and port using host networking, and has to work before Ingresses are available, so we can't scale out in a traditional k8s way.
(In reply to Derek Higgins from comment #1) > (In reply to Doug Hellmann from comment #0) > > So we have 9 API server processes (1 main process and 8 copies managed by > > oslo.service, as configured in our ironic.conf) each of which is around > > 90MiB. > > > > Looks like this is set based on the number of processors > > configure-ironic.sh:export NUMWORKERS=$(( NUMPROC < 12 ? NUMPROC : 12 )) > ironic.conf.j2:api_workers = {{ env.NUMWORKERS }} > > if that is the best behaviour or not is another matter Yeah, good point. I checked this on a dev-scripts deployment. I suppose that means on real hardware we're likely to be running at 12 instead of 8, so consumption would be even worse.
Do we need to clone this for 4.6? I assume yes.
(In reply to Dmitry Tantsur from comment #4) > Do we need to clone this for 4.6? I assume yes. Yes, let's do that.
Verified on 4.7.0-0.nightly-2020-11-30-172451 1. [kni@provisionhost-0-0 ~]$ oc exec -it metal3-86699bf5fc-hbpgp -c metal3-ironic-api -- bash 2. [root@master-0-2 /]# ps -o pid,user,%mem,rss,command -ax PID USER %MEM RSS COMMAND 1 root 0.2 91172 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic 40 root 0.3 99992 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic 41 root 0.3 99892 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic 42 root 0.3 99836 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironic 43 root 0.3 101880 /usr/bin/python3 /usr/bin/ironic-api --config-file /usr/share/ironic/ironic-dist.conf --config-file /etc/ironic/ironi 89 root 0.0 3700 bash 116 root 0.0 3676 ps -o pid,user,%mem,rss,command -ax
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633