Bug 1871885 - [OSP 16.1]nova_libvirt container "cannot fork child process: Resource temporarily unavailable"
Summary: [OSP 16.1]nova_libvirt container "cannot fork child process: Resource tempora...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: z2
: 16.1 (Train on RHEL 8.2)
Assignee: Kashyap Chamarthy
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-24 14:31 UTC by Matt Flusche
Modified: 2024-03-25 16:21 UTC (History)
19 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-1.20200828163406.94ba270.el8ost python-paunch-5.3.3-1.20200826193407.ed2c015.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-28 15:39:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1892817 0 None None None 2020-08-25 05:54:15 UTC
OpenStack gerrit 747831 0 None MERGED [USSURI-ONLY] Add new parameter: pids_limit 2021-02-10 07:31:16 UTC
OpenStack gerrit 747834 0 None MERGED Enable pids_limit support 2021-02-10 07:31:16 UTC
OpenStack gerrit 747835 0 None MERGED Set a higher PIDs limit for nova_libvirt container 2021-02-10 07:31:16 UTC
Red Hat Issue Tracker OSP-11139 0 None None None 2021-12-01 18:49:41 UTC
Red Hat Product Errata RHEA-2020:4284 0 None None None 2020-10-28 15:39:55 UTC

Description Matt Flusche 2020-08-24 14:31:39 UTC
Description of problem:
OSP 16.1 latest 

Running into the following issue with launching instances:

2020-08-19 21:14:42.722+0000: 34000: error : virFork:274 : cannot fork child process: Resource temporarily unavailable
2020-08-19 21:14:42.724+0000: 34000: error : virFork:274 : cannot fork child process: Resource temporarily unavailable

This is with 184 instances:
# sudo podman exec -ti nova_libvirt virsh list --all  |wc -l
184

Seems to be hitting PID limit with the nova_libvirt container.  Perhaps:

# sudo podman inspect nova_libvirt |grep PidsLimit
            "PidsLimit": 4096,


How is this config managed?


Version-Release number of selected component (if applicable):
16.1 current

How reproducible:
Unknown

Steps to Reproduce:
1. Launch a significant # of instances.
2.
3.



Additional info:

I'll provide additional libvirtd and nova logs to show the issue.

Comment 4 Matt Flusche 2020-08-24 20:06:15 UTC
Note the podman change log:

https://github.com/containers/podman/blob/v1.6/changelog.txt

- Changelog for v1.6.2-rc1 (2019-10-16)
[...]
* Setup a reasonable default for pids-limit 4096


From https://github.com/containers/podman/blob/v1.6/RELEASE_NOTES.md

1.6.2

Misc
The default PID limit for containers is now set to 4096. It can be adjusted back to the old default (unlimited) by passing --pids-limit 0 to podman create and podman run

It seems perhaps this change is not considered for OSP 16.1 containers.

Comment 5 Emilien Macchi 2020-08-24 21:19:16 UTC
This going to involve patching openstack/paunch & tripleo-ansible (tripleo-container-manage role) to support these options and then use these options in THT.

Comment 6 Cédric Jeanneret 2020-08-25 06:06:39 UTC
Ussuri patch ready for paunch.
Will for on tripleo-ansible in // so that we can add the needed option to nova_libvirt in master already.

Comment 7 Cédric Jeanneret 2020-08-25 06:12:49 UTC
Master patch against tripleo-ansible (needed for master/osp-17) ready for review.

Will now work on t-h-t content, adding a Depends-On the tripleo-ansible patch for master. Backports will need to point to the paunch patch for the Depends-On.

Comment 8 Kashyap Chamarthy 2020-08-25 08:43:22 UTC
To summarize IRC discussion with libvirt developers (thanks, DanPB):

- Removing the PID limit altogether can lead to a fork bomb; so we shouldn't do that.

- DanPB elaborates: libvirtd configures 'TasksMax=32768' — which means, 32768 should allow one to launch about 1200 guests (given that you were able to launch 150K guests with 4096 limit). However, this is more complicated: in Ceph configurations, if you have 100 storage hosts, then Ceph will create 1 thread per host, so you'll have each QEMU consume 100 threads. So 1200 guests in in a non-ceph configuration may turn into 200 guests in a Ceph-based setup.

In the end, going with the tunable parameter (https://review.opendev.org/#/c/747826/) is indeed a better option.

Comment 9 Cédric Jeanneret 2020-08-25 09:10:50 UTC
Updated my t-h-t patch and added it here.

Comment 10 Kashyap Chamarthy 2020-08-26 12:15:19 UTC
(In reply to Kashyap Chamarthy from comment #8)
> To summarize IRC discussion with libvirt developers (thanks, DanPB):
> 
> - Removing the PID limit altogether can lead to a fork bomb; so we shouldn't
> do that.
> 
> - DanPB elaborates: libvirtd configures 'TasksMax=32768' — which means,
> 32768 should allow one to launch about 1200 guests (given that you were able
> to launch 150K guests with 4096 limit). However, this is more complicated:
> in Ceph configurations, if you have 100 storage hosts, then Ceph will create
> 1 thread per host, so you'll have each QEMU consume 100 threads. So 1200
> guests in in a non-ceph configuration may turn into 200 guests in a
> Ceph-based setup.
> 
> In the end, going with the tunable parameter
> (https://review.opendev.org/#/c/747826/) is indeed a better option.

We actually went by setting the higher PID, which is also acceptable: https://review.opendev.org/#/c/747835/

(In general, if we _can_ avoid yet-more tunables, it's good.)

Comment 18 Cédric Jeanneret 2020-09-02 12:58:25 UTC
Hot-fix consists only in two packages to install on the Undercloud/Director node:
- new tripleo-heat-templates
- new python3-paunch

Comment 27 David Rosenfeld 2020-09-22 16:39:50 UTC
On a standard deploy saw default PidsLimit

sudo podman inspect nova_libvirt |grep PidsLimit
            "PidsLimit": 65536,

Also, changed: /usr/share/openstack-tripleo-heat-templates/deployment/nova/nova-libvirt-container-puppet.yaml to contain a ContainerNovaLibvirtPidsLimit of 60000
after deploying the non-default value was in effect:

sudo podman inspect nova_libvirt |grep PidsLimit
            "PidsLimit": 60000,

Comment 40 Cédric Jeanneret 2020-10-13 14:03:00 UTC
Hello all,

Seeing the amount of things being pulled in with the hotfix (creating issues), and seeing the approaching 16.1.2 release (due this week if everything goes according to the plan), it would be better to actually wait for 16.1.2 and do the tests on it, since we'll get everything pulled in, with the matching versions for both packages and containers.

Would that be OK?

Cheers,

C.

Comment 41 Matt Flusche 2020-10-13 15:58:11 UTC
(In reply to Cédric Jeanneret from comment #40)
> Hello all,
> 
> Seeing the amount of things being pulled in with the hotfix (creating
> issues), and seeing the approaching 16.1.2 release (due this week if
> everything goes according to the plan), it would be better to actually wait
> for 16.1.2 and do the tests on it, since we'll get everything pulled in,
> with the matching versions for both packages and containers.
> 
> Would that be OK?
> 
> Cheers,
> 
> C.

Hi,

I believe that is the best option.  Thanks for the update.

Regards,

Matt

Comment 47 errata-xmlrpc 2020-10-28 15:39:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284


Note You need to log in before you can comment on or make changes to this bug.