2159663 – Config download fails with [Errno 24] Too many open files at scale

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Bug 2159663 - Config download fails with [Errno 24] Too many open files at scale

Summary: Config download fails with [Errno 24] Too many open files at scale

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	documentation
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Irina
QA Contact:	RHOS Documentation Team
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2295413 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-01-10 10:22 UTC by Asma Syed Hameed
Modified:	2024-12-23 13:09 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	See the attached KCS for publication
Clone Of:
Environment:
Last Closed:	2024-12-23 13:09:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-21285	None	None	None	2023-01-10 10:30:50 UTC
Red Hat Issue Tracker	OSP-33298	None	None	None	2024-12-23 13:08:37 UTC
Red Hat Issue Tracker	OSPRH-12654	None	None	None	2024-12-23 13:08:59 UTC
Red Hat Knowledge Base (Solution)	7099895	None	None	None	2024-12-13 14:08:41 UTC

Description Asma Syed Hameed 2023-01-10 10:22:49 UTC

Description of problem:

We are deploying OSP 17.1 with TLS-e 3 controllers, 220 computes and 5 cephstorage nodes, the config-download is failing


2023-01-10 06:44:01,442 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.442154 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-29
2023-01-10 06:44:01,443 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.443682 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-3
2023-01-10 06:44:01,445 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.445267 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-30
2023-01-10 06:44:01,447 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.446827 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-31
2023-01-10 06:44:01,448 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.448358 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-198
2023-01-10 06:44:01,450 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.449897 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-199
2023-01-10 06:44:01,451 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.451428 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-2
2023-01-10 06:44:01,457 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.457483 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-32
2023-01-10 06:44:01,463 p=963541 u=stack n=ansible | 2023-01-10 06:44:01.463482 | bc97e1c3-4240-c543-7458-000000075212 |         OK | Ensure we get the ansible interfaces facts | computer640-37
2023-01-10 06:44:01,469 p=963541 u=stack n=ansible | ERROR! Unexpected Exception, this is probably a bug: [Errno 24] Too many open files: '/tmp/tripleoyihd0kky/3afb0e15-9266-460c-8713-af61d49efdec/job_events/478fcbfb-859a-483e-b3dd-241c2fc4c112-partial.json.tmp'
2023-01-10 06:44:01,470 p=963541 u=stack n=ansible | to see the full traceback, use -vvv
2023-01-10 06:44:01,472 p=963541 u=stack n=ansible | the full traceback was:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/ansible/plugins/cache/__init__.py", line 169, in set
    self._dump(value, tmpfile_path)
  File "/usr/lib/python3.9/site-packages/ansible/plugins/cache/jsonfile.py", line 63, in _dump
    with codecs.open(filepath, 'w', encoding='utf-8') as f:
  File "/usr/lib64/python3.9/codecs.py", line 905, in open
    file = builtins.open(filename, mode, buffering)
OSError: [Errno 24] Too many open files: '/home/stack/.tripleo/fact_cache/tmp4e42lkvk'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ansible/plugins/strategy/tripleo_free.py", line 313, in run
    result |= self.process_work()
  File "/usr/share/ansible/plugins/strategy/tripleo_free.py", line 264, in process_work
    results = self._process_pending_results(self._iterator)
  File "/usr/lib/python3.9/site-packages/ansible/plugins/strategy/__init__.py", line 157, in inner
    results = func(self, iterator, one_pass=one_pass, max_passes=max_passes, do_handlers=do_handlers)
  File "/usr/lib/python3.9/site-packages/ansible/plugins/strategy/__init__.py", line 754, in _process_pending_results
    self._variable_manager.set_host_facts(target_host, result_item['ansible_facts'].copy())
  File "/usr/lib/python3.9/site-packages/ansible/vars/manager.py", line 677, in set_host_facts
    self._fact_cache[host] = host_cache
  File "/usr/lib/python3.9/site-packages/ansible/vars/fact_cache.py", line 36, in __setitem__
    self._plugin.set(key, value)
  File "/usr/lib/python3.9/site-packages/ansible/plugins/cache/__init__.py", line 171, in set
    display.warning("error in '%s' cache plugin while trying to write to '%s' : %s" % (self.plugin_name, tmpfile_path, to_bytes(e)))
  File "/usr/lib/python3.9/site-packages/ansible_runner/display_callback/display.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/ansible/utils/display.py", line 403, in warning
    self.display(new_msg, color=C.COLOR_WARN, stderr=True)
  File "/usr/lib/python3.9/site-packages/ansible_runner/display_callback/display.py", line 89, in wrapper
    event_context.dump_begin(fileobj)
  File "/usr/lib/python3.9/site-packages/ansible_runner/display_callback/events.py", line 196, in dump_begin
    self.cache.set(":1:ev-{}".format(begin_dict['uuid']), begin_dict)
  File "/usr/lib/python3.9/site-packages/ansible_runner/display_callback/events.py", line 72, in set
    with os.fdopen(os.open(write_location, os.O_WRONLY | os.O_CREAT, stat.S_IRUSR | stat.S_IWUSR), 'w') as f:
OSError: [Errno 24] Too many open files: '/tmp/tripleoyihd0kky/3afb0e15-9266-460c-8713-af61d49efdec/job_events/b1d6856c-2212-4556-999f-5415a29501a3-partial.json.tmp'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/ansible/cli/__init__.py", line 601, in cli_executor
    exit_code = cli.run()
  File "/usr/lib/python3.9/site-packages/ansible/cli/playbook.py", line 143, in run
    results = pbex.run()
  File "/usr/lib/python3.9/site-packages/ansible/executor/playbook_executor.py", line 190, in run
    result = self._tqm.run(play=play)
  File "/usr/lib/python3.9/site-packages/ansible/executor/task_queue_manager.py", line 321, in run
    play_return = strategy.run(iterator, play_context)
  File "/usr/share/ansible/plugins/strategy/tripleo_free.py", line 324, in run
    display.error("Exception while running task loop: "
  File "/usr/lib/python3.9/site-packages/ansible_runner/display_callback/display.py", line 41, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/ansible/utils/display.py", line 458, in error
    self.display(new_msg, color=C.COLOR_ERROR, stderr=True)
  File "/usr/lib/python3.9/site-packages/ansible_runner/display_callback/display.py", line 89, in wrapper
    event_context.dump_begin(fileobj)
  File "/usr/lib/python3.9/site-packages/ansible_runner/display_callback/events.py", line 196, in dump_begin
    self.cache.set(":1:ev-{}".format(begin_dict['uuid']), begin_dict)
  File "/usr/lib/python3.9/site-packages/ansible_runner/display_callback/events.py", line 72, in set
    with os.fdopen(os.open(write_location, os.O_WRONLY | os.O_CREAT, stat.S_IRUSR | stat.S_IWUSR), 'w') as f:
OSError: [Errno 24] Too many open files: '/tmp/tripleoyihd0kky/3afb0e15-9266-460c-8713-af61d49efdec/job_events/478fcbfb-859a-483e-b3dd-241c2fc4c112-partial.json.tmp'

 
Version-Release number of selected component (if applicable):
RHOS-17.1-RHEL-9-20221130.n.1

How reproducible:
100 %

Steps to Reproduce:
1. Deploy overcloud with ~200
2. config-download fails with too many open files

Actual results:
Overcloud deployment failing

Expected results:
Overcloud deployed successfully

Additional info:
ANSIBLE_FORKS = 100

[stack@undercloud ~]$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 1540021
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 1540021
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

Comment 6 Asma Syed Hameed 2023-03-15 04:12:16 UTC

Takashi,

We had used  RHOS-17.0-RHEL-9-20220615.n.2 puddle during OSP 17.0 250 node testing.
We are using the default package versions available with the respective puddles.

Comment 7 Takashi Kajinami 2023-03-15 08:01:42 UTC

Hi Asma,

Please correct me if I'm wrong but the "puddle" applies to only RHOSP packages.
We are using ansible packages from RHEL and I need to know the version of RHEL packages installed during tests.

By any chance do you have some more details(like sosreport) taken from your director node during these tests ?

Comment 10 Jaison Raju 2024-01-24 10:19:03 UTC

For now, Ian will add the following as a configuration workaround in our official scale documentation.

Ensure that you edit the file limits in the /etc/security/limits.conf file:
*               soft    nofile            4096
*               hard    nofile            4096

Comment 11 Eric Nothen 2024-10-21 14:44:14 UTC

Is it possible that this is a side-effect of the elimination of mistral? In RHOSP 16 mistral seems to have a value of 1M hardcoded:

~~~
[stack.lab ~]$ cat /etc/rhosp-release ; sudo podman exec mistral_engine ulimit -n
Red Hat OpenStack Platform release 16.2.6 (Train)
1048576
[stack.lab ~]$ 
~~~

Without mistral, the jobs are now using the host's default:

~~~
[stack.example.lab ~]$ ulimit -n
1024
[stack.example.lab ~]$ 
~~~

I'm attaching a customer case which failed with the same error (Too many open files), but during overcloud upgrade. This happened only on their cluster with > 60 compute nodes. All the others didn't break with this error.

Comment 12 Eric Nothen 2024-12-12 16:28:57 UTC

Phil, would you mind giving some reason how this is not a bug?

Comment 13 Brendan Shephard 2024-12-13 09:08:51 UTC

(In reply to Eric Nothen from comment #12)
> Phil, would you mind giving some reason how this is not a bug?

This seems like a RHEL configuration / administration issue, rather than RHOSP specific. That would be the reason it was closed as not a bug with RHOSP. It seems that the problem here can be solved by increasing the number of file descriptors allowed by the user.

At best, we could probably supplement our RHOSP17 documentation to mention that the ulimit on the Director node should be increased at scale, or maybe a KCS for RHOSP17. But general RHEL administration hasn't been something we have documented in the past, rather we would defer to RHEL documentation and material such as:
https://access.redhat.com/solutions/146233

Comment 14 Eric Nothen 2024-12-13 09:40:24 UTC

(In reply to Brendan Shephard from comment #13)
> (In reply to Eric Nothen from comment #12)
> > Phil, would you mind giving some reason how this is not a bug?
> 
> This seems like a RHEL configuration / administration issue, rather than
> RHOSP specific. That would be the reason it was closed as not a bug with
> RHOSP. 

I understand it is a configuration in RHEL, but so are iptables/nftables, and you do have to address them in RHOSP if you want the deployment to be successful.

Moreover, we are saying above that this is likely coming from mistral, which was a RHOSP component. Somebody added that ulimit to the mistral image, likely because that someone found out things were breaking when it was not set. Now due to lack of mistral, the same change is required on the undercloud host, but somehow this is now on RHEL to be addressed?


> It seems that the problem here can be solved by increasing the number
> of file descriptors allowed by the user.

Yes, agree. I'm just not convinced that the OSP administrator has to do this change, when in the past it was built into the image. FWIW, I am telling the remaining of my customers to perform this change in advance so that they are covered way before the upgrades.

> At best, we could probably supplement our RHOSP17 documentation to mention
> that the ulimit on the Director node should be increased at scale, or maybe
> a KCS for RHOSP17. But general RHEL administration hasn't been something we
> have documented in the past, rather we would defer to RHEL documentation and
> material such as:
> https://access.redhat.com/solutions/146233

This issue has been triggered during FFU with as little as 60 compute nodes, so I would think a fair number of customer's environments are going to hit this issue if there's no fix on the upgrade procedure. That for me would be the best way to address the problem.

Lacking that, the second best would be to add this as a known issue on the documentation, pointing to a KCS, yes.

Comment 15 Brendan Shephard 2024-12-13 10:10:30 UTC

I don't think we were explicitly setting any ulimit value for the mistral_engine container actually. We have something set for mistral_executor, but not engine:

❯ yq .parameters.DockerMistralExecutorUlimit deployment/mistral/mistral-executor-container-puppet.yaml
default: ['nofile=1024']
description: ulimit for Mistral Executor Container
type: comma_delimited_list

I can see the value 1048576 in a few places within tripleo-heat-templates, but none seem to be related to ulimit. I think that value might be coming from the fact that the container is running as root, and as such gets the unlimited file descriptor limit associated with the root user. I can see that is indeed the case if I run a random container with `--privileged` as well:

❯ podman run --privileged -it centos:stream9 ulimit -u
Resolved "centos" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull quay.io/centos/centos:stream9...
Getting image source signatures
Copying blob sha256:88c721e12e28a396c7afa9cfc6727057b7d8b52d1e34522e15689da4892a950a
Copying config sha256:f8ac3c66b3c4402fd57590335365c2d740357dff3c63efcd9066463492689f09
Writing manifest to image destination
1048576

The reason this default was changed in RHEL seems to be security related, so changing the value via some automated method has security implications that need to be considered.

Happy for dfg:upgrades to look at it and consider the request though. I'm just letting you know the likely reason that it was closed as not a bug.

Comment 16 Eric Nothen 2024-12-13 11:23:50 UTC

Thank you for tracking the source of the change. It might be very valid for RHEL to lower the default limit, but in practice it turns out to be rather short on mid-size OSP environments.

I think the one thing that we can't do is not doing anything. My customers will be warned in advance, but other's customers should also either receive a fix, or be able to see a warning in the docs before they upgrade.

Comment 17 Eric Nothen 2024-12-13 14:09:34 UTC

I've created and attached a KCS to this BZ

Comment 19 pweeks 2024-12-13 15:41:06 UTC

*** Bug 2295413 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.