Description of problem: Too many open files Version-Release number of selected component (if applicable): 16.2 > 17.1 (OVS to OVN migration) How reproducible: 100% Steps to Reproduce: 1. # ovn_migration.sh start-migration | sudo tee -a ~/logs/start-migration 2. 3. Actual results: 2024-06-16 20:31:19.735770 | e0071b6a-fbb0-5077-6137-000000042c68 | FATAL | Ensure we get the ansible interfaces facts | openstack085 | error={"msg": "Unable to execute ssh command line on a controller due to: [Errno 24] Too many open files"} Expected results: no errors Additional info: Increase the ulimit: # ulimit -n 4096
It will help if the reproduction steps can be added for the team to track this down since the bug report says it's 100% reproducible and it doesn't happen in our regular CI testing. We're gonna need more details to get a reproducer. Thanks! daniel
This is based on the standard ulimit that is set with a default installation, I'll verify the ulimit that is set on their production and on the lab in this scenario so we can see what is there by default, but it seems the 'ulimit' is too low in certain scenario's as they did not hit it either in a small lab cluster but as soon as they ran this on the lab cluster with > 100 nodes they immediately had this.
Both old and fresh install are at: (undercloud) [stack@openstack01 ~]$ ulimit -Sn 1024 Actually the ulimits file is completely empty in both cases, all is commented out: (undercloud) [stack@openstack02 ~]$ cat /etc/security/limits.conf # /etc/security/limits.conf # #This file sets the resource limits for the users logged in via PAM. #It does not affect resource limits of the system services. # #Also note that configuration files in /etc/security/limits.d directory, #which are read in alphabetical order, override the settings in this #file in case the domain is the same or more specific. #That means, for example, that setting a limit for wildcard domain here #can be overridden with a wildcard setting in a config file in the #subdirectory, but a user specific setting here can be overridden only #with a user specific setting in the subdirectory. # #Each line describes a limit for a user in the form: # #<domain> <type> <item> <value> # #Where: #<domain> can be: # - a user name # - a group name, with @group syntax # - the wildcard *, for default entry # - the wildcard %, can be also used with %group syntax, # for maxlogin limit # #<type> can have the two values: # - "soft" for enforcing the soft limits # - "hard" for enforcing hard limits # #<item> can be one of the following: # - core - limits the core file size (KB) # - data - max data size (KB) # - fsize - maximum filesize (KB) # - memlock - max locked-in-memory address space (KB) # - nofile - max number of open file descriptors # - rss - max resident set size (KB) # - stack - max stack size (KB) # - cpu - max CPU time (MIN) # - nproc - max number of processes # - as - address space limit (KB) # - maxlogins - max number of logins for this user # - maxsyslogins - max number of logins on the system # - priority - the priority to run user process with # - locks - max number of file locks the user can hold # - sigpending - max number of pending signals # - msgqueue - max memory used by POSIX message queues (bytes) # - nice - max nice priority allowed to raise to values: [-20, 19] # - rtprio - max realtime priority # #<domain> <type> <item> <value> # #* soft core 0 #* hard rss 10000 #@student hard nproc 20 #@faculty soft nproc 20 #@faculty hard nproc 50 #ftp hard nproc 0 #@student - maxlogins 4 # End of file
Please if you could upload the ~/logs/start-migration file that would be great. From a quick look it seems the error is actually coming from tripleo when OVN was being installed/configured.
Where there more migration attempts? Seems like the attached file does not contain the error and based on the timestamp is from a day after the observed error. Attached: Monday 17 June 2024 02:19:18 +0200 From the description: 2024-06-16 20:31:19.735770 Can you please attach the logs that contain the error?
Unfortunately these have rotated by now
Not sure how much progress can we make without the logs. Based on the snippet from the description it seems the error is coming out from the TripleO when deploying/configuring OVN - https://opendev.org/openstack/tripleo-ansible/src/commit/6dc26efa6f62648e259cbc356a6f3fc8fc0c4bea/tripleo_ansible/roles/tripleo-podman/tasks/tripleo_podman_install.yml#L29 I'm changing the component to OOO as configuring the hosts is out of the migration scope.
it might be caused by having a lot of network interfaces on the compute nodes, due to ovn ports, etc. since it looks like the task it's failing on is the ansible network fact gathering. I would try raising ulimit could you please share a a reproducer or log?
(In reply to Fabricio from comment #9) > it might be caused by having a lot of network interfaces on the compute > nodes, due to ovn ports, etc. since it looks like the task it's failing on > is the ansible network fact gathering. > > I would try raising ulimit > > could you please share a a reproducer or log? I can provide a sosreport of the system if that would help provide some extra information on to what might be the cause. And indeed increasing the ulimit fixes the issue, we described this in the description when the bug was filed, we were aiming to have either something in the documentation to point out the need for increasing ulimits or that the ansible playbook takes care of it. Additional info: Increase the ulimit: # ulimit -n 4096
I've checked the ansible interfaces fact gathering code: https://github.com/ansible/ansible/blob/ab624ad0317205b76e3f3d6d65c2250b6ef6db06/lib/ansible/module_utils/facts/network/linux.py#L136 it loops through /sys/class/net and opens the files inside its device. We should document the need for increase limits when there are a lot of network interfaces
This looks like a duplicate of BZ#2159663. It's not isolated to the OVN migration, other `openstack ...` commands fail with the same error but at different tasks.
*** This bug has been marked as a duplicate of bug 2159663 ***