2295413 – [OVN] ML2/OVN migration tool - Too many open files

Bug 2295413 - [OVN] ML2/OVN migration tool - Too many open files

Summary: [OVN] ML2/OVN migration tool - Too many open files

Keywords:
Status:	CLOSED DUPLICATE of bug 2159663
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	OSP Team
QA Contact:	Joe H. Rahme
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-07-03 13:11 UTC by Kenny Tordeurs
Modified:	2024-12-13 15:41 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-12-13 15:41:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-32444	0	None	None	None	2024-07-03 13:13:02 UTC

Description Kenny Tordeurs 2024-07-03 13:11:50 UTC

Description of problem:
Too many open files


Version-Release number of selected component (if applicable):
16.2 > 17.1 (OVS to OVN migration)

How reproducible:
100%

Steps to Reproduce:
1. # ovn_migration.sh start-migration | sudo tee -a ~/logs/start-migration
2.
3.

Actual results:

2024-06-16 20:31:19.735770 |
 e0071b6a-fbb0-5077-6137-000000042c68
 |      FATAL | Ensure we get the ansible interfaces facts | openstack085 | error={"msg": "Unable
 to execute ssh command line on a controller due to: [Errno 24] Too many open files"}


Expected results:
no errors

Additional info:
Increase the ulimit:
# ulimit -n 4096

Comment 1 Daniel Alvarez Sanchez 2024-07-03 13:22:19 UTC

It will help if the reproduction steps can be added for the team to track this down since the bug report says it's 100% reproducible and it doesn't happen in our regular CI testing.
We're gonna need more details to get a reproducer.

Thanks!
daniel

Comment 2 Kenny Tordeurs 2024-07-03 13:46:13 UTC

This is based on the standard ulimit that is set with a default installation, I'll verify the ulimit that is set on their production and on the lab in this scenario so we can see what is there by default, but it seems the 'ulimit' is too low in certain scenario's as they did not hit it either in a small lab cluster but as soon as they ran this on the lab cluster with > 100 nodes they immediately had this.

Comment 3 Kenny Tordeurs 2024-07-05 07:07:03 UTC

Both old and fresh install are at:
(undercloud) [stack@openstack01 ~]$ ulimit -Sn
1024
 
Actually the ulimits file is completely empty in both cases, all is commented out:

(undercloud) [stack@openstack02 ~]$ cat /etc/security/limits.conf
# /etc/security/limits.conf
#
#This file sets the resource limits for the users logged in via PAM.
#It does not affect resource limits of the system services.
#
#Also note that configuration files in /etc/security/limits.d directory,
#which are read in alphabetical order, override the settings in this
#file in case the domain is the same or more specific.
#That means, for example, that setting a limit for wildcard domain here
#can be overridden with a wildcard setting in a config file in the
#subdirectory, but a user specific setting here can be overridden only
#with a user specific setting in the subdirectory.
#
#Each line describes a limit for a user in the form:
#
#<domain>        <type>  <item>  <value>
#
#Where:
#<domain> can be:
#        - a user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#        - the wildcard %, can be also used with %group syntax,
#                 for maxlogin limit
#
#<type> can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
#<item> can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open file descriptors
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit (KB)
#        - maxlogins - max number of logins for this user
#        - maxsyslogins - max number of logins on the system
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#        - sigpending - max number of pending signals
#        - msgqueue - max memory used by POSIX message queues (bytes)
#        - nice - max nice priority allowed to raise to values: [-20, 19]
#        - rtprio - max realtime priority
#
#<domain>      <type>  <item>         <value>
#
 
#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4
 
# End of file

Comment 4 Jakub Libosvar 2024-07-15 16:18:28 UTC

Please if you could upload the ~/logs/start-migration file that would be great. From a quick look it seems the error is actually coming from tripleo when OVN was being installed/configured.

Comment 6 Jakub Libosvar 2024-08-08 15:28:09 UTC

Where there more migration attempts? Seems like the attached file does not contain the error and based on the timestamp is from a day after the observed error.

Attached:
Monday 17 June 2024  02:19:18 +0200

From the description:

2024-06-16 20:31:19.735770

Can you please attach the logs that contain the error?

Comment 7 Kenny Tordeurs 2024-08-12 08:04:33 UTC

Unfortunately these have rotated by now

Comment 8 Jakub Libosvar 2024-08-12 12:33:26 UTC

Not sure how much progress can we make without the logs. Based on the snippet from the description it seems the error is coming out from the TripleO when deploying/configuring OVN - https://opendev.org/openstack/tripleo-ansible/src/commit/6dc26efa6f62648e259cbc356a6f3fc8fc0c4bea/tripleo_ansible/roles/tripleo-podman/tasks/tripleo_podman_install.yml#L29

I'm changing the component to OOO as configuring the hosts is out of the migration scope.

Comment 9 Fabricio 2024-08-14 11:11:03 UTC

it might be caused by having a lot of network interfaces on the compute nodes, due to ovn ports, etc. since it looks like the task it's failing on is the ansible network fact gathering.

I would try raising ulimit

could you please share a a reproducer or log?

Comment 10 Kenny Tordeurs 2024-08-14 14:15:36 UTC

(In reply to Fabricio from comment #9)
> it might be caused by having a lot of network interfaces on the compute
> nodes, due to ovn ports, etc. since it looks like the task it's failing on
> is the ansible network fact gathering.
> 
> I would try raising ulimit
> 
> could you please share a a reproducer or log?

I can provide a sosreport of the system if that would help provide some extra information on to what might be the cause.

And indeed increasing the ulimit fixes the issue, we described this in the description when the bug was filed, we were aiming to have either something in the documentation to point out the need for increasing ulimits or that the ansible playbook takes care of it.

Additional info:
Increase the ulimit:
# ulimit -n 4096

Comment 11 Fabricio 2024-08-14 18:16:01 UTC

I've checked the ansible interfaces fact gathering code: https://github.com/ansible/ansible/blob/ab624ad0317205b76e3f3d6d65c2250b6ef6db06/lib/ansible/module_utils/facts/network/linux.py#L136
it loops through /sys/class/net and opens the files inside its device.
We should document the need for increase limits when there are a lot of network interfaces

Comment 12 Eric Nothen 2024-10-21 14:46:07 UTC

This looks like a duplicate of BZ#2159663. It's not isolated to the OVN migration, other `openstack ...` commands fail with the same error but at different tasks.

Comment 13 pweeks 2024-12-13 15:41:06 UTC


*** This bug has been marked as a duplicate of bug 2159663 ***

Note You need to log in before you can comment on or make changes to this bug.