Bug 2075795 - "Too many open files" error causes vdsm not being able to work properly
Summary: "Too many open files" error causes vdsm not being able to work properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.1
: ---
Assignee: Nir Soffer
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-15 11:28 UTC by Marian Jankular
Modified: 2022-08-21 08:03 UTC (History)
8 users (show)

Fixed In Version: vdsm-4.50.1.3
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-14 12:41:13 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm pull 240 0 None open check: Fix fd leak when child process fails 2022-06-07 20:19:45 UTC
Red Hat Issue Tracker RHV-45694 0 None None None 2022-04-15 11:28:48 UTC
Red Hat Product Errata RHBA-2022:5583 0 None None None 2022-07-14 12:41:29 UTC

Description Marian Jankular 2022-04-15 11:28:09 UTC
Description of problem:
"Too many open files" error causes vdsm not being able to work properly

Version-Release number of selected component (if applicable):
vdsm-4.40.90.4-1.el8ev.x86_64

How reproducible:
reproduced on multiple hosts in the datacenter

Steps to Reproduce:
1. make sure the iso storage domain is active in datacenter
2. remove export on the nfs server side or just make iso domain unaccessible

Actual results:
after couple of hours vdsm will reach "Too many open files" error

Expected results:
vdsm will work correctly even iso storage domain will nto be available

Additional info:
After vdsm daemon restart hosts start working properly until iso nfs sd was restored
I will attach vdsm logs from hosts.

Comment 1 Michal Skrivanek 2022-04-16 09:30:54 UTC
FWIW ISO domains are deprecated

Comment 2 Arik 2022-04-18 14:43:21 UTC
ISO domains are indeed deprecated but users are still advised to use them for various flows, e.g., import vm from external provider, so it would be better to address this if possible

Nir, can it be related to the FD leak that we see you tried to address in https://gerrit.ovirt.org/c/vdsm/+/113395?

Comment 3 Nir Soffer 2022-04-18 16:29:51 UTC
(In reply to Arik from comment #2)
> ISO domains are indeed deprecated but users are still advised to use them
> for various flows, e.g., import vm from external provider, so it would be
> better to address this if possible
> 
> Nir, can it be related to the FD leak that we see you tried to address in
> https://gerrit.ovirt.org/c/vdsm/+/113395?

This was early draft, this was fixed later in vdsm-4.50.0.3. If the leak
is caused by the same issue, this will be fixed in 4.4sp1.

Lets try to reproduce this with 4.5.0. Moving to ON_QA and setting as TestOnly.

Comment 5 Arik 2022-04-18 16:52:03 UTC
Ack, so it might be solved by the fix for bz 1926589

Comment 7 sshmulev 2022-05-31 13:52:11 UTC
Trying to reproduce this issue according to the steps provided on vdsm version of vdsm-4.40.100.2-1.el8ev but can't reproduce it, after an hour passed I still don't see the "Too many open files" error in the vdsm log file.

maybe I need some clarification about the steps:

1. "make sure the iso storage domain is active in datacenter" - the SD was unattached, so I attached it to the data center and it was active.
2. "remove export on the nfs server side or just make iso domain inaccessible" - I set the iso domain to maintenance, is that good enough?

Comment 8 Nir Soffer 2022-06-01 14:15:48 UTC
(In reply to sshmulev from comment #7)

Note that I never reproduce this issue, so I don't know if we can reproduce it.

> Trying to reproduce this issue according to the steps provided on vdsm
> version of vdsm-4.40.100.2-1.el8ev but can't reproduce it, after an hour
> passed I still don't see the "Too many open files" error in the vdsm log
> file.

One hour may not be enough, one day is more likely.

> maybe I need some clarification about the steps:
> 
> 1. "make sure the iso storage domain is active in datacenter" - the SD was
> unattached, so I attached it to the data center and it was active.
> 2. "remove export on the nfs server side or just make iso domain
> inaccessible" - I set the iso domain to maintenance, is that good enough?

No, this deactivates the domain, so the system stops trying to access it, so the 
problem cannot be reproduced.

You need to:
1. make sure the iso domain is active
2. make the storage server inaccessible

The easiest way is to remove the export for the iso domain:

1. Edit /etc/exports on the storage server and comment out the line exporting the iso domain
2. Refresh the exports

   exportfs -rv

After that you should start to see warnings in vdsm about inaccessible iso domain.

You can verify that your change was correct by tring to mount the iso domain manually
on some host. The mount should fail or stuck.

Leave the host running like this for a day, and watch the number of file descriptors
in vdsmd and supervdsmd. The leak fixed in bug 1926589 was in supervdsmd.

# pidof -x /usr/libexec/vdsm/vdsmd
2697

# pidof -x /usr/libexec/vdsm/supervdsmd
1964

# ls -lh /proc/2697/fd/
total 0
lr-x------. 1 vdsm kvm 64 Jun  1 16:58 0 -> /dev/null
l-wx------. 1 vdsm kvm 64 Jun  1 16:58 1 -> /dev/null
l-wx------. 1 vdsm kvm 64 Jun  1 16:58 10 -> /var/log/vdsm/vdsm-schema-error.log
lr-x------. 1 vdsm kvm 64 Jun  1 16:58 11 -> 'pipe:[44038]'
l-wx------. 1 vdsm kvm 64 Jun  1 16:58 12 -> 'pipe:[44038]'
lrwx------. 1 vdsm kvm 64 Jun  1 16:58 13 -> 'socket:[44041]'
...

# ls -lh -1 /proc/2697/fd/ | wc -l
70

# ls -lh -1 /proc/1964/fd/
total 0
lr-x------. 1 root root 64 Jun  1 16:58 0 -> /dev/null
lrwx------. 1 root root 64 Jun  1 16:58 1 -> socket:[37260]
l-wx------. 1 root root 64 Jun  1 16:58 10 -> pipe:[37388]
lrwx------. 1 root root 64 Jun  1 16:58 11 -> socket:[37500]
lrwx------. 1 root root 64 Jun  1 16:58 12 -> socket:[46686]
lrwx------. 1 root root 64 Jun  1 16:58 13 -> socket:[44120]
lrwx------. 1 root root 64 Jun  1 16:58 14 -> socket:[44144]
lrwx------. 1 root root 64 Jun  1 16:58 15 -> socket:[48298]
lrwx------. 1 root root 64 Jun  1 16:58 16 -> socket:[48299]
...

# ls -lh -1 /proc/1964/fd/ | wc -l
27

In 4.4 we expect to see increasing number of fds (likely socket:*). If we wait enough
time the number will reach the limit and we will start to see the "Too many open files"
error in the log.

I would write the output of "ls -lh /proc/{pid}/fd" to file every 10 minutes and
check the results after several hours or a day.

In 4.5 you should see constant number of fds - it can increase and decrease from time
to time.

Comment 9 RHEL Program Management 2022-06-06 15:20:11 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 12 sshmulev 2022-06-06 16:29:05 UTC
Moving this bug to assigned since there is still an increase in the fds of the vdsm when there is no connection to the iso domain.

Steps to reproduce:
1. Created new iso SD and activated it.
2. Uncomented the export for the iso domain and then these logs started to appear in the vdsm:

2022-06-06 12:12:40,561+0300 ERROR (monitor/29f8492) [storage.monitor] Error checking domain 29f8492f-ea38-442b-b187-9256d781e802 (monitor:453)

3. Made a table with the values of the vdsm fds and supervdsm fds and saw that vdsm fds increased during that time (samples each 10 min)

Time	                     vdsm fds supervdsm fds
Mon Jun 6 12:00:00 IDT 2022	418	36
Mon Jun 6 14:20:48 IDT 2022	922	36
Mon Jun 6 14:30:48 IDT 2022	977	36
Mon Jun 6 14:40:48 IDT 2022	1036	36
Mon Jun 6 14:50:48 IDT 2022	1091	36
Mon Jun 6 15:00:48 IDT 2022	1147	36
Mon Jun 6 15:10:48 IDT 2022	1206	36
Mon Jun 6 15:20:48 IDT 2022	1265	36
Mon Jun 6 15:51:07 IDT 2022	1438	36
Mon Jun 6 16:01:07 IDT 2022	1492	36

Will update if that's a regression (need to check it reproduced also on ovit-engine4.4)

Comment 14 Nir Soffer 2022-06-07 18:17:18 UTC
More info from the host showing this issue.

We see more ioprocess instances than expected - this usually happen when
ioprocess get stack on inaccessible storage.

vdsm       19297  3.1 25.2 4640544 2003540 ?     S<sl Jun06  79:34 /usr/bin/python3 /usr/libexec/vdsm/vdsmd
vdsm       19727  0.0  0.0 411704  6812 ?        S<l  Jun06   0:13  \_ /usr/libexec/ioprocess --read-pipe-fd 94 --write-pipe-fd 93 --max-threads 10 --max-queued-requests 10
vdsm       19765  0.0  0.0 411704  4708 ?        S<l  Jun06   0:07  \_ /usr/libexec/ioprocess --read-pipe-fd 106 --write-pipe-fd 105 --max-threads 10 --max-queued-requests 10
vdsm       19794  0.0  0.0 411704  6832 ?        S<l  Jun06   0:07  \_ /usr/libexec/ioprocess --read-pipe-fd 111 --write-pipe-fd 110 --max-threads 10 --max-queued-requests 10
vdsm       19803  0.0  0.0 411704  6812 ?        S<l  Jun06   0:06  \_ /usr/libexec/ioprocess --read-pipe-fd 117 --write-pipe-fd 116 --max-threads 10 --max-queued-requests 10
vdsm       19825  0.0  0.0 411704  4828 ?        S<l  Jun06   0:07  \_ /usr/libexec/ioprocess --read-pipe-fd 127 --write-pipe-fd 126 --max-threads 10 --max-queued-requests 10
vdsm       19844  0.0  0.0 411704  4732 ?        S<l  Jun06   0:06  \_ /usr/libexec/ioprocess --read-pipe-fd 133 --write-pipe-fd 129 --max-threads 10 --max-queued-requests 10
vdsm       19856  0.0  0.0 411704  4808 ?        S<l  Jun06   0:06  \_ /usr/libexec/ioprocess --read-pipe-fd 138 --write-pipe-fd 137 --max-threads 10 --max-queued-requests 10
vdsm      112352  0.0  0.0 411732  6724 ?        S<l  Jun06   0:00  \_ /usr/libexec/ioprocess --read-pipe-fd 164 --write-pipe-fd 163 --max-threads 10 --max-queued-requests 10
vdsm      114160  0.0  0.0 559168  4652 ?        S<l  Jun06   0:02  \_ /usr/libexec/ioprocess --read-pipe-fd 72 --write-pipe-fd 71 --max-threads 10 --max-queued-requests 10
vdsm      114163  0.0  0.0 559168  4708 ?        S<l  Jun06   0:02  \_ /usr/libexec/ioprocess --read-pipe-fd 78 --write-pipe-fd 77 --max-threads 10 --max-queued-requests 10
vdsm      114170  0.0  0.0 559168  6744 ?        S<l  Jun06   0:02  \_ /usr/libexec/ioprocess --read-pipe-fd 83 --write-pipe-fd 82 --max-threads 10 --max-queued-requests 10
vdsm      114227  0.0  0.0 559168  6632 ?        S<l  Jun06   0:02  \_ /usr/libexec/ioprocess --read-pipe-fd 101 --write-pipe-fd 97 --max-threads 10 --max-queued-requests 10
vdsm      114231  0.0  0.0 559168  4644 ?        S<l  Jun06   0:02  \_ /usr/libexec/ioprocess --read-pipe-fd 143 --write-pipe-fd 142 --max-threads 10 --max-queued-requests 10
vdsm      114240  0.0  0.0 559168  6652 ?        S<l  Jun06   0:02  \_ /usr/libexec/ioprocess --read-pipe-fd 149 --write-pipe-fd 148 --max-threads 10 --max-queued-requests 10
vdsm      114249  0.0  0.0 559168  4532 ?        S<l  Jun06   0:02  \_ /usr/libexec/ioprocess --read-pipe-fd 155 --write-pipe-fd 154 --max-threads 10 --max-queued-requests 10

Many "Too many open files" errors:

# grep 'OSError: \[Errno 24\] Too many open files' vdsm.log | wc -l
4375

# readlink /proc/$(pidof -x vdsmd)/fd/* > readlink-vdsm-fd.out

# wc -l readlink-vdsm-fd.out 
4096 readlink-vdsm-fd.out

# grep ^/ readlink-vdsm-fd.out | wc -l
9

# grep pipe: readlink-vdsm-fd.out | wc -l
4045

# grep socket: readlink-vdsm-fd.out | wc -l
32

So we have pipe leak - this is similar to the pipe leak we had in supervdsm
that was fixed in 4.5.

Comment 15 Nir Soffer 2022-06-07 19:56:06 UTC
I added experimental fix in host_mixed_2, and set up this watch script
on both host_mixed_2 and host_mixed_3:

# cat watch-vdsm-pids.sh 
for i in $(seq 3600); do
	echo "$(date --rfc-3339=seconds), $(readlink /proc/$(pidof -x vdsmd)/fd/* | wc -l)"
	sleep 10
done

# sh watch-vdsm-pids.sh > watch-vdsm-pids.csv

Output on host_mixed_2 (with fix):

[root@storage-ge6-vdsm2 ~]# tail -f watch-vdsm-pids.csv
2022-06-07 22:52:15+03:00, 143
2022-06-07 22:52:25+03:00, 143
2022-06-07 22:52:35+03:00, 143
2022-06-07 22:52:45+03:00, 143
2022-06-07 22:52:55+03:00, 143
2022-06-07 22:53:05+03:00, 143
2022-06-07 22:53:15+03:00, 143
2022-06-07 22:53:25+03:00, 143
2022-06-07 22:53:35+03:00, 143
2022-06-07 22:53:45+03:00, 143
2022-06-07 22:53:55+03:00, 143
2022-06-07 22:54:05+03:00, 143
2022-06-07 22:54:15+03:00, 143
2022-06-07 22:54:25+03:00, 143
2022-06-07 22:54:35+03:00, 143
2022-06-07 22:54:45+03:00, 143
2022-06-07 22:54:55+03:00, 143
2022-06-07 22:55:05+03:00, 143
2022-06-07 22:55:15+03:00, 143

Output on host_mixed_3 (without fix):

[root@storage-ge6-vdsm3 ~]# tail -f watch-vdsm-pids.csv 
2022-06-07 22:52:14+03:00, 374
2022-06-07 22:52:24+03:00, 375
2022-06-07 22:52:34+03:00, 376
2022-06-07 22:52:44+03:00, 377
2022-06-07 22:52:54+03:00, 378
2022-06-07 22:53:04+03:00, 379
2022-06-07 22:53:14+03:00, 379
2022-06-07 22:53:24+03:00, 380
2022-06-07 22:53:34+03:00, 381
2022-06-07 22:53:44+03:00, 384
2022-06-07 22:53:54+03:00, 383
2022-06-07 22:54:04+03:00, 384
2022-06-07 22:54:14+03:00, 385
2022-06-07 22:54:24+03:00, 386
2022-06-07 22:54:34+03:00, 387
2022-06-07 22:54:44+03:00, 388
2022-06-07 22:54:54+03:00, 388
2022-06-07 22:55:04+03:00, 389

Need to check again next morning.

Comment 16 sshmulev 2022-06-08 11:52:48 UTC
(In reply to Nir Soffer from comment #15)
> I added experimental fix in host_mixed_2, and set up this watch script
> on both host_mixed_2 and host_mixed_3:
> 
> # cat watch-vdsm-pids.sh 
> for i in $(seq 3600); do
> 	echo "$(date --rfc-3339=seconds), $(readlink /proc/$(pidof -x vdsmd)/fd/* |
> wc -l)"
> 	sleep 10
> done
> 
> # sh watch-vdsm-pids.sh > watch-vdsm-pids.csv
> 
> Output on host_mixed_2 (with fix):
> 
> [root@storage-ge6-vdsm2 ~]# tail -f watch-vdsm-pids.csv
> 2022-06-07 22:52:15+03:00, 143
> 2022-06-07 22:52:25+03:00, 143
> 2022-06-07 22:52:35+03:00, 143
> 2022-06-07 22:52:45+03:00, 143
> 2022-06-07 22:52:55+03:00, 143
> 2022-06-07 22:53:05+03:00, 143
> 2022-06-07 22:53:15+03:00, 143
> 2022-06-07 22:53:25+03:00, 143
> 2022-06-07 22:53:35+03:00, 143
> 2022-06-07 22:53:45+03:00, 143
> 2022-06-07 22:53:55+03:00, 143
> 2022-06-07 22:54:05+03:00, 143
> 2022-06-07 22:54:15+03:00, 143
> 2022-06-07 22:54:25+03:00, 143
> 2022-06-07 22:54:35+03:00, 143
> 2022-06-07 22:54:45+03:00, 143
> 2022-06-07 22:54:55+03:00, 143
> 2022-06-07 22:55:05+03:00, 143
> 2022-06-07 22:55:15+03:00, 143
> 
> Output on host_mixed_3 (without fix):
> 
> [root@storage-ge6-vdsm3 ~]# tail -f watch-vdsm-pids.csv 
> 2022-06-07 22:52:14+03:00, 374
> 2022-06-07 22:52:24+03:00, 375
> 2022-06-07 22:52:34+03:00, 376
> 2022-06-07 22:52:44+03:00, 377
> 2022-06-07 22:52:54+03:00, 378
> 2022-06-07 22:53:04+03:00, 379
> 2022-06-07 22:53:14+03:00, 379
> 2022-06-07 22:53:24+03:00, 380
> 2022-06-07 22:53:34+03:00, 381
> 2022-06-07 22:53:44+03:00, 384
> 2022-06-07 22:53:54+03:00, 383
> 2022-06-07 22:54:04+03:00, 384
> 2022-06-07 22:54:14+03:00, 385
> 2022-06-07 22:54:24+03:00, 386
> 2022-06-07 22:54:34+03:00, 387
> 2022-06-07 22:54:44+03:00, 388
> 2022-06-07 22:54:54+03:00, 388
> 2022-06-07 22:55:04+03:00, 389
> 
> Need to check again next morning.

The results on 4.5.1 version:
host 3 is in Unassigned status after it leaked up to its limit with the error message: "OSError: [Errno 24] Too many open files"
started with: 2022-06-07 22:52:14+03:00, 374
ended with: 2022-06-08 08:54:40+03:00, 3815

Host 2 is up and didn't have significant increase in it pids: 
started with: 2022-06-07 22:55:15+03:00, 143
ended with: 2022-06-08 08:50:16+03:00, 150

regarding 4.4.10 version it also reached to the error "OSError: [Errno 24] Too many open files" and the host is in unassigned status after almost a night
started with: Tue Jun 7 15:22:59 IDT 2022, 309
ended with: Tue Jun 7 21:02:59 IDT 2022, 1869
Not a regression

Comment 19 sshmulev 2022-06-19 15:05:45 UTC
Verified.

Versions:
engine-4.5.1.2-0.11.el8ev
vdsm-4.50.1.3-1.el8ev

The host is up and didn't have a significant increase in its PID (for more than 5 hours) while the ISO domain was inaccessible.
During that time sometimes it went up and down, but not as we've seen in the previous version with a sharp increase.

started with:
Sun Jun 19 12:25:06 IDT 2022
vdsm_fds:
110
supervdsm_fds:
30

ended with:
Sun Jun 19 18:05:07 IDT 2022
vdsm_fds:
169
supervdsm_fds:
35

Comment 23 errata-xmlrpc 2022-07-14 12:41:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.5.1] update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5583


Note You need to log in before you can comment on or make changes to this bug.