Description of problem: "Too many open files" error causes vdsm not being able to work properly Version-Release number of selected component (if applicable): vdsm-4.40.90.4-1.el8ev.x86_64 How reproducible: reproduced on multiple hosts in the datacenter Steps to Reproduce: 1. make sure the iso storage domain is active in datacenter 2. remove export on the nfs server side or just make iso domain unaccessible Actual results: after couple of hours vdsm will reach "Too many open files" error Expected results: vdsm will work correctly even iso storage domain will nto be available Additional info: After vdsm daemon restart hosts start working properly until iso nfs sd was restored I will attach vdsm logs from hosts.
FWIW ISO domains are deprecated
ISO domains are indeed deprecated but users are still advised to use them for various flows, e.g., import vm from external provider, so it would be better to address this if possible Nir, can it be related to the FD leak that we see you tried to address in https://gerrit.ovirt.org/c/vdsm/+/113395?
(In reply to Arik from comment #2) > ISO domains are indeed deprecated but users are still advised to use them > for various flows, e.g., import vm from external provider, so it would be > better to address this if possible > > Nir, can it be related to the FD leak that we see you tried to address in > https://gerrit.ovirt.org/c/vdsm/+/113395? This was early draft, this was fixed later in vdsm-4.50.0.3. If the leak is caused by the same issue, this will be fixed in 4.4sp1. Lets try to reproduce this with 4.5.0. Moving to ON_QA and setting as TestOnly.
Ack, so it might be solved by the fix for bz 1926589
Trying to reproduce this issue according to the steps provided on vdsm version of vdsm-4.40.100.2-1.el8ev but can't reproduce it, after an hour passed I still don't see the "Too many open files" error in the vdsm log file. maybe I need some clarification about the steps: 1. "make sure the iso storage domain is active in datacenter" - the SD was unattached, so I attached it to the data center and it was active. 2. "remove export on the nfs server side or just make iso domain inaccessible" - I set the iso domain to maintenance, is that good enough?
(In reply to sshmulev from comment #7) Note that I never reproduce this issue, so I don't know if we can reproduce it. > Trying to reproduce this issue according to the steps provided on vdsm > version of vdsm-4.40.100.2-1.el8ev but can't reproduce it, after an hour > passed I still don't see the "Too many open files" error in the vdsm log > file. One hour may not be enough, one day is more likely. > maybe I need some clarification about the steps: > > 1. "make sure the iso storage domain is active in datacenter" - the SD was > unattached, so I attached it to the data center and it was active. > 2. "remove export on the nfs server side or just make iso domain > inaccessible" - I set the iso domain to maintenance, is that good enough? No, this deactivates the domain, so the system stops trying to access it, so the problem cannot be reproduced. You need to: 1. make sure the iso domain is active 2. make the storage server inaccessible The easiest way is to remove the export for the iso domain: 1. Edit /etc/exports on the storage server and comment out the line exporting the iso domain 2. Refresh the exports exportfs -rv After that you should start to see warnings in vdsm about inaccessible iso domain. You can verify that your change was correct by tring to mount the iso domain manually on some host. The mount should fail or stuck. Leave the host running like this for a day, and watch the number of file descriptors in vdsmd and supervdsmd. The leak fixed in bug 1926589 was in supervdsmd. # pidof -x /usr/libexec/vdsm/vdsmd 2697 # pidof -x /usr/libexec/vdsm/supervdsmd 1964 # ls -lh /proc/2697/fd/ total 0 lr-x------. 1 vdsm kvm 64 Jun 1 16:58 0 -> /dev/null l-wx------. 1 vdsm kvm 64 Jun 1 16:58 1 -> /dev/null l-wx------. 1 vdsm kvm 64 Jun 1 16:58 10 -> /var/log/vdsm/vdsm-schema-error.log lr-x------. 1 vdsm kvm 64 Jun 1 16:58 11 -> 'pipe:[44038]' l-wx------. 1 vdsm kvm 64 Jun 1 16:58 12 -> 'pipe:[44038]' lrwx------. 1 vdsm kvm 64 Jun 1 16:58 13 -> 'socket:[44041]' ... # ls -lh -1 /proc/2697/fd/ | wc -l 70 # ls -lh -1 /proc/1964/fd/ total 0 lr-x------. 1 root root 64 Jun 1 16:58 0 -> /dev/null lrwx------. 1 root root 64 Jun 1 16:58 1 -> socket:[37260] l-wx------. 1 root root 64 Jun 1 16:58 10 -> pipe:[37388] lrwx------. 1 root root 64 Jun 1 16:58 11 -> socket:[37500] lrwx------. 1 root root 64 Jun 1 16:58 12 -> socket:[46686] lrwx------. 1 root root 64 Jun 1 16:58 13 -> socket:[44120] lrwx------. 1 root root 64 Jun 1 16:58 14 -> socket:[44144] lrwx------. 1 root root 64 Jun 1 16:58 15 -> socket:[48298] lrwx------. 1 root root 64 Jun 1 16:58 16 -> socket:[48299] ... # ls -lh -1 /proc/1964/fd/ | wc -l 27 In 4.4 we expect to see increasing number of fds (likely socket:*). If we wait enough time the number will reach the limit and we will start to see the "Too many open files" error in the log. I would write the output of "ls -lh /proc/{pid}/fd" to file every 10 minutes and check the results after several hours or a day. In 4.5 you should see constant number of fds - it can increase and decrease from time to time.
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
Moving this bug to assigned since there is still an increase in the fds of the vdsm when there is no connection to the iso domain. Steps to reproduce: 1. Created new iso SD and activated it. 2. Uncomented the export for the iso domain and then these logs started to appear in the vdsm: 2022-06-06 12:12:40,561+0300 ERROR (monitor/29f8492) [storage.monitor] Error checking domain 29f8492f-ea38-442b-b187-9256d781e802 (monitor:453) 3. Made a table with the values of the vdsm fds and supervdsm fds and saw that vdsm fds increased during that time (samples each 10 min) Time vdsm fds supervdsm fds Mon Jun 6 12:00:00 IDT 2022 418 36 Mon Jun 6 14:20:48 IDT 2022 922 36 Mon Jun 6 14:30:48 IDT 2022 977 36 Mon Jun 6 14:40:48 IDT 2022 1036 36 Mon Jun 6 14:50:48 IDT 2022 1091 36 Mon Jun 6 15:00:48 IDT 2022 1147 36 Mon Jun 6 15:10:48 IDT 2022 1206 36 Mon Jun 6 15:20:48 IDT 2022 1265 36 Mon Jun 6 15:51:07 IDT 2022 1438 36 Mon Jun 6 16:01:07 IDT 2022 1492 36 Will update if that's a regression (need to check it reproduced also on ovit-engine4.4)
More info from the host showing this issue. We see more ioprocess instances than expected - this usually happen when ioprocess get stack on inaccessible storage. vdsm 19297 3.1 25.2 4640544 2003540 ? S<sl Jun06 79:34 /usr/bin/python3 /usr/libexec/vdsm/vdsmd vdsm 19727 0.0 0.0 411704 6812 ? S<l Jun06 0:13 \_ /usr/libexec/ioprocess --read-pipe-fd 94 --write-pipe-fd 93 --max-threads 10 --max-queued-requests 10 vdsm 19765 0.0 0.0 411704 4708 ? S<l Jun06 0:07 \_ /usr/libexec/ioprocess --read-pipe-fd 106 --write-pipe-fd 105 --max-threads 10 --max-queued-requests 10 vdsm 19794 0.0 0.0 411704 6832 ? S<l Jun06 0:07 \_ /usr/libexec/ioprocess --read-pipe-fd 111 --write-pipe-fd 110 --max-threads 10 --max-queued-requests 10 vdsm 19803 0.0 0.0 411704 6812 ? S<l Jun06 0:06 \_ /usr/libexec/ioprocess --read-pipe-fd 117 --write-pipe-fd 116 --max-threads 10 --max-queued-requests 10 vdsm 19825 0.0 0.0 411704 4828 ? S<l Jun06 0:07 \_ /usr/libexec/ioprocess --read-pipe-fd 127 --write-pipe-fd 126 --max-threads 10 --max-queued-requests 10 vdsm 19844 0.0 0.0 411704 4732 ? S<l Jun06 0:06 \_ /usr/libexec/ioprocess --read-pipe-fd 133 --write-pipe-fd 129 --max-threads 10 --max-queued-requests 10 vdsm 19856 0.0 0.0 411704 4808 ? S<l Jun06 0:06 \_ /usr/libexec/ioprocess --read-pipe-fd 138 --write-pipe-fd 137 --max-threads 10 --max-queued-requests 10 vdsm 112352 0.0 0.0 411732 6724 ? S<l Jun06 0:00 \_ /usr/libexec/ioprocess --read-pipe-fd 164 --write-pipe-fd 163 --max-threads 10 --max-queued-requests 10 vdsm 114160 0.0 0.0 559168 4652 ? S<l Jun06 0:02 \_ /usr/libexec/ioprocess --read-pipe-fd 72 --write-pipe-fd 71 --max-threads 10 --max-queued-requests 10 vdsm 114163 0.0 0.0 559168 4708 ? S<l Jun06 0:02 \_ /usr/libexec/ioprocess --read-pipe-fd 78 --write-pipe-fd 77 --max-threads 10 --max-queued-requests 10 vdsm 114170 0.0 0.0 559168 6744 ? S<l Jun06 0:02 \_ /usr/libexec/ioprocess --read-pipe-fd 83 --write-pipe-fd 82 --max-threads 10 --max-queued-requests 10 vdsm 114227 0.0 0.0 559168 6632 ? S<l Jun06 0:02 \_ /usr/libexec/ioprocess --read-pipe-fd 101 --write-pipe-fd 97 --max-threads 10 --max-queued-requests 10 vdsm 114231 0.0 0.0 559168 4644 ? S<l Jun06 0:02 \_ /usr/libexec/ioprocess --read-pipe-fd 143 --write-pipe-fd 142 --max-threads 10 --max-queued-requests 10 vdsm 114240 0.0 0.0 559168 6652 ? S<l Jun06 0:02 \_ /usr/libexec/ioprocess --read-pipe-fd 149 --write-pipe-fd 148 --max-threads 10 --max-queued-requests 10 vdsm 114249 0.0 0.0 559168 4532 ? S<l Jun06 0:02 \_ /usr/libexec/ioprocess --read-pipe-fd 155 --write-pipe-fd 154 --max-threads 10 --max-queued-requests 10 Many "Too many open files" errors: # grep 'OSError: \[Errno 24\] Too many open files' vdsm.log | wc -l 4375 # readlink /proc/$(pidof -x vdsmd)/fd/* > readlink-vdsm-fd.out # wc -l readlink-vdsm-fd.out 4096 readlink-vdsm-fd.out # grep ^/ readlink-vdsm-fd.out | wc -l 9 # grep pipe: readlink-vdsm-fd.out | wc -l 4045 # grep socket: readlink-vdsm-fd.out | wc -l 32 So we have pipe leak - this is similar to the pipe leak we had in supervdsm that was fixed in 4.5.
I added experimental fix in host_mixed_2, and set up this watch script on both host_mixed_2 and host_mixed_3: # cat watch-vdsm-pids.sh for i in $(seq 3600); do echo "$(date --rfc-3339=seconds), $(readlink /proc/$(pidof -x vdsmd)/fd/* | wc -l)" sleep 10 done # sh watch-vdsm-pids.sh > watch-vdsm-pids.csv Output on host_mixed_2 (with fix): [root@storage-ge6-vdsm2 ~]# tail -f watch-vdsm-pids.csv 2022-06-07 22:52:15+03:00, 143 2022-06-07 22:52:25+03:00, 143 2022-06-07 22:52:35+03:00, 143 2022-06-07 22:52:45+03:00, 143 2022-06-07 22:52:55+03:00, 143 2022-06-07 22:53:05+03:00, 143 2022-06-07 22:53:15+03:00, 143 2022-06-07 22:53:25+03:00, 143 2022-06-07 22:53:35+03:00, 143 2022-06-07 22:53:45+03:00, 143 2022-06-07 22:53:55+03:00, 143 2022-06-07 22:54:05+03:00, 143 2022-06-07 22:54:15+03:00, 143 2022-06-07 22:54:25+03:00, 143 2022-06-07 22:54:35+03:00, 143 2022-06-07 22:54:45+03:00, 143 2022-06-07 22:54:55+03:00, 143 2022-06-07 22:55:05+03:00, 143 2022-06-07 22:55:15+03:00, 143 Output on host_mixed_3 (without fix): [root@storage-ge6-vdsm3 ~]# tail -f watch-vdsm-pids.csv 2022-06-07 22:52:14+03:00, 374 2022-06-07 22:52:24+03:00, 375 2022-06-07 22:52:34+03:00, 376 2022-06-07 22:52:44+03:00, 377 2022-06-07 22:52:54+03:00, 378 2022-06-07 22:53:04+03:00, 379 2022-06-07 22:53:14+03:00, 379 2022-06-07 22:53:24+03:00, 380 2022-06-07 22:53:34+03:00, 381 2022-06-07 22:53:44+03:00, 384 2022-06-07 22:53:54+03:00, 383 2022-06-07 22:54:04+03:00, 384 2022-06-07 22:54:14+03:00, 385 2022-06-07 22:54:24+03:00, 386 2022-06-07 22:54:34+03:00, 387 2022-06-07 22:54:44+03:00, 388 2022-06-07 22:54:54+03:00, 388 2022-06-07 22:55:04+03:00, 389 Need to check again next morning.
(In reply to Nir Soffer from comment #15) > I added experimental fix in host_mixed_2, and set up this watch script > on both host_mixed_2 and host_mixed_3: > > # cat watch-vdsm-pids.sh > for i in $(seq 3600); do > echo "$(date --rfc-3339=seconds), $(readlink /proc/$(pidof -x vdsmd)/fd/* | > wc -l)" > sleep 10 > done > > # sh watch-vdsm-pids.sh > watch-vdsm-pids.csv > > Output on host_mixed_2 (with fix): > > [root@storage-ge6-vdsm2 ~]# tail -f watch-vdsm-pids.csv > 2022-06-07 22:52:15+03:00, 143 > 2022-06-07 22:52:25+03:00, 143 > 2022-06-07 22:52:35+03:00, 143 > 2022-06-07 22:52:45+03:00, 143 > 2022-06-07 22:52:55+03:00, 143 > 2022-06-07 22:53:05+03:00, 143 > 2022-06-07 22:53:15+03:00, 143 > 2022-06-07 22:53:25+03:00, 143 > 2022-06-07 22:53:35+03:00, 143 > 2022-06-07 22:53:45+03:00, 143 > 2022-06-07 22:53:55+03:00, 143 > 2022-06-07 22:54:05+03:00, 143 > 2022-06-07 22:54:15+03:00, 143 > 2022-06-07 22:54:25+03:00, 143 > 2022-06-07 22:54:35+03:00, 143 > 2022-06-07 22:54:45+03:00, 143 > 2022-06-07 22:54:55+03:00, 143 > 2022-06-07 22:55:05+03:00, 143 > 2022-06-07 22:55:15+03:00, 143 > > Output on host_mixed_3 (without fix): > > [root@storage-ge6-vdsm3 ~]# tail -f watch-vdsm-pids.csv > 2022-06-07 22:52:14+03:00, 374 > 2022-06-07 22:52:24+03:00, 375 > 2022-06-07 22:52:34+03:00, 376 > 2022-06-07 22:52:44+03:00, 377 > 2022-06-07 22:52:54+03:00, 378 > 2022-06-07 22:53:04+03:00, 379 > 2022-06-07 22:53:14+03:00, 379 > 2022-06-07 22:53:24+03:00, 380 > 2022-06-07 22:53:34+03:00, 381 > 2022-06-07 22:53:44+03:00, 384 > 2022-06-07 22:53:54+03:00, 383 > 2022-06-07 22:54:04+03:00, 384 > 2022-06-07 22:54:14+03:00, 385 > 2022-06-07 22:54:24+03:00, 386 > 2022-06-07 22:54:34+03:00, 387 > 2022-06-07 22:54:44+03:00, 388 > 2022-06-07 22:54:54+03:00, 388 > 2022-06-07 22:55:04+03:00, 389 > > Need to check again next morning. The results on 4.5.1 version: host 3 is in Unassigned status after it leaked up to its limit with the error message: "OSError: [Errno 24] Too many open files" started with: 2022-06-07 22:52:14+03:00, 374 ended with: 2022-06-08 08:54:40+03:00, 3815 Host 2 is up and didn't have significant increase in it pids: started with: 2022-06-07 22:55:15+03:00, 143 ended with: 2022-06-08 08:50:16+03:00, 150 regarding 4.4.10 version it also reached to the error "OSError: [Errno 24] Too many open files" and the host is in unassigned status after almost a night started with: Tue Jun 7 15:22:59 IDT 2022, 309 ended with: Tue Jun 7 21:02:59 IDT 2022, 1869 Not a regression
Verified. Versions: engine-4.5.1.2-0.11.el8ev vdsm-4.50.1.3-1.el8ev The host is up and didn't have a significant increase in its PID (for more than 5 hours) while the ISO domain was inaccessible. During that time sometimes it went up and down, but not as we've seen in the previous version with a sharp increase. started with: Sun Jun 19 12:25:06 IDT 2022 vdsm_fds: 110 supervdsm_fds: 30 ended with: Sun Jun 19 18:05:07 IDT 2022 vdsm_fds: 169 supervdsm_fds: 35
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.5.1] update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5583