Bug 853011
Summary: | 3.1 - [vdsm] logging: 'No free file handlers in pool' when /rhev/data-center/mnt/ contains lots of directories | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> | ||||||||||
Component: | vdsm | Assignee: | Yaniv Bronhaim <ybronhei> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | vvyazmin <vvyazmin> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 6.3 | CC: | abaron, achan, bazulay, fsimonce, hateya, iheim, ilvovsky, lpeer, rvaknin, vvyazmin, yeylon, ykaul | ||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | infra | ||||||||||||
Fixed In Version: | vdsm-4.9.6-41.0 | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
When /rhev/data-center/mnt contained multiple directories, creating a new storage pool resulted in the exception "No free file handles in pool." Since all domains used one global process pool with limited process slots, the exception occurred when the process limit is reached. The fix separates process pools for each domain instead of using a global pool, modifying the design behavior to allow for more available file handlers.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2012-12-04 19:08:29 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Haim
2012-08-30 09:13:59 UTC
Created attachment 608091 [details]
vdsm log
Created attachment 608152 [details]
vdsm log
Haim, does this reproduce with 50 valid NFS domains? i.e. instead of creating garbage directories, really create 50 domains. This would mean a scalability issue. Also, which vdsm version? (In reply to comment #3) > Haim, does this reproduce with 50 valid NFS domains? > i.e. instead of creating garbage directories, really create 50 domains. > This would mean a scalability issue. > > Also, which vdsm version? vdsm-4.9.6-30.0.el6_3.x86_64. Rami - can we check that. I see that the exception is during createStoragePool and not connectStoragePool, Let's say that I've created 50 nfs storage domains, how do I createStoragePool??? (In reply to comment #7) > I see that the exception is during createStoragePool and not > connectStoragePool, Let's say that I've created 50 nfs storage domains, how > do I createStoragePool??? changed title. iirc, since create storage pool requires spUUID and sdUUID (and not a list of possible domains), then the flow will be: 1) create storage domain while /rhev/data-center/mnt/ contains lots of directories 2) create storage pool with storage domain created in step 1 as msUUID 3) create, attach and activate additional 99. hateya, thanks for the title change, your new scenario still means to have garbage directories in /rhev/data-center/mnt/, so I believe that it's actually not the scalability scenario that abaron would like to be tested but I tested it anyway. I still get tons of "No free file handlers in pool" exceptions for old directories that still exist (used to be nfs storage domains that were not cleaned well due to another bug). Beside these errors, the valid nfs storage domains are created, attached and activated well. Tested with vdsm-4.9.6-31.0.el6_3.x86_64 on RHEL6.3. Thread-3205063::DEBUG::2012-09-09 12:33:13,921::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex Thread-3205063::DEBUG::2012-09-09 12:33:13,921::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex Thread-3205063::DEBUG::2012-09-09 12:33:13,922::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex Thread-3205063::DEBUG::2012-09-09 12:33:13,922::misc::1090::SamplingMethod::(__call__) Returning last result Thread-3205063::DEBUG::2012-09-09 12:33:13,923::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-3205063::DEBUG::2012-09-09 12:33:13,924::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disab le_after_error_count=3 filter = [ \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,s ize,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free b79a214a-9c79-46af-bbc2-bd641afc213a' (cwd None) Thread-3205063::DEBUG::2012-09-09 12:33:13,956::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Volume group "b79a214a-9c79-46af-bbc2-bd641afc213a" not found\n'; <rc> = 5 Thread-3205063::WARNING::2012-09-09 12:33:13,958::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "b79a214a-9c79-46af-bbc2-bd641afc213a" not found'] Thread-3205063::DEBUG::2012-09-09 12:33:13,958::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-3205080::WARNING::2012-09-09 12:33:13,973::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs176 Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction raise Exception("No free file handlers in pool") Exception: No free file handlers in pool Thread-3205081::WARNING::2012-09-09 12:33:13,975::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs113 Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction raise Exception("No free file handlers in pool") Exception: No free file handlers in pool Thread-3205082::WARNING::2012-09-09 12:33:13,976::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs105 Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction raise Exception("No free file handlers in pool") Exception: No free file handlers in pool The end line of my reproduction (Comment #9) is that createStoragePool does not fail on these exceptions, they just dumped to the vdsm.log but I find the harmless as creation+attachement+activation of storage domains works well. Created attachment 611187 [details]
vdsm log
http://gerrit.ovirt.org/#/c/8745 http://gerrit.ovirt.org/#/c/8746 To limit number of collectMetaData threads in the same time we added maxthreads variable that limits that. Saggi says that the value of maxthreads needs to be the amount of how many stuck domains we are willing to handle. Well.. how many? The only limit we have is process_pool_max_per_domain that assigned to 10 in config.py, it means that per domain we can run 10 processes in the same time - in this bz we reached the maximum. We decided to use process_pool_max_per_domain value because we start outOf Process with only one global pool that can contain only this amount of processes. This fixes the bug. If we want I can change it to use multiply process pools (for each domain), and we'll be able to run more processes simultaneously - this can be done in another patch. Found a same behaviour in in RHEVM 3.1 - SI22 RHEVM: rhevm-3.1.0-22.el6ev.noarch VDSM: vdsm-4.9.6-39.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 SANLOCK: sanlock-2.3-4.el6_3.x86_64 Verify a following scenario: 1. Create iSCSI DC with 2 hosts, with one iSCSI SD, Export NFS, ISO NFS 2. ISO NFS Storage Domain disconnect 3. VDSM on SPM server enter to deadlock All relevant logs attached. Created attachment 634475 [details]
## Logs vdsm, rhevm, screen-shots
Another patch to fix this error: http://gerrit.ovirt.org/#/c/9029/ Our first patch allows limitation of threads for each process, but if we have one mount folder between the temp folders that is stuck (blocked from its origin), we still see the exception. This is because we reach our processes limit and we try to initiate more processes than allowed with only one process pool. This patch separates process pools for each domain. Verified on RHEVM 3.1 - SI24 RHEVM: rhevm-3.1.0-26.el6ev.noarch VDSM: vdsm-4.9.6-41.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.4.x86_64 SANLOCK: sanlock-2.3-4.el6_3.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1508.html |