Bug 853011
| Summary: | 3.1 - [vdsm] logging: 'No free file handlers in pool' when /rhev/data-center/mnt/ contains lots of directories | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Haim <hateya> | ||||||||||
| Component: | vdsm | Assignee: | Yaniv Bronhaim <ybronhei> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | vvyazmin <vvyazmin> | ||||||||||
| Severity: | medium | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 6.3 | CC: | abaron, achan, bazulay, fsimonce, hateya, iheim, ilvovsky, lpeer, rvaknin, vvyazmin, yeylon, ykaul | ||||||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | infra | ||||||||||||
| Fixed In Version: | vdsm-4.9.6-41.0 | Doc Type: | Bug Fix | ||||||||||
| Doc Text: |
When /rhev/data-center/mnt contained multiple directories, creating a new storage pool resulted in the exception "No free file handles in pool." Since all domains used one global process pool with limited process slots, the exception occurred when the process limit is reached. The fix separates process pools for each domain instead of using a global pool, modifying the design behavior to allow for more available file handlers.
|
Story Points: | --- | ||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2012-12-04 19:08:29 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 608091 [details]
vdsm log
Created attachment 608152 [details]
vdsm log
Haim, does this reproduce with 50 valid NFS domains? i.e. instead of creating garbage directories, really create 50 domains. This would mean a scalability issue. Also, which vdsm version? (In reply to comment #3) > Haim, does this reproduce with 50 valid NFS domains? > i.e. instead of creating garbage directories, really create 50 domains. > This would mean a scalability issue. > > Also, which vdsm version? vdsm-4.9.6-30.0.el6_3.x86_64. Rami - can we check that. I see that the exception is during createStoragePool and not connectStoragePool, Let's say that I've created 50 nfs storage domains, how do I createStoragePool??? (In reply to comment #7) > I see that the exception is during createStoragePool and not > connectStoragePool, Let's say that I've created 50 nfs storage domains, how > do I createStoragePool??? changed title. iirc, since create storage pool requires spUUID and sdUUID (and not a list of possible domains), then the flow will be: 1) create storage domain while /rhev/data-center/mnt/ contains lots of directories 2) create storage pool with storage domain created in step 1 as msUUID 3) create, attach and activate additional 99. hateya, thanks for the title change, your new scenario still means to have garbage directories in /rhev/data-center/mnt/, so I believe that it's actually not the scalability scenario that abaron would like to be tested but I tested it anyway.
I still get tons of "No free file handlers in pool" exceptions for old directories that still exist (used to be nfs storage domains that were not cleaned well due to another bug). Beside these errors, the valid nfs storage domains are created, attached and activated well.
Tested with vdsm-4.9.6-31.0.el6_3.x86_64 on RHEL6.3.
Thread-3205063::DEBUG::2012-09-09 12:33:13,921::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-3205063::DEBUG::2012-09-09 12:33:13,921::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-3205063::DEBUG::2012-09-09 12:33:13,922::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-3205063::DEBUG::2012-09-09 12:33:13,922::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-3205063::DEBUG::2012-09-09 12:33:13,923::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-3205063::DEBUG::2012-09-09 12:33:13,924::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disab
le_after_error_count=3 filter = [ \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,s
ize,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free b79a214a-9c79-46af-bbc2-bd641afc213a' (cwd None)
Thread-3205063::DEBUG::2012-09-09 12:33:13,956::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Volume group "b79a214a-9c79-46af-bbc2-bd641afc213a" not found\n'; <rc> = 5
Thread-3205063::WARNING::2012-09-09 12:33:13,958::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "b79a214a-9c79-46af-bbc2-bd641afc213a" not found']
Thread-3205063::DEBUG::2012-09-09 12:33:13,958::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-3205080::WARNING::2012-09-09 12:33:13,973::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs176
Traceback (most recent call last):
File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles
constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA))
File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction
raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool
Thread-3205081::WARNING::2012-09-09 12:33:13,975::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs113
Traceback (most recent call last):
File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles
constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA))
File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction
raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool
Thread-3205082::WARNING::2012-09-09 12:33:13,976::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs105
Traceback (most recent call last):
File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles
constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA))
File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction
raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool
The end line of my reproduction (Comment #9) is that createStoragePool does not fail on these exceptions, they just dumped to the vdsm.log but I find the harmless as creation+attachement+activation of storage domains works well. Created attachment 611187 [details]
vdsm log
http://gerrit.ovirt.org/#/c/8745 http://gerrit.ovirt.org/#/c/8746 To limit number of collectMetaData threads in the same time we added maxthreads variable that limits that. Saggi says that the value of maxthreads needs to be the amount of how many stuck domains we are willing to handle. Well.. how many? The only limit we have is process_pool_max_per_domain that assigned to 10 in config.py, it means that per domain we can run 10 processes in the same time - in this bz we reached the maximum. We decided to use process_pool_max_per_domain value because we start outOf Process with only one global pool that can contain only this amount of processes. This fixes the bug. If we want I can change it to use multiply process pools (for each domain), and we'll be able to run more processes simultaneously - this can be done in another patch. Found a same behaviour in in RHEVM 3.1 - SI22 RHEVM: rhevm-3.1.0-22.el6ev.noarch VDSM: vdsm-4.9.6-39.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 SANLOCK: sanlock-2.3-4.el6_3.x86_64 Verify a following scenario: 1. Create iSCSI DC with 2 hosts, with one iSCSI SD, Export NFS, ISO NFS 2. ISO NFS Storage Domain disconnect 3. VDSM on SPM server enter to deadlock All relevant logs attached. Created attachment 634475 [details]
## Logs vdsm, rhevm, screen-shots
Another patch to fix this error: http://gerrit.ovirt.org/#/c/9029/ Our first patch allows limitation of threads for each process, but if we have one mount folder between the temp folders that is stuck (blocked from its origin), we still see the exception. This is because we reach our processes limit and we try to initiate more processes than allowed with only one process pool. This patch separates process pools for each domain. Verified on RHEVM 3.1 - SI24 RHEVM: rhevm-3.1.0-26.el6ev.noarch VDSM: vdsm-4.9.6-41.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.4.x86_64 SANLOCK: sanlock-2.3-4.el6_3.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1508.html |
Description of problem: 1) make sure that /rhev/data-center/mnt/ contains lots of directories [1] 2) restart vdsm service 3) try to create new storage pool [2] [1] - dir structure: [root@nott-vds1 ~]# ll /rhev/data-center/mnt/ total 200 drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:14 __.. drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:14 _.. drwxr-xr-x. 14 vdsm kvm 4096 Aug 26 18:37 blockSD drwxr-xr-x. 2 vdsm kvm 4096 Aug 12 19:48 hateya-rhevm.qa.lab.tlv.redhat.com:_usr_local_exports_iso drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:25 _kaka_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:07 ls drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:01 _mnt drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:35 _mnt_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:47 _mnt10_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:48 _mnt11_" drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:49 _mnt12_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:50 _mnt14_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:51 _mnt15_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:51 _mnt16_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:51 _mnt17_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:52 _mnt18_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:52 _mnt19_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:02 _mnt2 drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:39 _mnt2_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:54 _mnt20_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:54 _mnt21_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:55 _mnt22_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:56 _mnt23_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:56 _mnt24_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:58 _mnt24_; reboot ; drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:57 _mnt24_;reboot; drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:40 _mnt3_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:41 _mnt4_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:42 _mnt5_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:42 _mnt6_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:43 _mnt7_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:44 _mnt8_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:46 _mnt9_ drwxr-xr-x. 2 vdsm kvm 4096 Aug 2 15:01 orion.qa.lab.tlv.redhat.com:_export_hateya_kaka1 drwxr-xr-x. 2 vdsm kvm 4096 Aug 2 14:52 orion.qa.lab.tlv.redhat.com:_export_hateya_kaka4 drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:59 ;reboo; drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:03 reboot drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:03 |reboot drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:58 ;reboot drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:58 ;reboot; drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:03 reboot> drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:10 reboot; drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:04 reboot" drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:09 reboot&& drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:05 rm drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:06 rm drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:06 rm \ drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:32 ; rm -rf _opt* ; drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 15:31 ; rm -rf _tmp_ ; drwxr-xr-x. 2 vdsm kvm 4096 Aug 6 16:05 _tmp [2] - trace upon create storage pool: Thread-91::INFO::2012-08-30 11:49:04,454::logUtils::37::dispatcher::(wrapper) Run and protect: createStoragePool(poolType=None, spUUID='b2ce3cea-66fc-4ab5-84d9-db3a41c17245', poolName='iSCSI', masterDom='383812a1- 6678-4785-b07a-da4fb53fc17d', domList=['383812a1-6678-4785-b07a-da4fb53fc17d'], masterVersion=1, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None) Thread-91::DEBUG::2012-08-30 11:49:04,454::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage) Thread-91::DEBUG::2012-08-30 11:49:04,455::misc::1082::SamplingMethod::(__call__) Got in to sampling method Thread-91::DEBUG::2012-08-30 11:49:04,455::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan) Thread-91::DEBUG::2012-08-30 11:49:04,456::misc::1082::SamplingMethod::(__call__) Got in to sampling method Thread-91::DEBUG::2012-08-30 11:49:04,456::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None) Thread-91::DEBUG::2012-08-30 11:49:04,475::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0 Thread-91::DEBUG::2012-08-30 11:49:04,476::misc::1090::SamplingMethod::(__call__) Returning last result Thread-91::DEBUG::2012-08-30 11:49:05,871::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/multipath' (cwd None) Thread-91::DEBUG::2012-08-30 11:49:05,914::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0 Thread-91::DEBUG::2012-08-30 11:49:05,915::lvm::460::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex Thread-91::DEBUG::2012-08-30 11:49:05,916::lvm::462::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex Thread-91::DEBUG::2012-08-30 11:49:05,917::lvm::472::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex Thread-91::DEBUG::2012-08-30 11:49:05,917::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex Thread-91::DEBUG::2012-08-30 11:49:05,918::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex Thread-91::DEBUG::2012-08-30 11:49:05,918::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex Thread-91::DEBUG::2012-08-30 11:49:05,919::misc::1090::SamplingMethod::(__call__) Returning last result Thread-91::DEBUG::2012-08-30 11:49:05,920::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-91::DEBUG::2012-08-30 11:49:05,921::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_ cache_state=0 disable_after_error_count=3 filter = [ \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b - -nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 383812a1-6678-4785-b07a-da4fb53fc17d' (cwd None) Thread-91::DEBUG::2012-08-30 11:49:06,170::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' Volume group "383812a1-6678-4785-b07a-da4fb53fc17d" not found\n'; <rc> = 5 Thread-91::WARNING::2012-08-30 11:49:06,172::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "383812a1-6678-4785-b07a-da4fb53fc17d" not found'] Thread-91::DEBUG::2012-08-30 11:49:06,173::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex Thread-102::WARNING::2012-08-30 11:49:06,240::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/_mnt4_ Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction raise Exception("No free file handlers in pool") Exception: No free file handlers in pool Thread-103::WARNING::2012-08-30 11:49:06,261::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/_mnt20_ Traceback (most recent call last): File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA)) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction raise Exception("No free file handlers in pool") Exception: No free file handlers in pool Thread-91::ERROR::2012-08-30 11:49:06,372::task::853::TaskManager.Task::(_setError) Task=`b0dc9ded-4aa5-4b96-82fa-b4480062695f`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 761, in createStoragePool msd = sdCache.produce(sdUUID=masterDom) File "/usr/share/vdsm/storage/sdc.py", line 110, in produce dom.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) :