Bug 853011

Summary: 3.1 - [vdsm] logging: 'No free file handlers in pool' when /rhev/data-center/mnt/ contains lots of directories
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED ERRATA QA Contact: vvyazmin <vvyazmin>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.3CC: abaron, achan, bazulay, fsimonce, hateya, iheim, ilvovsky, lpeer, rvaknin, vvyazmin, yeylon, ykaul
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: vdsm-4.9.6-41.0 Doc Type: Bug Fix
Doc Text:
When /rhev/data-center/mnt contained multiple directories, creating a new storage pool resulted in the exception "No free file handles in pool." Since all domains used one global process pool with limited process slots, the exception occurred when the process limit is reached. The fix separates process pools for each domain instead of using a global pool, modifying the design behavior to allow for more available file handlers.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 19:08:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm log
none
vdsm log
none
vdsm log
none
## Logs vdsm, rhevm, screen-shots none

Description Haim 2012-08-30 09:13:59 UTC
Description of problem:

1) make sure that /rhev/data-center/mnt/ contains lots of directories [1]
2) restart vdsm service
3) try to create new storage pool [2]

[1] - dir structure:

[root@nott-vds1 ~]# ll /rhev/data-center/mnt/
total 200
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:14 __..
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:14 _..
drwxr-xr-x. 14 vdsm kvm 4096 Aug 26 18:37 blockSD
drwxr-xr-x.  2 vdsm kvm 4096 Aug 12 19:48 hateya-rhevm.qa.lab.tlv.redhat.com:_usr_local_exports_iso
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:25 _kaka_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:07 ls
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:01 _mnt
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:35 _mnt_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:47 _mnt10_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:48 _mnt11_"
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:49 _mnt12_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:50 _mnt14_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:51 _mnt15_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:51 _mnt16_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:51 _mnt17_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:52 _mnt18_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:52 _mnt19_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:02 _mnt2
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:39 _mnt2_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:54 _mnt20_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:54 _mnt21_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:55 _mnt22_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:56 _mnt23_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:56 _mnt24_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:58 _mnt24_; reboot ;
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:57 _mnt24_;reboot;
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:40 _mnt3_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:41 _mnt4_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:42 _mnt5_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:42 _mnt6_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:43 _mnt7_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:44 _mnt8_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:46 _mnt9_
drwxr-xr-x.  2 vdsm kvm 4096 Aug  2 15:01 orion.qa.lab.tlv.redhat.com:_export_hateya_kaka1
drwxr-xr-x.  2 vdsm kvm 4096 Aug  2 14:52 orion.qa.lab.tlv.redhat.com:_export_hateya_kaka4
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:59 ;reboo;
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:03 reboot
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:03 |reboot
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:58 ;reboot
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:58 ;reboot;
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:03 reboot>
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:10 reboot;
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:04 reboot"
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:09 reboot&&
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:05 rm
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:06 rm 
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:06 rm \
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:32 ; rm -rf _opt* ;
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 15:31 ; rm -rf _tmp_ ;
drwxr-xr-x.  2 vdsm kvm 4096 Aug  6 16:05 _tmp

[2] - trace upon create storage pool:

Thread-91::INFO::2012-08-30 11:49:04,454::logUtils::37::dispatcher::(wrapper) Run and protect: createStoragePool(poolType=None, spUUID='b2ce3cea-66fc-4ab5-84d9-db3a41c17245', poolName='iSCSI', masterDom='383812a1-
6678-4785-b07a-da4fb53fc17d', domList=['383812a1-6678-4785-b07a-da4fb53fc17d'], masterVersion=1, lockPolicy=None, lockRenewalIntervalSec=5, leaseTimeSec=60, ioOpTimeoutSec=10, leaseRetries=3, options=None)
Thread-91::DEBUG::2012-08-30 11:49:04,454::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.sdc.refreshStorage)
Thread-91::DEBUG::2012-08-30 11:49:04,455::misc::1082::SamplingMethod::(__call__) Got in to sampling method
Thread-91::DEBUG::2012-08-30 11:49:04,455::misc::1080::SamplingMethod::(__call__) Trying to enter sampling method (storage.iscsi.rescan)
Thread-91::DEBUG::2012-08-30 11:49:04,456::misc::1082::SamplingMethod::(__call__) Got in to sampling method
Thread-91::DEBUG::2012-08-30 11:49:04,456::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/iscsiadm -m session -R' (cwd None)
Thread-91::DEBUG::2012-08-30 11:49:04,475::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-91::DEBUG::2012-08-30 11:49:04,476::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-91::DEBUG::2012-08-30 11:49:05,871::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/multipath' (cwd None)
Thread-91::DEBUG::2012-08-30 11:49:05,914::__init__::1164::Storage.Misc.excCmd::(_log) SUCCESS: <err> = ''; <rc> = 0
Thread-91::DEBUG::2012-08-30 11:49:05,915::lvm::460::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-91::DEBUG::2012-08-30 11:49:05,916::lvm::462::OperationMutex::(_invalidateAllPvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-91::DEBUG::2012-08-30 11:49:05,917::lvm::472::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-91::DEBUG::2012-08-30 11:49:05,917::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-91::DEBUG::2012-08-30 11:49:05,918::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-91::DEBUG::2012-08-30 11:49:05,918::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-91::DEBUG::2012-08-30 11:49:05,919::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-91::DEBUG::2012-08-30 11:49:05,920::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-91::DEBUG::2012-08-30 11:49:05,921::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_
cache_state=0 disable_after_error_count=3 filter = [ \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b -
-nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 383812a1-6678-4785-b07a-da4fb53fc17d' (cwd None)
Thread-91::DEBUG::2012-08-30 11:49:06,170::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = '  Volume group "383812a1-6678-4785-b07a-da4fb53fc17d" not found\n'; <rc> = 5
Thread-91::WARNING::2012-08-30 11:49:06,172::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  Volume group "383812a1-6678-4785-b07a-da4fb53fc17d" not found']
Thread-91::DEBUG::2012-08-30 11:49:06,173::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-102::WARNING::2012-08-30 11:49:06,240::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/_mnt4_
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles
    constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction
    raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool
Thread-103::WARNING::2012-08-30 11:49:06,261::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/_mnt20_
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles
    constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction
    raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool
Thread-91::ERROR::2012-08-30 11:49:06,372::task::853::TaskManager.Task::(_setError) Task=`b0dc9ded-4aa5-4b96-82fa-b4480062695f`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 761, in createStoragePool
    msd = sdCache.produce(sdUUID=masterDom)
  File "/usr/share/vdsm/storage/sdc.py", line 110, in produce
    dom.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
:

Comment 1 Haim 2012-08-30 09:17:04 UTC
Created attachment 608091 [details]
vdsm log

Comment 2 Haim 2012-08-30 10:53:29 UTC
Created attachment 608152 [details]
vdsm log

Comment 3 Ayal Baron 2012-09-02 06:35:53 UTC
Haim, does this reproduce with 50 valid NFS domains?
i.e. instead of creating garbage directories, really create 50 domains.
This would mean a scalability issue.

Also, which vdsm version?

Comment 4 Haim 2012-09-02 07:24:21 UTC
(In reply to comment #3)
> Haim, does this reproduce with 50 valid NFS domains?
> i.e. instead of creating garbage directories, really create 50 domains.
> This would mean a scalability issue.
> 
> Also, which vdsm version?

vdsm-4.9.6-30.0.el6_3.x86_64.

Rami - can we check that.

Comment 7 Rami Vaknin 2012-09-03 15:33:05 UTC
I see that the exception is during createStoragePool and not connectStoragePool, Let's say that I've created 50 nfs storage domains, how do I createStoragePool???

Comment 8 Haim 2012-09-09 09:21:04 UTC
(In reply to comment #7)
> I see that the exception is during createStoragePool and not
> connectStoragePool, Let's say that I've created 50 nfs storage domains, how
> do I createStoragePool???

changed title. iirc, since create storage pool requires spUUID and sdUUID (and not a list of possible domains), then the flow will be:

1) create storage domain while /rhev/data-center/mnt/ contains lots of directories
2) create storage pool with storage domain created in step 1 as msUUID
3) create, attach and activate additional 99.

Comment 9 Rami Vaknin 2012-09-09 11:43:16 UTC
hateya, thanks for the title change, your new scenario still means to have garbage directories in /rhev/data-center/mnt/, so I believe that it's actually not the scalability scenario that abaron would like to be tested but I tested it anyway.

I still get tons of "No free file handlers in pool" exceptions for old directories that still exist (used to be nfs storage domains that were not cleaned well due to another bug). Beside these errors, the valid nfs storage domains are created, attached and activated well.

Tested with vdsm-4.9.6-31.0.el6_3.x86_64 on RHEL6.3.

Thread-3205063::DEBUG::2012-09-09 12:33:13,921::lvm::474::OperationMutex::(_invalidateAllVgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-3205063::DEBUG::2012-09-09 12:33:13,921::lvm::493::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-3205063::DEBUG::2012-09-09 12:33:13,922::lvm::495::OperationMutex::(_invalidateAllLvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-3205063::DEBUG::2012-09-09 12:33:13,922::misc::1090::SamplingMethod::(__call__) Returning last result
Thread-3205063::DEBUG::2012-09-09 12:33:13,923::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-3205063::DEBUG::2012-09-09 12:33:13,924::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disab
le_after_error_count=3 filter = [ \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,s
ize,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free b79a214a-9c79-46af-bbc2-bd641afc213a' (cwd None)
Thread-3205063::DEBUG::2012-09-09 12:33:13,956::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = '  Volume group "b79a214a-9c79-46af-bbc2-bd641afc213a" not found\n'; <rc> = 5
Thread-3205063::WARNING::2012-09-09 12:33:13,958::lvm::356::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  Volume group "b79a214a-9c79-46af-bbc2-bd641afc213a" not found']
Thread-3205063::DEBUG::2012-09-09 12:33:13,958::lvm::379::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' released the operation mutex
Thread-3205080::WARNING::2012-09-09 12:33:13,973::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs176
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles
    constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction
    raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool
Thread-3205081::WARNING::2012-09-09 12:33:13,975::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs113
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles
    constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction
    raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool
Thread-3205082::WARNING::2012-09-09 12:33:13,976::fileSD::421::scanDomains::(collectMetaFiles) Could not collect metadata file for domain path /rhev/data-center/mnt/qanashead.qa.lab.tlv.redhat.com:_export_rami_scale_nfs105
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 410, in collectMetaFiles
    constants.UUID_GLOB_PATTERN, sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 287, in callCrabRPCFunction
    raise Exception("No free file handlers in pool")
Exception: No free file handlers in pool

Comment 10 Rami Vaknin 2012-09-09 11:46:39 UTC
The end line of my reproduction (Comment #9) is that createStoragePool does not fail on these exceptions, they just dumped to the vdsm.log but I find the harmless as creation+attachement+activation of storage domains works well.

Comment 11 Rami Vaknin 2012-09-09 11:47:50 UTC
Created attachment 611187 [details]
vdsm log

Comment 15 Yaniv Bronhaim 2012-10-24 09:09:09 UTC
http://gerrit.ovirt.org/#/c/8745
http://gerrit.ovirt.org/#/c/8746

To limit number of collectMetaData threads in the same time we added maxthreads variable that limits that. 
Saggi says that the value of maxthreads needs to be the amount of how many stuck domains we are willing to handle. 
Well.. how many?

The only limit we have is process_pool_max_per_domain that assigned to 10 in config.py, it means that per domain we can run 10 processes in the same time - in this bz we reached the maximum.

Comment 16 Yaniv Bronhaim 2012-10-24 16:28:41 UTC
We decided to use process_pool_max_per_domain value because we start outOf Process with only one global pool that can contain only this amount of processes.

This fixes the bug. 

If we want I can change it to use multiply process pools (for each domain), and we'll be able to run more processes simultaneously - this can be done in another patch.

Comment 17 vvyazmin@redhat.com 2012-10-28 10:02:27 UTC
Found a same behaviour in in RHEVM 3.1 - SI22 

RHEVM: rhevm-3.1.0-22.el6ev.noarch 
VDSM: vdsm-4.9.6-39.0.el6_3.x86_64 
LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64 
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 
SANLOCK: sanlock-2.3-4.el6_3.x86_64

Verify a following scenario:
1. Create  iSCSI DC with 2 hosts, with one iSCSI SD, Export NFS, ISO NFS
2. ISO NFS Storage Domain disconnect 
3. VDSM on SPM server enter to deadlock

All relevant logs attached.

Comment 18 vvyazmin@redhat.com 2012-10-28 10:04:21 UTC
Created attachment 634475 [details]
## Logs vdsm, rhevm, screen-shots

Comment 19 Yaniv Bronhaim 2012-11-05 15:33:49 UTC
Another patch to fix this error:
http://gerrit.ovirt.org/#/c/9029/

Our first patch allows limitation of threads for each process, but if we have one mount folder between the temp folders that is stuck (blocked from its origin), we still see the exception. This is because we reach our processes limit and we try to initiate more processes than allowed with only one process pool.

This patch separates process pools for each domain.

Comment 20 vvyazmin@redhat.com 2012-11-06 11:52:54 UTC
Verified on RHEVM 3.1 - SI24

RHEVM: rhevm-3.1.0-26.el6ev.noarch
VDSM: vdsm-4.9.6-41.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.4.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

Comment 22 errata-xmlrpc 2012-12-04 19:08:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html