Bug 707217

Summary: blocked on lvm scalability: vdsm fails to connect to storage pool for 5 minutes - on a loaded environment
Product: Red Hat Enterprise Linux 6 Reporter: Moran Goldboim <mgoldboi>
Component: vdsmAssignee: Eduardo Warszawski <ewarszaw>
Status: CLOSED ERRATA QA Contact: Moran Goldboim <mgoldboi>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: abaron, bazulay, ewarszaw, fsimonce, hateya, iheim, ilvovsky, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm-4.9-80.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 07:19:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 658639    
Bug Blocks:    
Attachments:
Description Flags
vdsm log none

Description Moran Goldboim 2011-05-24 12:42:34 UTC
Created attachment 500606 [details]
vdsm log

Description of problem:
Running a host with 60 vms and restarting libvirt, vdsm is restarting and getting connectStoragePool command from rhevm. this command isn't answered for 3 minutes and rhevm is starting migration process for the running vms.
some lvm operations are taking lots of time:
Thread-321::DEBUG::2011-05-24 11:33:42,997::lvm::359::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 filter = [ \\"a%/dev/mapper/36006016066102900341493a64a79e011|/dev/mapper/36006016066102900351493a64a79e011|/dev/mapper/36006016066102900361493a64a79e011|/dev/mapper/360060160661029007e98a17ffa81e011|/dev/mapper/36006016066102900be8c283b0344e011%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --refresh 40b5a575-8e2e-44a7-ab56-46ab5aa4bab9' (cwd None)
Thread-321::DEBUG::2011-05-24 11:35:04,221::lvm::359::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0
Thread-321::DEBUG::2011-05-24 11:35:04,340::lvm::527::OperationMutex::(_invalidatevgs) Operation 'lvm reload operation' is holding the operation mutex, waiting...
Thread-321::DEBUG::2011-05-24 11:36:33,083::lvm::527::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' got the operation mutex
Thread-321::DEBUG::2011-05-24 11:36:33,106::lvm::529::OperationMutex::(_invalidatevgs) Operation 'lvm invalidate operation' released the operation mutex
Thread-321::DEBUG::2011-05-24 11:36:33,187::lvm::538::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' got the operation mutex
Thread-321::DEBUG::2011-05-24 11:36:33,214::lvm::550::OperationMutex::(_invalidatelvs) Operation 'lvm invalidate operation' released the operation mutex
Thread-321::INFO::2011-05-24 11:36:33,227::sp::862::Storage.StoragePool::(_linkStorageDomain) Linking /rhev/data-center/mnt/blockSD/40b5a575-8e2e-44a7-ab56-46ab5aa4bab9 to /rhev/data-center/06b29e0f-8c22-431e-a6f1-75989a582c52/40b5a575-8e2e-44a7-ab56-46ab5aa4bab9
Thread-321::DEBUG::2011-05-24 11:36:33,279::persistentDict::144::Storage.PersistentDict::(transaction) Starting transaction
Thread-321::DEBUG::2011-05-24 11:36:33,289::persistentDict::137::Storage.PersistentDict::(transaction) Reusing active transaction
Thread-321::DEBUG::2011-05-24 11:36:33,291::persistentDict::137::Storage.PersistentDict::(transaction) Reusing active transaction
Thread-321::DEBUG::2011-05-24 11:36:33,295::persistentDict::152::Storage.PersistentDict::(transaction) Finished transaction
Thread-321::DEBUG::2011-05-24 11:36:33,305::lvm::359::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 filter = [ \\"a%/dev/mapper/36006016066102900341493a64a79e011|/dev/mapper/36006016066102900351493a64a79e011|/dev/mapper/36006016066102900361493a64a79e011|/dev/mapper/360060160661029007e98a17ffa81e011|/dev/mapper/36006016066102900be8c283b0344e011%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --refresh 2b27b725-9063-4a27-899a-3ec8d02c1ade' (cwd None)
Thread-321::DEBUG::2011-05-24 11:37:59,899::lvm::359::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0

in addition there are errors on interliving getVolumeSize of vms in the middle of it:
Thread-49::INFO::2011-05-24 11:33:17,045::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getVolumeSize, args: ( sdUUID=2b27b725-9063-4a27-899a-3ec8d02c1ade spUUID=06b29e0f-8c22-431e-a6f1-75989a582c52 imgUUID=46e4383
4-4827-408a-b444-9754446d762d volUUID=044189e5-8b79-4644-8a75-ed4037e713d2)
clientIFinit::INFO::2011-05-24 11:33:17,050::clientIF::527::vds::(_getNetworkIp) network None: using 0
Thread-49::DEBUG::2011-05-24 11:33:17,054::task::492::TaskManager.Task::(_debug) Task 1a0ba393-b832-4418-8d47-4c4b4f7489b2: moving from state init -> state preparing
clientIFinit::INFO::2011-05-24 11:33:17,057::clientIF::670::vds::(create) vmContainerLock aquired by vm 57965f76-667b-4d56-aa9f-d1508dc88ef8
Thread-49::ERROR::2011-05-24 11:33:17,058::task::856::TaskManager.Task::(_setError) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 864, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1611, in public_getVolumeSize
    self.validateConnectedPool(spUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 84, in validateConnectedPool
    raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected: ('06b29e0f-8c22-431e-a6f1-75989a582c52',)



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Federico Simoncelli 2011-06-10 14:15:23 UTC
There are two commands that are taking a long time to complete, they both are lvm vgchanges:

Thread-321::DEBUG::2011-05-24 11:33:42,997::lvm::359::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 filter = [ \\"a%/dev/mapper/36006016066102900341493a64a79e011|/dev/mapper/36006016066102900351493a64a79e011|/dev/mapper/36006016066102900361493a64a79e011|/dev/mapper/360060160661029007e98a17ffa81e011|/dev/mapper/36006016066102900be8c283b0344e011%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --refresh 40b5a575-8e2e-44a7-ab56-46ab5aa4bab9' (cwd None)
Thread-321::DEBUG::2011-05-24 11:35:04,221::lvm::359::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0

Thread-321::DEBUG::2011-05-24 11:36:33,305::lvm::359::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 filter = [ \\"a%/dev/mapper/36006016066102900341493a64a79e011|/dev/mapper/36006016066102900351493a64a79e011|/dev/mapper/36006016066102900361493a64a79e011|/dev/mapper/360060160661029007e98a17ffa81e011|/dev/mapper/36006016066102900be8c283b0344e011%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --refresh 2b27b725-9063-4a27-899a-3ec8d02c1ade' (cwd None)
Thread-321::DEBUG::2011-05-24 11:37:59,899::lvm::359::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = ''; <rc> = 0

The 'refresh' option is known to take long time in vgs with many lvs.
Resolution of this issue depends on bug 658639 - [RHEL 6.2] [RFE] LVM scalability for RHEV

Comment 5 Eduardo Warszawski 2011-06-24 05:42:23 UTC
http://gerrit.usersys.redhat.com/#change,621

Comment 8 Haim 2011-07-20 14:37:31 UTC
verified (4.9-83). on host loaded with luns (120), connectStoragePool takes 1 minute and half.
still there are 5 lvm commands, 4 specifically for various vgs and another general one. 
there still work to be done here, but this bz can be closed.

Comment 9 errata-xmlrpc 2011-12-06 07:19:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2011-1782.html