Hide Forgot
Description of problem: VMs are going into non responding stage when an export domain is detached Version-Release number of selected component (if applicable): Red Hat Enterprise Virtualization 3.5 vdsm-4.16.20-1.el6ev.x86_64 How reproducible: 100% in customer environment. Steps to Reproduce: 1. Detach an export domain. 2. VMs went into non responding stage. Actual results: VMs went into non responding stage during the detach of export domain Expected results: VMs should be up Additional info:
Hi nijin, From the logs it appears that vdsm is flooded with getVolumeSize() calls, were they executed by the user? why? the refreshStoragePool call takes few minutes because the lvm lock is being held ----------------------------------------------------------------------------- Thread-265024::DEBUG::2015-11-06 10:05:17,833::lvm::503::OperationMutex::(_invalidatevgs) Operation 'lvm reload operation' is holding the operation mutex, waiting... ----------------------------------------------------------------------------- the lock is taken by one of the calls to getVolumeSize() (but obviously that could be any operation) the executed command that get stuck is Thread-206320::INFO::2015-11-06 10:05:10,408::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='aa34a2b0-7637-491b-8496-8ae433c4ffb3', spUUID='16c37 571-a75d-4933-860a-09219abc5b8d', imgUUID='13774fa6-cde3-469f-92fc-5ca38cbcf64a', volUUID='1ee495f8-3661-46de-9056-8e3d19b47fa9', options=None) and specifically - ----------------------------------------------------------------------------- Thread-206320::DEBUG::2015-11-06 10:05:17,887::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\ \"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/350002ac0003e1b26|/dev/mapper/350002 ac000431b26|/dev/mapper/350002ac000441b26|/dev/mapper/350002ac0004b1b26|/dev/mapper/350002ac000561b26|/dev/mapper/350002ac001641b26|/dev/mapper/350002ac001a01b26|/dev/mapper /350002ac001a91b26|/dev/mapper/350002ac001c71b26|/dev/mapper/350002ac001c81b26|/dev/mapper/350002ac002091b26|/dev/mapper/350002ac0020a1b26|/dev/mapper/350002ac002111b26|/dev /mapper/350002ac002121b26|/dev/mapper/350002ac002151b26|/dev/mapper/350002ac0021c1b26|/dev/mapper/350002ac0021d1b26|/dev/mapper/350002ac0021e1b26|/dev/mapper/350002ac0021f1b 26|/dev/mapper/350002ac002201b26|/dev/mapper/350002ac002281b26|/dev/mapper/350002ac002571b26|/dev/mapper/350002ac003591b26|/dev/mapper/350002ac0035a1b26|/dev/mapper/360002ac 000000000000000810000b6af|/dev/mapper/360002ac000000000000001ae0000b6af|/dev/mapper/360002ac000000000000001af0000b6af|\', \'r|.*|\' ] } global { locking_type=1 prioritise _write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free ,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name aa34a2b0-7637-491b-8496-8ae433c4ffb3' (cwd None) Thread-210030::DEBUG::2015-11-06 10:05:17,925::misc::748::SamplingMethod::(__call__) Returning last result ----------------------------------------------------------------------------- nijin, as it seems the host possibly had a problem to access the storage on that time that causes the operation to get stuck and the vms to move to non responding as well. Regardless of possible improvements to the code (less lvm operations) in those area there is no obvious bug. How many times did you try to reproduce and saw that the issue occurred? btw, adding a small correction - the performed operation was DeactivateStorageDomain. adding needinfo? on Nir as well to see if something here looks familiar.
>> as it seems the host possibly had a problem to access the storage on that time that causes the operation to get stuck and the vms to move to non responding as well. The issue seems to be happen only when the export storage domain is detached. As per me, the data storage domain looks good. >> How many times did you try to reproduce and saw that the issue occurred? if the issue does reproduce 100%, is that always on the same host? As per the customer issue used to happen in all the hosts intermittently. >> you could attach the var/log/messages file from that date, currently it's missing. Attaching the messages file. >> if the issue reproduces we can try to enable lvm logging on that machine, currently it's disabled. I will ask the customer to do it and will provide you the output.
(In reply to Liron Aravot from comment #7) > adding needinfo? on Nir as well to see if something here looks familiar. Yes, this is typical behavior when lvm command get stuck. This should be solved in rhel 7.2, and improved in 6.7. See bug 880738. Please check /var/log/messages, if lvm was stuck, we may find there evidence about problem with devices/scsi/multipath.
(In reply to nijin ashok from comment #0) nijin, please provide the versions of these packages on the host: - kernel - device-mapper-multipath - lvm2
kernel 2.6.32-504.el6.x86_64 device-mapper-multipath-0.4.9-80.el6.x86_64 lvm2-2.02.111-2.el6.x86_64 Red Hat Enterprise Linux Server release 6.6 (Santiago)
Adding need-info on Nir since the versions were supplied.
(In reply to nijin ashok from comment #12) > device-mapper-multipath-0.4.9-80.el6.x86_64 This should be good enough regarding bug 1099932 Ben, are fixes for bug 1254292 available in this version? or for el6?
(In reply to nijin ashok from comment #12) > kernel 2.6.32-504.el6.x86_64 Chris, do we have a fix for bug 1139038 for el6?