| Summary: | VMs are going into non responding stage when an export domain is detached | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | nijin ashok <nashok> |
| Component: | vdsm | Assignee: | Idan Shaby <ishaby> |
| Status: | CLOSED WONTFIX | QA Contact: | Aharon Canan <acanan> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.5.6 | CC: | amureini, bazulay, bmarzins, cleech, gklein, kgoldbla, laravot, lsurette, nashok, nsoffer, tnisan, yeylon, ykaul, ylavi |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-02-22 14:43:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
nijin ashok
2015-12-23 22:58:38 UTC
Hi nijin,
From the logs it appears that vdsm is flooded with getVolumeSize() calls, were they executed by the user? why?
the refreshStoragePool call takes few minutes because the lvm lock is being held
-----------------------------------------------------------------------------
Thread-265024::DEBUG::2015-11-06 10:05:17,833::lvm::503::OperationMutex::(_invalidatevgs) Operation 'lvm reload operation' is holding the operation mutex, waiting...
-----------------------------------------------------------------------------
the lock is taken by one of the calls to getVolumeSize() (but obviously that could be any operation) the executed command that get stuck is
Thread-206320::INFO::2015-11-06 10:05:10,408::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='aa34a2b0-7637-491b-8496-8ae433c4ffb3', spUUID='16c37
571-a75d-4933-860a-09219abc5b8d', imgUUID='13774fa6-cde3-469f-92fc-5ca38cbcf64a', volUUID='1ee495f8-3661-46de-9056-8e3d19b47fa9', options=None)
and specifically -
-----------------------------------------------------------------------------
Thread-206320::DEBUG::2015-11-06 10:05:17,887::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\
\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/350002ac0003e1b26|/dev/mapper/350002
ac000431b26|/dev/mapper/350002ac000441b26|/dev/mapper/350002ac0004b1b26|/dev/mapper/350002ac000561b26|/dev/mapper/350002ac001641b26|/dev/mapper/350002ac001a01b26|/dev/mapper
/350002ac001a91b26|/dev/mapper/350002ac001c71b26|/dev/mapper/350002ac001c81b26|/dev/mapper/350002ac002091b26|/dev/mapper/350002ac0020a1b26|/dev/mapper/350002ac002111b26|/dev
/mapper/350002ac002121b26|/dev/mapper/350002ac002151b26|/dev/mapper/350002ac0021c1b26|/dev/mapper/350002ac0021d1b26|/dev/mapper/350002ac0021e1b26|/dev/mapper/350002ac0021f1b
26|/dev/mapper/350002ac002201b26|/dev/mapper/350002ac002281b26|/dev/mapper/350002ac002571b26|/dev/mapper/350002ac003591b26|/dev/mapper/350002ac0035a1b26|/dev/mapper/360002ac
000000000000000810000b6af|/dev/mapper/360002ac000000000000001ae0000b6af|/dev/mapper/360002ac000000000000001af0000b6af|\', \'r|.*|\' ] } global { locking_type=1 prioritise
_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free
,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name aa34a2b0-7637-491b-8496-8ae433c4ffb3' (cwd None)
Thread-210030::DEBUG::2015-11-06 10:05:17,925::misc::748::SamplingMethod::(__call__) Returning last result
-----------------------------------------------------------------------------
nijin, as it seems the host possibly had a problem to access the storage on that time that causes the operation to get stuck and the vms to move to non responding as well. Regardless of possible improvements to the code (less lvm operations) in those area there is no obvious bug. How many times did you try to reproduce and saw that the issue occurred? btw, adding a small correction - the performed operation was DeactivateStorageDomain.
adding needinfo? on Nir as well to see if something here looks familiar.
>> as it seems the host possibly had a problem to access the storage on that time that causes the operation to get stuck and the vms to move to non responding as well. The issue seems to be happen only when the export storage domain is detached. As per me, the data storage domain looks good. >> How many times did you try to reproduce and saw that the issue occurred? if the issue does reproduce 100%, is that always on the same host? As per the customer issue used to happen in all the hosts intermittently. >> you could attach the var/log/messages file from that date, currently it's missing. Attaching the messages file. >> if the issue reproduces we can try to enable lvm logging on that machine, currently it's disabled. I will ask the customer to do it and will provide you the output. (In reply to Liron Aravot from comment #7) > adding needinfo? on Nir as well to see if something here looks familiar. Yes, this is typical behavior when lvm command get stuck. This should be solved in rhel 7.2, and improved in 6.7. See bug 880738. Please check /var/log/messages, if lvm was stuck, we may find there evidence about problem with devices/scsi/multipath. (In reply to nijin ashok from comment #0) nijin, please provide the versions of these packages on the host: - kernel - device-mapper-multipath - lvm2 kernel 2.6.32-504.el6.x86_64 device-mapper-multipath-0.4.9-80.el6.x86_64 lvm2-2.02.111-2.el6.x86_64 Red Hat Enterprise Linux Server release 6.6 (Santiago) Adding need-info on Nir since the versions were supplied. (In reply to nijin ashok from comment #12) > device-mapper-multipath-0.4.9-80.el6.x86_64 This should be good enough regarding bug 1099932 Ben, are fixes for bug 1254292 available in this version? or for el6? (In reply to nijin ashok from comment #12) > kernel 2.6.32-504.el6.x86_64 Chris, do we have a fix for bug 1139038 for el6? |