Bug 1293991

Summary: VMs are going into non responding stage when an export domain is detached
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: vdsmAssignee: Idan Shaby <ishaby>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.6CC: amureini, bazulay, bmarzins, cleech, gklein, kgoldbla, laravot, lsurette, nashok, nsoffer, tnisan, yeylon, ykaul, ylavi
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-22 14:43:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description nijin ashok 2015-12-23 22:58:38 UTC
Description of problem:

VMs are going into non responding stage when an export domain is detached

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization 3.5
vdsm-4.16.20-1.el6ev.x86_64

How reproducible:
100% in customer environment.

Steps to Reproduce:
1. Detach an export domain.
2. VMs went into non responding stage.

Actual results:
VMs went into non responding stage during the detach of export domain

Expected results:
VMs should be up


Additional info:

Comment 7 Liron Aravot 2016-01-07 13:32:38 UTC
Hi nijin,
From the logs it appears that vdsm is flooded with getVolumeSize() calls, were they executed by the user? why?
 
the refreshStoragePool call takes few minutes because the lvm lock is being held 

-----------------------------------------------------------------------------
Thread-265024::DEBUG::2015-11-06 10:05:17,833::lvm::503::OperationMutex::(_invalidatevgs) Operation 'lvm reload operation' is holding the operation mutex, waiting...
-----------------------------------------------------------------------------

 the lock is taken by one of the calls to getVolumeSize() (but obviously that could be any operation) the executed command that get stuck is

Thread-206320::INFO::2015-11-06 10:05:10,408::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='aa34a2b0-7637-491b-8496-8ae433c4ffb3', spUUID='16c37
571-a75d-4933-860a-09219abc5b8d', imgUUID='13774fa6-cde3-469f-92fc-5ca38cbcf64a', volUUID='1ee495f8-3661-46de-9056-8e3d19b47fa9', options=None)


and specifically - 

-----------------------------------------------------------------------------

Thread-206320::DEBUG::2015-11-06 10:05:17,887::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\
\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/350002ac0003e1b26|/dev/mapper/350002
ac000431b26|/dev/mapper/350002ac000441b26|/dev/mapper/350002ac0004b1b26|/dev/mapper/350002ac000561b26|/dev/mapper/350002ac001641b26|/dev/mapper/350002ac001a01b26|/dev/mapper
/350002ac001a91b26|/dev/mapper/350002ac001c71b26|/dev/mapper/350002ac001c81b26|/dev/mapper/350002ac002091b26|/dev/mapper/350002ac0020a1b26|/dev/mapper/350002ac002111b26|/dev
/mapper/350002ac002121b26|/dev/mapper/350002ac002151b26|/dev/mapper/350002ac0021c1b26|/dev/mapper/350002ac0021d1b26|/dev/mapper/350002ac0021e1b26|/dev/mapper/350002ac0021f1b
26|/dev/mapper/350002ac002201b26|/dev/mapper/350002ac002281b26|/dev/mapper/350002ac002571b26|/dev/mapper/350002ac003591b26|/dev/mapper/350002ac0035a1b26|/dev/mapper/360002ac
000000000000000810000b6af|/dev/mapper/360002ac000000000000001ae0000b6af|/dev/mapper/360002ac000000000000001af0000b6af|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise
_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free
,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name aa34a2b0-7637-491b-8496-8ae433c4ffb3' (cwd None)
Thread-210030::DEBUG::2015-11-06 10:05:17,925::misc::748::SamplingMethod::(__call__) Returning last result
-----------------------------------------------------------------------------


nijin, as it seems the host possibly had a problem to access the storage on that time that causes the operation to get stuck and the vms to move to non responding as well. Regardless of possible improvements to the code (less lvm operations) in those area there is no obvious bug. How many times did you try to reproduce and saw that the issue occurred? btw, adding a small correction - the performed operation was DeactivateStorageDomain.

adding needinfo? on Nir as well to see if something here looks familiar.

Comment 8 nijin ashok 2016-01-14 13:22:50 UTC
>>  as it seems the host possibly had a problem to access the storage on that time that causes the operation to get stuck and the vms to move to non responding as well.

The issue seems to be happen only when the export storage domain is detached. As per me, the data storage domain looks good.

>>  How many times did you try to reproduce and saw that the issue occurred? if the issue does reproduce 100%, is that always on the same host?

As per the customer issue used to happen in all the hosts intermittently. 

>> you could attach the var/log/messages file from that date, currently it's missing.

Attaching the messages file.

>> if the issue reproduces we can try to enable lvm logging on that machine, currently it's disabled.

I will ask the customer to do it and will provide you the output.

Comment 10 Nir Soffer 2016-01-17 13:47:09 UTC
(In reply to Liron Aravot from comment #7)
> adding needinfo? on Nir as well to see if something here looks familiar.

Yes, this is typical behavior when lvm command get stuck. This should be solved 
in rhel 7.2, and improved in 6.7. See bug 880738.

Please check /var/log/messages, if lvm was stuck, we may find there evidence about
problem with devices/scsi/multipath.

Comment 11 Nir Soffer 2016-01-17 13:53:39 UTC
(In reply to nijin ashok from comment #0)
nijin, please provide the versions of these packages on the host:
- kernel
- device-mapper-multipath
- lvm2

Comment 12 nijin ashok 2016-01-18 12:43:32 UTC
kernel 2.6.32-504.el6.x86_64
device-mapper-multipath-0.4.9-80.el6.x86_64
lvm2-2.02.111-2.el6.x86_64

Red Hat Enterprise Linux Server release 6.6 (Santiago)

Comment 13 Idan Shaby 2016-01-31 14:39:18 UTC
Adding need-info on Nir since the versions were supplied.

Comment 14 Nir Soffer 2016-02-20 20:46:59 UTC
(In reply to nijin ashok from comment #12)
> device-mapper-multipath-0.4.9-80.el6.x86_64

This should be good enough regarding bug 1099932

Ben, are fixes for bug 1254292 available in this version? or for el6?

Comment 15 Nir Soffer 2016-02-20 20:51:58 UTC
(In reply to nijin ashok from comment #12)
> kernel 2.6.32-504.el6.x86_64

Chris, do we have a fix for bug 1139038 for el6?