Bug 1293991 - VMs are going into non responding stage when an export domain is detached
Summary: VMs are going into non responding stage when an export domain is detached
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.6
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Idan Shaby
QA Contact: Aharon Canan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-23 22:58 UTC by nijin ashok
Modified: 2022-07-13 07:35 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-22 14:43:01 UTC
oVirt Team: Storage
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-47582 0 None None None 2022-07-13 07:35:09 UTC

Description nijin ashok 2015-12-23 22:58:38 UTC
Description of problem:

VMs are going into non responding stage when an export domain is detached

Version-Release number of selected component (if applicable):
Red Hat Enterprise Virtualization 3.5
vdsm-4.16.20-1.el6ev.x86_64

How reproducible:
100% in customer environment.

Steps to Reproduce:
1. Detach an export domain.
2. VMs went into non responding stage.

Actual results:
VMs went into non responding stage during the detach of export domain

Expected results:
VMs should be up


Additional info:

Comment 7 Liron Aravot 2016-01-07 13:32:38 UTC
Hi nijin,
From the logs it appears that vdsm is flooded with getVolumeSize() calls, were they executed by the user? why?
 
the refreshStoragePool call takes few minutes because the lvm lock is being held 

-----------------------------------------------------------------------------
Thread-265024::DEBUG::2015-11-06 10:05:17,833::lvm::503::OperationMutex::(_invalidatevgs) Operation 'lvm reload operation' is holding the operation mutex, waiting...
-----------------------------------------------------------------------------

 the lock is taken by one of the calls to getVolumeSize() (but obviously that could be any operation) the executed command that get stuck is

Thread-206320::INFO::2015-11-06 10:05:10,408::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='aa34a2b0-7637-491b-8496-8ae433c4ffb3', spUUID='16c37
571-a75d-4933-860a-09219abc5b8d', imgUUID='13774fa6-cde3-469f-92fc-5ca38cbcf64a', volUUID='1ee495f8-3661-46de-9056-8e3d19b47fa9', options=None)


and specifically - 

-----------------------------------------------------------------------------

Thread-206320::DEBUG::2015-11-06 10:05:17,887::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\
\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'a|/dev/mapper/350002ac0003e1b26|/dev/mapper/350002
ac000431b26|/dev/mapper/350002ac000441b26|/dev/mapper/350002ac0004b1b26|/dev/mapper/350002ac000561b26|/dev/mapper/350002ac001641b26|/dev/mapper/350002ac001a01b26|/dev/mapper
/350002ac001a91b26|/dev/mapper/350002ac001c71b26|/dev/mapper/350002ac001c81b26|/dev/mapper/350002ac002091b26|/dev/mapper/350002ac0020a1b26|/dev/mapper/350002ac002111b26|/dev
/mapper/350002ac002121b26|/dev/mapper/350002ac002151b26|/dev/mapper/350002ac0021c1b26|/dev/mapper/350002ac0021d1b26|/dev/mapper/350002ac0021e1b26|/dev/mapper/350002ac0021f1b
26|/dev/mapper/350002ac002201b26|/dev/mapper/350002ac002281b26|/dev/mapper/350002ac002571b26|/dev/mapper/350002ac003591b26|/dev/mapper/350002ac0035a1b26|/dev/mapper/360002ac
000000000000000810000b6af|/dev/mapper/360002ac000000000000001ae0000b6af|/dev/mapper/360002ac000000000000001af0000b6af|\', \'r|.*|\' ] }  global {  locking_type=1  prioritise
_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free
,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name aa34a2b0-7637-491b-8496-8ae433c4ffb3' (cwd None)
Thread-210030::DEBUG::2015-11-06 10:05:17,925::misc::748::SamplingMethod::(__call__) Returning last result
-----------------------------------------------------------------------------


nijin, as it seems the host possibly had a problem to access the storage on that time that causes the operation to get stuck and the vms to move to non responding as well. Regardless of possible improvements to the code (less lvm operations) in those area there is no obvious bug. How many times did you try to reproduce and saw that the issue occurred? btw, adding a small correction - the performed operation was DeactivateStorageDomain.

adding needinfo? on Nir as well to see if something here looks familiar.

Comment 8 nijin ashok 2016-01-14 13:22:50 UTC
>>  as it seems the host possibly had a problem to access the storage on that time that causes the operation to get stuck and the vms to move to non responding as well.

The issue seems to be happen only when the export storage domain is detached. As per me, the data storage domain looks good.

>>  How many times did you try to reproduce and saw that the issue occurred? if the issue does reproduce 100%, is that always on the same host?

As per the customer issue used to happen in all the hosts intermittently. 

>> you could attach the var/log/messages file from that date, currently it's missing.

Attaching the messages file.

>> if the issue reproduces we can try to enable lvm logging on that machine, currently it's disabled.

I will ask the customer to do it and will provide you the output.

Comment 10 Nir Soffer 2016-01-17 13:47:09 UTC
(In reply to Liron Aravot from comment #7)
> adding needinfo? on Nir as well to see if something here looks familiar.

Yes, this is typical behavior when lvm command get stuck. This should be solved 
in rhel 7.2, and improved in 6.7. See bug 880738.

Please check /var/log/messages, if lvm was stuck, we may find there evidence about
problem with devices/scsi/multipath.

Comment 11 Nir Soffer 2016-01-17 13:53:39 UTC
(In reply to nijin ashok from comment #0)
nijin, please provide the versions of these packages on the host:
- kernel
- device-mapper-multipath
- lvm2

Comment 12 nijin ashok 2016-01-18 12:43:32 UTC
kernel 2.6.32-504.el6.x86_64
device-mapper-multipath-0.4.9-80.el6.x86_64
lvm2-2.02.111-2.el6.x86_64

Red Hat Enterprise Linux Server release 6.6 (Santiago)

Comment 13 Idan Shaby 2016-01-31 14:39:18 UTC
Adding need-info on Nir since the versions were supplied.

Comment 14 Nir Soffer 2016-02-20 20:46:59 UTC
(In reply to nijin ashok from comment #12)
> device-mapper-multipath-0.4.9-80.el6.x86_64

This should be good enough regarding bug 1099932

Ben, are fixes for bug 1254292 available in this version? or for el6?

Comment 15 Nir Soffer 2016-02-20 20:51:58 UTC
(In reply to nijin ashok from comment #12)
> kernel 2.6.32-504.el6.x86_64

Chris, do we have a fix for bug 1139038 for el6?


Note You need to log in before you can comment on or make changes to this bug.