1022976 – SD is partially accessible after extending.

Bug 1022976 - SD is partially accessible after extending.

Summary: SD is partially accessible after extending.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.3.0
Assignee:	Sergey Gotliv
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	1023206
Blocks:	1025467 3.3snap3
TreeView+	depends on / blocked

Reported:	2013-10-24 11:49 UTC by Pavel Zhukov
Modified:	2018-12-09 17:14 UTC (History)
CC List:	21 users (show)
Fixed In Version:	is24
Doc Type:	Bug Fix
Doc Text:	LvmCache did not invalidate stale filters, so after adding a new FC or iSCSI LUN to a volume group, hosts could not access the storage domains and became non-operational. Now, all filters are validated after a new device is added and before the storage domain is extended, so hosts can access storage domains which have been extended.
Clone Of:
Clones:	1025467 (view as bug list)
Environment:
Last Closed:	2014-01-21 16:19:15 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	522173	None	None	None	Never
Red Hat Product Errata	RHBA-2014:0040	normal	SHIPPED_LIVE	vdsm bug fix and enhancement update	2014-01-21 20:26:21 UTC
oVirt gerrit	20552	None	None	None	Never
oVirt gerrit	21223	None	None	None	Never

Description Pavel Zhukov 2013-10-24 11:49:14 UTC

Description of problem:
After extending of the SD hosts went to non-operational status because of inaccessible SD. 

Version-Release number of selected component (if applicable):
vdsm-4.10.2-23.0.el6ev.x86_64

How reproducible:
Unknown yet

Steps to Reproduce:
1. Map new LUN
2. Run multipath -r (optional)
3. Extend SD

Actual results:
Hosts went to N/O state VMs started to migrate but failed and stuck in 'Migrating from' status. 

Expected results:
Hosts are up 

Additional info:

Comment 7 Pavel Zhukov 2013-10-24 20:37:49 UTC

looks like vgs command (_reloadvgs) returns 0 even if pvs are missed [1], as result domainThreadMonitor uses wrong lvmcache (with stale filters). 
vgck return 5 if pvs are missed and cmd function invalides filters, tries to  run with new ones. Bug chkVG doesn't update lvmcache. 

[1] 
Thread-688802::DEBUG::2013-10-23 11:09:01,967::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None)
Thread-688802::DEBUG::2013-10-23 11:09:02,639::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = "  Couldn't find device with uuid hegDlo-Q0sQ-bmf3-E293-bIJF-0fj3-85jMDP.\n  Couldn't find device with uuid FkqpOc-6XDn-2igg-nA2n-110Q-LHlU-TwqiL9.\n  Couldn't find device with uuid VUSiHy-oTWE-ORNh-HkxU-TDu6-GBNk-pgwZBo.\n  Couldn't find device with uuid lGs4mM-wZix-pYYn-8uTr-i3As-Ocnm-PM1Aia.\n  Couldn't find device with uuid 4WtfsQ-Kpxm-uylt-MQ2R-hJjb-KEUU-SI1v66.\n"; <rc> = 0

Comment 10 Nir Soffer 2013-10-27 23:58:17 UTC

This seems to be the error in this log:

This vgchk command failed because of stale filters:

Thread-688802::DEBUG::2013-10-23 11:09:02,647::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgck --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None)
Thread-688802::DEBUG::2013-10-23 11:09:03,230::misc::83::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = "  Couldn't find device with uuid hegDlo-Q0sQ-bmf3-E293-bIJF-0fj3-85jMDP.\n  Couldn't find device with uuid FkqpOc-6XDn-2igg-nA2n-110Q-LHlU-TwqiL9.\n  Couldn't find device with uuid VUSiHy-oTWE-ORNh-HkxU-TDu6-GBNk-pgwZBo.\n  Couldn't find device with uuid lGs4mM-wZix-pYYn-8uTr-i3As-Ocnm-PM1Aia.\n  Couldn't find device with uuid 4WtfsQ-Kpxm-uylt-MQ2R-hJjb-KEUU-SI1v66.\n  The volume group is missing 5 physical volumes.\n"; <rc> = 5

Then filters are invalidated and command is run again and succeeds:

Thread-688802::DEBUG::2013-10-23 11:09:03,238::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgck --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f|20017380063ea08c5|20017380063ea08c6|20017380063ea08c7|20017380063ea08c8|20017380063ea08c9%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None)

But seems that vg.partial flag was not corrected - therefore selftest() raises.

Thread-688802::ERROR::2013-10-23 11:09:13,460::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 8a9259ec-90c7-455a-ba90-9d29584425e4 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/blockSD.py", line 805, in selftest
    raise se.StorageDomainAccessError(self.sdUUID)
StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('8a9259ec-90c7-455a-ba90-9d29584425e4',)

So it seems that at least partial solution is to update vg status after running vgck.

Comment 26 Zdenek Kabelac 2013-10-31 21:46:48 UTC

Looking here into comment 15  - and putting it into context with Bug 1020401 - I'd suggest to apply same workaround:

Modify /etc/lvm/lvm.conf  - devices { obtain_device_list_from_udev=0 }

Udev in RHEL6.4/6.5 is unfortunately broken and can't be fixed to work reliable under heavy workload.

Also please remove dependency on 1023206  - vgs is not a tool for checking consistency - it's reporting tool.

Comment 27 Nir Soffer 2013-10-31 23:24:09 UTC

(In reply to Zdenek Kabelac from comment #26)
> Looking here into comment 15  - and putting it into context with Bug 1020401
> - I'd suggest to apply same workaround:
> 
> Modify /etc/lvm/lvm.conf  - devices { obtain_device_list_from_udev=0 }
> 
> Udev in RHEL6.4/6.5 is unfortunately broken and can't be fixed to work
> reliable under heavy workload.

There is no load here - this is caused by wrong filter caching in vdsm. The pv is not found because we our filter is missing the new pv.

Comment 28 Aharon Canan 2013-11-24 16:46:32 UTC

verified using is24.1

after consulting with sergey, following comment #22 and steps in description

Comment 29 errata-xmlrpc 2014-01-21 16:19:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html

Note You need to log in before you can comment on or make changes to this bug.

abaron
amureini
bazulay
cpelland
cww
eedri
iheim
jentrena
lpeer
lyarwood
michele
mkalinin
nobody
nsoffer
pep
pzhukov
scohen
sgotliv
tnisan
yeylon
zkabelac