Bug 1022976 - SD is partially accessible after extending.
Summary: SD is partially accessible after extending.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.3.0
Assignee: Sergey Gotliv
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On: 1023206
Blocks: 1025467 3.3snap3
TreeView+ depends on / blocked
 
Reported: 2013-10-24 11:49 UTC by Pavel Zhukov
Modified: 2018-12-09 17:14 UTC (History)
21 users (show)

Fixed In Version: is24
Doc Type: Bug Fix
Doc Text:
LvmCache did not invalidate stale filters, so after adding a new FC or iSCSI LUN to a volume group, hosts could not access the storage domains and became non-operational. Now, all filters are validated after a new device is added and before the storage domain is extended, so hosts can access storage domains which have been extended.
Clone Of:
: 1025467 (view as bug list)
Environment:
Last Closed: 2014-01-21 16:19:15 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 522173 0 None None None Never
Red Hat Product Errata RHBA-2014:0040 0 normal SHIPPED_LIVE vdsm bug fix and enhancement update 2014-01-21 20:26:21 UTC
oVirt gerrit 20552 0 None None None Never
oVirt gerrit 21223 0 None None None Never

Description Pavel Zhukov 2013-10-24 11:49:14 UTC
Description of problem:
After extending of the SD hosts went to non-operational status because of inaccessible SD. 

Version-Release number of selected component (if applicable):
vdsm-4.10.2-23.0.el6ev.x86_64

How reproducible:
Unknown yet

Steps to Reproduce:
1. Map new LUN
2. Run multipath -r (optional)
3. Extend SD

Actual results:
Hosts went to N/O state VMs started to migrate but failed and stuck in 'Migrating from' status. 

Expected results:
Hosts are up 

Additional info:

Comment 7 Pavel Zhukov 2013-10-24 20:37:49 UTC
looks like vgs command (_reloadvgs) returns 0 even if pvs are missed [1], as result domainThreadMonitor uses wrong lvmcache (with stale filters). 
vgck return 5 if pvs are missed and cmd function invalides filters, tries to  run with new ones. Bug chkVG doesn't update lvmcache. 

[1] 
Thread-688802::DEBUG::2013-10-23 11:09:01,967::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None)
Thread-688802::DEBUG::2013-10-23 11:09:02,639::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = "  Couldn't find device with uuid hegDlo-Q0sQ-bmf3-E293-bIJF-0fj3-85jMDP.\n  Couldn't find device with uuid FkqpOc-6XDn-2igg-nA2n-110Q-LHlU-TwqiL9.\n  Couldn't find device with uuid VUSiHy-oTWE-ORNh-HkxU-TDu6-GBNk-pgwZBo.\n  Couldn't find device with uuid lGs4mM-wZix-pYYn-8uTr-i3As-Ocnm-PM1Aia.\n  Couldn't find device with uuid 4WtfsQ-Kpxm-uylt-MQ2R-hJjb-KEUU-SI1v66.\n"; <rc> = 0

Comment 10 Nir Soffer 2013-10-27 23:58:17 UTC
This seems to be the error in this log:

This vgchk command failed because of stale filters:

Thread-688802::DEBUG::2013-10-23 11:09:02,647::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgck --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None)
Thread-688802::DEBUG::2013-10-23 11:09:03,230::misc::83::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = "  Couldn't find device with uuid hegDlo-Q0sQ-bmf3-E293-bIJF-0fj3-85jMDP.\n  Couldn't find device with uuid FkqpOc-6XDn-2igg-nA2n-110Q-LHlU-TwqiL9.\n  Couldn't find device with uuid VUSiHy-oTWE-ORNh-HkxU-TDu6-GBNk-pgwZBo.\n  Couldn't find device with uuid lGs4mM-wZix-pYYn-8uTr-i3As-Ocnm-PM1Aia.\n  Couldn't find device with uuid 4WtfsQ-Kpxm-uylt-MQ2R-hJjb-KEUU-SI1v66.\n  The volume group is missing 5 physical volumes.\n"; <rc> = 5

Then filters are invalidated and command is run again and succeeds:

Thread-688802::DEBUG::2013-10-23 11:09:03,238::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgck --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f|20017380063ea08c5|20017380063ea08c6|20017380063ea08c7|20017380063ea08c8|20017380063ea08c9%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None)

But seems that vg.partial flag was not corrected - therefore selftest() raises.

Thread-688802::ERROR::2013-10-23 11:09:13,460::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 8a9259ec-90c7-455a-ba90-9d29584425e4 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/blockSD.py", line 805, in selftest
    raise se.StorageDomainAccessError(self.sdUUID)
StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('8a9259ec-90c7-455a-ba90-9d29584425e4',)

So it seems that at least partial solution is to update vg status after running vgck.

Comment 26 Zdenek Kabelac 2013-10-31 21:46:48 UTC
Looking here into comment 15  - and putting it into context with Bug 1020401 - I'd suggest to apply same workaround:

Modify /etc/lvm/lvm.conf  - devices { obtain_device_list_from_udev=0 }

Udev in RHEL6.4/6.5 is unfortunately broken and can't be fixed to work reliable under heavy workload.

Also please remove dependency on 1023206  - vgs is not a tool for checking consistency - it's reporting tool.

Comment 27 Nir Soffer 2013-10-31 23:24:09 UTC
(In reply to Zdenek Kabelac from comment #26)
> Looking here into comment 15  - and putting it into context with Bug 1020401
> - I'd suggest to apply same workaround:
> 
> Modify /etc/lvm/lvm.conf  - devices { obtain_device_list_from_udev=0 }
> 
> Udev in RHEL6.4/6.5 is unfortunately broken and can't be fixed to work
> reliable under heavy workload.

There is no load here - this is caused by wrong filter caching in vdsm. The pv is not found because we our filter is missing the new pv.

Comment 28 Aharon Canan 2013-11-24 16:46:32 UTC
verified using is24.1

after consulting with sergey, following comment #22 and steps in description

Comment 29 errata-xmlrpc 2014-01-21 16:19:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html


Note You need to log in before you can comment on or make changes to this bug.