Description of problem: After extending of the SD hosts went to non-operational status because of inaccessible SD. Version-Release number of selected component (if applicable): vdsm-4.10.2-23.0.el6ev.x86_64 How reproducible: Unknown yet Steps to Reproduce: 1. Map new LUN 2. Run multipath -r (optional) 3. Extend SD Actual results: Hosts went to N/O state VMs started to migrate but failed and stuck in 'Migrating from' status. Expected results: Hosts are up Additional info:
looks like vgs command (_reloadvgs) returns 0 even if pvs are missed [1], as result domainThreadMonitor uses wrong lvmcache (with stale filters). vgck return 5 if pvs are missed and cmd function invalides filters, tries to run with new ones. Bug chkVG doesn't update lvmcache. [1] Thread-688802::DEBUG::2013-10-23 11:09:01,967::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f%\', \'r%.*%\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None) Thread-688802::DEBUG::2013-10-23 11:09:02,639::misc::83::Storage.Misc.excCmd::(<lambda>) SUCCESS: <err> = " Couldn't find device with uuid hegDlo-Q0sQ-bmf3-E293-bIJF-0fj3-85jMDP.\n Couldn't find device with uuid FkqpOc-6XDn-2igg-nA2n-110Q-LHlU-TwqiL9.\n Couldn't find device with uuid VUSiHy-oTWE-ORNh-HkxU-TDu6-GBNk-pgwZBo.\n Couldn't find device with uuid lGs4mM-wZix-pYYn-8uTr-i3As-Ocnm-PM1Aia.\n Couldn't find device with uuid 4WtfsQ-Kpxm-uylt-MQ2R-hJjb-KEUU-SI1v66.\n"; <rc> = 0
This seems to be the error in this log: This vgchk command failed because of stale filters: Thread-688802::DEBUG::2013-10-23 11:09:02,647::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgck --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f%\', \'r%.*%\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None) Thread-688802::DEBUG::2013-10-23 11:09:03,230::misc::83::Storage.Misc.excCmd::(<lambda>) FAILED: <err> = " Couldn't find device with uuid hegDlo-Q0sQ-bmf3-E293-bIJF-0fj3-85jMDP.\n Couldn't find device with uuid FkqpOc-6XDn-2igg-nA2n-110Q-LHlU-TwqiL9.\n Couldn't find device with uuid VUSiHy-oTWE-ORNh-HkxU-TDu6-GBNk-pgwZBo.\n Couldn't find device with uuid lGs4mM-wZix-pYYn-8uTr-i3As-Ocnm-PM1Aia.\n Couldn't find device with uuid 4WtfsQ-Kpxm-uylt-MQ2R-hJjb-KEUU-SI1v66.\n The volume group is missing 5 physical volumes.\n"; <rc> = 5 Then filters are invalidated and command is run again and succeeds: Thread-688802::DEBUG::2013-10-23 11:09:03,238::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgck --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%20017380063ea059a|20017380063ea059b|20017380063ea06a3|20017380063ea06a4|20017380063ea06a5|20017380063ea06a6|20017380063ea06a7|20017380063ea06a8|20017380063ea06b6|20017380063ea06b7|20017380063ea06b8|20017380063ea06b9|20017380063ea06ba|20017380063ea06bb|20017380063ea06bc|20017380063ea06bd|20017380063ea06be|20017380063ea06bf|20017380063ea06c0|20017380063ea06c1|20017380063ea06c2|20017380063ea06c3|20017380063ea06c4|20017380063ea06c5|20017380063ea06c6|20017380063ea06c7|20017380063ea06c8|20017380063ea06c9|20017380063ea06ca|20017380063ea06cb|20017380063ea084b|20017380063ea084c|20017380063ea084d|20017380063ea084e|20017380063ea084f|20017380063ea08c5|20017380063ea08c6|20017380063ea08c7|20017380063ea08c8|20017380063ea08c9%\', \'r%.*%\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " 8a9259ec-90c7-455a-ba90-9d29584425e4' (cwd None) But seems that vg.partial flag was not corrected - therefore selftest() raises. Thread-688802::ERROR::2013-10-23 11:09:13,460::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 8a9259ec-90c7-455a-ba90-9d29584425e4 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/blockSD.py", line 805, in selftest raise se.StorageDomainAccessError(self.sdUUID) StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: ('8a9259ec-90c7-455a-ba90-9d29584425e4',) So it seems that at least partial solution is to update vg status after running vgck.
Looking here into comment 15 - and putting it into context with Bug 1020401 - I'd suggest to apply same workaround: Modify /etc/lvm/lvm.conf - devices { obtain_device_list_from_udev=0 } Udev in RHEL6.4/6.5 is unfortunately broken and can't be fixed to work reliable under heavy workload. Also please remove dependency on 1023206 - vgs is not a tool for checking consistency - it's reporting tool.
(In reply to Zdenek Kabelac from comment #26) > Looking here into comment 15 - and putting it into context with Bug 1020401 > - I'd suggest to apply same workaround: > > Modify /etc/lvm/lvm.conf - devices { obtain_device_list_from_udev=0 } > > Udev in RHEL6.4/6.5 is unfortunately broken and can't be fixed to work > reliable under heavy workload. There is no load here - this is caused by wrong filter caching in vdsm. The pv is not found because we our filter is missing the new pv.
verified using is24.1 after consulting with sergey, following comment #22 and steps in description
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html