| Summary: | [BLOCKED] Starting a VM with many disks sometimes fails. (blocked on guest LVs visible to host issue - bug 1374545 ) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> |
| Component: | vdsm | Assignee: | Nir Soffer <nsoffer> |
| Status: | CLOSED DUPLICATE | QA Contact: | Raz Tamir <ratamir> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.6.4 | CC: | amureini, bazulay, hannsj_uhl, ishaby, lsurette, michal.skrivanek, nsoffer, rhodain, srevivo, tnisan, ycui, ykaul, ylavi |
| Target Milestone: | ovirt-4.1.0-beta | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-26 14:27:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | 1374545 | ||
| Bug Blocks: | |||
Full logs please? Idan, as this week's QA contact, please take a look. oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target. - How much is 'many disks' ? - I assume this is a Hitachi storage, but did not see a lot more information on it (did not look at the logs though) - is it high-end/low-end? Is it fully utilized or is not working very hard, etc.? Reducing the 'medium' severity for the time being, as: - We did not see it happening in other environments. - It seems to be a (rare?) race. - We have improved (in 4.0) the monitoring code as well as other code dealing with LV refreshes. I hope those changes will eliminate such possible scale issue (or reduce them furthermore). (In reply to Yaniv Kaul from comment #12) > - How much is 'many disks' ? In this case there are 2 disks from SDs and 25 direct luns. > - I assume this is a Hitachi storage, but did not see a lot more information > on it (did not look at the logs though) - is it high-end/low-end? Is it > fully utilized or is not working very hard, etc.? I am not sure about this, but it is rather high end due to the size of the environment. > > Reducing the 'medium' severity for the time being, as: > - We did not see it happening in other environments. Well it does not happened that often I suppose. > - It seems to be a (rare?) race. Sorry, yes it is a race condition. > - We have improved (in 4.0) the monitoring code as well as other code > dealing with LV refreshes. I hope those changes will eliminate such possible > scale issue (or reduce them furthermore). Well may be yes and may be no. This would deserve more investigation. Basically if we run the SD refresh and we still deactivate non used LVs during the process we can hit the time when the disk preparation takes place and we hit this issue. What happens for 5 disks in short is: Prepare disk1 disk1 is activated Prepare disk2 disk2 is activated Refresh SD - disk1 and disk2 are deactivated. Prepare disk3 disk3 is activated Prepare disk4 disk4 is activated Prepare disk5 disk5 is activated Vm start fails as disk1 and disk2 are deactivated and are not visible for libvirt. Nir - yet another LVM auto-activation bug? (In reply to Yaniv Kaul from comment #14) > Nir - yet another LVM auto-activation bug? Dup / close-enough to bug 1374545 *** This bug has been marked as a duplicate of bug 1374545 *** (In reply to Roman Hodain from comment #13) > (In reply to Yaniv Kaul from comment #12) > > - How much is 'many disks' ? > > In this case there are 2 disks from SDs and 25 direct luns. > > > - I assume this is a Hitachi storage, but did not see a lot more information > > on it (did not look at the logs though) - is it high-end/low-end? Is it > > fully utilized or is not working very hard, etc.? > > I am not sure about this, but it is rather high end due to the size of > the environment. > > > > > Reducing the 'medium' severity for the time being, as: > > - We did not see it happening in other environments. > > Well it does not happened that often I suppose. > > > - It seems to be a (rare?) race. > > Sorry, yes it is a race condition. > > > - We have improved (in 4.0) the monitoring code as well as other code > > dealing with LV refreshes. I hope those changes will eliminate such possible > > scale issue (or reduce them furthermore). > > Well may be yes and may be no. This would deserve more investigation. > > Basically if we run the SD refresh and we still deactivate non used LVs > during the process we can hit the time when the disk preparation takes place > and we hit this issue. > > What happens for 5 disks in short is: > > Prepare disk1 > disk1 is activated > Prepare disk2 > disk2 is activated > Refresh SD - disk1 and disk2 are deactivated. We don't deactivate disks when refreshing storage domains. The refresh include only dropping the lvm cache and reading it again from storage. We deactivate disks only when starting vdsm. > Prepare disk3 > disk3 is activated > Prepare disk4 > disk4 is activated > Prepare disk5 > disk5 is activated > > Vm start fails as disk1 and disk2 are deactivated and are not visible for > libvirt. Please reopen this if you can prove that we deactivate disks. |
Description of problem: When a VM start fails with the following ERROR message: libvirtError: Cannot access storage file '/rhev/data-center/5a94dbe6-3fdc-4584-b434-318354a66c6f/0f62ea24-5128-4616-87da-23515c9a3b0c/images/e07ab723-cb0a-43c4-aa14-592807775de4/450459de-ea79-42bd-aa98-296dfc67fed5' (as uid:107, gid:107): No such file or directory Version-Release number of selected component (if applicable): vdsm 4.17.23.2-1.el7ev How reproducible: Sometimes (race condition) Steps to Reproduce: 1. Start VM with many disks as many times as needed to hit the issue. Actual results: VM fails to start Expected results: VM starts without any issue Additional info: Prepare image is triggered: Thread-80::INFO::2016-04-12 22:53:17,998::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID=u'0f62ea24-5128-4616-87da-23515c9a3b0c', spUUID=u'5a94dbe6-3fdc-4584-b434-318354a66c6f', imgUUID=u'e07ab723-cb0a-43c4-aa14-592807775de4', leafUUID=u'450459d e-ea79-42bd-aa98-296dfc67fed5') related LV is activated: Thread-80::DEBUG::2016-04-12 22:53:18,658::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-19 /usr/bin/sudo -n /usr/sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060e801332e400502032e400001002|/dev/mapper/360060e801332e400502032e400001003|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --autobackup n --available y 0f62ea24-5128-4616-87da-23515c9a3b0c/450459de-ea79-42bd-aa98-296dfc67fed5 (cwd None) Then a storage refresh is triggered and disables the LV again: storageRefresh::DEBUG::2016-04-12 22:53:25,043::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=a14c9141-fb3b-4b8d-b0f1-ffb7159ce853 storageRefresh::DEBUG::2016-04-12 22:53:25,043::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=68a4d368-605b-4441-8b51-1b9fc449762f storageRefresh::DEBUG::2016-04-12 22:53:25,043::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=81723af0-2f9f-4c18-a787-b4fb86ca4c61 storageRefresh::DEBUG::2016-04-12 22:53:25,043::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=318f3e3c-084c-41d4-abcb-b6c71bcb7f79 storageRefresh::DEBUG::2016-04-12 22:53:25,044::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=59bbc838-44c1-4c98-8fc8-fa7f4efea44a storageRefresh::DEBUG::2016-04-12 22:53:25,044::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=570748d6-aa90-4302-8e08-198744b17c8e storageRefresh::INFO::2016-04-12 22:53:25,044::lvm::666::Storage.LVM::(bootstrap) Deactivating lvs: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lvs=['450459de-ea79-42bd-aa98-296dfc67fed5'] storageRefresh::DEBUG::2016-04-12 22:53:25,044::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-19 /usr/bin/sudo -n /usr/sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060e801332e400502032e400001002|/dev/mapper/360060e801332e400502032e400001003|'\'', '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --autobackup n --available n 0f62ea24-5128-4616-87da-23515c9a3b0c/450459de-ea79-42bd-aa98-296dfc67fed5 (cwd None) We then get: /rhev/data-center/5a94dbe6-3fdc-4584-b434-318354a66c6f/0f62ea24-5128-4616-87da-23515c9a3b0c/images/e07ab723-cb0a-43c4-aa14-592807775de4/450459de-ea79-42bd-aa98-296dfc67fed5' is a block device Traceback (most recent call last): File "/usr/share/vdsm/virt/vmdevices/storage.py", line 213, in blockDev self._blockDev = utils.isBlockDevice(self.path) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 116, in isBlockDevice return stat.S_ISBLK(os.stat(path).st_mode) and when libvirt starts the VM it fails with the error libvirtError: Cannot access storage file '/rhev/data-center/5a94dbe6-3fdc-4584-b434-318354a66c6f/0f62ea24-5128-4616-87da-23515c9a3b0c/images/e07ab723-cb0a-43c4-aa14-592807775de4/450459de-ea79-42bd-aa98-296dfc67fed5' (as uid:107, gid:107): No such file or directory