Bug 1326828 - [BLOCKED] Starting a VM with many disks sometimes fails. (blocked on guest LVs visible to host issue - bug 1374545 )
Summary: [BLOCKED] Starting a VM with many disks sometimes fails. (blocked on guest LV...
Keywords:
Status: CLOSED DUPLICATE of bug 1374545
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.1.0-beta
: ---
Assignee: Nir Soffer
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On: 1374545
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-13 13:33 UTC by Roman Hodain
Modified: 2019-04-28 10:08 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-26 14:27:46 UTC
oVirt Team: Storage
Target Upstream Version:


Attachments (Terms of Use)

Description Roman Hodain 2016-04-13 13:33:49 UTC
Description of problem:
When a VM start fails with the following ERROR message:

     libvirtError: Cannot access storage file '/rhev/data-center/5a94dbe6-3fdc-4584-b434-318354a66c6f/0f62ea24-5128-4616-87da-23515c9a3b0c/images/e07ab723-cb0a-43c4-aa14-592807775de4/450459de-ea79-42bd-aa98-296dfc67fed5' (as uid:107, gid:107): No such file or directory

Version-Release number of selected component (if applicable):
    vdsm 4.17.23.2-1.el7ev

How reproducible:
Sometimes (race condition)

Steps to Reproduce:
1. Start VM with many disks as many times as needed to hit the issue.

Actual results:
     VM fails to start

Expected results:
     VM starts without any issue

Additional info:

Prepare image is triggered:

Thread-80::INFO::2016-04-12 22:53:17,998::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID=u'0f62ea24-5128-4616-87da-23515c9a3b0c', spUUID=u'5a94dbe6-3fdc-4584-b434-318354a66c6f', imgUUID=u'e07ab723-cb0a-43c4-aa14-592807775de4', leafUUID=u'450459d
e-ea79-42bd-aa98-296dfc67fed5')

related LV is activated:

Thread-80::DEBUG::2016-04-12 22:53:18,658::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-19 /usr/bin/sudo -n /usr/sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060e801332e400502032e400001002|/dev/mapper/360060e801332e400502032e400001003|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --available y 0f62ea24-5128-4616-87da-23515c9a3b0c/450459de-ea79-42bd-aa98-296dfc67fed5 (cwd None)

Then a storage refresh is triggered and disables the LV again:

storageRefresh::DEBUG::2016-04-12 22:53:25,043::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=a14c9141-fb3b-4b8d-b0f1-ffb7159ce853
storageRefresh::DEBUG::2016-04-12 22:53:25,043::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=68a4d368-605b-4441-8b51-1b9fc449762f
storageRefresh::DEBUG::2016-04-12 22:53:25,043::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=81723af0-2f9f-4c18-a787-b4fb86ca4c61
storageRefresh::DEBUG::2016-04-12 22:53:25,043::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=318f3e3c-084c-41d4-abcb-b6c71bcb7f79
storageRefresh::DEBUG::2016-04-12 22:53:25,044::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=59bbc838-44c1-4c98-8fc8-fa7f4efea44a
storageRefresh::DEBUG::2016-04-12 22:53:25,044::lvm::661::Storage.LVM::(bootstrap) Skipping open lv: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lv=570748d6-aa90-4302-8e08-198744b17c8e
storageRefresh::INFO::2016-04-12 22:53:25,044::lvm::666::Storage.LVM::(bootstrap) Deactivating lvs: vg=0f62ea24-5128-4616-87da-23515c9a3b0c lvs=['450459de-ea79-42bd-aa98-296dfc67fed5']
storageRefresh::DEBUG::2016-04-12 22:53:25,044::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-19 /usr/bin/sudo -n /usr/sbin/lvm lvchange --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/360060e801332e400502032e400001002|/dev/mapper/360060e801332e400502032e400001003|'\'', '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --autobackup n --available n 0f62ea24-5128-4616-87da-23515c9a3b0c/450459de-ea79-42bd-aa98-296dfc67fed5 (cwd None)

We then get:
/rhev/data-center/5a94dbe6-3fdc-4584-b434-318354a66c6f/0f62ea24-5128-4616-87da-23515c9a3b0c/images/e07ab723-cb0a-43c4-aa14-592807775de4/450459de-ea79-42bd-aa98-296dfc67fed5' is a block device
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vmdevices/storage.py", line 213, in blockDev
    self._blockDev = utils.isBlockDevice(self.path)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 116, in isBlockDevice
    return stat.S_ISBLK(os.stat(path).st_mode)

and when libvirt starts the VM it fails with the error
libvirtError: Cannot access storage file '/rhev/data-center/5a94dbe6-3fdc-4584-b434-318354a66c6f/0f62ea24-5128-4616-87da-23515c9a3b0c/images/e07ab723-cb0a-43c4-aa14-592807775de4/450459de-ea79-42bd-aa98-296dfc67fed5' (as uid:107, gid:107): No such file or directory

Comment 2 Allon Mureinik 2016-04-14 14:08:30 UTC
Full logs please?

Comment 4 Allon Mureinik 2016-04-20 12:45:27 UTC
Idan, as this week's QA contact, please take a look.

Comment 8 Yaniv Lavi 2016-05-09 11:01:03 UTC
oVirt 4.0 Alpha has been released, moving to oVirt 4.0 Beta target.

Comment 12 Yaniv Kaul 2016-05-29 13:39:16 UTC
- How much is 'many disks' ?
- I assume this is a Hitachi storage, but did not see a lot more information on it (did not look at the logs though) - is it high-end/low-end? Is it fully utilized or is not working very hard, etc.?

Reducing the 'medium' severity for the time being, as:
- We did not see it happening in other environments.
- It seems to be a (rare?) race.
- We have improved (in 4.0) the monitoring code as well as other code dealing with LV refreshes. I hope those changes will eliminate such possible scale issue (or reduce them furthermore).

Comment 13 Roman Hodain 2016-06-15 07:43:36 UTC
(In reply to Yaniv Kaul from comment #12)
> - How much is 'many disks' ?

   In this case there are 2 disks from SDs and 25 direct luns.

> - I assume this is a Hitachi storage, but did not see a lot more information
> on it (did not look at the logs though) - is it high-end/low-end? Is it
> fully utilized or is not working very hard, etc.?

   I am not sure about this, but it is rather high end due to the size of the environment.

> 
> Reducing the 'medium' severity for the time being, as:
> - We did not see it happening in other environments.

    Well it does not happened that often I suppose.

> - It seems to be a (rare?) race.

    Sorry, yes it is a race condition.

> - We have improved (in 4.0) the monitoring code as well as other code
> dealing with LV refreshes. I hope those changes will eliminate such possible
> scale issue (or reduce them furthermore).

    Well may be yes and may be no. This would deserve more investigation.

Basically if we run the SD refresh and we still deactivate non used LVs during the process we can hit the time when the disk preparation takes place and we hit this issue.

What happens for 5 disks in short is:

Prepare disk1
disk1 is activated
Prepare disk2
disk2 is activated
Refresh SD - disk1 and disk2 are deactivated.
Prepare disk3
disk3 is activated
Prepare disk4
disk4 is activated
Prepare disk5
disk5 is activated

Vm start fails as disk1 and disk2 are deactivated and are not visible for libvirt.

Comment 14 Yaniv Kaul 2016-11-21 10:59:50 UTC
Nir - yet another LVM auto-activation bug?

Comment 15 Yaniv Kaul 2016-11-21 11:40:15 UTC
(In reply to Yaniv Kaul from comment #14)
> Nir - yet another LVM auto-activation bug?

Dup / close-enough to bug 1374545

Comment 16 Raz Tamir 2016-12-26 14:27:46 UTC

*** This bug has been marked as a duplicate of bug 1374545 ***

Comment 17 Nir Soffer 2017-03-06 22:12:58 UTC
(In reply to Roman Hodain from comment #13)
> (In reply to Yaniv Kaul from comment #12)
> > - How much is 'many disks' ?
> 
>    In this case there are 2 disks from SDs and 25 direct luns.
> 
> > - I assume this is a Hitachi storage, but did not see a lot more information
> > on it (did not look at the logs though) - is it high-end/low-end? Is it
> > fully utilized or is not working very hard, etc.?
> 
>    I am not sure about this, but it is rather high end due to the size of
> the environment.
> 
> > 
> > Reducing the 'medium' severity for the time being, as:
> > - We did not see it happening in other environments.
> 
>     Well it does not happened that often I suppose.
> 
> > - It seems to be a (rare?) race.
> 
>     Sorry, yes it is a race condition.
> 
> > - We have improved (in 4.0) the monitoring code as well as other code
> > dealing with LV refreshes. I hope those changes will eliminate such possible
> > scale issue (or reduce them furthermore).
> 
>     Well may be yes and may be no. This would deserve more investigation.
> 
> Basically if we run the SD refresh and we still deactivate non used LVs
> during the process we can hit the time when the disk preparation takes place
> and we hit this issue.
> 
> What happens for 5 disks in short is:
> 
> Prepare disk1
> disk1 is activated
> Prepare disk2
> disk2 is activated
> Refresh SD - disk1 and disk2 are deactivated.

We don't deactivate disks when refreshing storage domains. The refresh
include only dropping the lvm cache and reading it again from storage.

We deactivate disks only when starting vdsm.

> Prepare disk3
> disk3 is activated
> Prepare disk4
> disk4 is activated
> Prepare disk5
> disk5 is activated
> 
> Vm start fails as disk1 and disk2 are deactivated and are not visible for
> libvirt.

Please reopen this if you can prove that we deactivate disks.


Note You need to log in before you can comment on or make changes to this bug.