Bug 1400528 - [Bug RHV 4.0.4] Intermittent direct lun vm failed to start with error "VolumeError: Bad volume specification".
Summary: [Bug RHV 4.0.4] Intermittent direct lun vm failed to start with error "Volume...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.0.4
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ovirt-4.1.1
: ---
Assignee: Nir Soffer
QA Contact: Lilach Zitnitski
URL:
Whiteboard:
Depends On: 1374545
Blocks: 1424819
TreeView+ depends on / blocked
 
Reported: 2016-12-01 12:37 UTC by Sachin Raje
Modified: 2020-02-14 18:14 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
doctext provided in bug 1424819, no reason to have it copy-pasted twice.
Clone Of:
: 1424819 (view as bug list)
Environment:
Last Closed: 2017-04-25 00:42:35 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:0998 0 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1 GA 2017-04-18 20:11:39 UTC

Comment 2 Yaniv Kaul 2016-12-01 12:44:39 UTC
Please attach complete logs. 
Unclear if it's 3.6 or 4.0.4 -  please clarify.

Comment 5 Sachin Raje 2016-12-01 12:56:20 UTC
its rhv-4.0.4. I'll attach the sosrpeort and other data shortly.

Comment 11 Marina Kalinin 2016-12-07 19:29:36 UTC
Looking into related bugs, mentioned earlier:
Some dirty LUNs are not usable in RHEV
https://bugzilla.redhat.com/show_bug.cgi?id=1253640

[Nimble Storage] multipath unable to add new path
https://bugzilla.redhat.com/show_bug.cgi?id=1309409

Specifically here:
https://bugzilla.redhat.com/show_bug.cgi?id=1253640#c15

I think maybe this is not a vdsm bug at all?

Nir, can you please check and move to platform, if needed?
This is an urgent customer bug.

Comment 12 Yaniv Kaul 2016-12-07 19:33:27 UTC
(In reply to Marina from comment #11)
> Looking into related bugs, mentioned earlier:
> Some dirty LUNs are not usable in RHEV
> https://bugzilla.redhat.com/show_bug.cgi?id=1253640
> 
> [Nimble Storage] multipath unable to add new path
> https://bugzilla.redhat.com/show_bug.cgi?id=1309409
> 
> Specifically here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1253640#c15
> 
> I think maybe this is not a vdsm bug at all?
> 
> Nir, can you please check and move to platform, if needed?

This was comment 3 .  I don't think it's our issue.

> This is an urgent customer bug.

Comment 13 Nir Soffer 2016-12-07 19:38:33 UTC
I agree with Yaniv, if we don't see the device in the scsi layer, vdsm can do
nothing about it. Maybe we do not rescan scsi devices correctly?

I suggest to move this to platform.

Comment 16 Daniel Erez 2016-12-12 11:02:23 UTC
Hi Zdenek,

Do you think an lvm filter with a 'white-list' is the best solution for this issue as well? (as you've suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1374545#c74)

Thanks!

Comment 17 Yaniv Kaul 2017-01-02 14:28:46 UTC
We've also reached the conclusion that disabling (and stopping) lvmetad is a good idea and we'll implement it in 4.0.7 and 4.1. Can you try that?

Comment 18 Nir Soffer 2017-02-14 12:03:52 UTC
Should be fixed in 4.1.1 by disabling lvmetad.

Comment 19 Yaniv Kaul 2017-02-14 12:42:29 UTC
(In reply to Nir Soffer from comment #18)
> Should be fixed in 4.1.1 by disabling lvmetad.

Current target milestone (for this bug) is 4.0.7. If it's intended to be fixed there, need clone, backport, etc. Otherwise - set target milestone to 4.1.1.

Comment 22 Nir Soffer 2017-02-19 16:28:07 UTC
(In reply to Yaniv Kaul from comment #19)
> (In reply to Nir Soffer from comment #18)
> > Should be fixed in 4.1.1 by disabling lvmetad.
> 
> Current target milestone (for this bug) is 4.0.7. If it's intended to be
> fixed there, need clone, backport, etc.

This fix is also available in 4.0.7, so we should be good.

I don't think we can verified since we don't know how to reproduce this issue, it
is caused by race between multipath and lvm when new device is discovered.

Comment 24 Lilach Zitnitski 2017-02-20 08:35:57 UTC
(In reply to Nir Soffer from comment #22)
> (In reply to Yaniv Kaul from comment #19)
> > (In reply to Nir Soffer from comment #18)
> > > Should be fixed in 4.1.1 by disabling lvmetad.
> > 
> > Current target milestone (for this bug) is 4.0.7. If it's intended to be
> > fixed there, need clone, backport, etc.
> 
> This fix is also available in 4.0.7, so we should be good.
> 
> I don't think we can verified since we don't know how to reproduce this
> issue, it
> is caused by race between multipath and lvm when new device is discovered.

Can I verify this bug using the steps to reproduce from the first comment? 
or there are different steps to make sure this bug is fixed?

Comment 25 Nir Soffer 2017-02-20 18:53:09 UTC
(In reply to Lilach Zitnitski from comment #24)
> > I don't think we can verified since we don't know how to reproduce this
> > issue, it
> > is caused by race between multipath and lvm when new device is discovered.
> 
> Can I verify this bug using the steps to reproduce from the first comment? 
> or there are different steps to make sure this bug is fixed?

To verify this you need to reproduce this wit an older version, and show that
it works with new version.

Reproducing this should be very hard, since the issue is a race between lvm and
multipath and the chance to get this race is very low.

You can try like this:

1. Add a new LUN on storage, and expose it to 2 hosts
2. On one host attach the LUN to the vm as direct LUN
3. Inside the guest, create a pv from the lun
4. Inside the guest, create a vg with that pv
5. Inside the guest, create a lv on the new pv
6. Try to migrate the vm to another host
7. If you are lucky, multipath will fail to grab the LUN because LVM will grab
   the LUN before multipath, activating the lv you created inside the guest in 
   step 5

I don't think you will reproduce this, since the LUN will probably be discovered
on the second host before you created the lv and multipath will grab it before
LVM.

If you are lucky and could reproduce this, you will have to repeat the entire 
setup from scratch using 4.1. Even if you could reproduce it, proving that
the fix works is very hard since the race between LVM and multipath is hard
to reproduce.

Comment 26 Nir Soffer 2017-02-22 09:21:22 UTC
There is doc text in the downstream clone, use if it needed.

Comment 27 Lilach Zitnitski 2017-02-23 08:50:46 UTC
According to Nir's comment (comment #25), and because I didn't manage to reproduce this bug, moving to CLOSED.

Comment 29 Lilach Zitnitski 2017-03-06 12:35:16 UTC
Moving to VERIFIED without reproducing this bug (this can't be reproduced comment#25)


Note You need to log in before you can comment on or make changes to this bug.