Bug 1400528

Summary: [Bug RHV 4.0.4] Intermittent direct lun vm failed to start with error "VolumeError: Bad volume specification".
Product: Red Hat Enterprise Virtualization Manager Reporter: Sachin Raje <sraje>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED ERRATA QA Contact: Lilach Zitnitski <lzitnits>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.0.4CC: amureini, bazulay, gklein, gwatson, lsurette, lzitnits, mkalinin, mwest, nashok, nsoffer, ratamir, sherold, sraje, srevivo, tnisan, ycui, ykaul, ylavi, zkabelac
Target Milestone: ovirt-4.1.1Keywords: Reopened, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
doctext provided in bug 1424819, no reason to have it copy-pasted twice.
Story Points: ---
Clone Of:
: 1424819 (view as bug list) Environment:
Last Closed: 2017-04-25 00:42:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1374545    
Bug Blocks: 1424819    

Comment 2 Yaniv Kaul 2016-12-01 12:44:39 UTC
Please attach complete logs. 
Unclear if it's 3.6 or 4.0.4 -  please clarify.

Comment 5 Sachin Raje 2016-12-01 12:56:20 UTC
its rhv-4.0.4. I'll attach the sosrpeort and other data shortly.

Comment 11 Marina Kalinin 2016-12-07 19:29:36 UTC
Looking into related bugs, mentioned earlier:
Some dirty LUNs are not usable in RHEV
https://bugzilla.redhat.com/show_bug.cgi?id=1253640

[Nimble Storage] multipath unable to add new path
https://bugzilla.redhat.com/show_bug.cgi?id=1309409

Specifically here:
https://bugzilla.redhat.com/show_bug.cgi?id=1253640#c15

I think maybe this is not a vdsm bug at all?

Nir, can you please check and move to platform, if needed?
This is an urgent customer bug.

Comment 12 Yaniv Kaul 2016-12-07 19:33:27 UTC
(In reply to Marina from comment #11)
> Looking into related bugs, mentioned earlier:
> Some dirty LUNs are not usable in RHEV
> https://bugzilla.redhat.com/show_bug.cgi?id=1253640
> 
> [Nimble Storage] multipath unable to add new path
> https://bugzilla.redhat.com/show_bug.cgi?id=1309409
> 
> Specifically here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1253640#c15
> 
> I think maybe this is not a vdsm bug at all?
> 
> Nir, can you please check and move to platform, if needed?

This was comment 3 .  I don't think it's our issue.

> This is an urgent customer bug.

Comment 13 Nir Soffer 2016-12-07 19:38:33 UTC
I agree with Yaniv, if we don't see the device in the scsi layer, vdsm can do
nothing about it. Maybe we do not rescan scsi devices correctly?

I suggest to move this to platform.

Comment 16 Daniel Erez 2016-12-12 11:02:23 UTC
Hi Zdenek,

Do you think an lvm filter with a 'white-list' is the best solution for this issue as well? (as you've suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1374545#c74)

Thanks!

Comment 17 Yaniv Kaul 2017-01-02 14:28:46 UTC
We've also reached the conclusion that disabling (and stopping) lvmetad is a good idea and we'll implement it in 4.0.7 and 4.1. Can you try that?

Comment 18 Nir Soffer 2017-02-14 12:03:52 UTC
Should be fixed in 4.1.1 by disabling lvmetad.

Comment 19 Yaniv Kaul 2017-02-14 12:42:29 UTC
(In reply to Nir Soffer from comment #18)
> Should be fixed in 4.1.1 by disabling lvmetad.

Current target milestone (for this bug) is 4.0.7. If it's intended to be fixed there, need clone, backport, etc. Otherwise - set target milestone to 4.1.1.

Comment 22 Nir Soffer 2017-02-19 16:28:07 UTC
(In reply to Yaniv Kaul from comment #19)
> (In reply to Nir Soffer from comment #18)
> > Should be fixed in 4.1.1 by disabling lvmetad.
> 
> Current target milestone (for this bug) is 4.0.7. If it's intended to be
> fixed there, need clone, backport, etc.

This fix is also available in 4.0.7, so we should be good.

I don't think we can verified since we don't know how to reproduce this issue, it
is caused by race between multipath and lvm when new device is discovered.

Comment 24 Lilach Zitnitski 2017-02-20 08:35:57 UTC
(In reply to Nir Soffer from comment #22)
> (In reply to Yaniv Kaul from comment #19)
> > (In reply to Nir Soffer from comment #18)
> > > Should be fixed in 4.1.1 by disabling lvmetad.
> > 
> > Current target milestone (for this bug) is 4.0.7. If it's intended to be
> > fixed there, need clone, backport, etc.
> 
> This fix is also available in 4.0.7, so we should be good.
> 
> I don't think we can verified since we don't know how to reproduce this
> issue, it
> is caused by race between multipath and lvm when new device is discovered.

Can I verify this bug using the steps to reproduce from the first comment? 
or there are different steps to make sure this bug is fixed?

Comment 25 Nir Soffer 2017-02-20 18:53:09 UTC
(In reply to Lilach Zitnitski from comment #24)
> > I don't think we can verified since we don't know how to reproduce this
> > issue, it
> > is caused by race between multipath and lvm when new device is discovered.
> 
> Can I verify this bug using the steps to reproduce from the first comment? 
> or there are different steps to make sure this bug is fixed?

To verify this you need to reproduce this wit an older version, and show that
it works with new version.

Reproducing this should be very hard, since the issue is a race between lvm and
multipath and the chance to get this race is very low.

You can try like this:

1. Add a new LUN on storage, and expose it to 2 hosts
2. On one host attach the LUN to the vm as direct LUN
3. Inside the guest, create a pv from the lun
4. Inside the guest, create a vg with that pv
5. Inside the guest, create a lv on the new pv
6. Try to migrate the vm to another host
7. If you are lucky, multipath will fail to grab the LUN because LVM will grab
   the LUN before multipath, activating the lv you created inside the guest in 
   step 5

I don't think you will reproduce this, since the LUN will probably be discovered
on the second host before you created the lv and multipath will grab it before
LVM.

If you are lucky and could reproduce this, you will have to repeat the entire 
setup from scratch using 4.1. Even if you could reproduce it, proving that
the fix works is very hard since the race between LVM and multipath is hard
to reproduce.

Comment 26 Nir Soffer 2017-02-22 09:21:22 UTC
There is doc text in the downstream clone, use if it needed.

Comment 27 Lilach Zitnitski 2017-02-23 08:50:46 UTC
According to Nir's comment (comment #25), and because I didn't manage to reproduce this bug, moving to CLOSED.

Comment 29 Lilach Zitnitski 2017-03-06 12:35:16 UTC
Moving to VERIFIED without reproducing this bug (this can't be reproduced comment#25)