Bug 1020401

Summary:

Creating and modifying logical volumes concurrently fails with Volume group not found error

Product:

Red Hat Enterprise Linux 6

Reporter:

Nir Soffer <nsoffer>

Component:

lvm2

Assignee:

Peter Rajnoha <prajnoha>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.5

CC:

abaron, agk, amureini, cmarthal, dwysocha, ewarszaw, fsimonce, heinzm, jbrassow, kay, msnitzer, nperic, nsoffer, prajnoha, prockai, scohen, thornber, tlavigne, zkabelac

Target Milestone:

Keywords:

Regression

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

lvm2-2.02.100-7.el6

Doc Type:

Bug Fix

Doc Text:

This bug was a result of latest changes in recent version of lvm2 planned for 6.5 (WATCH udev rule enabled for LVM devices) and it has been fixed for 6.5. No doc text required.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-11-21 23:29:40 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1024449

Attachments:

Description	Flags
test script and logs	none
test script and logs	none
test script and logs	none
output of pvs -o all	none
output of vgs -o all	none

Description Nir Soffer 2013-10-17 15:21:32 UTC

Created attachment 813392 [details]
test script and logs

Description of problem:

Creating and modifying logical volumes concurrently fails randomally with "Volume ... not found" error.

Version-Release number of selected component (if applicable):

Red Hat Enterprise Linux Server release 6.5 Beta (Santiago)
lvm2-2.02.100-5.el6.x86_6

How reproducible:

Random. Chance of failure increase with concurrency.

Steps to Reproduce:

1. Start many thread or processes running this flow:
   1. Create new logical volume with unique name
   2. Deactivate it
   3. List current logical volumes
   4. Remove logical volume

Note: see flow.log for detailed command line used

Note: each process/thread is modifying only one logical volumes. There is no concurrent modification of single logical volume.

Actual results:

One of the operations fails randomally.

Expected results:

All operation succeeds.

Additional info:

Workaround: disable devices filtering. I could create 20,000 volumes without an error when filtering is disabled.

Attached files:

failure-lvchange.log   lvchange failure verbose log
failure-lvcreate.log   lvcreate failure verbose log
flow.log               log with defailed lvm command used
pvdisplay.log          output of pvdisplay on used pvs
pvs.log                available pvs
test.py                test script
vgdisplay.log          output of vgdisplay on used vg
vgs.log                output of vgs

Using the test script:

First update the vg_name and devices list manually.

To see the flow:

    python test.py -d

To run 30 concurrent jobs 10 times and get verbose log of the failed command:

    python test.py -c30 -i10 -v

To run the same test without filtering:

    python test.py -c30 -i10 -v -n

Comment 2 Nir Soffer 2013-10-17 15:31:55 UTC

Another workround: enabling logging to file with high log level (e.g. 7) avoid the failures. This is the reason I did not provide lvm.log during the test runs.

Comment 3 Nir Soffer 2013-10-18 10:56:55 UTC

Created attachment 813714 [details]
test script and logs

The file failure-lvcreate.log in the first attachment was invalid - the test lvs were not cleaned after the previous error.

The attachment replaces the first one, adding new failure logs for lvcreate, lvremove and lvs.

Comment 4 Nir Soffer 2013-10-18 14:35:57 UTC

Created attachment 813798 [details]
test script and logs

New test script uses -vvvv, creating more defailed logs.

Comment 5 Zdenek Kabelac 2013-10-18 17:03:41 UTC

Please attach more info about what are the 'backend' devices used in your test -

        '/dev/mapper/3600144f09dbd05000000522dde040011',
        '/dev/mapper/3600144f09dbd05000000522dde070013',

Are you able to reproduce your problem on plain 
/dev/sdX  (or /dev/loopX)  devices ?


In the trace it's obvious volume group is not found.

I've seen something similar in the past, were multiple layered device were in the middle and read from disk has returned some already overwritten content.

Anyway - please try to reproduce on native disk or loop device.

Comment 6 Zdenek Kabelac 2013-10-18 17:20:46 UTC

Please attach also output of:


vgs -o all

pvs -o all

Comment 8 Nir Soffer 2013-10-18 21:20:10 UTC

(In reply to Zdenek Kabelac from comment #5)
> Please attach more info about what are the 'backend' devices used in your
> test -
> 
>         '/dev/mapper/3600144f09dbd05000000522dde040011',
>         '/dev/mapper/3600144f09dbd05000000522dde070013',

Both are iscsi targets on local network.

> 
> Are you able to reproduce your problem on plain 
> /dev/sdX  (or /dev/loopX)  devices ?

Will try.

Comment 9 Nir Soffer 2013-10-18 21:27:34 UTC

Created attachment 813911 [details]
output of pvs -o all

Comment 10 Nir Soffer 2013-10-18 21:28:12 UTC

Created attachment 813913 [details]
output of vgs -o all

Comment 11 Nir Soffer 2013-10-18 22:21:36 UTC

(In reply to Zdenek Kabelac from comment #5)
> Are you able to reproduce your problem on plain 
> /dev/sdX  (or /dev/loopX)  devices ?

Cannot reproduce this with vg created from /dev/loop0 and /dev/loop1. Created 15,000 lvs without an error.

Comment 12 Zdenek Kabelac 2013-10-19 19:24:27 UTC

I'd start to suspect iscsi disk synchronization.

This reminds me the okozina@ is also seeing some weird disk errors which should not normally happen - they should be impossible to hit with correctly working kernel - and he is also using  iscsi for his kvm.

AFAIK okozina@ was busy with other tasks to investigate this issue in depth - but I guess your reproducer could be more easily deployed on his setup to check whether it's the same issue - symptoms seems to be very similar although he gets later into deep trouble with thin pool as metadata gets out-of-sync in pretty weird way - but if iscsi doesn't maintain data integrity across disk flushes it would easy explain what he gets.

So this case certainly needs more investigation on kernel level and your simple python script seems to be pretty easy to use reproducer.
lvm2 relays on direct disc access and disc flushes to be properly finished when locks are released - if this fail on iscsi it may easily explain what you see.


Maybe there are some iscsi parameters to change and retest ?

Comment 13 Nir Soffer 2013-10-19 20:43:32 UTC

(In reply to Zdenek Kabelac from comment #12)
> lvm2 relays on direct disc access and disc flushes to be properly finished
> when locks are released - if this fail on iscsi it may easily explain what
> you see.

Can you suggest a minimal test case without lvm that can demonstrate the same problem? (e.g. obtain lock, open a device with O_DIRECT, write block, release lock, read from other thread concurently)?

Comment 14 Ayal Baron 2013-10-19 20:51:12 UTC

Marking as Regression and reflagging for 6.5 since this does not happen on 6.4 (Nir tried to reproduce on both platforms).

Comment 17 Zdenek Kabelac 2013-10-20 20:56:01 UTC

There are 2 easier tests to try first -

As comment 14 suggests - rhel6.4 with its kernel and lvm2 is ok - so please try 2 more tests:

From the setup which works (6.4) - take its kernel and try it instead of 6.5 kernel in your 6.5 machine.

Another simple test would be to upgrade  'just' lvm2 package on 6.4 machine.

Hopefully both cases works properly?

This test would make at least clear it's a kernel issue not related to user space lvm2 tool  (kernel bisect)

Then I guess we may try to think about some simpler test case to trigger the bug,
otherwise we will need bisect (lvm2 bisect)

Comment 18 Zdenek Kabelac 2013-10-21 10:35:08 UTC

Also please attach info about kernel version in use.

We have tried to test with: rhel6-a 2.6.32-424.el6.x86_64 (latest 6.5 kernel)
without hitting this  test.py  issue.

Comment 19 Nir Soffer 2013-10-21 18:56:50 UTC

(In reply to Zdenek Kabelac from comment #18)

Reproduced with:
- RHEL 6.5
  - kernel 2.6.32-358, 2.6.32-422 and 2.6.32-424
  - lvm 2.02.100 and 2.0.104-git.

Not reproducible with:
- RHEL 6.4
  - kernel 2.6.32-538 and 2.6.32-424
  - lvm 2.02.98.
- RHEL 6.5
  - kernel 2.6.32-422
  - lvm 2.02.98.

lvm is obtaining the device list from udev. Seems that udev sometimes returns incomplete list, which cause lvm to think that vg is missing.

Possible workaround is to disable obtain_device_list_from_udev option in lvm.conf. When disabled, issue cannot be reproduced when creating 800 lvs.

Comment 20 Zdenek Kabelac 2013-10-21 19:54:53 UTC

So I've played a bit more in depth with modified version of lvm2.

It seems at certain points  when udev is queried for list of devices - one or the other from 2PVs in the example  are not present - so the VG is updated with PV marked as missing -  to repair such VG -  'vgextend --restoremissing' is needed.

The main difference between  Rhel6.4 and Rhel6.5 which makes the failing is addition of  WATCH rule in 13-dm-disk.rules.

When this watch rules is commented out - it 'magically' starts to work.
Moving this to the prajnoha - there could be number of theories what could be wrong here.

ATM the easiest workaround is to probably to avoid using udev scan for stacked devices as mentioned in comment 19.

Comment 21 Peter Rajnoha 2013-10-23 10:08:54 UTC

Based on debug info we have, if there's missing device while iterating over the list we get from udev db/libudev and while this "missing" state is caused by parallel CHANGE event processing (the WATCH rule in this exact case), then it looks like libudev does not provide any form of transaction when reading and updating udev db at the same time. Which ends up with transient outage of the information. If this is the case, then it's definitely udev bug - the udev db must not lose the information transiently if the device is not actually removed.

But I don't see into udev/libudev code at this detailed level, so CC-ing Kay here to add a few comments on this. Kay - how is udev db updated? Is it possible that the udev db info is missing transiently - I mean, can it get lost from the list libudev provides just because it's being processed/updated at the moment?

Also, maybe worth mentioning, that the udev/libudev used in RHEL6 is relatively old, it may be that this problem is already resolved with recent versions of udev already and that we only need a backport... But that's something udev team should comment on here.

Comment 22 Kay Sievers 2013-10-23 13:07:39 UTC

Udev device enumeration is not based on database content, it just iterates
over /sys/{bus,class}/ (depending on the set filters). The list created
by enumerate, is basically nothing else than:
  ls -l /sys/class/block

There is no smartness inside udev about any running events, or transactions
which should interfere with enumerate, it's all entirely "dumb on purpose".

I have no idea how an existing device could be missed here.

Comment 23 Peter Rajnoha 2013-10-23 13:10:12 UTC

This is the part of the code that gets the list of block devices from udev:
  https://git.fedorahosted.org/cgit/lvm2.git/tree/lib/device/dev-cache.c#n494

Comment 24 Kay Sievers 2013-10-23 13:25:27 UTC

Oh, you probably mean the "device open" not the enumerator. The device
is probably returned by enumerate, but when opening it, it will not carry
any non-kernel information.

This is expected behaviour in RHEL6, and was only changed quite some times
later. A running change event in RHEL6 might delete and re-create the
database, and if asked in that time window, the device will not carry any
information.

Comment 25 Peter Rajnoha 2013-10-23 13:27:40 UTC

Hmmm, then we can't rely on this information and we need to advise using devices/obtain_device_list_from_udev=0 (at least for such a workload as described in this bz).

I guess the change is invasive and not quite backportable, right?

Comment 27 Nir Soffer 2013-10-23 13:50:33 UTC

(In reply to Peter Rajnoha from comment #25)
> I guess the change is invasive and not quite backportable, right?

Since the problem does not exists on 6.4, we don't have any problem if obtain_device_list_from_udev does not exists on old lvm versions.

We are not happy to disable udev cache, as it may have performance impact when working with large number of lvs (1000's).

However, there is another direction that Zdenek describe in comment 20; There is a new watch rule in udev dm-disk rule added in 6.5. According to Zdenek, this rule is required for lvmetad, which is disabled by default, and planned for RHEL 7.

When this watch rule is disabled (e.g. #OPTIONS+=watch), we cannot reproduce this issue. We did not compeared the performance vs the disabling udev cache, but this is closer to the setup we had in 6.4.

This is the added rule:

# Add inotify watch to track changes on this device.
# Using the watch rule is not optimal - it generates a lot of spurious
# and useless events whenever the device opened for read-write is closed.
# The best would be to generete the event directly in the tool changing
# relevant information so only relevant events will be processed
# (like creating a filesystem, changing filesystem label etc.).
#
# But let's use this until we have something better...
LABEL="dm_watch"
OPTIONS+="watch"
LABEL="dm_end"
 
Is is possible that lvm will revents this new rule until lvmetad is enabled by default, or add this configuration only when lvmetad is running?

Comment 28 Zdenek Kabelac 2013-10-23 13:53:43 UTC

Could you please quantify the perfomance impact ?

Removal of watch event is less preferable - since it makes the content of udev db sometimes irrelevant in terms of i.e. filesystem on device.

Comment 29 Kay Sievers 2013-10-23 13:59:32 UTC

(In reply to Peter Rajnoha from comment #25)
> I guess the change is invasive and not quite backportable, right?

It's a pretty invasive change, we need to calculate the the difference of
the old device to the new, carry that state through the event processing,
and apply only the resulting delta to the filesystem, in an atomic manner.

I wouldn't really feel too comfortable to introduce such large changes in
a released product.

Comment 30 Peter Rajnoha 2013-10-23 14:03:37 UTC

(In reply to Kay Sievers from comment #29)
> (In reply to Peter Rajnoha from comment #25)
> > I guess the change is invasive and not quite backportable, right?
> 
> It's a pretty invasive change, we need to calculate the the difference of
> the old device to the new, carry that state through the event processing,
> and apply only the resulting delta to the filesystem, in an atomic manner.
> 
> I wouldn't really feel too comfortable to introduce such large changes in
> a released product.

Yeah, and it could introduce nasty regressions, I can imagine. So let's just keep the udev as it is...

Comment 32 Nenad Peric 2013-10-23 14:10:52 UTC

Adding QA ack as SanityOnly for now, but we would really prefer to have a nice (read: sort of reliable) reproducer in order to make sure that the issue is not present now after the patch or in the future builds.

Comment 33 Peter Rajnoha 2013-10-23 15:27:48 UTC

devices/obtain_device_list_from_udev=0 used as default in 6.5 now.

Comment 36 Nir Soffer 2013-10-29 16:46:48 UTC

We did not check yet how this change effect vdsm performance. Please see bug 1024449 for more info.

Comment 37 errata-xmlrpc 2013-11-21 23:29:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1704.html