Bug 1020401
Summary: | Creating and modifying logical volumes concurrently fails with Volume group not found error | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Nir Soffer <nsoffer> | ||||||||||||
Component: | lvm2 | Assignee: | Peter Rajnoha <prajnoha> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 6.5 | CC: | abaron, agk, amureini, cmarthal, dwysocha, ewarszaw, fsimonce, heinzm, jbrassow, kay, msnitzer, nperic, nsoffer, prajnoha, prockai, scohen, thornber, tlavigne, zkabelac | ||||||||||||
Target Milestone: | rc | Keywords: | Regression | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | lvm2-2.02.100-7.el6 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: |
This bug was a result of latest changes in recent version of lvm2 planned for 6.5 (WATCH udev rule enabled for LVM devices) and it has been fixed for 6.5. No doc text required.
|
Story Points: | --- | ||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2013-11-21 23:29:40 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 1024449 | ||||||||||||||
Attachments: |
|
Another workround: enabling logging to file with high log level (e.g. 7) avoid the failures. This is the reason I did not provide lvm.log during the test runs. Created attachment 813714 [details]
test script and logs
The file failure-lvcreate.log in the first attachment was invalid - the test lvs were not cleaned after the previous error.
The attachment replaces the first one, adding new failure logs for lvcreate, lvremove and lvs.
Created attachment 813798 [details]
test script and logs
New test script uses -vvvv, creating more defailed logs.
Please attach more info about what are the 'backend' devices used in your test - '/dev/mapper/3600144f09dbd05000000522dde040011', '/dev/mapper/3600144f09dbd05000000522dde070013', Are you able to reproduce your problem on plain /dev/sdX (or /dev/loopX) devices ? In the trace it's obvious volume group is not found. I've seen something similar in the past, were multiple layered device were in the middle and read from disk has returned some already overwritten content. Anyway - please try to reproduce on native disk or loop device. Please attach also output of: vgs -o all pvs -o all (In reply to Zdenek Kabelac from comment #5) > Please attach more info about what are the 'backend' devices used in your > test - > > '/dev/mapper/3600144f09dbd05000000522dde040011', > '/dev/mapper/3600144f09dbd05000000522dde070013', Both are iscsi targets on local network. > > Are you able to reproduce your problem on plain > /dev/sdX (or /dev/loopX) devices ? Will try. Created attachment 813911 [details]
output of pvs -o all
Created attachment 813913 [details]
output of vgs -o all
(In reply to Zdenek Kabelac from comment #5) > Are you able to reproduce your problem on plain > /dev/sdX (or /dev/loopX) devices ? Cannot reproduce this with vg created from /dev/loop0 and /dev/loop1. Created 15,000 lvs without an error. I'd start to suspect iscsi disk synchronization. This reminds me the okozina@ is also seeing some weird disk errors which should not normally happen - they should be impossible to hit with correctly working kernel - and he is also using iscsi for his kvm. AFAIK okozina@ was busy with other tasks to investigate this issue in depth - but I guess your reproducer could be more easily deployed on his setup to check whether it's the same issue - symptoms seems to be very similar although he gets later into deep trouble with thin pool as metadata gets out-of-sync in pretty weird way - but if iscsi doesn't maintain data integrity across disk flushes it would easy explain what he gets. So this case certainly needs more investigation on kernel level and your simple python script seems to be pretty easy to use reproducer. lvm2 relays on direct disc access and disc flushes to be properly finished when locks are released - if this fail on iscsi it may easily explain what you see. Maybe there are some iscsi parameters to change and retest ? (In reply to Zdenek Kabelac from comment #12) > lvm2 relays on direct disc access and disc flushes to be properly finished > when locks are released - if this fail on iscsi it may easily explain what > you see. Can you suggest a minimal test case without lvm that can demonstrate the same problem? (e.g. obtain lock, open a device with O_DIRECT, write block, release lock, read from other thread concurently)? Marking as Regression and reflagging for 6.5 since this does not happen on 6.4 (Nir tried to reproduce on both platforms). There are 2 easier tests to try first - As comment 14 suggests - rhel6.4 with its kernel and lvm2 is ok - so please try 2 more tests: From the setup which works (6.4) - take its kernel and try it instead of 6.5 kernel in your 6.5 machine. Another simple test would be to upgrade 'just' lvm2 package on 6.4 machine. Hopefully both cases works properly? This test would make at least clear it's a kernel issue not related to user space lvm2 tool (kernel bisect) Then I guess we may try to think about some simpler test case to trigger the bug, otherwise we will need bisect (lvm2 bisect) Also please attach info about kernel version in use. We have tried to test with: rhel6-a 2.6.32-424.el6.x86_64 (latest 6.5 kernel) without hitting this test.py issue. (In reply to Zdenek Kabelac from comment #18) Reproduced with: - RHEL 6.5 - kernel 2.6.32-358, 2.6.32-422 and 2.6.32-424 - lvm 2.02.100 and 2.0.104-git. Not reproducible with: - RHEL 6.4 - kernel 2.6.32-538 and 2.6.32-424 - lvm 2.02.98. - RHEL 6.5 - kernel 2.6.32-422 - lvm 2.02.98. lvm is obtaining the device list from udev. Seems that udev sometimes returns incomplete list, which cause lvm to think that vg is missing. Possible workaround is to disable obtain_device_list_from_udev option in lvm.conf. When disabled, issue cannot be reproduced when creating 800 lvs. So I've played a bit more in depth with modified version of lvm2. It seems at certain points when udev is queried for list of devices - one or the other from 2PVs in the example are not present - so the VG is updated with PV marked as missing - to repair such VG - 'vgextend --restoremissing' is needed. The main difference between Rhel6.4 and Rhel6.5 which makes the failing is addition of WATCH rule in 13-dm-disk.rules. When this watch rules is commented out - it 'magically' starts to work. Moving this to the prajnoha - there could be number of theories what could be wrong here. ATM the easiest workaround is to probably to avoid using udev scan for stacked devices as mentioned in comment 19. Based on debug info we have, if there's missing device while iterating over the list we get from udev db/libudev and while this "missing" state is caused by parallel CHANGE event processing (the WATCH rule in this exact case), then it looks like libudev does not provide any form of transaction when reading and updating udev db at the same time. Which ends up with transient outage of the information. If this is the case, then it's definitely udev bug - the udev db must not lose the information transiently if the device is not actually removed. But I don't see into udev/libudev code at this detailed level, so CC-ing Kay here to add a few comments on this. Kay - how is udev db updated? Is it possible that the udev db info is missing transiently - I mean, can it get lost from the list libudev provides just because it's being processed/updated at the moment? Also, maybe worth mentioning, that the udev/libudev used in RHEL6 is relatively old, it may be that this problem is already resolved with recent versions of udev already and that we only need a backport... But that's something udev team should comment on here. Udev device enumeration is not based on database content, it just iterates over /sys/{bus,class}/ (depending on the set filters). The list created by enumerate, is basically nothing else than: ls -l /sys/class/block There is no smartness inside udev about any running events, or transactions which should interfere with enumerate, it's all entirely "dumb on purpose". I have no idea how an existing device could be missed here. This is the part of the code that gets the list of block devices from udev: https://git.fedorahosted.org/cgit/lvm2.git/tree/lib/device/dev-cache.c#n494 Oh, you probably mean the "device open" not the enumerator. The device is probably returned by enumerate, but when opening it, it will not carry any non-kernel information. This is expected behaviour in RHEL6, and was only changed quite some times later. A running change event in RHEL6 might delete and re-create the database, and if asked in that time window, the device will not carry any information. Hmmm, then we can't rely on this information and we need to advise using devices/obtain_device_list_from_udev=0 (at least for such a workload as described in this bz). I guess the change is invasive and not quite backportable, right? (In reply to Peter Rajnoha from comment #25) > I guess the change is invasive and not quite backportable, right? Since the problem does not exists on 6.4, we don't have any problem if obtain_device_list_from_udev does not exists on old lvm versions. We are not happy to disable udev cache, as it may have performance impact when working with large number of lvs (1000's). However, there is another direction that Zdenek describe in comment 20; There is a new watch rule in udev dm-disk rule added in 6.5. According to Zdenek, this rule is required for lvmetad, which is disabled by default, and planned for RHEL 7. When this watch rule is disabled (e.g. #OPTIONS+=watch), we cannot reproduce this issue. We did not compeared the performance vs the disabling udev cache, but this is closer to the setup we had in 6.4. This is the added rule: # Add inotify watch to track changes on this device. # Using the watch rule is not optimal - it generates a lot of spurious # and useless events whenever the device opened for read-write is closed. # The best would be to generete the event directly in the tool changing # relevant information so only relevant events will be processed # (like creating a filesystem, changing filesystem label etc.). # # But let's use this until we have something better... LABEL="dm_watch" OPTIONS+="watch" LABEL="dm_end" Is is possible that lvm will revents this new rule until lvmetad is enabled by default, or add this configuration only when lvmetad is running? Could you please quantify the perfomance impact ? Removal of watch event is less preferable - since it makes the content of udev db sometimes irrelevant in terms of i.e. filesystem on device. (In reply to Peter Rajnoha from comment #25) > I guess the change is invasive and not quite backportable, right? It's a pretty invasive change, we need to calculate the the difference of the old device to the new, carry that state through the event processing, and apply only the resulting delta to the filesystem, in an atomic manner. I wouldn't really feel too comfortable to introduce such large changes in a released product. (In reply to Kay Sievers from comment #29) > (In reply to Peter Rajnoha from comment #25) > > I guess the change is invasive and not quite backportable, right? > > It's a pretty invasive change, we need to calculate the the difference of > the old device to the new, carry that state through the event processing, > and apply only the resulting delta to the filesystem, in an atomic manner. > > I wouldn't really feel too comfortable to introduce such large changes in > a released product. Yeah, and it could introduce nasty regressions, I can imagine. So let's just keep the udev as it is... Adding QA ack as SanityOnly for now, but we would really prefer to have a nice (read: sort of reliable) reproducer in order to make sure that the issue is not present now after the patch or in the future builds. devices/obtain_device_list_from_udev=0 used as default in 6.5 now. We did not check yet how this change effect vdsm performance. Please see bug 1024449 for more info. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1704.html |
Created attachment 813392 [details] test script and logs Description of problem: Creating and modifying logical volumes concurrently fails randomally with "Volume ... not found" error. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 6.5 Beta (Santiago) lvm2-2.02.100-5.el6.x86_6 How reproducible: Random. Chance of failure increase with concurrency. Steps to Reproduce: 1. Start many thread or processes running this flow: 1. Create new logical volume with unique name 2. Deactivate it 3. List current logical volumes 4. Remove logical volume Note: see flow.log for detailed command line used Note: each process/thread is modifying only one logical volumes. There is no concurrent modification of single logical volume. Actual results: One of the operations fails randomally. Expected results: All operation succeeds. Additional info: Workaround: disable devices filtering. I could create 20,000 volumes without an error when filtering is disabled. Attached files: failure-lvchange.log lvchange failure verbose log failure-lvcreate.log lvcreate failure verbose log flow.log log with defailed lvm command used pvdisplay.log output of pvdisplay on used pvs pvs.log available pvs test.py test script vgdisplay.log output of vgdisplay on used vg vgs.log output of vgs Using the test script: First update the vg_name and devices list manually. To see the flow: python test.py -d To run 30 concurrent jobs 10 times and get verbose log of the failed command: python test.py -c30 -i10 -v To run the same test without filtering: python test.py -c30 -i10 -v -n