Bug 464877
Summary: | Avoid scanning devices to find LV/VG in LVM commands | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Takahiro Yasui <tyasui> | ||||||
Component: | lvm2 | Assignee: | Petr Rockai <prockai> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Corey Marthaler <cmarthal> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.0 | CC: | agk, borgan, coughlan, djansa, dwysocha, heinzm, iannis, iheim, jbrassow, ltroan, lwang, masaki.kimura.kz, mbroz, msnitzer, noboru.obata.ar, prajnoha, prockai, saguchi, ssaha, takahiro.yasui.mp | ||||||
Target Milestone: | beta | Keywords: | FutureFeature, TechPreview | ||||||
Target Release: | 6.3 | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | lvm2-2.02.95-1.el6 | Doc Type: | Technology Preview | ||||||
Doc Text: |
Title: Dynamic aggregation of LVM metadata via lvmetad
Most LVM commands require an accurate view of the LVM metadata stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O operations in systems that have a large number of disks.
The purpose of the lvmetad daemon is to eliminate the need for this scanning by dynamically aggregating metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, LVM performs a scan as it normally would.
This feature is provided as a Technology Preview and is disabled by default in Red Hat Enterprise Linux 6.3. To enable it, refer to the use_lvmetad parameter in the /etc/lvm/lvm.conf file, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 857530 (view as bug list) | Environment: | |||||||
Last Closed: | 2012-06-20 14:50:52 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 816724, 1173739 | ||||||||
Bug Blocks: | 464724, 697866, 705085, 718103, 756082, 857530 | ||||||||
Attachments: |
|
Description
Takahiro Yasui
2008-09-30 22:53:11 UTC
I created this bugzilla to make issues reported by bug #464724 in public. I have been looking into this (using upstream code as of 10/13/08 and iSCSI w/wire traces). Even on systems without broken disks, there is definitely multiple I/Os issued to the same area of the disk in a very short time. For instance, with 4 disks both 'pvscan' and 'pvs' issue IO to the same area of one disk 3 times (LBA 0, length 8 and LBA 8, length 8). This is with a system of no VGs, just PVs that have been initialized. I added a single VG across the 4 PVs, and it got even worse: 1) 'pvscan': 7 IOs to LUN 0, LBA 8, len 8 2) 'vgscan': 10 IOs to same area 3) 'pvs': 30 IOs 4) 'vgs': 9 IOs I understand the reason for some of this from a code organizational standpoint, but it is not acceptable behavior at the I/O level. Here's some more detailed analysis of pvscan. In all, there's 12 IOs issued to a single device (NOTE: The 7 IOs to offset 4096, len 8 correspond to LBA 8 / len 8 as stated earlier). The duplicate IOs are the result of a couple things: 1) different subsystems within LVM checking for different things, but on the same area of the disk; you can see this below for example with device filtering on partitions and md reading the same area of the disk, as well as the label reading 2) IO subsystem within LVM does aligned reads, which normally results in a nice 4K IO alignment, regardless of initial read offset. Brief summary of all IOs (breakpoint at _io() routine, in the while loop): io1.txt: offset == 0, length == 4096; device filtering; filter out any device with a partition table io2.txt: offset == 20905984, length == 4096, device filtering; filter out md device io3.txt: offset == 20963328, length == 4096, device filtering; filter out md device io4.txt: offset == 0, length == 4096; device filtering; filter out md device io5.txt: offset == 4096, length == 4096; device filtering; filter out md device io6.txt: offset == 0, len == 4096; label reading io7.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header io8.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header io9.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header io10.txt: offset == 4096, len == 4096; vg reading / parsing io11.txt: offset == 4096, len == 4096; vg reading / parsing io12.txt: offset == 4096, len == 4096; vg reading / parsing I will attach all the text files that show the backtraces of all these IOs. Created attachment 320583 [details]
Summary of IOs for pvscan
Output from pvscan was as follows:
PV /dev/xvda2 VG VolGroup00 lvm2 [5.75 GB / 0 free]
PV /dev/sda VG vgtest lvm2 [16.00 MB / 16.00 MB free]
PV /dev/sdb VG vgtest lvm2 [16.00 MB / 16.00 MB free]
PV /dev/sdc VG vgtest lvm2 [16.00 MB / 16.00 MB free]
PV /dev/sdd VG vgtest lvm2 [16.00 MB / 16.00 MB free]
Total: 5 [5.81 GB] / in use: 5 [5.81 GB] / in no VG: 0 [0 ]
One other key thing I should have mentioned. LVM opens devices with O_DIRECT, so the page cache is bypassed and we actually get the duplicate IOs to the storage. I have been researching various checkins but so far it is not clear why LVM needs O_DIRECT in all cases, especially if we are just reading the disks and not updating them. Mikulas and mbroz have pointed out that O_DIRECT is needed with: 1) clustered lvm 2) suspended devices (non-direct IO could cause deadlock) I did a very quick hack to disable direct IO (see below) and confirmed that it does cut down on the duplicate IOs but does not eliminate them. We might be able to safely disable O_DIRECT for some commands if at runtime we address the above (and any other) issues. Mikulas pointed out this doesn't really solve the problem though for broken disks. I wondered if we could add an option to do an initial read to a device and then dynamically add a filter if the read failed. I think Milan was working on the broken device issue for the lvmcache work. I am working on adding a device parameter to pvscan which would just scan the device specified and should eventually address the first issue (Issue I/Os only to target disks). @@ -447,7 +453,7 @@ int dev_open_quiet(struct device *dev) flags = vg_write_lock_held() ? O_RDWR : O_RDONLY; - return dev_open_flags(dev, flags, 1, 1); + return dev_open_flags(dev, flags, 0, 1); } int dev_open(struct device *dev) @@ -456,7 +462,7 @@ int dev_open(struct device *dev) flags = vg_write_lock_held() ? O_RDWR : O_RDONLY; - return dev_open_flags(dev, flags, 1, 0); + return dev_open_flags(dev, flags, 0, 0); } int dev_test_excl(struct device *dev) @@ -467,7 +473,7 @@ int dev_test_excl(struct device *dev) flags = vg_write_lock_held() ? O_RDWR : O_RDONLY; flags |= O_EXCL; - r = dev_open_flags(dev, flags, 1, 1); + r = dev_open_flags(dev, flags, 0, 1); if (r) dev_close_immediate(dev); One last thing. The second issue: - Avoid reading the same label many times and minimize the number of I/Os involves the way LVM is structured into subsystems - not sure there's an easy fix to consolidate the IOs. We may be better off deferring until we have the next generation of storage scanning infrastructure. Created attachment 321648 [details]
Very simple systemtap script to record/print IOs completed on various devices (hook bio_endio)
Useful systemtap script to capture IOs completed on devices. I used this with 'script' to capture the IOs that completed on various devices while running various scanning lvm scanning / reporting commands. Verified results matched the iscsi traces I took for pvscan, pvs, etc.
Might be useful to put this or something like it into the nightly test to measure IO cost of various commands.
> Expected results:
> - Issue I/Os only to target disks
> - Avoid reading the same label many times and minimize the number of I/Os
> - Refrain from issuing I/Os to broken disks, once disk failures are detected
Could you tell me the current status for this problem? I tested again with lvm
commands on 5.3. Thanks to Milan, I found that the second point, avoid reading the same label many times, are improved compare to 5.2, but still those problems
are remained. Especially the third point is very important. This is the test
results about the third point.
* Environment
- LVM structure
vg00: /dev/sdc (*no response*), /dev/sdd
vg01: /dev/sde (*no response*), /dev/sef
- Timeout
/sys/block/sd[c-f]/device/timeout: 3
* vgscan results
# vgscan
Reading all physical volumes. This may take a while...
/dev/sdc: read failed after 0 of 4096 at 0: Input/output error
/dev/sdc: open failed: No such device or address
/dev/sde: read failed after 0 of 4096 at 0: Input/output error
/dev/sde: open failed: No such device or address
/dev/sdc: read failed after 0 of 4096 at 0: Input/output error
/dev/sde: read failed after 0 of 4096 at 0: Input/output error
Couldn't find device with uuid 'jihpS9-FDxW-61f1-Y6N8-c60p-ryKk-OyG9n7'.
Found volume group "vg01" using metadata type lvm2
/dev/sdc: read failed after 0 of 4096 at 0: Input/output error
/dev/sde: read failed after 0 of 4096 at 0: Input/output error
Couldn't find device with uuid '0zRZNX-IZcO-bGsZ-7nMr-ieDd-Njqr-h9iI9J'.
Found volume group "vg00" using metadata type lvm2
- broken disks, /dev/sdc and /dev/sde, are accessed several times and
vgscan command takes more than 10 minutes.
- vgscan command scan disks in get_vgids() and detects disk errors but
still accesses those error disks.
I appreciate if you could share the current status and target date to be
fixed.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. Some improvements have gone into 6.0 and more will go in 6.1 and beyond. *** Bug 636001 has been marked as a duplicate of this bug. *** Ongoing work - more improvements got into 6.1 and more to come. Moving to 6.2 for reappraisal of how far we've got and how much more we can do. We now have a consensus about the design, but the implementation is still in a rather early phase. Since the planned solution is quite invasive, any late-coming bugs could cause significant trouble -- it would be advisable to not rush the implementation. Therefore, I nominate this for inclusion in 6.3 (but not 6.2), which should give us a reasonable timeframe to ensure the implementation is robust. Peter, Thank you for handling this. I understand the current status and I'm very pleased to hear this feature is ongoing. I expect it would be supported on 6.3. Thanks, Taka Pushed to 6.3. We are still trying for 6.3, although we did encounter more resistance than planned with lvmetad (which is the planned solution for the problem). This change is intended to be transparent to the user. Some LVM operations should be faster as a result of this change, but that is the only visible impact. The test plan consists of the standard LVM regression tests. Monitor performance to ensure it stays the same or improves. Stress tests, involving repeated scanning of large configurations while the system is busy, shutdown/reboot, add/remove/resize while the system is busy should be done. *** Bug 464724 has been marked as a duplicate of this bug. *** The lvmetad support support is addedd in 6.3 as a tech preview. It's disabled by default. To enable it, you need to set the global/use_lvmetad lvm.conf setting and enable the lvmetad daemon by running/enabling the lvm2-lvmetad init script. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Please add this as a Tech. Preview: Most LVM commands require an accurate view of the LVM metadata, stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O in systems that have a large number of disks. The purpose of lvmetad is to eliminate the need for this scanning by dynamically aggregating the metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, then LVM will perform a scan as it has in the past. This feature is off by default in RHEL 6.3. To enable it, refer to the use_lvmetad parameter in lvm.conf, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,3 +1,7 @@ -Please add this as a Tech. Preview: +Title: Dynamic aggregation of LVM metadata via lvmetad -Most LVM commands require an accurate view of the LVM metadata, stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O in systems that have a large number of disks. The purpose of lvmetad is to eliminate the need for this scanning by dynamically aggregating the metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, then LVM will perform a scan as it has in the past. This feature is off by default in RHEL 6.3. To enable it, refer to the use_lvmetad parameter in lvm.conf, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.+Most LVM commands require an accurate view of the LVM metadata stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O operations in systems that have a large number of disks. + +The purpose of the lvmetad daemon is to eliminate the need for this scanning by dynamically aggregating metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, LVM performs a scan as it normally would. + +This feature is provided as a Technology Preview and is disabled by default in Red Hat Enterprise Linux 6.3. To enable it, refer to the use_lvmetad parameter in the /etc/lvm/lvm.conf file, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script. Basic regression tests now pass with lvmetad running. Marking this verified (SanityOnly). 2.6.32-269.el6.x86_64 lvm2-2.02.95-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 lvm2-libs-2.02.95-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 lvm2-cluster-2.02.95-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 udev-147-2.41.el6 BUILT: Thu Mar 1 13:01:08 CST 2012 device-mapper-1.02.74-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 device-mapper-libs-1.02.74-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 device-mapper-event-1.02.74-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 device-mapper-event-libs-1.02.74-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 cmirror-2.02.95-8.el6 BUILT: Wed May 9 03:33:32 CDT 2012 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0962.html |