Bug 464877 - Avoid scanning devices to find LV/VG in LVM commands
Avoid scanning devices to find LV/VG in LVM commands
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.0
All Linux
high Severity high
: beta
: 6.3
Assigned To: Petr Rockai
Corey Marthaler
: FutureFeature, TechPreview
: 464724 (view as bug list)
Depends On: 816724 1173739
Blocks: 705085 464724 697866 718103 756082 857530
  Show dependency treegraph
 
Reported: 2008-09-30 18:53 EDT by Takahiro Yasui
Modified: 2014-12-12 15:02 EST (History)
20 users (show)

See Also:
Fixed In Version: lvm2-2.02.95-1.el6
Doc Type: Technology Preview
Doc Text:
Title: Dynamic aggregation of LVM metadata via lvmetad Most LVM commands require an accurate view of the LVM metadata stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O operations in systems that have a large number of disks. The purpose of the lvmetad daemon is to eliminate the need for this scanning by dynamically aggregating metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, LVM performs a scan as it normally would. This feature is provided as a Technology Preview and is disabled by default in Red Hat Enterprise Linux 6.3. To enable it, refer to the use_lvmetad parameter in the /etc/lvm/lvm.conf file, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.
Story Points: ---
Clone Of:
: 857530 (view as bug list)
Environment:
Last Closed: 2012-06-20 10:50:52 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Summary of IOs for pvscan (2.62 KB, application/x-compressed-tar)
2008-10-16 13:24 EDT, Dave Wysochanski
no flags Details
Very simple systemtap script to record/print IOs completed on various devices (hook bio_endio) (1.44 KB, application/octet-stream)
2008-10-27 15:59 EDT, Dave Wysochanski
no flags Details

  None (edit)
Description Takahiro Yasui 2008-09-30 18:53:11 EDT
Description of problem:
  Current implementation of LVM commands needs to issue "READ I/O" to
  all disks in the system so many times in order to detect target LV/VG
  although disks are not related to the operation. This behaviour causes
  a big problem, once a disk have problems. If a disk replies no response,
  each LVM command will be timed-out even if the target devices are not
  broken, and it spends a lot of time.

Version-Release number of selected component (if applicable):
  lvm2-2.02.32-4.el5

How reproducible:
  Just executing lvm commands.

Actual results:
  For example, in the following environment, vgs and vgscan command access
  200 times to broken disks (PVs). It takes a long time for LVM commands
  to finish if timeout happens whenever broken disks are accessed.

  - LUs: 32 (LU#01 ... LU#32) Broken LUs: LU#01, 03, 05, 07, ... 15
  - PVs: 32 (PV#01 ... PV#32)
  - VGs: 16
      VG#01 (PV#01 and PV#02)
      VG#02 (PV#03 and PV#04)
              ...
      VG#16 (PV#31 and PV#32)

Expected results:
  - Issue I/Os only to target disks
  - Avoid reading the same label many times and minimize the number of I/Os
  - Refrain from issuing I/Os to broken disks, once disk failures are detected

Additional info:
  N/A
Comment 1 Takahiro Yasui 2008-09-30 19:06:17 EDT
I created this bugzilla to make issues reported by bug #464724 in public.
Comment 2 Dave Wysochanski 2008-10-16 12:01:12 EDT
I have been looking into this (using upstream code as of 10/13/08 and iSCSI w/wire traces).  Even on systems without broken disks, there is definitely multiple I/Os issued to the same area of the disk in a very short time.

For instance, with 4 disks both 'pvscan' and 'pvs' issue IO to the same area of one disk 3 times (LBA 0, length 8 and LBA 8, length 8).  This is with a system of no VGs, just PVs that have been initialized.  I added a single VG across the 4 PVs, and it got even worse:
1) 'pvscan': 7 IOs to LUN 0, LBA 8, len 8
2) 'vgscan': 10 IOs to same area
3) 'pvs': 30 IOs
4) 'vgs': 9 IOs

I understand the reason for some of this from a code organizational standpoint, but it is not acceptable behavior at the I/O level.
Comment 3 Dave Wysochanski 2008-10-16 13:21:36 EDT
Here's some more detailed analysis of pvscan.  In all, there's 12 IOs issued to a single device (NOTE: The 7 IOs to offset 4096, len 8 correspond to LBA 8 / len 8 as stated earlier).  The duplicate IOs are the result of a couple things:
1) different subsystems within LVM checking for different things, but on the same area of the disk; you can see this below for example with device filtering on partitions and md reading the same area of the disk, as well as the label reading
2) IO subsystem within LVM does aligned reads, which normally results in a nice 4K IO alignment, regardless of initial read offset.

Brief summary of all IOs (breakpoint at _io() routine, in the while loop):
io1.txt: offset == 0, length == 4096; device filtering; filter out any device with a partition table
io2.txt: offset == 20905984, length == 4096, device filtering; filter out md device
io3.txt: offset == 20963328, length == 4096, device filtering; filter out md device
io4.txt: offset == 0, length == 4096; device filtering; filter out md device
io5.txt: offset == 4096, length == 4096; device filtering; filter out md device
io6.txt: offset == 0, len == 4096; label reading
io7.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header
io8.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header
io9.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header
io10.txt: offset == 4096, len == 4096; vg reading / parsing
io11.txt: offset == 4096, len == 4096; vg reading / parsing
io12.txt: offset == 4096, len == 4096; vg reading / parsing

I will attach all the text files that show the backtraces of all these IOs.
Comment 4 Dave Wysochanski 2008-10-16 13:24:30 EDT
Created attachment 320583 [details]
Summary of IOs for pvscan

Output from pvscan was as follows:

  PV /dev/xvda2   VG VolGroup00   lvm2 [5.75 GB / 0    free]
  PV /dev/sda     VG vgtest       lvm2 [16.00 MB / 16.00 MB free]
  PV /dev/sdb     VG vgtest       lvm2 [16.00 MB / 16.00 MB free]
  PV /dev/sdc     VG vgtest       lvm2 [16.00 MB / 16.00 MB free]
  PV /dev/sdd     VG vgtest       lvm2 [16.00 MB / 16.00 MB free]
  Total: 5 [5.81 GB] / in use: 5 [5.81 GB] / in no VG: 0 [0   ]
Comment 5 Dave Wysochanski 2008-10-16 14:25:19 EDT
One other key thing I should have mentioned.  LVM opens devices with O_DIRECT, so the page cache is bypassed and we actually get the duplicate IOs to the storage.

I have been researching various checkins but so far it is not clear why LVM needs O_DIRECT in all cases, especially if we are just reading the disks and not updating them.
Comment 6 Dave Wysochanski 2008-10-16 16:50:53 EDT
Mikulas and mbroz have pointed out that O_DIRECT is needed with:
1) clustered lvm
2) suspended devices (non-direct IO could cause deadlock)

I did a very quick hack to disable direct IO (see below) and confirmed that it does cut down on the duplicate IOs but does not eliminate them.  We might be able to safely disable O_DIRECT for some commands if at runtime we address the above (and any other) issues.

Mikulas pointed out this doesn't really solve the problem though for broken disks.  I wondered if we could add an option to do an initial read to a device and then dynamically add a filter if the read failed.  I think Milan was working on the broken device issue for the lvmcache work.

I am working on adding a device parameter to pvscan which would just scan the device specified and should eventually address the first issue (Issue I/Os only to target disks).


@@ -447,7 +453,7 @@ int dev_open_quiet(struct device *dev)
 
        flags = vg_write_lock_held() ? O_RDWR : O_RDONLY;
 
-       return dev_open_flags(dev, flags, 1, 1);
+       return dev_open_flags(dev, flags, 0, 1);
 }
 
 int dev_open(struct device *dev)
@@ -456,7 +462,7 @@ int dev_open(struct device *dev)
 
        flags = vg_write_lock_held() ? O_RDWR : O_RDONLY;
 
-       return dev_open_flags(dev, flags, 1, 0);
+       return dev_open_flags(dev, flags, 0, 0);
 }
 
 int dev_test_excl(struct device *dev)
@@ -467,7 +473,7 @@ int dev_test_excl(struct device *dev)
        flags = vg_write_lock_held() ? O_RDWR : O_RDONLY;
        flags |= O_EXCL;
 
-       r = dev_open_flags(dev, flags, 1, 1);
+       r = dev_open_flags(dev, flags, 0, 1);
        if (r)
                dev_close_immediate(dev);
Comment 7 Dave Wysochanski 2008-10-16 16:56:48 EDT
One last thing.  The second issue:
- Avoid reading the same label many times and minimize the number of I/Os

involves the way LVM is structured into subsystems - not sure there's an easy fix to consolidate the IOs.  We may be better off deferring until we have the next generation of storage scanning infrastructure.
Comment 8 Dave Wysochanski 2008-10-27 15:59:09 EDT
Created attachment 321648 [details]
Very simple systemtap script to record/print IOs completed on various devices (hook bio_endio)

Useful systemtap script to capture IOs completed on devices.  I used this with 'script' to capture the IOs that completed on various devices while running various scanning lvm scanning / reporting commands.  Verified results matched the iscsi traces I took for pvscan, pvs, etc.

Might be useful to put this or something like it into the nightly test to measure IO cost of various commands.
Comment 9 Takahiro Yasui 2009-01-29 18:53:26 EST
> Expected results:
>   - Issue I/Os only to target disks
>   - Avoid reading the same label many times and minimize the number of I/Os
>   - Refrain from issuing I/Os to broken disks, once disk failures are detected

Could you tell me the current status for this problem? I tested again with lvm
commands on 5.3. Thanks to Milan, I found that the second point, avoid reading the same label many times, are improved compare to 5.2, but still those problems
are remained. Especially the third point is very important. This is the test
results about the third point.

* Environment
  - LVM structure
    vg00: /dev/sdc (*no response*), /dev/sdd
    vg01: /dev/sde (*no response*), /dev/sef

  - Timeout
    /sys/block/sd[c-f]/device/timeout: 3

* vgscan results

# vgscan
  Reading all physical volumes.  This may take a while...
  /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdc: open failed: No such device or address
  /dev/sde: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde: open failed: No such device or address
  /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde: read failed after 0 of 4096 at 0: Input/output error
  Couldn't find device with uuid 'jihpS9-FDxW-61f1-Y6N8-c60p-ryKk-OyG9n7'.
  Found volume group "vg01" using metadata type lvm2
  /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde: read failed after 0 of 4096 at 0: Input/output error
  Couldn't find device with uuid '0zRZNX-IZcO-bGsZ-7nMr-ieDd-Njqr-h9iI9J'.
  Found volume group "vg00" using metadata type lvm2

  - broken disks, /dev/sdc and /dev/sde, are accessed several times and
    vgscan command takes more than 10 minutes.
  - vgscan command scan disks in get_vgids() and detects disk errors but
    still accesses those error disks.

I appreciate if you could share the current status and target date to be
fixed.
Comment 10 RHEL Product and Program Management 2009-02-05 18:34:00 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 13 Alasdair Kergon 2010-05-17 08:18:17 EDT
Some improvements have gone into 6.0 and more will go in 6.1 and beyond.
Comment 15 Peter Rajnoha 2010-09-21 06:37:58 EDT
*** Bug 636001 has been marked as a duplicate of this bug. ***
Comment 17 Alasdair Kergon 2011-02-09 19:14:12 EST
Ongoing work - more improvements got into 6.1 and more to come.
Moving to 6.2 for reappraisal of how far we've got and how much more we can do.
Comment 19 Petr Rockai 2011-06-02 17:04:04 EDT
We now have a consensus about the design, but the implementation
is still in a rather early phase. Since the planned solution is
quite invasive, any late-coming bugs could cause significant
trouble -- it would be advisable to not rush the
implementation. Therefore, I nominate this for inclusion in
6.3 (but not 6.2), which should give us a reasonable timeframe to
ensure the implementation is robust.
Comment 20 Takahiro Yasui 2011-06-06 04:08:03 EDT
Peter,

Thank you for handling this. I understand the current status and I'm very pleased to hear this feature is ongoing. I expect it would be supported on 6.3.

Thanks,
Taka
Comment 22 Larry Troan 2011-11-10 09:59:53 EST
Pushed to 6.3.
Comment 23 Petr Rockai 2012-01-09 04:33:02 EST
We are still trying for 6.3, although we did encounter more resistance than planned with lvmetad (which is the planned solution for the problem).
Comment 27 Tom Coughlan 2012-01-12 08:33:01 EST
This change is intended to be transparent to the user. Some LVM operations should be faster as a result of this change, but that is the only visible impact. The test plan consists of the standard LVM regression tests. Monitor performance to ensure it stays the same or improves. Stress tests, involving repeated scanning of large configurations while the system is busy, shutdown/reboot, 
add/remove/resize while the system is busy should be done.
Comment 29 Peter Rajnoha 2012-03-13 10:00:22 EDT
*** Bug 464724 has been marked as a duplicate of this bug. ***
Comment 30 Peter Rajnoha 2012-03-13 10:14:37 EDT
The lvmetad support support is addedd in 6.3 as a tech preview. It's disabled by default. To enable it, you need to set the global/use_lvmetad lvm.conf setting and enable the lvmetad daemon by running/enabling the lvm2-lvmetad init script.
Comment 32 Tom Coughlan 2012-03-28 17:43:23 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Please add this as a Tech. Preview:

Most LVM commands require an accurate view of the LVM metadata, stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O in systems that have a large number of disks. The purpose of lvmetad is to eliminate the need for this scanning by dynamically aggregating the metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, then LVM will perform a scan as it has in the past. This feature is off by default in RHEL 6.3. To enable it, refer to the use_lvmetad parameter in lvm.conf, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.
Comment 34 Martin Prpic 2012-04-03 08:30:04 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,7 @@
-Please add this as a Tech. Preview:
+Title: Dynamic aggregation of LVM metadata via lvmetad
 
-Most LVM commands require an accurate view of the LVM metadata, stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O in systems that have a large number of disks. The purpose of lvmetad is to eliminate the need for this scanning by dynamically aggregating the metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, then LVM will perform a scan as it has in the past. This feature is off by default in RHEL 6.3. To enable it, refer to the use_lvmetad parameter in lvm.conf, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.+Most LVM commands require an accurate view of the LVM metadata stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O operations in systems that have a large number of disks.
+
+The purpose of the lvmetad daemon is to eliminate the need for this scanning by dynamically aggregating metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, LVM performs a scan as it normally would.
+
+This feature is provided as a Technology Preview and is disabled by default in Red Hat Enterprise Linux 6.3. To enable it, refer to the use_lvmetad parameter in the /etc/lvm/lvm.conf file, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.
Comment 36 Corey Marthaler 2012-05-11 14:59:43 EDT
Basic regression tests now pass with lvmetad running. Marking this verified (SanityOnly).


2.6.32-269.el6.x86_64
lvm2-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
lvm2-libs-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
lvm2-cluster-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-libs-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-event-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-event-libs-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
cmirror-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
Comment 38 errata-xmlrpc 2012-06-20 10:50:52 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0962.html

Note You need to log in before you can comment on or make changes to this bug.