Bug 464877

Summary:

Avoid scanning devices to find LV/VG in LVM commands

Product:

Red Hat Enterprise Linux 6

Reporter:

Takahiro Yasui <tyasui>

Component:

lvm2

Assignee:

Petr Rockai <prockai>

Status:

CLOSED ERRATA

QA Contact:

Corey Marthaler <cmarthal>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.0

CC:

agk, borgan, coughlan, djansa, dwysocha, heinzm, iannis, iheim, jbrassow, ltroan, lwang, masaki.kimura.kz, mbroz, msnitzer, noboru.obata.ar, prajnoha, prockai, saguchi, ssaha, takahiro.yasui.mp

Target Milestone:

beta

Keywords:

FutureFeature, TechPreview

Target Release:

6.3

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

lvm2-2.02.95-1.el6

Doc Type:

Technology Preview

Doc Text:

Title: Dynamic aggregation of LVM metadata via lvmetad Most LVM commands require an accurate view of the LVM metadata stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O operations in systems that have a large number of disks. The purpose of the lvmetad daemon is to eliminate the need for this scanning by dynamically aggregating metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, LVM performs a scan as it normally would. This feature is provided as a Technology Preview and is disabled by default in Red Hat Enterprise Linux 6.3. To enable it, refer to the use_lvmetad parameter in the /etc/lvm/lvm.conf file, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.

Story Points:

---

Clone Of:

Clones:

857530 (view as bug list)

Environment:

Last Closed:

2012-06-20 14:50:52 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

816724, 1173739

Bug Blocks:

464724, 697866, 705085, 718103, 756082, 857530

Attachments:

Description	Flags
Summary of IOs for pvscan	none
Very simple systemtap script to record/print IOs completed on various devices (hook bio_endio)	none

Description Takahiro Yasui 2008-09-30 22:53:11 UTC

Description of problem:
  Current implementation of LVM commands needs to issue "READ I/O" to
  all disks in the system so many times in order to detect target LV/VG
  although disks are not related to the operation. This behaviour causes
  a big problem, once a disk have problems. If a disk replies no response,
  each LVM command will be timed-out even if the target devices are not
  broken, and it spends a lot of time.

Version-Release number of selected component (if applicable):
  lvm2-2.02.32-4.el5

How reproducible:
  Just executing lvm commands.

Actual results:
  For example, in the following environment, vgs and vgscan command access
  200 times to broken disks (PVs). It takes a long time for LVM commands
  to finish if timeout happens whenever broken disks are accessed.

  - LUs: 32 (LU#01 ... LU#32) Broken LUs: LU#01, 03, 05, 07, ... 15
  - PVs: 32 (PV#01 ... PV#32)
  - VGs: 16
      VG#01 (PV#01 and PV#02)
      VG#02 (PV#03 and PV#04)
              ...
      VG#16 (PV#31 and PV#32)

Expected results:
  - Issue I/Os only to target disks
  - Avoid reading the same label many times and minimize the number of I/Os
  - Refrain from issuing I/Os to broken disks, once disk failures are detected

Additional info:
  N/A

Comment 1 Takahiro Yasui 2008-09-30 23:06:17 UTC

I created this bugzilla to make issues reported by bug #464724 in public.

Comment 2 Dave Wysochanski 2008-10-16 16:01:12 UTC

I have been looking into this (using upstream code as of 10/13/08 and iSCSI w/wire traces).  Even on systems without broken disks, there is definitely multiple I/Os issued to the same area of the disk in a very short time.

For instance, with 4 disks both 'pvscan' and 'pvs' issue IO to the same area of one disk 3 times (LBA 0, length 8 and LBA 8, length 8).  This is with a system of no VGs, just PVs that have been initialized.  I added a single VG across the 4 PVs, and it got even worse:
1) 'pvscan': 7 IOs to LUN 0, LBA 8, len 8
2) 'vgscan': 10 IOs to same area
3) 'pvs': 30 IOs
4) 'vgs': 9 IOs

I understand the reason for some of this from a code organizational standpoint, but it is not acceptable behavior at the I/O level.

Comment 3 Dave Wysochanski 2008-10-16 17:21:36 UTC

Here's some more detailed analysis of pvscan.  In all, there's 12 IOs issued to a single device (NOTE: The 7 IOs to offset 4096, len 8 correspond to LBA 8 / len 8 as stated earlier).  The duplicate IOs are the result of a couple things:
1) different subsystems within LVM checking for different things, but on the same area of the disk; you can see this below for example with device filtering on partitions and md reading the same area of the disk, as well as the label reading
2) IO subsystem within LVM does aligned reads, which normally results in a nice 4K IO alignment, regardless of initial read offset.

Brief summary of all IOs (breakpoint at _io() routine, in the while loop):
io1.txt: offset == 0, length == 4096; device filtering; filter out any device with a partition table
io2.txt: offset == 20905984, length == 4096, device filtering; filter out md device
io3.txt: offset == 20963328, length == 4096, device filtering; filter out md device
io4.txt: offset == 0, length == 4096; device filtering; filter out md device
io5.txt: offset == 4096, length == 4096; device filtering; filter out md device
io6.txt: offset == 0, len == 4096; label reading
io7.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header
io8.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header
io9.txt: offset == 4096, len == 4096; label reading / reading vgname from mda header
io10.txt: offset == 4096, len == 4096; vg reading / parsing
io11.txt: offset == 4096, len == 4096; vg reading / parsing
io12.txt: offset == 4096, len == 4096; vg reading / parsing

I will attach all the text files that show the backtraces of all these IOs.

Comment 4 Dave Wysochanski 2008-10-16 17:24:30 UTC

Created attachment 320583 [details]
Summary of IOs for pvscan

Output from pvscan was as follows:

  PV /dev/xvda2   VG VolGroup00   lvm2 [5.75 GB / 0    free]
  PV /dev/sda     VG vgtest       lvm2 [16.00 MB / 16.00 MB free]
  PV /dev/sdb     VG vgtest       lvm2 [16.00 MB / 16.00 MB free]
  PV /dev/sdc     VG vgtest       lvm2 [16.00 MB / 16.00 MB free]
  PV /dev/sdd     VG vgtest       lvm2 [16.00 MB / 16.00 MB free]
  Total: 5 [5.81 GB] / in use: 5 [5.81 GB] / in no VG: 0 [0   ]

Comment 5 Dave Wysochanski 2008-10-16 18:25:19 UTC

One other key thing I should have mentioned.  LVM opens devices with O_DIRECT, so the page cache is bypassed and we actually get the duplicate IOs to the storage.

I have been researching various checkins but so far it is not clear why LVM needs O_DIRECT in all cases, especially if we are just reading the disks and not updating them.

Comment 6 Dave Wysochanski 2008-10-16 20:50:53 UTC

Mikulas and mbroz have pointed out that O_DIRECT is needed with:
1) clustered lvm
2) suspended devices (non-direct IO could cause deadlock)

I did a very quick hack to disable direct IO (see below) and confirmed that it does cut down on the duplicate IOs but does not eliminate them.  We might be able to safely disable O_DIRECT for some commands if at runtime we address the above (and any other) issues.

Mikulas pointed out this doesn't really solve the problem though for broken disks.  I wondered if we could add an option to do an initial read to a device and then dynamically add a filter if the read failed.  I think Milan was working on the broken device issue for the lvmcache work.

I am working on adding a device parameter to pvscan which would just scan the device specified and should eventually address the first issue (Issue I/Os only to target disks).


@@ -447,7 +453,7 @@ int dev_open_quiet(struct device *dev)
 
        flags = vg_write_lock_held() ? O_RDWR : O_RDONLY;
 
-       return dev_open_flags(dev, flags, 1, 1);
+       return dev_open_flags(dev, flags, 0, 1);
 }
 
 int dev_open(struct device *dev)
@@ -456,7 +462,7 @@ int dev_open(struct device *dev)
 
        flags = vg_write_lock_held() ? O_RDWR : O_RDONLY;
 
-       return dev_open_flags(dev, flags, 1, 0);
+       return dev_open_flags(dev, flags, 0, 0);
 }
 
 int dev_test_excl(struct device *dev)
@@ -467,7 +473,7 @@ int dev_test_excl(struct device *dev)
        flags = vg_write_lock_held() ? O_RDWR : O_RDONLY;
        flags |= O_EXCL;
 
-       r = dev_open_flags(dev, flags, 1, 1);
+       r = dev_open_flags(dev, flags, 0, 1);
        if (r)
                dev_close_immediate(dev);

Comment 7 Dave Wysochanski 2008-10-16 20:56:48 UTC

One last thing.  The second issue:
- Avoid reading the same label many times and minimize the number of I/Os

involves the way LVM is structured into subsystems - not sure there's an easy fix to consolidate the IOs.  We may be better off deferring until we have the next generation of storage scanning infrastructure.

Comment 8 Dave Wysochanski 2008-10-27 19:59:09 UTC

Created attachment 321648 [details]
Very simple systemtap script to record/print IOs completed on various devices (hook bio_endio)

Useful systemtap script to capture IOs completed on devices.  I used this with 'script' to capture the IOs that completed on various devices while running various scanning lvm scanning / reporting commands.  Verified results matched the iscsi traces I took for pvscan, pvs, etc.

Might be useful to put this or something like it into the nightly test to measure IO cost of various commands.

Comment 9 Takahiro Yasui 2009-01-29 23:53:26 UTC

> Expected results:
>   - Issue I/Os only to target disks
>   - Avoid reading the same label many times and minimize the number of I/Os
>   - Refrain from issuing I/Os to broken disks, once disk failures are detected

Could you tell me the current status for this problem? I tested again with lvm
commands on 5.3. Thanks to Milan, I found that the second point, avoid reading the same label many times, are improved compare to 5.2, but still those problems
are remained. Especially the third point is very important. This is the test
results about the third point.

* Environment
  - LVM structure
    vg00: /dev/sdc (*no response*), /dev/sdd
    vg01: /dev/sde (*no response*), /dev/sef

  - Timeout
    /sys/block/sd[c-f]/device/timeout: 3

* vgscan results

# vgscan
  Reading all physical volumes.  This may take a while...
  /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdc: open failed: No such device or address
  /dev/sde: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde: open failed: No such device or address
  /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde: read failed after 0 of 4096 at 0: Input/output error
  Couldn't find device with uuid 'jihpS9-FDxW-61f1-Y6N8-c60p-ryKk-OyG9n7'.
  Found volume group "vg01" using metadata type lvm2
  /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde: read failed after 0 of 4096 at 0: Input/output error
  Couldn't find device with uuid '0zRZNX-IZcO-bGsZ-7nMr-ieDd-Njqr-h9iI9J'.
  Found volume group "vg00" using metadata type lvm2

  - broken disks, /dev/sdc and /dev/sde, are accessed several times and
    vgscan command takes more than 10 minutes.
  - vgscan command scan disks in get_vgids() and detects disk errors but
    still accesses those error disks.

I appreciate if you could share the current status and target date to be
fixed.

Comment 10 RHEL Program Management 2009-02-05 23:34:00 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 13 Alasdair Kergon 2010-05-17 12:18:17 UTC

Some improvements have gone into 6.0 and more will go in 6.1 and beyond.

Comment 15 Peter Rajnoha 2010-09-21 10:37:58 UTC

*** Bug 636001 has been marked as a duplicate of this bug. ***

Comment 17 Alasdair Kergon 2011-02-10 00:14:12 UTC

Ongoing work - more improvements got into 6.1 and more to come.
Moving to 6.2 for reappraisal of how far we've got and how much more we can do.

Comment 19 Petr Rockai 2011-06-02 21:04:04 UTC

We now have a consensus about the design, but the implementation
is still in a rather early phase. Since the planned solution is
quite invasive, any late-coming bugs could cause significant
trouble -- it would be advisable to not rush the
implementation. Therefore, I nominate this for inclusion in
6.3 (but not 6.2), which should give us a reasonable timeframe to
ensure the implementation is robust.

Comment 20 Takahiro Yasui 2011-06-06 08:08:03 UTC

Peter,

Thank you for handling this. I understand the current status and I'm very pleased to hear this feature is ongoing. I expect it would be supported on 6.3.

Thanks,
Taka

Comment 22 Larry Troan 2011-11-10 14:59:53 UTC

Pushed to 6.3.

Comment 23 Petr Rockai 2012-01-09 09:33:02 UTC

We are still trying for 6.3, although we did encounter more resistance than planned with lvmetad (which is the planned solution for the problem).

Comment 27 Tom Coughlan 2012-01-12 13:33:01 UTC

This change is intended to be transparent to the user. Some LVM operations should be faster as a result of this change, but that is the only visible impact. The test plan consists of the standard LVM regression tests. Monitor performance to ensure it stays the same or improves. Stress tests, involving repeated scanning of large configurations while the system is busy, shutdown/reboot, 
add/remove/resize while the system is busy should be done.

Comment 29 Peter Rajnoha 2012-03-13 14:00:22 UTC

*** Bug 464724 has been marked as a duplicate of this bug. ***

Comment 30 Peter Rajnoha 2012-03-13 14:14:37 UTC

The lvmetad support support is addedd in 6.3 as a tech preview. It's disabled by default. To enable it, you need to set the global/use_lvmetad lvm.conf setting and enable the lvmetad daemon by running/enabling the lvm2-lvmetad init script.

Comment 32 Tom Coughlan 2012-03-28 21:43:23 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Please add this as a Tech. Preview:

Most LVM commands require an accurate view of the LVM metadata, stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O in systems that have a large number of disks. The purpose of lvmetad is to eliminate the need for this scanning by dynamically aggregating the metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, then LVM will perform a scan as it has in the past. This feature is off by default in RHEL 6.3. To enable it, refer to the use_lvmetad parameter in lvm.conf, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.

Comment 34 Martin Prpič 2012-04-03 12:30:04 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,3 +1,7 @@
-Please add this as a Tech. Preview:
+Title: Dynamic aggregation of LVM metadata via lvmetad
 
-Most LVM commands require an accurate view of the LVM metadata, stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O in systems that have a large number of disks. The purpose of lvmetad is to eliminate the need for this scanning by dynamically aggregating the metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, then LVM will perform a scan as it has in the past. This feature is off by default in RHEL 6.3. To enable it, refer to the use_lvmetad parameter in lvm.conf, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.+Most LVM commands require an accurate view of the LVM metadata stored on the disk devices on the system. With the current LVM design, if this information is not available, LVM must scan all the physical disk devices in the system. This requires a significant amount of I/O operations in systems that have a large number of disks.
+
+The purpose of the lvmetad daemon is to eliminate the need for this scanning by dynamically aggregating metadata information each time the status of a device changes. These events are signaled to lvmetad by udev rules. If lvmetad is not running, LVM performs a scan as it normally would.
+
+This feature is provided as a Technology Preview and is disabled by default in Red Hat Enterprise Linux 6.3. To enable it, refer to the use_lvmetad parameter in the /etc/lvm/lvm.conf file, and enable the lvmetad daemon by configuring the lvm2-lvmetad init script.

Comment 36 Corey Marthaler 2012-05-11 18:59:43 UTC

Basic regression tests now pass with lvmetad running. Marking this verified (SanityOnly).


2.6.32-269.el6.x86_64
lvm2-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
lvm2-libs-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
lvm2-cluster-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-libs-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-event-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-event-libs-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
cmirror-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012

Comment 38 errata-xmlrpc 2012-06-20 14:50:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0962.html