Bug 636001

Summary: [RFE] LVM operations should not scan all devices
Product: Red Hat Enterprise Linux 6 Reporter: Itamar Heim <iheim>
Component: lvm2Assignee: Petr Rockai <prockai>
Status: CLOSED ERRATA QA Contact: Corey Marthaler <cmarthal>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: abaron, agk, coughlan, dwysocha, heinzm, jbrassow, nperic, prajnoha, prockai
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.98-1.el6 Doc Type: Enhancement
Doc Text:
A new optional metadata caching daemon (lvmetad) is available as part of this update of LVM2, along with udev integration for device scanning. Repeated scans of all block devices in the system with each LVM command are avoided if the daemon is enabled (see lvm.conf for details). The original behaviour can be restored at any time by disabling lvmetad in lvm.conf.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 08:09:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 655920, 697866, 749672, 756082    

Description Itamar Heim 2010-09-21 09:56:51 UTC
Description of problem:
LVM scans all devices for each operation.
today we limit this via filter provided for each command.
this prevents us from taking advantage of LVM cache

LVM should only scan devices relevant to the VG of the requested operation (if scope of operation is limited to a certain VG, and should scan only the PVs/devices known to contains the VG metadata).

Comment 6 Ayal Baron 2010-11-30 21:57:25 UTC
Please note that in all our setups only one pv in the vg holds the MDA.  After the initial scan is performed, only the devices containing MDAs should be accessed in all subsequent commands (any changes to the list of devices to be scanned should be evident from vg md read from already known devices).

Also note that seeing as scan is performed sequentially, in a setup with 500 LUNs, it is sufficient to have just a few with high latency to cause lvs/pvs/vgs to stall for a long time.

Comment 16 Corey Marthaler 2012-05-11 19:40:05 UTC
It appears to make no difference whether or not there's MDA in the PV or not, all devices are scanned the second time regardless. Also, I see no devel unit test results proving otherwise. Marking this FailsQA and removing 6.3 flag. This should be pulled out and moved to rhel6.4.


2.6.32-269.el6.x86_64
lvm2-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
lvm2-libs-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
lvm2-cluster-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-libs-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-event-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
device-mapper-event-libs-1.02.74-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012
cmirror-2.02.95-8.el6    BUILT: Wed May  9 03:33:32 CDT 2012



[root@hayes-01 ~]# pvcreate /dev/etherd/e1.1p1 /dev/etherd/e1.1p2
  Writing physical volume data to disk "/dev/etherd/e1.1p1"
  Physical volume "/dev/etherd/e1.1p1" successfully created
  Writing physical volume data to disk "/dev/etherd/e1.1p2"
  Physical volume "/dev/etherd/e1.1p2" successfully created

[root@hayes-01 ~]# pvcreate --pvmetadatacopies 0 /dev/etherd/e1.1p3 /dev/etherd/e1.1p4
  Writing physical volume data to disk "/dev/etherd/e1.1p3"
  Physical volume "/dev/etherd/e1.1p3" successfully created
  Writing physical volume data to disk "/dev/etherd/e1.1p4"
  Physical volume "/dev/etherd/e1.1p4" successfully created

[root@hayes-01 ~]# pvs -a -o +pv_mda_free,pv_mda_size
  PV                  VG Fmt  Attr PSize   PFree   PMdaFree  PMdaSize 
  /dev/etherd/e1.1p1     lvm2 a--  908.23g 908.23g   509.50k  1020.00k
  /dev/etherd/e1.1p2     lvm2 a--  908.23g 908.23g   509.50k  1020.00k
  /dev/etherd/e1.1p3     lvm2 a--  908.23g 908.23g        0         0 
  /dev/etherd/e1.1p4     lvm2 a--  908.23g 908.23g        0         0 

[root@hayes-01 ~]# vgcreate VG1 /dev/etherd/e1.1p[13]
  Volume group "VG1" successfully created
[root@hayes-01 ~]# vgcreate VG2 /dev/etherd/e1.1p[24]
  Volume group "VG2" successfully created

[root@hayes-01 ~]# vgs -vvvv VG1 > /tmp/vg1.a 2>&1
[root@hayes-01 ~]# vgs -vvvv VG1 > /tmp/vg1.b 2>&1
[root@hayes-01 ~]# diff /tmp/vg1.a /tmp/vg1.b
574c574
< #metadata/vg.c:59         Allocated VG VG1 at 0x1a8bcb0.
---
> #metadata/vg.c:59         Allocated VG VG1 at 0x1f8ccb0.
582c582
< #metadata/vg.c:74         Freeing VG VG1 at 0x1a8fcc0.
---
> #metadata/vg.c:74         Freeing VG VG1 at 0x1f90cc0.
588c588
< #metadata/vg.c:74         Freeing VG VG1 at 0x1a8bcb0.
---
> #metadata/vg.c:74         Freeing VG VG1 at 0x1f8ccb0.


[root@hayes-01 ~]# vgs -vvvv VG2 > /tmp/vg2.a 2>&1
[root@hayes-01 ~]# vgs -vvvv VG2 > /tmp/vg2.b 2>&1
[root@hayes-01 ~]# diff /tmp/vg2.a /tmp/vg2.b
574c574
< #metadata/vg.c:59         Allocated VG VG2 at 0x3289cb0.
---
> #metadata/vg.c:59         Allocated VG VG2 at 0x2ffbcb0.
582c582
< #metadata/vg.c:74         Freeing VG VG2 at 0x328dcc0.
---
> #metadata/vg.c:74         Freeing VG VG2 at 0x2fffcc0.
588c588
< #metadata/vg.c:74         Freeing VG VG2 at 0x3289cb0.
---
> #metadata/vg.c:74         Freeing VG VG2 at 0x2ffbcb0.


# SHOULDN'T THE NON MDA DEVICES NOT BE IN THE 2ND SCAN???

[root@hayes-01 ~]# grep e1.1p3 /tmp/vg1.b 
#device/dev-cache.c:333         /dev/etherd/e1.1p3: Added to device cache
#device/dev-cache.c:330         /dev/block/152:275: Aliased to /dev/etherd/e1.1p3 in device cache
#device/dev-io.c:524         Opened /dev/etherd/e1.1p3 RO O_DIRECT
#device/dev-io.c:271       /dev/etherd/e1.1p3: size is 1904693766 sectors
#device/dev-io.c:577         Closed /dev/etherd/e1.1p3
#device/dev-io.c:271       /dev/etherd/e1.1p3: size is 1904693766 sectors
#device/dev-io.c:524         Opened /dev/etherd/e1.1p3 RO O_DIRECT
#device/dev-io.c:137         /dev/etherd/e1.1p3: block size is 1024 bytes
#device/dev-io.c:577         Closed /dev/etherd/e1.1p3
#filters/filter-composite.c:31         Using /dev/etherd/e1.1p3
#device/dev-io.c:524         Opened /dev/etherd/e1.1p3 RO O_DIRECT
#device/dev-io.c:137         /dev/etherd/e1.1p3: block size is 1024 bytes
#label/label.c:156       /dev/etherd/e1.1p3: lvm2 label detected at sector 1
#cache/lvmcache.c:1337         lvmcache: /dev/etherd/e1.1p3: now in VG #orphans_lvm2 (#orphans_lvm2) with 0 mdas
#device/dev-io.c:577         Closed /dev/etherd/e1.1p3
#label/label.c:266         Using cached label for /dev/etherd/e1.1p3
#cache/lvmcache.c:1337         lvmcache: /dev/etherd/e1.1p3: now in VG VG1 (UP4Q1C0dfn10T6wRU8gEa5iDTu4NB4xW) with 0 mdas
#metadata/pv_manip.c:327         /dev/etherd/e1.1p3 0:      0 232506: NULL(0:0)

# SHOULDN'T THE NON MDA DEVICES NOT BE IN THE 2ND SCAN???

[root@hayes-01 ~]# grep e1.1p4 /tmp/vg2.b 
#device/dev-cache.c:333         /dev/etherd/e1.1p4: Added to device cache
#device/dev-cache.c:330         /dev/block/152:276: Aliased to /dev/etherd/e1.1p4 in device cache
#device/dev-io.c:524         Opened /dev/etherd/e1.1p4 RO O_DIRECT
#device/dev-io.c:271       /dev/etherd/e1.1p4: size is 1904693766 sectors
#device/dev-io.c:577         Closed /dev/etherd/e1.1p4
#device/dev-io.c:271       /dev/etherd/e1.1p4: size is 1904693766 sectors
#device/dev-io.c:524         Opened /dev/etherd/e1.1p4 RO O_DIRECT
#device/dev-io.c:137         /dev/etherd/e1.1p4: block size is 1024 bytes
#device/dev-io.c:577         Closed /dev/etherd/e1.1p4
#filters/filter-composite.c:31         Using /dev/etherd/e1.1p4
#device/dev-io.c:524         Opened /dev/etherd/e1.1p4 RO O_DIRECT
#device/dev-io.c:137         /dev/etherd/e1.1p4: block size is 1024 bytes
#label/label.c:156       /dev/etherd/e1.1p4: lvm2 label detected at sector 1
#cache/lvmcache.c:1337         lvmcache: /dev/etherd/e1.1p4: now in VG #orphans_lvm2 (#orphans_lvm2) with 0 mdas
#device/dev-io.c:577         Closed /dev/etherd/e1.1p4
#label/label.c:266         Using cached label for /dev/etherd/e1.1p4
#cache/lvmcache.c:1337         lvmcache: /dev/etherd/e1.1p4: now in VG VG2 (duQVk7ME9UoItbXvp4VjBNruXcFFscYd) with 0 mdas
#metadata/pv_manip.c:327         /dev/etherd/e1.1p4 0:      0 232506: NULL(0:0)


[root@hayes-01 ~]# pvs -a -o +pv_mda_free,pv_mda_size
  PV                   VG   Fmt  Attr PSize   PFree   PMdaFree  PMdaSize 
  /dev/etherd/e1.1p1   VG1  lvm2 a--  908.23g 908.23g   508.50k  1020.00k
  /dev/etherd/e1.1p10            ---       0       0         0         0 
  /dev/etherd/e1.1p2   VG2  lvm2 a--  908.23g 908.23g   508.50k  1020.00k
  /dev/etherd/e1.1p3   VG1  lvm2 a--  908.23g 908.23g        0         0 
  /dev/etherd/e1.1p4   VG2  lvm2 a--  908.23g 908.23g        0         0

Comment 17 Petr Rockai 2012-05-13 13:41:16 UTC
I think there is a misunderstanding about the bug here, or maybe the solution. The proposed fix is to use lvmetad, which avoids the scans altogether (and if it doesn't, then it is definitely a bug and grounds for FailsQA), which is a strict improvement over what the bug asks for (scan only some devices).

I don't think there is much interest in optimizing the non-lvmetad code paths for reducing scans, since the preferred future solution is to use lvmetad on big systems. It would be good to know what the original submitter thinks about this.

I am in favour of WONTFIX in case the request is for non-lvmetad setups to behave this way. With lvmetad, I think the bug is already fixed. Opinions?

Comment 19 Alasdair Kergon 2012-05-14 16:58:56 UTC
The original description is unambiguous.

If a VG has 50 PVs in it, but only of those has an MDA, only that one disk should be accessed - the others should be skipped.

The bug proposed that this would be fixed provided lvmetad was used.

This means that the 'lvm' process (the 'client' that is talking to lvmetad) should only be accessing the one device that lvmetad tells it is the one containing the metadata.

Investigation is needed to understand why those "Opened" lines still appear in the log messages.

Comment 24 Petr Rockai 2012-10-14 09:38:51 UTC
I would suggest to use strace -e open on an lvm command to determine what files/devices it is opening. Something like:

strace -o strace.log -e open pvs
grep /dev strace.log

To get rid of device reads, you also need to disable MD component detection in lvm.conf, since it currently forces the device filter to read bits from each device. Moreover, other filters open the devices to get their size (without reading anything though). So with current upstream, you may want to instead check for reads (strace -e open,read) and check that the devices are not being read. I.e., the QA check should be that when lvmetad is active, lvm commands to not *read* from devices, even though they may open them to obtain their size.

Ideally however, with lvmetad active, there would be no device opens for read-only commands, and only MDA devices should be opened for read-write commands. That can be achieved with the following patch:

diff --git a/lib/commands/toolcontext.c b/lib/commands/toolcontext.c
index 5177f41..0ee6ddb 100644
--- a/lib/commands/toolcontext.c
+++ b/lib/commands/toolcontext.c
@@ -774,7 +774,7 @@ static struct dev_filter *_init_filter_components(struct cmd_context *cmd)
         * Listed first because it's very efficient at eliminating
         * unavailable devices.
         */
-       if (find_config_tree_bool(cmd, "devices/sysfs_scan",
+       if (!lvmetad_active() && find_config_tree_bool(cmd, "devices/sysfs_scan",
                             DEFAULT_SYSFS_SCAN)) {
                if ((filters[nr_filt] = sysfs_filter_create(cmd->sysfs_dir)))
                        nr_filt++;
@@ -791,27 +791,29 @@ static struct dev_filter *_init_filter_components(struct cmd_context *cmd)
        } else
                nr_filt++;
 
-       /* device type filter. Required. */
-       cn = find_config_tree_node(cmd, "devices/types");
-       if (!(filters[nr_filt] = lvm_type_filter_create(cmd->proc_dir, cn))) {
-               log_error("Failed to create lvm type filter");
-               goto bad;
-       }
-       nr_filt++;
+       if (!lvmetad_active()) {
+               /* device type filter. Required. */
+               cn = find_config_tree_node(cmd, "devices/types");
+               if (!(filters[nr_filt] = lvm_type_filter_create(cmd->proc_dir, cn))) {
+                       log_error("Failed to create lvm type filter");
+                       goto bad;
+               }
+               nr_filt++;
 
-       /* md component filter. Optional, non-critical. */
-       if (find_config_tree_bool(cmd, "devices/md_component_detection",
-                            DEFAULT_MD_COMPONENT_DETECTION)) {
-               init_md_filtering(1);
-               if ((filters[nr_filt] = md_filter_create()))
-                       nr_filt++;
-       }
+               /* md component filter. Optional, non-critical. */
+               if (find_config_tree_bool(cmd, "devices/md_component_detection",
+                                         DEFAULT_MD_COMPONENT_DETECTION)) {
+                       init_md_filtering(1);
+                       if ((filters[nr_filt] = md_filter_create()))
+                               nr_filt++;
+               }
 
-       /* mpath component filter. Optional, non-critical. */
-       if (find_config_tree_bool(cmd, "devices/multipath_component_detection",
-                            DEFAULT_MULTIPATH_COMPONENT_DETECTION)) {
-               if ((filters[nr_filt] = mpath_filter_create(cmd->sysfs_dir)))
-                       nr_filt++;
+               /* mpath component filter. Optional, non-critical. */
+               if (find_config_tree_bool(cmd, "devices/multipath_component_detection",
+                                         DEFAULT_MULTIPATH_COMPONENT_DETECTION)) {
+                       if ((filters[nr_filt] = mpath_filter_create(cmd->sysfs_dir)))
+                               nr_filt++;
+               }
        }
 
        /* Only build a composite filter if we really need it. */

Comment 26 Nenad Peric 2012-11-19 09:35:25 UTC
Tested with running read-only LVM commands while lvmetad is running, and use_lvmetad is set to 1 in lvm config. 
Even though devices were opened only the ones with metada have been read. 

So since this was stated as a requirement in Comment #24 I am marking this BZ as verified. 

If lvmetada was off (or otherwise misconfigured, running lvmetad without use_lvmetad in lvm.conf) the scan included reading all the devices, as was expected. 


  PV         VG       Fmt  Attr PSize  PFree  PMdaFree  PMdaSize 
  /dev/sda1  smallvg  lvm2 a--   9.99g  9.99g        0   1020.00k
  /dev/sdb1  smallvg  lvm2 a--   9.99g  9.99g        0   1020.00k
  /dev/sdc1           lvm2 a--  10.00g 10.00g        0         0 
  /dev/sdd1           lvm2 a--  10.00g 10.00g        0         0 
  /dev/vda2  VolGroup lvm2 a--   9.51g     0         0   1020.00k


open("/dev/sda1", O_RDONLY|O_DIRECT|O_NOATIME) = 5
open("/dev/sda1", O_RDONLY)             = 6
open("/dev/sda1", O_RDONLY)             = 5
open("/dev/sda1", O_RDONLY|O_DIRECT|O_NOATIME) = 5
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
read(5, "\253\364/# LVM2 x[5A%r0N*>\1\0\0\0\0\20\0\0\0\0\0\0"..., 1024) = 1024
open("/dev/sdb1", O_RDONLY|O_DIRECT|O_NOATIME) = 5
open("/dev/sdb1", O_RDONLY)             = 6
open("/dev/sdb1", O_RDONLY)             = 5
open("/dev/sdb1", O_RDONLY|O_DIRECT|O_NOATIME) = 5
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
read(5, "\253\364/# LVM2 x[5A%r0N*>\1\0\0\0\0\20\0\0\0\0\0\0"..., 1024) = 1024
read(4, "response=\"OK\"\nname=\"VolGroup\"\nme", 32) = 32
read(4, "tadata {\n\tid=\"1QA3xc-9cdT-Pces-3"..., 1024) = 1024
read(4, "in\"\n\t\t\tcreation_time=1349267710\n"..., 1056) = 189
open("/dev/vda2", O_RDONLY|O_DIRECT|O_NOATIME) = 5
open("/dev/vda2", O_RDONLY)             = 6
open("/dev/vda2", O_RDONLY)             = 5
open("/dev/vda2", O_RDONLY|O_DIRECT|O_NOATIME) = 5
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(5, "\254YT\374 LVM2 x[5A%r0N*>\1\0\0\0\0\20\0\0\0\0\0\0"..., 4096) = 4096
open("/proc/self/task/31826/attr/current", O_RDONLY) = 5


No other devices were being opened for reading. 



Tested with:

lvm2-2.02.98-3.el6

Comment 28 errata-xmlrpc 2013-02-21 08:09:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0501.html