Bug 736027

Summary: [lvm] [scale] pvs command takes more than 3 minutes when VG combined from 300 pvs
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: lvm2Assignee: Zdenek Kabelac <zkabelac>
Status: CLOSED ERRATA QA Contact: Corey Marthaler <cmarthal>
Severity: high Docs Contact:
Priority: medium    
Version: 6.2CC: agk, bsettle, danken, dwysocha, heinzm, iheim, jbrassow, kzak, lyarwood, mgoldboi, prajnoha, prockai, thornber, yeylon, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.118-1.el6 Doc Type: Bug Fix
Doc Text:
When a VG is built from many PVs (say >50), we advice users to think about performance/security consequences. Overall speed of lvm2 validation has been significantly improved, but still if the read access to a PV is slow, the read bottleneck cannot be eliminated. Thus on the system without lvmetad, it's very useful to limit number of metadata areas in a VG [--metadatacopies] (so they are not read/verified/written 50 times with each use). On the other hand if PVs with metadata are lost or taken out of a VG, the remaining PVs without metadata areas become useless, so users needs to optimize risk here. System with enabled lvmetad should significantly reduce the operation time of individual commands even if VG has metadata on all PVs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-22 07:36:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 1075802, 1159926    
Attachments:
Description Flags
test script for internal lvm test suite none

Description Haim 2011-09-06 13:06:49 UTC
Description of problem:

pvs command takes more then 3 minutes to return when utilizing a VG with 300 + physical devices. 
attached 'strace' log on pvs command. 

topology: 

- RHEL6.2 host 
- host connected with iSCSI connection to 300 LUNs 
- each LUN is a PV in the system
- all 300 LUNs assembles one VG.

  --- Volume group ---
  VG Name               6940d483-3c30-4b17-8490-86a966679d6d
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  15
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                6
  Open LV               0
  Max PV                0
  Cur PV                302
  Act PV                302
  VG Size               2.03 TiB
  PE Size               128.00 MiB
  Total PE              16657
  Alloc PE / Size       31 / 3.88 GiB
  Free  PE / Size       16626 / 2.03 TiB
  VG UUID               xn2iVg-Drvm-PCYt-jmg7-bvQs-IqsJ-R0sjGH

Comment 1 Alasdair Kergon 2011-09-06 13:20:49 UTC
lvmdump or at least -vvvv output?
(and when you attach it, make sure the strace has timestamps)

Comment 3 Zdenek Kabelac 2011-09-07 12:14:02 UTC
I've made my own test case on local device - and I' have not noticed 3 minutes for 300 devices - so the delay must be probably caused by read latency. 
(Can you attach strace -ttt ?)

Since lvm2 currently doesn't read multiple devices at once (no thread, no asynchronous read) the read latency is crucial here.

Since you have not mentioned lvm2 version in use - maybe you could try some latest rhel6 brew builds if it makes it any better?

From developmental POV, there is on going progress with lvmetad daemon (in early developmental stage) to keep the metadata information cached and making this operation much faster.

Comment 5 Zdenek Kabelac 2011-09-25 17:30:51 UTC
Can we have attached timed strace:

 'strace -ttt -o timetrace pvs'  

so we are sure if this guess is correct ?

Comment 8 Zdenek Kabelac 2011-09-27 18:12:34 UTC
Created attachment 525184 [details]
test script for internal lvm test suite

Ok - as a quick help, you could probably try:

1. More precise filter line for your pvs command - i.e. avoid scanning   /dev/sdXXX if you do not need to scan these.

2. If you are sure devices scanned by pvs command are never part of any mdraid
you may further significantly reduce scanning time (since iscsi has it's speed limits)  by disabling mdraid scan i.e.:

pvs --config 'devices { md_component_detection = 0 }'

From devel POV this scan is based on improvements on the usage of udev scan (patch resolving this particular issue was not yet accepted upstream) and will be ultimately resolved by lvmetad.

Petr can we figure out some acceptable solution to accelerate this udev scanning filters more?

(Attaching test-suite script for simple testing - there are many openings per one device for various scanning activities).

Comment 9 RHEL Product and Program Management 2011-10-07 15:55:37 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 11 Peter Rajnoha 2012-01-04 14:21:35 UTC
(In reply to comment #8)
> Petr can we figure out some acceptable solution to accelerate this udev
> scanning filters more?

When we introduced the patch for reading the list of existing block devices from udev database instead of scanning it directly in /dev, there was also a patch proposed to read additional information we already have in udev database provided by blkid. It holds the information about the type of the device (along with other useful info), e.g. "this is an LVM PV". So we could just read the udev db record and see directly if the block device is a PV or not. If not, we could just skip the device since it's not a part of the LVM at all and speed up the scanning process.

However, this additional patch was rejected as we had to completely trust blkid code to detect PVs correctly. It was suggested that the blkid should be transformed into a utility with support for external plugins and each "type of the device" would provide its own plugin to detect its own devices, maintained directly by the owner of the device/metadata writer. So in our case, it would be a plugin to read PV labels and maintained by us (so any future changes in lvm2 would be directly reflected in that plugin that blkid would use).

I'm still not quite sure whether this model would be acceptable by all (if we do the plugin, then everybody else should follow the same model and provide its own plugins as well). From this point of view, blkid would just act as a scanning kernel, calling callbacks to plugins to check whether the device being processed is owned by someone.

(...putting Karel Zak on CC, the blkid maintainer)

Karel, do you think such an idea is feasible? (I'd say that we could just trust blkid and provide any additional patches if needed in case there's any change that would prevent the blkid to detect PVs correctly, but let's see...)

Comment 12 Karel Zak 2012-01-10 17:09:11 UTC
(In reply to comment #11)
> When we introduced the patch for reading the list of existing block devices
> from udev database instead of scanning it directly in /dev, there was also a
> patch proposed to read additional information we already have in udev database
> provided by blkid. It holds the information about the type of the device (along
> with other useful info), e.g. "this is an LVM PV". So we could just read the
> udev db record and see directly if the block device is a PV or not. If not, we
> could just skip the device since it's not a part of the LVM at all and speed up
> the scanning process.

A few questions come to mind:

 - what exactly means "read additional information"? 
 - is there any situation when libblkid is not able to detect LVM PV?
 
> However, this additional patch was rejected as we had to completely trust blkid
> code to detect PVs correctly. It was suggested that the blkid should be
> transformed into a utility with support for external plugins and each "type of
> the device" would provide its own plugin to detect its own devices maintained
> directly by the owner of the device/metadata writer. So in our case, it would
> be a plugin to read PV labels and maintained by us (so any future changes in
> lvm2 would be directly reflected in that plugin that blkid would use).

Why libblkid cannot be used for first level scanning and additional information
cannot be read by LVM specific tools called from udev rules?

> I'm still not quite sure whether this model would be acceptable by all (if we
> do the plugin, then everybody else should follow the same model and provide its
> own plugins as well).

Who is "everybody else" in this case? Filesystems maintainers? :-) 

> Karel, do you think such an idea is feasible? 

Are you changing on-disk format so quickly that standard collaboration between projects has to be replaced with plug-in? I don't think so... it seems more like an attempt to create academically perfect system.

> (I'd say that we could just
> trust blkid and provide any additional patches if needed in case there's any 
> change that would prevent the blkid to detect PVs correctly, but let's see...)

Yep, I think that resolve the problem with the current non-plug-in concept should be our first attempt, then we will see...

Comment 13 Milan Broz 2012-03-22 15:06:32 UTC
I think recent progress is that blkid will be used to scan lvm members, let's try set 6.4 flags...

Comment 14 Zdenek Kabelac 2012-03-22 15:27:43 UTC
2 fixes are needed here -  smart implementation of   for_each_pv() and  better handling of filters - where even 'read-only' commands (like pvs) are not properly caching result for dm devices and repeatedly are rechecking such devices.

Comment 15 Zdenek Kabelac 2012-03-26 15:59:29 UTC
Initial proposal for review:

https://www.redhat.com/archives/lvm-devel/2012-March/msg00170.html

Probably needs more extension - but already shows nice improvements.

Comment 16 Zdenek Kabelac 2012-04-25 08:15:42 UTC
Waits for review

Comment 22 Zdenek Kabelac 2015-03-04 12:25:30 UTC
While I do have a patch that accelerates some pieces of this slowness puzzle, I still miss more global view.

It looks like we need to consider wider range of patches to accelerate lvm2 commands here with larger set of PVs.

We do use too many scans per vg/lv command - using quadratic complexity.

Likely some trivial improvements  may be still included, meanwhile still the suggestion to use lvmetad & less metadata areas applies.

Comment 23 Zdenek Kabelac 2015-03-09 08:39:12 UTC
So some 'low hanging fruits' have been pushed upstream starting with this patch:
https://www.redhat.com/archives/lvm-devel/2015-March/msg00026.html

Yet - we have still some outstanding issue (vgcreate) which do not scale well.

Also - while with current upstream code the operation 'pvs' with 300PVs in a single VG is nearly instant on fast PV devices - we still need to read all 300PVs metadata to validate correctness. So if there is latency while reading content from individual devices - it will remain slow - patch only reduces number metadata parsing and validation.

In the case of slower storage the usage of 'lvmetad' and lowering number of metadata areas in a VG is the main way to increase the performance of lvm2 commands.


Speed of 'vgcreate' on 300PVs is not ideal - but it could be likely dealth with in another bugzilla.

Comment 24 Peter Rajnoha 2015-03-13 08:03:35 UTC
There's also another patchset included in lvm2 version 2.02.116 and higher which causes LVM to read some information from udev database that is then used for filtering decisions and this avoids scans that need to be done for filtering devices (opening devices and scanning for signatures or checking type of the device being scanned). This can speed up processing considerably too.

For this to work, you can try enabling it by setting devices/external_device_info_source="udev". I also recommend setting devices/obtain_device_list_from_udev=1 with this as well.

Comment 26 Corey Marthaler 2015-05-06 23:21:09 UTC
Looks like we're at 30 seconds now. Marking verified in the latest rpms.

2.6.32-554.el6.x86_64
lvm2-2.02.118-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
lvm2-libs-2.02.118-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
lvm2-cluster-2.02.118-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
udev-147-2.61.el6    BUILT: Mon Mar  2 05:08:11 CST 2015
device-mapper-1.02.95-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
device-mapper-libs-1.02.95-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
device-mapper-event-1.02.95-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
device-mapper-event-libs-1.02.95-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 08:43:06 CDT 2014
cmirror-2.02.118-2.el6    BUILT: Wed Apr 15 06:34:08 CDT 2015



[root@host-075 ~]# time pvs
[...]

real    0m30.083s
user    0m0.534s
sys     0m0.183s

[root@host-075 ~]# vgs
  VG         #PV #LV #SN Attr   VSize   VFree  
  VG         304   0   0 wz--n- 199.50g 199.50g

Comment 28 errata-xmlrpc 2015-07-22 07:36:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1411.html