Bug 1986158

Summary:	udev enumerate interface is slow
Product:	Red Hat Enterprise Linux 8	Reporter:	David Teigland <teigland>
Component:	systemd	Assignee:	Michal Sekletar <msekleta>
Status:	CLOSED WORKSFORME	QA Contact:	Frantisek Sumsal <fsumsal>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.3	CC:	dtardon, jbrassow, msekleta, prajnoha, systemd-maint-list, zkabelac
Target Milestone:	beta	Keywords:	Performance, Triaged
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-10 15:59:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Teigland 2021-07-26 19:22:43 UTC

Description of problem:

When many concurrent lvm pvscan commands get a list of system devices from libudev, the libudev apis are extremely slow, which can lead to boot timeouts.

Here is the lvm function that is getting the device list from udev:
https://sourceware.org/git/?p=lvm2.git;a=blob;f=lib/device/dev-cache.c;h=b1d477ebb0f7399a87f38779aa1c7548e9640349;hb=HEAD#l996

Is this an inefficient way to use the udev interface?

For now, lvm is simply doing readdir in /dev to avoid the problem.  This should be a workable solution, but we would like to know if it's still feasible to use libudev for this.

Booting a system with 1024 PVs on top of multipath, the lvm2-pvscan services take around 100 sec to complete (from systemd-analyze blame), when using the function above.  When lvm simply does readdir in /dev, the lvm2-pvscan services take only around 15 seconds.

Looking more narrowly at the pvscan commands themselves, the times drop from about 64 seconds to around 4 seconds when we turn off the libudev device listing.

This may be related to bug 1812801 in which udev_enumerate_scan_devices() fails outright.

It may also be related to numerous reports, going back to at least 2016, in which lvm prints "Device %s not initialized in udev database ...", when it gets stuck in a retry loop calling udev_device_get_property_value().

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 David Tardon 2021-07-27 08:50:43 UTC

udev_enumerate_scan_devices() crawls /sys under the hood and you're calling it multiple times. I assume that's the cause of the slowness.

Comment 3 Peter Rajnoha 2021-09-06 08:51:21 UTC

We call udev_enumerate_scan_devices only once.


I tried this in a VM with over 1400 devices through virtio scsi and when I compared latest RHEL7 (systemd-219-78.el7_9.3.x86_64) and latest RHEL8 (systemd-239-50.el8.2.x86_64) version of systemd/systemd-udevd, both with and without libudev to get list of block devices (devices/obtain_device_list_from_udev in lvm.conf).

I noticed a performance drop while using libudev interface, looking at strace/ltrace/callgrind:

  - the number of certain syscalls (open/openat/fstat...) is much higher on rhel8 version (e.g. taking 'open'+'openat' it is 13037+4315 in RHEL7 vs 79143 in RHEL8

  - 'udev_enumerate_scan_devices' taking 41643 usecs/call in RHEL7 vs 1455003 usecs in RHEL8 (and other functions having slowed down a bit too, like udev_device_new_from_syspath...)

  - comparing the callgrind logs from both, it seems the performance drop seems to be caused by internal hashmap use inside libudev and its siphash backend (probably because the systemd code in RHEL8 is now more shared with udev in recent versions? Just a guess...)

Here are the all the logs I've collected: https://prajnoha.fedorapeople.org/bz1986158/ (I'll also attach to this bz)

Also, is it really necessary to parse the sysfs while the only thing we need is the list of block devices and/or a few properties that udev already knows? This information is already in udev database so why the need to parse sysfs? Would it be possible to provide an extension to the libudev API for getting this information directly from udev db, bypassing any sysfs access/parsing?

Comment 4 Peter Rajnoha 2021-09-06 08:55:10 UTC

Created attachment 1820775 [details]
strace, ltrace, callgrind logs while using and not using libudev RHEL7 and RHEL8

The logs from strace, ltrace and callgrind on RHEL7 vs RHEL8 both with using libudev and without using libudev in lvm to get full list of block devices.

Comment 6 David Teigland 2021-09-07 14:40:33 UTC

Does udev_enumerate_scan_devices change the list of devices it gets from sysfs before giving it to the caller?  If so, what change is it making, and if not why are we not going to sysfs directly?

Comment 7 Zdenek Kabelac 2021-09-07 14:45:03 UTC

I assume the basic idea is -  the udev is supposed to give us list of block devices we should see as usable/potential PV devices.

So the 'so called' private devices (i.e. mpath legs, raid legs...) are not passed to us.

If we are not using all the knowledge udev DB has - lvm2 would need to reproduce every type of hook ATM used by udev to recognize these device (which lvm2 already does for numerous types).

Comment 8 David Teigland 2021-09-07 15:10:35 UTC

(In reply to Zdenek Kabelac from comment #7)
> I assume the basic idea is -  the udev is supposed to give us list of block
> devices we should see as usable/potential PV devices.
> 
> So the 'so called' private devices (i.e. mpath legs, raid legs...) are not
> passed to us.

It doesn't do any filtering AFAICT.

One thing that udev_enumerate_scan_devices does is add link names to the basic set of devices in sysfs.

Comment 9 Zdenek Kabelac 2021-09-08 13:16:22 UTC

I assume function  'udev_device_get_is_initialized()' is provided by udev to recognize whether we are
accessing a device that is meant to be accessed.

Although I believe there is a chaos since the udev  block device maintenance is mess on all sides,
but IMHO fix should be oriented to be system-wide - as all the 'disk' utilities should understand
the privacy logic - thus fixing it 'just for lvm2' still leave us in troubles when other tools will
randomly access our devices unexpectedly.

i.e. the existing issue is - while we do implement 'retry' operation for public LVs, there is no such 'retry' for subLVs,
thus if there is a parallel race access/open on those - lvm2 ATM may leak them in table.

Comment 10 Peter Rajnoha 2021-09-13 07:38:09 UTC

(In reply to Zdenek Kabelac from comment #9)
> Although I believe there is a chaos since the udev  block device maintenance
> is mess on all sides,
> but IMHO fix should be oriented to be system-wide - as all the 'disk'
> utilities should understand
> the privacy logic - thus fixing it 'just for lvm2' still leave us in
> troubles when other tools will
> randomly access our devices unexpectedly.

Yes, the way it works now with udev is that each block subsystem marks "private" or "unusable" devices in its own way in udev rules, thus leaving any udev db user to check all possible variables these subsystems may set. There's no coordination nor any standard defined here on how such devices should be marked in a single way. However, I don't think this is quite a problem of udev because this is something related to block devices only I suppose. And that's actually the part that SID is trying to cover within one of its primary goals (to have a standard defined on marking block device state for sharing among other users/subsystems).

The primary clear issue here is that udev should provide an easy and quick way of simply enumerating devices with basic filtering like "all block devices" etc. And this is where the regression is seen right now - the traces point to some internal hash usage which wasn't the bottleneck there before (my guess is this is due to the code being merged/shared with systemd code). Also, if we're not interested in any of sysfs info and we just need to get the list of devices as udev db has in its records, we should have a way to simply bypass the sysfs scans and get the list from udev db directly as it's all already there in udev db.

Comment 12 Michal Sekletar 2023-01-10 15:59:56 UTC

I was trying to reproduce this on latest RHEL-8 and I can see that lvm commands are much slower compared to RHEL-9 (on the system with the same amount of block devices), however, I don't observe slowness that is mentioned in the bug description (i.e. pvscan taking 100 seconds to complete). Hence I am closing the BZ for now (based on my offline discussion with Peter Rajnoha). Feel free to double check my findings and reopen if needed.