1073909 – [RFE] Timeout of udev rules needs solution

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1073909 - [RFE] Timeout of udev rules needs solution

Summary: [RFE] Timeout of udev rules needs solution

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	systemd-maint
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1173739 1298243 74systemd 1420851
TreeView+	depends on / blocked

Reported:	2014-03-07 13:15 UTC by Zdenek Kabelac
Modified:	2020-12-15 07:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-15 07:29:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Zdenek Kabelac 2014-03-07 13:15:37 UTC

Description of problem:

This could be seen as continue of Bug 918511.

RUN_TIMEOUT+="..." thing.
Lvm2 codebase is slowly moving to be more dependent of 'udev' 'db' consistency.
But for this it's really needed to have some reliable info in udev database.

Killing udev device processing with some timeout unfortunately is not something we could rely one - unless we know the info for particular device is in some unreliable state.

Also from Kay's own words that udev is just 'kernel' extension - udev should not autonomously (and basically randomly on timeout value) kill rules.

I'd have maybe this proposal for the begging:

Udev starts scanning of device - which on very loaded system might take easily minutes (i.e. large enterprise system with huge disk array). So udev keeps flag in DB devices scan for this device is in progress (leaving the process running without any timeout - so no autokill)

When reading the content of udev db - it's known if some device is in 'scan' state. Now if i.e. 'change' event comes again for device, which has already scanning in-progress - it will just queue another 'change' for this in-progress task so there will not be a growing list of systemd-udev scanning process - but on the other there will be no problem with 'lost event' processing - i.e. killing (-9) processing of event in the middle of udev rules.

This means the number of udev scan process will not be constant and may grow - the heuristic will get more tricky here.

It also means - change event will not be lost, and for frozen disk, there will be a 'frozen' udev scan process - thus communication with frozen process blocked in kernel i/o may need some extra care...

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 2 RHEL Program Management 2014-03-22 05:50:30 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 3 Michal Sekletar 2014-07-30 10:04:50 UTC

Recently patch from Hannes landed on systemd-devel and Kay has already merged it upstream. Zdenek, do you think it would solve LVM2 issue?

http://lists.freedesktop.org/archives/systemd-devel/2014-July/021601.html

Comment 4 Peter Rajnoha 2014-07-30 10:46:34 UTC

(In reply to Michal Sekletar from comment #3)
> Recently patch from Hannes landed on systemd-devel and Kay has already
> merged it upstream. Zdenek, do you think it would solve LVM2 issue?
> 
> http://lists.freedesktop.org/archives/systemd-devel/2014-July/021601.html

Unfortunately, it won't. It's just a global timeout to use for udevd daemon. The same thing can be done via existing udev rule for each event separately and we're already using it (OPTIONS+="event_timeout=180").

The problem here is that we just don't know what the safe value should be for the number of seconds (since the environments may differ a lot when it comes to the number of devices/performance/things running in parallel that slow down the udev processing). It may happen that even the event_timeout=180 is not enough.

There's still a possibility the event processing is killed because of timeout. The requirement in this report is to provide a way to detect such situations somehow - that the event processing has not finished properly and that the udev db content may contain improper/old information that does not reflect the actual state. As Zdenek already commented in comment #0, we need to know this information somehow and current udev interface does not exhibit this.

As already commented in bug #918511, the best would if there's a possibility to mark the udev db as incomplete/dirty and to have a possibility to run a cleanup/notification rule on timeout so the rest of the system knows that something went wrong and the tools relying on udev/udev db can react to this situation (providing cleanup, error/warning messages etc.).

Comment 5 Marian Csontos 2014-12-16 14:44:57 UTC

The udev worker timing out is a recurring issue.

As scripts forked by udev rules taking more than 30s are considered buggy we need to start gathering information about all these bugs. So the first thing is to make these issues traceable and sending SIGKILL is not helpful at all.

Make SIGSEGV a default please so a core is dumped.

If SIGSEGV can not be made default (And why it would not?) I want at least to be able to override the signal.

Comment 6 Kay Sievers 2015-03-18 15:58:56 UTC

(In reply to Peter Rajnoha from comment #4)
> There's still a possibility the event processing is killed because of
> timeout. The requirement in this report is to provide a way to detect such
> situations somehow - that the event processing has not finished properly and
> that the udev db content may contain improper/old information that does not
> reflect the actual state. As Zdenek already commented in comment #0, we need
> to know this information somehow and current udev interface does not exhibit
> this.

This should be improved now. The killed worker will cause the database content
to be cleared and the raw kernel event is forwarded to tell userspace that udev
no better idea about the device:

  http://cgit.freedesktop.org/systemd/systemd/commit/?id=6969c349df91a3cc5fc2cf559a14e32a84db969d

Comment 7 Jan Synacek 2016-06-15 11:52:35 UTC

We still miss a way to specify a rule that would be run on the timeout.

Comment 10 RHEL Program Management 2020-12-15 07:29:46 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.