200514 – Udev initialisation takes so long it can affect fsck

Bug 200514 - Udev initialisation takes so long it can affect fsck

Summary: Udev initialisation takes so long it can affect fsck

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	udev
Sub Component:
Version:	5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Harald Hoyer
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-07-28 11:49 UTC by David Howells
Modified:	2007-11-30 22:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-09-20 11:04:08 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description David Howells 2006-07-28 11:49:03 UTC

Description of problem:

On my Dual 200MHz PPro testbox, with SELinux enabled and using a vanilla
kernel (approximately FC6's kernel), the system will almost always wind up in
the filesystem repair shell.

Version-Release number of selected component (if applicable):

e2fsprogs-1.38-12
udev-084-13
initscripts-8.31.5-1
linux-2.6.18-rc2 up to date with git to 27th July 2006.

How reproducible:

Almost 100% (it appears timing related), but it requires a slow system to show
the effect.

Steps to Reproduce:
1. Install vanilla kernel
2. Boot to it
3. Wait for fsck to fail or complete.

Actual results:

System jumps to filesystem repair shell.

Expected results:

System should boot normally.

Additional info:

The problem appears to be udev takes such a long time to run that it gets
backgrounded by the boot procedure. However, shortly after it is
backgrounded, fsck is run. What appears to be happening is that udev hasn't
actually created any block device references at this point, and so fsck goes
searching through all the chardev lists in /sys - of which there are a lot,
since each tty dev entry points back to the list of tty dev entries.
Eventually fsck dies on SIGKILL.

From examining things with strace, I can say:

(1) rc.sysinit isn't SIGKILL'ing fsck.

(2) stracing fsck will cause fsck to succeed - I think because it slows fsck
and thus allows udev to catch up.

(3) fsck goes and does a lot of stat64'ing of chardevs in /sys, indirectly
via /dev/.udev. The paths look like this:

/dev/.udev/failed/devices@platform@serial8250/tty:ttyS0/subsystem/ptyc3/subsystem/ptyc3/dev

Note that each of four tty:ttyS[0-3] is iterated through, as are all
584 "?ty??" in the first subsystem directory, and also in the second
subdirectory (that's recursive through symlinkage). This is on the order of
1.4 million chardevs.

Using gdb shows the search is being conducted in libblkid from
e2fsprogs-libs.

I've upgraded by testbox from an early FC5 to the latest FC5 and that doesn't
change the problem.

As I said above, I think the root of the problem is that udev takes so long to
run (judging by the way the PIDs crank it's running nearly 2000 programs), and
this is a problem on a slow machine.

Running fsck with the same parameter list once the repair shell is available
works almost instantly.

fsck -T -t noopts=_netdev -A -a -C

I'm not sure whether this belongs against the udev, initscripts or e2fsprogs
packages, but I think the first has to be the major culprit: udev needs to be
faster or optional.

Comment 1 Kay Sievers 2006-08-02 14:46:29 UTC

Searching /dev for device nodes that way can't really work, it's a weird
concept, and in this implementation obviously broken. (For that reason, on SUSE,
we patched mount and fsck to use libvolume_id provided by udev.)

Comment 2 Bill Nottingham 2006-08-02 14:52:37 UTC

However, shouldn't having udevsettle in the udev start procedure handle this
for any reasonably local devices (IDE, SCSI)?

Comment 3 Kay Sievers 2006-08-03 07:39:36 UTC

Hmm, isn't this the problem:
 "3) fsck goes and does a lot of stat64'ing of chardevs in /sys"
 "This is on the order of 1.4 million chardevs."

What do you mean? What should udevsettle handle?

Comment 4 Bill Nottingham 2006-08-03 13:02:11 UTC

Maybe I misread; I though the problem was:

udev starts, starts coldplug, start scripts exit
  <loads scsi, sata, etc driver>
  <disk scan starts>
fsck runs, can't find device
  <disk scan finishes>

and udevsettle would help with that. Perhaps I missed the actual problem here.

Comment 5 Kay Sievers 2006-08-03 15:12:26 UTC

Oh right, looks like. You require a newer udev, or backport udevsettle for that.
Udevd needs to export the current seqnum, which udevsettle can compare against
the actual kernel number to see if the kernel has events in the queue and not
only in the udev daemon queue. Maybe that's the reason for the failure?

Note You need to log in before you can comment on or make changes to this bug.