Bug 735866

Summary:

boot hangs with udevadm settle - timeout of 120 seconds

Product:

[Fedora] Fedora

Reporter:

Mads Kiilerich <mads>

Component:

udev

Assignee:

udev-maint

Status:

CLOSED WONTFIX

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

amcnabb, awilliam, cra, harald, jason, jdulaney, jonathan, kay, kchamart, kparal, nojunk311, pquiring, robatino, satellitgo, tflink

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

RejectedBlocker, AcceptedNTH

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-02-14 02:28:36 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

713565

Attachments:

Description	Flags
smolt and dmesg	none
udevadm error in KVM	none
dmesg with the delay	none
dmesg failing on "regular" i3 laptop for the first time	none

Description Mads Kiilerich 2011-09-05 19:55:00 UTC

Created attachment 521545 [details]
smolt and dmesg

[   12.974328] udevd[780]: failed to execute '/sbin/initqueue' '/sbin/initqueue --settled --onetime --unique /sbin/dmsquash-live-root /dev/sdc': No such file or directory
[  130.927626] udevadm[494]: udevadm settle - timeout of 120 seconds reached, the event queue contains:
[  130.927758] udevadm[494]: /sys/devices/virtual/tty/tty4 (2124)
[  130.932233] systemd[1]: udev-settle.service: main process exited, code=exited, status=1
[  130.965870] systemd[1]: Unit udev-settle.service entered failed state.

The initqueue issue has been reported on bug 734096 - I don't know if it is related.

I think this happens regularly when booting a live usb but rarely when booting from HD.

udev-173-3.fc16.x86_64
dracut-013-8.fc16.noarch
systemd-35-1.fc16.x86_64
kernel-3.1.0-0.rc4.git0.0.fc16.x86_64

Comment 1 Mads Kiilerich 2011-09-07 13:03:59 UTC

The hanging-for-2-minutes symptoms also showed up on bug 733046 - but that was apparently not related to the real issue there ...

Comment 2 Adam Williamson 2011-09-14 00:14:13 UTC

I've seen this too in several boots of F16 live images.

I didn't assign it a very high priority, but I think I just found an unfortunate consequence.

It's not just the boot invocation of udevadm settle which times out and fails, but *any* invocation...and unfortunately, anaconda does udevadm settle when starting up in the live environment, and won't run if it doesn't work.

so, if you hit this bug when booting live, it doesn't just cause you two minutes of inconvenience: it prevents you from installing.

proposing as a Beta blocker.

the 'event queue' message is different for me - it's:

the event queue contains:
   /sys/devices/platform/microcode/firmware/microcode

mads, can you confirm that when hitting this bug, you can't run the live installer?

Comment 3 Adam Williamson 2011-09-14 00:15:43 UTC

I do see the 'failed to execute '/sbin/initqueue' '/sbin/initqueue
--settled --onetime --unique /sbin/dmsquash-live-root /dev/sdc': No such file
or directory' message (though for me it's sdc1).

Comment 4 Adam Williamson 2011-09-14 00:42:12 UTC

hum - I also see that message on boots where the bug does *not* occur. so it alone can't be the reason for the failure.

Comment 5 Adam Williamson 2011-09-14 00:43:47 UTC

* Wed Jul 20 2011 Harald Hoyer <harald> 172-2 
- removed loop devices (should be compiled in the kernel)

reads vaguely like it could be related, I guess.

Comment 6 Mads Kiilerich 2011-09-14 09:03:42 UTC

(In reply to comment #2)
> mads, can you confirm that when hitting this bug, you can't run the live
> installer?

Not easily. I'm using a custom livecd without anaconda.


Note also that I don't _know_ if the initqueue error and the 2 minute delay are related, so I intended to track the delay here and the initqueue error on bug 734096.

I assume that the best strategy for understanding and fixing this issue is to focus on bug 734096 first.

Comment 7 Mads Kiilerich 2011-09-14 13:05:51 UTC

Bug 670964 from f14 seems related ... but it seems like it shows up more frequently now.

Comment 8 Adam Williamson 2011-09-14 16:11:15 UTC

you can still 'yum install anaconda'.

Comment 9 John Dulaney 2011-09-15 19:06:57 UTC

I started seeing this with TC1, it seemed to get a little worse for TC2.
I'm going to estimate 40% of the time I hit the bug; 100% of the time
when I hit it, it doesn't install.  I did not realize the non installation
issue at first because the time I tried a test install the bug did not occur.
I seem to get it an equal amount on an i5 based desktop and P4 based
laptop; I also hit it in a VM on the desktop.

I'll say +1 for blocker.

Comment 10 Adam Williamson 2011-09-15 22:11:28 UTC

can you check with RC1 and see how often you hit it?

it occurs to me that the use of a non-debug kernel in RC1 will likely change the timings on this if it's some kind of race; might make it more common, might make it less common.

Comment 11 Adam Williamson 2011-09-15 23:48:16 UTC

since it seems like there may be multiple causes of this, John, can you paste your exact error message, and possibly also attach your whole /var/log/messages on an affected boot? Thanks!

Comment 12 Kamil Páral 2011-09-16 10:49:13 UTC

Created attachment 523539 [details]
udevadm error in KVM

I started seeing this in KVM machines probably since F16 Beta TC1. It persists also in RC1. I hit this irregularly, maybe in 10-20% of all anaconda installations. After I hit "Start installation" option in GRUB, I see a white screen. Usually there are some kernel printouts after 10-15 seconds and then anaconda starts up, but sometimes I got an udevadm error printed out and the machine hangs. See screenshot.

I am +1 to blocker, Beta or Final.

Comment 13 Adam Williamson 2011-09-16 17:49:05 UTC

Discussed at the 2011-09-16 blocker review meeting. We agreed this is still very vague and could actually be multiple bugs (various things can trigger a stuck udevadm settle), but the impact is sufficient that we're accepting it as a blocker for now.

Can everyone who is hitting this please verify they still hit it in RC1, and please attach a full /var/log/messages from an affected boot? Thanks!

Kamil: it sounds like you're hitting this is the DVD installer, not live - is that accurate?

Comment 14 Adam Williamson 2011-09-16 19:19:17 UTC

so here's a thought: can anyone who can still reproduce this quite easily please test with enforcing=0 and see if it changes anything? I suppose this could be related to the early boot AVCs tracked in https://bugzilla.redhat.com/show_bug.cgi?id=733086 .

Comment 15 Andre Robatino 2011-09-16 20:01:39 UTC

Sorry, didn't intend to remove the needinfo by subscribing to this (but that shouldn't happen when there is no comment).

Comment 16 Mads Kiilerich 2011-09-16 21:40:25 UTC

(In reply to comment #14)
> so here's a thought: can anyone who can still reproduce this quite easily
> please test with enforcing=0 and see if it changes anything?

The original report and dmesg was with enforcing=0.

Instead I would put my first $0.05 on bug 734096. That is the first indication that there is something very strange going on in this area.

Comment 17 Mads Kiilerich 2011-09-16 23:21:02 UTC

Created attachment 523665 [details]
dmesg with the delay

I reproduced it on an a-couple-days old f16+updates-testing live usb with
dracut-013-8.fc16.noarch
systemd-35-1.fc16.x86_64
kernel-3.1.0-0.rc5.git0.0.fc16.x86_64
selinux-policy-targeted-3.10.0-25.fc16.noarch

I can however not reproduce on latest updates-testing with 
kernel-3.1.0-0.rc6.git0.0.fc16.x86_64 (koji)
selinux-policy-targeted-3.10.0-28.fc16.noarch
udev-173-3.fc16.x86_64

Comment 18 Adam Williamson 2011-09-16 23:36:14 UTC

mads: RC1 has all of those versions, I believe - kernel, selinux-policy, and udev.

Comment 19 Adam Williamson 2011-09-17 00:02:40 UTC

I just reproduced this on a RC1-ish live image (one I'm using to test stuff for the selinux bug) booting to runlevel 3, dunno if that was coincidence or not, on my laptop.

queue contains:

/sys/devices/pci0000:00/0000:00:1b.0/sound/card0/pcmC0D0c
/sys/devices/pci0000:00/0000:00:1b.0/sound/card0

Comment 20 Harald Hoyer 2011-09-19 10:51:30 UTC

(In reply to comment #6)
> (In reply to comment #2)
> > mads, can you confirm that when hitting this bug, you can't run the live
> > installer?
> 
> Not easily. I'm using a custom livecd without anaconda.
> 
> 
> Note also that I don't _know_ if the initqueue error and the 2 minute delay are
> related, so I intended to track the delay here and the initqueue error on bug
> 734096.
> 
> I assume that the best strategy for understanding and fixing this issue is to
> focus on bug 734096 first.

I don't think, the initqueue thing has anything to do with it, but I have a fix at hand for that.

Comment 21 Adam Williamson 2011-09-21 21:35:59 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=670964 is rather similar to this bug, note. there's some info there. can people check for the clocksource thing on their affected machines?

Comment 22 Mads Kiilerich 2011-09-21 22:48:45 UTC

Created attachment 524279 [details]
dmesg failing on "regular" i3 laptop for the first time

I just got this issue again - but this time on a shiny laptop where I haven't seen it before.

udevadm settle - timeout of 120 seconds reached, the event queue contains:
  /sys/devices/platform/i8042/serio1/input/input6/mouse1 (1047)
  /sys/devices/platform/i8042 (1318)
  /sys/devices/platform/i8042/serio0 (1319)
  /sys/devices/platform/i8042/serio0/input/input3 (1320)
  /sys/devices/platform/i8042/serio0/input/input3/event3 (1321)
  /sys/devices/platform/i8042/serio1 (1322)
  /sys/devices/platform/i8042/serio1/input/input6 (1323)
  /sys/devices/platform/i8042/serio1/input/input6/event6 (1324)
  /sys/devices/platform/i8042/serio1/input/input6/mouse1 (1325)

I also notice that it seems like there is both a 120 and a 180 s timeout involved:

[    4.807865] generic-usb 0003:413C:2003.0002: input,hidraw1: USB HID v1.10 Keyboard [Dell Dell USB Keyboard] on usb-0000:00:1d.0-1.2.4/input0
[  125.258689] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
[  125.359671] dracut: Checking ext4: /dev/disk/by-uuid/7832dd26-f5ce-4653-ac8d-6e4c4ab95aba
[  125.363768] dracut: issuing e2fsck -a  /dev/disk/by-uuid/7832dd26-f5ce-4653-ac8d-6e4c4ab95aba
[  125.447758] dracut: _Fedora-15-x86_6: clean, 158234/14467072 files, 2233159/57848320 blocks
[  125.452319] dracut: Remounting /dev/disk/by-uuid/7832dd26-f5ce-4653-ac8d-6e4c4ab95aba with -o ro
[  125.582402] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
[  125.597878] dracut: Mounted root filesystem /dev/sda4
[  186.093651] dracut: Switching root

I notice some time jumping and that it detects the unstable TSC and disable it - quoted directly:

[    0.227489] TSC synchronization [CPU#0 -> CPU#1]:
[    0.227589] Measured 4200644 cycles TSC warp between CPUs, turning off TSC clock.
[    0.001999] Marking TSC unstable due to check_tsc_sync_source failed
[    0.099013] NMI watchdog enabled, takes one hw-pmu counter.
[    0.099204]  #2
[    0.099247] smpboot cpu 2: start_ip = 98000
[    0.170064] NMI watchdog enabled, takes one hw-pmu counter.
[    0.170260]  #3 Ok.
[    0.170353] smpboot cpu 3: start_ip = 98000
[    0.241057] NMI watchdog enabled, takes one hw-pmu counter.

Grepping for interesting keywords shows this:

[    0.000000] ACPI: HPET 00000000caffce98 00038 (v01 TOSCPL TOSCPL00 01072009 AMI. 00000004)
[    0.000000] ACPI: HPET id: 0x8086a701 base: 0xfed00000
[    0.000000] hpet clockevent registered
[    0.000000] Fast TSC calibration using PIT
[    0.227489] TSC synchronization [CPU#0 -> CPU#1]:
[    0.227589] Measured 4200644 cycles TSC warp between CPUs, turning off TSC clock.
[    0.001999] Marking TSC unstable due to check_tsc_sync_source failed
[    2.008712] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[    2.009160] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[    2.010711] Switching to clocksource hpet
[    2.845964] rtc0: alarms up to one month, y3k, 242 bytes nvram, hpet irqs
[    2.858417] rtc_cmos 00:05: setting system clock to 2011-09-22 00:23:23 UTC (1316651003)

- but I really don't know what else to look for. There is no doubt that this is a "random" and very timing sensitive bug, so many completely unrelated change could have a huge and misleading impact in either direction.

/me wonders: What is in the settlement queue? What is it waiting for? How can these events possibly be lost?

Comment 23 Adam Williamson 2011-09-22 00:28:56 UTC

what udevadm settle actually *does* is simply sit there and monitor the udev event queue. when it's empty, it exits and returns success. if the queue doesn't empty before the timeout, it exits and returns failure. the default timeout is 120 secs, but there's a parameter you can pass to set it to whatever you like.

seems like your case is at least possibly related to the USB input stuff...

Comment 24 Kay Sievers 2011-09-22 00:37:03 UTC

9 of 10 of these cases are kernel modules that fail with some piece
of hardware. Please type 'dmesg' in a shell and check if there are any
kernel errors that could be the reason.

Comment 25 Mads Kiilerich 2011-09-22 01:09:28 UTC

Kay, are the dmesg's that already has been attached good enough, or what should be done differently?

FWIW I don't see any indication of any problems in the dmesg's I have seen. It seems like the queue content varies a lot, which could indicate an issue close to udev rather than at the hardware/kernel level.

Comment 26 Adam Williamson 2011-09-22 01:19:29 UTC

mads: it can also indicate that we're dealing with *multiple* kernel bugs.

Comment 27 Tim Flink 2011-09-23 18:40:36 UTC

Discussed in the 2011-09-23 blocker review meeting. We agreed to reject as a blocker for Fedora 16 beta since this doesn't seem to be as severe or as common as originally thought.

However, it was accepted as a NTH for Fedora 16 beta since it would be nice to see fixed.

If this bug does turn out to be more severe/common and there is more data to work off of, please re-submit as a blocker.

Comment 28 Fedora Admin XMLRPC Client 2011-10-20 16:08:24 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 29 Fedora Admin XMLRPC Client 2011-10-20 16:10:36 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 30 Fedora Admin XMLRPC Client 2011-10-20 16:12:48 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 31 Fedora Admin XMLRPC Client 2011-10-20 16:17:13 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 32 Andre Robatino 2011-10-20 20:51:44 UTC

Seeing this on a text-based install from the 16 Final TC2 DVD in VirtualBox 4.1.4. Screen shows

udevadm settle - timeout of 300 seconds reached, the event queue contains: /sys/devices/virtual/vc/vcsa3 (1296) /sys/devices/virtual/vc/vcsa3 (1562)

Comment 33 Andre Robatino 2011-10-20 20:55:09 UTC

Should have mentioned this was the x86_64 DVD. I reran the same install and this time it works normally.

Comment 34 Charles R. Anderson 2011-11-14 14:44:18 UTC

I see udevadm settle hanging during the liveinst installer (whenver the storage code invokes it) on my ThinkPad T61.  The system boots fine, only hangs during liveinst.  I had to repeatedly type "killall udevadm" from a terminal so the install would complete this century.  The install completed and boots successfully, despite having done this workaround.

Comment 35 HughieB 2012-01-12 00:35:06 UTC

This happens to me boot from HD once in a while.
The messages aren't saved in /var/log/messages so I can't give much detail.
Only it was looking at mouse drivers
Most of the time it boots fine.
Mine is a virtualbox guest x64 install on W7 host

Comment 36 pquiring 2012-01-25 20:11:06 UTC

I'm getting this on F16 Final.
The messages are related to snd device.
I had to reboot MANY times before it didn't happen.

Comment 37 Fedora End Of Life 2013-01-17 00:42:17 UTC

This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 38 Fedora End Of Life 2013-02-14 02:28:41 UTC

Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.