Bug 735866
Summary: | boot hangs with udevadm settle - timeout of 120 seconds | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Mads Kiilerich <mads> | ||||||||||
Component: | udev | Assignee: | udev-maint | ||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 16 | CC: | amcnabb, awilliam, cra, harald, jason, jdulaney, jonathan, kay, kchamart, kparal, nojunk311, pquiring, robatino, satellitgo, tflink | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | RejectedBlocker, AcceptedNTH | ||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2013-02-14 02:28:36 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 713565 | ||||||||||||
Attachments: |
|
Description
Mads Kiilerich
2011-09-05 19:55:00 UTC
The hanging-for-2-minutes symptoms also showed up on bug 733046 - but that was apparently not related to the real issue there ... I've seen this too in several boots of F16 live images. I didn't assign it a very high priority, but I think I just found an unfortunate consequence. It's not just the boot invocation of udevadm settle which times out and fails, but *any* invocation...and unfortunately, anaconda does udevadm settle when starting up in the live environment, and won't run if it doesn't work. so, if you hit this bug when booting live, it doesn't just cause you two minutes of inconvenience: it prevents you from installing. proposing as a Beta blocker. the 'event queue' message is different for me - it's: the event queue contains: /sys/devices/platform/microcode/firmware/microcode mads, can you confirm that when hitting this bug, you can't run the live installer? I do see the 'failed to execute '/sbin/initqueue' '/sbin/initqueue --settled --onetime --unique /sbin/dmsquash-live-root /dev/sdc': No such file or directory' message (though for me it's sdc1). hum - I also see that message on boots where the bug does *not* occur. so it alone can't be the reason for the failure. * Wed Jul 20 2011 Harald Hoyer <harald> 172-2 - removed loop devices (should be compiled in the kernel) reads vaguely like it could be related, I guess. (In reply to comment #2) > mads, can you confirm that when hitting this bug, you can't run the live > installer? Not easily. I'm using a custom livecd without anaconda. Note also that I don't _know_ if the initqueue error and the 2 minute delay are related, so I intended to track the delay here and the initqueue error on bug 734096. I assume that the best strategy for understanding and fixing this issue is to focus on bug 734096 first. Bug 670964 from f14 seems related ... but it seems like it shows up more frequently now. you can still 'yum install anaconda'. I started seeing this with TC1, it seemed to get a little worse for TC2. I'm going to estimate 40% of the time I hit the bug; 100% of the time when I hit it, it doesn't install. I did not realize the non installation issue at first because the time I tried a test install the bug did not occur. I seem to get it an equal amount on an i5 based desktop and P4 based laptop; I also hit it in a VM on the desktop. I'll say +1 for blocker. can you check with RC1 and see how often you hit it? it occurs to me that the use of a non-debug kernel in RC1 will likely change the timings on this if it's some kind of race; might make it more common, might make it less common. since it seems like there may be multiple causes of this, John, can you paste your exact error message, and possibly also attach your whole /var/log/messages on an affected boot? Thanks! Created attachment 523539 [details]
udevadm error in KVM
I started seeing this in KVM machines probably since F16 Beta TC1. It persists also in RC1. I hit this irregularly, maybe in 10-20% of all anaconda installations. After I hit "Start installation" option in GRUB, I see a white screen. Usually there are some kernel printouts after 10-15 seconds and then anaconda starts up, but sometimes I got an udevadm error printed out and the machine hangs. See screenshot.
I am +1 to blocker, Beta or Final.
Discussed at the 2011-09-16 blocker review meeting. We agreed this is still very vague and could actually be multiple bugs (various things can trigger a stuck udevadm settle), but the impact is sufficient that we're accepting it as a blocker for now. Can everyone who is hitting this please verify they still hit it in RC1, and please attach a full /var/log/messages from an affected boot? Thanks! Kamil: it sounds like you're hitting this is the DVD installer, not live - is that accurate? so here's a thought: can anyone who can still reproduce this quite easily please test with enforcing=0 and see if it changes anything? I suppose this could be related to the early boot AVCs tracked in https://bugzilla.redhat.com/show_bug.cgi?id=733086 . Sorry, didn't intend to remove the needinfo by subscribing to this (but that shouldn't happen when there is no comment). (In reply to comment #14) > so here's a thought: can anyone who can still reproduce this quite easily > please test with enforcing=0 and see if it changes anything? The original report and dmesg was with enforcing=0. Instead I would put my first $0.05 on bug 734096. That is the first indication that there is something very strange going on in this area. Created attachment 523665 [details]
dmesg with the delay
I reproduced it on an a-couple-days old f16+updates-testing live usb with
dracut-013-8.fc16.noarch
systemd-35-1.fc16.x86_64
kernel-3.1.0-0.rc5.git0.0.fc16.x86_64
selinux-policy-targeted-3.10.0-25.fc16.noarch
I can however not reproduce on latest updates-testing with
kernel-3.1.0-0.rc6.git0.0.fc16.x86_64 (koji)
selinux-policy-targeted-3.10.0-28.fc16.noarch
udev-173-3.fc16.x86_64
mads: RC1 has all of those versions, I believe - kernel, selinux-policy, and udev. I just reproduced this on a RC1-ish live image (one I'm using to test stuff for the selinux bug) booting to runlevel 3, dunno if that was coincidence or not, on my laptop. queue contains: /sys/devices/pci0000:00/0000:00:1b.0/sound/card0/pcmC0D0c /sys/devices/pci0000:00/0000:00:1b.0/sound/card0 (In reply to comment #6) > (In reply to comment #2) > > mads, can you confirm that when hitting this bug, you can't run the live > > installer? > > Not easily. I'm using a custom livecd without anaconda. > > > Note also that I don't _know_ if the initqueue error and the 2 minute delay are > related, so I intended to track the delay here and the initqueue error on bug > 734096. > > I assume that the best strategy for understanding and fixing this issue is to > focus on bug 734096 first. I don't think, the initqueue thing has anything to do with it, but I have a fix at hand for that. https://bugzilla.redhat.com/show_bug.cgi?id=670964 is rather similar to this bug, note. there's some info there. can people check for the clocksource thing on their affected machines? Created attachment 524279 [details]
dmesg failing on "regular" i3 laptop for the first time
I just got this issue again - but this time on a shiny laptop where I haven't seen it before.
udevadm settle - timeout of 120 seconds reached, the event queue contains:
/sys/devices/platform/i8042/serio1/input/input6/mouse1 (1047)
/sys/devices/platform/i8042 (1318)
/sys/devices/platform/i8042/serio0 (1319)
/sys/devices/platform/i8042/serio0/input/input3 (1320)
/sys/devices/platform/i8042/serio0/input/input3/event3 (1321)
/sys/devices/platform/i8042/serio1 (1322)
/sys/devices/platform/i8042/serio1/input/input6 (1323)
/sys/devices/platform/i8042/serio1/input/input6/event6 (1324)
/sys/devices/platform/i8042/serio1/input/input6/mouse1 (1325)
I also notice that it seems like there is both a 120 and a 180 s timeout involved:
[ 4.807865] generic-usb 0003:413C:2003.0002: input,hidraw1: USB HID v1.10 Keyboard [Dell Dell USB Keyboard] on usb-0000:00:1d.0-1.2.4/input0
[ 125.258689] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
[ 125.359671] dracut: Checking ext4: /dev/disk/by-uuid/7832dd26-f5ce-4653-ac8d-6e4c4ab95aba
[ 125.363768] dracut: issuing e2fsck -a /dev/disk/by-uuid/7832dd26-f5ce-4653-ac8d-6e4c4ab95aba
[ 125.447758] dracut: _Fedora-15-x86_6: clean, 158234/14467072 files, 2233159/57848320 blocks
[ 125.452319] dracut: Remounting /dev/disk/by-uuid/7832dd26-f5ce-4653-ac8d-6e4c4ab95aba with -o ro
[ 125.582402] EXT4-fs (sda4): mounted filesystem with ordered data mode. Opts: (null)
[ 125.597878] dracut: Mounted root filesystem /dev/sda4
[ 186.093651] dracut: Switching root
I notice some time jumping and that it detects the unstable TSC and disable it - quoted directly:
[ 0.227489] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.227589] Measured 4200644 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.001999] Marking TSC unstable due to check_tsc_sync_source failed
[ 0.099013] NMI watchdog enabled, takes one hw-pmu counter.
[ 0.099204] #2
[ 0.099247] smpboot cpu 2: start_ip = 98000
[ 0.170064] NMI watchdog enabled, takes one hw-pmu counter.
[ 0.170260] #3 Ok.
[ 0.170353] smpboot cpu 3: start_ip = 98000
[ 0.241057] NMI watchdog enabled, takes one hw-pmu counter.
Grepping for interesting keywords shows this:
[ 0.000000] ACPI: HPET 00000000caffce98 00038 (v01 TOSCPL TOSCPL00 01072009 AMI. 00000004)
[ 0.000000] ACPI: HPET id: 0x8086a701 base: 0xfed00000
[ 0.000000] hpet clockevent registered
[ 0.000000] Fast TSC calibration using PIT
[ 0.227489] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.227589] Measured 4200644 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.001999] Marking TSC unstable due to check_tsc_sync_source failed
[ 2.008712] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[ 2.009160] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[ 2.010711] Switching to clocksource hpet
[ 2.845964] rtc0: alarms up to one month, y3k, 242 bytes nvram, hpet irqs
[ 2.858417] rtc_cmos 00:05: setting system clock to 2011-09-22 00:23:23 UTC (1316651003)
- but I really don't know what else to look for. There is no doubt that this is a "random" and very timing sensitive bug, so many completely unrelated change could have a huge and misleading impact in either direction.
/me wonders: What is in the settlement queue? What is it waiting for? How can these events possibly be lost?
what udevadm settle actually *does* is simply sit there and monitor the udev event queue. when it's empty, it exits and returns success. if the queue doesn't empty before the timeout, it exits and returns failure. the default timeout is 120 secs, but there's a parameter you can pass to set it to whatever you like. seems like your case is at least possibly related to the USB input stuff... 9 of 10 of these cases are kernel modules that fail with some piece of hardware. Please type 'dmesg' in a shell and check if there are any kernel errors that could be the reason. Kay, are the dmesg's that already has been attached good enough, or what should be done differently? FWIW I don't see any indication of any problems in the dmesg's I have seen. It seems like the queue content varies a lot, which could indicate an issue close to udev rather than at the hardware/kernel level. mads: it can also indicate that we're dealing with *multiple* kernel bugs. Discussed in the 2011-09-23 blocker review meeting. We agreed to reject as a blocker for Fedora 16 beta since this doesn't seem to be as severe or as common as originally thought. However, it was accepted as a NTH for Fedora 16 beta since it would be nice to see fixed. If this bug does turn out to be more severe/common and there is more data to work off of, please re-submit as a blocker. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. Seeing this on a text-based install from the 16 Final TC2 DVD in VirtualBox 4.1.4. Screen shows udevadm settle - timeout of 300 seconds reached, the event queue contains: /sys/devices/virtual/vc/vcsa3 (1296) /sys/devices/virtual/vc/vcsa3 (1562) Should have mentioned this was the x86_64 DVD. I reran the same install and this time it works normally. I see udevadm settle hanging during the liveinst installer (whenver the storage code invokes it) on my ThinkPad T61. The system boots fine, only hangs during liveinst. I had to repeatedly type "killall udevadm" from a terminal so the install would complete this century. The install completed and boots successfully, despite having done this workaround. This happens to me boot from HD once in a while. The messages aren't saved in /var/log/messages so I can't give much detail. Only it was looking at mouse drivers Most of the time it boots fine. Mine is a virtualbox guest x64 install on W7 host I'm getting this on F16 Final. The messages are related to snd device. I had to reboot MANY times before it didn't happen. This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |