Description of problem: anaconda is hanging or freezing during the initial storage scan. There appears to be no CPU or disk activity and nothing is being logged to any of the VTs or log files. Version-Release number of selected component (if applicable): Fedora 17 Alpha RC4 (Gold) - x86_64 How reproducible: 100% Steps to Reproduce: 1. Attempt to install Fedora 17 Alpha 2. Leave "basic storage devices" selected 3. Hit next Actual results: After a period of disk activity, everything stops, including the progress bar. Expected results: Installation should continue
Created attachment 565802 [details] anaconda.log
Created attachment 565807 [details] program.log
Created attachment 565808 [details] storage.log
Created attachment 565809 [details] X.log
Created attachment 565810 [details] output of "ps fax"
I just tried installing with the netinst iso without any sort of problem in a KVM VM. Looking at your storage.log I notice that you have two disks - what storage configuration is on them right now, if any? I think that might be relevant here.
(In reply to comment #6) > I just tried installing with the netinst iso without any sort of problem in a > KVM VM. Looking at your storage.log I notice that you have two disks - what > storage configuration is on them right now, if any? I think that might be > relevant here. You're probably right. Pretty much every Fedora release has been a battle to install since the anaconda storage rewrite. Basic layout is 2 disks, partitioned identically into 12 partitions. With a couple of exceptions, all of the partitions are paired up into MD RAID-1 devices. Those RAID-1 devices are then used as LVM PVs (except for md0, which is my /boot device).
"udevadm settle" appears to be the problem. It looks like anaconda calls "udevadm settle --timeout=300" 5 times every time it stops an MD device. After it stops the last MD device, this is not returning until the 5 minute timeout expires -- 5 times. So installation does continue after a 25 minute delay.
Created attachment 565818 [details] Updated program.log showing udevadm 5-minute timeout (line 1221)
How about the syslog, please? Either /tmp/syslog (non-live) or /var/log/messages (live).
Created attachment 566081 [details] syslog (In reply to comment #10) > How about the syslog, please? Either /tmp/syslog (non-live) or > /var/log/messages (live). Attached. In this case, the storage scan began around 15:36:03. There's a bunch of software RAID-related activity for about 50 seconds, and then everything goes quiet until all of the udevadm timeouts expire -- at which point I bring up the network interface to scp the syslog over. HTH
I don't see anything obvious, although it is pretty difficult with a bunch of subsets of the log set, all from different runs. If the problem is that udevadm settle is timing out, that is not a problem in anaconda. FWIW I have heard some reports of _extremely_ slow disk I/O in F17 that seem to only occur on some bare-metal BIOS systems.
Created attachment 566187 [details] anaconda.log
Created attachment 566188 [details] storage.log
Created attachment 566189 [details] X.log
Created attachment 566190 [details] program.log
Created attachment 566191 [details] syslog
New set of log files uploaded. Changing component to udev.
The last comment in bug 569019 sounds very similar to what I'm seeing. Unfortunately, there's no indication of any resolution.
(In reply to comment #18) > New set of log files uploaded. > > Changing component to udev. What makes you think udev is the culprit? Calling "udevadm settle" and getting a timeout can mean a lot of things. You might want to set "udev.log-priority=info" on the kernel command line. Or even "udev.log-priority=debug". anaconda should use libudev or libgudev and not rely on "udevadm settle"
(In reply to comment #20) > What makes you think udev is the culprit? It's the obvious place to look. > Calling "udevadm settle" and getting a timeout can mean a lot of things. Great. How can I find out what it means in this case? > You might want to set "udev.log-priority=info" on the kernel command line. > Or even "udev.log-priority=debug". Trying with "udev.log-priority=debug" right now, and will attach new logs when the timeouts expire. Unfortunately, there's no additional information in the syslog. > anaconda should use libudev or libgudev and not rely on "udevadm settle" (In reply to comment #12) > If the problem is that udevadm settle is timing out, that is not a problem in > anaconda. So the problem isn't in anaconda or udev?
Created attachment 566339 [details] anaconda.log (with udev.log-priority=debug)
Created attachment 566340 [details] program.log (with udev.log-priority=debug)
Created attachment 566341 [details] storage.log (with udev.log-priority=debug)
Created attachment 566342 [details] syslog (with udev.log-priority=debug)
Created attachment 566343 [details] X.log (with udev.log-priority=debug)
Created attachment 566344 [details] strace of "hung" udevadm process
This may be related to bug 795050, with the RC4 kernel I saw extremely slow boot times and very long delays when doing storage scan and disk partitioning.
(In reply to comment #21) > (In reply to comment #12) > > If the problem is that udevadm settle is timing out, that is not a problem in > > anaconda. > > So the problem isn't in anaconda or udev? If udev is busy longer than the default timeout, you just have to wait longer... or wait endlessly...: while ! udevadm settle; echo "Still waiting for udev to finish"; done
(In reply to comment #29) > If udev is busy longer than the default timeout, you just have to wait > longer... > > or wait endlessly...: I'm going to re-ask the following question: > > Calling "udevadm settle" and getting a timeout can mean a lot of things. > > Great. How can I find out what it means in this case? I would appreciate a constructive answer.
(In reply to comment #27) > Created attachment 566344 [details] > strace of "hung" udevadm process this is bogus :) "udevadm settle" is not hanging, it just takes a peek, if the udevd work queue is empty and does not return unless this state is given or the timeout was reached.
(In reply to comment #30) > (In reply to comment #29) > > If udev is busy longer than the default timeout, you just have to wait > > longer... > > > > or wait endlessly...: > > I'm going to re-ask the following question: > > > > Calling "udevadm settle" and getting a timeout can mean a lot of things. > > > > Great. How can I find out what it means in this case? > > I would appreciate a constructive answer. It means, that udevd has received an event and is still processing it. Might be, that a helper tool was spawned and is hanging. "ps" is your friend.
(In reply to comment #31) > this is bogus :) "udevadm settle" is not hanging, it just takes a peek, if the > udevd work queue is empty and does not return unless this state is given or the > timeout was reached. That's why I put it in quotes. How can I see what's in the udev event queue? Google give me some hits about a /etc/.udev directory (on Ubuntu?), but there doesn't seem to be anything like that on Fedora. (In reply to comment #32) > It means, that udevd has received an event and is still processing it. Might > be, that a helper tool was spawned and is hanging. "ps" is your friend. That's not happening here. See the ps output I attached when I first created the bug.
There was a /dev/.udev directory in the past, never a /etc/.udev. Today it's /run/udev. The queue export is a binary file and stored in that directory. The udevadm settle command prints the remaining events in the udev event queue, when it has reached the timeout. It looks like: $ udevadm settle --timeout=1 udevadm settle - timeout of 1 seconds reached, the event queue contains: /sys/devices/virtual/mem/full (2425) /sys/devices/virtual/mem/kmsg (2426) /sys/devices/virtual/mem/mem (2427) ... Maybe that reveals something what's going on with the storage management.
"udevadm settle --timeout=5" returns immediately with no output. (This is while the udevadm process started by anaconda is sitting there waiting for its timeout to expire.) What does that indicate?
(In reply to comment #35) > "udevadm settle --timeout=5" returns immediately with no output. (This is > while the udevadm process started by anaconda is sitting there waiting for its > timeout to expire.) > > What does that indicate? Something strange. :) Can you attach strace again to the looping udevadm process, with -v. It should show something like: open("/run/udev/queue.bin", O_RDONLY|O_CLOEXEC) = 4 fstat(4, {st_dev=makedev(0, 18), st_ino=740329, and we need the st_dev and inode number. Then run from another udevadm from a login: strace -v udevadm settle which does not not block. Are the dev_t and inode numbers of both processes the same? Why? A wild guess is that something has over-mounted /run and the old udevadm process sees the old file which never changes, because the running daemon operates on the overmounted and different one. If the files are identical, please attach the binary file here, and I will take a look at the content.
Ian, you are right: attachment 566344 [details] ... sorry for comment 31 strace from the "hung" "udevadm settle": open("/run/udev/queue.bin", O_RDONLY|O_CLOEXEC) = 12 read(12, "\245\f\0\0\0\0\0\0", 4096) = 8 sequence 0x0ff5 == 4085 this seems to be bogus, because: open("/sys/kernel/uevent_seqnum", O_RDONLY|O_CLOEXEC) = 14 read(14, "3238\n", 32) = 5 Can this be an "old" "/run/udev/queue.bin" ?? I hope /run is still mounted from the initramfs and does not contain leftovers from some anaconda image build process.
It looks like they're reading the same file. The strace of the looping process which was started by anaconda includes: open("/run/udev/queue.bin", O_RDONLY|O_CLOEXEC) = 12 fstat(12, {st_dev=makedev(0, 17), st_ino=25069, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=8, st_atime=2012/03/01-19:24:03, st_mtime=2012/03/01-19:24:03, st_ctime=2012/03/01-19:24:03}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa72e9e000 read(12, "\3\r\0\0\0\0\0\0", 4096) = 8 read(12, "", 4096) = 0 open("/sys/kernel/uevent_seqnum", O_RDONLY|O_CLOEXEC) = 14 read(14, "3332\n", 32) = 5 close(14) = 0 close(12) = 0 The strace of the process run from a shell, which returns immediately includes: open("/run/udev/queue.bin", O_RDONLY|O_CLOEXEC) = 4 fstat(4, {st_dev=makedev(0, 17), st_ino=25069, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=8, st_atime=2012/03/01-19:24:03, st_mtime=2012/03/01-19:24:03, st_ctime=2012/03/01-19:24:03}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4e212c7000 read(4, "\3\r\0\0\0\0\0\0", 4096) = 8 read(4, "", 4096) = 0 open("/sys/kernel/uevent_seqnum", O_RDONLY|O_CLOEXEC) = 5 read(5, "3332\n", 32) = 5 close(5) = 0 close(4) = 0 So the two processes appear to be reading from the same file and seeing the same contents -- which match the queue.bin file that I copied: 03 0D 00 00 00 00 00 00 0x0D03 = 3331, and both processes are reading 3332 from uevent_seqnum.
what is the contents of /proc/<pid>/cmdline of the looping settle?
(In reply to comment #39) > what is the contents of /proc/<pid>/cmdline of the looping settle? udevadm.settle.--timeout=300. (where the . characters represent 0x00)
(In reply to comment #37) > Ian, you are right: attachment 566344 [details] ... sorry for comment 31 > > strace from the "hung" "udevadm settle": > > open("/run/udev/queue.bin", O_RDONLY|O_CLOEXEC) = 12 > read(12, "\245\f\0\0\0\0\0\0", 4096) = 8 > > sequence 0x0ff5 == 4085 > > this seems to be bogus, because: > > open("/sys/kernel/uevent_seqnum", O_RDONLY|O_CLOEXEC) = 14 > read(14, "3238\n", 32) = 5 > > Can this be an "old" "/run/udev/queue.bin" ?? > > I hope /run is still mounted from the initramfs and does not contain leftovers > from some anaconda image build process. ok, this was miscalculated... # echo -e '\0245\f'|hexdump 0000000 0ca5 000a 0000003 # echo $((0x0ca5)) 3237 So, we still miss one event.
(In reply to comment #38) > It looks like they're reading the same file. The strace of the looping process > which was started by anaconda includes: > > open("/run/udev/queue.bin", O_RDONLY|O_CLOEXEC) = 12 > fstat(12, {st_dev=makedev(0, 17), st_ino=25069, st_mode=S_IFREG|0644, > st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=8, > st_atime=2012/03/01-19:24:03, st_mtime=2012/03/01-19:24:03, > st_ctime=2012/03/01-19:24:03}) = 0 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x7ffa72e9e000 > read(12, "\3\r\0\0\0\0\0\0", 4096) = 8 > read(12, "", 4096) = 0 > open("/sys/kernel/uevent_seqnum", O_RDONLY|O_CLOEXEC) = 14 > read(14, "3332\n", 32) = 5 > close(14) = 0 > close(12) = 0 > > The strace of the process run from a shell, which returns immediately includes: > > open("/run/udev/queue.bin", O_RDONLY|O_CLOEXEC) = 4 > fstat(4, {st_dev=makedev(0, 17), st_ino=25069, st_mode=S_IFREG|0644, > st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=8, > st_atime=2012/03/01-19:24:03, st_mtime=2012/03/01-19:24:03, > st_ctime=2012/03/01-19:24:03}) = 0 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x7f4e212c7000 > read(4, "\3\r\0\0\0\0\0\0", 4096) = 8 > read(4, "", 4096) = 0 > open("/sys/kernel/uevent_seqnum", O_RDONLY|O_CLOEXEC) = 5 > read(5, "3332\n", 32) = 5 > close(5) = 0 > close(4) = 0 > > So the two processes appear to be reading from the same file and seeing the > same contents -- which match the queue.bin file that I copied: > > 03 0D 00 00 00 00 00 00 > > 0x0D03 = 3331, and both processes are reading 3332 from uevent_seqnum. And 3331 is less than 3332 and your shell "udevadm settle" should not return with success. How do you invoke it?
(In reply to comment #5) > Created attachment 565810 [details] > output of "ps fax" The strange thing here is, that we see no udevd childs (udev worker). Was the udevd killed and restarted? Maybe by systemd?
Any chance you can provide us a login to a box that shows that weird behaviour? I have no real idea what's going on here, and what can cause this.
Created attachment 567986 [details] anaconda.log
Created attachment 567988 [details] program.log
Created attachment 567990 [details] output of "ps fax"
Created attachment 567991 [details] /run/udev/queue.bin
Created attachment 567992 [details] output of "ls -li /run/udev"
Created attachment 567993 [details] storage.log
Created attachment 567994 [details] syslog
Created attachment 567995 [details] X.log
Created attachment 567996 [details] strace of looping udevadm process (udevadm settle --timeout=300)
Created attachment 567997 [details] strace of "udevadm settle --timeout=5" run from shell
I just uploaded a whole new set of files, just so everything would be consistent. Nothing particularly new. Both udevadm processes are reading 3553 (0x0DE1) from queue.bin, which matches the contents of the file. The inode number also matches the output of "ls -li". Both processes are also reading 3554 from uevent_seqnum. (In reply to comment #42) > And 3331 is less than 3332 and your shell "udevadm settle" should not return > with success. How do you invoke it? I'm just typing "udevadm settle --timeout=5" in the shell. Complete strace attached. (In reply to comment #43) > Was the udevd killed and restarted? Maybe by systemd? Just in case you missed it, both "ps fax" output files that I've attached show /usr/lib/udev/udevd running (PID 390 in both cases). (In reply to comment #44) > Any chance you can provide us a login to a box that shows that weird behaviour? > > I have no real idea what's going on here, and what can cause this. I'm not opposed to the idea. This is my main home workstation, so the logistics may be a bit challenging. If you want to access it while it's in the "looping" state, I'll need to be in front of it to start the install, etc. We would need to set up a time and hop on the phone/IRC to coordinate. If you just want to access the system in its normal state (running Fedora 16), that would be easier. Let me know what you need.
Looking at the strace of the udevadm process started from the shell, I see that poll is timing out 5 times. So I may have been wrong about the fact that it was returning immediately. (I thought that 5 seconds was a sufficiently long timeout that I would be able to tell the difference, but perhaps not.) So if it is actually timing out, the question would be why isn't it giving me any output?
whatever this is, it isn't anaconda. reassigning to udev.
There is a slight chance that it is caused by what is tried to be fixed here: http://git.kernel.org/?p=linux/kernel/git/gregkh/driver-core.git;a=commitdiff;h=7b60a18da393ed70db043a777fd9e6d5363077c4
*** Bug 803080 has been marked as a duplicate of this bug. ***
reassigning to kernel because of comment 58
That patch made it into 3.4, which means it should have been in the final F17 image. Is this still a problem ?
(In reply to comment #61) > That patch made it into 3.4, which means it should have been in the final > F17 image. Is this still a problem ? I did not see this problem when I installed F17 TC3.
(In reply to comment #62) > (In reply to comment #61) > > That patch made it into 3.4, which means it should have been in the final > > F17 image. Is this still a problem ? > > I did not see this problem when I installed F17 TC3. Thanks for letting us know.