Description of problem: With the latest update to Qemu (version 1.0), the 'check' target in the Makefile hangs with: ... febootstrap: internal insmod virtio_ring.ko.gz febootstrap: internal insmod virtio-rng.ko.gz febootstrap: internal insmod virtio_console.ko.gz febootstrap: internal insmod virtio_balloon.ko.gz febootstrap: internal insmod virtio_pci.ko.gz febootstrap: internal insmod virtio_blk.ko.gz [ 240.210126] INFO: task init:1 blocked for more than 120 seconds. [ 240.211982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Full build log is here: http://paste.ubuntu.com/772629/ This is on an Arch Linux x86_64 machine with the package versions shown below. When it hangs, "ps aux | grep qemu-kvm" shows this: http://paste.kde.org/174722/ qemu doesn't seem to be doing anything when the build hangs as the CPU usage of the process is so low that it doesn't show up in top/htop. Version-Release number of selected component (if applicable): febootstrap 3.12 Both libguestfs stable 1.14.7 and development 1.15.11 qemu-kvm (with spice patches) 1.0 How reproducible: Always Steps to Reproduce: 1. Compile and install qemu-kvm 1.0 (with or without spice). 2. Compile libguestfs with "make" This part is okay. 3. Check the build using "make check" Actual results: "make check" hangs. Expected results: "make check" should pass just like it did with qemu 0.15
This problem was also mentioned on the mailing list. https://www.redhat.com/archives/libguestfs/2011-December/msg00132.html For some reason the Arch kernel has disabled ordinary kernel messages, or else we are not picking them up. Nevertheless, I can reproduce this problem with Fedora, kernel 3.2.0, and upstream qemu from git. The errors reported by the Fedora kernel are below. My first impression would be that this is simply a bug in qemu. [ 0.853049] vda: unknown partition table [ 1.106371] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input1 [ 240.494566] INFO: task init:1 blocked for more than 120 seconds. [ 240.496486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.499265] init D 0000000000000000 3280 1 0 0x00000000 [ 240.501114] ffff88001edc5798 0000000000000046 000000000ca403db ffffffff81c18f00 [ 240.502707] ffff88001edc8000 ffff88001edc5fd8 ffff88001edc5fd8 ffff88001edc5fd8 [ 240.503903] ffffffff81c0d020 ffff88001edc8000 ffff88001edc5798 ffffffff810affad [ 240.505035] Call Trace: [ 240.505755] [<ffffffff810affad>] ? ktime_get_ts+0xad/0xe0 [ 240.506682] [<ffffffff81144240>] ? __lock_page+0x70/0x70 [ 240.507765] [<ffffffff81675c3f>] schedule+0x3f/0x60 [ 240.508738] [<ffffffff81675cef>] io_schedule+0x8f/0xd0 [ 240.512676] [<ffffffff8114424e>] sleep_on_page+0xe/0x20 [ 240.513240] [<ffffffff8167657a>] __wait_on_bit_lock+0x5a/0xc0 [ 240.513646] [<ffffffff81144237>] __lock_page+0x67/0x70 [ 240.513646] [<ffffffff810a4ec0>] ? autoremove_wake_function+0x50/0x50 [ 240.513646] [<ffffffff811457b7>] do_read_cache_page+0x167/0x190 [ 240.513646] [<ffffffff81225ced>] ? find_valid_gpt+0x9d/0x610 [ 240.513646] [<ffffffff811e73c0>] ? blkdev_write_begin+0x30/0x30 [ 240.513646] [<ffffffff8114582c>] read_cache_page_async+0x1c/0x20 [ 240.513646] [<ffffffff8114583e>] read_cache_page+0xe/0x20 [ 240.513646] [<ffffffff81222410>] read_dev_sector+0x30/0x90 [ 240.513646] [<ffffffff8122571c>] read_lba+0xbc/0x120 [ 240.513646] [<ffffffff81225d0b>] find_valid_gpt+0xbb/0x610 [ 240.520796] [<ffffffff8114e46a>] ? __alloc_pages_nodemask+0x14a/0xa30 [ 240.521910] [<ffffffff81226260>] ? find_valid_gpt+0x610/0x610 [ 240.522749] [<ffffffff812262e2>] efi_partition+0x82/0x3f0 [ 240.523914] [<ffffffff81310264>] ? snprintf+0x34/0x40 [ 240.524949] [<ffffffff81226260>] ? find_valid_gpt+0x610/0x610 [ 240.526420] [<ffffffff81223278>] rescan_partitions+0x1b8/0x500 [ 240.527573] [<ffffffff811e824e>] __blkdev_get+0x2ce/0x430 [ 240.528722] [<ffffffff811c8e3b>] ? unlock_new_inode+0x5b/0x80 [ 240.529928] [<ffffffff811e7268>] ? bdget+0x118/0x140 [ 240.531364] [<ffffffff811e8403>] blkdev_get+0x53/0x350 [ 240.532907] [<ffffffff811c8e3b>] ? unlock_new_inode+0x5b/0x80 [ 240.533769] [<ffffffff81305f17>] ? kobject_put+0x27/0x60 [ 240.534425] [<ffffffff812f30dd>] ? disk_get_part+0x1d/0x190 [ 240.535311] [<ffffffff812f3722>] register_disk+0x152/0x170 [ 240.535981] [<ffffffff812f37e6>] add_disk+0xa6/0x350 [ 240.536625] [<ffffffffa0059fa7>] virtblk_probe+0x48f/0x534 [virtio_blk] [ 240.537631] [<ffffffffa0059110>] ? virtblk_config_changed+0x30/0x30 [virtio_blk] [ 240.538824] [<ffffffff813a8840>] ? vp_find_vqs+0xc0/0xc0 [ 240.540082] [<ffffffff813a6bb3>] virtio_dev_probe+0xe3/0x140 [ 240.540626] [<ffffffff813f71ec>] driver_probe_device+0x9c/0x300 [ 240.541617] [<ffffffff813f74fb>] __driver_attach+0xab/0xb0 [ 240.543005] [<ffffffff813f7450>] ? driver_probe_device+0x300/0x300 [ 240.544110] [<ffffffff813f7450>] ? driver_probe_device+0x300/0x300 [ 240.544797] [<ffffffff813f5fc4>] bus_for_each_dev+0x64/0xa0 [ 240.546129] [<ffffffff813f6dee>] driver_attach+0x1e/0x20 [ 240.547065] [<ffffffff813f69e0>] bus_add_driver+0x1c0/0x2b0 [ 240.547978] [<ffffffffa0067000>] ? 0xffffffffa0066fff [ 240.548613] [<ffffffff813f7a76>] driver_register+0x76/0x140 [ 240.549885] [<ffffffff81677a1e>] ? mutex_unlock+0xe/0x10 [ 240.551095] [<ffffffffa0067000>] ? 0xffffffffa0066fff [ 240.551942] [<ffffffff813a6e10>] register_virtio_driver+0x20/0x30 [ 240.552699] [<ffffffffa0067057>] init+0x57/0x1000 [virtio_blk] [ 240.553806] [<ffffffffa0067000>] ? 0xffffffffa0066fff [ 240.554962] [<ffffffff81002040>] do_one_initcall+0x40/0x180 [ 240.556334] [<ffffffff810ce0f0>] sys_init_module+0xc0/0x230 [ 240.556996] [<ffffffff81682082>] system_call_fastpath+0x16/0x1b [ 240.557816] 3 locks held by init/1: [ 240.558681] #0: (&__lockdep_no_validate__){......}, at: [<ffffffff813f74ab>] __driver_attach+0x5b/0xb0 [ 240.561323] #1: (&__lockdep_no_validate__){......}, at: [<ffffffff813f74b9>] __driver_attach+0x69/0xb0 [ 240.562886] #2: (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811e7ffe>] __blkdev_get+0x7e/0x430
Thanks for replying, Richard! With a quick grep of some of your kernel output, it seems that Arch Linux puts all the kernel messages in /var/log/kernel.log. Anyway, here's my kernel output: Dec 12 15:21:00 localhost kernel: [ 2157.064991] INFO: task qemu-kvm:13549 blocked for more than 120 seconds. Dec 12 15:21:00 localhost kernel: [ 2157.064996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 12 15:21:00 localhost kernel: [ 2157.065000] qemu-kvm D 0000000000000002 0 13549 1 0x00000000 Dec 12 15:21:00 localhost kernel: [ 2157.065009] ffff8800ad87f978 0000000000000082 ffff8800ad87f8b8 ffffffff810fd165 Dec 12 15:21:00 localhost kernel: [ 2157.065017] ffff8801236d6bd0 ffff8800ad87ffd8 ffff8800ad87ffd8 ffff8800ad87ffd8 Dec 12 15:21:00 localhost kernel: [ 2157.065025] ffff880136fc0e60 ffff8801236d6bd0 ffff8800ad87fac0 ffff880130852468 Dec 12 15:21:00 localhost kernel: [ 2157.065033] Call Trace: Dec 12 15:21:00 localhost kernel: [ 2157.065039] [<ffffffff810fd165>] ? find_get_pages_tag+0x125/0x150 Dec 12 15:21:00 localhost kernel: [ 2157.065047] [<ffffffff81109435>] ? pagevec_lookup_tag+0x25/0x40 Dec 12 15:21:00 localhost kernel: [ 2157.065067] [<ffffffffa01426de>] ? write_cache_pages_da+0x15e/0x460 [ext4] Dec 12 15:21:00 localhost kernel: [ 2157.065075] [<ffffffff8101c139>] ? read_tsc+0x9/0x20 Dec 12 15:21:00 localhost kernel: [ 2157.065082] [<ffffffff810fd470>] ? __lock_page+0x70/0x70 Dec 12 15:21:00 localhost kernel: [ 2157.065089] [<ffffffff81406dbf>] schedule+0x3f/0x60 Dec 12 15:21:00 localhost kernel: [ 2157.065097] [<ffffffff81406e6f>] io_schedule+0x8f/0xd0 Dec 12 15:21:00 localhost kernel: [ 2157.065103] [<ffffffff810fd47e>] sleep_on_page+0xe/0x20 Dec 12 15:21:00 localhost kernel: [ 2157.065111] [<ffffffff8140761f>] __wait_on_bit+0x5f/0x90 Dec 12 15:21:00 localhost kernel: [ 2157.065118] [<ffffffff810fd668>] wait_on_page_bit+0x78/0x80 Dec 12 15:21:00 localhost kernel: [ 2157.065126] [<ffffffff81084fb0>] ? autoremove_wake_function+0x40/0x40 Dec 12 15:21:00 localhost kernel: [ 2157.065134] [<ffffffff810fd7dc>] filemap_fdatawait_range+0x10c/0x1a0 Dec 12 15:21:00 localhost kernel: [ 2157.065144] [<ffffffff810fe7c8>] filemap_write_and_wait_range+0x68/0x80 Dec 12 15:21:00 localhost kernel: [ 2157.065163] [<ffffffffa013922f>] ext4_sync_file+0x6f/0x450 [ext4] Dec 12 15:21:00 localhost kernel: [ 2157.065174] [<ffffffff8118a725>] generic_write_sync+0x55/0x70 Dec 12 15:21:00 localhost kernel: [ 2157.065182] [<ffffffff810fee56>] generic_file_aio_write+0xc6/0xe0 Dec 12 15:21:00 localhost kernel: [ 2157.065200] [<ffffffffa0138f5f>] ext4_file_write+0xbf/0x250 [ext4] Dec 12 15:21:00 localhost kernel: [ 2157.065210] [<ffffffff8115dd35>] ? rw_copy_check_uvector+0xf5/0x130 Dec 12 15:21:00 localhost kernel: [ 2157.065228] [<ffffffffa0138ea0>] ? ext4_llseek+0x170/0x170 [ext4] Dec 12 15:21:00 localhost kernel: [ 2157.065237] [<ffffffff8115db73>] do_sync_readv_writev+0xd3/0x110 Dec 12 15:21:00 localhost kernel: [ 2157.065246] [<ffffffff8115dd35>] ? rw_copy_check_uvector+0xf5/0x130 Dec 12 15:21:00 localhost kernel: [ 2157.065255] [<ffffffff811e6eac>] ? security_file_permission+0x2c/0xb0 Dec 12 15:21:00 localhost kernel: [ 2157.065264] [<ffffffff8115d271>] ? rw_verify_area+0x61/0xf0 Dec 12 15:21:00 localhost kernel: [ 2157.065272] [<ffffffff8115de3d>] do_readv_writev+0xcd/0x1d0 Dec 12 15:21:00 localhost kernel: [ 2157.065281] [<ffffffff81076d45>] ? kill_pid_info+0x55/0x80 Dec 12 15:21:00 localhost kernel: [ 2157.065289] [<ffffffff810781de>] ? sys_kill+0x9e/0x1c0 Dec 12 15:21:00 localhost kernel: [ 2157.065298] [<ffffffff8115df7c>] vfs_writev+0x3c/0x50 Dec 12 15:21:00 localhost kernel: [ 2157.065306] [<ffffffff8115e2c2>] sys_pwritev+0xa2/0xc0 Dec 12 15:21:00 localhost kernel: [ 2157.065313] [<ffffffff8140a502>] system_call_fastpath+0x16/0x1b
Bugzilla's line wrapping makes that kind of hard to read. Here's the kernel output at a pastebin: http://paste.kde.org/174764/
I guess that is the host kernel (not the appliance kernel)? That's even odder because I wouldn't expect the host kernel to have problems -- and the Fedora host kernel doesn't have anything like that.
Right, it's the host kernel. I'll try compiling the kernel with Fedora's patches and config to see if anything changes.
I'm currently running a git bisect on qemu.git. This could be related to: https://lists.gnu.org/archive/html/qemu-devel/2011-12/msg02206.html
git bisect result was bogus. However there are some more results in the thread mentioned in comment 6. No conclusions yet unfortunately, but it appears that the act of loading virtio_blk.ko in the appliance causes the hangs, and it is easily reproducible.
Created attachment 548223 [details] 0001-HACK-Add-vga-cirrus-to-work-around-virtio-hang.patch This patch is a hack, but Anthony worked out that the bug happens only when -vga cirrus is missing from the command line, so adding it cures the bug.
Reassigning bug to qemu, since it appears to be a qemu bug, not a libguestfs bug.
It's a SeaBIOS problem. Specifically, it's a problem with the binary SeaBIOS bundled with qemu 1.0. If you replace this with upstream SeaBIOS then it works. The easiest way to fix this is: git clone git://git.seabios.org/seabios.git cd seabios make clean make copy out/bios.bin over qemu's pc-bios/bios.bin recompile qemu in the normal way
I can confirm that using SeaBIOS git works.
Okay, with SeaBIOS 1.6.3, it works also. However the workaround patch causes "make check" to fail with an unmounting error, but I guess it shouldn't matter now that there's a proper solution.
Comment on attachment 548223 [details] 0001-HACK-Add-vga-cirrus-to-work-around-virtio-hang.patch Obsolete this patch since the correct solution is to update SeaBIOS.
F17 and rawhide now use 1.6.3. Is this still broken on f15 or f16?
No, it's certainly working fine in F16 and F15 is nearly obsolete. I'm going to close this bug.