Bug 768508
Summary: | qemu 1.0 with -nodefaults: guest kernel hangs when loading virtio-blk driver in the guest | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andrew Gunnerson <accounts+fedora> | ||||
Component: | seabios | Assignee: | Richard W.M. Jones <rjones> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rawhide | CC: | amit.shah, bazanluis20, berrange, dougsland, dwmw2, ehabkost, itamar, jaswinder, jforbes, juzhang, knoel, mbooth, pbonzini, rjones, scottt.tw, tburke, virt-maint | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-03-21 22:41:15 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Andrew Gunnerson
2011-12-16 20:40:12 UTC
This problem was also mentioned on the mailing list. https://www.redhat.com/archives/libguestfs/2011-December/msg00132.html For some reason the Arch kernel has disabled ordinary kernel messages, or else we are not picking them up. Nevertheless, I can reproduce this problem with Fedora, kernel 3.2.0, and upstream qemu from git. The errors reported by the Fedora kernel are below. My first impression would be that this is simply a bug in qemu. [ 0.853049] vda: unknown partition table [ 1.106371] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input1 [ 240.494566] INFO: task init:1 blocked for more than 120 seconds. [ 240.496486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.499265] init D 0000000000000000 3280 1 0 0x00000000 [ 240.501114] ffff88001edc5798 0000000000000046 000000000ca403db ffffffff81c18f00 [ 240.502707] ffff88001edc8000 ffff88001edc5fd8 ffff88001edc5fd8 ffff88001edc5fd8 [ 240.503903] ffffffff81c0d020 ffff88001edc8000 ffff88001edc5798 ffffffff810affad [ 240.505035] Call Trace: [ 240.505755] [<ffffffff810affad>] ? ktime_get_ts+0xad/0xe0 [ 240.506682] [<ffffffff81144240>] ? __lock_page+0x70/0x70 [ 240.507765] [<ffffffff81675c3f>] schedule+0x3f/0x60 [ 240.508738] [<ffffffff81675cef>] io_schedule+0x8f/0xd0 [ 240.512676] [<ffffffff8114424e>] sleep_on_page+0xe/0x20 [ 240.513240] [<ffffffff8167657a>] __wait_on_bit_lock+0x5a/0xc0 [ 240.513646] [<ffffffff81144237>] __lock_page+0x67/0x70 [ 240.513646] [<ffffffff810a4ec0>] ? autoremove_wake_function+0x50/0x50 [ 240.513646] [<ffffffff811457b7>] do_read_cache_page+0x167/0x190 [ 240.513646] [<ffffffff81225ced>] ? find_valid_gpt+0x9d/0x610 [ 240.513646] [<ffffffff811e73c0>] ? blkdev_write_begin+0x30/0x30 [ 240.513646] [<ffffffff8114582c>] read_cache_page_async+0x1c/0x20 [ 240.513646] [<ffffffff8114583e>] read_cache_page+0xe/0x20 [ 240.513646] [<ffffffff81222410>] read_dev_sector+0x30/0x90 [ 240.513646] [<ffffffff8122571c>] read_lba+0xbc/0x120 [ 240.513646] [<ffffffff81225d0b>] find_valid_gpt+0xbb/0x610 [ 240.520796] [<ffffffff8114e46a>] ? __alloc_pages_nodemask+0x14a/0xa30 [ 240.521910] [<ffffffff81226260>] ? find_valid_gpt+0x610/0x610 [ 240.522749] [<ffffffff812262e2>] efi_partition+0x82/0x3f0 [ 240.523914] [<ffffffff81310264>] ? snprintf+0x34/0x40 [ 240.524949] [<ffffffff81226260>] ? find_valid_gpt+0x610/0x610 [ 240.526420] [<ffffffff81223278>] rescan_partitions+0x1b8/0x500 [ 240.527573] [<ffffffff811e824e>] __blkdev_get+0x2ce/0x430 [ 240.528722] [<ffffffff811c8e3b>] ? unlock_new_inode+0x5b/0x80 [ 240.529928] [<ffffffff811e7268>] ? bdget+0x118/0x140 [ 240.531364] [<ffffffff811e8403>] blkdev_get+0x53/0x350 [ 240.532907] [<ffffffff811c8e3b>] ? unlock_new_inode+0x5b/0x80 [ 240.533769] [<ffffffff81305f17>] ? kobject_put+0x27/0x60 [ 240.534425] [<ffffffff812f30dd>] ? disk_get_part+0x1d/0x190 [ 240.535311] [<ffffffff812f3722>] register_disk+0x152/0x170 [ 240.535981] [<ffffffff812f37e6>] add_disk+0xa6/0x350 [ 240.536625] [<ffffffffa0059fa7>] virtblk_probe+0x48f/0x534 [virtio_blk] [ 240.537631] [<ffffffffa0059110>] ? virtblk_config_changed+0x30/0x30 [virtio_blk] [ 240.538824] [<ffffffff813a8840>] ? vp_find_vqs+0xc0/0xc0 [ 240.540082] [<ffffffff813a6bb3>] virtio_dev_probe+0xe3/0x140 [ 240.540626] [<ffffffff813f71ec>] driver_probe_device+0x9c/0x300 [ 240.541617] [<ffffffff813f74fb>] __driver_attach+0xab/0xb0 [ 240.543005] [<ffffffff813f7450>] ? driver_probe_device+0x300/0x300 [ 240.544110] [<ffffffff813f7450>] ? driver_probe_device+0x300/0x300 [ 240.544797] [<ffffffff813f5fc4>] bus_for_each_dev+0x64/0xa0 [ 240.546129] [<ffffffff813f6dee>] driver_attach+0x1e/0x20 [ 240.547065] [<ffffffff813f69e0>] bus_add_driver+0x1c0/0x2b0 [ 240.547978] [<ffffffffa0067000>] ? 0xffffffffa0066fff [ 240.548613] [<ffffffff813f7a76>] driver_register+0x76/0x140 [ 240.549885] [<ffffffff81677a1e>] ? mutex_unlock+0xe/0x10 [ 240.551095] [<ffffffffa0067000>] ? 0xffffffffa0066fff [ 240.551942] [<ffffffff813a6e10>] register_virtio_driver+0x20/0x30 [ 240.552699] [<ffffffffa0067057>] init+0x57/0x1000 [virtio_blk] [ 240.553806] [<ffffffffa0067000>] ? 0xffffffffa0066fff [ 240.554962] [<ffffffff81002040>] do_one_initcall+0x40/0x180 [ 240.556334] [<ffffffff810ce0f0>] sys_init_module+0xc0/0x230 [ 240.556996] [<ffffffff81682082>] system_call_fastpath+0x16/0x1b [ 240.557816] 3 locks held by init/1: [ 240.558681] #0: (&__lockdep_no_validate__){......}, at: [<ffffffff813f74ab>] __driver_attach+0x5b/0xb0 [ 240.561323] #1: (&__lockdep_no_validate__){......}, at: [<ffffffff813f74b9>] __driver_attach+0x69/0xb0 [ 240.562886] #2: (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811e7ffe>] __blkdev_get+0x7e/0x430 Thanks for replying, Richard! With a quick grep of some of your kernel output, it seems that Arch Linux puts all the kernel messages in /var/log/kernel.log. Anyway, here's my kernel output: Dec 12 15:21:00 localhost kernel: [ 2157.064991] INFO: task qemu-kvm:13549 blocked for more than 120 seconds. Dec 12 15:21:00 localhost kernel: [ 2157.064996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 12 15:21:00 localhost kernel: [ 2157.065000] qemu-kvm D 0000000000000002 0 13549 1 0x00000000 Dec 12 15:21:00 localhost kernel: [ 2157.065009] ffff8800ad87f978 0000000000000082 ffff8800ad87f8b8 ffffffff810fd165 Dec 12 15:21:00 localhost kernel: [ 2157.065017] ffff8801236d6bd0 ffff8800ad87ffd8 ffff8800ad87ffd8 ffff8800ad87ffd8 Dec 12 15:21:00 localhost kernel: [ 2157.065025] ffff880136fc0e60 ffff8801236d6bd0 ffff8800ad87fac0 ffff880130852468 Dec 12 15:21:00 localhost kernel: [ 2157.065033] Call Trace: Dec 12 15:21:00 localhost kernel: [ 2157.065039] [<ffffffff810fd165>] ? find_get_pages_tag+0x125/0x150 Dec 12 15:21:00 localhost kernel: [ 2157.065047] [<ffffffff81109435>] ? pagevec_lookup_tag+0x25/0x40 Dec 12 15:21:00 localhost kernel: [ 2157.065067] [<ffffffffa01426de>] ? write_cache_pages_da+0x15e/0x460 [ext4] Dec 12 15:21:00 localhost kernel: [ 2157.065075] [<ffffffff8101c139>] ? read_tsc+0x9/0x20 Dec 12 15:21:00 localhost kernel: [ 2157.065082] [<ffffffff810fd470>] ? __lock_page+0x70/0x70 Dec 12 15:21:00 localhost kernel: [ 2157.065089] [<ffffffff81406dbf>] schedule+0x3f/0x60 Dec 12 15:21:00 localhost kernel: [ 2157.065097] [<ffffffff81406e6f>] io_schedule+0x8f/0xd0 Dec 12 15:21:00 localhost kernel: [ 2157.065103] [<ffffffff810fd47e>] sleep_on_page+0xe/0x20 Dec 12 15:21:00 localhost kernel: [ 2157.065111] [<ffffffff8140761f>] __wait_on_bit+0x5f/0x90 Dec 12 15:21:00 localhost kernel: [ 2157.065118] [<ffffffff810fd668>] wait_on_page_bit+0x78/0x80 Dec 12 15:21:00 localhost kernel: [ 2157.065126] [<ffffffff81084fb0>] ? autoremove_wake_function+0x40/0x40 Dec 12 15:21:00 localhost kernel: [ 2157.065134] [<ffffffff810fd7dc>] filemap_fdatawait_range+0x10c/0x1a0 Dec 12 15:21:00 localhost kernel: [ 2157.065144] [<ffffffff810fe7c8>] filemap_write_and_wait_range+0x68/0x80 Dec 12 15:21:00 localhost kernel: [ 2157.065163] [<ffffffffa013922f>] ext4_sync_file+0x6f/0x450 [ext4] Dec 12 15:21:00 localhost kernel: [ 2157.065174] [<ffffffff8118a725>] generic_write_sync+0x55/0x70 Dec 12 15:21:00 localhost kernel: [ 2157.065182] [<ffffffff810fee56>] generic_file_aio_write+0xc6/0xe0 Dec 12 15:21:00 localhost kernel: [ 2157.065200] [<ffffffffa0138f5f>] ext4_file_write+0xbf/0x250 [ext4] Dec 12 15:21:00 localhost kernel: [ 2157.065210] [<ffffffff8115dd35>] ? rw_copy_check_uvector+0xf5/0x130 Dec 12 15:21:00 localhost kernel: [ 2157.065228] [<ffffffffa0138ea0>] ? ext4_llseek+0x170/0x170 [ext4] Dec 12 15:21:00 localhost kernel: [ 2157.065237] [<ffffffff8115db73>] do_sync_readv_writev+0xd3/0x110 Dec 12 15:21:00 localhost kernel: [ 2157.065246] [<ffffffff8115dd35>] ? rw_copy_check_uvector+0xf5/0x130 Dec 12 15:21:00 localhost kernel: [ 2157.065255] [<ffffffff811e6eac>] ? security_file_permission+0x2c/0xb0 Dec 12 15:21:00 localhost kernel: [ 2157.065264] [<ffffffff8115d271>] ? rw_verify_area+0x61/0xf0 Dec 12 15:21:00 localhost kernel: [ 2157.065272] [<ffffffff8115de3d>] do_readv_writev+0xcd/0x1d0 Dec 12 15:21:00 localhost kernel: [ 2157.065281] [<ffffffff81076d45>] ? kill_pid_info+0x55/0x80 Dec 12 15:21:00 localhost kernel: [ 2157.065289] [<ffffffff810781de>] ? sys_kill+0x9e/0x1c0 Dec 12 15:21:00 localhost kernel: [ 2157.065298] [<ffffffff8115df7c>] vfs_writev+0x3c/0x50 Dec 12 15:21:00 localhost kernel: [ 2157.065306] [<ffffffff8115e2c2>] sys_pwritev+0xa2/0xc0 Dec 12 15:21:00 localhost kernel: [ 2157.065313] [<ffffffff8140a502>] system_call_fastpath+0x16/0x1b Bugzilla's line wrapping makes that kind of hard to read. Here's the kernel output at a pastebin: http://paste.kde.org/174764/ I guess that is the host kernel (not the appliance kernel)? That's even odder because I wouldn't expect the host kernel to have problems -- and the Fedora host kernel doesn't have anything like that. Right, it's the host kernel. I'll try compiling the kernel with Fedora's patches and config to see if anything changes. I'm currently running a git bisect on qemu.git. This could be related to: https://lists.gnu.org/archive/html/qemu-devel/2011-12/msg02206.html git bisect result was bogus. However there are some more results in the thread mentioned in comment 6. No conclusions yet unfortunately, but it appears that the act of loading virtio_blk.ko in the appliance causes the hangs, and it is easily reproducible. Created attachment 548223 [details]
0001-HACK-Add-vga-cirrus-to-work-around-virtio-hang.patch
This patch is a hack, but Anthony worked out that the bug
happens only when -vga cirrus is missing from the command
line, so adding it cures the bug.
Reassigning bug to qemu, since it appears to be a qemu bug, not a libguestfs bug. It's a SeaBIOS problem. Specifically, it's a problem with the binary SeaBIOS bundled with qemu 1.0. If you replace this with upstream SeaBIOS then it works. The easiest way to fix this is: git clone git://git.seabios.org/seabios.git cd seabios make clean make copy out/bios.bin over qemu's pc-bios/bios.bin recompile qemu in the normal way I can confirm that using SeaBIOS git works. Okay, with SeaBIOS 1.6.3, it works also. However the workaround patch causes "make check" to fail with an unmounting error, but I guess it shouldn't matter now that there's a proper solution. Comment on attachment 548223 [details]
0001-HACK-Add-vga-cirrus-to-work-around-virtio-hang.patch
Obsolete this patch since the correct solution is to update SeaBIOS.
F17 and rawhide now use 1.6.3. Is this still broken on f15 or f16? No, it's certainly working fine in F16 and F15 is nearly obsolete. I'm going to close this bug. |