Description of problem: Install of FreeBSD 6.2 32 bit kernel in fullvirt on an x86_64 dom0 (ie. FV 32-on-64), with heavy load on the machine, causes qemu-dm to segfault with an error such as: qemu-dm[10011]: segfault at 0000000000000000 rip 0000000000000000 rsp 0000000041400c18 error 14 Version-Release number of selected component (if applicable): xen-3.1.0-0.rc7.1.fc7 Linux lambda 2.6.20-2925.8.fc7xen #1 SMP Thu May 10 17:47:43 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux Other Fedora 7 components are fully up to date as of the moment this was posted. How reproducible: Intermittently on my test machine. Much easier to reproduce under heavy load, as described below. Steps to Reproduce: 1. Download ftp://ftp.uk.freebsd.org/pub/FreeBSD/releases/i386/ISO-IMAGES/6.2/6.2-RELEASE-i386-bootonly.iso 2. Begin the install as in the attached instructions. 3. During the install, run 'make -j 4' of a kernel and at the same time boot and shutdown other guests. Actual results: qemu-dm segfaults (the visual indication of this in virt-manager is that suddenly the FreeBSD console is lost with message "The console is currently unavailable" although continues -- incorrectly I think -- to display that the FreeBSD guest is running) Expected results: qemu-dm shouldn't segfault. Additional info: I was starting up and shutting down two other guests which were both PV. No other FV guests were running apart from the FreeBSD installer. It's not very clear, but this upstream bug might be a manifestation of the same thing: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=542
Created attachment 154641 [details] FreeBSD installation notes
I also reproduced this bug with just load, no other guests running. On Dom0 (a 4 core Athlon) I am running: cd linux-2.6.21.1; while true; do make -j 4; make clean; done No guests are running, except a FreeBSD 6.2 FV 32-on-64 install. After a little while the install stops, and in Dom0's dmesg: qemu-dm[3075]: segfault at 0000000000000000 rip 0000000000000000 rsp 0000000041400c18 error 14
This bug also happens with an updated Xen hypervisor. [Background: Dan pointed out that cset 15038, http://xenbits.xensource.com/xen-3.1-testing.hg?rev/c00b2ab8af2c looked like it might have had something to do with this, but even with this change the segfault is still happening.]
Created attachment 154722 [details] Core dump from qemu-dm Core dump from qemu-dm. Corresponding binary: $ rpm -qf /usr/lib64/xen/bin/qemu-dm xen-3.1.0-0.rc7.1.fc7 Stack trace (from gdb): Core was generated by `/usr/lib64/xen/bin/qemu-dm -d 2 -vcpus 1 -boot d -serial pty -acpi -domain-name'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in ?? () #1 0x000000000042c085 in dma_thread_func (opaque=<value optimized out>) at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/hw/ide.c:2402 #2 0x00000030310061b5 in start_thread () from /lib64/libpthread.so.0 #3 0x00000030304d043d in clone () from /lib64/libc.so.6 [Quite amazingly this 60K file expands to the full 39MB core dump with md5sum 189a904867814006d199f4d92c2f642c]
Stack trace from each thread: (gdb) thread apply all bt Thread 3 (process 29295): #0 0x00000030304c9952 in select () from /lib64/libc.so.6 #1 0x0000000000409555 in main_loop_wait (timeout=10) at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/vl.c:5216 #2 0x000000000046d251 in main_loop () at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/target-i386-dm/helper2.c:628 #3 0x000000000040b206 in main (argc=19, argv=0x7fff0c4f2e08) at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/vl.c:6903 #4 0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6 #5 0x0000000000404809 in _start () Thread 2 (process 29306): #0 0x000000303100cabb in read () from /lib64/libpthread.so.0 #1 0x000000303180197a in read_all (fd=5, data=0xc9d2f0, len=16) at /usr/include/bits/unistd.h:35 #2 0x00000030318019f2 in read_message (h=0xc9b6b0) at xs.c:768 #3 0x0000003031801b4c in read_thread (arg=<value optimized out>) at xs.c:821 #4 0x00000030310061b5 in start_thread () from /lib64/libpthread.so.0 #5 0x00000030304d043d in clone () from /lib64/libc.so.6 Thread 1 (process 29426): #0 0x0000000000000000 in ?? () #1 0x000000000042c085 in dma_thread_func (opaque=<value optimized out>) at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/hw/ide.c:2402 #2 0x00000030310061b5 in start_thread () from /lib64/libpthread.so.0 #3 0x00000030304d043d in clone () from /lib64/libc.so.6
I compiled qemu-dm with -O0 -g and generated another core dump: http://annexia.org/tmp/qemu-dm.bz2 http://annexia.org/tmp/core.qemu-dm.10152.1179249168.bz2
Created attachment 154763 [details] Patch to pass structure instead of pointers to the IDE DMA thread. This patch is currently looking solid. The FreeBSD install has got much further than before. If it stays up overnight I'll feed it upstream.
FreeBSD install finished successfully for the first time under load. Patch sent upstream.
Created attachment 154836 [details] Screenshot of FreeBSD install failing. Unfortunately this patch hasn't corrected the problem. I'm still seeing FreeBSD failing during the install at the same place as before, although with a different error. This time qemu-dm isn't segfaulting, but FreeBSD itself is giving an error as shown in the screenshot. The error is: anic: initiate_write_inodeblock_ufs2: already started
FYI regarding this bug: There was a recent exchange with someone complaining about IDE multi-threading problems. Keir has checked in a patch to 3.2/3.1 that fixes that particular problem; it may also be relevant here: http://lists.xensource.com/archives/html/xen-devel/2008-01/msg01151.html Chris Lalancette
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping