Bug 431698 - qemu-dm segfault with Windows 32 bit FV on heavily loaded machine
Summary: qemu-dm segfault with Windows 32 bit FV on heavily loaded machine
Keywords:
Status: CLOSED DUPLICATE of bug 250988
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen
Version: 5.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Chris Lalancette
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 250988
Blocks: 448899
TreeView+ depends on / blocked
 
Reported: 2008-02-06 16:04 UTC by Adam Stokes
Modified: 2009-12-14 21:03 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-09-09 10:11:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
core from segfault (215.60 KB, application/x-gzip)
2008-02-06 16:04 UTC, Adam Stokes
no flags Details

Description Adam Stokes 2008-02-06 16:04:04 UTC
+++ This bug was initially created as a clone of Bug #240009 +++

Description of problem:

Install of FreeBSD 6.2 32 bit kernel in fullvirt on an x86_64 dom0 (ie. FV
32-on-64), with heavy load on the machine, causes qemu-dm to segfault with an
error such as:

qemu-dm[10011]: segfault at 0000000000000000 rip 0000000000000000 rsp
0000000041400c18 error 14

Version-Release number of selected component (if applicable):

xen-3.1.0-0.rc7.1.fc7
Linux lambda 2.6.20-2925.8.fc7xen #1 SMP Thu May 10 17:47:43 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux

Other Fedora 7 components are fully up to date as of the moment this was posted.

How reproducible:

Intermittently on my test machine.  Much easier to reproduce under heavy load,
as described below.

Steps to Reproduce:
1. Download
ftp://ftp.uk.freebsd.org/pub/FreeBSD/releases/i386/ISO-IMAGES/6.2/6.2-RELEASE-i386-bootonly.iso
2. Begin the install as in the attached instructions.
3. During the install, run 'make -j 4' of a kernel and at the same time boot and
shutdown other guests.
  
Actual results:

qemu-dm segfaults (the visual indication of this in virt-manager is that
suddenly the FreeBSD console is lost with message "The console is currently
unavailable" although continues -- incorrectly I think -- to display that the
FreeBSD guest is running)

Expected results:

qemu-dm shouldn't segfault.

Additional info:

I was starting up and shutting down two other guests which were both PV.  No
other FV guests were running apart from the FreeBSD installer.

It's not very clear, but this upstream bug might be a manifestation of the same
thing:

http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=542

-- Additional comment from rjones on 2007-05-14 07:29 EST --
Created an attachment (id=154641)
FreeBSD installation notes


-- Additional comment from rjones on 2007-05-14 08:42 EST --
I also reproduced this bug with just load, no other guests running.

On Dom0 (a 4 core Athlon) I am running:
  cd linux-2.6.21.1; while true; do make -j 4; make clean; done

No guests are running, except a FreeBSD 6.2 FV 32-on-64 install.  After a little
while the install stops, and in Dom0's dmesg:

qemu-dm[3075]: segfault at 0000000000000000 rip 0000000000000000 rsp
0000000041400c18 error 14


-- Additional comment from rjones on 2007-05-14 10:30 EST --
This bug also happens with an updated Xen hypervisor.  [Background: Dan pointed
out that cset 15038,
http://xenbits.xensource.com/xen-3.1-testing.hg?rev/c00b2ab8af2c looked like it
might have had something to do with this, but even with this change the segfault
is still happening.]

-- Additional comment from rjones on 2007-05-15 08:10 EST --
Created an attachment (id=154722)
Core dump from qemu-dm

Core dump from qemu-dm.

Corresponding binary:
$ rpm -qf /usr/lib64/xen/bin/qemu-dm
xen-3.1.0-0.rc7.1.fc7

Stack trace (from gdb):

Core was generated by `/usr/lib64/xen/bin/qemu-dm -d 2 -vcpus 1 -boot d -serial
pty -acpi -domain-name'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000000000042c085 in dma_thread_func (opaque=<value optimized out>)
    at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/hw/ide.c:2402
#2  0x00000030310061b5 in start_thread () from /lib64/libpthread.so.0
#3  0x00000030304d043d in clone () from /lib64/libc.so.6

[Quite amazingly this 60K file expands to the full 39MB core dump with md5sum
189a904867814006d199f4d92c2f642c]

-- Additional comment from rjones on 2007-05-15 08:22 EST --
Stack trace from each thread:

(gdb) thread apply all bt

Thread 3 (process 29295):
#0  0x00000030304c9952 in select () from /lib64/libc.so.6
#1  0x0000000000409555 in main_loop_wait (timeout=10)
    at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/vl.c:5216
#2  0x000000000046d251 in main_loop ()
    at
/usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/target-i386-dm/helper2.c:628
#3  0x000000000040b206 in main (argc=19, argv=0x7fff0c4f2e08)
    at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/vl.c:6903
#4  0x000000303041da54 in __libc_start_main () from /lib64/libc.so.6
#5  0x0000000000404809 in _start ()

Thread 2 (process 29306):
#0  0x000000303100cabb in read () from /lib64/libpthread.so.0
#1  0x000000303180197a in read_all (fd=5, data=0xc9d2f0, len=16)
    at /usr/include/bits/unistd.h:35
#2  0x00000030318019f2 in read_message (h=0xc9b6b0) at xs.c:768
#3  0x0000003031801b4c in read_thread (arg=<value optimized out>) at xs.c:821
#4  0x00000030310061b5 in start_thread () from /lib64/libpthread.so.0
#5  0x00000030304d043d in clone () from /lib64/libc.so.6

Thread 1 (process 29426):
#0  0x0000000000000000 in ?? ()
#1  0x000000000042c085 in dma_thread_func (opaque=<value optimized out>)
    at /usr/src/debug/xen-3.1.0-testing.hg-rc7/tools/ioemu/hw/ide.c:2402
#2  0x00000030310061b5 in start_thread () from /lib64/libpthread.so.0
#3  0x00000030304d043d in clone () from /lib64/libc.so.6


-- Additional comment from rjones on 2007-05-15 13:49 EST --
I compiled qemu-dm with -O0 -g and generated another core dump:

http://annexia.org/tmp/qemu-dm.bz2
http://annexia.org/tmp/core.qemu-dm.10152.1179249168.bz2


-- Additional comment from rjones on 2007-05-15 15:05 EST --
Created an attachment (id=154763)
Patch to pass structure instead of pointers to the IDE DMA thread.

This patch is currently looking solid.	The FreeBSD install has got much
further than before.  If it stays up overnight I'll feed it upstream.

-- Additional comment from rjones on 2007-05-15 17:28 EST --
FreeBSD install finished successfully for the first time under load.  Patch sent
upstream.

-- Additional comment from rjones on 2007-05-16 11:34 EST --
Created an attachment (id=154836)
Screenshot of FreeBSD install failing.

Unfortunately this patch hasn't corrected the problem.	I'm still seeing
FreeBSD failing during the install at the same place as before, although with a
different error.  This time qemu-dm isn't segfaulting, but FreeBSD itself is
giving an error as shown in the screenshot.

The error is:

anic: initiate_write_inodeblock_ufs2: already started

-- Additional comment from clalance on 2008-01-30 12:58 EST --
FYI regarding this bug:

There was a recent exchange with someone complaining about IDE multi-threading
problems.  Keir has checked in a patch to 3.2/3.1 that fixes that particular
problem; it may also be relevant here:

http://lists.xensource.com/archives/html/xen-devel/2008-01/msg01151.html

Chris Lalancette
----


This happens on xen 3.0.3-41 as well

Comment 1 Adam Stokes 2008-02-06 16:04:04 UTC
Created attachment 294119 [details]
core from segfault

Comment 2 RHEL Program Management 2008-02-06 16:17:39 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Daniel Berrangé 2008-03-04 21:22:06 UTC
Comment #10 in bug 240009 refers to this mailing list thread

  http://lists.xensource.com/archives/html/xen-devel/2008-01/msg01147.html

About QEMU crashes under high load. THis resulted in the following patch:

  http://xenbits.xen.org/xen-3.1-testing.hg?rev/df56245d48f5

which fixes one known race condition. I can't say for certain whether this bug
reporter is hitting this particular race condition, but it certainly a likely 
candidate & an race which should be fixed.  In absence of any other explanation
for the crashes, I'd recommend we apply the upstream patch I reference above and
see if QEMU remains crash-free under load.


Comment 10 RHEL Program Management 2008-03-11 19:36:42 UTC
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 18 RHEL Program Management 2008-06-02 20:19:27 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 19 Daniel Berrangé 2008-07-11 13:14:06 UTC
This bug can probably be closed as a DUP of 

https://bugzilla.redhat.com/show_bug.cgi?id=250988

Comment 20 Chris Lalancette 2008-07-11 13:41:36 UTC
Yes, most likely.  However, I would like to leave it open just until I get
another report of whether this patch works for a customer.  Assuming that is
successful, we can then close this out as a dup.

Chris Lalancette

Comment 21 Scott Dodson 2008-08-27 21:44:30 UTC
In the past I've hit this bug quite frequently and I've been able to reproduce it simply by restarting multiple Windows DomU at the same time on a machine with many other DomU (24 of them). After installing xen-3.0.3-68.el5 I'm no longer able to reproduce the problem.

Comment 22 Daniel Berrangé 2008-09-09 10:11:57 UTC
On the basis of the confirmation in comment #21, i'm closing this as a dup of bug 250988.

If someone manages to get this problem to recur with xen >= 3.0.3-68.el5, then open a new bug with a reproducable test case

*** This bug has been marked as a duplicate of bug 250988 ***


Note You need to log in before you can comment on or make changes to this bug.