Bug 234160 - Fully virtualized RHEL-5 guests hang at random times
Summary: Fully virtualized RHEL-5 guests hang at random times
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Xen Maintainance List
QA Contact: Martin Jenner
URL:
Whiteboard:
: 241714 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-03-27 15:17 UTC by Chris Lalancette
Modified: 2018-10-19 22:52 UTC (History)
13 users (show)

Fixed In Version: 5.1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-08-14 01:53:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Chris Lalancette 2007-03-27 15:17:08 UTC
Description of problem:
I now have 3 reports of this, so I'm going to make the BZ for easy updates to
all of them at the same time.  Mostly from the certification side, I've had
reports that a fully virtualized RHEL5 guest can sometimes just hang.  That is,
it sits spinning, eating CPU time, but not making any progress.  Trying to ping,
ssh, and attach to the serial console all fail.  These are x86_64 guests with
500MB of memory and 1 VCPU.  I've had one report of i386 as well, but that is
unconfirmed.  This may be restricted to certain Intel CPUs, and further, might
be restricted to certain CPU models.  The current case can be reproduced on an
in-house machine, just running "date" in a loop for about an hour.  Changing the
screen blanking stuff does not seem to make a difference.

I'm still not certain if this is a RHEL-5 kernel bug or a Xen problem, but I am
gathering more information.

Comment 1 Chris Lalancette 2007-03-27 15:26:35 UTC
Analysis so far:

I've done a few things to look at this.  First, I took a look at what qemu-dm
was doing.  It seems to be just sitting in a select loop waiting for events,
which always timeout.  That seems reasonable for the driver domain, since it is
waiting on something from the domain itself.

I then took a core via "xm dump-core <dom>", which worked.  Looking at the
kernel with crash shows that every process in the system is in "schedule", so
there is not a lot of information there.  However, one thing I have noticed
since taking two different dumps from two different machines is that it always
seems to be running "ld-linux-x86-64" as the running process when it dies.  The
backtrace looks like this:

PID: 3810   TASK: ffff81001045f820  CPU: 0   COMMAND: "ld-linux-x86-64"
 #0 [ffff810005b99e98] schedule at ffffffff80060ab8
 #1 [ffff810005b99ea0] sys_mprotect at ffffffff80020604
 #2 [ffff810005b99f80] tracesys at ffffffff8005b2c1
    RIP: 00005555555679f7  RSP: 00007fffc3d4c218  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: ffffffff8005b2c1  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 00000000001ff000  RDI: 0000003bf2807000
    RBP: 0000000000000002   R8: 0000000000000004   R9: 0000000000000000
    R10: 0000000000000802  R11: 0000000000000206  R12: 00007fffc3d4c588
    R13: 00007fffc3d4c290  R14: 00002aaaaaad94b0  R15: 00007fffc3d4c4c0
    ORIG_RAX: 000000000000000a  CS: 0033  SS: 002b

It seems to be in user-mode (based on the CS), but nothing seems to be
happening.  Next is to confirm that I can reproduce with forcing
prelink/ld-linux-x86-64 to run.

Chris Lalancette

Comment 2 Chris Lalancette 2007-03-27 15:33:45 UTC
OK, just confirming....running /etc/cron.daily/prelink by hand causes the system
to get into the same situation, very quickly.  I now have a good reproducer.

Chris Lalancette

Comment 3 Chris Lalancette 2007-03-28 14:58:29 UTC
Additionally I'm confirming that this issue does *not* happen on RHEL-4 fully
virtualized guests.  I was able to run the same prelink command in a loop on
RHEL-4, with no ill effects.  So it definitely seems limited to RHEL-5. 
Additionally, I was able to reproduce this on both AMD and Intel dom0, so it is
not processor specific.  I ran prelink through strace, and the last few lines
look like:

[pid  2363] fstat(4, {st_mode=S_IFREG|0755, st_size=1678480, ...}) = 0
[pid  2363] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x2aaaaaaba000
[pid  2363] mmap(0x379ae00000, 3461272, PROT_READ|PROT_EXEC,
MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x379ae00000
[pid  2363] mprotect(0x379af44000, 2097152, PROT_NONE) = 0
[pid  2363] mmap(0x379b144000, 20480, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x144000) = 0x379b144000
[pid  2363] mmap(0x379b149000, 16536, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x379b149000
[pid  2363] clo


Nothing else ever gets printed.  THe mmap() in the strace, combined with the
sys_mprotect from the core trace, leads me to believe it has something to do
there.  I'm still working on getting more info.

Chris Lalancette

Comment 6 Chris Lalancette 2007-03-31 08:10:57 UTC
One additional comment: I was only able to make this happen with the "default"
certification recommendation for fully virt, which is 1 vCPU and 500 MB of
memory.  Once I kicked the memory up to 1024MB, I couldn't seem to reproduce it.

Chris Lalancette

Comment 14 Gary Case 2007-04-05 20:44:21 UTC
Intel has asked for me to raise the severity on the BZ as it's blocking all
their certification requests. It's their top RHEL5 issue at this moment.

Comment 15 Matt Chorman 2007-04-06 20:11:43 UTC
When we are installing the RHEL5 guest, we have been entering a product key that
enables the virtualization software, and in the process gives us the RHEL5 Xen
kernel.

It has been discovered that if we use the default non-Xen kernel in the guest,
ALL signs of instability seem to go away. Previously, we have been unable to
have a machine stay alive for more than a few hours. Now, using the non-Xen
kernel, all guests have passed the HTS cert suite with no signs of instability.
This is on multiple machines using Clovertown and Woodcrest processors, with one
machine allocated 512MB of Virtual Memory and the other has 2048MB of virtual
memory. 

Are you using the Xen kernel in a guest to reproduce this? Is this a supported
certification configuration, i.e. should we be using a standard kernel inside of
the guest? This would make sense as I doubt anyone is going to have a "guest
inside a guest". 

Comment 17 Rik van Riel 2007-04-10 04:40:21 UTC
OK, exactly what are you doing to reproduce the bug?

I am running the prelink script now on my x86-64 Xen system, with kernel-xen
2.6.18-8.el5 in dom0 and kernel 2.6.18-8.el5 in my guest.  The guest has 200MB
of memory.

I have run the prelink script about 50,000 times now, and the bug has not yet
hit.  Am I doing something wrong?

Comment 20 Chris Lalancette 2007-04-16 18:02:55 UTC
By the way, this is the script I've been using to test out:

===============================================================
#!/bin/bash

echo -n "Filling filesystem..."
dd if=/dev/zero of=/tmp/zerotest.img &> /dev/null
echo "done"
echo -n "Removing temporary file..."
rm -f /tmp/zerotest.img
echo "done"
echo "Running prelink test"
while true ; do echo -n . ; /usr/sbin/prelink -av ; done
===============================================================

Note that the first dd fills the whole filesystem (and hence the buffer cache);
that may have something to do with the failure, since running prelink on it's
own *sometimes* fails to reproduce the problem for me.

Chris Lalancette

Comment 21 Tom Horsley 2007-04-24 22:32:32 UTC
I can confirm this happens on Opterons, also running fully virt, 1 CPU, 500MB
memory. For me, it definitely seems to be related to disk I/O. It has crashed
downloading the openoffice-core update, it has crashed doing makewhatis. Four
other linux distributions (including rhel4 update 4) are running under the same
Dom0 (FC6 x86_64) with no problems (yet). I'm just using the standard
virt-manager created file-backed qemu virtual disk.

The Dom0 system is a fairly huge server - 4 dual core opterons, 12 GB memory,
several scsi disks, but I'm just telling each VM it has 1 cpu.

Comment 22 Tom Horsley 2007-04-25 16:00:06 UTC
Another data point, same system as comment #21. I Just tried to install
a fedora core 6 HVM and the install hung with much the same symptoms as
rhel5 was having. I see the kernel on the fc6 dvd is also 2.6.18.1
(which might be relevant :-). On a subsequent attempt, I gave the VM 1500
MB of memory instead of the default 500 MB, and that install went OK.

Comment 23 Gary Case 2007-04-25 16:03:54 UTC
Tom, 

Are you using the 'kernel-xen' kernel or the 'kernel' kernel inside the FV
guest? We know that using the kernel-xen kernel will cause crashes. We didn't
explicitly mention this in our documentation (that's proposed for our next
release notes and documentation), but users should not use the kernel-xen kernel
inside FV guests. 

Not to sidestep the issue, but have you tried using paravirt guests for your
RHEL5 virtual machines? You should get better performance from paravirt guests.

Also, have you tried using 512MB of RAM instead of 500MB? The minimum amount we
recommend for RHEL5 is actually 512. I'm not sure why we made the default
'500MB' in the virt machine 

Comment 24 Tom Horsley 2007-04-25 16:17:22 UTC
Definitely not using the kernel-xen, in the rhel5 system I've got
kernel-2.6.18-8.1.1.el5 and in the fc6 system (which I haven't
updated yet) I've got kernel-2.6.18-1.2798.fc6.

I've got both using 1500MB now instead of 500, and they are working fine
so far, the rhel5 system even made it through all the cron triggered
stuff that kept crashing it before (prelink, makewhatis, etc).

Comment 25 Tom Horsley 2007-04-25 18:21:23 UTC
Continuing to add data points: The fc6 HVM, even with 1500MB of memory
just crashed the same way. The last thing I noticed it doing was running
beagle-build-index. I updated it to kernel-2.6.20-1.2944.fc6 and I'll
see if it crashes again.

Comment 26 Stephen Tweedie 2007-04-25 18:57:53 UTC
Thanks.  Please, though, open a separate bugzilla for this.

This BZ is very specifically for bugs with a RHEL-5 host.  We'll need to track
separately any FC-6 bugs causing guests to stall.


Comment 31 Tom Horsley 2007-04-25 21:50:30 UTC
OK, I added bug 237895 for FC6 specific updates (but I'll mention that the
2.6.20 kernel is doing much better so far :-).


Comment 33 Chris Lalancette 2007-07-02 19:42:18 UTC
I tested this out with the 5.1 Beta packages, and this bug seems to be fixed.

Chris Lalancette

Comment 34 Chris Lalancette 2007-07-05 11:51:15 UTC
*** Bug 241714 has been marked as a duplicate of this bug. ***

Comment 35 Larry Troan 2007-07-31 11:55:45 UTC
Using the RHEL5.1 beta on partners, Andy Prowse reports.....
On Mon, 2007-07-30 at 20:52 -0400, Andy Prowse wrote:
Hi Larry,
> 
> I have had the 64 bit DomU up for 8 hours so far, and have run the hts
> test in it OK.  Looks like the issue I was experiencing with RHEL5
> x86_64 is fixed.

Comment 36 Larry Troan 2007-08-09 22:10:34 UTC
Per comment #35, can we mark this as fixed in 5.1?

Comment 37 Larry Troan 2007-08-09 22:12:13 UTC
Since the code is already in the 5.1 beta, I don't believe we need to request an
exception.


Note You need to log in before you can comment on or make changes to this bug.