Description of problem: I now have 3 reports of this, so I'm going to make the BZ for easy updates to all of them at the same time. Mostly from the certification side, I've had reports that a fully virtualized RHEL5 guest can sometimes just hang. That is, it sits spinning, eating CPU time, but not making any progress. Trying to ping, ssh, and attach to the serial console all fail. These are x86_64 guests with 500MB of memory and 1 VCPU. I've had one report of i386 as well, but that is unconfirmed. This may be restricted to certain Intel CPUs, and further, might be restricted to certain CPU models. The current case can be reproduced on an in-house machine, just running "date" in a loop for about an hour. Changing the screen blanking stuff does not seem to make a difference. I'm still not certain if this is a RHEL-5 kernel bug or a Xen problem, but I am gathering more information.
Analysis so far: I've done a few things to look at this. First, I took a look at what qemu-dm was doing. It seems to be just sitting in a select loop waiting for events, which always timeout. That seems reasonable for the driver domain, since it is waiting on something from the domain itself. I then took a core via "xm dump-core <dom>", which worked. Looking at the kernel with crash shows that every process in the system is in "schedule", so there is not a lot of information there. However, one thing I have noticed since taking two different dumps from two different machines is that it always seems to be running "ld-linux-x86-64" as the running process when it dies. The backtrace looks like this: PID: 3810 TASK: ffff81001045f820 CPU: 0 COMMAND: "ld-linux-x86-64" #0 [ffff810005b99e98] schedule at ffffffff80060ab8 #1 [ffff810005b99ea0] sys_mprotect at ffffffff80020604 #2 [ffff810005b99f80] tracesys at ffffffff8005b2c1 RIP: 00005555555679f7 RSP: 00007fffc3d4c218 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: ffffffff8005b2c1 RCX: ffffffffffffffff RDX: 0000000000000000 RSI: 00000000001ff000 RDI: 0000003bf2807000 RBP: 0000000000000002 R8: 0000000000000004 R9: 0000000000000000 R10: 0000000000000802 R11: 0000000000000206 R12: 00007fffc3d4c588 R13: 00007fffc3d4c290 R14: 00002aaaaaad94b0 R15: 00007fffc3d4c4c0 ORIG_RAX: 000000000000000a CS: 0033 SS: 002b It seems to be in user-mode (based on the CS), but nothing seems to be happening. Next is to confirm that I can reproduce with forcing prelink/ld-linux-x86-64 to run. Chris Lalancette
OK, just confirming....running /etc/cron.daily/prelink by hand causes the system to get into the same situation, very quickly. I now have a good reproducer. Chris Lalancette
Additionally I'm confirming that this issue does *not* happen on RHEL-4 fully virtualized guests. I was able to run the same prelink command in a loop on RHEL-4, with no ill effects. So it definitely seems limited to RHEL-5. Additionally, I was able to reproduce this on both AMD and Intel dom0, so it is not processor specific. I ran prelink through strace, and the last few lines look like: [pid 2363] fstat(4, {st_mode=S_IFREG|0755, st_size=1678480, ...}) = 0 [pid 2363] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2aaaaaaba000 [pid 2363] mmap(0x379ae00000, 3461272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x379ae00000 [pid 2363] mprotect(0x379af44000, 2097152, PROT_NONE) = 0 [pid 2363] mmap(0x379b144000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x144000) = 0x379b144000 [pid 2363] mmap(0x379b149000, 16536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x379b149000 [pid 2363] clo Nothing else ever gets printed. THe mmap() in the strace, combined with the sys_mprotect from the core trace, leads me to believe it has something to do there. I'm still working on getting more info. Chris Lalancette
One additional comment: I was only able to make this happen with the "default" certification recommendation for fully virt, which is 1 vCPU and 500 MB of memory. Once I kicked the memory up to 1024MB, I couldn't seem to reproduce it. Chris Lalancette
Intel has asked for me to raise the severity on the BZ as it's blocking all their certification requests. It's their top RHEL5 issue at this moment.
When we are installing the RHEL5 guest, we have been entering a product key that enables the virtualization software, and in the process gives us the RHEL5 Xen kernel. It has been discovered that if we use the default non-Xen kernel in the guest, ALL signs of instability seem to go away. Previously, we have been unable to have a machine stay alive for more than a few hours. Now, using the non-Xen kernel, all guests have passed the HTS cert suite with no signs of instability. This is on multiple machines using Clovertown and Woodcrest processors, with one machine allocated 512MB of Virtual Memory and the other has 2048MB of virtual memory. Are you using the Xen kernel in a guest to reproduce this? Is this a supported certification configuration, i.e. should we be using a standard kernel inside of the guest? This would make sense as I doubt anyone is going to have a "guest inside a guest".
OK, exactly what are you doing to reproduce the bug? I am running the prelink script now on my x86-64 Xen system, with kernel-xen 2.6.18-8.el5 in dom0 and kernel 2.6.18-8.el5 in my guest. The guest has 200MB of memory. I have run the prelink script about 50,000 times now, and the bug has not yet hit. Am I doing something wrong?
By the way, this is the script I've been using to test out: =============================================================== #!/bin/bash echo -n "Filling filesystem..." dd if=/dev/zero of=/tmp/zerotest.img &> /dev/null echo "done" echo -n "Removing temporary file..." rm -f /tmp/zerotest.img echo "done" echo "Running prelink test" while true ; do echo -n . ; /usr/sbin/prelink -av ; done =============================================================== Note that the first dd fills the whole filesystem (and hence the buffer cache); that may have something to do with the failure, since running prelink on it's own *sometimes* fails to reproduce the problem for me. Chris Lalancette
I can confirm this happens on Opterons, also running fully virt, 1 CPU, 500MB memory. For me, it definitely seems to be related to disk I/O. It has crashed downloading the openoffice-core update, it has crashed doing makewhatis. Four other linux distributions (including rhel4 update 4) are running under the same Dom0 (FC6 x86_64) with no problems (yet). I'm just using the standard virt-manager created file-backed qemu virtual disk. The Dom0 system is a fairly huge server - 4 dual core opterons, 12 GB memory, several scsi disks, but I'm just telling each VM it has 1 cpu.
Another data point, same system as comment #21. I Just tried to install a fedora core 6 HVM and the install hung with much the same symptoms as rhel5 was having. I see the kernel on the fc6 dvd is also 2.6.18.1 (which might be relevant :-). On a subsequent attempt, I gave the VM 1500 MB of memory instead of the default 500 MB, and that install went OK.
Tom, Are you using the 'kernel-xen' kernel or the 'kernel' kernel inside the FV guest? We know that using the kernel-xen kernel will cause crashes. We didn't explicitly mention this in our documentation (that's proposed for our next release notes and documentation), but users should not use the kernel-xen kernel inside FV guests. Not to sidestep the issue, but have you tried using paravirt guests for your RHEL5 virtual machines? You should get better performance from paravirt guests. Also, have you tried using 512MB of RAM instead of 500MB? The minimum amount we recommend for RHEL5 is actually 512. I'm not sure why we made the default '500MB' in the virt machine
Definitely not using the kernel-xen, in the rhel5 system I've got kernel-2.6.18-8.1.1.el5 and in the fc6 system (which I haven't updated yet) I've got kernel-2.6.18-1.2798.fc6. I've got both using 1500MB now instead of 500, and they are working fine so far, the rhel5 system even made it through all the cron triggered stuff that kept crashing it before (prelink, makewhatis, etc).
Continuing to add data points: The fc6 HVM, even with 1500MB of memory just crashed the same way. The last thing I noticed it doing was running beagle-build-index. I updated it to kernel-2.6.20-1.2944.fc6 and I'll see if it crashes again.
Thanks. Please, though, open a separate bugzilla for this. This BZ is very specifically for bugs with a RHEL-5 host. We'll need to track separately any FC-6 bugs causing guests to stall.
OK, I added bug 237895 for FC6 specific updates (but I'll mention that the 2.6.20 kernel is doing much better so far :-).
I tested this out with the 5.1 Beta packages, and this bug seems to be fixed. Chris Lalancette
*** Bug 241714 has been marked as a duplicate of this bug. ***
Using the RHEL5.1 beta on partners, Andy Prowse reports..... On Mon, 2007-07-30 at 20:52 -0400, Andy Prowse wrote: Hi Larry, > > I have had the 64 bit DomU up for 8 hours so far, and have run the hts > test in it OK. Looks like the issue I was experiencing with RHEL5 > x86_64 is fixed.
Per comment #35, can we mark this as fixed in 5.1?
Since the code is already in the 5.1 beta, I don't believe we need to request an exception.