Red Hat Bugzilla – Bug 485956
Dom0 reboots whenever a paravirt guest is started.
Last modified: 2009-05-01 16:22:55 EDT
Description of problem: On the kernel Linux dom0 2.6.18-128.1.1.el5xen #1 SMP Mon Jan 26 14:34:58 EST 2009 i686 i686 i386 GNU/Linux my system(dom0) is rebooting when I run any paravirt guest. This also happened on kernel 2.6.18-128. I have several systems that I can reproduce this on.
Version-Release number of selected component (if applicable):
Every time I start a paravirt guest.
Steps to Reproduce:
1. Boot the system into kernel 2.6.18-128.1.1.el5xen.
2. Start a paravirt guest.
3. Watch the system reboot.
I am willing to provide anything you need for debugging this issue
Before closing this as WORKSFORME, it would be good to know on what systems it works and what paravirt guests you are trying to start.
Starting a paravirt guest is one of the basic tests QE does before allowing a new kernel to go out and many people are using the 5.3 Xen kernel, so there must be something in your setup that makes it different from everybody else...
A good start would be a list of the systems that this happens on?
There are at least 2 of us in GSS and one sales that are seeing this issue. I cannot comment on their setup so I will have them update this bug with their configs.
>>> Starting a paravirt guest is one of the basic tests QE does before allowing a
>>> new kernel to go out and many people are using the 5.3 Xen kernel, so there
>>> must be something in your setup that makes it different from everybody else...
I have a two node cluster as virt guests and dom0 is setup as a single node cluster so I can use fence_xen. My guests are 2 RHEL 5 and 1 RHEL 4 which is not clustered. Dom0 reboots shortly after any of the guests is started. I can be reached on #gss on irc as bennyturns if you have any specific questions or you would like access to this system.
>>> A good start would be a list of the systems that this happens on?
My system is a Dell Precision t5400, I will get the others for you shortly.
I have the same problem on my T61 with 5.3. Everything works as intended with the 2.6.18-92.1.22 kernel. But I will do some further testing with a vanilla install.
I have heard a couple reports of this now, but details have been scarce. This is the first time I've heard about it working on 5.2 and not 5.3; that is a good start. I guess the obvious thing will be to get a stack trace (via serial console) and/or a kernel core of when the machine reboots; at least then, we can see what the problem might be. Another thing to try, if you have a bit of spare time, is to bisect the interim kernel builds between 5.2 and 5.3 to see if we can narrow down where this was introduced. If you don't have time for this, then I can probably do it next week, assuming you give me access to one of the boxes in question.
I just got access to a serial cable today so I will try grab the stack trace and core sometime today. I will also try to isolate which kernel the problem first appeared in. I will get back to you with my findings.
Can I get a core dump from dom0 with kdump? I spent some time yesterday trying but I didn't have any luck. I thought that was a limitation of the Xen kernel on dom0. If it is possible could you point me in the right direction? If not I will just provide the console output.
(In reply to comment #7)
> Can I get a core dump from dom0 with kdump? I spent some time yesterday
> trying but I didn't have any luck. I thought that was a limitation of the Xen
> kernel on dom0. If it is possible could you point me in the right direction?
> If not I will just provide the console output.
Yes, kdump works with Xen dom0 since 5.1. Essentially, you do the same as you would for bare-metal, with a few caveats. First, is that you add the "crashkernel" command-line option to the *hypervisor* command-line, not the kernel. So your grub.conf entry would look something like:
title Red Hat Enterprise Linux Server (2.6.18-128.el5xen)
kernel /xen.gz-2.6.18-128.el5 crashkernel=128M@32M
module /vmlinuz-2.6.18-128.el5xen ro root=/dev/VolGroup00/RHEL5x86_64
The second caveat is that you can't kexec *into* a Xen kernel. Therefore, you also need to have the normal kernel installed; in the i386 case, that would be the -PAE variant by default. Once you have those in place, and you've successfully started the kdump service, kdump should work.
All of that being said, a console output is a good start, so I would concentrate on getting that first and then fiddle with kdump.
Any luck reproducing, getting console output, or getting kdump output? If there is a machine that you can reproduce this on, it would be great if I could log into it and poke around for a while.
Here is the majority of my console output, I uploaded a file with everything:
tap tap-1-51712: 2 getting info
blktap: ring-ref 770, event-channel 9, protocol 1 (x86_32-abi)
(XEN) mm.c:625:d1 Non-privileged (1) attempt to map I/O space 000000f8
(XEN) mm.c:625:d1 Non-privileged (1) attempt to map I/O space 000000f0
(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 0000b2da: ed=ff1c4080(0), sd=ff238080, caf=800000021
(XEN) mm.c:649:d0 Error getting mfn b2da (pfn 117b7) from L1 entry 000000000b2da063 for dom0
(XEN) mm.c:649:d0 Error getting mfn b2db (pfn 1578a) from L1 entry 000000000b2db063 for dom0
BUG: unable to handle kernel paging request at virtual address db5b2000
29ed2000 -> *pde = 00000000:56e71001
2a071000 -> *pme = 00000000:3c0d1067
000d1000 -> *pte = 00000000:00000000
Oops: 0002 [#1]
last sysfs file: /devices/pci0000:00/0000:00:1e.0/0000:09:02.0/irq
Modules linked in: xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_std
EIP: 0061:[<c0417f62>] Not tainted VLI
EFLAGS: 00010206 (2.6.18-128.1.1.el5xen #1)
EIP is at xen_create_contiguous_region+0x88/0x3c3
eax: 00000000 ebx: 00004000 ecx: 00000800 edx: 00000000
esi: c082a6a0 edi: db5b2000 ebp: 00000004 esp: e9e8dcf8
ds: 007b es: 007b ss: 0069
Process /usr/share/virt (pid: 4721, ti=e9e8d000 task=c2105550 task.ti=e9e8d000)
Stack: 00000000 c067d380 00000000 00000002 db5b0000 00000000 00000002 000204d0
c0681d80 00000000 c0775120 00000004 00000000 00000000 00007ff0 e9e8dd4c
00000001 00000002 00000000 00007ff0 00000000 000204d0 00000002 c082a6a0
Code: 89 6c 24 2c 66 c7 44 24 38 f0 7f 89 4c 24 44 89 d9 c7 44 24 40 01 00 00 00 c1 e9 02 66 c7
EIP: [<c0417f62>] xen_create_contiguous_region+0x88/0x3c3 SS:ESP 0069:e9e8dcf8
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
PCI: Not using MMCONFIG.
�Mounting proc filesystem
Mounting sysfs filesystem
Created attachment 334703 [details]
The console output I uploaded is from a crash during a VM install. I wiped my system and started new so I didn't have a previously installed VM to try to load.
OK, thanks. That's actually a big help. There are actually two things here:
1) Upstream has since moved away from using skbuff_ctor at all, which shows up in your stack trace. Essentially, it was a performance optimization, but with dubious performance gains, so they removed it. It might be a good idea for us to also remove that, although I would have to check with the performance team to make sure we didn't actually lose any performance.
2) That being said, I don't think that the trace leading to skbuff_ctor is actually the root cause of the problem. The good news is that the root cause here *looks* to be the same as in BZ 479754; that is, during a grant operation, it looks like there is one extra reference to a page than the hypervisor expects. The bad news is that I still do not know what is causing that.
Oh, I should also note that this is the mail thread where skbuff_ctor was discussed:
And, just for my own reference, it was actually removed in upstream xen-unstable c/s 13494.
The steps to reproduce this problem are:
1. Boot into Xen kernel.
>> I am running kernel 2.6.18-128.1.1.el5
2. Start a VM.
>> I didn't have any VMs installed because I did a fresh install for this test. >> The system crashed during the install of a new VM. I opened virt manager and >> began a new VM install. I made it through the graphical installer right to
>> the point where the system starts copying files. This is when the stack trace >> appeared and shortly after the system was rebooted.
I have also reproduced this with XM create at command line. This happens every time I have tried to start a VM on kernel versions 2.6.18-128.1.1 and 2.6.18-128 on my dell precision t5400. I have heard from coworkers that on some systems you have to start multiple Xen guests but on my system you only have to start one.
Any chance you can download the kernel from:
And try it out? It has a couple of patches backed out that I think may be responsible for the issue, so if you can do a couple of tests with that kernel, it would be really useful.
Created attachment 336212 [details]
conlole output from the test kernel
It crashed during the install just like the other kernel. I attached to console output for you to take a look at.
OK, thanks for the test. That seems to confirm a similar report from another tester. I'll keep on it.
I also recently encountered this problem of dom0 rebooting under RHEL 5.3 kernel version kernel-xen-2.6.18-128.el5.x86_64 on a Dell PowerEdge 2950 with 8 CPU andd 32GB RAM. I also tried updating the kernel to version kernel-xen-2.6.18-128.1.6.el5.x86_64 and still got the same rebooting problem.
Both dom0 and domU are using the same kernel version. I was able to reproduce the problem by starting the domU and ssh over to this domU and run the commands "ls -R /*" and "du -hs /*" and after a while the whole system reboots. domU was configured to have 4 cpu's and 16GB of RAM. later on tried to do 4 cpu's and 4GB RAM and still yielded the same results. I've tried this on two Dell PowerEdge 2950 server.
Later on I installed the old kernel version from RHEL 5.2 kernel-xen-2.6.18-92.el5.x86_64 on both dom0 and domU, and did the same test to verify if the problem still exist. So far it seems to be stable on that version.
The workaround I have been giving my customers is to run the older kernel which is stable.
Yeah, that is also what i did for the mean time to resolve the issue. But I still have to find a creative way on how to explain to customer why the new kernel is not stable for the VM =(
Just to be clear; I can easily reproduce this on the 5.2 kernel as well. It's probably just a combination of other bugs that prevents it from happening on the 5.2 kernel reliably.
I think, at this point, I'm actually going to close this as a dup of 479754, since I'm pretty sure they are the same issue. That way we won't have missing information in one or both.
*** This bug has been marked as a duplicate of bug 479754 ***