479754 – RH5.3 x64 RC2 reboots while installing a virtual machine

Bug 479754 - RH5.3 x64 RC2 reboots while installing a virtual machine

Summary: RH5.3 x64 RC2 reboots while installing a virtual machine

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Chris Lalancette
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (9):	479343 483279 485956 496700 496741 503139 505352 506859 645043 (view as bug list)
Depends On:
Blocks:	460955 497812 707285
TreeView+	depends on / blocked

Reported:	2009-01-12 21:20 UTC by Jim Evans
Modified:	2018-11-14 21:00 UTC (History)
CC List:	37 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:04:35 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sosreport (685.77 KB, application/octet-stream) 2009-03-04 14:51 UTC, Jim Evans	no flags	Details
today's messages file showing memory squeeze message (191.77 KB, text/plain) 2009-03-13 16:18 UTC, Jim Evans	no flags	Details
the whole xend.log file (900.55 KB, text/plain) 2009-03-13 16:19 UTC, Jim Evans	no flags	Details
today's xend.log (40.87 KB, text/plain) 2009-03-17 17:44 UTC, Jim Evans	no flags	Details
serial output of crash (105.58 KB, text/plain) 2009-03-23 13:29 UTC, Jim Evans	no flags	Details
latest console capture for the latest crash (55.81 KB, text/plain) 2009-03-25 19:31 UTC, Jim Evans	no flags	Details
Serial console output for wgold's comment. (3.24 KB, text/plain) 2009-06-18 10:09 UTC, Werner Gold	no flags	Details
crash in ssh session (9.85 KB, text/plain) 2009-07-09 14:40 UTC, Tamas Vincze	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Jim Evans 2009-01-12 21:20:25 UTC

Description of problem:
System rebooted itself after appearing hung. This was during a VM install while another VM was rebooting.


Version-Release number of selected component (if applicable):
RHEL 5.3 RC2 x86_64

How reproducible:
Not sure but have the vmcore file

Steps to Reproduce:
1. Install RC2 for Dom0
2. Install RC1-para for VM1
3. Start installing RC1-para for VM2
4. While VM2 is installing, reboot VM1
  
Actual results:
System mouse froze, graphics colors got all funky, and after a couple of minutes the system rebooted itself.

Expected results:
The system to stay up

Additional info:
This was all on a local disk so no boot from SAN was involved.
HP BL480c with 12GB of ram
one disk

I have the vmcore file which I will attach after logging this issue. It is
12GB long so it is taking a bit to FTP to my laptop. It may be quicker to have me drive it up to Westford and have someone copy it off my USB disk than to have me add it as an attachment but I'll try the attachment route. If there is a RH FTP site you prefer I load it to, let me know. 

Is there any other file RH would like off this system?

Jim in Marlborough

Comment 1 Jim Evans 2009-01-13 14:39:31 UTC

This priority needs to be set to Urgent but I can't seem to change it.

Comment 2 Jim Evans 2009-01-13 17:02:35 UTC

"The file you are trying to attach is 560265 kilobytes (KB) in size. Non-patch attachments cannot be more than 20000 KB."

Even after compressing the file it is too big for the upload. I had to upload it to the dropbox ftp site. vmcore file is 479754.gz

Comment 3 Chris Lalancette 2009-01-16 09:32:24 UTC

*** Bug 479343 has been marked as a duplicate of this bug. ***

Comment 4 Chris Lalancette 2009-01-16 17:39:17 UTC

Mostly notes, but:

I started to take a look at this core.  Unfortunately, in this particular case the core isn't hugely helpful, but it's at least somewhat.  I extracted the hypervisor logs from it, and I see this:

(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 000000000005fb08: ed=ffff8300ceefa080(0), sd=ffff8300cefc6080, caf=80000002, taf=00000000e8000001
(XEN)
(XEN) mm.c:649:d0 Error getting mfn 5fb08 (pfn 40dd) from L1 entry 001000005fb08067 for dom0
(XEN) mm.c:649:d0 Error getting mfn 5fb09 (pfn 40dc) from L1 entry 001000005fb09067 for dom0

So from a high-level, what happened is that some domain tried to do a steal_page (despite the fact that it says "gnttab_transfer", there are actually two ways to get here), but that failed.  Later on, when the dom0 goes to use that page, it crashed because it wasn't mapped into the address space.

Now, to take a look at why that is.  Given the above information, we can see that caf=0x80000002, which is "x" in the source code.  And there is this check:

        x = y;
        if (unlikely((x & (PGC_count_mask|PGC_allocated)) !=
                     (1 | PGC_allocated)) || unlikely(_nd != _d)) { 
            MEM_LOG("gnttab_transfer: Bad page %p: ed=%p(%u), sd=%p,"
                    " caf=%08x, taf=%" PRtype_info "\n", 
                    (void *) page_to_mfn(page),
                    d, d->domain_id, unpickle_domptr(_nd), x, 
                    page->u.inuse.type_info);

PGC_count_mask|PGC_allocated == 0x9fffffff, and 1|PGC_allocated == 0x80000001.  So we see that & x with 0x9fffffff gives us back 0x80000002, which does not equal the 1|PGC_allocated, which means we hit the check and it's all downhill from there.

Now, the real problem here is that x == 0x80000002, when this code is clearing expecting it to be 0x80000001.  That, in turn, means that the page_count is too high, meaning that someone mapped it twice (or something like that).

I'll have to continue to dig further, to see what is going on.  One quick thing that might be useful, though, is to try to find out when this started occuring.  We know RC1 and RC2 had it.  Did the 5.3 Beta kernel have it?  Did 5.2 have it?  At least a rough estimation like this can help us narrow it down somewhat.

Chris Lalancette

Comment 5 Jim Evans 2009-01-16 17:57:27 UTC

Chris,

I didn't test xen with 5.2 so I can't help there and I didn't see it with any of the snapshots. RC1 was my first time seeing it.

jim

Comment 6 Hector Arteaga 2009-01-16 21:37:20 UTC

I encountered a very similar issue with RH 5.3 snapshot 5 (see bugzilla 476294) but was not able to reproduce in snapshot 6.  I also tested 5.2 and do not recall seeing this issue, but I don't recall if I was ever rebooting a VM while installing another.

Comment 10 Chris Lalancette 2009-02-18 14:07:58 UTC

Is there any chance I can get remote access to one of the blades that this is happening on?  I can't reproduce it internally, and the core isn't giving me a whole bunch more information, so I think I just need to spend some time with one of these blades and see what I can do.

Chris Lalancette

Comment 11 Jim Evans 2009-02-18 14:52:25 UTC

Chris,

I don't know of a mechanism to give you remote access however, if you are in Westford and can come down to Marlborough I can give you hands on access to the machine. You can set up any logging you want. Let me know.

jim

Comment 12 Ronald Pacheco 2009-02-19 01:44:21 UTC

Jim,

Chris is in the UK.

Comment 14 Jim Evans 2009-02-23 19:07:38 UTC

I have discussed this issue with my counterparts and off the top we can't think of a way to give someone outside of HP access to our internal network.

Does Red Hat have some access I don't know about?

Other than that Chris may have to send me the commands he'd like me to try,
or someone from Westford comes to my lab and have me shadow them for the day while s/he communicates with Chris.

Comment 15 Chris Lalancette 2009-02-24 14:23:44 UTC

Jim,
     OK, let's just start with something simple.  Let's start with RHEL 5.2, and see if the problem happens there.  If it doesn't, then I can feed you some kernel packages to try to narrow down which patch between 5.2 and 5.3 started causing the problem.  So I just need you to install the RHEL-5.2 kernel (which should be 2.6.18-92.el5xen), and re-run the test.  Let me know if you don't have access to that; I can give it to you otherwise.

Thanks,
Chris Lalancette

Comment 16 Chris Lalancette 2009-02-24 14:27:31 UTC

Oh, I forgot to add...

I see that the boxes you've had problems on here have a bit of memory.  As another test, can you try the RHEL 5.3 kernel, but pass "mem=4G" on the *hypervisor* command-line, and then run the test again?  That might show us if it is a larger memory related problem.

Chris Lalancette

Comment 17 Jeff Hillman 2009-02-25 16:18:38 UTC

I am also seeing this problem on 5.3 x86_64 on HP BL460c G1 blades with 16gb of ram..

Mainly happens when installing a xen guest, which is also x86_64. Although some guest installs work just fine.

Customer installed guests using 5.2 before my arrival onsite on the same hardware and had no issues.  

More to come as i gather info.

Comment 18 Jim Evans 2009-02-27 16:03:11 UTC

Oh yeah. Setting mem=4G made a WORLD of difference.

I just installed three VMs at the same time in a serial fashion, then went through rebooting them as they became available. Last test was to simultaneously boot all three VMs. That worked great.

I'd say load up a server with beaucoup of memory and you'll reproduce it.

These were all para-virt RH5.3 x86_64 VMs.

jim

Comment 19 Jeff Hillman 2009-02-27 17:42:01 UTC

Interestingly, on different blades in same chassis, each with 16gb of ram and sam proc setup, we were able to install 2 vm's, each on different blades, without problems.

About to install 4 more on machines with only 8gb of ram.

Comment 20 Chris Lalancette 2009-03-04 09:53:22 UTC

(In reply to comment #18)
> Oh yeah. Setting mem=4G made a WORLD of difference.
> 
> I just installed three VMs at the same time in a serial fashion, then went
> through rebooting them as they became available. Last test was to
> simultaneously boot all three VMs. That worked great.
> 
> I'd say load up a server with beaucoup of memory and you'll reproduce it.

Still no luck reproducing here.  We'll have to continue on your systems.  OK, so at this point, we know that 5.2 doesn't have the problem, 5.3 does, and restricting the hypervisor to < 4GB makes the problem go away.  So, there are a few things I would like to see next:

1.  Please get an sosreport right after booting the 5.3 dom0.  This will just give me some additional information (like the output from xm dmesg), so I don't have to bother you with getting some of that information.

2.  I'm not sure if these machines have EPT or not, but that was one of the big features that went in for 5.3.  Run "xm dmesg | grep -i "Hardware Assisted Paging".  If that comes up with something, then you are using EPT.  In that case, try to boot again, but pass "hap=0" on the hypervisor command-line.  The previous command should then say "Hardware Assisted Paging detected, but disabled".  Then try the test again to see if it makes a difference.

3.  One thing I've found helpful in the past was to narrow down the problem between the hypervisor and the kernel.  In this case, I really am thinking this is a problem in the hypervisor, but it would be good to get confirmation.  Try to boot with a 5.3 hypervisor, but a 5.2 dom0 kernel, run the test, and see if it makes a difference.  Then swap, and boot with a 5.2 hypervisor and a 5.3 dom0.  This will at least narrow down where the regression is.

I'll continue to try to reproduce here, but it hasn't been looking good so far.

Thanks,
Chris Lalancette

Comment 21 Jim Evans 2009-03-04 14:51:23 UTC

Created attachment 334003 [details]
sosreport

The md5sum is: 6cdaa016b7f4aab7f88d0f18e46e7dda

Comment 22 Jim Evans 2009-03-04 14:53:44 UTC

1. sosreport is now there

2. There is no hardware listed in xm dmesg

3. Will try to get to this soon

jim

Comment 23 Chris Lalancette 2009-03-04 15:57:07 UTC

(In reply to comment #22)
> 1. sosreport is now there
> 
> 2. There is no hardware listed in xm dmesg
> 
> 3. Will try to get to this soon

Great, thanks.  One interesting thing I noticed in your xm dmesg is that VMX is disabled.  That's not a widely tested configuration, since people generally want to do full-virt as well as paravirt.  Maybe while you are at it you can add:

4.  Enable VMX (in the BIOS), power-off and power-on the machine, and then try the test again to see if that makes a difference.

Chris Lalancette

Comment 24 Jim Evans 2009-03-04 18:05:07 UTC

Chris,

I'm embarassed to say I forgot to enable that as part of the initial system setup. It is part of my checklist and somehow was overlooked. I checked all of my other xen servers and it was enabled so this one was missed.

That said after correcting that I was able to reboot all three of my existing VMs while installing an additional three. I thought we had it until the system crashed while installing the 6th VM just after the anaconda line.

I'm uploading the vmcore file now to the dropbox incoming directory and will post the file name when it has completed.

jim

Comment 25 Jim Evans 2009-03-04 19:06:54 UTC

479754-vmcore-during-vm-install.tgz  has been uploaded

Comment 26 Chris Lalancette 2009-03-12 21:20:52 UTC

Jim,
     Can you try a test kernel?  I have a couple of suspect patches that went in between 5.2 and 5.3, and I'd like a test with one of them reverted.  If you grab the kernel at:

http://people.redhat.com/clalance/bz479754

and install it, can you run a test to see if it makes a difference?  FYI, I've finally been able to reproduce it here, but on the machine I have locally it takes quite a bit of time to reproduce.  So if you can do some of these tests quickly, that would help a lot.

Thanks,
Chris Lalancette

Comment 27 Jim Evans 2009-03-13 16:16:27 UTC

Was anything gleamed out of the March 4th vmcore? 

As far as the new debug kernel you provided, the system crashed running it but it never got the vmcore file. What was in messages was a bunch of 

kernel: xen_net: Memory squeeze in netback driver.

right before the system rebooted. I'm going to attach the messages file for today and the whole xend.log file. Maybe it will help you determine what happened with the Mar 4th panic as well.

I had 2 new VMs installing rhel 5.3 para x86_64 while trying to reboot 4 older rhel 5.3 para VMs. 

10:25 rebooted system after installing new kernel
10:49 system rebooted but no vmcore generated

jim

Comment 28 Jim Evans 2009-03-13 16:18:23 UTC

Created attachment 335112 [details]
today's messages file showing memory squeeze message

Comment 29 Jim Evans 2009-03-13 16:19:22 UTC

Created attachment 335114 [details]
the whole xend.log file

Comment 30 Chris Lalancette 2009-03-17 00:55:14 UTC

Jim,

OK.  So based on your test, we know the problem isn't in the "[xen] avoid dom0 hang when tearing down domains"; that's actually good news, since that patch was one of the more tricky ones.  I'll try again to reproduce, by doing what you did; keep rebooting a few domains while installing two others.

In terms of the second core, I haven't yet had  time to look at it, but I will soon.  I have some ideas for gathering more information with another debug kernel, but that may take a little time to code up.  I'll keep you posted.  Have you attempted with a 5.2 hypervisor and a 5.3 kernel yet?  I would like to try to pin down which of the two it is, and that would be help.

Thanks,
Chris Lalancette

Comment 31 Jim Evans 2009-03-17 15:15:27 UTC

Chris,

I have not had much time on this system due to other priorities but I do try to get to it when I have a chance. I'll try some more installs and reboots the way it is.

I have not had a chance to install a 5.2 hypervisor and try that.

jim

Comment 32 Jim Evans 2009-03-17 17:41:27 UTC

Chris,

This reboot was even easier with your kernel. I got the same message too

"Memory squeeze in netback driver"

I deleted vm5 and vm6 from the previous reboot so I can have luns to install to. I brought up vm1 through vm4 and got them ready for reboot but didn't start the process yet. I started installing vm5 and got to selecting the lun to install to and it went down.

I'll attach today's portion of the xend.log.

jim

Comment 33 Jim Evans 2009-03-17 17:44:35 UTC

Created attachment 335572 [details]
today's xend.log

Comment 34 Justin M. Forbes 2009-03-18 14:49:05 UTC

The March 4th vmcore is basically the same as the previous vmcore.

(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 000000000010306d:
ed=ffff8300ceef8080(0), sd=ffff8300cefba080, caf=80000
003, taf=00000000e8000001
(XEN) mm.c:649:d0 Error getting mfn 10306d (pfn b3b6) from L1 entry
001000010306d067 for dom0
(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 000000000010fc85:
ed=ffff8300ceef8080(0), sd=ffff8300ceef8080, caf=80000
003, taf=00000000e8000002
(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 000000000005e4e6:
ed=ffff8300ceef8080(0), sd=ffff8300cfdce080, caf=80000
002, taf=00000000e8000001
(XEN) mm.c:649:d0 Error getting mfn 5e4e6 (pfn 1faff) from L1 entry
001000005e4e6067 for dom0

Comment 35 Chris Lalancette 2009-03-20 11:29:54 UTC

(In reply to comment #32)
> Chris,
> 
> This reboot was even easier with your kernel. I got the same message too
> 
> "Memory squeeze in netback driver"
> 
> I deleted vm5 and vm6 from the previous reboot so I can have luns to install
> to. I brought up vm1 through vm4 and got them ready for reboot but didn't start
> the process yet. I started installing vm5 and got to selecting the lun to
> install to and it went down.
> 
> I'll attach today's portion of the xend.log.

Just for future reference, the xend.log isn't really very interesting for this problem; you have sufficiently explained what it is you are doing, and I can't reproduce reliably, so we'll just keep going on with what you are doing.

What *is* interesting is the serial console log; if you can get that from every time we try and fail, that would be useful.  It's mostly to ensure that the crash signature remains constant; I want to make sure that patches that I am adding/removing from these test builds don't cause other problems.

In any case, I've now built another kernel with a different set of patches backed out, namely, the 2MB page table stuff that went into 5.3.  This is another place that is touching page tables in a number of places, so is another good candidate for a "forgotten" put_page.  It's available here (you want the -135 version):

http://people.redhat.com/clalance/bz479754

If you could give that a whirl (along with doing the 5.2 HV with 5.3 kernel test), that would be great.

Thanks,
Chris Lalancette

Comment 36 Joseph Szczypek 2009-03-20 15:49:21 UTC

I've also encountered what looks like the same problem as described in comments #4 and #34.   Adding myself so I can track this.

Comment 37 Jim Evans 2009-03-20 18:53:50 UTC

Chris,

I had a chance to install the -135 driver and with a quick check the system rebooted by itself again. Now to work on setting up the serial console for you.

I brought up the initial four VMs and had them sitting there.

I started installing two VMs. On one I got to the initial screen and the other I got to where it showed "Anaconda" and that seemed to freeze. When I see that I know the system is going down in about 5-10 seconds.

Same "netback driver" comments in the messages file.

I want to get you a serial console output of this before installing the 5.2 HV.

jim

Comment 38 Jim Evans 2009-03-23 13:29:09 UTC

Created attachment 336288 [details]
serial output of crash

Comment 39 Jim Evans 2009-03-23 13:39:02 UTC

Chris,

I've attached the serial console output of this crash. I captured a boot sequence, the crash then the following boot sequence.


"xen_net: Memory squeeze in netback driver.
(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 0000000000123f81: ed=ffff8300cefd4080(0), sd=ffff8300cefd4080, caf=80000002, taf=00000000e8000001
(XEN) 
printk: 3 messages suppressed.
xen_net: Memory squeeze in netback driver.
printk: 4 messages suppressed.
xen_net: Memory squeeze in netback driver.
(XEN) mm.c:1808:d0 Bad type (saw 00000000e8000001 != exp 0000000080000000) for mfn 123f81 (pfn 1fff81)
(XEN) mm.c:2098:d0 Error while pinning mfn 123f81
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at arch/x86_64/mm/../../i386/mm/hypervisor.c:197
invalid opcode: 0000 [1] SMP (XEN) Domain 0 crashed: rebooting machine in 5 seconds."

I'll wait to hear from you before installing anything else.

Method of causing the crash:
1. Bring up the 4 existing VMs and open a terminal window in each.
2. Start setting up for a new install to VM5 but don't hit the finish button
3. Start setting up for a new install to VM6 and watch the screen pause before the crash.

I don't know if I need to open terminal windows or if I need all 4 VMs booted but since I can crash this machine this way every time I keep doing it to be consistant and to insure I actually get the crash quickly.

jim

Comment 40 Chris Lalancette 2009-03-23 13:51:19 UTC

OK, good.  Your serial output shows the same crash.  That's all I want to do here; make sure that when we crash, we are getting the same crash, and not running into something more.  Also, whatever your test method is that reliably reproduces it, I wouldn't change it :).

The good news is that I've created a script that can now semi-reliably reproduce it locally.  It takes between 2 - 4 hours for it to do so, though.  The bad news is that given the continued failures, I don't have good ideas at the moment.

So, going forward, I would like to see 2 tests:
1)  Start up the 5.3 Xen kernel as normal.  However, before you start the test that usually causes the crash, run:

# xm mem-set 0 3000

which will balloon dom0 down to 3000 MB before the test.  I'm thinking that the "Memory squeeze" messages may be related to the crash, and hand-ballooning dom0 might actually avoid the problem.  This should just be a quick test.

2)  On your machine, please run the test where you boot with a 5.2 HV (2.6.18-92.el5xen) and a 5.3 kernel (2.6.18-128.el5xen), and let's see what happens.  Again, that will let us concentrate on the HV vs. the kernel proper.

Thanks,
Chris Lalancette

Comment 41 Jim Evans 2009-03-23 14:53:24 UTC

Chris,

I've attached the serial console output of this crash. I captured a boot sequence, the crash then the following boot sequence.


"xen_net: Memory squeeze in netback driver.
(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 0000000000123f81: ed=ffff8300cefd4080(0), sd=ffff8300cefd4080, caf=80000002, taf=00000000e8000001
(XEN) 
printk: 3 messages suppressed.
xen_net: Memory squeeze in netback driver.
printk: 4 messages suppressed.
xen_net: Memory squeeze in netback driver.
(XEN) mm.c:1808:d0 Bad type (saw 00000000e8000001 != exp 0000000080000000) for mfn 123f81 (pfn 1fff81)
(XEN) mm.c:2098:d0 Error while pinning mfn 123f81
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at arch/x86_64/mm/../../i386/mm/hypervisor.c:197
invalid opcode: 0000 [1] SMP (XEN) Domain 0 crashed: rebooting machine in 5 seconds."

I'll wait to hear from you before installing anything else.

Method of causing the crash:
1. Bring up the 4 existing VMs and open a terminal window in each.
2. Start setting up for a new install to VM5 but don't hit the finish button
3. Start setting up for a new install to VM6 and watch the screen pause before the crash.

I don't know if I need to open terminal windows or if I need all 4 VMs booted but since I can crash this machine this way every time I keep doing it to be consistant and to insure I actually get the crash quickly.

jim

Comment 42 Jim Evans 2009-03-23 15:01:20 UTC

OOPS.

Ignore #41. I hit refresh and it must have re-written my previous one.

Well I did the "xm mem-set 0 3000" and the behavior was indeed different.

I was able to install both VM5 and VM6 together along with rebooting VMs 1 through 3. VM1s screen said it crashed but it was flickering between "run" and "pause". While it in the "run" screens I was eventually able to log in and try a shutdown.

It was still ugly so I rebooted the whole server to try again.

jim

Comment 43 Chris Lalancette 2009-03-23 21:38:11 UTC

OK.  Well, if guests are crashing, that's actually a vast improvement; that's probably some bug with the tools.  The important bit is that dom0 is not crashing.  That's good that "xm mem-set" improved the situation for you; that agrees with my testing results locally.

Interestingly, I ran the test with a 5.2 HV and a 5.3 dom0 kernel, and I still got the crash (it looked ever so slightly different, but I think it is the same root cause).  I'm trying again now with a 5.3 HV and a 5.2 dom0 kernel to see if that is any better.  I'll keep running the test with different combinations now that I have a reproducer.

However, I would still like to see the results of 5.2 HV and 5.3 dom0 on your hardware, and vice-versa.  At least it's an additional data point against my testing, which is always good to have.

Chris Lalancette

Comment 44 Jim Evans 2009-03-24 15:32:13 UTC

Chris,

I re-ran the test from yesterday without VM1. That one becomes a zombie domain so I'm going to blow that away and rebuild it. 

I was able to install VM5 and VM6 while rebooting VMs 2, 3 and 4 using xm mem-set.

Now for the RH5.2 HV. I went looking in my RH52/Server directory for a xen rpm and only see xen-lib rpms. Same thing for RH53/Server.

Do the xen rpms get loaded right from Red Hat when using the RH number during the installation, are they hidden elsewhere or is it not a xen-#.#...rpm file?

What will be the procedure to install the RH5.2 HV on RH5.3?

jim

Comment 45 Chris Lalancette 2009-03-24 18:48:54 UTC

Ah, yeah, I should have mentioned.  So, actually, the hypervisor is included as part of the kernel, so you don't want to mess with the xen package (that is only the userland tools).  Basically, to run a 5.2 hypervisor with a 5.3 dom0, I install:

kernel-xen-2.6.18-92.el5xen
kernel-xen-2.6.18-128.el5xen

And then I add a grub.conf entry that looks like:

title Red Hat Enterprise Linux Server (2.6.18-128.el5xen)
	root (hd0,2)
	kernel /xen.gz-2.6.18-92.el5
	module /vmlinuz-2.6.18-128.el5xen ro root=/dev/HostGroup/RHEL5x86_64
	module /initrd-2.6.18-128.el5xen.img

You can, of course, then switch it to a 5.3 hypervisor with a 5.2 dom0 by having an entry like:

title Red Hat Enterprise Linux Server (2.6.18-92.el5xen)
	root (hd0,2)
	kernel /xen.gz-2.6.18-128.el5
	module /vmlinuz-2.6.18-92.el5xen ro root=/dev/HostGroup/RHEL5x86_64
	module /initrd-2.6.18-92.el5xen.img

Chris Lalancette

Comment 46 Jim Evans 2009-03-24 21:12:45 UTC

Chris,

I updated the grub.conf with the appropriate files and was able to get in the RH53 kernel/RH52 HV test this afternoon. I'll do the other tomorrow.

I was able to install both VM5 and VM6 ok but my VM3 is now a zombie too. First thing tomorrow is to reinstall VM1 and VM3 before proceeding. VM2 and VM4 were rebooting during the installations.

I did see two "Memory squeeze in Netback Driver" messages on the console but at least the server stayed up.

jim

Comment 47 Jim Evans 2009-03-25 15:24:53 UTC

Chris,

I reinstalled VM1 and VM3.

I booted the RH52 kernel and 53 hypervisor.

I booted VMs 1 through 4

I set up for VM5. Before that was ready to start the installation I set up for VM6 so that I could start configuring as soon as VM5 begun.

As soon as I hit "finish" on VM5 to begin installing I clicked on "finish" for VM6 so that I could start setting up. At that point VM5 was stuck at "starting the install process" and VM6 showed "that directory could not be mounted from the server". The console spewed out the memory squeeze message for about 10 minutes before I killed both VM5 and VM6 before deleting them. I was expecting the server to go down but it didn't. I also ftp'd into that directory so the server was fine.

I then started the same process but waited until the files started loading on VM5 before beginning to set up for VM6 and that went fine. No memory squeeze issues at any time. I rebooted VMs 1-4 a couple of times so there was a lot of activity going on. Installations finished without any extra messages on the console.

jim

Comment 48 Jim Evans 2009-03-25 16:57:20 UTC

Chris,

I tried to duplicate the issue I logged above and I can't. I restarted the test and both VM5 and 6 are installing like they should. No console messages.

So that means I can't crash the system with RH53 kernel + RH52 HV, nor with RH52 kernel + RH53 HV. Only RH53 kernel and HV. Lets hope I haven't lost the touch.

jim

Comment 49 Jim Evans 2009-03-25 19:29:07 UTC

Yes I can still crash it. I tried it again with your -135 test kernel.

Same senario:
1. Bring up 4 VMs and open a terminal window.
2. Get VM5 all queued up
3. Get VM6 ready to install
4. Start VM5 and as soon as that begins start the VM6 process.
5. VM6 gets to the anaconda line and the system goes down shortly afterwards.


printk: 4 messages suppressed.
xen_net: Memory squeeze in netback driver.
(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 0000000000105556: ed=ffff8300cefd4080(0), sd=ffff8300cf122080, caf=80000002, taf=00000000e8000001
(XEN) 
(XEN) mm.c:649:d0 Error getting mfn 105556 (pfn 1326a) from L1 entry 0010000105556067 for dom0
(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 00000000000c1f00: ed=ffff8300cefd4080(0), sd=ffff8300cf122080, caf=80000004, taf=00000000e8000001Unable to handle kernel paging request
 at ffff8801fe351f58 RIP: 
(XEN) 
 [<ffffffff80261f19>] __memcpy+0x15/0xac
(XEN) PGD 2df4067 mm.c:649:d0 Error getting mfn c1f00 (pfn 1e0f7) from L1 entry 00100000c1f00067 for dom0PUD 3bfc067 
PMD 3dee067 PTE 0(XEN) 
mm.c:2768:d0 gnttab_transfer: Bad page 00000000000bda28: ed=ffff8300cefd4080(0), sd=ffff8300cf122080, caf=80000002, taf=00000000e8000001Oops: 0002 [1] 
(XEN) 
SMP (XEN) mm.c:649:d0 Error getting mfn bda28 (pfn 125cf) from L1 entry 00100000bda28067 for dom0

last sysfs file: /class/fc_host/host5/speed
CPU 1 (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 0000000000111182: ed=ffff8300cefd4080(0), sd=ffff8300cefb6080, caf=80000002, taf=00000000e8000001

(XEN) Modules linked in:
 nfs lockd(XEN) mm.c:649:d0 Error getting mfn 111182 (pfn 163) from L1 entry 0010000111182067 for dom0
 fscache nfs_acl xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport joydev tg3 i5000_edac libphy hpilo bnx2 edac_mc serio_raw sg pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod shpchp qla2xxx scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 10422, comm: python Not tainted 2.6.18-135.el5bz479754xen #1
RIP: e030:[<ffffffff80261f19>]  [<ffffffff80261f19>] __memcpy+0x15/0xac
RSP: e02b:ffff8802c947fde8  EFLAGS: 00010203
RAX: ffff8801fe351f58 RBX: ffff8801fe350000 RCX: 0000000000000001
RDX: 00000000000000a8 RSI: ffff8802c947ff58 RDI: ffff8801fe351f58
RBP: ffff8802b20a7040 R08: 000000001e972a50 R09: ffff8802c947ff58
R10: 0000000000010800 R11: 0000000000001000 R12: ffff8801fe351f58
R13: ffff8802c48a17e0 R14: 0000000041387250 R15: 00000000003d0f00
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Comment 50 Jim Evans 2009-03-25 19:31:03 UTC

Created attachment 336695 [details]
latest console capture for the latest crash


I started this capture after the four VMs were booted and before creating VM5

Comment 51 Chris Lalancette 2009-03-25 21:22:55 UTC

OK, thanks for the testing.  Actually, we've seemed to reverse positions; I can definitely reproduce the error with a 5.2 HV and a 5.3 kernel, and I can also reproduce the problem with a 5.3 kernel and a 5.2 HV.  That seems to say we've had this problem for a while.  In any case, I've been adding some debug in to try to narrow this down.  I'll let you know what I find.

Thanks again,
Chris Lalancette

Comment 52 Chris Lalancette 2009-04-10 07:32:10 UTC

*** Bug 485956 has been marked as a duplicate of this bug. ***

Comment 53 Chris Lalancette 2009-04-10 07:33:58 UTC

*** Bug 483279 has been marked as a duplicate of this bug. ***

Comment 55 Chris Lalancette 2009-04-15 09:58:38 UTC

I've uploaded a new test kernel that has a possible fix here:

http://people.redhat.com/clalance/bz479754

Can people who are affected by this bug please download the appropriate kernel from there, and see if it makes a difference in testing for them?  Besides testing to see if it solves the crash, I would also appreciate any performance data people are able to share with this kernel in place.  There is a portion of the patch that, in theory, has the possibility to cause a performance regression.  In practice, I don't expect it to change very much, but it would be good to confirm that.

Thanks,
Chris Lalancette

Comment 56 Jim Evans 2009-04-15 17:04:15 UTC

Chris,

Much better. I tried twice and both VM5 and 6 install with VMs1-4 rebooting.

On occasion I did see what I thought was performance related. One was when I clicked on "next" during the install of VM5 and it seemed to take at least a minute where it should have been seconds. VM6 being set up at the same time didn't see that delay. Another time was after both 5 and 6 were installed but before the first reboot the mouse seemed elusive. It was behind a window but it took a bunch of coaxing to get it into view. I couldn't see it so I couldn't tell if it was moving or not.

jim

Comment 57 Martin Jürgens 2009-04-21 17:18:57 UTC

can anyone check if bug 496741 is a dup of this bug? i am not sure.

Comment 58 Martin Jürgens 2009-04-21 22:26:33 UTC

I've had a similar problem (bug 496741) and having 2.6.18-138.el5bz479754xen installed I do not run into the issue anymore.

Comment 59 Chris Lalancette 2009-04-22 09:35:17 UTC

*** Bug 496700 has been marked as a duplicate of this bug. ***

Comment 60 Chris Lalancette 2009-04-22 10:12:16 UTC

*** Bug 496741 has been marked as a duplicate of this bug. ***

Comment 61 Bob Kozdemba 2009-04-23 21:05:29 UTC

The test kernel seems to work fine on my T500. As an added bonus, my wireless LED now illuminates. You guys rock!

Bob

Comment 64 Chris Lalancette 2009-05-06 08:42:06 UTC

*** Bug 454285 has been marked as a duplicate of this bug. ***

Comment 65 Martin Jürgens 2009-05-11 13:51:23 UTC

Chris, I have been using 2.6.18-141.el5bz479754perfxen without problems. Today I have tried 2.6.18-144.el5bz479754perf and with that kernel my virtual machines do net get any network access, the error messages are:

printk: 7 messages suppressed.
netfront: rx->offset: 0, size: 4294967295
printk: 7 messages suppressed.
netfront: rx->offset: 0, size: 4294967295
printk: 9 messages suppressed.
netfront: rx->offset: 0, size: 4294967295
printk: 7 messages suppressed.

Comment 66 Don Zickus 2009-05-12 17:40:40 UTC

in kernel-2.6.18-146.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 68 Martin Jürgens 2009-05-12 18:35:28 UTC

kernel-2.6.18-146.el5 is fine for me. I had to reboot my domUs in order to get rid of the error mentioned in comment #65

Comment 69 Alexander Pierce 2009-06-01 20:47:14 UTC

the test kernel (2.6.18-144.el5bz479754perf) didn't help, except that it pointed out the intel hda driver on my T61 as being the issue. As soon as I blacklisted the intel_hda kernel modules, and limited dom0 to 768MB of RAM, I have been rock solid. The memory limiting didn't help until I removed the intel_hda driver.

Comment 70 Chris Lalancette 2009-06-02 06:59:53 UTC

(In reply to comment #69)
> the test kernel (2.6.18-144.el5bz479754perf) didn't help, except that it
> pointed out the intel hda driver on my T61 as being the issue. As soon as I
> blacklisted the intel_hda kernel modules, and limited dom0 to 768MB of RAM, I
> have been rock solid. The memory limiting didn't help until I removed the
> intel_hda driver.  

Hm, then this sounds like a different bug.  Can you setup a serial console to collect a stack trace, or setup kdump to collect a core?  That way we can tell if it has the same trace (which I suspect it will not), and then we can open a different bug about it.

Chris Lalancette

Comment 71 Alexander Pierce 2009-06-02 17:59:28 UTC

Which error?  I could not even boot dom0 with the -141 test kernel, which is how I saw the error.

I can try one of the test kernels, and just copy some stack info off the screen since I don't know if I can get kdump without dom0 fully booting, or I can setup kdump and see what i get with -128.1.10 and the intel driver not blacklisted.

Comment 72 Chris Lalancette 2009-06-03 06:48:12 UTC

(In reply to comment #71)
> Which error?  I could not even boot dom0 with the -141 test kernel, which is
> how I saw the error.

I'm looking to get a stack trace of what causes this to fail.  Since you can't even boot, it sounds like a very different bug.

> 
> I can try one of the test kernels, and just copy some stack info off the screen

The test kernels aren't worth it.  They are much older now than anything in the 5.4 kernel.  What I would like to see is a boot with the latest 5.4 kernel (it should be -151 at this point), and get the stack trace.

> since I don't know if I can get kdump without dom0 fully booting, or I can
> setup kdump and see what i get with -128.1.10 and the intel driver not
> blacklisted.  

You are right, kdump won't work without fully booting first.  But now I'm confused; what are you talking about with the -128.1.10 kernel?  I thought this only started happening with the 5.4 preview kernels?

In any case, please open a new BZ with all of your information at this point.  It's almost certainly a different bug, and we are just cluttering this one up.

Chris Lalancette

Comment 73 Don Zickus 2009-06-09 18:33:20 UTC

Setting to POST to pickup a small revert in this patch

Comment 74 Don Zickus 2009-06-11 15:36:36 UTC

in kernel-2.6.18-153.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 76 Chris Lalancette 2009-06-17 07:08:54 UTC

*** Bug 503139 has been marked as a duplicate of this bug. ***

Comment 77 Bob Kozdemba 2009-06-17 21:27:35 UTC

The 152 x86_64 kernel seems to behave on my T500 laptop but the 153 kernel hangs at xend during boot up. Let me know if and how I can provide more details.

Comment 78 Chris Lalancette 2009-06-18 10:00:34 UTC

(In reply to comment #77)
> The 152 x86_64 kernel seems to behave on my T500 laptop but the 153 kernel
> hangs at xend during boot up. Let me know if and how I can provide more
> details.  

Please open a new Bugzilla for that, since we definitely want to track it, but it's almost certainly a different bug.

Chris Lalancette

Comment 79 Werner Gold 2009-06-18 10:07:03 UTC

I have the same or a similar bug on 32-bit. I'll check the kernel 153 from Don and let you know if that helped.

I'll add the serial console output to the attachments.

Comment 80 Werner Gold 2009-06-18 10:09:24 UTC

Created attachment 348405 [details]
Serial console output for wgold's comment.

Comment 84 Bob Kozdemba 2009-06-22 17:06:58 UTC

> The 152 x86_64 kernel seems to behave on my T500 laptop but the 153 kernel
> hangs at xend during boot up. Let me know if and how I can provide more
> details.  

Both the 154 and 155 work with my T105 (x86_64).

Bob

Comment 85 Martin Jenner 2009-06-22 20:40:49 UTC

(Jim Evans - HP)

As the original reporter would you be able to test the -155.el5 kernel located at

  http://people.redhat.com/dzickus/el5/

and verifiy you no longer see the initial problem on your hardware.

Comment 86 Jim Evans 2009-06-23 15:31:27 UTC

I am happy to report in the 86th entry of this bugzilla that I do not see the issue after installing the -155 kernel.

Simultaneously I had four VMs installing while two existing ones were rebooting. Earlier I could panic the machine in a couple of minutes.

Hats off to the engineers at Red Hat! Thank you.

jim

Comment 87 Evan McNabb 2009-06-25 17:17:58 UTC

Thanks everyone for testing.

Comment 89 Chris Lalancette 2009-07-02 12:34:43 UTC

*** Bug 505352 has been marked as a duplicate of this bug. ***

Comment 90 Tamas Vincze 2009-07-09 14:40:55 UTC

Created attachment 351083 [details]
crash in ssh session

I saw several crashes since upgrading to 5.3, see attached log of the latest one.
There was one domU running to where I was logged in via ssh from dom0,
started Firefox in domU (tunneling X through ssh), started browsing redhat.com 
and a few minutes later it crashed.
DomU is 64-bit and uses the bridged network setup.

Comment 91 Chris Lalancette 2009-07-09 15:27:03 UTC

Yep, that's exactly this bug.  Should be fixed in 5.4 now.

Chris Lalancette

Comment 92 Werner Gold 2009-07-17 17:41:06 UTC

Is this patch also part of the latest official kernel update, or do we get a patched version with the latest security update applied from your repository?

Comment 93 Bill Burns 2009-07-17 19:08:31 UTC

The patch has not yet been released in an update kernel for RHEL 5.3. It is currently targeted for the next update around the end of the month.

Comment 97 Chris Lalancette 2009-07-28 14:11:03 UTC

*** Bug 506859 has been marked as a duplicate of this bug. ***

Comment 101 errata-xmlrpc 2009-09-02 08:04:35 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 107 Paolo Bonzini 2011-01-18 12:55:58 UTC

*** Bug 645043 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

adaora.onyia
akananov
akarlsso
awiggins
bkozdemb
bstein
bturner
clalance
cvantuin
cward
degts
dhoward
dougsland
dzickus
emcnabb
hector.arteaga
honza801
jforbes
joseph.szczypek
jpirko
jplans
ma
mjenner
mmcgrath
mschick
mwalls
pasteur
rick.beldin
rpacheco
rzhou
sputhenp
syeghiay
tao
tcameron
tom
wgold
xen-maint