Bug 483279

Summary: Starting 5th domU causes dom0 to reboot on RHEL 5.3 AP i386
Product: Red Hat Enterprise Linux 5 Reporter: Thomas Cameron <tcameron>
Component: kernel-xenAssignee: Chris Lalancette <clalance>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 5.3CC: clalance, rjones, sputhenp, syeghiay, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-10 07:33:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport from dom0
none
xml file for the guests - all are identically configured none

Description Thomas Cameron 2009-01-30 17:19:12 UTC
Created attachment 330492 [details]
sosreport from dom0

Description of problem:
I am trying to set up 6 RHEL 4.6 i386 domUs on a RHEL 5.3 i386 dom0.  I am using kickstart to build them.  Four domUs built and run successfully, but when I start the 5th domU, dom0 reboots every time.  The only thing odd in /var/log/messages is "kernel: xen_net: Memory squeeze in netback driver" several hundred times (suppressed).

The dom0 machine is a Dell Optiplex GX280 with an Intel 2.8GHz processor, 4GB memory and a 250GB SATA drive.  I am installing the domUs to 4GB LVM slices /dev/mapper/XenVol-host1, /dev/mapper/XenVol-host2 and so on on dom0.  Each domU is set up with 1 virtual CPU and 384MB memory.  This should leave plenty of memory for dom0 - even with the lost memory to pci dom0 sees 3.2GB memory, so 5 guests (384*5) should still leave over a gig of memory for dom0.


Version-Release number of selected component (if applicable):
xen-3.0.3-80.el5

How reproducible:
Install 4 domUs then start a 5th.

Steps to Reproduce:
1.  See above
2.
3.
  
Actual results:
dom0 reboots spontaneously

Expected results:
5th guest installs and run.

Additional info:

Comment 1 Thomas Cameron 2009-01-30 17:20:21 UTC
Created attachment 330493 [details]
xml file for the guests - all are identically configured

Comment 2 Thomas Cameron 2009-01-30 18:11:17 UTC
I think this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=454285 although in this case, the dom0 actually rebooted instead of just throwing errors in /var/log/messages.

I changed "kernel /xen.gz-2.6.18-128.el5" to "kernel /xen.gz-2.6.18-128.el5 dom0_mem=1024MB" in grub.conf and I was able to boot up 6 total domUs with no issues.

Comment 3 Chris Lalancette 2009-02-02 07:34:37 UTC
Hm, I don't really think it's the same, though.  While I (and plenty of others) have run into the "Memory squeeze in netback driver" many times before, it's never caused a reboot before.  My guess is that there are 2 separate issues here, and the dom0_mem=1024M is working around both of them somehow.  I have 2 requests for testing, if you can:

1)  Try out the kernel at http://people.redhat.com/clalance/virttest; it has a patch for the Memory squeeze problem that may or may not help.

2)  Either get a serial console output, or a core-dump (via kdump) when the dom0 crashes.  That way we can at least see the stack trace that is causing the crash.

Thanks,
Chris Lalancette

Comment 4 Thomas Cameron 2009-02-02 15:06:31 UTC
It appears that kdump is not available for dom0.  http://kbase.redhat.com/faq/docs/DOC-10126

I will have to pick up a serial cable next time I'm at Fry's, I don't have one now.

I installed the test kernel and removed the "dom0_mem=1024MB" entry from grub.con.  When I started the 4th domU I got:

Feb  2 14:37:11 molly kernel: xenbr0: topology change detected, propagating
Feb  2 14:37:11 molly kernel: xenbr0: port 5(vif3.0) entering forwarding state
Feb  2 14:37:29 molly kernel: device vif4.0 entered promiscuous mode
Feb  2 14:37:30 molly kernel: xen_net: Memory squeeze in netback driver.
Feb  2 14:37:30 molly last message repeated 2 times
Feb  2 14:37:31 molly kernel: blkback: ring-ref 8, event-channel 8, protocol 1 (x86_32-abi)
Feb  2 14:37:31 molly kernel: xen_net: Memory squeeze in netback driver.
Feb  2 14:37:38 molly last message repeated 7 times
Feb  2 14:37:40 molly kernel: xenbr0: topology change detected, propagating
Feb  2 14:37:40 molly kernel: xenbr0: port 6(vif4.0) entering forwarding state
Feb  2 14:37:40 molly kernel: printk: 1 messages suppressed.
Feb  2 14:37:40 molly kernel: xen_net: Memory squeeze in netback driver.
Feb  2 14:37:47 molly kernel: printk: 548 messages suppressed.
Feb  2 14:37:47 molly kernel: xen_net: Memory squeeze in netback driver.
Feb  2 14:37:51 molly kernel: printk: 50 messages suppressed.
Feb  2 14:37:51 molly kernel: xen_net: Memory squeeze in netback driver.
Feb  2 14:38:00 molly kernel: printk: 68 messages suppressed.
Feb  2 14:38:00 molly kernel: xen_net: Memory squeeze in netback driver.

I was able to start the 5th and 6th domUs successfully, though.

I wanted to try exactly what I had done before - kickstarting a 5th domU.  Oddly, when I went through the virt-manager interface and tried to kickstart host5 again, it would not allow me to choose a bridged network.  None showed up in the drop-down menu.

I shut down my guests, rebooted dom0 and tried again.  This time, starting the kickstart of the 5th domU caused dom0 to reboot.  This time, dom0 rebooted while I was still in the GUI to create the domU, I had not even started the installation yet.

After reboot, I still do not have the ability to kickstart a guest on the shared physical network.  There is no shared network to choose in virt-manager's installation GUI.  I went ahead and kickstarted the 5th domU on the default 192.168.122 network and it did come up.  No instanced of the "Memory squeeze in netback driver" error during the kickstart of the 5th domU.  Once the 5th domU was built, I was able to start the saved 6th domU with no issues.

I am not sure if the test kernel helped.  The first time I tried to use it the system did reboot.  The second time it seemed to work.

I'd like to try to rebuild all of the guests again but they really need to be on the shared physical network, not the 192.168.122 network.

Comment 5 Chris Lalancette 2009-02-02 15:20:44 UTC
dom0 *does* support kdump, since 5.1.  That kbase article is just wrong.  You'll need to add the "crashkernel" parameter to the hypervisor line (the line that has xen.gz on it), and then you'll need to make sure you start the kdump service.  After that, it should work.  It doesn't seem like the test kernel made a difference for you, so for now, let's try to stick with the base 5.3 kernel and see what we can figure out there.  Hopefully you'll be able to get a successful core dump; once that happens, I can at least look at the trace.

Chris Lalancette

Comment 6 Thomas Cameron 2009-02-02 16:02:14 UTC
I've rebooted with the crashkernel section in grub.conf but I am still having problems:

[root@molly ~]# cat /boot/grub/grub.conf 
# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/sda2
#          initrd /initrd-version.img
#boot=/dev/sda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.18-128.el5xen)
        root (hd0,0)
        kernel /xen.gz-2.6.18-128.el5 crashkernel=128M@16M
        module /vmlinuz-2.6.18-128.el5xen ro root=LABEL=/
        module /initrd-2.6.18-128.el5xen.img



[root@molly ~]# service kdump propagate
Using existing keys...
/root/.ssh/kdump_id_rsa.pub has been added to ~kdump/.ssh/authorized_keys2 on 172.31.100.1



[root@molly ~]# service kdump restart
Stopping kdump:                                            [  OK  ]
No kdump kernel image found.                               [WARNING]
Tried to locate /boot/vmlinuz-2.6.18-128.el5PAE
Starting kdump:                                            [FAILED]



[root@molly ~]# uname -a
Linux molly.tc.redhat.com 2.6.18-128.el5xen #1 SMP Wed Dec 17 12:22:24 EST 2008 i686 i686 i386 GNU/Linux


Not sure why the kdump service is looking for vmlinuz-2.6.18-128.el5PAE when that is not the kernel I am running.  Thoughts?

Comment 7 Thomas Cameron 2009-02-02 16:03:34 UTC
Sorry, this is probably important as well:

[root@molly ~]# grep -v "^#" /etc/kdump.conf 
net kdump.100.1

Comment 8 Chris Lalancette 2009-02-02 16:40:23 UTC
Oh, right.  Yes, you can't kexec *into* a Xen kernel, so the kdump service falls back into the default kernel, which would be PAE.  So you need to install the PAE kernel as well, then it can use that.

Chris Lalancette

Comment 9 Thomas Cameron 2009-02-02 18:52:12 UTC
ok, got it to reboot again kickstarting the 5th domU.  What I did was nuke all of the guests and start fresh, installing to the 192.168.122 network.  For some reason I can no longer build domUs with the bridged network, the drop-down is blank.

Anyway, I set up netconsole and got it logging on my workstation.  When I brought up the 5th domU for kickstart, I got this in the log:


Feb  2 12:45:51 172.31.100.3 BUG: unable to handle kernel paging request
Feb  2 12:45:51 172.31.100.3  at virtual address e541f000 
Feb  2 12:45:51 172.31.100.3  printing eip: 
Feb  2 12:45:51 172.31.100.3 c04540e9 
Feb  2 12:45:51 172.31.100.3 29b44000 -> *pde = 00000000:b7873001 
Feb  2 12:45:51 172.31.100.3 28273000 -> *pme = 00000000:3c121067 
Feb  2 12:45:51 172.31.100.3 00121000 -> *pte = 00000000:00000000 
Feb  2 12:45:51 172.31.100.3 Oops: 0002 [#1] 
Feb  2 12:45:51 172.31.100.3 SMP 
Feb  2 12:45:51 172.31.100.3  
Feb  2 12:45:51 172.31.100.3 last sysfs file: /class/net/lo/type 
Feb  2 12:45:51 172.31.100.3 Modules linked in:
Feb  2 12:45:51 172.31.100.3  xt_physdev
Feb  2 12:45:51 172.31.100.3  netloop
Feb  2 12:45:51 172.31.100.3  netbk
Feb  2 12:45:51 172.31.100.3  blktap
Feb  2 12:45:51 172.31.100.3  blkbk
Feb  2 12:45:51 172.31.100.3  ipt_MASQUERADE
Feb  2 12:45:51 172.31.100.3  iptable_nat
Feb  2 12:45:51 172.31.100.3  ip_nat
Feb  2 12:45:51 172.31.100.3  xt_state
Feb  2 12:45:51 172.31.100.3  ip_conntrack
Feb  2 12:45:51 172.31.100.3  nfnetlink
Feb  2 12:45:51 172.31.100.3  ipt_REJECT
Feb  2 12:45:51 172.31.100.3  xt_tcpudp
Feb  2 12:45:51 172.31.100.3  iptable_filter
Feb  2 12:45:51 172.31.100.3  ip_tables
Feb  2 12:45:51 172.31.100.3  x_tables
Feb  2 12:45:51 172.31.100.3  bridge
Feb  2 12:45:51 172.31.100.3  netconsole
Feb  2 12:45:51 172.31.100.3  autofs4
Feb  2 12:45:51 172.31.100.3  hidp
Feb  2 12:45:51 172.31.100.3  rfcomm
Feb  2 12:45:51 172.31.100.3  l2cap
Feb  2 12:45:51 172.31.100.3  bluetooth
Feb  2 12:45:51 172.31.100.3  sunrpc
Feb  2 12:45:51 172.31.100.3  xfrm_nalgo
Feb  2 12:45:51 172.31.100.3  crypto_api
Feb  2 12:45:51 172.31.100.3  dm_multipath
Feb  2 12:45:51 172.31.100.3  scsi_dh
Feb  2 12:45:51 172.31.100.3  video
Feb  2 12:45:51 172.31.100.3  hwmon
Feb  2 12:45:51 172.31.100.3  backlight
Feb  2 12:45:51 172.31.100.3  sbs
Feb  2 12:45:51 172.31.100.3  i2c_ec
Feb  2 12:45:51 172.31.100.3  button
Feb  2 12:45:51 172.31.100.3  battery
Feb  2 12:45:51 172.31.100.3  asus_acpi
Feb  2 12:45:51 172.31.100.3  ac
Feb  2 12:45:51 172.31.100.3  lp
Feb  2 12:45:51 172.31.100.3  sg
Feb  2 12:45:51 172.31.100.3  parport_pc
Feb  2 12:45:51 172.31.100.3  i2c_i801
Feb  2 12:45:51 172.31.100.3  parport
Feb  2 12:45:51 172.31.100.3  snd_intel8x0
Feb  2 12:45:51 172.31.100.3  snd_ac97_codec
Feb  2 12:45:51 172.31.100.3  ac97_bus
Feb  2 12:45:51 172.31.100.3  snd_seq_dummy
Feb  2 12:45:51 172.31.100.3  snd_seq_oss
Feb  2 12:45:51 172.31.100.3  snd_seq_midi_event
Feb  2 12:45:51 172.31.100.3  snd_seq
Feb  2 12:45:51 172.31.100.3  snd_seq_device
Feb  2 12:45:51 172.31.100.3  snd_pcm_oss
Feb  2 12:45:51 172.31.100.3  ide_cd
Feb  2 12:45:51 172.31.100.3  i2c_core
Feb  2 12:45:51 172.31.100.3  serio_raw
Feb  2 12:45:51 172.31.100.3  snd_mixer_oss
Feb  2 12:45:51 172.31.100.3  
Feb  2 12:45:51 172.31.100.3  [<c041e393>] 
Feb  2 12:45:51 172.31.100.3  [<c0410a4b>] 
Feb  2 12:45:51 172.31.100.3  [<c061088c>] 
Feb  2 12:45:51 172.31.100.3  [<c0453ece>] 
Feb  2 12:45:51 172.31.100.3 pte_alloc_one+0x11/0x29 
Feb  2 12:45:51 172.31.100.3  [<c041e393>] 
Feb  2 12:45:51 172.31.100.3  [<c0405413>] 

The dom0 machine then rebooted.

Comment 10 Chris Lalancette 2009-02-02 21:59:43 UTC
Ug, that's unfortunate, most of the interesting pieces of the stack trace got truncated.  I might be able to squeeze a little bit of information out of this by seeing exactly what eip c04540e9 is; I'll try that tomorrow.  In the meantime, if you get a chance to get a serial cable and get a full dump, that would be best.

Thanks!
Chris Lalancette

Comment 11 Richard W.M. Jones 2009-02-10 13:49:31 UTC
I had a similar problem yesterday, starting 4 guests (on
starting the 4th guest, the host hard rebooted).

Host is RHEL 5.3 Xen x86_64:
Linux intel-mb 2.6.18-128.el5xen #1 SMP Wed Dec 17 12:01:40 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

Guests were RHEL 5, RHEL 4, 32 and 64 bits, all PV.  It was
while I was starting up the install of the fourth one
(RHEL 4 32 bit) that the reboot happened.

This looked easily reproducible so ask me if you need more
information.

Comment 12 Thomas Cameron 2009-02-11 14:15:48 UTC
Richard - can you capture a dump?  I've been travelling heavily so this has been on my backburner.  I still don't have a serial cable to capture anything on the console and kdump isn't working for me.

Comment 13 Richard W.M. Jones 2009-02-24 16:29:12 UTC
Unfortunately I cannot reproduce this now, even starting and
stopping lots more domains than before.  If it reoccurs I'll
try to capture a crashdump.

Comment 14 Chris Lalancette 2009-04-10 07:33:58 UTC
I'm pretty sure this is the same as 479754, so I'm going to close this as a dup.  If it turns out to be different, we can re-open.

Chris Lalancette

*** This bug has been marked as a duplicate of bug 479754 ***