Bug 591784

Summary: RHEL 6 x86_64 beta VM's don't boot correctly in RHEL 6 x86_64 beta
Product: Red Hat Enterprise Linux 6 Reporter: Justin Clift <justin>
Component: qemu-kvmAssignee: Karen Noel <knoel>
Status: CLOSED DUPLICATE QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: low    
Version: 6.0CC: alex.williamson, berrange, hagberg, kai, mikolaj, notting, tudor.georgescu, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-07 15:34:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Showing hang during startup.
none
top showing cpu pegged.
none
qemu log file.
none
All of /var/log/messages from when the VM was rebooted.
none
Graphical console output while serial console redirection is in place.
none
Serial console log file none

Description Justin Clift 2010-05-13 06:38:45 UTC
Description of problem:

On servers running RHEL 6 beta x86_64, no installation of RHEL6 beta x86_64 as a virtual machine will boot successfully.

The installation of virtual machines is flawless, however no subsequent boot (including the first boot) makes it through to completion.

In every case, the boot hangs shortly into the boot process.  Sometimes pegging a cpu at 100%, sometimes not at all.  (inconsistent)

RHEL 5.4 x86_64 VM's on the same hosts work fine.

Screenshots attached showing boot failures.  qemu log file attached, and the portion of /var/log/messages from when the vm was booted.


Version-Release number of selected component (if applicable):

$ rpm -qa | grep qemu
qemu-kvm-tools-0.12.1.2-2.17.el6.x86_64
qemu-kvm-0.12.1.2-2.17.el6.x86_64
gpxe-roms-qemu-0.9.7-6.2.el6.noarch
qemu-img-0.12.1.2-2.17.el6.x86_64
$ rpm -qa | grep libvirt
libvirt-cim-0.5.8-1.el6.x86_64
fence-virtd-libvirt-0.2.1-3.el6.x86_64
libvirt-qpid-0.2.17-7.el6.x86_64
libvirt-java-0.4.1-1.el6.noarch
libvirt-client-0.7.6-2.el6.x86_64
libvirt-python-0.7.6-2.el6.x86_64
libvirt-0.7.6-2.el6.x86_64
$


How reproducible:

Every time (unfortunately)


Steps to Reproduce:
1. Install RHEL 6 beta x86_64 as a virtual machine through virt-manager (GUI).
   Any installation type will do (ie minimal).
2. Reboot as per normal at the end of installation, optionally press escape during the startup to view the startup log.
3. Hang occurs here during startup.

  
Actual results:

Hang during bootup process.


Expected results:

RHEL 6 VM's should start and function normally.


Additional info:

When installing the VM, the OS type was set to "Linux", and the OS Version was set to "Red Hat Enterprise Linux 6".

Using different types of backend storage for the VM makes no difference (ie local disk, iSCSI, network block disk, etc).

Comment 1 Justin Clift 2010-05-13 06:39:53 UTC
Created attachment 413644 [details]
Showing hang during startup.

Comment 2 Justin Clift 2010-05-13 06:40:21 UTC
Created attachment 413645 [details]
top showing cpu pegged.

Comment 3 Justin Clift 2010-05-13 06:40:51 UTC
Created attachment 413646 [details]
qemu log file.

Comment 4 Justin Clift 2010-05-13 06:41:42 UTC
Created attachment 413647 [details]
All of /var/log/messages from when the VM was rebooted.

Comment 5 Justin Clift 2010-05-13 06:43:00 UTC
As a thought, I'm open to further suggestions on how to collect relevant info and log messages.

The VM's in question don't get far enough into the boot process to write to /var/log/messages on the VM disk, so no useful information is there.  (I checked, just in case)

Comment 7 Justin Clift 2010-05-13 06:52:44 UTC
Useful additional info.  When starting with the kernel option "init=/bin/bash"
to bypass the init scripts, the kernel loads fine and gives a working bash
prompt at the appropriate place.

Looks like the cause is somewhere in the init script chain.

Comment 8 RHEL Program Management 2010-05-13 07:59:37 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 9 Bill Nottingham 2010-05-13 14:38:38 UTC
What happens if you remove 'rhgb' and/or 'quiet' from the boot arguments? You might also try attaching a serial console to the virtual machine and getting kernel messages.

Comment 10 Justin Clift 2010-05-14 16:57:14 UTC
Thanks Bill.  Using the serial console allows more to be seen, but there's no smoking gun.

Bits of interest:

 + Using init=/bin/bash and dropping through directly works every time.
   Not useful for much other than enabling/disabling scripts.

 + Disabling *all* of the init service scripts also allows the boot to
   complete every time on a "Minimal" installation, however even that
   doesn't work if done with a "Desktop" installation.

After noticing that disabling all the init scripts on a Minimal installation allowed the boot to get to the end and function, I suspected the problem to be one of the init scripts.

So I then went through the process of enabling them 1 at a time to see which caused it.  None of them individually. :(

Leaving them all on, then disabling only "udev-post", "lvm2-monitor" and "network" sometimes allows a vm to function.  Even that is inconsistent though, with vm's still hanging roughly 1/3 of the time.

This is all on a server that's running RHEL 5.4 vm's with no issue, so no idea what the cause of this really is at this stage.

Attaching:

  + A screenshot of the latest interesting error message on a vm console during boot.  This is from a "Desktop" installation, with all of the init scripts disabled.
  + The complete serial console output from the same boot, showing it getting up to the same point.

Maybe there's a smoking gun in the serial console output I'm not seeing?

Comment 11 Justin Clift 2010-05-14 16:58:01 UTC
Created attachment 414110 [details]
Graphical console output while serial console redirection is in place.

Comment 12 Justin Clift 2010-05-14 16:58:51 UTC
Created attachment 414111 [details]
Serial console log file

Comment 13 Justin Clift 2010-05-15 02:36:51 UTC
Avi, would having remote root access to these boxes (via ssh) help?

Comment 14 Avi Kivity 2010-05-16 15:30:12 UTC
Yes please.

Comment 15 Justin Clift 2010-05-17 14:33:37 UTC
Thanks Avi.  Details for remote login have just been emailed to you. :)

Comment 16 Eric Hagberg 2010-05-17 20:32:03 UTC
I saw something like this, and it appeared to be related to interaction between the virtio_balloon driver and kvm. After that driver loaded (via start_udev in rc.sysinit) /proc/meminfo showed about 170Mb instead of the 4Gb for MemTotal.

If I put the virtio_balloon driver in the blacklist config file (in the guest vm, after booting with "init=/bin/bash") so it didn't load, then all was well.

Comment 17 Justin Clift 2010-05-18 01:17:09 UTC
Thanks Eric, initial cursory testing of black listing the virtio_balloon driver looks promising.

Another symptom that had been occurring was that *sometimes* a VM would show Out Of Memory (OOM) error during the init, with processes being automatically killed free up ram.  (and strange subsequent errors in the boot log as a consequence of that)

When manually entering (via init=/bin/bash) a VM showing those symptoms, top display generally displayed just under 90MB of RAM.

Just tried fresh installations of RHEL 6 vm's here, and with the virtio_balloon driver black listed after the install (prior to reboot) things worked perfectly.

Then going into the same VM's and removing the black list entry caused them to peg the CPU at 100% and hang during boot every time.

I'll test this in more depth today.

Comment 18 Justin Clift 2010-05-18 16:42:35 UTC
I've run through the install and boot process with just under a hundred individual VM's today, mostly using kickstart, and blacklisting virtio_balloon is definitely the make-or-break thing here.

With virtio_balloon still being loaded, they *all* fail at some point during boot or shortly afterwards.

With virtio_balloon blacklisted, no problems are encountered (related to this bug anyway).

Comment 19 Mikolaj Kucharski 2010-05-27 00:50:38 UTC
I experience exactly the same problem with Fedora 12, Fedora 13 and Ubuntu 10.04 as a guest. Is there anyway to disable virtio balloon in qemu-kvm via libvirt (in xml)?

host machine: redhat-release-6-6.0.0.24.el6.x86_64

Comment 20 Justin Clift 2010-05-27 03:41:05 UTC
Hi Mikolaj, are you using kickstart scripts for building the VM's?

I haven't looked into using libvirt to disable the balloon driver, but adding this %post installation snippet to kickstart scripts for building VMs works here:

  # Post installation script
  %post
  echo "blacklist virtio_balloon" >> /etc/modprobe.d/blacklist.conf
  %end

Hope that helps. :)

Comment 21 Kai Meyer 2010-05-27 15:17:10 UTC
I can echo Justin's finding exactly. With virtio_balloon, the machine crashes as soon as the module is loaded. With out it, the machines install and boot flawlessly.

I wonder if it is related to the issue I see in virsh-dump where somebody is forgetting to convert from MB->KB, or visa versa:

From libvirt:
[root@kvm0 ~]# virsh dumpxml pxe | grep -i mem
  <memory>524288</memory>
  <currentMemory>536870912</currentMemory>

From the configured XML file:
[root@kvm0 ~]# grep -i mem /etc/libvirt/qemu/pxe.xml 
  <memory>524288</memory>
  <currentMemory>524288</currentMemory>


Running VMs have the correct memory value, but times 1024. Virt-manager shows the exact same thing. I don't have enough memory to run a VM where the configured memory times 1024 is less than the amount of memory on the server, so I can't test to see if that would run properly. If libvirt thinks that the running memory is "Out of bounds" for the amount of physical memory available, I could see this causing the issue with the virtio_balloon kernel module.

Comment 22 Mikolaj Kucharski 2010-05-28 09:22:54 UTC
I see that in Ubuntu 10.04 (libvirt-bin 0.7.5-5ubuntu27) qemu-kvm is started without:

-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3

parameter like it is on RHEL6 (libvirt-0.7.6-2.el6.x86_64). Justin thanks for the tip, I figured that out, but I would like to be able/know how to pass option:

-balloon none

to qemu-kvm via libvirt or to remove above ``-device virtio...'' default completely. Is this possible?

Comment 23 Daniel Berrangé 2010-06-07 15:34:09 UTC
As Eric notes in comment #16 the problem is with the balloon driver. There was an unexpected units change in the QEMU balloon monitor from kilobytes to bytes, and the corresponding change to libvirt missed the release.

Disabling the guest balloon driver is probably the easiest quick hack workaround that I know of
 
The real fix is tracked in bug 566261.

*** This bug has been marked as a duplicate of bug 566261 ***

Comment 24 Alex Williamson 2010-06-10 14:06:04 UTC
*** Bug 601782 has been marked as a duplicate of this bug. ***