Bug 1163749 - RHEL7.1 guest go to the grub rescue mode when boot up
Summary: RHEL7.1 guest go to the grub rescue mode when boot up
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RHEV for Power
Classification: Retired
Component: qemu-kvm-rhev
Version: unspecified
Hardware: ppc64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: David Gibson
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: RHV4.1PPC
TreeView+ depends on / blocked
 
Reported: 2014-11-13 12:22 UTC by Joy Pu
Modified: 2016-09-07 21:30 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-07 21:30:12 UTC
Embargoed:


Attachments (Terms of Use)
serial output (6.25 KB, text/plain)
2014-11-13 12:23 UTC, Joy Pu
no flags Details
screen dump at the end (35.93 KB, image/jpeg)
2014-11-13 12:24 UTC, Joy Pu
no flags Details
serail log have more info (4.07 KB, text/plain)
2015-01-14 09:01 UTC, Joy Pu
no flags Details

Description Joy Pu 2014-11-13 12:22:47 UTC
Description of problem:

RHEL7.1 ppc64 BE guest go to the grub rescue mode when boot up.


Version-Release number of selected component (if applicable):

# rpm -qa |grep qemu
qemu-common-2.0.0-2.1.pkvm2_1_1.20.38.ppc64
qemu-2.0.0-2.1.pkvm2_1_1.20.38.ppc64
ipxe-roms-qemu-20130517-2.gitc4bce43.f19.2.noarch
qemu-system-x86-2.0.0-2.1.pkvm2_1_1.20.38.ppc64
qemu-img-2.0.0-2.1.pkvm2_1_1.20.38.ppc64
libvirt-daemon-driver-qemu-1.2.5-1.1.pkvm2_1_1.20.28.ppc64
qemu-system-ppc-2.0.0-2.1.pkvm2_1_1.20.38.ppc64
qemu-kvm-2.0.0-2.1.pkvm2_1_1.20.38.ppc64
libvirt-daemon-qemu-1.2.5-1.1.pkvm2_1_1.20.28.ppc64
qemu-kvm-tools-2.0.0-2.1.pkvm2_1_1.20.38.ppc64


host verions:
# uname -r
3.10.42-2018.1.pkvm2_1_1.46.ppc64

guest version:
3.10.0-195.el7.ppc64


How reproducible:
Rarely. Only met once.


Steps to Reproduce:
1. Boot up guest with following command line:
/bin/qemu-kvm \
    -S  \
    -name 'virt-tests-vm1'  \
    -sandbox off  \
    -machine pseries,accel=kvm  \
    -nodefaults  \
    -device VGA,id=video0  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/tmp/monitor-qmpmonitor1-20141112-051047-m2L27gfs,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -serial unix:'/tmp/serial-serial0-20141112-051047-m2L27gfs',server,nowait \
    -device ich9-usb-uhci1,id=usb1,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=04 \
    -drive id=drive_image1,if=none,cache=none,snapshot=off,aio=native,file=/var/lib/libvirt/images/autotest/client/tests/virt/shared/data/images/rhel71-ppc64-be.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device spapr-vlan,mac=9a:a5:a6:a7:a8:a9,id=idyps53X,netdev=idis7TcH  \
    -netdev tap,id=idis7TcH,fd=22  \
    -m 2048  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu 'POWER8' \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off  \
    -machine accel=kvm

2. Check the output from serial port

Actual results:

Get this from serial port:

2014-11-12 05:10:54: 
2014-11-12 05:10:54: Trying to load:  from: disk ...
2014-11-12 05:10:54:   Successfully loaded
2014-11-12 05:11:01: error: mismatched names.
2014-11-12 05:11:01: Entering rescue mode...
2014-11-12 05:11:01: grub rescue>

And the boot up screen get stuck at the screen of successfully load disk.


Expected results:

Guest can boot up.


Additional info:

Before run this test just run a basic test of live_snapshot. And that guest is shut down normally without any error msg. Not sure if this is related. And after met this once. Boot up with this image again. It can boot up normally. The qemu-img check and guest kernel not report any error about it.

Comment 1 Joy Pu 2014-11-13 12:23:19 UTC
Created attachment 957162 [details]
serial output

Comment 2 Joy Pu 2014-11-13 12:24:46 UTC
Created attachment 957163 [details]
screen dump at the end

Comment 4 David Gibson 2014-11-14 00:38:58 UTC
Aside:  using a UHCI USB controller isn't generally a good idea on ppc64, but that's unlikely to be the cause of the problem here.

Comment 5 David Gibson 2014-11-14 00:39:56 UTC
I am very confused by the "mismatched names" error which seems to be relevant here.

I can't find such an error message in either grub or SLOF, so I'm not sure where it's coming from.

Comment 6 David Gibson 2014-11-14 00:42:47 UTC
Do you still have the disk image which triggered this problem?

Comment 7 Joy Pu 2014-11-17 03:29:22 UTC
(In reply to David Gibson from comment #6)
> Do you still have the disk image which triggered this problem?

Hi David

I am sorry the image already be reinstalled in the following test as it is a test case in the middle of the test loop. Will keep the image if I met it again.

And the error msg maybe is from grub, find this in the newest grub source code:

/* Load a module using a symbolic name.  */
grub_dl_t
grub_dl_load (const char *name)
{
  char *filename;
......

  if (grub_strcmp (mod->name, name) != 0)
    grub_error (GRUB_ERR_BAD_MODULE, "mismatched names");

  return mod;
}

And thanks for your advice about UHCI. Will update our cfg files to not use it in the test in the future.

Comment 8 David Gibson 2014-11-17 05:13:43 UTC
Ok, I can't do much without an image to try reproducing this.

My understanding is that you've only ever seen this happen once.  Is that correct?


To clarify about UHCI:
  * There's no inherent reason UHCI won't work, but it would be very unusual to see a UCHI controller on a (real hardware) non-x86 machine (the ICH9 the UHCI is often part of is an x86 specific chipset).
  * Because of that, POWER guests may not include drivers for UHCI (though it happens that RHEL does)
  * Although IBM's PowerKVM includes a UHCI controller, the Red Hat KVM for Power host which will be coming in the RHEL 7.2 will have the UHCI controller disabled.
  * Generally you should use an OHCI controller instead ("pci-ohci" in qemu, IIRC).  EHCI and XHCI should also be available but OHCI is the default that the power qemu uses if keyboard and tablet devices are needed.

Comment 10 David Gibson 2014-11-19 02:20:41 UTC
Thanks for the update.  Because the disk image is now booting, it sounds like this is an intermittent problem in the firmware or grub itself, rather than something incorrectly configured during the install.

Have you tried the test loop with OHCI instead of UHCI yet?  That is probably not the cause, but it's not impossible that SLOF or grub has an intermittent bug when dealing with the UHCI controller.

Do you have the serial output for the second failure - I want to verify if the symptoms are identical or just similar.

Comment 11 David Gibson 2014-11-19 05:42:49 UTC
I got the disk image, and tried booting with a command line similar to the one given about 25 times, and haven't managed to reproduce so far.  I think I'll need to write some sort of script to detect the failure state and run in a loop for a while to see if we can hit this again.

Since I'm running qemu manually, I'm using a slightly different command line.  I don't think the changes I've made are likely to affectwhether the bug triggers, but just in case, can you also attach the libvirt XML for this machine, and the commands that your test scripts use to boot the machine via libvirt?

Comment 12 Joy Pu 2014-11-19 06:11:46 UTC
(In reply to David Gibson from comment #11)
> I got the disk image, and tried booting with a command line similar to the
> one given about 25 times, and haven't managed to reproduce so far.  I think
> I'll need to write some sort of script to detect the failure state and run
> in a loop for a while to see if we can hit this again.
> 
> Since I'm running qemu manually, I'm using a slightly different command
> line.  I don't think the changes I've made are likely to affectwhether the
> bug triggers, but just in case, can you also attach the libvirt XML for this
> machine, and the commands that your test scripts use to boot the machine via
> libvirt?

We are using autotest to test it. And it start the qemu-kvm process directly. So there is no XML file for virsh.

I only find this in my test loop. So maybe we can let the script sleep in the test case can tiger the problem to save the env it we hit it again. Will try that in my machine, and info you if the guest is stuck at rescue mode.

Comment 13 David Gibson 2014-11-20 00:13:31 UTC
Ah, ok, I had assumed libvirt was involved because of the -netdev tap,id=idis7TcH,fd=22.  Without libvirt do you know what is managing this fd being passed to qemu?

Your plan for attempting to capture a stuck state sounds like a good place to start, thanks.

Comment 14 Joy Pu 2014-11-25 09:58:01 UTC
(In reply to David Gibson from comment #13)
> Ah, ok, I had assumed libvirt was involved because of the -netdev
> tap,id=idis7TcH,fd=22.  Without libvirt do you know what is managing this fd
> being passed to qemu?
> 
> Your plan for attempting to capture a stuck state sounds like a good place
> to start, thanks.

Sorry for the delay of reply. Autotest have some functions to open and get the fds from host. And test the loops for several times and not reproduce it so far. Will update the bug when it shows up again.

Comment 15 Joy Pu 2015-01-14 09:01:25 UTC
Created attachment 979894 [details]
serail log have more info

Comment 16 Joy Pu 2015-01-14 09:03:23 UTC
Met another kind of serial log during the test. It also triggered right after the live_snapshot test case. Seems it is easier to trigger after a I/O heavy test cases. Not sure if they are the same one.



2015-01-12 11:31:47: Trying to load:  from: disk ...
2015-01-12 11:31:52:   Successfully loaded
2015-01-12 11:31:57: error: attempt to read or write outside of partition.
2015-01-12 11:31:57: error: attempt to read or write outside of partition.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: attempt to read or write outside of partition.
2015-01-12 11:31:57: error: attempt to read or write outside of partition.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: attempt to read or write outside of partition.
2015-01-12 11:31:57: error: attempt to read or write outside of partition.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: attempt to read or write outside of partition.
2015-01-12 11:31:57: error: not a correct XFS inode.
2015-01-12 11:31:57: error: attempt to read or write outside of partition.
2015-01-12 11:31:57: error: file `/grub2/powerpc-ieee1275/normal.mod' not found.
2015-01-12 11:31:57: Entering rescue mode...
2015-01-12 11:31:57: grub rescue>

Comment 17 David Gibson 2015-01-16 03:04:37 UTC
Ok, this looks like some sort of error reading the disk.

It's not clear if this is caused by the same problem as the first occurrence or not.

Was this test run using any of the same disk images as the live snapshot test?

I'm wondering if there could be some sort of image corruption that doesn't cause an error at the time, but causes problems in subsequent tests.

Comment 18 Joy Pu 2015-01-27 08:25:54 UTC
(In reply to David Gibson from comment #17)
> Ok, this looks like some sort of error reading the disk.
> 
> It's not clear if this is caused by the same problem as the first occurrence
> or not.
> 
> Was this test run using any of the same disk images as the live snapshot
> test?

No. They are not using the same image. The images are newly created and installed in every loop of our tests.

> 
> I'm wondering if there could be some sort of image corruption that doesn't
> cause an error at the time, but causes problems in subsequent tests.

Comment 20 Thomas Huth 2016-05-17 17:12:17 UTC
Is this bug still an issue with the latest version of RHEV? If not, would it be ok to close it?

Comment 21 Qunfang Zhang 2016-09-07 10:26:05 UTC
(In reply to Thomas Huth from comment #20)
> Is this bug still an issue with the latest version of RHEV? If not, would it
> be ok to close it?

QE didn't hit this bug during our RHEL7.3 function test with both autotest and manual test.

Comment 22 Thomas Huth 2016-09-07 21:30:12 UTC
Thanks a lot for checking! So I assume this has been fixed by switching RHEV from PowerKVM to RHEL (or some other bugfix inbetween), and it should be OK to close this bug now. If the problem occurs again, please feel free to open this bug ticket again (or a new one).


Note You need to log in before you can comment on or make changes to this bug.