Bug 214494

Summary: The kernel cannot be booted after installing.
Product: [Fedora] Fedora Reporter: IBM Bug Proxy <bugproxy>
Component: anacondaAssignee: Anaconda Maintenance Team <anaconda-maint-list>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 6CC: cagney, dhowells, dwmw2, gal, jgirouar, marksmit, wtogami
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: powerpc   
OS: Linux   
Whiteboard:
Fixed In Version: yum-3.0.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-02-26 18:42:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
fc6_working.tar
none
FC6 install logs - not working none

Description IBM Bug Proxy 2006-11-07 21:00:39 UTC
LTC Owner is: rosalesa.com
LTC Originator is: zhengyzy.com

Problem description:
After installing FC6 release through network(NFS), the kernel cannot be 
rebooted! It ran into crash.

Hardware Environment
    Cpu type (Power4, Power5, IA-64, etc.): POWER 5

Is this reproducible?(YES)
    If so, how long does it (did it) take to reproduce it?(5)
    Describe the steps:
 
   step1) install FC6 through network(NFS)
   step2) reboot the system
   step3) will get the bug.

Is the system (not just the application) hung? YES.
    If so, describe how you determined this:
    The kernel stopped the starting process and give no response.

Did the system produce an OOPS message on the console?(YES)
    If so, copy it here:
Welcome
Welcome to yaboot version 1.3.13 (Red Hat 1.3.13-2.fc6)
Enter "help" to get some basic usage information
boot: linux
Please wait, loading kernel...
   Elf32 kernel loaded...
Loading ramdisk...
ramdisk loaded at 04000000, size: 1600 Kbytes
OF stdout device is: /pci@400000000110/isa@3/serial@i3f8
command line: ro console=ttyS0 rhgb quiet root=LABEL=/
memory layout at init:
  alloc_bottom : 04190000
  alloc_top    : 00000000
  alloc_top_hi : 00000000
  rmo_top      : 00000000
  ram_top      : 00000000
Looking for displays
alloc_down() called with mem not initialized
EXIT called ok
0 >

Is the system sitting in a debugger right now? NO.

----------------------------------------------------------------------------

Hi all,

same for me on a ppc64 box. I my case it is a JS21.
I tested it on a few JS21 with different configurations and firmware level.
And I use http as preferred media. AND: I always used VNC.
Always the same issue.

A few mins ago I used SOL instead of VNC and it works. Interesting.

Torsten
----------------------------------------------------------------------------
Hi all,

today I made a second test - again with a JS21.
And again with SOL. And now I got the same hang as on a VNC installation.
-----------

boot: linux
Please wait, loading kernel...
   Elf32 kernel loaded...
Loading ramdisk...
ramdisk loaded at 02700000, size: 2311 Kbytes
OF stdout device is: /vdevice/vty@30000000
command line: root=/dev/VolGroup00/LogVol00 ro console=hvc0 rhgb quiet
memory layout at init:
  alloc_bottom : 02942000
  alloc_top    : 30000000
  alloc_top_hi : ff000000
  rmo_top      : 30000000
  ram_top      : ff000000
Looking for displays
instantiating rtas at 0x076cd000 ... done
00000000 : boot cpu     00000000
00000001 : starting cpu hw idx 00000001... done
00000002 : starting cpu hw idx 00000002... done
00000003 : starting cpu hw idx 00000003... done
copying OF device tree ...
Building dt strings...
Building dt structure...
Device tree strings 0x02b43000 -> 0x02b4429b
Device tree struct  0x02b45000 -> 0x02b60000
Calling quiesce ...
returning from prom_init
DEFAULT CATCH!, exception-handler=fff00300
at   %SRR0: 0000000000c3c23c   %SRR1: 8000000000003002
Call History
------------
@  - c3c1d0
find-method  - c46b4c
(poplocals)  - c3a738
$call-method  - c46c04
(poplocals)  - c3a738
key-fillq  - c4717c
?xoff  - c47278
(poplocals)  - c3a738
(stdout-write)  - c478a4
(type)  - c47930
_syscatch  - c4dce4
_exception  - c4d258
<excp>  - c3982c
_syscatch  - c4dc48
_syscatch  - c4dc48
invalid pointer - 0

Client's Fix Pt Regs:
 00 ffffffffff07d198 ffffffffff07d198 00000000deadbeef fffffffffffffffc
 04 0000000000000000 ffffffffffffffff fffffffc00000000 0000000000000001
 08 0000000008000000 0000000000000044 00000000003ff000 80000000000fcc58
 0c 0000000022000082 0000000000000000 0000000000000000 0000000000000000
 10 0000000000e8cc90 0000000000e8cc90 0000000000c4694c 0000000000c46b4c
 14 0000000000000000 0000000001bfff81 00000000024799b0 0000000002479bcc
 18 0000000000c13000 0000000000c38000 0000000000c14f40 0000000000c16fc0
 1c 0000000000c20000 0000000000c3fce0 0000000000c11f98 0000000000c10fd0
Special Regs:
    %IV: 00000300     %CR: 82000082    %XER: 00000000  %DSISR: 08000000
  %SRR0: 0000000000c3c23c   %SRR1: 8000000000003002
    %LR: 0000000000c3c1d0    %CTR: 0000000000000000
   %DAR: ffffffffff07d198
Virtual PID = 0
 ofdbg
0 >
 ----------------------------------------------------------------------------

Red Hat,
Mirroring this bug for your awarness.  We are tyring to gather install logs.
-thanks

Comment 1 IBM Bug Proxy 2006-11-07 21:51:06 UTC
Created attachment 140606 [details]
fc6_working.tar

Comment 2 IBM Bug Proxy 2006-11-07 21:51:12 UTC
----- Additional Comments From TBLOTH.com  2006-11-07 16:44 EDT -------
 
FC6 install logs - working 

Comment 3 IBM Bug Proxy 2006-11-07 21:53:16 UTC
----- Additional Comments From TBLOTH.com  2006-11-07 16:45 EDT -------
Aloha,

please find above my first bunch of logs for the WORKING boot.
The current firmware is Global Firmware (PHYP) MB245_300_000. 

Comment 4 Peter Jones 2006-11-08 19:38:22 UTC
How you guys came to the conclusion that this should be filed against grub is
beyond me.

Can you boot without "rhgb quiet" and get a log of the boot sequence that way?

For now, I'm reassigning this to kernel.  If the logs show something else, we
can reassign it again.

Comment 5 IBM Bug Proxy 2006-11-08 21:33:15 UTC
Created attachment 140718 [details]
FC6 install logs - not working

please find attached the logs of failed boot after a successful installation.
--------
Elapsed time since release of system processors: 0 mins 50 secs

Config file read, 1024 bytes
Welcome
Welcome to yaboot version 1.3.13 (Red Hat 1.3.13-2.fc6)
Enter "help" to get some basic usage information
boot: linux
Please wait, loading kernel...
   Elf32 kernel loaded...
Loading ramdisk...
ramdisk loaded at 02700000, size: 1469 Kbytes
OF stdout device is: /vdevice/vty@30000000
command line: ro console=hvc0 rhgb quiet root=LABEL=/1
memory layout at init:
  alloc_bottom : 02870000
  alloc_top    : 30000000
  alloc_top_hi : ff000000
  rmo_top      : 30000000
  ram_top      : ff000000
Looking for displays
instantiating rtas at 0x0764c000 ... done
00000000 : boot cpu	00000000
00000001 : starting cpu hw idx 00000001... done
00000002 : starting cpu hw idx 00000002... done
00000003 : starting cpu hw idx 00000003... done
copying OF device tree ...
Building dt strings...
Building dt structure...
Device tree strings 0x02b71000 -> 0x02b7229b
Device tree struct  0x02b73000 -> 0x02b8f000
Calling quiesce ...
returning from prom_init

--------
This problem seems to only occur on 32 bit and not 64 bit.

Comment 6 IBM Bug Proxy 2006-11-14 02:20:56 UTC
----- Additional Comments From marksmit.com  2006-11-13 21:16 EDT -------
My bug got dup'd to this one. Pardon if I am re-stating what is already known, 
but I could not find it summarized in this bug.
My power5 ppc64 machines and added ones at OSDL are experiencing this problem.
Not all lpars hit this problem, but what separates the successful ones from 
the recreates is the kernel it is trying to boot.
All of the lpars that succeed are booting Elf64.
All of the lpars that recreate this bug are attempting to boot elf32.
I have an OpenPower 710 that has one lpar succeeding and another lpar 
recreating this bug.
By booting in rescue mode and copying the successfull kernel and initrd 
from /boot onto the victim, this bug no longer recreates - and it says it is 
booting Elf64.

This bug does not recreate on the Oct12 snapshot of rawhide, but does recreate 
on the FC6-gold that followed a few days later.
What puzzles me is what is causing the install process to install a 32bit 
kernel in one lpar and the 64bit kernel on another lpar in the same system.
What puzzles me is 

Comment 7 IBM Bug Proxy 2006-11-14 02:40:59 UTC
----- Additional Comments From marksmit.com  2006-11-13 21:37 EDT -------
I can work around the problem, with 
rpm -Uhv kernel-2.6.18-1.2798.fc6.ppc64.rpm --force
I copied that rpm over to the lpar when I was network-booted in rescue mode.  
I found that copying over the kernel and initrd (from a working system) got me 
booted,  but all the modules, including networks were busted until I re-
installed the kernel rpm and rebooted. 

Comment 8 IBM Bug Proxy 2006-11-14 03:51:10 UTC
----- Additional Comments From marksmit.com  2006-11-13 22:45 EDT -------
Interesting recreate:
original victim is booted using the workaround just posted.
I then yum update to install the Nov10 FC6 updates just available.
Upon reboot, the bug is recreated, this time with the new kernel, and it is 
attempting to load Elf32 again.

The 2nd lpar in this system - the lpar that never recreated - successfully did 
the same yum update, but it installed and boots Elf64.

If any logs from the yum update would assist, please let me know. 

Comment 9 David Woodhouse 2006-11-16 08:28:16 UTC
Refiling against anaconda. We shouldn't be installing 32-bit kernels on 64-bit
hardware.

Please reproduce with anaconda debugging (however that's done).

If yum also installs the wrong kernel, please file a separate bug against yum.

Comment 10 Paul Nasrat 2006-11-16 12:42:48 UTC
Do we have an lpar that is exhibiting this with FC6?

Comment 11 David Woodhouse 2006-11-16 12:59:18 UTC
Not in-house. I haven't seen this on any  LPAR FC6 installs. Feel free to blow
away uranus.cambridge in an attempt to reproduce though, assuming dhowells concurs.

Comment 12 Paul Nasrat 2006-11-16 13:00:43 UTC
Janice have you seen this on any of the Westford machines?

Comment 13 David Howells 2006-11-16 13:25:09 UTC
I've no objection to uranus being reinstalled.

Comment 14 Janice Girouard - IBM on-site partner 2006-11-16 17:14:57 UTC
Paul, I have not install FC6 on the typical systems in the pool.   I did install
a cellblade the other day, and this problem did not appear.  For that system,
FC6 gold ( /vol/engineering/redhat/released/FC-6/GOLD)was installed. 

My notes show that you were the last person who requested squad6-lp1.  I would
expect this to be available for testing.  I noticed that someone installed 5.0
on this about 5 days ago.  Was this you?     

Janice

Comment 16 David Woodhouse 2006-11-22 18:24:04 UTC
*** Bug 213092 has been marked as a duplicate of this bug. ***

Comment 17 Andrew Cagney 2006-11-22 21:33:48 UTC
*** Bug 214902 has been marked as a duplicate of this bug. ***

Comment 18 Mark Smith 2006-12-15 19:23:30 UTC
I would like to offer "kiwi" the OpenPower 710 to help.
It is at IBM, but can make it available for debugging.
I just recently re-installed FC6 and this continues to recreate in one lpar 
and not recreate in a 2nd lpar on the same machine. (cannot tell what is 
causing the diff)
The failing lpar has had the workaround applied and is now up on the network.
I can give access to Janice and let her investigate, or can collect logs, etc 
and attach them.

Comment 19 IBM Bug Proxy 2006-12-21 05:35:58 UTC
----- Additional Comments From marksmit.com  2006-12-21 00:31 EDT -------
On my VIO client that recreates easily with FC6-gold and updates, I am now 
observing a successful "yum update" and reboot with the Dec 15 FC6 updates 
repository.
Specifically, reinstalling with FC6-gold - it recreates.
On Dec 15, updated to kernel-2.6.18-1.2798.fc6.ppc64.rpm and it recreated.
On Dec 20, updated to kernel-2.6.18-1.2868.fc6.ppc64.rpm and it no longer 
recreates.
(note: when updating, I ran "yum update" so it was not just the kernel, but a 
whole slew of updates being applied.)

Is anyone else seeing a similar change?
I will re-attempt recreate including another FC6-gold install, just to be sure. 

Comment 20 IBM Bug Proxy 2006-12-22 06:55:47 UTC
----- Additional Comments From marksmit.com  2006-12-22 01:52 EDT -------
Recreate conclusions:
The most significant selections are
a. software selection - in addition to default "core" offering, check the 
boxes to include "development system" and "web server"
b. yum install kernel-2.6.18-1.2849.fc6   recreated on systems already running 
the ppc64 kernel (however the sampling is limited to lpars that previously 
recreated)

After dozens of FC6-gold recreate attempts - manual reinstalls of 2 lpars:
1. OpenPower 710 lpar 0.5 proc units, 2GB mem, vscsi - 1 disk, virtual ethernet
2. p5-550, 4 dedicated procs, 8GB mem, IPR scsi disks, e1000 ethernet, lpar 
contains all system resources.
 (i.e. very different lpar configs, but it recreates on both)

network boot (yaboot, vmlinuz and ramdisk from ppc/ppc64 dir on server)
network install: server has expanded RPMs into a dir; not using ISO's 

a. Most installs done on one (first) disk (ie. /dev/sda), however default auto-
partition scheme always accepted, once I de-selected all disks but sda.
b. "remove-all partitions" on sda versus "remove-Linux partitions" (default 
offering) on sda does not seem to matter
c. One sample was recreated using all the IPR disks (sda-sdr), accepting the 
default partitioning offering.

text versus vnc does not matter
nfs versus ftp (anonymous) does not seem to matter
static eth versus dhcp eth does not matter
kickstart files were not used

recreates using yum install:
There were a few nfs install attempts that succeeded (attempting to install 
just the "core" default software selection, for example) where the proper 
ppc64 kernel did install (ie. no bug recreate).  In those cases, I could still 
do a specific yum install that would install the 32bit ppc kernel.
Use the (older) Nov12 kernel in the yum repository thus:
yum install kernel-2.6.18-1.2849.fc6 

Comment 21 IBM Bug Proxy 2006-12-22 14:22:08 UTC
----- Additional Comments From marksmit.com  2006-12-22 09:16 EDT -------
Installed both, picking only the "core" software offered by default.
Both successfully installed the ppc64 kerneel.
Upon boot after install, then
yum install kernel  (recreates, installing the 4MB ppc 32bit kernel)
yum update kernel   (succeeds, installing the 6MB ppc64 kernel) 

Comment 22 IBM Bug Proxy 2006-12-22 21:46:03 UTC
----- Additional Comments From marksmit.com  2006-12-22 16:42 EDT -------
Even on a system that I cannot get to recreate this problem at install, I run:
yum install kernel-2.6.18-1.2849.fc6   
and it will recreate.  In the transaction, it only offers to download and 
install one package.  Whereas
yum install kernel
will pull the newer kernel version, but offer to download 2 packages - the ppc 
and ppc64 packages, and the resulting install is the correct ppc64 kernel.

So in theory this should recreatable on any ppc64 system installed with FC6.  
Just run: 
yum install kernel-2.6.18-1.2849.fc6 

Comment 23 IBM Bug Proxy 2006-12-22 21:50:45 UTC
----- Additional Comments From marksmit.com  2006-12-22 16:46 EDT -------
Oops.  correction to previous post:
On this system that does not recreate this problem at install, 
yum install kernel-2.6.18-1.2849.fc6   recreates, but so does
yum install kernel
It will offer to download 2 packages in the 2nd case, but still chooses to 
install the ppc 32bit one. 

Comment 24 Jeremy Katz 2007-01-03 18:47:23 UTC
Fixed with yum 3.0.2 or later

Comment 25 IBM Bug Proxy 2007-01-12 22:26:16 UTC
----- Additional Comments From marksmit.com  2007-01-12 17:23 EDT -------
This still recreates for the victim that originally recreated at install (FC6-
gold installs 32bit kernel).
The scenario that recreates:
# yum install kernel-2.6.18-1.2849.fc6

this recreated with 
yum-3.0.1-2.fc6.noarch.rpm        
and again when I enabled the /etc/yum.repos.d/fedora-updates-testing.repo
# yum update yum
to yum-3.0.3-1.fc6.noarch.rpm 

Comment 26 IBM Bug Proxy 2007-02-08 18:12:30 UTC
Reopening per comment 25 as this problem was recreated using
yum-3.0.3-1.fc6.noarch.rpm

Comment 27 IBM Bug Proxy 2007-02-08 18:15:17 UTC
----- Additional Comments From rosalesa=40us.ibm.com (prefers email at ro=
salesa=40austin.ibm.com)  2007-02-08 13:11 EDT -------
Reopening as this was recreated in:
yum-3.0.3-1.fc6.noarch.rpm
=20 

Comment 28 IBM Bug Proxy 2007-02-13 19:30:28 UTC
----- Additional Comments From jklewis.com  2007-02-13 14:27 EDT -------
I have this problem on just one of my Cell blades, it has a hardware revision of
40. My Cells that install and run FC6 Gold just fine have a revision of 31. Will
that help solve this problem?

Also, while trying to recover this manually (how is that done BTW?). I now can't
boot anything, not even kernels that used to boot. I get "Not a valid ELF image"
on every kernel. Did I somehow mess up my yaboot.conf file? I have a copy of it
if needed. 

Unless I missed something this defect should have a much higher severity level. 

Comment 29 IBM Bug Proxy 2007-02-13 23:50:19 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jklewis.com




------- Additional Comments From jklewis.com  2007-02-13 18:45 EDT -------
Please ignore what I said about "Not a valid ELF image" in Comment #44.
Boneheaded mistake made by me in yaboot.conf.

My system is finally up and running fine. I was able to boot into Rescue mode
and scp the relevant kernel RPM over. After installing it I had files in /boot
with both fc6smp and fc6 in the name. Using the "file" command I was able to
find which ones were 64 bit (the fc6 ones) and after updating yaboot.conf the
install proceeded normally.

So, it's not very clear where are we on this defect. Something, I don't know
what, is obviously installing the wrong kernel, but ONLY on certain systems. I
currently have some Cell blades that install fine, and one that doesn't, so if I
can help with this let me know. 

Comment 30 Jeremy Katz 2007-02-26 18:42:51 UTC
Huh?  Comment #25 won't do anything at all to help the fact that the install was
wrong to begin with.  

Comment 31 IBM Bug Proxy 2007-02-26 19:00:23 UTC
----- Additional Comments From jklewis.com  2007-02-26 13:59 EDT -------
It's still not clear to me where we are on this one, and also why the severity
is not higher. Has anyone been able to reproduce this in Fedora 7?

I had mentioned earlier that I had one Cell blade that exhibits this wrong
behavior, and several that don't. Unfortunately, the blade that has the install
failure is not longer working properly and I have to send it in for repair.