Bug 440654 - Installation or reboot failures of js20 blade
Summary: Installation or reboot failures of js20 blade
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.6
Hardware: powerpc
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Steve Best
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 461304
TreeView+ depends on / blocked
 
Reported: 2008-04-04 12:40 UTC by Vivek Goyal
Modified: 2010-09-14 03:54 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-09-14 03:54:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Watchdog Log file (67.92 KB, application/octet-stream)
2008-04-04 16:40 UTC, Brad Peters
no flags Details
Installer log (15.81 KB, text/plain)
2008-04-04 16:41 UTC, Brad Peters
no flags Details
Panic with 2.6.9-88.EL (10.00 KB, text/plain)
2009-04-14 19:26 UTC, Jeff Burke
no flags Details


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 43831 0 None None None Never

Description Vivek Goyal 2008-04-04 12:40:39 UTC
Description of problem:

This js20 blade is showing erratic behavior. Sometimes installation of RHEL4U6
as well as
RHEL4U7 will fail. If installation is successful then it will fail to reboot in
subsequent 
reboots.

We see the issue only on this particular machine.
 
ibm-js20-1.test.redhat.com

Problem happens with mouting the root file systems. 

- Once it said partition table is corrupted.
- Now it is complaining about file system inconsistency

Some failure logs links.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2524169
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2506636


Version-Release number of selected component (if applicable):

RHEL4U6

How reproducible:

Pretty frequent

Steps to Reproduce:
1.
2.
3.
  
Actual results:

reboot of the mahcine after installation fails.

Expected results:


Additional info:

This might very well be an some bad hardware issue but we don't know at this
point of time.

Comment 1 Ed Pollard 2008-04-04 14:24:31 UTC
re-assigning to Brad, I will be glad to help out but Power is his hardware.

Brad, let me know if you need something on this.

Comment 2 Brad Peters 2008-04-04 16:39:35 UTC
Looking through the watchdog logs, this appears to be caused by a simple HDD
failure.  Note the following:

(...)
Checking root filesystem
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolGroup00/LogVol00 
/dev/VolGroup00/LogVol00 contains a file system with errors, check forced.
/dev/VolGroup00/LogVol00: Inode 4866129 has a bad extended attribute block
9732616.  

/dev/VolGroup00/LogVol00: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
	(i.e., without -a or -p options)
[FAILED]


*** An error occurred during the file system check.
*** Dropping you to a shell; the system will reboot
*** when you leave the shell.

(...)


I think we need to replace the disk, and see if this problem goes away.

Second opinions would be welcome

-Brad

Comment 3 Brad Peters 2008-04-04 16:40:41 UTC
Created attachment 301020 [details]
Watchdog Log file

Show's apparent HDD failure

Comment 4 Brad Peters 2008-04-04 16:41:02 UTC
Created attachment 301028 [details]
Installer log

Comment 5 Jeff Burke 2008-04-08 17:33:22 UTC
Brad,
  This has happened again. With the same signature.

Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
AMD8111: IDE controller at PCI slot 0000:00:04.1
AMD8111: chipset revision 3
AMD8111: 0000:00:04.1 (rev 03) UDMA133 controller
AMD8111: 100% native mode on irq 32
    ide0: BM-DMA at 0x7c00-0x7c07, BIOS settings: hda:pio, hdb:pio
    ide1: BM-DMA at 0x7c08-0x7c0f, BIOS settings: hdc:pio, hdd:pio
hda: TOSHIBA MK6026GAXB, ATA DISK drive
Using cfq io scheduler
ide0 at 0x7400-0x7407,0x6c02 on irq 32
hda: max request size: 128KiB
hda: 117210240 sectors (60011 MB), CHS=65535/16/63, UDMA(33)
 hda: unknown partition table               <---------Note first sign of failure
ide-floppy driver 0.99.newide
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.0:USB HID core driver
mice: PS/2 mouse device common for all mice
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
NET: Registered protocol family 2
IP route cache hash table entries: 65536 (order: 7, 524288 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 262144 (order: 10, 4194304 bytes)
TCP: Hash tables configured (established 262144 bind 262144)
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Freeing unused kernel memory: 216k freed
Red Hat nash version 4.2.1.13 starting
Mounted /proc filesystem
Mounting sysfs
Creating /dev
Starting udev
Loading scsi_mod.ko module
SCSI subsystem initialized
Loading sd_mod.ko module
Loading scsi_transport_fc.ko module
Loading qla2xxx.ko module
QLogic Fibre Channel HBA Driver
Loading qla2300.ko module
qla2300 0000:01:01.0: Found an ISP2312, irq 40, iobase 0xe000000080000000
qla2300 0000:01:01.0: Configuring PCI space...
qla2300 0000:01:01.0: Configure NVRAM parameters...
qla2300 0000:01:01.0: Verifying loaded RISC code...
qla2300 0000:01:01.0: Extended memory detected (512 KB)...
qla2300 0000:01:01.0: Resizing request queue depth (2048 -> 4096)...
qla2300 0000:01:01.0: Waiting for LIP to complete...
qla2300 0000:01:01.0: LOOP UP detected (2 Gbps).
qla2300 0000:01:01.0: Topology - (F_Port), Host Loop address 0xffff
scsi0 : qla2xxx
qla2300 0000:01:01.0: 
 QLogic Fibre Channel HBA Driver: 8.01.07-d4-rhel4.7-01
  QLogic IBM FCEC - 
  ISP2312: PCI-X (133 MHz) @ 0000:01:01.0 hdma-, host#=0, fw=3.03.20 IPX
qla2300 0000:01:01.1: Found an ISP2312, irq 41, iobase 0xe000000080001000
qla2300 0000:01:01.1: Configuring PCI space...
qla2300 0000:01:01.1: Configure NVRAM parameters...
qla2300 0000:01:01.1: Verifying loaded RISC code...
qla2300 0000:01:01.1: Extended memory detected (512 KB)...
qla2300 0000:01:01.1: Resizing request queue depth (2048 -> 4096)...
qla2300 0000:01:01.1: Waiting for LIP to complete...
qla2300 0000:01:01.1: Cable is unplugged...
scsi1 : qla2xxx
qla2300 0000:01:01.1: 
 QLogic Fibre Channel HBA Driver: 8.01.07-d4-rhel4.7-01
  QLogic IBM FCEC - 
  ISP2312: PCI-X (133 MHz) @ 0000:01:01.1 hdma-, host#=1, fw=3.03.20 IPX
Loading dm-mod.ko module
device-mapper: 4.5.5-ioctl (2006-12-01) initialised: dm-devel
Loading jbd.ko module
Loading ext3.ko module
Loading dm-mirror.ko module
Loading dm-zero.ko module
Loading dm-snapshot.ko module
Making device-mapper control node
Scanning logical volumes
  Reading all physical volumes.  This may take a while...
  No volume groups found
Activating logical volumes
  Volume group "VolGroup00" not found
ERROR: /bin/lvm exited abnormally! (pid 470)
Creating root device
Mounting root filesystem
mount: error 6 mounting ext3
mount: error 2 mounting none

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2579448

If you believe that this is a hardware issue can you please have the hardware
sent to the RDU lab.

We have hit RHEL4.7 kernel freeze and we are not sure if this is hardware or
software.

Comment 6 IBM Bug Proxy 2008-05-21 21:08:52 UTC
------- Comment From bpeters.com 2008-05-21 17:03 EDT-------
Jeff, could you provide details as to the system you saw this on?  Was it the
same  one Vivek saw this problem?

Comment 7 IBM Bug Proxy 2008-06-19 16:00:40 UTC
------- Comment From bpeters.com 2008-06-19 11:57 EDT-------
My best guess is that this is a simple HDD failure.  Tracking down the failing
disk is a hands-on job, but should be reasonably simple given light-path
diagnostics.  I recommend you contact your local RDU-equivalent of the Westford
engineering.  If they refuse to support this box, then send and email
to myself and Mark Wisner (onsite and may be able to assist).

Comment 9 RHEL Program Management 2008-09-03 13:16:31 UTC
Updating PM score.

Comment 11 Jeff Burke 2009-01-12 18:01:19 UTC
Subhendu,
   Brad Peters is no longer here at Red Hat. He was the onsite partner for IBM but was replaced by Ameet Paranjape <aparanja>

Jeff

Comment 13 Mike Gahagan 2009-02-16 19:39:09 UTC
Did this turn out to be hardware? If so can we go ahead and close this bug?

Comment 20 Jeff Burke 2009-04-14 19:26:53 UTC
Created attachment 339558 [details]
Panic with 2.6.9-88.EL

Switching to new root
exec of init (/bin/sh) failed!!!: 5
umount /initrd/dev failed: 2
Kernel panic - not syncing: Attempted to kill init!

Comment 21 Ameet Paranjape 2009-04-15 14:55:20 UTC
I've seen the lvm scan fail intermittently on this JS20:

<snip...>
  Reading all physical volumes.  This may take a while...
Activating logical volumes
  Volume group "VolGroup00" not found
ERROR: /bin/lvm exited abnormally! (pid 471)
Creating root device
Mounting root filesystem
mount: error 6 mounting ext3
mount: error 2 mounting none
Switching to new root
switchroot: mount failed: 22
umount /initrd/dev failed: 2
Kernel panic - not syncing: Attempted to kill init!

A manual filesystem check passed and I don't see any hardware complaints in the kernel logs either, so I thought I would try updating the blade FW to the latest level (it was more than 2 years old):  

http://www-947.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-55553&brandind=5000020

I'm running a cron script to see if the boot failure recreates.  

Also, I am not sure what effect the old RAID configuration has on the disk layout, but I am continuing to investigate that.

Comment 24 Jeff Burke 2009-04-21 12:27:31 UTC
Ameet,
   It happened again on last nights kernel: 2.6.9-89.EL. This time is just hung.

http://rhts.redhat.com/testlogs/55163/185171/1548457/console.txt

Activating logical volumes
  2 logical volume(s) in volume group "VolGroup00" now active
Creating root device
Mounting root filesystem
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
Switching to new root
INIT: version 2.85 booting
INIT: No inittab file found

   All the other systems seem to be fine. Any way we can pull this thing from RHTS until we get some answers to what is going wring?

Thanks,
Jeff

Comment 27 Kevin W Monroe 2009-12-01 22:10:43 UTC
Reassigning to Steve Best, the new IBM on-site partner.

Comment 28 IBM Bug Proxy 2009-12-01 22:30:54 UTC
------- Comment From mjr.ibm.com 2009-12-01 17:24 EDT-------
It's closed on this side, does it need to be re-opened?

Comment 34 Subhendu Ghosh 2010-09-14 03:54:53 UTC
CLosing - NOTABUG - HDD failure


Note You need to log in before you can comment on or make changes to this bug.