Bug 36519
Summary: | megaraid doesn't work on HP boards | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | ville.sulko |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brock Organ <borgan> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.1 | CC: | bbrock, howanitz, mduncan, rtodd, tim_clymo |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-06-06 00:26:41 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
ville.sulko
2001-04-18 17:24:16 UTC
More information about the bug. I have now tried to install RH71 on two similar HP lp2000r machines, and on both machines the install either hangs, or completes, but then the system won't boot or boots but is corrupt (either disk fs or in-memory kernel, it's hard to tell). The first time I managed to complete the install and booted, all critical bootup components were intact, and I managed to get to the shell, but that's just about it, filesystem was too badly corrupted in order to do any sensible work. On next install(s) I managed to hang the install totally (no visible panics/oopses). Them I managed to complete the install, but the system refused to boot (hung before loading system services). Next try the install completed, and the system booted (one service failed, most likely due to corrupted binary-file) but trying to log in caused oops... So, most likely this is a kernel bug maybe related to the megaraid driver or fs operations. Note that I used both UP (install + boot) and SMP (boot) kernels, with similar results. And on both of these machines RH70 installs just fine, so I think kernel 2.2.x is just fine. We've tested a megaraid controller in our QA lab extensively and found no
problems. But that doesn't mean _all_ versions of the megaraid controller work.
What version(s) did you try exactly ?
> However,
> I was left with a feeling that kernel 2.4.2 isn't ready for
> wider use.
We've put a LOT of effort in the kernel to make it usable for wider use;
just look at the number of patches for bugfixes in our kernel.
> We've tested a megaraid controller in our QA lab extensively and found no > problems. But that doesn't mean _all_ versions of the megaraid controller work. > What version(s) did you try exactly ? Kernel boot-information was on my first mail, here's lspci -v : 03:00.0 RAID bus controller: American Megatrends Inc.: Unknown device 1960 (rev 20) Subsystem: Hewlett-Packard Company: Unknown device 60e8 Flags: bus master, fast Back2Back, medium devsel, latency 64, IRQ 16 Memory at f8000000 (32-bit, prefetchable) Capabilities: [80] Power Management version 2 and here's just about all information I was able to dig out from BIOS-screens etc : HP NetRaid-2M (bios version G.01.02) Firmware version H.01.07 2* RAID processors(?) SDR GEM318 Disks : 2 * HP 80-8C42 > We've put a LOT of effort in the kernel to make it usable for wider use; > just look at the number of patches for bugfixes in our kernel. Yes, I'm sure you have done excellent work trying to stabilize 2.4.x. I was mostly referring to Linus 2.4.2, which has had several problems in it (at least according to pre-patch / ac-patch changenotes). I'm not sure if it is related, but have you seen: http://netserver.hp.com/netserver/support/hot_news/bpn04056.asp > I'm not sure if it is related, but have you seen:
> http://netserver.hp.com/netserver/support/hot_news/bpn04056.asp
I was just browsing HPs site for possible information about the problem,
and came across this one too. I haven't checked the controller yet, since
they are packed quite tightly in a rack... However, the servers are brand new
and were (physically) installed just last week by HP, so I suppose they would
have known if the servers had had faulty components on them. I think I don't
dare to open the case myself, but I might verify this from HP the next time they
are around.
I've looked at the HP NetServer LH3000, with megaraid. There's also a symbios/lsi/ncr 53c896 chipset onboard, I don't see it otherwise reported by the system. /etc/modules.conf contains: alias eth0 eepro100 alias scsi_hostadapter megaraid alias scsi_hostadapter1 aic7xxx alias parport_lowlevel parport_pc alias scsi_hostadapter2 aic7xxx The AMI chip is labeled: AMI 9942LRM HP Proteus2 Version B.0 lspci on that machine reports the following SCSI controllers: 01:03.0 PCI bridge: Intel Corporation 80960RP [i960 RP Microprocessor/Bridge] (rev 01) 01:03.1 SCSI storage controller: Intel Corporation 80960RP [i960RP Microprocessor] (rev 01) 02:07.0 SCSI storage controller: Adaptec AIC-7880U (rev 02) The Adaptec controller does not have drives attached, and the aic7xxx module is unused (and unloaded without bad consequences to the running system). System boot reports this: SCSI subsystem driver Revision: 1.00 megaraid: v1.14g (Release Date: Feb 5, 2001; 11:42) megaraid: found 0x8086:0x1960:idx 0:bus 1:slot 3:func 1 scsi0 : Found a MegaRAID controller at 0xc8825000, IRQ: 20 megaraid: [:^A^BB ] detected 2 logical drives scsi0 : AMI MegaRAID 254 commands 16 targs 2 chans 8 luns scsi0: scanning channel 1 for devices. Vendor: HP Model: SAFTE; U160/M BP Rev: 1020 Type: Processor ANSI SCSI revision: 02 scsi0: scanning channel 2 for devices. Vendor: HP Model: SAFTE; U160/M BP Rev: 1020 Type: Processor ANSI SCSI revision: 02 scsi0: scanning virtual channel for logical drives. Vendor: MegaRAID Model: LD0 RAID0 8677R Rev: E Type: Direct-Access ANSI SCSI revision: 02 Vendor: MegaRAID Model: LD1 RAID0 8677R Rev: E Type: Direct-Access ANSI SCSI revision: 02 Attached scsi disk sda at scsi0, channel 2, id 0, lun 0 Attached scsi disk sdb at scsi0, channel 2, id 0, lun 1 SCSI device sda: 17770496 512-byte hdwr sectors (9098 MB) Partition check: sda: sda1 SCSI device sdb: 17770496 512-byte hdwr sectors (9098 MB) sdb: sdb1 (scsi1) <Adaptec AIC-7880 Ultra SCSI host adapter> found at PCI 2/7/0 (scsi1) Wide Channel, SCSI ID=7, 16/255 SCBs (scsi1) Downloading sequencer code... 436 instructions downloaded scsi1 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 5.2.4/5.2.0 <Adaptec AIC-7880 Ultra SCSI host adapter> The previous HP NS LH3k was runing for 13 days before reboot (in the test lab), with no report of FS corruption. No FS problems detected on system reboot with fsck forced, either. It's also successfully run/survived the kernel stress testing used in the test lab without error. The raid-controller in LH3k seems to be different than the one we have in our lp2k(s).
Just browsed hp's website, and it seems that LH3k has integrated 2-channel raid-
controller, but it didn't say if it's HP NetRAID-2M. At least the bootup sequence would
suggest it isn't, since at least the raid processors seem to be different. In lp2k the
raid is not integrated, but sold as an addon.
Don't know what other differences there might be between LH3k and lp2k, but lp2k
is quite new model (as is lp1k), so there might be a little newer hardware inside.
And of course the problem might be somewhere else than in the raid controller?
> The previous HP NS LH3k was runing for 13 days before reboot (in the test lab),
> with no report of FS corruption.
As I said before, the fs corruption I had with these machines was visible immediately
after reboot, so the fs was corrupted already during the install process. However, since
I also experienced a couple of failed installs (hanged), it might be some other kernel-
related problem as well.
BTW, I e-mailed about the problem to HP too, and the reply was that they hadn't
tested RH71 on lp2k yet.
>> http://netserver.hp.com/netserver/support/hot_news/bpn04056.asp > > I haven't checked the controller yet, since they are packed quite tightly in a rack... Asked about this one too, and the controllers were in fact changed, but before I tried to install RH71. So this is not it... This is NOT an HP-specific problem. Please see http://lwn.net/2001/0412/kernel.php3 . I have personally experienced it 5 times even in moderate load on a stock K7-1200-266 / Asus A7M266 / Seagate Baracuda ATAIII and have reinstalled 5 times. I'm pulling out my hair. Luke Hutchison. lukeh.nz: the bug mentioned there was one we fixed before we released Red Hat Linux 7.1. However, you have a VIA chipset and VIA recently announced that their chipset has a bug (and that they new about it for months). The safest thing to do is to use "ide=nodma" as kernel option (during the installation and in lilo.conf) at all times. This is totally unrelated to the real problem of this bug. I've experienced the same problem using a PIII with a ServerWorks OSB4 chipset and the stock 2.4.2-2 RH71 kernel. Here is the dmesg: Linux version 2.4.2-2 (root.redhat.com) (gcc version 2.96 20000731 ( Red Hat Linux 7.1 2.96-79)) #1 Sun Apr 8 20:41:30 EDT 2001 BIOS-provided physical RAM map: BIOS-e820: 000000000009fc00 @ 0000000000000000 (usable) BIOS-e820: 0000000000000400 @ 000000000009fc00 (reserved) BIOS-e820: 0000000000020000 @ 00000000000e0000 (reserved) BIOS-e820: 0000000007f00000 @ 0000000000100000 (usable) BIOS-e820: 0000000000001000 @ 00000000fec00000 (reserved) BIOS-e820: 0000000000001000 @ 00000000fec01000 (reserved) BIOS-e820: 0000000000001000 @ 00000000fee00000 (reserved) BIOS-e820: 0000000000080000 @ 00000000fff80000 (reserved) On node 0 totalpages: 32768 zone(0): 4096 pages. zone DMA has max 32 cached pages. zone(1): 28672 pages. zone Normal has max 224 cached pages. zone(2): 0 pages. zone HighMem has max 1 cached pages. Kernel command line: auto BOOT_IMAGE=linux ro root=301 BOOT_FILE=/boot/vmlinuz-2 .4.2-2 Initializing CPU#0 Detected 999.556 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 1992.29 BogoMIPS Memory: 126472k/131072k available (1365k kernel code, 4212k reserved, 92k data, 236k init, 0k highmem) Dentry-cache hash table entries: 16384 (order: 5, 131072 bytes) Buffer-cache hash table entries: 8192 (order: 3, 32768 bytes) Page-cache hash table entries: 32768 (order: 6, 262144 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) VFS: Diskquotas version dquot_6.5.0 initialized CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0 CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 256K Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000 CPU: After generic, caps: 0383fbff 00000000 00000000 00000000 CPU: Common caps: 0383fbff 00000000 00000000 00000000 CPU: Intel Pentium III (Coppermine) stepping 06 Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX mtrr: v1.37 (20001109) Richard Gooch (rgooch.au) mtrr: detected mtrr type: Intel PCI: PCI BIOS revision 2.10 entry at 0xfdbb1, last bus=1 PCI: Using configuration type 1 PCI: Probing PCI hardware PCI: Discovered primary peer bus 01 [IRQ] PCI: Using IRQ router ServerWorks [1166/0200] at 00:0f.0 isapnp: Scanning for PnP cards... isapnp: No Plug & Play device found Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket apm: BIOS version 1.2 Flags 0x03 (Driver version 1.14) Starting kswapd v1.8 Detected PS/2 Mouse Port. pty: 256 Unix98 ptys configured block: queued sectors max/low 83898kB/27966kB, 256 slots per queue RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 6.31 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ServerWorks OSB4: IDE controller on PCI bus 00 dev 79 ServerWorks OSB4: chipset revision 0 ServerWorks OSB4: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:pio hda: ST320414A, ATA DISK drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: 39102336 sectors (20020 MB) w/2048KiB Cache, CHS=2434/255/63, UDMA(33) Partition check: hda:<5>apm: get_event: Interface not connected hda1 hda2 hda3 Floppy drive(s): fd0 is 1.44M FDC 0 is a National Semiconductor PC87306 Serial driver version 5.02 (2000-08-09) with MANY_PORTS MULTIPORT SHARE_IRQ SERI AL_PCI ISAPNP enabled ttyS00 at 0x03f8 (irq = 4) is a 16550A Real Time Clock Driver v1.10d md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 md.c: sizeof(mdp_super_t) = 4096 autodetecting RAID arrays autorun ... ... autorun DONE. NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP, IGMP IP: routing cache hash table of 1024 buckets, 8Kbytes TCP: Hash tables configured (established 8192 bind 8192) Linux IP multicast router 0.06 plus PIM-SM NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 236k freed Adding Swap: 2097136k swap-space (priority -1) usb.c: registered new driver usbdevfs usb.c: registered new driver hub PCI: Found IRQ 10 for device 00:0f.2 usb-ohci.c: USB OHCI at membase 0xc8915000, IRQ 10 usb-ohci.c: usb-00:0f.2, PCI device 1166:0220 (ServerWorks) usb.c: new USB bus registered, assigned bus number 1 hub.c: USB hub found hub.c: 4 ports detected Winbond Super-IO detection, now testing ports 3F0,370,250,4E,2E ... SMSC Super-IO detection, now testing Ports 2F0, 370 ... ip_conntrack (1024 buckets, 8192 max) eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/driv ers/eepro100.html eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <saw@sa w.sw.com.sg> and others PCI: Found IRQ 11 for device 00:06.0 eth0: Intel Corporation 82557 [Ethernet Pro 100], 00:30:48:21:84:D0, I/O at 0xd8 00, IRQ 11. Receiver lock-up bug exists -- enabling work-around. Board assembly 000000-000, Physical connectors present: RJ45 Primary interface chip i82555 PHY #1. General self-test: passed. Serial sub-system self-test: passed. Internal registers self-test: passed. ROM checksum self-test: passed (0x04f4518b). Receiver lock-up workaround activated. EXT2-fs error (device ide0(3,1)): ext2_readdir: bad entry in directory #1619184: directory entry across blocks - offset=0, inode=0, rec_len=46320, name_len=24 EXT2-fs error (device ide0(3,1)): ext2_readdir: bad entry in directory #1619187: rec_len % 4 != 0 - offset=0, inode=0, rec_len=46323, name_len=24 EXT2-fs error (device ide0(3,1)): ext2_readdir: bad entry in directory #1619184: directory entry across blocks - offset=0, inode=0, rec_len=46320, name_len=24 EXT2-fs error (device ide0(3,1)): ext2_readdir: bad entry in directory #1619187: rec_len % 4 != 0 - offset=0, inode=0, rec_len=46323, name_len=24 EXT2-fs error (device ide0(3,1)): ext2_readdir: bad entry in directory #1619184: directory entry across blocks - offset=0, inode=0, rec_len=46320, name_len=24 EXT2-fs error (device ide0(3,1)): ext2_readdir: bad entry in directory #1619187: rec_len % 4 != 0 - offset=0, inode=0, rec_len=46323, name_len=24 I noticed in the kernel source RPM that the OSB4 support is disabled because it is known to cause data corruption and I have not changed that option. I've since compiled a 2.4.4 kernel with the same config file used to compile the 2.4.2-2 RH71 kernel and while DMA no longer works reliably, data corruption is no longer a problem. Rob Todd I forgot to add: We have 32 of these exact machines... this behavior has occurred on 12 of them ranging from the installation problem described above to data corruption and overall FS weirdness during operation. At one point even data recovered during a swap operation was corrupted (obviously crashing the application that had swapped the data). Robert Whoops... I spoke too soon. It appears as if the same problem exists in 2.4.4 also. Robert PLEASE open a separate bug for the serverworks problem. This bug is about the megaraid driver which is TOTALLY unrelated to serverworks. (And IDE on serverwork doesn't work. It's a chipset bug. Try using ide=nodma; it might work around the chipsetbug) Just to add my $0.02 worth... I have had identical experience to the original poster with an LP1000r and a Netraid-1M (single channel version of the AMI sourced Netraid-2M). This system is a 2-way PIII/933 with 750Mb RAM, 3x HP 18Gb drives on the Netraid and an HP DDS-4 drive on the internal Symbios SCSI (which uses the sym53c8xx driver). The Netraid-1M has firmware H.01.07 I have tried setting up hardware RAID1 with hot spare and also simply presenting the 3 drives as 3 separate LUN's to use with software RAID1. In either case there are serious problems. Using software RAID set up with Disk Druid, the installer gets all the way to the end where, instead of the expected "performing post configuration tasks" progress bar I get "Installer terminated abnormally". Installing to a conventional partition (with or without hardware RAID) sometimes gets an apparently successful install which will last an absolute maximum of 2 reboots before it dies a spectacular death due to massive filesystem corruption. On other occasions the installer will hang part way through. It is generally noticeable that all does not seem well during the install. The I/O "feels" really choppy, and there are frequent pauses for thought. It takes in excess of 30min as reported by the installer to do a custom "install everything", whereas I usually expect to see no more than around 20min. This problem is readily reproduced on each of 2 identical LP1000r's I also removed the disks from the Netraid-1M and connected them instead to the Symbios SCSI - guess what, it worked... The Netraid was still physically installed and the megaraid driver loaded, just not connected to any disks. The install "felt" much cleaner, and took almost exactly 20min I'm also having problems with the megaraid controller. I'm using the megaraid 1600 on a tyan thunder LE motherboard with two 18GB drives on a RAID-1 and a single 9GB as a RAID-0. During installation, the megaraid driver is loaded, but I'm warned of an invalid partition table on all real and RAID devices connected to the megaraid controller. I've repartitioned the drives several times using fdisk and disk druid, but each time it claims the partition table is corrupt and makes me try again. I'm a newbie, so if there is any more technical information that would help you, you'll have to tell me how to find it. According to bug # 37531, others have been having problems with the megaraid controller that have been solved by the latest firmware update. It seems to have fixed my problems, might it be worth a shot for you? > According to bug # 37531, others have been having problems with the megaraid
> controller that have been solved by the latest firmware update. It seems to
Does anyone know if HP NetRaids (1M/2M) are identical to some AMI model, or
do they have some HP-specific HW/FW in them?
I am having same problem, installation failing, HP lp1000r with HP netraid 1M card. Had one good install, but then massive file system coruption. I have verified newest BIOS in SCSI drives ( two HP Hot swap Fujitsu drives, firmware F612 - trying for RAID1), and Card (H.01.07 G.01.02) The installation seems to go OK, and installs a MegaRAID driver for the netraid card. (RedHat 7.1) just before it should make the boot floppy, installation stops, I reboot manually, system will not boot. I got the install to go once, but had ext2 errors about an hour later. Running RAID1. Borrowed disks from a friend, no change, swapped SCSI cable, no change. Do I need a different RAID driver? TIA -Keith howanitz Just tryed installing from with fdisk instead of disk druid. System formatted the drives, then crashed while it was copying install image to HDD. (Before it started to install any packages.) I have always been using the expert install. -Keith Contacted HP support, they asked me to try the RAID driver for 7.0 available on this page: http://www.hp.com/cposupport/swindexes/hpnetserve28162_swen.html going to attempt a clean install after lunch. -Keith Flashed one of my NetRaid 1M cards with the latest AMI bois for the MegaRaid Express 500. (it matched the Series 475 silkscreened to the card) It is "New Release A159" and can be grabbed from www.ami.com I can finally install, and reboot without any apparent failures. HP tells me that they are aware of the problem, and a fix is due out 'Any Day Now'. Until then, I will be running the AMI bios. Gordon The HP drivers for 7.0 did not work for me. I have flashed my 1M board with the same AMI bios as Gordon. (I believe mine said 471, at least during the flash). Everything appears to be working now. I am calling to ask HP to carry better support for RedHat with their hardware and to close my case number (1428411937), but I would feel better if I knew someone at RedHat called also (I do not know if my polite insistance will mean as much.) Thanks Gordon!!! -Keith Now that my NetRaid 1M is working, I have noticed that the automatic partioning during install only uses 12GB of my 18GB available. (2 x 18GB in Raid 1). I have no idea if this is related to the 1M card or if it is a problem in Disk Druid. Anyone have any ideas? I tried to manually partion using Druid, but it would not allow me to use the remaining disk space. I remember this kind of thing from ages ago, and back then, I was able to use 'fdisk' to gain the lost space back. I will be trying that tomorrow morning. BTW - we spend major $$$ with HP. Big iron, Intel servers, PC, printers, and network hubs/switches. I spoke with a couple of techs in the nearby HP office, and they were *very* helpfull; stated that the AMI bois would flash without any issues. They stated that HP is a Linux friendly company, and that they specifically support RedHat. Polite insistance is the best policy for a problem that appears to have a temporary band-aid. Just my $.02 worth. Gordon Just wanted to let you know that I have a Netserver LC2000 with Netraid 1M and had problems with 7.1. (no problems on a Netraid 1Si) Found this bug report and as Gordon suggested downloaded the ami bios for the express 500 and installed it. RH 7.1 installation worked like a charm. Anyone know if the newer megaraid driver (from patch-2.4.6-ac5) version
> 1.16 fixes the NetRaid 1M/2M filesystem corruption problem? The following
is from the driver changelog (megaraid.c) :
+ * Check added for HP 1M/2M controllers if having firmware H.01.07 or
+ * H.01.08. If found, disable 64 bit support since these firmware have
+ * limitations for 64 bit addressing
And BTW, is the H.01.08 firmware available from HP? Their support-pages
seem to offer H.01.07 as the latest firmware.
|