Bug 43461
Summary: | Netfinity 4500R + serveRAID4L + 2.4.2:(ips0) Controller reset failed | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | rosa | ||||||
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | Brock Organ <borgan> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.1 | ||||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i686 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2003-06-06 13:11:34 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
rosa
2001-06-05 03:42:03 UTC
Just two things I wanted to mention: At the first incident the controller went offline at 04:03, which is right after cron.daily .. this time it was after a rpm --rebuild imap-whatever.src.rpm so although definitely not under heavy load, I would suspect some extra disk activity in both cases. I've seen some discussion on the kernel mailinglist about a Wiseman System Management adapter, and although there definitely is an `ASMA Advanced System Management Adapter ' in the system I do not know if it is a `Wiseman' (see threads at http://uwsg.iu.edu/hypermail/linux/kernel/0101.1/0918.html and http://www.uwsg.indiana.edu/hypermail/linux/kernel/0103.1/0681.html ) Although I'm not yet sure whether the symptoms are identical: the machine does not lock up completely .. otoh .. The other thing I realised is that if you look at the messages you'll see there is twenty minuten between Jun 5 01:27:02 server3 kernel: (ips0) Controller reset failed - controller now offline. and: Jun 5 01:48:12 server3 kernel: scsi: device set offline - command error During this time, it did not respond, character echo or whatsoever, while I was typing at the serial console, then suddenly (could be after twenty minutes) I got back the shell prompt (and loads of EXT2-fs error (device sd(8,5)): ext2_write_inode: unable to read inode block and `I/O error: dev 08:05, sector 32243712 ' type of messages HTH, Harold. I've asked IBM and they seem to think it's a problem with your hardware. They have run days and days of stresstests with the driver/firmware without any problems at all.. Which driver/kernel which firmware version ? I took the liberty to ask ipslinux.com (from ips.c) if it was still an option to downgrade to 2.2 (given that we upgraded the firmware). They replied: > The 4.7x ServeRAID release ( firmware / BIOS / drivers ) has been tested > against the older releases. There is no reason to down-level it to use a > 2.2.x kernel. It should be backwards compatible. > > It sounds to me like you may have a bad controller. Have you tried a > different one in the failing machine ? I replied ` I read everything I could find on the net and added the `nmi_watchdog=0' to the boot options. I'm not sure I completely understand what it does' I suspect that as a side effect it disables some buggy code... Right now the machine is running three simultaneous compiles of a linux kernel and three of the gcc compiler while making tar backups of / (except /tmp) with a loadaverage: 12.35, 11.76, 9.14 I am willing to believe it is faulty harwdware, but what am I going to tell IBM support ? We bought the new machine twelve days ago.. and the first time I called the answer was `only redhat 7.0 is supported' (see: http://www.pc.ibm.com/us/compat/nos/linux.html ) This is all a bit inconvenient as we bought the machine to ease the transfer of our primary nameserver to another colocation (other netblock as well) and frankly right now I don't dare running anything important on this machine. It's an expensive kernel compile farm for the next two weeks :) nmi_watchdog=0 should be the default in our kernel. You can use a 2.2 kernel with RHL 7.1, except that you loose some of the USB hotplug features, but for a nameserver I suspect that isn't an issue. Since Harold is on holiday, I take over some of his activities. This case is one of those. We followed your advise to upgrade the BIOS to 4.70. We even did the nmi_watchdog=0-trick. We scheduled some kernel-build jobs so that there was some load on the machine. Not more than two concurrent build jobs, and most of the time idle. Today, after five days, the server stopped to react normally and showed lots of IO-errors, like this: I/O error: dev 08:05, sector 3670088 EXT2-fs error (device sd(8,5)): ext2_read_inode: unable to read inode block - inode=229068, block=458761 green.betterbe.com login: I/O error: dev 08:05, sector 3670088 EXT2-fs error (device sd(8,5)): ext2_write_inode: unable to read inode block - inode=229068, block=458761 I/O error: dev 08:05, sector 0 I even could not login to the server. After power off/on and manually reparing the filesystem, we found these lines in the logfile again: Jun 5 17:07:33 green kernel: (ips0) Resetting controller. Jun 5 20:37:16 green kernel: (ips0) Resetting controller. Jun 6 19:27:58 green kernel: (ips0) Resetting controller. Jun 9 09:39:23 green kernel: (ips0) Resetting controller. Jun 9 10:47:35 green kernel: (ips0) Resetting controller. Jun 9 15:47:32 green kernel: (ips0) Resetting controller. Jun 10 10:47:33 green kernel: (ips0) Resetting controller. Jun 10 13:38:25 green kernel: (ips0) Resetting controller. The line reporting that the controller is offline was not there. (due to the nmi_watchdog=0 setting?). In the lost+found directory I found 8 files of Jun 10, 15:17, all these files are kernel-build-related. The last reboot was Jun 5 12:16 So, almost 5 days uptime (longest so far :-/). We did what we could so far. What should we do now ? I saw that the linux driver (ips.c), released together with firmware/bios 4.70 (latest official version), has version 4.70.13. While the ips.c version of the driver that comes with RH7.1 has version 4.71. Could it be that that version 4.71 "too new" or "beta" ? In what circumstances is a (ips0) Resetting controller generated? Is it regular? What could we do to pinpoint the cause? Please advice. 4.71 has a fix for 2.4 kernels issueing large requests and should otherwise be identical to 4.70 I don't have such hardware, and IBM thinks it's a hardware bug reaction from ipslinux.com: ---------------------------------- I wish I could help more, but I don't know what your problem could be. It does appear to me that it could be ServeRAID related. It could be hardware related also. I'm at a loss, however, because we have run stress tests for much longer than this ( before it was released ) and did not see failures like this. When a SerevRAID resets occur, it usually means that the driver thinks that the controller is not responding anymore. You might try the latest driver ( 4.72 can be downloaded from http://www.developer.ibm.com/xseries/serveraid.html ), but I can't say that it will fix this ( since 4.71 should be OK in the first place ). There are lots of folks now using Red Hat 7.1 and are not seeing this, so I don't have a good answer at this time. Good Luck ... Let me know if there's anything else I can help with. Another reaction from ipslinux.com: A couple more items I thought of that are worth considering .... 1.) The device driver does not ever reset the controller. That request must come from Linux ( usually as a response to a series of failed and/or timed-out commands ) 2.) In your bugzilla report I saw: (ips0) timeout waiting for post. (ips0) Controller reset failed - controller now offline. This concerns me a lot. That indicates that the reset attempt failed. Reset is fairly simply and I don't know how it would fail unless there is a very serious problem in the server or adapter that is keeping the PCI Reset from occurring. 3.) Have you check you ServeRAID adapter error logs for errors ? Last friday the serveRaid card is replaced. But the "(ips0) Resetting controller" messages did not stop!!! We found out that the new card did not have the most recent firmware/bios. Yesterday (monday june 19th) I upgraded tot 4.70 (was 4.50). Still, the messages did not stop. Today I will ask to replace the whole server. It looks there is a pattern.... There are 2 cronjobs scheduled: 0 */2 * * * rpm --rebuild /home/build/IN/kernel-2.4.3-7.src.rpm > /home/build/rebuild.$$ 2>&1 0 */3 * * * rpm --rebuild /home/rebuild/IN/kernel-2.4.3-7.src.rpm > /home/rebuild/rebuild.$$ 2>&1 So, user "build" does a kernel compiler every 2 hours en user "rebuild" does it every 3 hours. It looks like the a controller reset is done at the and of such a build process, with a chance of about 30%. Here are the logs and the timestamps of the logfiles of the build processes: Jun 15 17:27:07 green kernel: ServerWorks OSB4: IDE controller on PCI bus 00 dev 79 Jun 15 20:42:28 green kernel: (ips0) Resetting controller. Jun 16 07:19:36 green kernel: (ips0) Resetting controller. Jun 16 08:42:37 green kernel: (ips0) Resetting controller. Jun 16 10:42:23 green kernel: (ips0) Resetting controller. Jun 16 13:18:54 green kernel: (ips0) Resetting controller. Jun 16 14:36:12 green kernel: (ips0) Resetting controller. Jun 16 14:42:34 green kernel: (ips0) Resetting controller. Jun 16 19:18:47 green kernel: (ips0) Resetting controller. Jun 17 01:18:44 green kernel: (ips0) Resetting controller. Jun 17 03:42:28 green kernel: (ips0) Resetting controller. Jun 17 04:43:43 green kernel: (ips0) Resetting controller. Jun 17 07:18:46 green kernel: (ips0) Resetting controller. Jun 17 10:42:19 green kernel: (ips0) Resetting controller. Jun 17 15:42:34 green kernel: (ips0) Resetting controller. Jun 17 19:18:26 green kernel: (ips0) Resetting controller. -rw-r--r-- 1 build build 2221453 Jun 15 17:12 rebuild.30268 -rw-r--r-- 1 build build 2221945 Jun 15 19:18 rebuild.1056 -rw-r--r-- 1 build build 2221944 Jun 15 20:41 rebuild.30102 -rw-r--r-- 1 build build 2221945 Jun 15 22:41 rebuild.26748 -rw-r--r-- 1 build build 2221945 Jun 16 01:18 rebuild.25063 -rw-r--r-- 1 build build 2221946 Jun 16 02:41 rebuild.21678 -rw-r--r-- 1 build build 2221946 Jun 16 04:42 rebuild.18158 -rw-r--r-- 1 build build 2221946 Jun 16 07:19 rebuild.16614 -rw-r--r-- 1 build build 2221944 Jun 16 08:41 rebuild.13306 -rw-r--r-- 1 build build 2221942 Jun 16 10:41 rebuild.9715 -rw-r--r-- 1 build build 2221944 Jun 16 13:18 rebuild.7977 -rw-r--r-- 1 build build 2221946 Jun 16 14:41 rebuild.4607 -rw-r--r-- 1 build build 2221946 Jun 16 16:41 rebuild.1118 -rw-r--r-- 1 build build 2221946 Jun 16 19:16 rebuild.31761 -rw-r--r-- 1 build build 2221945 Jun 16 20:41 rebuild.28376 -rw-r--r-- 1 build build 2221946 Jun 16 22:41 rebuild.24761 -rw-r--r-- 1 build build 2221945 Jun 17 01:16 rebuild.22985 -rw-r--r-- 1 build build 2221945 Jun 17 02:41 rebuild.19588 -rw-r--r-- 1 build build 2221946 Jun 17 04:42 rebuild.15977 -rw-r--r-- 1 build build 2221945 Jun 17 07:17 rebuild.19901 -rw-r--r-- 1 build build 2221946Jun 17 08:41 rebuild.16497 -rw-r--r-- 1 build build 2221945 Jun 17 10:41 rebuild.13083 -rw-r--r-- 1 build build 2221946 Jun 17 13:18 rebuild.11289 -rw-r--r-- 1 build build 2221943 Jun 17 14:41 rebuild.7889 -rw-r--r-- 1 build build 2221946 Jun 17 16:41 rebuild.4359 -rw-r--r-- 1 build build 2221946 Jun 17 19:15 rebuild.2675 -rw-r--r-- 1 build build 2221946 Jun 17 20:41 rebuild.31766 -rw-r--r-- 1 build build 2221945 Jun 17 22:41 rebuild.28308 -rw-r--r-- 1 rebuild rebuild 2243177 Jun 15 17:13 rebuild.30269 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 15 19:17 rebuild.1058 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 15 21:41 rebuild.28397 -rw-r--r-- 1 rebuild rebuild 2243668 Jun 16 01:18 rebuild.25065 -rw-r--r-- 1 rebuild rebuild 2243668 Jun 16 03:41 rebuild.19846 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 16 07:18 rebuild.16616 -rw-r--r-- 1 rebuild rebuild 2243669 Jun 16 09:41 rebuild.11527 -rw-r--r-- 1 rebuild rebuild 2243669 Jun 16 13:18 rebuild.7979 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 16 15:41 rebuild.2894 -rw-r--r-- 1 rebuild rebuild 2243669 Jun 16 19:17 rebuild.31763 -rw-r--r-- 1 rebuild rebuild 2243667 Jun 16 21:41 rebuild.26584 -rw-r--r-- 1 rebuild rebuild 2243669 Jun 17 01:17 rebuild.22987 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 17 03:41 rebuild.17798 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 17 07:19 rebuild.19903 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 17 09:41 rebuild.14791 -rw-r--r-- 1 rebuild rebuild 2243668 Jun 17 13:18 rebuild.11291 -rw-r--r-- 1 rebuild rebuild 2243669 Jun 17 15:41 rebuild.6175 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 17 19:17 rebuild.2677 -rw-r--r-- 1 rebuild rebuild 2243670 Jun 17 21:41 rebuild.30002 Yesterday, someone of CSG (a company that does hardware support) asked me to retrieve some information with ipssend (a command line Serveraid utility). After seeing this, he thought it is one disk that is causing the trouble: Found 1 IBM ServeRAID controller(s). Get event table has been initiated for controller 1... BIOS version : 4.70.17 Firmware version : 4.70.17 Device event table: |Channel|SCSI ID|Parity |Soft |Hard |PFA |Misc | |-------|-------|-------|-------|-------|-------|-------| | 1 | 0 | 0 | 0 | 0 | No | 3 | | 1 | 1 | 0 | 0 | 0 | No | 3 | | 1 | 2 | 0 | 0 | 0 | No | 11 | | 1 | 3 | 0 | 0 | 0 | No | 0 | | 1 | 4 | 0 | 0 | 0 | No | 0 | | 1 | 5 | 0 | 0 | 0 | No | 0 | | 1 | 6 | 0 | 0 | 0 | No | 0 | | 1 | 7 | 0 | 0 | 0 | No | 0 | | 1 | 8 | 0 | 0 | 0 | No | 0 | | 1 | 9 | 0 | 0 | 0 | No | 0 | | 1 | 10 | 0 | 0 | 0 | No | 0 | | 1 | 11 | 0 | 0 | 0 | No | 0 | | 1 | 12 | 0 | 0 | 0 | No | 0 | | 1 | 13 | 0 | 0 | 0 | No | 0 | | 1 | 14 | 0 | 0 | 0 | No | 0 | | 1 | 15 | 0 | 0 | 0 | No | 0 | Command completed successfully. The disk will be replaced. Since replacing the raidcontroller (this was done while I was on holiday) didn't fix the problem, IBM support decided to replace the backplane, which they did early this afternoon (Jun 25 13:30) After booting the machine I started one job [rebuild@green ]$ rpm --rebuild IN/gcc-2.96-85.src.rpm >& gcc.rebuild.$$ & [1] 29045 to get some disk activity going and soon got the reset message again: Jun 25 14:02:36 green kernel: ServerWorks OSB4: IDE controller on PCI bus 00 dev 79 Jun 25 14:17:52 green kernel: (ips0) Resetting controller. To give a detailed report I decided to go through the status commands that IBM support asked us to go through last time. Notice there is one succesfull `ipssend getevent 1 device' below but the second time it hung the machine... [root@green /root]# ipssend devinfo 1 1 2 Found 1 IBM ServeRAID controller(s). Device information has been initiated for controller 1... Device is a Hard disk Channel : 1 SCSI ID : 2 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 17357/35548048 Device ID : IBM-PSG DDYS-T18S9HA5EM0Z268 Command completed successfully. [root@green /root]# ipssend devinfo 1 1 1 Found 1 IBM ServeRAID controller(s). Device information has been initiated for controller 1... Device is a Hard disk Channel : 1 SCSI ID : 1 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 17357/35548048 Device ID : IBM-PSG DDYS-T18S9HA5ELLL611 Command completed successfully. [root@green /root]# ipssend devinfo 1 1 0 Found 1 IBM ServeRAID controller(s). Device information has been initiated for controller 1... Device is a Hard disk Channel : 1 SCSI ID : 0 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 17357/35548048 Device ID : IBM-PSG DDYS-T18S9HA5ELLH601 Command completed successfully. [root@green /root]# ipssend getevent 1 device Found 1 IBM ServeRAID controller(s). Get event table has been initiated for controller 1... BIOS version : 4.70.17 Firmware version : 4.70.17 Device event table: |Channel|SCSI ID|Parity |Soft |Hard |PFA |Misc | |-------|-------|-------|-------|-------|-------|-------| | 1 | 0 | 0 | 0 | 0 | No | 14 | | 1 | 1 | 0 | 0 | 0 | No | 19 | | 1 | 2 | 0 | 0 | 0 | No | 45 | | 1 | 3 | 0 | 0 | 0 | No | 0 | | 1 | 4 | 0 | 0 | 0 | No | 0 | | 1 | 5 | 0 | 0 | 0 | No | 0 | | 1 | 6 | 0 | 0 | 0 | No | 0 | | 1 | 7 | 0 | 0 | 0 | No | 0 | | 1 | 8 | 0 | 0 | 0 | No | 0 | | 1 | 9 | 0 | 0 | 0 | No | 0 | | 1 | 10 | 0 | 0 | 0 | No | 0 | | 1 | 11 | 0 | 0 | 0 | No | 0 | | 1 | 12 | 0 | 0 | 0 | No | 0 | | 1 | 13 | 0 | 0 | 0 | No | 0 | | 1 | 14 | 0 | 0 | 0 | No | 0 | | 1 | 15 | 0 | 0 | 0 | No | 0 | Command completed successfully. [root@green /root]# ipssend getevent 1 soft Found 1 IBM ServeRAID controller(s). Get event table has been initiated for controller 1... BIOS version : 4.70.17 Firmware version : 4.70.17 Controller soft event log (1023 entries): 10020102 8001D0CD 10050202 100000F0 8001D0CD 10F62902 10070101 8001D0CD 10050201 100000F0 8001D0CD 10F62902 100A0200 8001D0CE 10050200 100000F0 8001D0CE 0172000E 01030100 01041B00 0166BABE 8001D535 100503A7 100000F0 8001D535 010C000E 01830100 01041A00 0110BABE 8001D565 1005030C 100000F0 8001D565 010B000E 01830100 01041A00 0169BABE 8001D5B5 1005030B 100000F0 8001D5B5 011A000E 01030100 01041A00 010FBABE 8001D5BD 1005031A 100000F0 8001D5BD 0123000E 01830100 01021800 0121BABE 8001D5BD 0223FFFF 02010918 8001D5BD 0128000E 01030100 01041A00 0168BABE 8001D5C1 10050328 100000F0 8001D5C1 012C000E 01830100 01041A00 014EBABE 8001D5C1 1005032C 100000F0 8001D5C1 012D000E 01830100 01041A00 012EBABE 8001D5C1 1005032D 100000F0 8001D5C1 0138000E 01830100 01021800 0131BABE 8001D5C5 0238FFFF 02010918 8001D5C5 0147000E 01830100 01021800 016FBABE 8001D5C9 0247FFFF 02010918 8001D5C9 014E000E 01830100 01041A00 0107BABE 8001D5CD 1005034E 100000F0 8001D5CD 0150000E 01830100 01041B00 010CBABE 8001D5CD 100503AB 100000F0 8001D5CD 0167000E 01030100 01041A00 013EBABE 8001D5D1 10050367 100000F0 8001D5D1 0171000E 01830100 01041A00 017FBABE 8001D5D5 10050371 100000F0 8001D5D5 0169000E 01830100 01041A00 0116BABE 8001D5D5 10050369 100000F0 8001D5D5 0177000E 01030100 01021800 0152BABE 8001D5D9 0277FFFF 02010918 8001D5D9 0103000E 01830100 01041B00 0179BABE 8001D605 100503B8 100000F0 8001D605 0104000E 01030100 01041A00 0129BABE 8001D605 10050304 100000F0 8001D605 0106000E 01830100 01041A00 0160BABE 8001D605 10050306 100000F0 8001D605 010D000E 01030100 01021800 0134BABE 8001D609 020DFFFF 02010918 8001D609 010F000E 01030100 01041A00 010ABABE 8001D609 1005030F 100000F0 8001D609 0110000E 01830100 01041B00 017ABABE 8001D609 10050396 100000F0 8001D609 0116000E 01830100 01041A00 015DBABE 8001D609 10050316 100000F0 8001D609 0119000E 01020100 01021000 0136BABE 8001D609 0219FFFF 02010910 8001D609 10050319 100000F0 8001D609 011B000E 01830100 01041B00 0142BABE 8001D60D 100503CB 100000F0 8001D60D 0120000E 01030100 01041B00 010BBABE 8001D60D 10050384 100000F0 8001D60D 0122000E 01830100 01041A00 0177BABE 8001D60D 10050322 100000F0 8001D60D 0124000E 01830100 01041A00 0102BABE 8001D60D 10050324 100000F0 8001D60D 0127000E 01030100 01041B00 014ABABE 8001D611 10050390 100000F0 8001D611 0129000E 01830100 01021800 0171BABE 8001D611 0229FFFF 02010918 8001D611 012E000E 01830100 01041B00 0151BABE 8001D611 10050389 100000F0 8001D611 012F000E 01830100 01021800 0139BABE 8001D611 022FFFFF 02010918 8001D611 0130000E 01030100 01041B00 0124BABE 8001D611 100503DD 100000F0 8001D611 0131000E 01830100 01041B00 0156BABE 8001D611 100503AC 100000F0 8001D611 0132000E 01030100 01041B00 0158BABE 8001D611 100503ED 100000F0 8001D611 0133000E 01830100 01041A00 0140BABE 8001D612 10050333 100000F0 8001D612 013E000E 01830100 01041A00 011CBABE 8001D615 1005033E 100000F0 8001D615 0140000E 01030100 01041A00 0122BABE 8001D615 10050340 100000F0 8001D615 0134000E 01830100 01041A00 0138BABE 8001D615 10050334 100000F0 8001D615 0136000E 01830100 01041A00 011DBABE 8001D615 10050336 100000F0 8001D615 0137000E 01830100 01041B00 012DBABE 8001D615 100503B2 100000F0 8001D615 0139000E 01830100 01041B00 0128BABE 8001D615 100503AD 100000F0 8001D615 013B000E 01030100 01041A00 0137BABE 8001D616 1005033B 100000F0 8001D616 013C000E 01030100 01041B00 0103BABE 8001D616 1005039E 100000F0 8001D616 0142000E 01030100 01021800 015ABABE 8001D619 0242FFFF 02010918 8001D619 0143000E 01830100 01023700 0143BABE 8001D619 0243FFFF 02010937 8001D619 0144000E 01830100 01041A00 010EBABE 8001D619 10050344 100000F0 8001D619 0145000E 01830100 01021400 0104BABE 8001D619 0146000E 01830100 01041B00 0153BABE 8001D619 100503E4 100000F0 8001D619 0149000E 01830100 01041B00 016DBABE 8001D619 10050393 100000F0 8001D619 014A000E 01830100 01041B00 014BBABE 8001D619 100503D1 100000F0 8001D619 014C000E 01030100 01041B00 0105BABE 8001D619 100503F5 100000F0 8001D619 014D000E 01830100 01023700 0174BABE 8001D619 024DFFFF 02010937 8001D619 0141000E 01830100 01041B00 0167BABE 8001D61A 100503D2 100000F0 8001D61A 014F000E 01830100 01041A00 0155BABE 8001D61D 1005034F 100000F0 8001D61D 0151000E 01830100 01021400 0133BABE 8001D61D 0154000E 01030100 01041B00 012ABABE 8001D61D 10050397 100000F0 8001D61D 0155000E 01830100 01041B00 0113BABE 8001D61D 100503F4 100000F0 8001D61D 0156000E 01030100 01041B00 016EBABE 8001D61D 10050388 100000F0 8001D61D 0157000E 01830100 01041B00 0111BABE 8001D61D 100503F2 100000F0 8001D61D 0159000E 01830100 01041B00 0178BABE 8001D61E 100503BF 100000F0 8001D61E 015A000E 01030100 01041B00 0165BABE 8001D61E 100503A5 100000F0 8001D61E 015B000E 01830100 01041A00 0170BABE 8001D621 1005035B 100000F0 8001D621 015C000E 01830100 01041B00 0144BABE 8001D621 100503EF 100000F0 8001D621 015D000E 01030100 01041A00 0175BABE 8001D621 1005035D 100000F0 8001D621 015E000E 01830100 01021400 0162BABE 8001D621 015F000E 01830100 01041A00 0176BABE 8001D621 1005035F 100000F0 8001D621 0160000E 01830100 01023700 0164BABE 8001D621 0260FFFF 02010937 8001D621 0162000E 01830100 01041B00 011ABABE 8001D621 100503F0 100000F0 8001D621 0163000E 01830100 01021400 0173BABE 8001D621 0164000E 01830100 01021400 013ABABE 8001D621 0165000E 01030100 01041B00 0141BABE 8001D621 100503A9 100000F0 8001D621 0166000E 01830100 01041A00 013CBABE 8001D621 10050366 100000F0 8001D621 0168000E 01830100 01021400 0154BABE 8001D625 016A000E 01830100 01041A00 0149BABE 8001D625 1005036A 100000F0 8001D625 016B000E 01830100 01021400 017BBABE 8001D625 016C000E 01830100 01041B00 016BBABE 8001D625 100503C0 100000F0 8001D625 016D000E 01830100 01021800 0163BABE 8001D625 026DFFFF 02010918 8001D625 016E000E 01830100 01041B00 0145BABE 8001D625 1005038F 100000F0 8001D625 016F000E 01830100 01041A00 015CBABE 8001D625 1005036F 100000F0 8001D625 0170000E 01030100 01021800 012CBABE 8001D625 0270FFFF 02010918 8001D625 0173000E 01830100 01021400 0146BABE 8001D625 0174000E 01830100 01041B00 017CBABE 8001D625 10050380 100000F0 8001D625 0175000E 01830100 01041B00 014DBABE 8001D629 1005038C 100000F0 8001D629 0176000E 01830100 01041B00 017DBABE 8001D629 100503B6 100000F0 8001D629 0179000E 01830100 01041A00 013FBABE 8001D629 10050379 100000F0 8001D629 017A000E 01830100 01041A00 013BBABE 8001D629 1005037A 100000F0 8001D629 017B000E 01830100 01021400 016ABABE 8001D629 017C000E 01830100 01041A00 0114BABE 8001D629 1005037C 100000F0 8001D629 017D000E 01830100 01041B00 0108BABE 8001D629 100503E8 100000F0 8001D629 017E000E 01830100 01041B00 015FBABE 8001D629 100503B9 100000F0 8001D629 0100000E 01830100 01041B00 011EBABE 8001D655 100503DF 100000F0 8001D655 0105000E 01030100 01021400 0150BABE 8001D655 0109000E 01830100 01021400 0112BABE 8001D655 010E000E 01030100 01041B00 0123BABE 8001D659 100503B1 100000F0 8001D659 0111000E 01830100 01021400 015BBABE 8001D659 0112000E 01830100 01041B00 014CBABE 8001D659 10050394 100000F0 8001D659 0114000E 01830100 01021400 0126BABE 8001D659 0115000E 01830100 01021400 012BBABE 8001D659 0117000E 01830100 01025F00 0157BABE 8001D659 0217FFFF 0201095F 8001D659 0118000E 01830100 01021800 017EBABE 8001D659 0218FFFF 02010918 8001D659 011E000E 01030100 01041B00 0148BABE 8001D65D 1005038A 100000F0 8001D65D 0121000E 01830100 01021400 0109BABE 8001D65D 012B000E 01830100 01025F00 0172BABE 8001D661 022BFFFF 0201095F 8001D661 013A000E 01830100 01021400 0130BABE 8001D665 013D000E 01030100 01025F00 016CBABE 8001D665 023DFFFF 0201095F 8001D665 013F000E 01830100 01021400 0100BABE 8001D665 0148000E 01830100 01021400 013DBABE 8001D669 0152000E 01030100 01021400 0118BABE 8001D66D 0153000E 01830100 01021400 011BBABE 8001D66D 0101000E 01830000 01021800 014FBABE 8001D6A5 0201FFFF 02000918 8001D6A5 10F62900 10850008 8001D6A5 0108000E 01830000 01021800 0106BABE 8001D6A7 0208FFFF 02000918 8001D6A7 0113000E 01830100 01025F00 010DBABE 8001D6A9 0213FFFF 0201095F 8001D6A9 011C000E 01030100 01021400 012FBABE 8001D6AD 011D000E 01830000 01041B00 0147BABE 8001D6AD 100503C9 100000F0 8001D6AD 10050001 10C90302 1000002A 8001D6AD 10050100 100000F0 8001D6AD 011F000E 01830100 01021400 0159BABE 8001D711 0125000E 01830100 01021400 0115BABE 8001D711 0126000E 01830000 01041B00 0135BABE 8001D711 100503E7 100000F0 8001D711 012A000E 01830000 01351600 0120BABE 8001D711 3F2A041A 8001D712 1005039D 100000F0 8001D712 0135000E 01830000 01021800 0161BABE 8001D712 0235FFFF 02000918 8001D712 014B000E 01830100 01021400 0119BABE 8001D712 0158000E 01830000 01041B00 0125BABE 8001D712 100503C4 100000F0 8001D712 0178000E 01830000 01041B00 015EBABE 8001D712 1005038B 100000F0 8001D712 10F62902 10060002 8001D7E9 10050202 100000F0 8001D7E9 10F62902 10060100 8001D7EA 10050200 100000F0 8001D7EA 04DDD001 8001D7F1 10F62900 108B0008 8001D7F1 04DDD001 8001D7F3 04DDD001 8001D7F3 04DDD001 8001D7F4 04DDD001 8001D7F4 04DDD001 8001D7F6 04DDD001 8001D7F6 04DDD001 8001D7F7 04DDD001 8001D7F7 04DDD001 8001D7F8 04DDD001 8001D7F8 04DDD001 8001D7F8 04DDD001 8001D7F8 04DDD001 8001D7F8 04DDD001 8001D7F9 04DDD001 8001D7FA 04DDD001 8001D7FB 04DDD001 8001D7FB 04DDD001 8001D7FC 04DDD001 8001D7FC 04DDD001 8001D7FD 04DDD001 8001D826 04DDD001 8001D82C 0172000E 01030000 01021800 0166BABE 8001DC15 0272FFFF 02000918 8001DC15 010D000E 01030000 01021800 0134BABE 8001DCE9 020DFFFF 02000918 8001DCE9 70108002 800003FB Command completed successfully. [root@green /root]# ipssend getevent 1 device Found 1 IBM ServeRAID controller(s). And then the machine went dead again, no response anymore not even on the console ... Phew, this sure is becoming one expensive server by now .. Since replacing the raidcontroller (this was done while I was on holiday) didn't fix the problem, IBM support decided to replace the backplane, which they did early this afternoon (Jun 25 13:30) After booting the machine I started one job [rebuild@green ]$ rpm --rebuild IN/gcc-2.96-85.src.rpm >& gcc.rebuild.$$ & [1] 29045 to get some disk activity going and soon got the reset message again: Jun 25 14:02:36 green kernel: ServerWorks OSB4: IDE controller on PCI bus 00 dev 79 Jun 25 14:17:52 green kernel: (ips0) Resetting controller. To give a detailed report I decided to go through the status commands that IBM support asked us to go through last time. Notice there is one succesfull `ipssend getevent 1 device' below but the second time it hung the machine... [root@green /root]# ipssend devinfo 1 1 2 Found 1 IBM ServeRAID controller(s). Device information has been initiated for controller 1... Device is a Hard disk Channel : 1 SCSI ID : 2 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 17357/35548048 Device ID : IBM-PSG DDYS-T18S9HA5EM0Z268 Command completed successfully. [root@green /root]# ipssend devinfo 1 1 1 Found 1 IBM ServeRAID controller(s). Device information has been initiated for controller 1... Device is a Hard disk Channel : 1 SCSI ID : 1 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 17357/35548048 Device ID : IBM-PSG DDYS-T18S9HA5ELLL611 Command completed successfully. [root@green /root]# ipssend devinfo 1 1 0 Found 1 IBM ServeRAID controller(s). Device information has been initiated for controller 1... Device is a Hard disk Channel : 1 SCSI ID : 0 PFA (Yes/No) : No State : Online (ONL) Size (in MB)/(in sectors): 17357/35548048 Device ID : IBM-PSG DDYS-T18S9HA5ELLH601 Command completed successfully. [root@green /root]# ipssend getevent 1 device Found 1 IBM ServeRAID controller(s). Get event table has been initiated for controller 1... BIOS version : 4.70.17 Firmware version : 4.70.17 Device event table: |Channel|SCSI ID|Parity |Soft |Hard |PFA |Misc | |-------|-------|-------|-------|-------|-------|-------| | 1 | 0 | 0 | 0 | 0 | No | 14 | | 1 | 1 | 0 | 0 | 0 | No | 19 | | 1 | 2 | 0 | 0 | 0 | No | 45 | | 1 | 3 | 0 | 0 | 0 | No | 0 | | 1 | 4 | 0 | 0 | 0 | No | 0 | | 1 | 5 | 0 | 0 | 0 | No | 0 | | 1 | 6 | 0 | 0 | 0 | No | 0 | | 1 | 7 | 0 | 0 | 0 | No | 0 | | 1 | 8 | 0 | 0 | 0 | No | 0 | | 1 | 9 | 0 | 0 | 0 | No | 0 | | 1 | 10 | 0 | 0 | 0 | No | 0 | | 1 | 11 | 0 | 0 | 0 | No | 0 | | 1 | 12 | 0 | 0 | 0 | No | 0 | | 1 | 13 | 0 | 0 | 0 | No | 0 | | 1 | 14 | 0 | 0 | 0 | No | 0 | | 1 | 15 | 0 | 0 | 0 | No | 0 | Command completed successfully. [root@green /root]# ipssend getevent 1 soft Found 1 IBM ServeRAID controller(s). Get event table has been initiated for controller 1... BIOS version : 4.70.17 Firmware version : 4.70.17 Controller soft event log (1023 entries): 10020102 8001D0CD 10050202 100000F0 8001D0CD 10F62902 10070101 8001D0CD 10050201 100000F0 8001D0CD 10F62902 100A0200 8001D0CE 10050200 100000F0 8001D0CE 0172000E 01030100 01041B00 0166BABE 8001D535 100503A7 100000F0 8001D535 010C000E 01830100 01041A00 0110BABE 8001D565 1005030C 100000F0 8001D565 010B000E 01830100 01041A00 0169BABE 8001D5B5 1005030B 100000F0 8001D5B5 011A000E 01030100 01041A00 010FBABE 8001D5BD 1005031A 100000F0 8001D5BD 0123000E 01830100 01021800 0121BABE 8001D5BD 0223FFFF 02010918 8001D5BD 0128000E 01030100 01041A00 0168BABE 8001D5C1 10050328 100000F0 8001D5C1 012C000E 01830100 01041A00 014EBABE 8001D5C1 1005032C 100000F0 8001D5C1 012D000E 01830100 01041A00 012EBABE 8001D5C1 1005032D 100000F0 8001D5C1 0138000E 01830100 01021800 0131BABE 8001D5C5 0238FFFF 02010918 8001D5C5 0147000E 01830100 01021800 016FBABE 8001D5C9 0247FFFF 02010918 8001D5C9 014E000E 01830100 01041A00 0107BABE 8001D5CD 1005034E 100000F0 8001D5CD 0150000E 01830100 01041B00 010CBABE 8001D5CD 100503AB 100000F0 8001D5CD 0167000E 01030100 01041A00 013EBABE 8001D5D1 10050367 100000F0 8001D5D1 0171000E 01830100 01041A00 017FBABE 8001D5D5 10050371 100000F0 8001D5D5 0169000E 01830100 01041A00 0116BABE 8001D5D5 10050369 100000F0 8001D5D5 0177000E 01030100 01021800 0152BABE 8001D5D9 0277FFFF 02010918 8001D5D9 0103000E 01830100 01041B00 0179BABE 8001D605 100503B8 100000F0 8001D605 0104000E 01030100 01041A00 0129BABE 8001D605 10050304 100000F0 8001D605 0106000E 01830100 01041A00 0160BABE 8001D605 10050306 100000F0 8001D605 010D000E 01030100 01021800 0134BABE 8001D609 020DFFFF 02010918 8001D609 010F000E 01030100 01041A00 010ABABE 8001D609 1005030F 100000F0 8001D609 0110000E 01830100 01041B00 017ABABE 8001D609 10050396 100000F0 8001D609 0116000E 01830100 01041A00 015DBABE 8001D609 10050316 100000F0 8001D609 0119000E 01020100 01021000 0136BABE 8001D609 0219FFFF 02010910 8001D609 10050319 100000F0 8001D609 011B000E 01830100 01041B00 0142BABE 8001D60D 100503CB 100000F0 8001D60D 0120000E 01030100 01041B00 010BBABE 8001D60D 10050384 100000F0 8001D60D 0122000E 01830100 01041A00 0177BABE 8001D60D 10050322 100000F0 8001D60D 0124000E 01830100 01041A00 0102BABE 8001D60D 10050324 100000F0 8001D60D 0127000E 01030100 01041B00 014ABABE 8001D611 10050390 100000F0 8001D611 0129000E 01830100 01021800 0171BABE 8001D611 0229FFFF 02010918 8001D611 012E000E 01830100 01041B00 0151BABE 8001D611 10050389 100000F0 8001D611 012F000E 01830100 01021800 0139BABE 8001D611 022FFFFF 02010918 8001D611 0130000E 01030100 01041B00 0124BABE 8001D611 100503DD 100000F0 8001D611 0131000E 01830100 01041B00 0156BABE 8001D611 100503AC 100000F0 8001D611 0132000E 01030100 01041B00 0158BABE 8001D611 100503ED 100000F0 8001D611 0133000E 01830100 01041A00 0140BABE 8001D612 10050333 100000F0 8001D612 013E000E 01830100 01041A00 011CBABE 8001D615 1005033E 100000F0 8001D615 0140000E 01030100 01041A00 0122BABE 8001D615 10050340 100000F0 8001D615 0134000E 01830100 01041A00 0138BABE 8001D615 10050334 100000F0 8001D615 0136000E 01830100 01041A00 011DBABE 8001D615 10050336 100000F0 8001D615 0137000E 01830100 01041B00 012DBABE 8001D615 100503B2 100000F0 8001D615 0139000E 01830100 01041B00 0128BABE 8001D615 100503AD 100000F0 8001D615 013B000E 01030100 01041A00 0137BABE 8001D616 1005033B 100000F0 8001D616 013C000E 01030100 01041B00 0103BABE 8001D616 1005039E 100000F0 8001D616 0142000E 01030100 01021800 015ABABE 8001D619 0242FFFF 02010918 8001D619 0143000E 01830100 01023700 0143BABE 8001D619 0243FFFF 02010937 8001D619 0144000E 01830100 01041A00 010EBABE 8001D619 10050344 100000F0 8001D619 0145000E 01830100 01021400 0104BABE 8001D619 0146000E 01830100 01041B00 0153BABE 8001D619 100503E4 100000F0 8001D619 0149000E 01830100 01041B00 016DBABE 8001D619 10050393 100000F0 8001D619 014A000E 01830100 01041B00 014BBABE 8001D619 100503D1 100000F0 8001D619 014C000E 01030100 01041B00 0105BABE 8001D619 100503F5 100000F0 8001D619 014D000E 01830100 01023700 0174BABE 8001D619 024DFFFF 02010937 8001D619 0141000E 01830100 01041B00 0167BABE 8001D61A 100503D2 100000F0 8001D61A 014F000E 01830100 01041A00 0155BABE 8001D61D 1005034F 100000F0 8001D61D 0151000E 01830100 01021400 0133BABE 8001D61D 0154000E 01030100 01041B00 012ABABE 8001D61D 10050397 100000F0 8001D61D 0155000E 01830100 01041B00 0113BABE 8001D61D 100503F4 100000F0 8001D61D 0156000E 01030100 01041B00 016EBABE 8001D61D 10050388 100000F0 8001D61D 0157000E 01830100 01041B00 0111BABE 8001D61D 100503F2 100000F0 8001D61D 0159000E 01830100 01041B00 0178BABE 8001D61E 100503BF 100000F0 8001D61E 015A000E 01030100 01041B00 0165BABE 8001D61E 100503A5 100000F0 8001D61E 015B000E 01830100 01041A00 0170BABE 8001D621 1005035B 100000F0 8001D621 015C000E 01830100 01041B00 0144BABE 8001D621 100503EF 100000F0 8001D621 015D000E 01030100 01041A00 0175BABE 8001D621 1005035D 100000F0 8001D621 015E000E 01830100 01021400 0162BABE 8001D621 015F000E 01830100 01041A00 0176BABE 8001D621 1005035F 100000F0 8001D621 0160000E 01830100 01023700 0164BABE 8001D621 0260FFFF 02010937 8001D621 0162000E 01830100 01041B00 011ABABE 8001D621 100503F0 100000F0 8001D621 0163000E 01830100 01021400 0173BABE 8001D621 0164000E 01830100 01021400 013ABABE 8001D621 0165000E 01030100 01041B00 0141BABE 8001D621 100503A9 100000F0 8001D621 0166000E 01830100 01041A00 013CBABE 8001D621 10050366 100000F0 8001D621 0168000E 01830100 01021400 0154BABE 8001D625 016A000E 01830100 01041A00 0149BABE 8001D625 1005036A 100000F0 8001D625 016B000E 01830100 01021400 017BBABE 8001D625 016C000E 01830100 01041B00 016BBABE 8001D625 100503C0 100000F0 8001D625 016D000E 01830100 01021800 0163BABE 8001D625 026DFFFF 02010918 8001D625 016E000E 01830100 01041B00 0145BABE 8001D625 1005038F 100000F0 8001D625 016F000E 01830100 01041A00 015CBABE 8001D625 1005036F 100000F0 8001D625 0170000E 01030100 01021800 012CBABE 8001D625 0270FFFF 02010918 8001D625 0173000E 01830100 01021400 0146BABE 8001D625 0174000E 01830100 01041B00 017CBABE 8001D625 10050380 100000F0 8001D625 0175000E 01830100 01041B00 014DBABE 8001D629 1005038C 100000F0 8001D629 0176000E 01830100 01041B00 017DBABE 8001D629 100503B6 100000F0 8001D629 0179000E 01830100 01041A00 013FBABE 8001D629 10050379 100000F0 8001D629 017A000E 01830100 01041A00 013BBABE 8001D629 1005037A 100000F0 8001D629 017B000E 01830100 01021400 016ABABE 8001D629 017C000E 01830100 01041A00 0114BABE 8001D629 1005037C 100000F0 8001D629 017D000E 01830100 01041B00 0108BABE 8001D629 100503E8 100000F0 8001D629 017E000E 01830100 01041B00 015FBABE 8001D629 100503B9 100000F0 8001D629 0100000E 01830100 01041B00 011EBABE 8001D655 100503DF 100000F0 8001D655 0105000E 01030100 01021400 0150BABE 8001D655 0109000E 01830100 01021400 0112BABE 8001D655 010E000E 01030100 01041B00 0123BABE 8001D659 100503B1 100000F0 8001D659 0111000E 01830100 01021400 015BBABE 8001D659 0112000E 01830100 01041B00 014CBABE 8001D659 10050394 100000F0 8001D659 0114000E 01830100 01021400 0126BABE 8001D659 0115000E 01830100 01021400 012BBABE 8001D659 0117000E 01830100 01025F00 0157BABE 8001D659 0217FFFF 0201095F 8001D659 0118000E 01830100 01021800 017EBABE 8001D659 0218FFFF 02010918 8001D659 011E000E 01030100 01041B00 0148BABE 8001D65D 1005038A 100000F0 8001D65D 0121000E 01830100 01021400 0109BABE 8001D65D 012B000E 01830100 01025F00 0172BABE 8001D661 022BFFFF 0201095F 8001D661 013A000E 01830100 01021400 0130BABE 8001D665 013D000E 01030100 01025F00 016CBABE 8001D665 023DFFFF 0201095F 8001D665 013F000E 01830100 01021400 0100BABE 8001D665 0148000E 01830100 01021400 013DBABE 8001D669 0152000E 01030100 01021400 0118BABE 8001D66D 0153000E 01830100 01021400 011BBABE 8001D66D 0101000E 01830000 01021800 014FBABE 8001D6A5 0201FFFF 02000918 8001D6A5 10F62900 10850008 8001D6A5 0108000E 01830000 01021800 0106BABE 8001D6A7 0208FFFF 02000918 8001D6A7 0113000E 01830100 01025F00 010DBABE 8001D6A9 0213FFFF 0201095F 8001D6A9 011C000E 01030100 01021400 012FBABE 8001D6AD 011D000E 01830000 01041B00 0147BABE 8001D6AD 100503C9 100000F0 8001D6AD 10050001 10C90302 1000002A 8001D6AD 10050100 100000F0 8001D6AD 011F000E 01830100 01021400 0159BABE 8001D711 0125000E 01830100 01021400 0115BABE 8001D711 0126000E 01830000 01041B00 0135BABE 8001D711 100503E7 100000F0 8001D711 012A000E 01830000 01351600 0120BABE 8001D711 3F2A041A 8001D712 1005039D 100000F0 8001D712 0135000E 01830000 01021800 0161BABE 8001D712 0235FFFF 02000918 8001D712 014B000E 01830100 01021400 0119BABE 8001D712 0158000E 01830000 01041B00 0125BABE 8001D712 100503C4 100000F0 8001D712 0178000E 01830000 01041B00 015EBABE 8001D712 1005038B 100000F0 8001D712 10F62902 10060002 8001D7E9 10050202 100000F0 8001D7E9 10F62902 10060100 8001D7EA 10050200 100000F0 8001D7EA 04DDD001 8001D7F1 10F62900 108B0008 8001D7F1 04DDD001 8001D7F3 04DDD001 8001D7F3 04DDD001 8001D7F4 04DDD001 8001D7F4 04DDD001 8001D7F6 04DDD001 8001D7F6 04DDD001 8001D7F7 04DDD001 8001D7F7 04DDD001 8001D7F8 04DDD001 8001D7F8 04DDD001 8001D7F8 04DDD001 8001D7F8 04DDD001 8001D7F8 04DDD001 8001D7F9 04DDD001 8001D7FA 04DDD001 8001D7FB 04DDD001 8001D7FB 04DDD001 8001D7FC 04DDD001 8001D7FC 04DDD001 8001D7FD 04DDD001 8001D826 04DDD001 8001D82C 0172000E 01030000 01021800 0166BABE 8001DC15 0272FFFF 02000918 8001DC15 010D000E 01030000 01021800 0134BABE 8001DCE9 020DFFFF 02000918 8001DCE9 70108002 800003FB Command completed successfully. [root@green /root]# ipssend getevent 1 device Found 1 IBM ServeRAID controller(s). And then the machine went dead again, no response anymore not even on the console ... Phew, this sure is becoming one expensive server by now .. Created attachment 22387 [details]
for the record, see next attachmnt. for the rest of the story
Created attachment 22388 [details]
To me this looks like the proof that it's bad hardware that caused all this.
|