Description of problem: System crash when trying to boot kernel 2.6.20-1.2933.fc6 on system with HighPoint controllers. Version-Release number of selected component (if applicable): kernel 2.6.20-1.2933.fc6 How reproducible: everytime Steps to Reproduce: 1. install kernel 2.6.20-1.2933.fc6 2. reboot 3. Actual results: crash Expected results: boot sequence Additional info: This is occurring with a motherboard with an onboard HighPoint HPT372 controller and a PCI card HighPoint HPT302 controller. The system has 4 PATA hard drives, 2 attached to each of the controllers, as masters, one per channel. CPU is Athlon XP 2000. System has been working fine with kernels 2.6.18-1.2869 and 2.6.19-1.2869. Screenshot of crash output is attached.
Created attachment 151093 [details] Screenshot of crash
Can you boot in 50-line mode and get the whole message? (Just add "vga=1" to the kernel command line for that kernel.)
Created attachment 151124 [details] Screenshot of crash in 50-line mode
I've attached the 50-line mode output. I see the message about the bus timing and the following could be related: The motherboard is an Abit KG7-RAID board. When this board was released processors were in the 600-800 MHz range. FSB speed was 100. Subsequently Abit provided BIOS updates that would allow for higher speed CPU's with the last BIOS supporting 2600+ CPU. The last Athlon XP+ processors were all 133 FSB but for whatever reason KG7-RAID board would never operate stable except for 100 FSB. Any of the 133 FSB settings always results in a hung board after a couple days. Therefore you had to select the highest 100 FSB CPU speed in the BIOS which actually detuned the processor speed to get a truly stable configuration (at least one that could be used as a server). So right now the CPU is Athlon XP 2000+. The BIOS is set at 1600(100) which is the highest 100 FSB speed. The board is rock stable at this setting and it's been running Linux like this for over four years without any problem.
What speed is it running the PCI bus? It should be 3 x 33 MHz for the FSB speed but it might be using 4 x 25 MHz (25 or 33 being the PCI bus speed.)
Multipliers: CPU:Memory:PCI Bus 3: 3: 1 PCI Bus is running at 33MHz
There is a patch for Highpoint 302 out now. it will be in the next update.
According to the message at the top of https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=151124 (the [38 2]) the system is using a 38MHz clock. That's unsupported.
That might be what it shows but it is in error. I have verified on several machines that the clock is 33MHz. The clock speed is being misidentified. The 2933 kernel is the only kernel that has ever caused a problem of this nature on these machines and they have been running for years.
The clock detection code in hpt3xx was rewritten recently and may be finding wrong values, or maybe it's right and the PC BIOS is lying. :) Please try kernel parameter "idebus=33"
Before I do that, I checked the CPU speed. It's 1253MHz. Divide that by 33MHz and you get 38.
Ok, I tried adding "idebus=33" to the kernel parameters when booting two of the machines and no change - still crashes same as before. BTW still shows [38 2] even with adding the parameter.
I just tried the 2.6.20-1.2952 kernel and it too crashes with the exact same "unknown bus timing" error. I have not been able to load any of the 2.6.20 kernels on my servers. Something has drastically changed in the bus timing detection code. I've been able to load every kernel for the last four years without any crash occurring except for these 2.6.20 kernels. There needs to be some update that says my bus timing is ok. These boards have stock BIOS with no tweaks. All standard settings right from menus in BIOS. Help please!
Smolt profile: http://smolt.fedoraproject.org/show?UUID=ab65ac7d-1e79-4479-b7bf-14f4936b6e2a
Here is a bug that may be related in some way. Also regarding HPT controllers. https://bugzilla.redhat.com/242270
(In reply to comment #3) > Created an attachment (id=151124) [edit] > Screenshot of crash in 50-line mode Alas, this is still not enough. The only thing I could figure out was that you have either HPT372 or HPT302 chip with N suffix. Which one caused the kernel oops, remained unknown. Have you tried pressing Shift-PgUp after oops to scroll back the screen? (In reply to comment #8) > According to the message at the top of > https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=151124 > (the [38 2]) the system is using a 38MHz clock. That's unsupported. Erm, not exactly. It's a value of f_LOW register at which the DPLL calibration failed. (In reply to comment #10) > The clock detection code in hpt3xx was rewritten recently and may be > finding wrong values, or maybe it's right and the PC BIOS is lying. :) Erm, not before 2.6.21-rc1. The bootlog correspons to the older code. > Please try kernel parameter "idebus=33" That won't avail as this driver detects PCI clock itself. (In reply to comment #13) > I just tried the 2.6.20-1.2952 kernel and it too crashes with the exact same > "unknown bus timing" error. I have not been able to load any of the 2.6.20 > kernels on my servers. Something has drastically changed in the bus timing > detection code. There were no *drastic* changes at that point yet. Could you try 2.6.21?
I have both HPT372 and HPT302 in this machine. Don't know about the "N". Is that important? I checked yum but no 2.6.21 kernels. What repo do I need to setup to get a 2.6.21 kernel? Is it the updates-testing?
(In reply to comment #17) > I have both HPT372 and HPT302 in this machine. Don't know about the "N". Is > that important? It is important, or I wouldn't have asked. It's also important to know which of the two chips calused oops. > I checked yum but no 2.6.21 kernels. What repo do I need to > setup to get a 2.6.21 kernel? Is it the updates-testing? Erm, I meant building a kernel from source. Nevermind. :-)
(In reply to comment #18) > > I have both HPT372 and HPT302 in this machine. Don't know about the "N". Is > > that important? > It is important, or I wouldn't have asked. It's also important to know which of > the two chips calused oops. Could you post the output of 'lspci' on a working kernel?
00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 [IGD4-1P] System Controller (rev 13) 00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 [IGD4-1P] AGP Bridge 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South] (rev 40) 00:07.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) 00:07.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 1a) 00:07.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 1a) 00:07.4 SMBus: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI] (rev 40) 00:08.0 RAID bus controller: Triones Technologies, Inc. HPT302/302N (rev 02) 00:0d.0 Ethernet controller: Linksys Gigabit Network Adapter (rev 10) 00:0f.0 VGA compatible controller: Trident Microsystems Blade 3D PCI/AGP (rev 3a) 00:11.0 Ethernet controller: Linksys Gigabit Network Adapter (rev 10) 00:13.0 Mass storage controller: Triones Technologies, Inc. HPT366/368/370/370A/372/372N (rev 04) 00:08.0 RAID bus controller: Triones Technologies, Inc. HPT302/302N (rev 02) Subsystem: Triones Technologies, Inc. Unknown device 0001 Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 11 I/O ports at a000 [size=8] I/O ports at a400 [size=4] I/O ports at a800 [size=8] I/O ports at ac00 [size=4] I/O ports at b000 [size=256] Expansion ROM at 88000000 [disabled by cmd] [size=128K] Capabilities: [60] Power Management version 2 00:13.0 Mass storage controller: Triones Technologies, Inc. HPT366/368/370/370A/372/372N (rev 04) Subsystem: Triones Technologies, Inc. HPT370A Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 10 I/O ports at bc00 [size=8] I/O ports at c000 [size=4] I/O ports at c400 [size=8] I/O ports at c800 [size=4] I/O ports at cc00 [size=256] Expansion ROM at 88060000 [disabled by cmd] [size=128K] Capabilities: [60] Power Management version 2
(In reply to comment #20) > 00:08.0 RAID bus controller: Triones Technologies, Inc. HPT302/302N (rev 02) > Subsystem: Triones Technologies, Inc. Unknown device 0001 OK, from the boot log I can suppose this one (HPT302N indeed) caused the oops. Unfortunately, what the log lacks is couple more lines before that [38 2] one... > Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 11 > I/O ports at a000 [size=8] > I/O ports at a400 [size=4] > I/O ports at a800 [size=8] > I/O ports at ac00 [size=4] > I/O ports at b000 [size=256] > Expansion ROM at 88000000 [disabled by cmd] [size=128K] > Capabilities: [60] Power Management version 2 > 00:13.0 Mass storage controller: Triones Technologies, Inc. > HPT366/368/370/370A/372/372N (rev 04) > Subsystem: Triones Technologies, Inc. HPT370A Hm, I'd think that rev. 4 would match HPT370A, not HPT372 that you reported...
Ok, yes I have different board with HPT372. Sorry. To check I got into one of the cases and actually looked at the boards. The mainboard has an onboard HPT370A and the PCI card is HPT302NLF (on chip). The HPT302N is known as a Rocket133 from some of the old box materials I found. HTH
(In reply to comment #16) > (In reply to comment #3) > > Created an attachment (id=151124) [edit] [edit] > > Screenshot of crash in 50-line mode > > Alas, this is still not enough. The only thing I could figure out was that you > have either HPT372 or HPT302 chip with N suffix. Which one caused the kernel > oops, remained unknown. Have you tried pressing Shift-PgUp after oops to scroll > back the screen? That failing, could you connect the target to another box and use the serial console to get the complete bootlog?
In preparation to perform some more testing and try to get a good bootlog I went and removed all 2.6.20 kernels from two of the servers and had yum reinstall the 2952 kernel again from updates repo. And guess what, both servers are now booting 2.6.20-1.2952.fc6.i686 fine without any problems. So what changes were made to the kernel package in the updates repo since 3-Jun? Something must have changed for these kernels to all of the sudden to start booting fine on two different servers that could not boot any 2.6.20 kernel previously.
> So what changes were made to the kernel package in the updates repo since > 3-Jun? The repository has not changed since June 3. But if you remove all kernels and then install a new one some config options might be reset to defaults.
This is a real mystery. I just checked one other server that had a 2.6.20-1.2952 kernel and when I booted into that kernel it crashed. So I went and yum removed all 2.6.20 kernels from that machine (it still has 2.6.18 and 2.6.19 kernels), did a yum install to get a new 2.6.29-1.2952 kernel. Rebooted into the newly installed 2952 kernel and it booted fine. So what is going on?
All is not well still. I just tried this with yet another server and despite removing all the 2.6.20 kernels and then installing a new 2.6.20-1.2952 kernel it still crashed on boot (same errors). So maybe this is some uninitialized data issue or some other spurious problem.
Created attachment 156275 [details] screenshot of the boot right before the crash A little blurred but readable.
(In reply to comment #28) > Created an attachment (id=156275) [edit] > screenshot of the boot right before the crash > A little blurred but readable. Hm... the boot log is generally not usable if you're specifying 'quiet' option. ;-)
Created attachment 156278 [details] screenshot #1 of the boot right before the crash
Created attachment 156279 [details] screenshot #2 of the boot right before the crash These lines fly by so fast it is very hard to catch a clear image. Hope these help.
(In reply to comment #31) > Created an attachment (id=156279) [edit] > screenshot #2 of the boot right before the crash Well, it's HPT302N that caused oops indeed -- well, now that we've indentified the chip, I think it's time to change the summary... PCI clock detected was below 35 MHz. Let me check the frequency figures using the general equation, not the stupid thresholds like this driver does... pci_clk = (f_cnt * dpll_clk) / 192 = (72 * 77) / 192 ~= 28,8 Hm, could it be that your PCI is underclocked?.. Or has HighPoint changed something WRT how it stores f_CNT average or even clocked HPT302N with some other clock than 77 MHz that I was assuming (90 MHz would give an adequate result)?! > These lines fly by so fast it is very hard to catch a clear image. I hoped that pressing Shift-PgUp after oops should help... > Hope these help. It did, thanks.
(In reply to comment #11) > Before I do that, I checked the CPU speed. It's 1253MHz. Divide that by 33MHz > and you get 38. That's not the only variant. Divide it by 30MHz and you'll get 14 with FSB running at 90 MHz -- that would have explained the failure.
What is truly puzzling is that all my servers are the same board, same BIOS version level, same drives, and other hardware. But some of these are able to boot the 2.6.20-1.2952 that I just installed on them after deleting all other 2.6.20 kernels. But some are not. I checked /proc/cpuinfo and the CPU on problem machines is stepping 0. Others are stepping 1. All machines as stated above are running Athlon XP+ processor (i believe meant for 133FSB) at 100FSB. Now you should be able to select any of the allowable values in the BIOS speed menu for these CPUs and 1600(100) is the highest 100FSB speed so that is where I run them. The reason for this is that for whatever reason there was some hardware instability issue with these boards at 133FSB but they are rock stable at 100FSB so therefore that is how we run them. They have run many many kernels over the past four years like them without any problems until 2.6.20. I also see that there was a bug opened on F7 about problem with HighPoint controller too so this problem is also probably in 2.6.21. HTH
(In reply to comment #15) > Here is a bug that may be related in some way. Also regarding HPT controllers. > > https://bugzilla.redhat.com/242270 Gerry, if you mean this bug, it's installed-specific and completely unrelated.
(In reply to comment #35) > (In reply to comment #15) > > Here is a bug that may be related in some way. Also regarding HPT controllers. > > https://bugzilla.redhat.com/242270 > Gerry, if you mean this bug, it's installed-specific and completely unrelated. BTW, does F7 work for you?
I have only installed it in QEMU VM's so far just to get a look at it. I don't have any spare servers with this hardware that I can use to check it. I haven't looked at any of these LiveCD's. Can I boot from one of them to check?
With LiveCD would it still install the HighPoint controller drivers?
I'm downloading Fedora-7-Live now. I'll try it and let you know what happens.
(In reply to comment #38) > With LiveCD would it still install the HighPoint controller drivers? I have nor idea. (In reply to comment #39) > I'm downloading Fedora-7-Live now. I'll try it and let you know what happens. Even if it does, that will be libata driver, not the one from drivers/ide/. But worth trying anyway.
(In reply to comment #32) > (In reply to comment #31) > > Created an attachment (id=156279) [edit] [edit] > > screenshot #2 of the boot right before the crash > Well, it's HPT302N that caused oops indeed -- well, now that we've indentified > the chip, I think it's time to change the summary... > PCI clock detected was below 35 MHz. Let me check the frequency figures using > the general equation, not the stupid thresholds like this driver does... > pci_clk = (f_cnt * dpll_clk) / 192 = (72 * 77) / 192 ~= 28,8 > Hm, could it be that your PCI is underclocked?.. Or has HighPoint changed > something WRT how it stores f_CNT average Or maybe BIOS somehow incorrectly calculates this value on those step 0 CPUs... It would be interesting to see the boot log from the step 1 CPUs on which booting doesn't fail, if I understood correctly -- that shouldn't be an issue, dmesg will help. > or even clocked HPT302N with some > other clock than 77 MHz that I was assuming (90 MHz would give an adequate result)?! That doesn't seem likely...
Created attachment 156414 [details] dmesg output from step1 machine
LiveCD fails on this hardware. All kinds of block errors on sr0. So no help there.
Created attachment 156416 [details] dmesg output from step0 machine (2.6.19 kernel)
(In reply to comment #42) > Created an attachment (id=156414) [edit] > dmesg output from step1 machine FREQ: 75 corresponds to 30 MHz PCI clock... Probably this clock is still tolerable for the selected starting f_LOW value. The 2.6.21 driver should work better as it doesn't have the fixed values anymore. I'd suggest to RH the recent version of hpt366 driver to be backported... (In reply to comment #44) > Created an attachment (id=156416) [edit] > dmesg output from step0 machine (2.6.19 kernel) This one was of little interest -- everything as expected.
Seems sane to me - and explains why an identical board tested with libata drivers did work - must be the later version without the whacko PCI. FC6 doesn't usually backport stuff - it just ships newer kernels so will pickup the newer fixes as they go mainstream
Is there some way to increase the PCI bus clock on these machines? Some motherboards allow tweaking the speed in 1MHz increments.
I know that these boards were also used by a lot of gamers and they do have a lot of options in this BIOS. Without having checked yet, I think the FSB speed is able to be increased and the PCI bus clock is derived as either 3:3:1 or 4:4:1. I'll check this in a few minutes.
kudos Chuck! Good workaround. I found settings in the BIOS for 1MHz increments for FSB speed. So after experimenting with these, when I increase the FSB speed on the stepping 0 boards about 5% then the 2.6.20-1.2952 kernel starts booting. So I left it increased by 8% and we'll see how stable this is. I don't want to trigger the inherent h/w instability near 133FSB, but at least the board is now booting.
Resolving this as NOTABUG but hopefully there will be an upstream patch that we will pick up eventually so the driver will tolerate lower bus speeds.
(In reply to comment #46) > Seems sane to me - and explains why an identical board tested with libata > drivers did work - must be the later version without the whacko PCI. Well, pata_hpt3x2n doesn't use fixed f_LOW thresholds either, so indeed should work an both steppings. > FC6 doesn't usually backport stuff - it just ships newer kernels so will pickup > the newer fixes as they go mainstream They should be in 2,6.21 and this is 2.6.20. :-)(In reply to comment #50) > Resolving this as NOTABUG but hopefully there will be an upstream patch that > we will pick up eventually so the driver will tolerate lower bus speeds. There have been upstream patches for about a year in -mm tree -- they just only got into 2.6.21 due to not being reviewed all that time. So, you may just use the later driver.
Well, I do not agree that this is not a bug. Finding a BIOS tweak workaround does not mean that this is not a bug. These boards have booted dozens of different kernels over the past four years without problem until 2.6.20 series. That constitutes a bug. The kernel tolerance had changed in 2.6.20 and it sounds like maybe that was corrected in later 2.6.21 driver. I'd sure like to know for sure if that is the case. I don't like having to keep special BIOS settings for specific servers. That leads to all kinds of problems later if you forget why some settings are a certain way. All the servers are configured identically and I want to keep them that way. How soon would a 2.6.21 kernel be available in updates for FC6?
FC6 is probably going to jump to kernel 2.6.22 since 2.6.21 is very buggy.