Bug 126936
Summary: | Kernel 2.6.x locks up hard, 2.4.x works | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jonathan Kamens <jik> | ||||
Component: | kernel | Assignee: | Dave Jones <davej> | ||||
Status: | CLOSED WONTFIX | QA Contact: | |||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3 | CC: | alan, alex, pfrields, wtogami | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2005-07-15 22:43:08 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jonathan Kamens
2004-06-29 14:01:03 UTC
is this still causing problems in the current kernels ? It appears to no longer be a problem with kernel-smp-2.6.9-1.643 and glibc-2.3.3-74. I may have spoken too soon. My computer locked up hard a few hours after I went to bed after switching to the kernel referenced above. This may be because of the SiImage problem again, or I suppose there may be something else wrong with this kernel because it's a development kernel. OK, my system is unuseable right now, so I hope you had a good reason to ask me to retest. I can't now back down to my old kernel because I had to install the new MAKEDEV for compatibility with the kernel you asked me to test, and that apparently deleted most of the devices from /dev, one of which is apparently the device required for the initial console. When I try to boot with the old kernel now, which is 2.4 so keep in mind that I can't use hdev, it says it can't open the initial console and won't boot. Recreating the missing devices is rather difficult because I can't get to the actual /dev directory that the 2.4 kernel uses -- it's hidden by udev! I tried using a rescue disk which caused all kinds of problems because it's old enough that it's incompatible with most of the binaries in my root partition. I managed to get my root partition mounted with the rescue disk and to use mknod to create some console devices (I couldn't use MAKEDEV to create all of them because of the above-referenced binary incompatibility problem), but apparently whatever devices I created weren't enough because the kernel still couldn't create the initial console. Do you have any suggestions for what I might do now to restore the ability to boot with the old kernel? Also, can you recommend a decent PCI IDE controller which is rock solid with Red Hat kernels? Preferably one that's relatively fast but also relatively inexpensive. I'm perfectly willing to throw a little money at the problem if I can fix it by replacing the SiI card, although obviously it would be preferable if the Linux kernel worked with this card. Well, I managed to recreate /dev/console and I'm back up and running with 2.4, with a few tweaks I still need to fix. I'd still like a better solution. I've not heard any reports (good or bad) about the Sil controllers recently, so I'm unsure if that could be the cause. Alan ? There were definitely e-mail messages on LKML indicating that the Silicon Image PCI controller support in the 2.6.x kernel should be considered unstable, and I don't see anything in the ChangeLog files for the 2.6.x kernel series indicating that this instability has been addressed. SI controllers in 2.6.x should be rock solid. The new SATA driver layer won't be used for the SI/CMD680. If it kicks in with 2.6 I'd start with acpi=off becsause the IDE driver is identical Alan, I must confess that much of what you wrote above is too cryptic for me to understand it. I did get that you want me to try "acpi=off" in my kernel boot, which I've done and I'll let you know the results, but I'm under the impression that even before I did that ACPI was still off for my system, because it's SMP and I see only these ACPI-related messages when I boot: Oct 31 07:24:39 jik kernel: ACPI: Unable to locate RSDP ... Oct 31 07:24:42 jik kernel: ACPI: Subsystem revision 20040816 Oct 31 07:24:42 jik kernel: ACPI: Interpreter disabled. Sorry, its never easy t gauge the level of a reply in bugzilla. There are two sets of Linux drivers for some of the SI devices. The SI680 that you have is handled by an existing 2.4 and 2.6 stable driver with essentially the same code in both. The driver that has been unstable but is now it seems far better is a SATA specific driver for the SI3112 which in 2.4 used the old IDE style driver. As such I'm puzzled that the SI680 driver should be the cause of the problems. Am I right that ACPI was already turned off when I booted 2.6, even before I booted with "acpi=off"? no. ACPI is on by default. I tried acpi=off and it didn't help -- the system hung anyway. My newest theory is that there may be an issue with software raid. I had five different software raid partitions, /dev/md0 through /dev/md4, each with two mirrors, some with one mirror on my PIIX4 controller and the other on the SIi680 and some with both mirrors on the SIi680. I've disabled all of the RAID and switched back to using the plain partitions, and so far I haven't crashed or hung. We'll see if that keeps up. Code review shows no apparent problem differences. I found another bug involving SI3112 only but thats another story. No luck -- the 2.6 kernel just locked up on me with all the RAID disabled. Alan, do you know of *any* extant kernel lockups that I might be able to test for to explain why my system is locking up under 2.6? I just ran memtest86+ on my machine overnight for >7 hours and it didn't report a single error, so I think bad memory is unlikely to be the explanation for the lockups. Any suggestions for other things I can try? I guess you're right that it isn't a problem with the SIIG controller, because I removed the controller and the machine locked up again. ok, putting this one down to problematic hardware. The same hardware works perfectly with 2.4.27-pre2-pac1 and various other 2.4.x kernels, so how can it be reasonably classified as a hardware problem? comment 17 seemed to imply that the problem is elsewhere. is there any improvement with the latest 2.6.9 kernel ? 2.6 and 2.4 will show up differing hardware problems in difering ways. Ok so it's not the SI680 so lets see what else it might be. Can you attach an lspci please (off either 2.4 or 2.6 doesn't matter) Created attachment 108417 [details]
lspci output
kernel-smp-2.6.9-1.1021_FC4 is no better than any of the other 2.6.x kernels I've tried. One thing I don't think I mentioned before is that when the kernel locks up, it seems to get very sluggish for a few seconds before the lock-up, i.e., the lock-up is not instantaneous. I think I have a similar problem. Until a couple of months ago, I was using FC3 with kernel 2.6.9-1.667 and no problem. I updated to 2.6.9-1.681 and I get these lockups. I've done further updates since then, am now on kernel 2.6.10-1.737 and it's still happening. The bug is quite reproducible, I just have to do something which initiates some heavy disk action (preferably including swap space) and the system hangs with no option but to hard-reset :-( Steps to reproduce: 1. Get lots (3.5GB) of data and write an ISO image to hard disk. If this does not do it try: 2. Execute: split -b 500M <filename> on the ISO image. For a while I thought it might be kjournald or kswapd, as they always appeared high up in a top window at the time of the crash, so I switched both off but the crash occurs just the same. I agree with the last comment; that the system becomes very sluggish just beforehand. In fact, during one of my attempts to perform the split command above, I realised that the system became very sluggish, killed the process, and the system recovered just fine. I eventually got the ISO image split into its 7 pieces, so unfortunately the bug does not _always_ occur. Fergal Is the machine pingable? The problem I have is that if one proces goes bezerk (memory-wise) and machine starts to heavilly swap, it will virtually block entire machine. For desktop user, it might look as if machine halted (keyboard not reacting, mouse not reacting). For example, for some time now, for whatever reason, updating kernel package in Fedora Core 2 and 3 takes enourmous amounts of memory (both using rpm and yum). The only way to update the kernel RPM, is to limit the amount of memory that rpm/yum is allowed to use (and I also renice it): # ulimit -Sm 131072 # ulimit -Hm 131072 # nice yum update Without these restrictions, yum process (which seems to be very hard on memory) would lock up entire machine. I can still ping, so I know kernel is alive. And if I am *very* patient (patience measured in minutes), I might even get some feedback from keyboard or mouse. I can see disk activity light constatnly on. And if I leave machine powered on overnight, and than go to work, when I get back home kernel RPM package gets upgraded, and everything is back to normal (until the next kernel upgrade). I've noticed this on two older machine (one desktop, one laptop) with Pentium MMX processors, as well as on one two-year old Pentium 4 machine with 256 of RAM and (good old) IDE disk (actually I just locked it up complete by running yum updated without restricting first how much memory it is allowed to take). Now the question is, how is it possible that kernel allows for single process to get entire machine to complete halt? Sure, if one process makes machine to start swapping heavilly, it will be noticed on overall performance and everything will slow down drastically. But it shouldn't get everything to stop, kernel shouldn't allow single process to monopolize entire physical memory, and not allow anything else to be run at all. No, the machine does not respond to anything. I've even left the computer in its crashed state for hours to see if it finally recovered by itself, but no dice. I'll try your renice/ulimit suggestion... I don't understand why this ticket is still in "NEEDINFO" state. It would seem that the requested information has been provided, and furthermore, several other people besides me have attested to having the same problem, so I think we've ruled out a problem that is unique to my hardware. What can we do to move this along? 00:0d.0 RAID bus controller: Silicon Image, Inc. (formerly CMD Technology Inc) PCI0680 Ultra ATA-133 Host Controller (rev 02) I have TWO of these in my home server driving four disks with heavy load. I have had perfect stability so far. It doesn't matter if the bug is NEEDINFO or ASSIGNED, because there is NO USEFUL INFORMATION in this report to isolate a cause. You don't need to shout. There are several people here who have experienced this problem and who are eager to give you whatever support you need to isolate it further. If there is no useful information here, then tell us what would constitute useful information and we will endeavor to provide it. This may not work, but usually you will get useful oops or panic tracebacks if you connect a serial console from this computer to another and setup the tty properly. I don't believe the two bugs in this bugzilla entry are at all related. One is a vm/out of memory handling bug (known, mostly fixed in 2.6.11) the other a total mystery that smells hardware related (NMI watchdog didn't help get data) In terms of IDE fixes the only potentially relevant fix for 2.6.11 was an IDE shared IRQ handling corner case, which wouldn't produce the symptoms and several md/raid fixes which again don't fit the symptoms An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which may contain a fix for your problem. Please update to this new kernel, and report whether or not it fixes your problem. If you have updated to Fedora Core 4 since this bug was opened, and the problem still occurs with the latest updates for that release, please change the version field of this bug to 'fc4'. Thank you. Sorry, I finally gave up a couple of months ago and bought a new computer. I'm no longer using the computer that locks up with 2.6.x kernels, and I don't have time to put it back together enough to test this. |