From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9) Gecko/20010505 Description of problem: first effect noticed--after linux has been up a day or so, i come in the next day and do "mount /mnt/zip" to mount a zip drive. at that point, the system would freeze. the only option is to do a hard reboot, since nothing is functioning any more (cannot ping the network card, etc). current symptom--i mount the zip drive and a few minutes later (within like 10 or 20), the system freezes. the new symptoms have started happening since i have been trying different protocols on the parallel port. original effect happened with ps/2 type parallel, have seen worse behavior with ECP and EPP types. this is all using a parallel port 250 meg zip drive with the IMM kernel module. How reproducible: Always Steps to Reproduce: 1. follow zip how to document for zip250 drive. 2. mount a zip disk in the drive. 3. either problem will happen shortly where system becomes totally non-functional, or leave it running (after unmounting the disk) for a day or so, then try mounting again. 4. total system lock up. Actual Results: linux machine was totally unresponsive. no log entries were made regarding this problem. it's like the power had been turned off, but without flipping the switch. Expected Results: the zip drive should be able to mount and unmount any number of times without a freeze. redhat 6.2 had no problems with this at all. redhat 7.1 can't do it to save my life. Additional info:
Tim: Sounds familiar ?
Is this an SMP machine? Try taking out the line 'use_new_eh_code: 1,' from drivers/scsi/imm.h. I have had one report about lock-ups from ppa that went away when the old error handling code was used instead of the new. Perhaps this is another instance of that?
this is not an SMP machine. it's a 75 mhz pentium from micron. i don't really have the resources on that machine to do any recompilation. i have found that by setting the parallel port mode to "AT", the drive is slightly more stable. today, instead of freezing up when i tried to mount a zip disk, it merely said "/dev/sda4 is not a valid device". so there's still some brain damage, but at least there's no freeze. however, this is not really a solution, since i still need to reboot the machine daily to be able to use the zip drive. is it possible y'all could send me a recompiled imm.o to try out?
ftp://people.redhat.com/twaugh/tmp/kernel-2.4.2-2tmw1.{i386,src}.rpm For tonight only; I am out of space on that machine..
got the files. will install and watch them starting tomorrow. thanks... fred.
the patched kernel did not make any difference. the machine has been left with its parallel port type set to "AT". this seemed at first to have better results with the old kernel, but that wasn't born out. i can still see it freeze up with that parallel port type. and the machine will in fact still freeze for both the new and the old 2.4.2 kernels. the exception handling change that went into the newer kernel RPM does not seem to address the issue at all.
Could you try enabling magic-sysrq (edit /etc/sysctl.conf, change kernel.sysrq to 1), and see if Alt+SysRq+P does anything at all (make sure X isn't running, or you won't see any messages on the text console). Have you tried the errata kernel?
Created attachment 22609 [details] /var/log/messages from just after a partial freeze
i tried the magic-sysrq business, but realized something this time. the console _is_ still alive, at least sometimes. mainly my ssh session was frozen up. however, when i logged in as root at the console and entered "eject /mnt/zip", that session also froze up. previously, i could not ping or ssh to the machine, nor could i get any response from it at the console. i think maybe the testing version of the kernel (see above) may have allowed this new behavior. or the effects of the problem might vary, and i could have gotten lucky this time. in any case, this time there were log entries in /var/log/messages that seem related to the problem, including a stack dump. i hope that helps. i've attached them to the bug report. thanks, fred. ps: the disk in the zip 250 is actually a zip 100. i don't know if that's relevant or not.
oops, sorry to omit this: i haven't tried the errata kernel yet. i will try that out, assuming it's included on the redhat discs or is available via gnorpm. or is that the new version of the current kernel being referred to? if i can get it via the update program, i will do that instead... otherwise i might need a pointer. and i was so excited upon seeing the console functioning that i forgot to try hitting ctrl-alt-sysrq. it is enabled though, and i will try it out the next time i see the freeze upon mounting. thanks again... -fred
got the errata kernel (2.4.3-12) installed. will watch and see if the same freeze-up happens. with the testing kernel i was using (2.4.2-2tmw1), the message of "/dev/sda4 is not a valid device" is happening about 50% of the time now. a real freeze of the session trying the mount still happens for the other half of attempts.
have now tried the errata kernel (2.4.3-12)... with the newer kernel, the problem occurred again today when i tried to mount. the system had been idle over the weekend, but froze as soon as i tried to mount the zip disk. this was the full-blown freeze too; the machine was no longer pingable after the freeze. also, there was not a single mention in the log of any problem with the drive. there were only the traditional couple of messages about the zip drive after i logged in via ssh and tried to mount: Jul 9 09:23:06 zeno kernel: Attached scsi removable disk sda at scsi0, channel 0, id 6, lun 0 Jul 9 09:23:06 zeno kernel: SCSI device sda: 196608 512-byte hdwr sectors (101 MB)followed by the boot messages from the next reboot. nothing in between indicated a stack dump or crash. so, it seems to me like the newer 2.4.2.tmw kernel did have some slight advantages over the original 2.4.2 and the errata kernels. i still need a few more data points though... (and i forgot to try ctrl-alt-sysrq again, blast it. will try tomorrow or next time i see the freeze.)
i see i have been misquoting the sysrq key sequence in earlier comments... argh. but i did try the recommended sequence of alt-sysrq-P this morning after the freeze. it did absolutely nothing. this is still with the errata kernel. the freeze on mount appears to turn the system completely inactive, even with the magic sysrq thing enabled. as far as i know, i have exhausted all suggestions again. anything else i should try to help debug this? thanks...
Does it seem to be linked with a particular ZIP disk? Are you in X when this happens? It would be very useful for you to try to get the freeze to occur when you are _not_ in X, as then the kernel messages will go to the console (also, do 'dmesg -n 8' first). (If you need to use X on that machine, set up a serial or parallel console instead.)
the problem doesn't seem to be linked to a particular zip disk; today i was able to cause the freeze by doing "mount /mnt/zip" without a disk actually in the zip drive. also, the freeze is happening when the machine is sitting mostly idle without x windows running. this machine is mainly a server for my personal files and the zip drive, so it almost never has X running. i am pretty sure all of the freezes have happened without X. the machine is still going completely brick dead when this problem happens (and i've still got the errata kernel installed). it doesn't seem to be a load issue at all, and it's definitely not that X is hiding the console or something; this is happening from a machine doing basically nothing at the time besides mounting the zip disk. the only recourse still appears to be doing a hard reset of the box. this problem is still a continual pain for me; the same machine running redhat 6.2 never needed a reboot (except for power failures), but now needs to be rebooted every work day. it's quite a hassle and the hard disk errors are probably starting to pile up from all of the hard resets. is there anything that i can do to make the crash a little less destructive (besides rebooting it every day before trying to do anything)? i don't want to bitch and bitch, but this particular problem has the net effect of my being unable to say that redhat linux is more stable than windows right now. in terms of my end user experience, rh7.1 is currently less stable. that grates on me a lot worse than the problem itself.
ps: mount / umount / eject all work great after a reboot, including multiple mounts, filling up zip disks, ejecting the disk, mounting another, etc. it is only when the machine has had a longer time span (like overnight) when this freeze-up problem arises. isn't there some procedure that i haven't tried yet? i'm open to suggestion and can provide any config files needed for debugging. even "upgrading" the machine to 7.1 as released is fine if you think that will help track the problem down. for projects like cygwin, it is clear to me now that any fixes for bugs are going to be my own responsibility. however, i have purchased redhat linux (several releases, including 7.1). for a product i have paid money for, i expect a certain level of support. but it seems that for this bug, i have not gotten any responses to my last few posts and questions. please help me to help you get this problem fixed because it really does detract from my experience with redhat linux. thanks, fred.
this is not a support forum but a bugreport forum. If there is anything we developers could do, we've done so already.
Have you set up a serial console yet?
have not set up a serial console yet. i'm going on vacation for all of next week (8/6-8/10) and so can't try anything else until after then. does it seem likely that the serial console will work when the main display console is frozen (and the machine is also non-pingable)? i'm doubting it currently but will try the experiment the week after next. thanks, fred. ps: one comment mentioned that the programmers have done everything they can as far as suggestions; can i make the suggestion that a simple system be set up at redhat using the imm driver with a (parallel) zip250 drive? hopefully there's at least one of these drives floating around redhat, and that might be a lot more expedient for seeing if the problem is general or just on my older pentium.
I am using one of these drives every single day, which is why your report has me stumped.
wow, i'm glad to hear that this problem is fairly isolated. perhaps i will run through the 7.1 upgrade again first before i do anything else, just ensuring that this is not all caused by a speck on a cd or something similarly random. unless that seems dangerous or futile...
okay, things have finally cleared to the point where i had some time to set up a serial console. the redhat linux 7.1 box still needs to be rebooted every day; otherwise i still get the crash on mounting the zip disk every time. mounts after rebooting just fine, works all day, i eject the disk at 5pm, come in the next day, try to mount, and FREEZE-OLA. and guess what? the serial console also sees a completely dead state, just like the main console and the network connections. after this freeze-up, nothing at all works, as i've mentioned in previous entries. the last thing the machine spits out occurs while it's still in the process of mounting the zip disk: Attached scsi removable disk sda at scsi0, channel 0, id 6, lun 0 SCSI device sda: 196608 512-byte hdwr sectors (101 MB) sda: Write Protect is off that is also the last text that the main console shows. isn't there some process by which y'all can attempt to debug these situations more directly? when i'm trying to find a bug at my job, i ask the customer service people to: (1) gather log files, (2) gather configuration info, (3) perform test actions on the machine and gather results. so far i've only seen option (3) being brought into play. setting up the serial console was, as i feared, a total boondoggle and has only delayed the fixing of the actual bug. don't y'all want to kill this bug off before it makes it into redhat 7.2 also? if it's a kernel bug, it seems even more important to isolate it or at least report it. i really want to help to kill this bug. what more can i do to help find it?
The reason for not asking for log files is simple: as far as the kernel is concerned, the console _is_ the log file. That's why I asked for serial console output first. The reason for not asking for config info is that I already have it: the kernel configuration is done at compile time (.config). As far as reporting the bug goes: I am the maintainer of that code, so you already did that. All I can suggest that you do is gather as much information about the system as you can and provide it. BIOS version, CPU stepping, chipset, etc. As much as you can find out. Basically, yours is the only machine I have heard this happening on (it certainly works fine on all the machines I have), so it's something special there. Of course we want to get this bug fixed, but we need to know how to fix it first. :-)
Created attachment 33881 [details] micron 75 mhz info
okay, i will try to provide this information. the config files i was referring to were the /etc files that dictate the configuration of the zip drive and such. let me know if those are needed. and i have now seen two occurrences of what look like this very same problem on a different machine. this other machine has a scsi card though (not parallel port zip) and has a 100 meg zip drive instead of 250 meg. should i start a separate incident report for it? the effects were different on that machine; there was no freeze, but the scsi device /dev/sda4 was suddenly non-existent. the machine ran for a few weeks before encountering the problem though. but once the device disappeared, rebooting was my only recourse. and actually, most of my information for the primary machine (for this bug) comes from the log files created during startup. i'm attaching the dmesg log file. are there others that would be useful? do you actually want me to open up the case and read off numbers from the chips and such?
Oh, so there is a partial freeze first of all? That would have been handy to know! The oops message in the /var/log/messages file is useful. This seems to be caused by either the Red Hat-specific patch kernel-2.4.0-sard.patch, or by a bug present in both ppa.c and imm.c (now that I have seen this oops I see another bug report just like this).
I had the same problem mounting a external parallel port zip 100 (the old one) . As Fred has said, the problem occurs when the machine is idle for a while. As root I couldn't kill the mount process and the solution is a system boot. So, I moved the parallel zip to another machine. And Voila! It haven't hang anymore. It is a little cumbersome to put the disk on another machine and then go to yours, but mounting remotely did the trick. Of course it haven't resolve the problem which I believe that is related to a kernel problem according to the log sent to Tim.
actually the /var/log/messages was posted back in july. that went along with the bug report just after it (also july), where i documented the partial freeze. the more recent log (dmesg) was posted to provide the processor info requested. is more info needed about the machine itself, any config files or any other log files?
Could you try this kernel and see if it still exhibits the problem? <ftp://people.redhat.com/twaugh/tmp/43846/kernel-2.4.3-12tmw.i386.rpm> It is a 2.4.3-12 kernel, with the spec file and patch from that directory, built with --target=i386-redhat-linux.
Created attachment 34281 [details] Here is the patch it uses.
I've a suspicion that major_gendisk never actually gets initialised until after we have grokked the partitions, and the oops trace shows we are inside grok_partitions at the point we crash. Stephen: what do you think? Is my patch right?
i think there's a bit of confusion about one thing... the partial freeze from july happened like maybe once. but the much more "normal" behavior is that a total freeze of the machine occurs. that's what i've experienced pretty much every day before and after that partial freeze in july; every time i don't reboot the box before mounting the zip disk, it croaks right away with no perceivable activity from then on. also, if the provided patch attempts to fix an issue that occurs during startup of the kernel, i think that might be going after the wrong area. the freeze-up only occurs when the disk has been mounted on day X and then another mount is attempted on day X+1, just about 24 hours later. the machine is running through that entire time period, up until it freezes up. but i wasn't sure if grok_partitions was something that was done frequently or just during bootup. i have downloaded and installed the patch. i will provide more info tomorrow on what happens with it.
Tim's patch will almost certainly fix the problem in this specific case, but in principle it's not quite the right fix. Clearing the major_gendisk during grok_partitions() is wrong, because it is possible to have several different disks sharing the same major number, and repartitioning one of those (eg. rescanning a removable scsi device) should not clear the gendisk index for the other disks, even temporarily. There's also the problem of certain drivers which perform IO even before we call grok_partitions: for those drivers, Tim's fix won't help. I've got a patch now which should properly set and clear the major_gendisk[] entries as gendisks are created and destroyed. It compiles, and I'll followup once I've tested it a bit.
fred: An oops can have bad effects later on, because it means that the internal state of the kernel is messed up somehow. It will be interesting to know if fixing the oops fixes the freeze behaviour you see as a side effect.
day 1: still holding. i was able to mount the zip disk today without the usual freeze. i'm also willing to try out other kernels to test the newer versions...
day 2: still holding 10/19/2001 friday, mounted fine. will check after the weekend.
days 5 & 6: still holding. so, i will just report if it fails now. is there another, more correct, version of the kernel to try out? the fix looks extremely relevant to the crashes i was having, but i'd like to avoid any of the problems with the patch mentioned above by SCT.
Created attachment 34882 [details] other zip drive machine's bad behavior
the new attachment is a kernel log file from my other machine that has a zip drive. i saw the first total freeze of that machine a day or so ago, without the newer kernel. this machine is a 133 mhz pentium with an adaptec scsi adaptor. the zip100 drive is on the scsi adaptor. would the newer kernel (2.4.3-12tmw) be applicable to this crash as well?
Yes, that looks like the same failure mode.
Should still be fixed, but reopen if it's not --- current kernels do this whole thing in a somewhat different way which should be safer.