43846 – system freezes after using imm driver with zip250 drive

Bug 43846 - system freezes after using imm driver with zip250 drive

Summary: system freezes after using imm driver with zip250 drive

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i586
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-06-07 17:23 UTC by Fred T. Hamster
Modified:	2005-10-31 22:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-10-24 22:21:18 UTC
Embargoed:

Attachments	(Terms of Use)
/var/log/messages from just after a partial freeze (4.35 KB, text/plain) 2001-07-03 19:10 UTC, Fred T. Hamster	no flags	Details
micron 75 mhz info (4.10 KB, text/plain) 2001-10-11 14:51 UTC, Fred T. Hamster	no flags	Details
Here is the patch it uses. (311 bytes, patch) 2001-10-17 12:35 UTC, Tim Waugh	no flags	Details \| Diff
other zip drive machine's bad behavior (57.73 KB, text/plain) 2001-10-24 18:37 UTC, Fred T. Hamster	no flags	Details
View All

Description Fred T. Hamster 2001-06-07 17:23:32 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9)
Gecko/20010505

Description of problem:
first effect noticed--after linux has been up a day or so, i come in the
next day and do "mount /mnt/zip" to mount a zip drive.  at that point, the
system would freeze.  the only option is to do a hard reboot, since nothing
is functioning any more (cannot ping the network card, etc).
current symptom--i mount the zip drive and a few minutes later (within like
10 or 20), the system freezes.
the new symptoms have started happening since i have been trying different
protocols on the parallel port.  original effect happened with ps/2 type
parallel, have seen worse behavior with ECP and EPP types.
this is all using a parallel port 250 meg zip drive with the IMM kernel module.

How reproducible:
Always

Steps to Reproduce:
1. follow zip how to document for zip250 drive.
2. mount a zip disk in the drive.
3. either problem will happen shortly where system becomes totally
non-functional, or leave it running (after unmounting the disk) for a day
or so, then try mounting again.
4. total system lock up.
	

Actual Results:  linux machine was totally unresponsive.  no log entries
were made regarding this problem.  it's like the power had been turned off,
but without flipping the switch.

Expected Results:  the zip drive should be able to mount and unmount any
number of times without a freeze.  redhat 6.2 had no problems with this at
all.  redhat 7.1 can't do it to save my life.

Additional info:

Comment 1 Arjan van de Ven 2001-06-07 17:30:31 UTC

Tim: Sounds familiar ?

Comment 2 Tim Waugh 2001-06-07 17:35:52 UTC

Is this an SMP machine?

Try taking out the line 'use_new_eh_code:                1,' from 
drivers/scsi/imm.h.  I have had one report about lock-ups from ppa that went 
away when the old error handling code was used instead of the new.  Perhaps 
this is another instance of that?

Comment 3 Fred T. Hamster 2001-06-08 19:07:41 UTC

  this is not an SMP machine.  it's a 75 mhz pentium from micron.
  i don't really have the resources on that machine to do any recompilation.
  i have found that by setting the parallel port mode to "AT", the drive is
slightly more stable.  today, instead of freezing up when i tried to mount a zip
disk, it merely said "/dev/sda4 is not a valid device".  so there's still some
brain damage, but at least there's no freeze.
  however, this is not really a solution, since i still need to reboot the
machine daily to be able to use the zip drive.
  is it possible y'all could send me a recompiled imm.o to try out?

Comment 4 Tim Waugh 2001-06-12 17:40:24 UTC

ftp://people.redhat.com/twaugh/tmp/kernel-2.4.2-2tmw1.{i386,src}.rpm

For tonight only; I am out of space on that machine..

Comment 5 Fred T. Hamster 2001-06-12 20:54:17 UTC

got the files.  will install and watch them starting tomorrow.
thanks...
fred.

Comment 6 Fred T. Hamster 2001-06-14 20:46:30 UTC

  the patched kernel did not make any difference.
  the machine has been left with its parallel port type set to "AT".  this
seemed at first to have better results with the old kernel, but that wasn't born
out.  i can still see it freeze up with that parallel port type.
  and the machine will in fact still freeze for both the new and the old 2.4.2
kernels.  the exception handling change that went into the newer kernel RPM does
not seem to address the issue at all.

Comment 7 Tim Waugh 2001-06-26 16:10:15 UTC

Could you try enabling magic-sysrq (edit /etc/sysctl.conf, change kernel.sysrq 
to 1), and see if Alt+SysRq+P does anything at all (make sure X isn't running, 
or you won't see any messages on the text console).

Have you tried the errata kernel?

Comment 8 Fred T. Hamster 2001-07-03 19:10:27 UTC

Created attachment 22609 [details]
/var/log/messages from just after a partial freeze

Comment 9 Fred T. Hamster 2001-07-03 19:18:40 UTC

  i tried the magic-sysrq business, but realized something this time.  the
console _is_ still alive, at least sometimes.
  mainly my ssh session was frozen up.  however, when i logged in as root at the
console and entered "eject /mnt/zip", that session also froze up.
  previously, i could not ping or ssh to the machine, nor could i get any
response from it at the console.  i think maybe the testing version of the
kernel (see above) may have allowed this new behavior.  or the effects of the
problem might vary, and i could have gotten lucky this time.
  in any case, this time there were log entries in /var/log/messages that seem
related to the problem, including a stack dump.  i hope that helps.  i've
attached them to the bug report.
thanks,
fred.
ps: the disk in the zip 250 is actually a zip 100.  i don't know if that's
relevant or not.

Comment 10 Fred T. Hamster 2001-07-03 19:27:00 UTC

  oops, sorry to omit this: i haven't tried the errata kernel yet.  i will try
that out, assuming it's included on the redhat discs or is available via gnorpm.
 or is that the new version of the current kernel being referred to?  if i can
get it via the update program, i will do that instead...  otherwise i might need
a pointer.
  and i was so excited upon seeing the console functioning that i forgot to try
hitting ctrl-alt-sysrq.  it is enabled though, and i will try it out the next
time i see the freeze upon mounting.
thanks again...
-fred

Comment 11 Fred T. Hamster 2001-07-06 18:48:07 UTC

  got the errata kernel (2.4.3-12) installed.  will watch and see if the same
freeze-up happens.
  with the testing kernel i was using (2.4.2-2tmw1), the message of "/dev/sda4
is not a valid device" is happening about 50% of the time now.  a real freeze of
the session trying the mount still happens for the other half of attempts.

Comment 12 Fred T. Hamster 2001-07-09 16:56:12 UTC

  have now tried the errata kernel (2.4.3-12)...  with the newer kernel, the
problem occurred again today when i tried to mount.
  the system had been idle over the weekend, but froze as soon as i tried to
mount the zip disk.  this was the full-blown freeze too; the machine was no
longer pingable after the freeze.
  also, there was not a single mention in the log of any problem with the drive.
  there were only the traditional couple of messages about the zip drive after i
logged in via ssh and tried to mount:
Jul  9 09:23:06 zeno kernel: Attached scsi removable disk sda at scsi0, channel
0, id 6, lun 0
Jul  9 09:23:06 zeno kernel: SCSI device sda: 196608 512-byte hdwr sectors (101
MB)followed by the boot messages from the next reboot.  nothing in between
indicated a stack dump or crash.
  so, it seems to me like the newer 2.4.2.tmw kernel did have some slight
advantages over the original 2.4.2 and the errata kernels.  i still need a few
more data points though...
  (and i forgot to try ctrl-alt-sysrq again, blast it.  will try tomorrow or
next time i see the freeze.)

Comment 13 Fred T. Hamster 2001-07-11 13:43:11 UTC

  i see i have been misquoting the sysrq key sequence in earlier comments... 
argh.  but i did try the recommended sequence of alt-sysrq-P this morning after
the freeze.  it did absolutely nothing.  this is still with the errata kernel.
  the freeze on mount appears to turn the system completely inactive, even with
the magic sysrq thing enabled.
  as far as i know, i have exhausted all suggestions again.  anything else i
should try to help debug this?  thanks...

Comment 14 Tim Waugh 2001-07-30 11:29:58 UTC

Does it seem to be linked with a particular ZIP disk?

Are you in X when this happens?  It would be very useful for you to try to get 
the freeze to occur when you are _not_ in X, as then the kernel messages will 
go to the console (also, do 'dmesg -n 8' first). (If you need to use X on that 
machine, set up a serial or parallel console instead.)

Comment 15 Fred T. Hamster 2001-07-30 15:20:28 UTC

  the problem doesn't seem to be linked to a particular zip disk; today i was
able to cause the freeze by doing "mount /mnt/zip" without a disk actually in
the zip drive.
  also, the freeze is happening when the machine is sitting mostly idle without
x windows running.  this machine is mainly a server for my personal files and
the zip drive, so it almost never has X running.  i am pretty sure all of the
freezes have happened without X.  the machine is still going completely brick
dead when this problem happens (and i've still got the errata kernel installed).
 it doesn't seem to be a load issue at all, and it's definitely not that X is
hiding the console or something; this is happening from a machine doing
basically nothing at the time besides mounting the zip disk.  the only recourse
still appears to be doing a hard reset of the box.
  this problem is still a continual pain for me; the same machine running redhat
6.2 never needed a reboot (except for power failures), but now needs to be
rebooted every work day.  it's quite a hassle and the hard disk errors are
probably starting to pile up from all of the hard resets.  is there anything
that i can do to make the crash a little less destructive (besides rebooting it
every day before trying to do anything)?  i don't want to bitch and bitch, but
this particular problem has the net effect of my being unable to say that redhat
linux is more stable than windows right now.  in terms of my end user
experience, rh7.1 is currently less stable.  that grates on me a lot worse than
the problem itself.

Comment 16 Fred T. Hamster 2001-08-02 20:50:05 UTC

ps:
  mount / umount / eject all work great after a reboot, including multiple
mounts, filling up zip disks, ejecting the disk, mounting another, etc.  it is
only when the machine has had a longer time span (like overnight) when this
freeze-up problem arises.
  isn't there some procedure that i haven't tried yet?  i'm open to suggestion
and can provide any config files needed for debugging.  even "upgrading" the
machine to 7.1 as released is fine if you think that will help track the problem
down.
  for projects like cygwin, it is clear to me now that any fixes for bugs are
going to be my own responsibility.  however, i have purchased redhat linux
(several releases, including 7.1).  for a product i have paid money for, i
expect a certain level of support.  but it seems that for this bug, i have not
gotten any responses to my last few posts and questions.  please help me to help
you get this problem fixed because it really does detract from my experience
with redhat linux.
thanks, fred.

Comment 17 Arjan van de Ven 2001-08-02 21:05:28 UTC

this is not a support forum but a bugreport forum.
If there is anything we developers could do, we've done so already.

Comment 18 Tim Waugh 2001-08-02 21:38:30 UTC

Have you set up a serial console yet?

Comment 19 Fred T. Hamster 2001-08-03 18:04:03 UTC

have not set up a serial console yet.  i'm going on vacation for all of next
week (8/6-8/10) and so can't try anything else until after then.  does it seem
likely that the serial console will work when the main display console is frozen
(and the machine is also non-pingable)?  i'm doubting it currently but will try
the experiment the week after next.
thanks, fred.
ps: one comment mentioned that the programmers have done everything they can as
far as suggestions; can i make the suggestion that a simple system be set up at
redhat using the imm driver with a (parallel) zip250 drive?  hopefully there's
at least one of these drives floating around redhat, and that might be a lot
more expedient for seeing if the problem is general or just on my older pentium.

Comment 20 Tim Waugh 2001-08-03 18:11:53 UTC

I am using one of these drives every single day, which is why your report has me
stumped.

Comment 21 Fred T. Hamster 2001-08-04 00:19:18 UTC

wow, i'm glad to hear that this problem is fairly isolated.  perhaps i will run
through the 7.1 upgrade again first before i do anything else, just ensuring
that this is not all caused by a speck on a cd or something similarly random. 
unless that seems dangerous or futile...

Comment 22 Fred T. Hamster 2001-10-11 13:59:02 UTC

  okay, things have finally cleared to the point where i had some time to set up
a serial console.  the redhat linux 7.1 box still needs to be rebooted every
day; otherwise i still get the crash on mounting the zip disk every time.
  mounts after rebooting just fine, works all day, i eject the disk at 5pm, come
in the next day, try to mount, and FREEZE-OLA.
  and guess what?  the serial console also sees a completely dead state, just
like the main console and the network connections.  after this freeze-up,
nothing at all works, as i've mentioned in previous entries.
  the last thing the machine spits out occurs while it's still in the process of
mounting the zip disk:
     Attached scsi removable disk sda at scsi0, channel 0, id 6, lun 0
     SCSI device sda: 196608 512-byte hdwr sectors (101 MB)
     sda: Write Protect is off
that is also the last text that the main console shows.
  isn't there some process by which y'all can attempt to debug these situations
more directly?  when i'm trying to find a bug at my job, i ask the customer
service people to: (1) gather log files, (2) gather configuration info, (3)
perform test actions on the machine and gather results.  so far i've only seen
option (3) being brought into play.
  setting up the serial console was, as i feared, a total boondoggle and has
only delayed the fixing of the actual bug.  don't y'all want to kill this bug
off before it makes it into redhat 7.2 also?  if it's a kernel bug, it seems
even more important to isolate it or at least report it.
  i really want to help to kill this bug.  what more can i do to help find it?

Comment 23 Tim Waugh 2001-10-11 14:14:19 UTC

The reason for not asking for log files is simple: as far as the kernel is 
concerned, the console _is_ the log file.  That's why I asked for serial 
console output first.

The reason for not asking for config info is that I already have it: the 
kernel configuration is done at compile time (.config).

As far as reporting the bug goes: I am the maintainer of that code, so you 
already did that.

All I can suggest that you do is gather as much information about the system 
as you can and provide it.  BIOS version, CPU stepping, chipset, etc.  As much 
as you can find out.

Basically, yours is the only machine I have heard this happening on (it 
certainly works fine on all the machines I have), so it's something special 
there.

Of course we want to get this bug fixed, but we need to know how to fix it 
first. :-)

Comment 24 Fred T. Hamster 2001-10-11 14:51:24 UTC

Created attachment 33881 [details]
micron 75 mhz info

Comment 25 Fred T. Hamster 2001-10-11 14:53:23 UTC

  okay, i will try to provide this information.  the config files i was
referring to were the /etc files that dictate the configuration of the zip drive
and such.  let me know if those are needed.
  and i have now seen two occurrences of what look like this very same problem
on a different machine.  this other machine has a scsi card though (not parallel
port zip) and has a 100 meg zip drive instead of 250 meg.  should i start a
separate incident report for it?  the effects were different on that machine;
there was no freeze, but the scsi device /dev/sda4 was suddenly non-existent. 
the machine ran for a few weeks before encountering the problem though.  but
once the device disappeared, rebooting was my only recourse.
  and actually, most of my information for the primary machine (for this bug)
comes from the log files created during startup.  i'm attaching the dmesg log
file.  are there others that would be useful?  do you actually want me to open
up the case and read off numbers from the chips and such?

Comment 26 Tim Waugh 2001-10-16 10:53:36 UTC

Oh, so there is a partial freeze first of all?  That would have been handy to know!

The oops message in the /var/log/messages file is useful.  This seems to be caused 
by either the Red Hat-specific patch kernel-2.4.0-sard.patch, or by a bug present in 
both ppa.c and imm.c (now that I have seen this oops I see another bug report just 
like this).

Comment 27 Eduardo Asada 2001-10-16 11:45:49 UTC

I had the same problem mounting a external parallel port zip 100 (the old one) . 
As Fred has said, the problem occurs when the machine is idle for a while. As
root I couldn't kill the mount process and the solution is a system boot.
So, I moved the parallel zip to another machine. And Voila! It haven't hang
anymore. It is a little cumbersome to put the disk on another machine and then
go to yours, but mounting remotely did the trick.
Of course it haven't resolve the problem which I believe that is related to a
kernel problem according to the log sent to Tim.

Comment 28 Fred T. Hamster 2001-10-16 15:32:09 UTC

  actually the /var/log/messages was posted back in july.  that went along with
the bug report just after it (also july), where i documented the partial freeze.
  the more recent log (dmesg) was posted to provide the processor info
requested.  is more info needed about the machine itself, any config files or
any other log files?

Comment 29 Tim Waugh 2001-10-17 12:34:14 UTC

Could you try this kernel and see if it still exhibits the problem?

<ftp://people.redhat.com/twaugh/tmp/43846/kernel-2.4.3-12tmw.i386.rpm>

It is a 2.4.3-12 kernel, with the spec file and patch from that directory, 
built with --target=i386-redhat-linux.

Comment 30 Tim Waugh 2001-10-17 12:35:10 UTC

Created attachment 34281 [details]
Here is the patch it uses.

Comment 31 Tim Waugh 2001-10-17 12:37:37 UTC

I've a suspicion that major_gendisk never actually gets initialised until after we have 
grokked the partitions, and the oops trace shows we are inside grok_partitions at the 
point we crash.

Stephen: what do you think?  Is my patch right?

Comment 32 Fred T. Hamster 2001-10-17 18:36:55 UTC

  i think there's a bit of confusion about one thing...  the partial freeze from
july happened like maybe once.  but the much more "normal" behavior is that a
total freeze of the machine occurs.  that's what i've experienced pretty much
every day before and after that partial freeze in july; every time i don't
reboot the box before mounting the zip disk, it croaks right away with no
perceivable activity from then on.
  also, if the provided patch attempts to fix an issue that occurs during
startup of the kernel, i think that might be going after the wrong area.  the
freeze-up only occurs when the disk has been mounted on day X and then another
mount is attempted on day X+1, just about 24 hours later.  the machine is
running through that entire time period, up until it freezes up.  but i wasn't
sure if grok_partitions was something that was done frequently or just during
bootup.
  i have downloaded and installed the patch.  i will provide more info tomorrow
on what happens with it.

Comment 33 Stephen Tweedie 2001-10-18 17:42:27 UTC

Tim's patch will almost certainly fix the problem in this specific case, but in
principle it's not quite the right fix.

Clearing the major_gendisk during grok_partitions() is wrong, because it is
possible to have several different disks sharing the same major number, and
repartitioning one of those (eg. rescanning a removable scsi device) should not
clear the gendisk index for the other disks, even temporarily.

There's also the problem of certain drivers which perform IO even before we call
grok_partitions: for those drivers, Tim's fix won't help.

I've got a patch now which should properly set and clear the major_gendisk[]
entries as gendisks are created and destroyed.  It compiles, and I'll followup
once I've tested it a bit.

Comment 34 Tim Waugh 2001-10-18 19:36:18 UTC

fred: An oops can have bad effects later on, because it means 
that the internal state of the kernel is messed up somehow.  It will be 
interesting to know if fixing the oops fixes the freeze behaviour you see as a 
side effect.

Comment 35 Fred T. Hamster 2001-10-18 20:06:19 UTC

day 1: still holding.
  i was able to mount the zip disk today without the usual freeze.  i'm also
willing to try out other kernels to test the newer versions...

Comment 36 Fred T. Hamster 2001-10-21 19:22:14 UTC

day 2: still holding
10/19/2001 friday, mounted fine.  will check after the weekend.

Comment 37 Fred T. Hamster 2001-10-23 18:08:09 UTC

days 5 & 6: still holding.
so, i will just report if it fails now.
is there another, more correct, version of the kernel to try out?  the fix looks
extremely relevant to the crashes i was having, but i'd like to avoid any of the
problems with the patch mentioned above by SCT.

Comment 38 Fred T. Hamster 2001-10-24 18:37:36 UTC

Created attachment 34882 [details]
other zip drive machine's bad behavior

Comment 39 Fred T. Hamster 2001-10-24 18:40:17 UTC

the new attachment is a kernel log file from my other machine that has a zip
drive.  i saw the first total freeze of that machine a day or so ago, without
the newer kernel.  this machine is a 133 mhz pentium with an adaptec scsi
adaptor.  the zip100 drive is on the scsi adaptor.  would the newer kernel
(2.4.3-12tmw) be applicable to this crash as well?

Comment 40 Tim Waugh 2001-10-24 22:21:12 UTC

Yes, that looks like the same failure mode.

Comment 41 Stephen Tweedie 2002-11-11 22:35:46 UTC

Should still be fixed, but reopen if it's not --- current kernels do this whole
thing in a somewhat different way which should be safer.

Note You need to log in before you can comment on or make changes to this bug.