120341 – Runaway processes with USB console on Blade Center

Bug 120341 - Runaway processes with USB console on Blade Center

Summary: Runaway processes with USB console on Blade Center

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Pete Zaitcev
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-04-07 21:01 UTC by Dana Canfield
Modified:	2007-11-30 22:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-12-22 17:13:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Output of ps axf (12.26 KB, text/plain) 2004-04-07 21:27 UTC, Dana Canfield	no flags	Details
dmesg output (15.11 KB, text/plain) 2004-04-08 00:26 UTC, Dana Canfield	no flags	Details
/var/log/messages (228.02 KB, text/plain) 2004-04-08 17:49 UTC, Dana Canfield	no flags	Details
Output of various proc files (3.80 KB, text/plain) 2004-04-08 23:53 UTC, Dana Canfield	no flags	Details
var/log/messages output after reproducing the problem (163.49 KB, text/plain) 2004-04-12 14:49 UTC, Dana Canfield	no flags	Details
proc/bus/usb/devices (3.04 KB, text/plain) 2004-04-12 14:50 UTC, Dana Canfield	no flags	Details
usb-storage-0/2 (246 bytes, text/plain) 2004-04-12 14:52 UTC, Dana Canfield	no flags	Details
proc/bus/usb-storage-1/3 (226 bytes, text/plain) 2004-04-12 14:52 UTC, Dana Canfield	no flags	Details
Sysreq trace of processes. (102.44 KB, text/plain) 2004-06-30 03:40 UTC, Bojan Zdrnja	no flags	Details
Dmesg, ps -axf, lsusb and syslog from hanged machine. (9.38 KB, text/plain) 2004-07-01 00:46 UTC, Bojan Zdrnja	no flags	Details
Conserv. Cand. #2 (1.74 KB, patch) 2004-07-04 01:10 UTC, Pete Zaitcev	no flags	Details \| Diff
Sysreq log after applying the patch. (13.92 KB, text/plain) 2004-07-05 02:57 UTC, Bojan Zdrnja	no flags	Details
New dmesg log (6.33 KB, text/plain) 2004-07-07 23:15 UTC, Bojan Zdrnja	no flags	Details
Dmesg log, 2004-07-13 (4.59 KB, text/plain) 2004-07-13 01:58 UTC, Bojan Zdrnja	no flags	Details
The latest dmesg (13.34 KB, text/plain) 2004-07-15 01:23 UTC, Bojan Zdrnja	no flags	Details
Log with -17 and candidate patch (5.56 KB, text/plain) 2004-07-15 23:47 UTC, Bojan Zdrnja	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2004:188	0	normal	SHIPPED_LIVE	Important: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 2	2004-05-11 04:00:00 UTC

Description Dana Canfield 2004-04-07 21:01:02 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6)
Gecko/20040211 Firefox/0.8

Description of problem:
In no consistently repeatable fashion, blades will develop a problem
where ps shows something along these lines:


10583 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
10584 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
10589 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
10590 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
10597 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
10621 ?        D      0:00 /usr/sbin/updfstab
10631 ?        D      0:00 /usr/sbin/updfstab
10639 ?        D      0:00 /usr/sbin/updfstab
10640 ?        D      0:00 /usr/sbin/updfstab
10642 ?        D      0:00 /usr/sbin/updfstab
10643 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
10644 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
10645 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
10672 ?        D      0:00 /usr/sbin/updfstab
10676 ?        D      0:00 /usr/sbin/updfstab
10678 ?        D      0:00 /usr/sbin/updfstab
24613 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
24624 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
24625 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
24646 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
24647 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
24690 ?        S      0:00 /bin/bash /sbin/devlabel restart
24694 ?        S      0:00 /bin/bash /sbin/devlabel status
24742 ?        S      0:00 /bin/bash /sbin/devlabel status
24743 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
24788 ?        S      0:00 /bin/bash /sbin/devlabel restart
24789 ?        S      0:00 /bin/bash /sbin/devlabel restart
24793 ?        S      0:00 /bin/bash /sbin/devlabel status
24797 ?        S      0:00 /bin/bash /sbin/devlabel status
24886 ?        S      0:00 /bin/bash /sbin/devlabel status
24887 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
24894 ?        S      0:00 /bin/bash /sbin/devlabel status
24895 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
24935 ?        S      0:00 /bin/bash /sbin/devlabel restart
24948 ?        S      0:00 /bin/bash /sbin/devlabel status
24953 ?        S      0:00 /bin/bash /sbin/devlabel restart
24970 ?        S      0:00 /bin/bash /sbin/devlabel status
25033 ?        S      0:00 /bin/bash /sbin/devlabel status
25036 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
25054 ?        S      0:00 /bin/bash /sbin/devlabel status
25055 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
25154 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25155 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25160 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25161 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25165 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25192 ?        D      0:00 /usr/sbin/updfstab
25204 ?        D      0:00 /usr/sbin/updfstab
25209 ?        D      0:00 /usr/sbin/updfstab
25212 ?        D      0:00 /usr/sbin/updfstab
25213 ?        D      0:00 /usr/sbin/updfstab
25229 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25240 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25241 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25262 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25263 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25306 ?        S      0:00 /bin/bash /sbin/devlabel restart
25310 ?        S      0:00 /bin/bash /sbin/devlabel status
25358 ?        S      0:00 /bin/bash /sbin/devlabel status
25359 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
25404 ?        S      0:00 /bin/bash /sbin/devlabel restart
25405 ?        S      0:00 /bin/bash /sbin/devlabel restart
25412 ?        S      0:00 /bin/bash /sbin/devlabel status
25413 ?        S      0:00 /bin/bash /sbin/devlabel status
25508 ?        S      0:00 /bin/bash /sbin/devlabel status
25509 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
25510 ?        S      0:00 /bin/bash /sbin/devlabel status
25511 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
25564 ?        S      0:00 /bin/bash /sbin/devlabel restart
25565 ?        S      0:00 /bin/bash /sbin/devlabel restart
25572 ?        S      0:00 /bin/bash /sbin/devlabel status
25573 ?        S      0:00 /bin/bash /sbin/devlabel status
25668 ?        S      0:00 /bin/bash /sbin/devlabel status
25669 ?        S      0:00 /bin/bash /sbin/devlabel status
25670 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
25671 ?        D      0:00 /usr/bin/scsi_unique_id /dev/sdb
25674 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25675 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25678 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25680 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25687 ?        S      0:00 /bin/bash /etc/hotplug/usb.agent
25709 ?        D      0:00 /usr/sbin/updfstab
25719 ?        D      0:00 /usr/sbin/updfstab
25723 ?        D      0:00 /usr/sbin/updfstab
25728 ?        D      0:00 /usr/sbin/updfstab
25733 ?        D      0:00 /usr/sbin/updfstab

The only way to fix this completely seems to be a reboot.


Version-Release number of selected component (if applicable):
hotplug-2002_04_01-20.2

How reproducible:
Sometimes

Steps to Reproduce:
1. Switch console on blade center anywhere from 1 to 100 times.
2. Check ps ax.

Additional info:

The only hint from IBM amounts to "don't switch between consoles
quickly."  I'm hoping Red Hat might have a more useful solution.  I
have heard from at least four other IBM/Red Hat customers who are
having the same issue, and none of us have found a solution.

This eventually starts consuming significant numbers of processes and
increases the load number (the box illustrated above is currently at
44), though a significant performance decrease usually is not seen.

Comment 1 Bill Nottingham 2004-04-07 21:21:17 UTC

What's the process tree look like tree-wise?
(ps axf)

Comment 2 Dana Canfield 2004-04-07 21:27:14 UTC

Created attachment 99209 [details]
Output of ps axf

Comment 3 Bill Nottingham 2004-04-07 21:29:59 UTC

Any kernel messages?

My general assumption is that it's trying to access devices that went
away.

Comment 4 Dana Canfield 2004-04-07 21:45:32 UTC

>My general assumption is that it's trying to access devices that went
>away.

This is quite probable.  The IBM BladeCenter has one KB/Mouse and one
CD/Floppy that are internally switched via USB among the 14 blades in
the system.  

For all Linux knows, every time the console switches between the
blades, someone is coming along and unplugging the USB KB/Mouse and
plugging it into another system (same thing with the CD/Floppy unit).

As the BladeCenter is certified for RHEL, I guess I just need to know
if there is a "proper" way to use the console that won't cause this
issue.  FWIW, I don't believe it is necessary for a floppy/CD to be
mounted when the drive tray is switched for this behavior to occur,
though.  I would expect something like this to happen in that case
(much as I'd like to wish the Linux USB system would be better than that).

Comment 5 Dana Canfield 2004-04-07 21:54:12 UTC

BTW, not sure if it's relevant, but when these systems boot, they used
to complain about something related to the floppy on the console to
the point that it was difficult to use the console... I can't remember
the exact error any more, but based on advice from you (Bill) on the
taroon list, all of my systems have 

alias block-major-2 off

in /etc/modules.conf.  Not sure if that helps at all, but I'm also
fairly certain this issue occurred before getting that tip, though.

Comment 6 Bill Nottingham 2004-04-07 21:58:18 UTC

Are there any kernel messages related to USB?

I'm assuming stracing the updfstab or scsi_unique_id processes doesn't
do anything.

Comment 7 Dana Canfield 2004-04-07 22:12:50 UTC

I've not seen any particularly unusual message related to USB, however
the particular machine that I'm seeing the problem on at the moment
has been experiencing NFS connection problems, so any log messages
related to USB have scrolled away.

Also, I've not used strace extensively... if you could give me an
example of how to use it that would be useful to you, I'd be happy to
send the output.

Comment 8 Bill Nottingham 2004-04-07 23:58:37 UTC

strace -p <whatever the PID of updfstab or scsi_unique_id is>

If it prints no output, that means it's stuck in the kernel somewhere.

One other thing that would be useful when it's in this state is to
enable sysrq, and post the output of sysrq-t

Comment 9 Dana Canfield 2004-04-08 00:14:28 UTC

Strace gives nothing on any of the processes.

Sysrq-t says only "Show State" and nothing more.

Tom Coughlan lead me to believe in another bug that Red Hat has a
BladeCenter and at least a PPC blade, if not an i386 blade.  Are these
issues not being seen on that equipment?

Comment 10 Dana Canfield 2004-04-08 00:25:50 UTC

OK, I'm dumb.  Attached is the output of dmesg that I assume was
triggered from sysrq-t.

FWIW, I re-read an e-mail from someone else using the bladecenter, and
he said that this could happen without the KVM ever having been
switched to that blade: 

"I am having the exact same problem.  I am running rhel-3-es on IBM
blades and the usb.agent is occasionally dragging some of the blades
to a halt, but not others.  In fact, I have one that is doing nothing
but sitting there - no services running per se, never switching the
keyboard/mouse nor the cd/floppy over to it, and all of the sudden it
will take on multiple copies of these usb related processes."

Also, if it matters, with the drives and kvm attached, lsusb gives:
Bus 001 Device 001: ID 0000:0000
Bus 001 Device 027: ID 04b4:5203 Cypress Semiconductor Corp.
Bus 001 Device 028: ID 0644:0000 TEAC Corp. Floppy
Bus 001 Device 029: ID 05ab:0030 In-System Design
Bus 001 Device 024: ID 04b4:5204 Cypress Semiconductor Corp.
Bus 001 Device 025: ID 04b3:4002 IBM Corp.
Bus 001 Device 026: ID 04b3:4004 IBM Corp.

Without, it gives:
Bus 001 Device 001: ID 0000:0000

Comment 11 Dana Canfield 2004-04-08 00:26:49 UTC

Created attachment 99219 [details]
dmesg output

Comment 12 Bill Nottingham 2004-04-08 01:14:11 UTC

Is that the full log (just noticing it doesn't have updfstab).

Assigning to the kernel for now... look's like it's stuck in locks in
the kernel.

Comment 13 Dana Canfield 2004-04-08 16:26:46 UTC

That was the result of "dmesg > dmesg.txt" after issuing the sysrq-t.
 If there is a better way to get the output that you need, please let
me know how to do it.  Unfortunately, we're treading a little deeper
into the system than I've yet had a need to explore (which, as an
aside, is actually a compliment to Red Hat as we've been running RHL
on *every* server except the Finance/Student Record since 1998... and
that student/finance system is being coverted to RHEL right now...).

Comment 14 Bill Nottingham 2004-04-08 16:40:08 UTC

It should all be in /var/log/messages, if you just want to attach that.

Comment 15 Dana Canfield 2004-04-08 17:49:19 UTC

Created attachment 99243 [details]
/var/log/messages

Comment 16 Dana Canfield 2004-04-08 17:51:19 UTC

Sorry to have been brain-dead about that.  I was thinking dmesg had
more output than /var/log/messages sometimes. 

And just FYI, this apparently also happens with FC1 on the blades
we're using as terminal servers.

Comment 17 Pete Zaitcev 2004-04-08 20:19:35 UTC

No, it's all right, dmesg is preferred. The syslogd cannot keep
up sometimes and drops records. Ususally, I kill off unrelated
processes before taking a SysRq-T, which may not be applicable
in this case though. I'll go through this and see if I can figure
it out.

Comment 18 Pete Zaitcev 2004-04-08 22:35:12 UTC

I looked at dmesg and messages, so here's the bad news:
/var/log/messages is incomplete. There's no usb-storage
thread for one, but it's plainly visible in attached ps
output, so it didn't die.

Dana, please try to catch this situation as early as possible,
preferably with just one or two scsi_unique_id stuck.
It would be a great help.

Comment 19 Pete Zaitcev 2004-04-08 23:08:48 UTC

OK, this is getting ugly, I recognize the symptoms:

scsi_unique_i D 00000000     0  4626   5057                   (NOTLB)
Call Trace:   [<c01232f4>] schedule [kernel] 0x2f4 (0xd7023b58)
[<c0133de5>] schedule_timeout [kernel] 0x65 (0xd7023b9c)
[<c0133d70>] process_timeout [kernel] 0x0 (0xd7023bbc)
[<c012466d>] io_schedule_timeout [kernel] 0x2d (0xd7023bd4)
[<c01cc147>] __get_request_wait [kernel] 0x147 (0xd7023be0)
[<c0247a60>] dst_output [kernel] 0x0 (0xd7023c08)
[<c025fb8f>] tcp_v4_do_rcv [kernel] 0x13f (0xd7023c30)
[<c01cc7c0>] __make_request [kernel] 0x140 (0xd7023c44)
[<c01cceaa>] generic_make_request [kernel] 0xea (0xd7023c98)
[<c01ccf4c>] submit_bh_rsector [kernel] 0x4c (0xd7023cc0)
[<c0164016>] block_read_full_page [kernel] 0x266 (0xd7023cdc)
[<c014572a>] add_to_page_cache_unique [kernel] 0x5a (0xd7023d2c)
[<c014af3d>] __read_cache_page [kernel] 0x8d (0xd7023d40)
[<c0168990>] blkdev_get_block [kernel] 0x0 (0xd7023d48)
[<c0148a88>] read_cache_page [kernel] 0x38 (0xd7023d64)
[<c0168a80>] blkdev_readpage [kernel] 0x0 (0xd7023d70)
[<c01951cd>] read_dev_sector [kernel] 0x4d (0xd7023d88)
[<c0168a80>] blkdev_readpage [kernel] 0x0 (0xd7023d94)
[<c0195c5d>] handle_ide_mess [kernel] 0x2d (0xd7023da8)
[<c0128413>] call_console_drivers [kernel] 0x63 (0xd7023dc0)
[<c0195e7f>] msdos_partition [kernel] 0x6f (0xd7023dd4)
[<c0195e10>] msdos_partition [kernel] 0x0 (0xd7023e0c)
[<c0194f52>] check_partition [kernel] 0x102 (0xd7023e18)
[<f89af3bb>] sd_init_onedisk [sd_mod] 0x38b (0xd7023e38)
[<f89b099b>] .rodata.str1.1 [sd_mod] 0xa7 (0xd7023e44)
[<c0195117>] grok_partitions [kernel] 0xc7 (0xd7023e80)
[<f89b0409>] revalidate_scsidisk [sd_mod] 0x1b9 (0xd7023ea4)
[<f89b1548>] sd_fops [sd_mod] 0x0 (0xd7023ecc)

So, the I/O was lost, now the question is who lost it and why.

Dana, in addition to trying to capture the khubd, usb-storage-0/1
and SCSI EH threads, please give me several more things:

/proc/bus/usb/devices (with the storage attached - to find the
   protocol and transport)
/proc/scsi/scsi
/proc/version (You know I'm going to bump you to 2.4.21-12.EL :-)
/proc/cmdline
Complete /var/log/messages (to see what happened
before the lockup, maybe I can see a SCSI timeout)

Thanks

Comment 20 Dana Canfield 2004-04-08 23:53:55 UTC

Created attachment 99256 [details]
Output of various proc files

Here's the output of the various /proc files.  I'll have to go back and see
what I can do abou the full /var/log/messages.	I don't know where/when exactly
this was triggered, and it's not very easy to trigger it at will.

Comment 21 Pete Zaitcev 2004-04-10 00:11:17 UTC

So, the priority is to a) catch the problem early, b) get a sysrq
dump which has usb-storage, khubd, scsi_eh threads in it.

If the I/O was lost because, eh, someone just lost it, then
we are dead in the water until we know how to reproduce it.
However, there's a significant chance that usb-storage got
stuck, and since it's not multi-threaded, a stack dump will
show where it's stuck, then we're on track.

When it happens, ftp the /var/log/messages out of the system,
too, please.

One little detail, I forgot to ask about /proc/scsi/usb-storage-*/*.
I thought I'd calculate protocol and transport from
/proc/bus/usb/devices, but let's remove any uncertainty.

Comment 22 Dana Canfield 2004-04-12 14:49:30 UTC

Created attachment 99321 [details]
var/log/messages output after reproducing the problem

Comment 23 Dana Canfield 2004-04-12 14:50:04 UTC

Created attachment 99322 [details]
proc/bus/usb/devices

Comment 24 Dana Canfield 2004-04-12 14:52:05 UTC

Created attachment 99323 [details]
usb-storage-0/2

Comment 25 Dana Canfield 2004-04-12 14:52:30 UTC

Created attachment 99324 [details]
proc/bus/usb-storage-1/3

Comment 26 Dana Canfield 2004-04-12 14:57:30 UTC

These attachments are after a reboot.  I swapped the cd/floppy tray
back and forth much more aggressively than I would in normal use in
order to trigger the problem, so I hope they're still representative
of the actual problem.  Let me know if I missed anything.

Comment 27 Pete Zaitcev 2004-04-12 19:21:30 UTC

Perfect!!

usb-storage-0 S 00000001  3832 17601      1         17602  1611 (L-TLB)
Call Trace:   [<c01232f4>] schedule [kernel] 0x2f4 (0xe54c1ef4)
[<c010ad6a>] __down_interruptible [kernel] 0x8a (0xe54c1f38)
[<c01220c6>] wake_up_process [kernel] 0x26 (0xe54c1f5c)
[<c010ae37>] __down_failed_interruptible [kernel] 0x7 (0xe54c1f70)
[<f8c2a50f>] .text.lock.usb [usb-storage] 0x25 (0xe54c1f7c)
[<f8c3716c>] usb_stor_sense_notready [usb-storage] 0x0 (0xe54c1f84)
[<f8c28eb0>] usb_stor_control_thread [usb-storage] 0x0 (0xe54c1fe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xe54c1ff0)

scsi_eh_2     S 00000001  3868 17602      1         17603 17601 (L-TLB)
Call Trace:   [<c01232f4>] schedule [kernel] 0x2f4 (0xe54ddf2c)
[<c010ad6a>] __down_interruptible [kernel] 0x8a (0xe54ddf70)
[<c010ae37>] __down_failed_interruptible [kernel] 0x7 (0xe54ddfa8)
[<f8995b7f>] .text.lock.scsi_error [scsi_mod] 0xc5 (0xe54ddfb4)
[<f899db4d>] .rodata.str1.1 [scsi_mod] 0x2089 (0xe54ddfbc)
[<f8995690>] scsi_error_handler [scsi_mod] 0x0 (0xe54ddfe4)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xe54ddff0)

However, one "small" difficulty remains: I cannot see how
that could happen. This needs some thinking.

Comment 28 Pete Zaitcev 2004-04-12 22:48:45 UTC

I found that I did some changes after -9.EL in this area.
They are coming out really soon, so I would like not to backtrack.
Please try out a kernel from the following location:
 ftp://people.redhat.com/zaitcev/120341

Comment 29 Dana Canfield 2004-04-13 03:54:51 UTC

I'll see if it works during my maintenance window Wednesday night. 
Unfortunately, we're also fighting problems with the qlogic driver in
-11 & -12 (being worked on by Tom Coughlan and is also in bugzilla)
which are required for these boxes to boot from the SAN they're on, so
I may be stuck until that issue is sorted out.

Comment 30 Pete Zaitcev 2004-04-13 20:53:32 UTC

Dana, I uploaded an interim release -9.4.EL to the same place,
which does have my patch applied, but does not seem to have qlogic
changes (at least according to the changelog).

[zaitcev@devserv i686]$ rpm -q --changelog -p
kernel-smp-2.4.21-9.4.EL.i686.rpm | more
* Wed Feb 04 2004 Ernie Petrides <petrides>
kernel-2.4.21-9.4.EL
 
- update nfsd mode-with-acls fix for upstream conformance (Steve Dickson)
- rework usb-storage locking to avoid hangs on disconnect (Pete Zaitcev)
- fix mmio alignment error in olympic token ring driver (Julie DeWandel)
- prevent tux logger-thread shutdown problems on ppc64 (Ingo Molnar)
- add ppc64 AltiVec chip support for Power Blades (Julie DeWandel)
- eliminate multi-threaded coredump hang potential (Ingo Molnar)
- fix ppc64 segfaults with tasks > 2GB in size (Julie DeWandel)
- fix usb-storage memory-allocation/iowait deadlock (Alan Cox)
 
* Mon Feb 02 2004 Ernie Petrides <petrides>
kernel-2.4.21-9.3.EL
 
- add amd64 support for PT_GNU_STACK with exec shield (Jim Paradis)
- fix s390 raid1 sync problem w/writes at end of dev (Pete Zaitcev)
- update s390 zcrypt driver to version 1.1.4 (Pete Zaitcev)
- fix output of large /proc/pid/statm values (Mark DeWandel)
- fix output of large /proc/partitions values (Mark DeWandel)
- enable ppc64 support for JS20 Power Blades (Julie DeWandel)
- fix kernel stack memory leak on fork failures (Dave Anderson)
- fix ipv6 packet quoting problem in icmpv6_send() (Dave Miller)
 
* Thu Jan 29 2004 Ernie Petrides <petrides>
kernel-2.4.21-9.2.EL
 
- apply Intel's x86 microcode loader update (Jim Paradis)
- enhance PCI multifunction support for ppc64 (Julie DeWandel)
- fix nfsd returning bad mode bits when acls set (Steve Dickson)
- prevent unkillable zombies from detached NPTL threads (Ingo Molnar)
- avoid missing scsi disks in /proc/partitions on detach (Jason Baron)
- correct pte calculation inside zeromap_pte_range() loop (Ben Woodard)
- add inline function PDE() to help driver backporting (Matthew Galgoci)
                                                                     
          
* Wed Jan 21 2004 Ernie Petrides <petrides>
kernel-2.4.21-9.1.EL
                                                                     
          
- (this is the first build for RHEL 3 Update 2)
- relocate various pieces of the "undo" patch (Patch Review Committee)
- rename s390 zcrypt and zfcp driver patches (Pete Zaitcev)
- add LKML discussion to COPYING.modules (Arjan van de Ven)

.................

I think it should do the trick. If it does, mark the bug and
work it out with Tom.

Comment 31 Dana Canfield 2004-04-28 19:55:18 UTC

Sorry for the delay... my test box was unavailable for a while.  This
version of the kernel does appear to be more stable.  Switching the
console before all the individual processes completed still caused
some extra devlabel processes, but then I was able to kill those
processes off, and then things seemed to fix themselves.  I'll watch
this a while longer, but it seems to be a very big improvement.

Comment 34 Dana Canfield 2004-05-07 21:23:29 UTC

FWIW, I'll be putting the -14 kernel into production on a number of
boxes this week in order to solve both this problem and a qlogic
driver issue.  I've not noticed the problem recurring in any of the
test blades I've used running -11 through -14, so as far as I'm
concerned this ticket can be marked resolved for now.

Comment 35 Pete Zaitcev 2004-05-07 21:36:11 UTC

Kevin, get them to run -14.EL and let me know how it goes, please.

If Dana closes but your case persists I'm not letting you
to hang on this one (e.g. bug morphing), you'll need to file
a new one. At that point it would be likely a different root cause.
But we'll see.

Comment 36 Ernie Petrides 2004-05-07 22:43:17 UTC

Dana/Kevin, we are intending to officially release RHEL3 U2 on
Monday, May 10th.  The U2 kernel is 2.4.21-15.EL, which contains
an important security fix that is not in the latest beta channel
kernel -14.EL.  Please wait until Monday night or Tuesday in order
to get the official U2 (-15.EL) kernel.

I'm changing the state of this bugzilla report to "modified" based
on Dana's comment #34, and I will add this bug id to the errata
associated with the U2 kernel (RHSA-2004:188), since tonight is
my last chance to update it before it goes public.

If the originally reported problem persists after the U2 upgrade,
please reopen the bug.

Thanks.  -ernie

Comment 37 John Flanagan 2004-05-12 01:08:49 UTC

An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-188.html

Comment 38 Jeff O'Neill 2004-06-24 15:07:12 UTC

All, I'm not sure if you all are still experiencing this problem 
after the latest kernel release 2.4.21-15 but at the current time I 
still experience the same problem as before.  Seems to occur upon 
scanning the floppy drive via the USB bus.  I ran an strace against a 
command that generally works well on the other 5 blades in my HS20 
which is the EMC binary inq.  

My process table currently looks like the following:

[root@thdl005 root]# ps -ef|grep -E "inq|bash"|grep -v grep
root     17995     6  0 May25 ?        
00:00:00 /bin/bash /etc/hotplug/usb.agent
root     18102 17995  0 May25 ?        
00:00:00 /bin/bash /sbin/devlabel restart
root     18106 18102  0 May25 ?        
00:00:00 /bin/bash /sbin/devlabel status
root     18222 18106  0 May25 ?        
00:00:00 /bin/bash /sbin/devlabel status
root      6258     1  0 Jun05 ?        00:00:00 /usr/local/bin/inq
root     23091     1  0 Jun12 ?        00:00:00 /usr/local/bin/inq
root     28012     1  0 Jun19 ?        00:00:00 /usr/local/bin/inq
root     14701     1  0 Jun23 ?        00:00:00 inq
root     29073     1  0 Jun23 ?        00:00:00 inq
root     29232     1  0 Jun23 ?        00:00:00 inq
root     26836 26832  0 09:25 pts/1    00:00:00 -bash
root      5269  5264  0 10:38 pts/5    00:00:00 -bash

With 11 jobs in the blocked on I/O:

vmstat 
It responded with the following output:
[root@thdl005 root]# vmstat 2
procs                      memory      swap          io     
system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
 0 11 193708  27876 300184 2139968    0    0     1     1    0     2  
1  1  2  2
 0 11 193708  27872 300184 2139968    0    0     0   212  359   634  
0  0 89 11

open("/dev/sgc", O_RDWR|O_NONBLOCK|O_SYNC) = 3
getpid()                                = 29232
fcntl64(3, F_GETFL)                     = 0x1802 (flags 
O_RDWR|O_NONBLOCK|O_SYNC)
fcntl64(3, F_SETFL, O_RDWR|O_SYNC)      = 0
write(3, "S\0\0\0\375\377\377\377\6\377\0\0\220\0\0\0\270\374
(\10"..., 64

At which point there is no activity and the process for all intents 
and purposes is hung.  

Now /dev/sgc appears as the following in /dev:

[root@thdl005 root]# ls -la /dev/sgc
lrwxrwxrwx    1 root     root            3 May 17 13:18 /dev/sgc -> 
sg2
[root@thdl005 root]# ls -la /dev/sg2
crw-rw----    1 root     disk      21,   2 Feb 18 10:26 /dev/sg2

and in /etc/sysconfig/hwconfig I have the following entry 
for /dev/sg2:

-
class: FLOPPY
bus: USB
detached: 0
driver: usb-storage
desc: "TEAC TEAC FD-05PUB"
usbclass: 8
usbsubclass: 4
usbprotocol: 0
usbbus: 1
usblevel: 2
usbport: 0
usbdev: 6
vendorId: 0644
deviceId: 0000
usbmfr: TEAC
usbprod: TEAC FD-05PUB
-
class: FLOPPY
bus: SCSI
detached: 0
device: sdb
driver: ignore
desc: "Teac FD-05PUB"
host: 1
id: 0
channel: 0
lun: 0
generic: sg2
-

Comment 39 Jeff O'Neill 2004-06-24 15:09:23 UTC

Please reopen this bug as I do not believe it is resolved.  If you 
feel otherwise please indicate why.

Thanks.
Jeff

Comment 40 Ernie Petrides 2004-06-24 16:07:56 UTC

Reopening as requested in last comment.  -ernie

Comment 41 Dana Canfield 2004-06-24 16:58:35 UTC

I can confirm that the problem has started happening to us again.  It
doesn't seem to be quite as frequent as before, and I still can't
reliably reproduce it on demand, but eventually it kicks in and bumps
the load numbers way up.

Comment 42 Pete Zaitcev 2004-06-24 17:54:49 UTC

Dana, please re-do the drill like the last time: catch the situation
early, kill off all extra processes to preserve dmesg from overflowing,
trip sysrq-t with "echo t > /proc/sysrq-trigger" or from VT console,
capture dmesg and attach as attachement.

Comment 43 Jeff O'Neill 2004-06-24 18:11:06 UTC

Please let me know if there is anything I can do to assist or 
information I can provide.

Comment 44 Pete Zaitcev 2004-06-29 22:57:42 UTC

Jeff, please look at comment #21 and before. Also, comment #42 has
all necessary instructions, only they are ... condensed. Let me explain.

The objective is to recover the kernel stack traces of all hung
processes. It is straightforward with "echo t > /proc/sysrq-trigger".
However, since dmesg overflows, you better catch it early
(while few processes hang), and kill off irrelevant processes
such as system daemons, before tripping the dump.

Do not drop dmesg result into the bug comment box.
Attach it to the bug instead.

Comment 45 Bojan Zdrnja 2004-06-30 02:55:26 UTC

I can confirm that this happens.
Among other servers there are 2 blades running with the lates RHEL 
kernel (2.4.21-15.0.2.ELsmp).

After changing console and CDROM around a bit (between two of them), 
both of them remain with hanged processes and console keyboard is not 
working anymore (keyboard is USB as well).

Hanged processes are:

root        83  0.0  0.0     0    0 ?        DW   13:59   0:00 [khubd]
root      2615  0.0  0.0     0    0 ?        DW   13:59   0:00 [usb-
storage-0]
root      2617  0.0  0.0     0    0 ?        DW   13:59   0:00 
[scsi_eh_1]
root     14105  0.0  0.0  1380  236 ?        D    14:01   
0:00 /usr/bin/scsi_unique_id /dev/sdb


Other than that machine seems to be working ok.

Is there anything special I should do about that sysrq trigger? After 
I echo t to it, nothing happens. If there is something else I can do 
to help let me know.

Comment 46 Bojan Zdrnja 2004-06-30 03:40:05 UTC

Created attachment 101527 [details]
Sysreq trace of processes.

Comment 47 Pete Zaitcev 2004-06-30 03:56:14 UTC

Bojan, your trace is obviously corrupted. I asked to KILL OFF EXTRA
PROCESSES before taking the dmesg (which is preferable to
/var/log/messages), several times (see comments #17, 42, 44).
Please, take it again. I cannot do anything until I get
that little piece of info which your syslogd ate because you fed
too many stacks to it.

Comment 48 Bojan Zdrnja 2004-06-30 06:42:07 UTC

Ok Pete, will do tomorrow because I can't touch those servers anymore 
today :/

Btw, they seem to work ok - they are fast and everything seems good 
except that I can't access them through the console and load average 
is always 4+.

Comment 49 Bojan Zdrnja 2004-07-01 00:46:36 UTC

Created attachment 101549 [details]
Dmesg, ps -axf, lsusb and syslog from hanged machine.

I hope this will help.

I found it easy to reproduce the bug. If I change the console and CDROM between
2 blades in intervals of ~10 seconds, soon they hang.

I'm not sure if waiting will help (like waiting 30+ seconds between console
switch) - will test that and let know.

This attachment has the following files:

1) dmesg_output.txt - sysrq of dmesg
2) syslog.txt - syslog log of sysrq, seems like it caught everything because I
killed almost all other processes. I also left couple of lines at the top which
show what happens when you change cdrom to the Blade.
3) ps_axf.txt - ps -axf when it hangs
4) lsusb.txt - lsusb - doesn't have anything after it hangs.

I hope this will help, if there's something else let me know.

Comment 50 Pete Zaitcev 2004-07-01 12:44:36 UTC

Thanks, Bojan. These appear the key parts:

khubd         D 00000002  3584    81      1           205    26 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xf678de60)
[<c010ac83>] __down [kernel] 0x73 (0xf678dea4)
[<c010ae2c>] __down_failed [kernel] 0x8 (0xf678ded8)
[<f8edd51f>] .text.lock.usb [usb-storage] 0x85 (0xf678dee8)
[<f8eeb0a0>] usb_storage_driver [usb-storage] 0x20 (0xf678def0)
[<f8eeb080>] usb_storage_driver [usb-storage] 0x0 (0xf678def4)
[<f8bb2380>] usb_disconnect_Rsmp_8c20b3cd [usbcore] 0xa0 (0xf678def8)
[<f8bc3720>] hub_driver [usbcore] 0x0 (0xf678df18)
[<f8bb2428>] usb_disconnect_Rsmp_8c20b3cd [usbcore] 0x148 (0xf678df24)
[<f8bb538f>] usb_hub_port_connect_change [usbcore] 0x27f (0xf678df50)
[<f8bb4dac>] usb_hub_port_status [usbcore] 0x6c (0xf678df64)
[<f8bb56a2>] usb_hub_events [usbcore] 0x302 (0xf678df84)
[<f8bb5755>] usb_hub_thread [usbcore] 0x55 (0xf678dfb0)
[<f8bb5700>] usb_hub_thread [usbcore] 0x0 (0xf678dfe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf678dff0)

usb-storage-0 D 00000002  3584  2692      1          2693   211 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc4f5fe2c)
[<c01238e1>] wait_for_completion [kernel] 0x71 (0xc4f5fe70)
[<f8eda8a5>] usb_stor_control_msg [usb-storage] 0x175 (0xc4f5fec4)
[<f8edb0f5>] usb_stor_CBI_transport [usb-storage] 0xa5 (0xc4f5fef0)
[<f8edae8c>] usb_stor_invoke_transport [usb-storage] 0x18c (0xc4f5ff20)
[<c0170000>] open_namei [kernel] 0x330 (0xc4f5ff48)
[<f8eda341>] usb_stor_ufi_command [usb-storage] 0x61 (0xc4f5ff60)
[<f8edc245>] usb_stor_control_thread [usb-storage] 0x355 (0xc4f5ff78)
[<f8edbef0>] usb_stor_control_thread [usb-storage] 0x0 (0xc4f5ffe0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc4f5fff0)

scsi_eh_2     D 00000001  4100  2693      1          3230  2692 (L-TLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xc4f37ec8)
[<c01238e1>] wait_for_completion [kernel] 0x71 (0xc4f37f0c)
[<c0123274>] schedule [kernel] 0x2f4 (0xc4f37f2c)
[<f8ed9279>] command_abort [usb-storage] 0x79 (0xc4f37f60)
[<f8b14621>] scsi_try_to_abort_command [scsi_mod] 0x61 (0xc4f37f6c)
[<f8b15421>] scsi_unjam_host [scsi_mod] 0x8a1 (0xc4f37f7c)
[<f8b157e8>] scsi_error_handler [scsi_mod] 0x148 (0xc4f37fb4)
[<f8b1db75>] .rodata.str1.1 [scsi_mod] 0x2071 (0xc4f37fbc)
[<f8b156a0>] scsi_error_handler [scsi_mod] 0x0 (0xc4f37fe4)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc4f37ff0)

Now I am going to deduce how this could have happened.
If someone else wishes to try their hand at this, they
are very welcome, too.

Comment 51 Bojan Zdrnja 2004-07-02 00:33:35 UTC

I'm far away from kernel expert but some wild guesses ...

When I switch from one blade to the other, it obviously tries to 
initialize the CDROM and the floppy.

So, maybe this is the culprit, from scsi_eh_2:

[<f8b14621>] scsi_try_to_abort_command [scsi_mod] 0x61 (0xc4f37f6c)
[<f8b15421>] scsi_unjam_host [scsi_mod] 0x8a1 (0xc4f37f7c)
[<f8b157e8>] scsi_error_handler [scsi_mod] 0x148 (0xc4f37fb4)

I can see that it does something to the CDROM because LED goes on.


Also, I did more tests today. It doesn't matter what's the interval 
between switching consoles, eventually it will hang.

I also tested it on 2 blades with Win2K3 - it worked ok after a 
really nasty test (20 times switching with couple of seconds between 
each time only). Win2K3 (at least console) freezes for couple of 
seconds but it continues to work ok.

Comment 52 Pete Zaitcev 2004-07-04 01:10:42 UTC

Created attachment 101618 [details]
Conserv. Cand. #2

This is a follow-up candidate after the bug reopenining.
It does NOT obolete the original fix.
We shoud've asked Dana to refile this...

"Conservative" means that timeouts aren't added. I just fixed
the most idiotic races around the current URB. We'll see if
this if this is sufficient.

Comment 53 Pete Zaitcev 2004-07-04 01:21:08 UTC

I'm a little away in hotel in Disneyland right now, so a binary
test kernel might be delayed until Tuesday 7/6 at least. If someone
gives a try to the patch #2 by building from the source, I'll be
grateful.

Dana and Bojan have their boxes in production, so perhaps Jeff can?

Comment 54 Bojan Zdrnja 2004-07-05 02:57:44 UTC

Created attachment 101627 [details]
Sysreq log after applying the patch.

Hi Pete,

I managed to get one test Blade and I installed new kernel with your patch. I
applied it to 2.4.21-15.0.2.EL kernel.

Things still lock up, but differently. First time it was only khubd and the
other time there were lots of processes, like Dana reported at the beginning.

But, the difference second time was that the console remained alive! I was able
to work on it. Those other processes were probably stuck on the CD-ROM.

New log file I attached has both dmesg's and syslog output for second test.

Comment 57 Pete Zaitcev 2004-07-06 03:44:21 UTC

I am disappointed, but we are on the right track. Thank you, Bojan.

Comment 58 Pete Zaitcev 2004-07-07 03:00:34 UTC

Bojan, I have a special request, seeing how you were good to me
with all the info gathering. In the "newlog" batch, the messages.txt
appears to be missing something. Firstly, scsi_eh processes are not
consequently numbered. Also, the ps output has more processes.
And finally, all processes are idle and yet the partition reading
is not succeeding. I'm not saying it's not possible, but highly
unlikely.

So, would you be so kind to do this: change the
constant LOG_BUF_LEN in kernel/printk.c to something like 131072,
rebuild the kernel, catch the problem early, and take the dmesg.
Your "newlog" messages file size is 96KB. 128KB dmesg ought to
be enough to capture everything, including processes lost by syslogd.

If you can capture dmesg two or three times (with two or three
reboots for cleanup), all the better. It looks that the special
routines for isd200 may be implicated AND something generic as
well. The place is crawling with bugs.

Comment 59 Bojan Zdrnja 2004-07-07 23:15:10 UTC

Created attachment 101701 [details]
New dmesg log

Pete, I did as you wrote - increased ring buffer size for dmesg and captured
output after it hanged (I did sysrq trigger twice).

Hope this will help you, one scsi_unique looks strange to me.
If you need more testing let me know, I have one spare (test) blade now.

Comment 62 Pete Zaitcev 2004-07-09 23:18:40 UTC

I think I reproduced the mysterious hang that Bojan found. Symptoms
are: both usb-storage threads and both scsi_eh threads are idle,
but a user task is stuck doing the page read for partition scan:

scsi_unique_i D 00000003   760 10963      1         11111 11349 (NOTLB)
Call Trace:   [<c0123274>] schedule [kernel] 0x2f4 (0xdb923b58)
[<c01340b5>] schedule_timeout [kernel] 0x65 (0xdb923b9c)
[<c0134040>] process_timeout [kernel] 0x0 (0xdb923bbc)
[<c01245ed>] io_schedule_timeout [kernel] 0x2d (0xdb923bd4)
[<c01cc55a>] __get_request_wait [kernel] 0xfa (0xdb923be0)
[<c01ccbac>] __make_request [kernel] 0x15c (0xdb923c3c)
[<c0121af0>] wake_up_cpu [kernel] 0x20 (0xdb923c60)
[<c01cd22a>] generic_make_request [kernel] 0xea (0xdb923c98)
........ lotsa stack

Looks like a lost wakeup way outside of usb-storage to me.

Comment 63 Pete Zaitcev 2004-07-12 19:11:26 UTC

So what do you know, the struct request leak which Bojan captured
did exist (in generic SCSI stack), and Doug Ledford actually fixed
it for U3 Beta. I ran a test kernel on Bladecenter in RDU and
it doesn't die anymore at all.

Bojan, please install and try out the kernel 2.4.21-17.EL from
 ftp://people.redhat.com/zaitcev/120341/

Those kernels do not have the fix to sanitize semaphoring of
the current URB, but it so happens that -17.EL also has exclusive
access semaphore to help broken CD-ROMs, and so it's a moot point
apparently.

Comment 66 Bojan Zdrnja 2004-07-13 01:58:53 UTC

Created attachment 101848 [details]
Dmesg log, 2004-07-13

Well, I managed to crash it even with this latest kernel - switched console
around 10 times between two blades and it hang (on both blades, both with this
kernel).

I'd say it looks a bit better though.
From what I was able to see while it was still working, I think that the
culprit might be (this time) /sbin/usbmodules process - this was the last
process spawned which sits in D so it could prevent other things from
finishing.
That process was present on both blades which hanged.

In the attachment there is dmesg log of this machine - it wasn't complete (had
the default size) but it has usbmodules process in it. Let me know if you need
me to repeat this or if this is enough.

Comment 67 Pete Zaitcev 2004-07-13 04:57:47 UTC

I interpret the latest dmesg (07-13) differently. scsi_unique_id
PID 4032 took the open semaphore and got stuck, but since we
do not see one of scsi_eh and usb-storage threads, there's no
telling why.

I'll pound our Bladecenter and try to reproduce. I only can use
one blade, so I switch between None and Blade #5. Maybe it matters.

Bojan, sorry to ask you again, but can you do the same treatment
with bigger dmesg window to the -17.EL and try again?

Comment 68 Pete Zaitcev 2004-07-13 22:31:02 UTC

No success so far with -16.EL, no matter what I do, it works ok.
Mike Gahagan inserted a CD into drive, it made no difference.
Before, it was relatively easy to make request starvation to
happen through the Web remote control interface.

Debby Townsend sent the following note though (she tested -17.EL):
 The scsi_unique_id hang was not easy to recreate -- it involved
 pressing both the CD and console buttons repeatedly on three blades.
Her dmesg with SysRq was truncated unfortunately.

Comment 69 Bojan Zdrnja 2004-07-15 01:23:15 UTC

Created attachment 101928 [details]
The latest dmesg

Ok, I finally had time to test it a little bit more.

I'd say that it works better than before, but still has problems.
I'm switching the console between two identical blades (same kernel). One of
them hanged, the other didn't! I switched the console between 20-30 times.

Pete, please find attached latest dmesg log - I recompiled kernel so everything
was caught by dmesg (this is whole machine log since it booted). Hope this will
help.

Comment 70 Pete Zaitcev 2004-07-15 20:39:24 UTC

So the candidate #2 is needed after all, and my comment #63 was wrong.
I cannot construct a specific scenario though.

Bojan, please apply that patch again to -17.EL and test for me.

Comment 71 Bojan Zdrnja 2004-07-15 23:47:01 UTC

Created attachment 101951 [details]
Log with -17 and candidate patch

Well, it hanged again for me. :(

Maybe I have a Blade which is an older revision? Wonder if that would make any
difference.

This one was doing ok first round of tests and then it hanged on both blades.
Processes looked about the same, here's dmesg of one of them.

Comment 72 Pete Zaitcev 2004-07-16 06:09:49 UTC

There's no way to blame a blade here. It's a bunch of bugs.

However, the usb-storage code is extremely crufty. It needs a good
overhaul, such as a backport from 2.6. But in fact I have already
started a rewrite for 2.6, it was just so bad. This makes me
to want a backport less. Might as well get the new driver...

So, the bug is pretty much long term. But to get Dana, Bojan,
and Jeff on track quickly, I suggest to comment out devlabel
invocations from /etc/hotplug/usb.agent. It should work around
the problem for now. My reasoning is that a blade
is not a desktop, it should not be a problem to mount manually
with a mount command, right?

Comment 73 Dana Canfield 2004-07-16 14:38:52 UTC

I trust that this workaround would only affect the storage components,
and not the KB/Mouse?  If so, I for one would deem that acceptable for
now.  Especially if the issue stays on the radar for possibly being
resolved in a later QU.  

BTW, my thanks to Bojan for picking up the ball and running on the
testing.  Summers are just too hectic around here, so I haven't had
the time to contribute to testing.

Comment 74 Bojan Zdrnja 2004-07-17 04:19:48 UTC

Pete, so the complete workaround would be:

1) Use -17 kernel
2) Apply candidate #2
3) Comment out devlabel invocations

As long as the keyboard and mouse work this is great (will try on 
monday). Your reasoning is correct - at least here we use them as 
servers only and rarely work on console.

Np for this Dana, I'm glad we're finally getting somewhere, my 
management had bad opinion about Linux on Blades after initial 
problems. Besides, it's winter here so I have to work anyway. :)

Comment 76 Pete Zaitcev 2004-07-19 02:11:27 UTC

The #2 doesn't solve the full problem. The SCSI EH process still
gets stuck. The reason is pretty sick really... In 2.4 the wisdom
was that USB runs no its own timeouts, because the EH watches over
SCSI commands and will abort anything through the SCSI interface
(scsiglue.c). We feared that running two timeouts will race.
It's a bad design to layer timeouts in general. Unfortunately,
scsiglue faces a dilemma if the USB thread is doing something: 
to bust locks and unlink URBs, or to hang on locks. 2.4 does both,
inconsistently. It tries to unlink and races abort notifications
AND it can hang, because it needs notifications, but URBs can hang.
In 2.6 all URBs have timeouts and locking is always observed by
the scsiglue.c. So I think long term we have (I have) to look at
a backport from 2.6.

Bojan, please test (1)+(3) and let me know if it works.
Your own comment #71 demonstrates that patch #2 is pointless.
It helps a little, but not enough.

Comment 80 Bojan Zdrnja 2004-07-20 03:54:07 UTC

I can confirm that the workaround works ok!

After installing kernel -17 and commenting out devlabel invocations 
everything worked ok. I switched console around 40 times and 
everything went fine.

Any ETA when this kernel will be officialy out? I'll apply this on 
other blades we have.

Comment 81 Pete Zaitcev 2004-07-20 05:21:30 UTC

Bojan, I'm not exactly sure if the schedule of RHEL3 U3
is communicated externaly outside of NDAs. You have to ask your
account representative about this sort of things. I only can
promise you that -17.EL represents a trunk (not a branch) and
all changes there will be present in U3.

Comment 88 Pete Zaitcev 2005-05-14 04:04:50 UTC

What is the situation here? Are we still running with devlabel commented
out at customer site?

Comment 89 Dana Canfield 2005-05-16 14:30:07 UTC

I'm running all my boxes with devlabel commented out.  Was there something else
that happened on Red Hat's end that would have improved this situation and that
we should be trying?

Comment 90 Pete Zaitcev 2005-05-16 22:50:21 UTC

I'd like to close this as WONTFIX.

RHEL 4 has a vastly superior mechanism to process hotplug events which
does not require devlabel to be restarted, and also its kernel component
is not as racy. I advocate upgrading, at least long term.

Comment 91 Bojan Zdrnja 2005-05-16 23:10:57 UTC

I will check this with the latest kernel on my test blade.
Lately we have been installing only RHEL AS 2.1 (due to PeopleSoft NOT 
supporting anything else :/).
Interesting thing is that the bug mentioned here doesn't exist at 2.1 either.

Will let you know when I test with the latest kernel and devlabel uncommented.

Comment 92 Dana Canfield 2005-05-16 23:15:53 UTC

I don't disagree with your assessment of this bug, HOWEVER, you do know that
this problem occurs on the IBM BladeCenter, for which the functional qlogic SAN
support was removed in RHEL 4, and for which there won't be a suitable fix until
Update 2 or later?  And when one asks why the SAN support was removed without a
suitable replacement being available, the response is that there should be no
hurry for Red Hat customers to upgrade to RHEL4...  I know, it's a bit of an
off-topic rant, but this bug just helps to highlight my frustration with the
hit-or-miss level of support the BladeCenter has received by Red Hat, given that
it's supposedly been certified by both Red Hat and IBM since AS 2.1.

Comment 93 Pete Zaitcev 2005-05-19 21:58:55 UTC

By the way, while I'm looking to wontfix this bug report (the outright race
and the hang), I work on something which ought to help a lot, hopefuly.
See bug 156831. There, we deal with open succeeding when it should fail.
A side effect of that fix may be hotplug exiting instantaneously
instead of sticking around, thus resolving whole issue. Well, we'll see...

Bojan, please leave the devlabel commented.

Comment 95 Pete Zaitcev 2005-12-22 17:13:42 UTC

Like I said, while officially I do not remove this recommendation yet, I hope
that we may live without it now (with but 156831 fixed). So, if someone could
test U7 Beta and let me know if the same issue is present, it would be great.
This bug was WONTFIX'ed but I keep it in mind and continue to seek improvements.

Actually, I only asked requestors for WONTFIX permission, but did not close...
I'll do it now.

BTW, Dana mentioned that SAN was not supported in comment of 2005-05-16, but
that surely is not the case anymore. Jeff, please steer customers to try
RHEL 4 U2. I personally saw it working on a Bladecenter with HS-20.

Comment 96 Dana Canfield 2005-12-22 17:26:59 UTC

FWIW, Boot-from-SAN support has yet to be implemented for RHEL4, and multipath
support is still apparently undocumented and at your own risk.  Sure would be
nice if the "stable" version of RHEL wasn't having major development like this
being written as we go, given that a stable and tested configuration existed
right from the start.  There should have been no question that
device-mapper-multipath should have waited for RHEL5.

I still don't disagree that it's not worth trying to fix *this* bug at this
point, though.  Just wish some of the kernel developers and managers at Red Hat
understood what a colossal mess RHEL4 is on a SAN due to what appears to be one
man's hate of the existing qlogic drivers.  Running a bladecenter with Red Hat
has been a huge pain all along and there's no reason why it should be.

Note You need to log in before you can comment on or make changes to this bug.