Bug 120341
Description
Dana Canfield
2004-04-07 21:01:02 UTC
What's the process tree look like tree-wise? (ps axf) Created attachment 99209 [details]
Output of ps axf
Any kernel messages? My general assumption is that it's trying to access devices that went away. >My general assumption is that it's trying to access devices that went
>away.
This is quite probable. The IBM BladeCenter has one KB/Mouse and one
CD/Floppy that are internally switched via USB among the 14 blades in
the system.
For all Linux knows, every time the console switches between the
blades, someone is coming along and unplugging the USB KB/Mouse and
plugging it into another system (same thing with the CD/Floppy unit).
As the BladeCenter is certified for RHEL, I guess I just need to know
if there is a "proper" way to use the console that won't cause this
issue. FWIW, I don't believe it is necessary for a floppy/CD to be
mounted when the drive tray is switched for this behavior to occur,
though. I would expect something like this to happen in that case
(much as I'd like to wish the Linux USB system would be better than that).
BTW, not sure if it's relevant, but when these systems boot, they used to complain about something related to the floppy on the console to the point that it was difficult to use the console... I can't remember the exact error any more, but based on advice from you (Bill) on the taroon list, all of my systems have alias block-major-2 off in /etc/modules.conf. Not sure if that helps at all, but I'm also fairly certain this issue occurred before getting that tip, though. Are there any kernel messages related to USB? I'm assuming stracing the updfstab or scsi_unique_id processes doesn't do anything. I've not seen any particularly unusual message related to USB, however the particular machine that I'm seeing the problem on at the moment has been experiencing NFS connection problems, so any log messages related to USB have scrolled away. Also, I've not used strace extensively... if you could give me an example of how to use it that would be useful to you, I'd be happy to send the output. strace -p <whatever the PID of updfstab or scsi_unique_id is> If it prints no output, that means it's stuck in the kernel somewhere. One other thing that would be useful when it's in this state is to enable sysrq, and post the output of sysrq-t Strace gives nothing on any of the processes. Sysrq-t says only "Show State" and nothing more. Tom Coughlan lead me to believe in another bug that Red Hat has a BladeCenter and at least a PPC blade, if not an i386 blade. Are these issues not being seen on that equipment? OK, I'm dumb. Attached is the output of dmesg that I assume was triggered from sysrq-t. FWIW, I re-read an e-mail from someone else using the bladecenter, and he said that this could happen without the KVM ever having been switched to that blade: "I am having the exact same problem. I am running rhel-3-es on IBM blades and the usb.agent is occasionally dragging some of the blades to a halt, but not others. In fact, I have one that is doing nothing but sitting there - no services running per se, never switching the keyboard/mouse nor the cd/floppy over to it, and all of the sudden it will take on multiple copies of these usb related processes." Also, if it matters, with the drives and kvm attached, lsusb gives: Bus 001 Device 001: ID 0000:0000 Bus 001 Device 027: ID 04b4:5203 Cypress Semiconductor Corp. Bus 001 Device 028: ID 0644:0000 TEAC Corp. Floppy Bus 001 Device 029: ID 05ab:0030 In-System Design Bus 001 Device 024: ID 04b4:5204 Cypress Semiconductor Corp. Bus 001 Device 025: ID 04b3:4002 IBM Corp. Bus 001 Device 026: ID 04b3:4004 IBM Corp. Without, it gives: Bus 001 Device 001: ID 0000:0000 Created attachment 99219 [details]
dmesg output
Is that the full log (just noticing it doesn't have updfstab). Assigning to the kernel for now... look's like it's stuck in locks in the kernel. That was the result of "dmesg > dmesg.txt" after issuing the sysrq-t. If there is a better way to get the output that you need, please let me know how to do it. Unfortunately, we're treading a little deeper into the system than I've yet had a need to explore (which, as an aside, is actually a compliment to Red Hat as we've been running RHL on *every* server except the Finance/Student Record since 1998... and that student/finance system is being coverted to RHEL right now...). It should all be in /var/log/messages, if you just want to attach that. Created attachment 99243 [details]
/var/log/messages
Sorry to have been brain-dead about that. I was thinking dmesg had more output than /var/log/messages sometimes. And just FYI, this apparently also happens with FC1 on the blades we're using as terminal servers. No, it's all right, dmesg is preferred. The syslogd cannot keep up sometimes and drops records. Ususally, I kill off unrelated processes before taking a SysRq-T, which may not be applicable in this case though. I'll go through this and see if I can figure it out. I looked at dmesg and messages, so here's the bad news: /var/log/messages is incomplete. There's no usb-storage thread for one, but it's plainly visible in attached ps output, so it didn't die. Dana, please try to catch this situation as early as possible, preferably with just one or two scsi_unique_id stuck. It would be a great help. OK, this is getting ugly, I recognize the symptoms: scsi_unique_i D 00000000 0 4626 5057 (NOTLB) Call Trace: [<c01232f4>] schedule [kernel] 0x2f4 (0xd7023b58) [<c0133de5>] schedule_timeout [kernel] 0x65 (0xd7023b9c) [<c0133d70>] process_timeout [kernel] 0x0 (0xd7023bbc) [<c012466d>] io_schedule_timeout [kernel] 0x2d (0xd7023bd4) [<c01cc147>] __get_request_wait [kernel] 0x147 (0xd7023be0) [<c0247a60>] dst_output [kernel] 0x0 (0xd7023c08) [<c025fb8f>] tcp_v4_do_rcv [kernel] 0x13f (0xd7023c30) [<c01cc7c0>] __make_request [kernel] 0x140 (0xd7023c44) [<c01cceaa>] generic_make_request [kernel] 0xea (0xd7023c98) [<c01ccf4c>] submit_bh_rsector [kernel] 0x4c (0xd7023cc0) [<c0164016>] block_read_full_page [kernel] 0x266 (0xd7023cdc) [<c014572a>] add_to_page_cache_unique [kernel] 0x5a (0xd7023d2c) [<c014af3d>] __read_cache_page [kernel] 0x8d (0xd7023d40) [<c0168990>] blkdev_get_block [kernel] 0x0 (0xd7023d48) [<c0148a88>] read_cache_page [kernel] 0x38 (0xd7023d64) [<c0168a80>] blkdev_readpage [kernel] 0x0 (0xd7023d70) [<c01951cd>] read_dev_sector [kernel] 0x4d (0xd7023d88) [<c0168a80>] blkdev_readpage [kernel] 0x0 (0xd7023d94) [<c0195c5d>] handle_ide_mess [kernel] 0x2d (0xd7023da8) [<c0128413>] call_console_drivers [kernel] 0x63 (0xd7023dc0) [<c0195e7f>] msdos_partition [kernel] 0x6f (0xd7023dd4) [<c0195e10>] msdos_partition [kernel] 0x0 (0xd7023e0c) [<c0194f52>] check_partition [kernel] 0x102 (0xd7023e18) [<f89af3bb>] sd_init_onedisk [sd_mod] 0x38b (0xd7023e38) [<f89b099b>] .rodata.str1.1 [sd_mod] 0xa7 (0xd7023e44) [<c0195117>] grok_partitions [kernel] 0xc7 (0xd7023e80) [<f89b0409>] revalidate_scsidisk [sd_mod] 0x1b9 (0xd7023ea4) [<f89b1548>] sd_fops [sd_mod] 0x0 (0xd7023ecc) So, the I/O was lost, now the question is who lost it and why. Dana, in addition to trying to capture the khubd, usb-storage-0/1 and SCSI EH threads, please give me several more things: /proc/bus/usb/devices (with the storage attached - to find the protocol and transport) /proc/scsi/scsi /proc/version (You know I'm going to bump you to 2.4.21-12.EL :-) /proc/cmdline Complete /var/log/messages (to see what happened before the lockup, maybe I can see a SCSI timeout) Thanks Created attachment 99256 [details]
Output of various proc files
Here's the output of the various /proc files. I'll have to go back and see
what I can do abou the full /var/log/messages. I don't know where/when exactly
this was triggered, and it's not very easy to trigger it at will.
So, the priority is to a) catch the problem early, b) get a sysrq dump which has usb-storage, khubd, scsi_eh threads in it. If the I/O was lost because, eh, someone just lost it, then we are dead in the water until we know how to reproduce it. However, there's a significant chance that usb-storage got stuck, and since it's not multi-threaded, a stack dump will show where it's stuck, then we're on track. When it happens, ftp the /var/log/messages out of the system, too, please. One little detail, I forgot to ask about /proc/scsi/usb-storage-*/*. I thought I'd calculate protocol and transport from /proc/bus/usb/devices, but let's remove any uncertainty. Created attachment 99321 [details]
var/log/messages output after reproducing the problem
Created attachment 99322 [details]
proc/bus/usb/devices
Created attachment 99323 [details]
usb-storage-0/2
Created attachment 99324 [details]
proc/bus/usb-storage-1/3
These attachments are after a reboot. I swapped the cd/floppy tray back and forth much more aggressively than I would in normal use in order to trigger the problem, so I hope they're still representative of the actual problem. Let me know if I missed anything. Perfect!! usb-storage-0 S 00000001 3832 17601 1 17602 1611 (L-TLB) Call Trace: [<c01232f4>] schedule [kernel] 0x2f4 (0xe54c1ef4) [<c010ad6a>] __down_interruptible [kernel] 0x8a (0xe54c1f38) [<c01220c6>] wake_up_process [kernel] 0x26 (0xe54c1f5c) [<c010ae37>] __down_failed_interruptible [kernel] 0x7 (0xe54c1f70) [<f8c2a50f>] .text.lock.usb [usb-storage] 0x25 (0xe54c1f7c) [<f8c3716c>] usb_stor_sense_notready [usb-storage] 0x0 (0xe54c1f84) [<f8c28eb0>] usb_stor_control_thread [usb-storage] 0x0 (0xe54c1fe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xe54c1ff0) scsi_eh_2 S 00000001 3868 17602 1 17603 17601 (L-TLB) Call Trace: [<c01232f4>] schedule [kernel] 0x2f4 (0xe54ddf2c) [<c010ad6a>] __down_interruptible [kernel] 0x8a (0xe54ddf70) [<c010ae37>] __down_failed_interruptible [kernel] 0x7 (0xe54ddfa8) [<f8995b7f>] .text.lock.scsi_error [scsi_mod] 0xc5 (0xe54ddfb4) [<f899db4d>] .rodata.str1.1 [scsi_mod] 0x2089 (0xe54ddfbc) [<f8995690>] scsi_error_handler [scsi_mod] 0x0 (0xe54ddfe4) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xe54ddff0) However, one "small" difficulty remains: I cannot see how that could happen. This needs some thinking. I found that I did some changes after -9.EL in this area. They are coming out really soon, so I would like not to backtrack. Please try out a kernel from the following location: ftp://people.redhat.com/zaitcev/120341 I'll see if it works during my maintenance window Wednesday night. Unfortunately, we're also fighting problems with the qlogic driver in -11 & -12 (being worked on by Tom Coughlan and is also in bugzilla) which are required for these boxes to boot from the SAN they're on, so I may be stuck until that issue is sorted out. Dana, I uploaded an interim release -9.4.EL to the same place, which does have my patch applied, but does not seem to have qlogic changes (at least according to the changelog). [zaitcev@devserv i686]$ rpm -q --changelog -p kernel-smp-2.4.21-9.4.EL.i686.rpm | more * Wed Feb 04 2004 Ernie Petrides <petrides> kernel-2.4.21-9.4.EL - update nfsd mode-with-acls fix for upstream conformance (Steve Dickson) - rework usb-storage locking to avoid hangs on disconnect (Pete Zaitcev) - fix mmio alignment error in olympic token ring driver (Julie DeWandel) - prevent tux logger-thread shutdown problems on ppc64 (Ingo Molnar) - add ppc64 AltiVec chip support for Power Blades (Julie DeWandel) - eliminate multi-threaded coredump hang potential (Ingo Molnar) - fix ppc64 segfaults with tasks > 2GB in size (Julie DeWandel) - fix usb-storage memory-allocation/iowait deadlock (Alan Cox) * Mon Feb 02 2004 Ernie Petrides <petrides> kernel-2.4.21-9.3.EL - add amd64 support for PT_GNU_STACK with exec shield (Jim Paradis) - fix s390 raid1 sync problem w/writes at end of dev (Pete Zaitcev) - update s390 zcrypt driver to version 1.1.4 (Pete Zaitcev) - fix output of large /proc/pid/statm values (Mark DeWandel) - fix output of large /proc/partitions values (Mark DeWandel) - enable ppc64 support for JS20 Power Blades (Julie DeWandel) - fix kernel stack memory leak on fork failures (Dave Anderson) - fix ipv6 packet quoting problem in icmpv6_send() (Dave Miller) * Thu Jan 29 2004 Ernie Petrides <petrides> kernel-2.4.21-9.2.EL - apply Intel's x86 microcode loader update (Jim Paradis) - enhance PCI multifunction support for ppc64 (Julie DeWandel) - fix nfsd returning bad mode bits when acls set (Steve Dickson) - prevent unkillable zombies from detached NPTL threads (Ingo Molnar) - avoid missing scsi disks in /proc/partitions on detach (Jason Baron) - correct pte calculation inside zeromap_pte_range() loop (Ben Woodard) - add inline function PDE() to help driver backporting (Matthew Galgoci) * Wed Jan 21 2004 Ernie Petrides <petrides> kernel-2.4.21-9.1.EL - (this is the first build for RHEL 3 Update 2) - relocate various pieces of the "undo" patch (Patch Review Committee) - rename s390 zcrypt and zfcp driver patches (Pete Zaitcev) - add LKML discussion to COPYING.modules (Arjan van de Ven) ................. I think it should do the trick. If it does, mark the bug and work it out with Tom. Sorry for the delay... my test box was unavailable for a while. This version of the kernel does appear to be more stable. Switching the console before all the individual processes completed still caused some extra devlabel processes, but then I was able to kill those processes off, and then things seemed to fix themselves. I'll watch this a while longer, but it seems to be a very big improvement. FWIW, I'll be putting the -14 kernel into production on a number of boxes this week in order to solve both this problem and a qlogic driver issue. I've not noticed the problem recurring in any of the test blades I've used running -11 through -14, so as far as I'm concerned this ticket can be marked resolved for now. Kevin, get them to run -14.EL and let me know how it goes, please. If Dana closes but your case persists I'm not letting you to hang on this one (e.g. bug morphing), you'll need to file a new one. At that point it would be likely a different root cause. But we'll see. Dana/Kevin, we are intending to officially release RHEL3 U2 on Monday, May 10th. The U2 kernel is 2.4.21-15.EL, which contains an important security fix that is not in the latest beta channel kernel -14.EL. Please wait until Monday night or Tuesday in order to get the official U2 (-15.EL) kernel. I'm changing the state of this bugzilla report to "modified" based on Dana's comment #34, and I will add this bug id to the errata associated with the U2 kernel (RHSA-2004:188), since tonight is my last chance to update it before it goes public. If the originally reported problem persists after the U2 upgrade, please reopen the bug. Thanks. -ernie An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-188.html All, I'm not sure if you all are still experiencing this problem after the latest kernel release 2.4.21-15 but at the current time I still experience the same problem as before. Seems to occur upon scanning the floppy drive via the USB bus. I ran an strace against a command that generally works well on the other 5 blades in my HS20 which is the EMC binary inq. My process table currently looks like the following: [root@thdl005 root]# ps -ef|grep -E "inq|bash"|grep -v grep root 17995 6 0 May25 ? 00:00:00 /bin/bash /etc/hotplug/usb.agent root 18102 17995 0 May25 ? 00:00:00 /bin/bash /sbin/devlabel restart root 18106 18102 0 May25 ? 00:00:00 /bin/bash /sbin/devlabel status root 18222 18106 0 May25 ? 00:00:00 /bin/bash /sbin/devlabel status root 6258 1 0 Jun05 ? 00:00:00 /usr/local/bin/inq root 23091 1 0 Jun12 ? 00:00:00 /usr/local/bin/inq root 28012 1 0 Jun19 ? 00:00:00 /usr/local/bin/inq root 14701 1 0 Jun23 ? 00:00:00 inq root 29073 1 0 Jun23 ? 00:00:00 inq root 29232 1 0 Jun23 ? 00:00:00 inq root 26836 26832 0 09:25 pts/1 00:00:00 -bash root 5269 5264 0 10:38 pts/5 00:00:00 -bash With 11 jobs in the blocked on I/O: vmstat It responded with the following output: [root@thdl005 root]# vmstat 2 procs memory swap io system cpu r b swpd free buff cache si so bi bo in cs us sy id wa 0 11 193708 27876 300184 2139968 0 0 1 1 0 2 1 1 2 2 0 11 193708 27872 300184 2139968 0 0 0 212 359 634 0 0 89 11 open("/dev/sgc", O_RDWR|O_NONBLOCK|O_SYNC) = 3 getpid() = 29232 fcntl64(3, F_GETFL) = 0x1802 (flags O_RDWR|O_NONBLOCK|O_SYNC) fcntl64(3, F_SETFL, O_RDWR|O_SYNC) = 0 write(3, "S\0\0\0\375\377\377\377\6\377\0\0\220\0\0\0\270\374 (\10"..., 64 At which point there is no activity and the process for all intents and purposes is hung. Now /dev/sgc appears as the following in /dev: [root@thdl005 root]# ls -la /dev/sgc lrwxrwxrwx 1 root root 3 May 17 13:18 /dev/sgc -> sg2 [root@thdl005 root]# ls -la /dev/sg2 crw-rw---- 1 root disk 21, 2 Feb 18 10:26 /dev/sg2 and in /etc/sysconfig/hwconfig I have the following entry for /dev/sg2: - class: FLOPPY bus: USB detached: 0 driver: usb-storage desc: "TEAC TEAC FD-05PUB" usbclass: 8 usbsubclass: 4 usbprotocol: 0 usbbus: 1 usblevel: 2 usbport: 0 usbdev: 6 vendorId: 0644 deviceId: 0000 usbmfr: TEAC usbprod: TEAC FD-05PUB - class: FLOPPY bus: SCSI detached: 0 device: sdb driver: ignore desc: "Teac FD-05PUB" host: 1 id: 0 channel: 0 lun: 0 generic: sg2 - Please reopen this bug as I do not believe it is resolved. If you feel otherwise please indicate why. Thanks. Jeff Reopening as requested in last comment. -ernie I can confirm that the problem has started happening to us again. It doesn't seem to be quite as frequent as before, and I still can't reliably reproduce it on demand, but eventually it kicks in and bumps the load numbers way up. Dana, please re-do the drill like the last time: catch the situation early, kill off all extra processes to preserve dmesg from overflowing, trip sysrq-t with "echo t > /proc/sysrq-trigger" or from VT console, capture dmesg and attach as attachement. Please let me know if there is anything I can do to assist or information I can provide. Jeff, please look at comment #21 and before. Also, comment #42 has all necessary instructions, only they are ... condensed. Let me explain. The objective is to recover the kernel stack traces of all hung processes. It is straightforward with "echo t > /proc/sysrq-trigger". However, since dmesg overflows, you better catch it early (while few processes hang), and kill off irrelevant processes such as system daemons, before tripping the dump. Do not drop dmesg result into the bug comment box. Attach it to the bug instead. I can confirm that this happens. Among other servers there are 2 blades running with the lates RHEL kernel (2.4.21-15.0.2.ELsmp). After changing console and CDROM around a bit (between two of them), both of them remain with hanged processes and console keyboard is not working anymore (keyboard is USB as well). Hanged processes are: root 83 0.0 0.0 0 0 ? DW 13:59 0:00 [khubd] root 2615 0.0 0.0 0 0 ? DW 13:59 0:00 [usb- storage-0] root 2617 0.0 0.0 0 0 ? DW 13:59 0:00 [scsi_eh_1] root 14105 0.0 0.0 1380 236 ? D 14:01 0:00 /usr/bin/scsi_unique_id /dev/sdb Other than that machine seems to be working ok. Is there anything special I should do about that sysrq trigger? After I echo t to it, nothing happens. If there is something else I can do to help let me know. Created attachment 101527 [details]
Sysreq trace of processes.
Bojan, your trace is obviously corrupted. I asked to KILL OFF EXTRA PROCESSES before taking the dmesg (which is preferable to /var/log/messages), several times (see comments #17, 42, 44). Please, take it again. I cannot do anything until I get that little piece of info which your syslogd ate because you fed too many stacks to it. Ok Pete, will do tomorrow because I can't touch those servers anymore today :/ Btw, they seem to work ok - they are fast and everything seems good except that I can't access them through the console and load average is always 4+. Created attachment 101549 [details]
Dmesg, ps -axf, lsusb and syslog from hanged machine.
I hope this will help.
I found it easy to reproduce the bug. If I change the console and CDROM between
2 blades in intervals of ~10 seconds, soon they hang.
I'm not sure if waiting will help (like waiting 30+ seconds between console
switch) - will test that and let know.
This attachment has the following files:
1) dmesg_output.txt - sysrq of dmesg
2) syslog.txt - syslog log of sysrq, seems like it caught everything because I
killed almost all other processes. I also left couple of lines at the top which
show what happens when you change cdrom to the Blade.
3) ps_axf.txt - ps -axf when it hangs
4) lsusb.txt - lsusb - doesn't have anything after it hangs.
I hope this will help, if there's something else let me know.
Thanks, Bojan. These appear the key parts: khubd D 00000002 3584 81 1 205 26 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xf678de60) [<c010ac83>] __down [kernel] 0x73 (0xf678dea4) [<c010ae2c>] __down_failed [kernel] 0x8 (0xf678ded8) [<f8edd51f>] .text.lock.usb [usb-storage] 0x85 (0xf678dee8) [<f8eeb0a0>] usb_storage_driver [usb-storage] 0x20 (0xf678def0) [<f8eeb080>] usb_storage_driver [usb-storage] 0x0 (0xf678def4) [<f8bb2380>] usb_disconnect_Rsmp_8c20b3cd [usbcore] 0xa0 (0xf678def8) [<f8bc3720>] hub_driver [usbcore] 0x0 (0xf678df18) [<f8bb2428>] usb_disconnect_Rsmp_8c20b3cd [usbcore] 0x148 (0xf678df24) [<f8bb538f>] usb_hub_port_connect_change [usbcore] 0x27f (0xf678df50) [<f8bb4dac>] usb_hub_port_status [usbcore] 0x6c (0xf678df64) [<f8bb56a2>] usb_hub_events [usbcore] 0x302 (0xf678df84) [<f8bb5755>] usb_hub_thread [usbcore] 0x55 (0xf678dfb0) [<f8bb5700>] usb_hub_thread [usbcore] 0x0 (0xf678dfe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xf678dff0) usb-storage-0 D 00000002 3584 2692 1 2693 211 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc4f5fe2c) [<c01238e1>] wait_for_completion [kernel] 0x71 (0xc4f5fe70) [<f8eda8a5>] usb_stor_control_msg [usb-storage] 0x175 (0xc4f5fec4) [<f8edb0f5>] usb_stor_CBI_transport [usb-storage] 0xa5 (0xc4f5fef0) [<f8edae8c>] usb_stor_invoke_transport [usb-storage] 0x18c (0xc4f5ff20) [<c0170000>] open_namei [kernel] 0x330 (0xc4f5ff48) [<f8eda341>] usb_stor_ufi_command [usb-storage] 0x61 (0xc4f5ff60) [<f8edc245>] usb_stor_control_thread [usb-storage] 0x355 (0xc4f5ff78) [<f8edbef0>] usb_stor_control_thread [usb-storage] 0x0 (0xc4f5ffe0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc4f5fff0) scsi_eh_2 D 00000001 4100 2693 1 3230 2692 (L-TLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xc4f37ec8) [<c01238e1>] wait_for_completion [kernel] 0x71 (0xc4f37f0c) [<c0123274>] schedule [kernel] 0x2f4 (0xc4f37f2c) [<f8ed9279>] command_abort [usb-storage] 0x79 (0xc4f37f60) [<f8b14621>] scsi_try_to_abort_command [scsi_mod] 0x61 (0xc4f37f6c) [<f8b15421>] scsi_unjam_host [scsi_mod] 0x8a1 (0xc4f37f7c) [<f8b157e8>] scsi_error_handler [scsi_mod] 0x148 (0xc4f37fb4) [<f8b1db75>] .rodata.str1.1 [scsi_mod] 0x2071 (0xc4f37fbc) [<f8b156a0>] scsi_error_handler [scsi_mod] 0x0 (0xc4f37fe4) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc4f37ff0) Now I am going to deduce how this could have happened. If someone else wishes to try their hand at this, they are very welcome, too. I'm far away from kernel expert but some wild guesses ... When I switch from one blade to the other, it obviously tries to initialize the CDROM and the floppy. So, maybe this is the culprit, from scsi_eh_2: [<f8b14621>] scsi_try_to_abort_command [scsi_mod] 0x61 (0xc4f37f6c) [<f8b15421>] scsi_unjam_host [scsi_mod] 0x8a1 (0xc4f37f7c) [<f8b157e8>] scsi_error_handler [scsi_mod] 0x148 (0xc4f37fb4) I can see that it does something to the CDROM because LED goes on. Also, I did more tests today. It doesn't matter what's the interval between switching consoles, eventually it will hang. I also tested it on 2 blades with Win2K3 - it worked ok after a really nasty test (20 times switching with couple of seconds between each time only). Win2K3 (at least console) freezes for couple of seconds but it continues to work ok. Created attachment 101618 [details]
Conserv. Cand. #2
This is a follow-up candidate after the bug reopenining.
It does NOT obolete the original fix.
We shoud've asked Dana to refile this...
"Conservative" means that timeouts aren't added. I just fixed
the most idiotic races around the current URB. We'll see if
this if this is sufficient.
I'm a little away in hotel in Disneyland right now, so a binary test kernel might be delayed until Tuesday 7/6 at least. If someone gives a try to the patch #2 by building from the source, I'll be grateful. Dana and Bojan have their boxes in production, so perhaps Jeff can? Created attachment 101627 [details]
Sysreq log after applying the patch.
Hi Pete,
I managed to get one test Blade and I installed new kernel with your patch. I
applied it to 2.4.21-15.0.2.EL kernel.
Things still lock up, but differently. First time it was only khubd and the
other time there were lots of processes, like Dana reported at the beginning.
But, the difference second time was that the console remained alive! I was able
to work on it. Those other processes were probably stuck on the CD-ROM.
New log file I attached has both dmesg's and syslog output for second test.
I am disappointed, but we are on the right track. Thank you, Bojan. Bojan, I have a special request, seeing how you were good to me with all the info gathering. In the "newlog" batch, the messages.txt appears to be missing something. Firstly, scsi_eh processes are not consequently numbered. Also, the ps output has more processes. And finally, all processes are idle and yet the partition reading is not succeeding. I'm not saying it's not possible, but highly unlikely. So, would you be so kind to do this: change the constant LOG_BUF_LEN in kernel/printk.c to something like 131072, rebuild the kernel, catch the problem early, and take the dmesg. Your "newlog" messages file size is 96KB. 128KB dmesg ought to be enough to capture everything, including processes lost by syslogd. If you can capture dmesg two or three times (with two or three reboots for cleanup), all the better. It looks that the special routines for isd200 may be implicated AND something generic as well. The place is crawling with bugs. Created attachment 101701 [details]
New dmesg log
Pete, I did as you wrote - increased ring buffer size for dmesg and captured
output after it hanged (I did sysrq trigger twice).
Hope this will help you, one scsi_unique looks strange to me.
If you need more testing let me know, I have one spare (test) blade now.
I think I reproduced the mysterious hang that Bojan found. Symptoms are: both usb-storage threads and both scsi_eh threads are idle, but a user task is stuck doing the page read for partition scan: scsi_unique_i D 00000003 760 10963 1 11111 11349 (NOTLB) Call Trace: [<c0123274>] schedule [kernel] 0x2f4 (0xdb923b58) [<c01340b5>] schedule_timeout [kernel] 0x65 (0xdb923b9c) [<c0134040>] process_timeout [kernel] 0x0 (0xdb923bbc) [<c01245ed>] io_schedule_timeout [kernel] 0x2d (0xdb923bd4) [<c01cc55a>] __get_request_wait [kernel] 0xfa (0xdb923be0) [<c01ccbac>] __make_request [kernel] 0x15c (0xdb923c3c) [<c0121af0>] wake_up_cpu [kernel] 0x20 (0xdb923c60) [<c01cd22a>] generic_make_request [kernel] 0xea (0xdb923c98) ........ lotsa stack Looks like a lost wakeup way outside of usb-storage to me. So what do you know, the struct request leak which Bojan captured did exist (in generic SCSI stack), and Doug Ledford actually fixed it for U3 Beta. I ran a test kernel on Bladecenter in RDU and it doesn't die anymore at all. Bojan, please install and try out the kernel 2.4.21-17.EL from ftp://people.redhat.com/zaitcev/120341/ Those kernels do not have the fix to sanitize semaphoring of the current URB, but it so happens that -17.EL also has exclusive access semaphore to help broken CD-ROMs, and so it's a moot point apparently. Created attachment 101848 [details]
Dmesg log, 2004-07-13
Well, I managed to crash it even with this latest kernel - switched console
around 10 times between two blades and it hang (on both blades, both with this
kernel).
I'd say it looks a bit better though.
From what I was able to see while it was still working, I think that the
culprit might be (this time) /sbin/usbmodules process - this was the last
process spawned which sits in D so it could prevent other things from
finishing.
That process was present on both blades which hanged.
In the attachment there is dmesg log of this machine - it wasn't complete (had
the default size) but it has usbmodules process in it. Let me know if you need
me to repeat this or if this is enough.
I interpret the latest dmesg (07-13) differently. scsi_unique_id PID 4032 took the open semaphore and got stuck, but since we do not see one of scsi_eh and usb-storage threads, there's no telling why. I'll pound our Bladecenter and try to reproduce. I only can use one blade, so I switch between None and Blade #5. Maybe it matters. Bojan, sorry to ask you again, but can you do the same treatment with bigger dmesg window to the -17.EL and try again? No success so far with -16.EL, no matter what I do, it works ok. Mike Gahagan inserted a CD into drive, it made no difference. Before, it was relatively easy to make request starvation to happen through the Web remote control interface. Debby Townsend sent the following note though (she tested -17.EL): The scsi_unique_id hang was not easy to recreate -- it involved pressing both the CD and console buttons repeatedly on three blades. Her dmesg with SysRq was truncated unfortunately. Created attachment 101928 [details]
The latest dmesg
Ok, I finally had time to test it a little bit more.
I'd say that it works better than before, but still has problems.
I'm switching the console between two identical blades (same kernel). One of
them hanged, the other didn't! I switched the console between 20-30 times.
Pete, please find attached latest dmesg log - I recompiled kernel so everything
was caught by dmesg (this is whole machine log since it booted). Hope this will
help.
So the candidate #2 is needed after all, and my comment #63 was wrong. I cannot construct a specific scenario though. Bojan, please apply that patch again to -17.EL and test for me. Created attachment 101951 [details]
Log with -17 and candidate patch
Well, it hanged again for me. :(
Maybe I have a Blade which is an older revision? Wonder if that would make any
difference.
This one was doing ok first round of tests and then it hanged on both blades.
Processes looked about the same, here's dmesg of one of them.
There's no way to blame a blade here. It's a bunch of bugs. However, the usb-storage code is extremely crufty. It needs a good overhaul, such as a backport from 2.6. But in fact I have already started a rewrite for 2.6, it was just so bad. This makes me to want a backport less. Might as well get the new driver... So, the bug is pretty much long term. But to get Dana, Bojan, and Jeff on track quickly, I suggest to comment out devlabel invocations from /etc/hotplug/usb.agent. It should work around the problem for now. My reasoning is that a blade is not a desktop, it should not be a problem to mount manually with a mount command, right? I trust that this workaround would only affect the storage components, and not the KB/Mouse? If so, I for one would deem that acceptable for now. Especially if the issue stays on the radar for possibly being resolved in a later QU. BTW, my thanks to Bojan for picking up the ball and running on the testing. Summers are just too hectic around here, so I haven't had the time to contribute to testing. Pete, so the complete workaround would be: 1) Use -17 kernel 2) Apply candidate #2 3) Comment out devlabel invocations As long as the keyboard and mouse work this is great (will try on monday). Your reasoning is correct - at least here we use them as servers only and rarely work on console. Np for this Dana, I'm glad we're finally getting somewhere, my management had bad opinion about Linux on Blades after initial problems. Besides, it's winter here so I have to work anyway. :) The #2 doesn't solve the full problem. The SCSI EH process still gets stuck. The reason is pretty sick really... In 2.4 the wisdom was that USB runs no its own timeouts, because the EH watches over SCSI commands and will abort anything through the SCSI interface (scsiglue.c). We feared that running two timeouts will race. It's a bad design to layer timeouts in general. Unfortunately, scsiglue faces a dilemma if the USB thread is doing something: to bust locks and unlink URBs, or to hang on locks. 2.4 does both, inconsistently. It tries to unlink and races abort notifications AND it can hang, because it needs notifications, but URBs can hang. In 2.6 all URBs have timeouts and locking is always observed by the scsiglue.c. So I think long term we have (I have) to look at a backport from 2.6. Bojan, please test (1)+(3) and let me know if it works. Your own comment #71 demonstrates that patch #2 is pointless. It helps a little, but not enough. I can confirm that the workaround works ok! After installing kernel -17 and commenting out devlabel invocations everything worked ok. I switched console around 40 times and everything went fine. Any ETA when this kernel will be officialy out? I'll apply this on other blades we have. Bojan, I'm not exactly sure if the schedule of RHEL3 U3 is communicated externaly outside of NDAs. You have to ask your account representative about this sort of things. I only can promise you that -17.EL represents a trunk (not a branch) and all changes there will be present in U3. What is the situation here? Are we still running with devlabel commented out at customer site? I'm running all my boxes with devlabel commented out. Was there something else that happened on Red Hat's end that would have improved this situation and that we should be trying? I'd like to close this as WONTFIX. RHEL 4 has a vastly superior mechanism to process hotplug events which does not require devlabel to be restarted, and also its kernel component is not as racy. I advocate upgrading, at least long term. I will check this with the latest kernel on my test blade. Lately we have been installing only RHEL AS 2.1 (due to PeopleSoft NOT supporting anything else :/). Interesting thing is that the bug mentioned here doesn't exist at 2.1 either. Will let you know when I test with the latest kernel and devlabel uncommented. I don't disagree with your assessment of this bug, HOWEVER, you do know that this problem occurs on the IBM BladeCenter, for which the functional qlogic SAN support was removed in RHEL 4, and for which there won't be a suitable fix until Update 2 or later? And when one asks why the SAN support was removed without a suitable replacement being available, the response is that there should be no hurry for Red Hat customers to upgrade to RHEL4... I know, it's a bit of an off-topic rant, but this bug just helps to highlight my frustration with the hit-or-miss level of support the BladeCenter has received by Red Hat, given that it's supposedly been certified by both Red Hat and IBM since AS 2.1. By the way, while I'm looking to wontfix this bug report (the outright race and the hang), I work on something which ought to help a lot, hopefuly. See bug 156831. There, we deal with open succeeding when it should fail. A side effect of that fix may be hotplug exiting instantaneously instead of sticking around, thus resolving whole issue. Well, we'll see... Bojan, please leave the devlabel commented. Like I said, while officially I do not remove this recommendation yet, I hope that we may live without it now (with but 156831 fixed). So, if someone could test U7 Beta and let me know if the same issue is present, it would be great. This bug was WONTFIX'ed but I keep it in mind and continue to seek improvements. Actually, I only asked requestors for WONTFIX permission, but did not close... I'll do it now. BTW, Dana mentioned that SAN was not supported in comment of 2005-05-16, but that surely is not the case anymore. Jeff, please steer customers to try RHEL 4 U2. I personally saw it working on a Bladecenter with HS-20. FWIW, Boot-from-SAN support has yet to be implemented for RHEL4, and multipath support is still apparently undocumented and at your own risk. Sure would be nice if the "stable" version of RHEL wasn't having major development like this being written as we go, given that a stable and tested configuration existed right from the start. There should have been no question that device-mapper-multipath should have waited for RHEL5. I still don't disagree that it's not worth trying to fix *this* bug at this point, though. Just wish some of the kernel developers and managers at Red Hat understood what a colossal mess RHEL4 is on a SAN due to what appears to be one man's hate of the existing qlogic drivers. Running a bladecenter with Red Hat has been a huge pain all along and there's no reason why it should be. |