Description of problem: usb kernel panics on reboot Version-Release number of selected component (if applicable): How reproducible: Every time Steps to Reproduce: 1. install os 2. reboot from firstboot 3. kernel panic Actual results: after rebooting for the first time usb causes a kernel panic and disables usb entirely so we cannot use a usb keyboard, or mouse Expected results: Reboot into OS Additional info: Please let me know if there are any patches or diff modules i can use. I am at your full disposal to resolve this issue asap. My dmesg is attached
Created attachment 130588 [details] Dmesg
I forgot to mention that this is not reproducable on 3 update 7 or 4 update 2 or any other version of red hat enterprise linux
The good news is, the oops may be fixed in RHEL 4 U4 (see bug 182433), or at worst in U5 (depending on its position in the priority queue). But the bad news is, if U2 works and U3 does not, we'll need to look closer. Please DO NOT DUP this into 182433, at least until we figure this out. I saw interrupt tables to move between U2 and U3, wreaking havoc with USB. In order to cut this branch of the fault tree, it would be very useful to install and boot an old kernel from U2 (I think 2.6.9-15.EL). Then, collect a dmesg so we can look for differences.
Created attachment 130630 [details] Dmesg from 2.6.9-22.ELsmp kernel
Ok so i grabbed the kernel from 4u2 (2.6.9-22.ELsmp) and usb seems to be functioning properly. Above is the requested dmesg
The dmesgs are not taken on the same computer, they are less useful than they otherwise would. Here's just a little part of the diff -u: CPU: L2 Cache: 1024K (64 bytes/line) -CPU 0(2) -> Node 0 -CPU0: Physical Processor ID: 0 -CPU0: Processor Core ID: 0 -CPU0: Initial APIC ID: 0 -CPU0: Dual Core AMD Opteron(tm) Processor 265 HE stepping 02 -per-CPU timeslice cutoff: 1024.04 usecs. +AMD CPU0: Physical Processor ID: 0 +AMD CPU0: Processor Core ID: 0 +AMD CPU0: Initial APIC ID: 0 +CPU 0(1) -> Node 0 +CPU0: AMD Opteron(tm) Processor 248 HE stepping 01 +per-CPU timeslice cutoff: 1024.20 usecs. task migration cache decay timeout: 2 msecs. I need these one of these dmesgs to be taken again, on the same unit as the other dmesg.
I see why you took them, the motherboards seem to have very similar layout at least as far as USB is concerned. So, the diff looks like this: ... everyhthing the same ... @@ -326,12 +327,16 @@ md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. +ohci_hcd 0000:00:03.1: wakeup usb 3-1: new full speed USB device using address 2 hub 3-1:1.0: USB hub found hub 3-1:1.0: 4 ports detected usb 3-1.1: new low speed USB device using address 3 input: USB HID v1.00 Keyboard [USBPS2] on usb-0000:00:03.1-1.1 input: USB HID v1.00 Mouse [USBPS2] on usb-0000:00:03.1-1.1 +usb 3-1.2: new low speed USB device using address 4 +input: USB HID v1.10 Keyboard [CHESEN PS2 to USB Converter] on usb-0000:00:03.1-1.2 +input: USB HID v1.10 Mouse [CHESEN PS2 to USB Converter] on usb-0000:00:03.1-1.2 ACPI: Power Button (FF) [PWRF] EXT3 FS on dm-0, internal journal device-mapper: dm-multipath version 1.0.4 loaded @@ -339,12 +344,45 @@ EXT3 FS on sda1, internal journal EXT3-fs: mounted filesystem with ordered data mode. Adding 2031608k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1 +ohci_hcd 0000:00:03.1: OHCI Unrecoverable Error, disabled +bad: scheduling while atomic! ..... Which is fine, but on the other hand, there weren't any changes to OHCI driver to create this problem. It cannot possibly regress. So... Maybe the mouse not present on the box with -22.EL allows it to work, or the BIOS is somehow different. Which is why doing it on the same box with same peripherals is important.
Things i forgot to mention.. This is a blade unit that uses usb for the mouse keyboard and kvm functions. Also we are constantly switching from one blade to another so the usb disconnects and connects frequently. I also do not have access to bug 182433. My co-workers have just mentioned that they have seen this with sles 9 sp3. In that case it was weird because they only saw it when it had REV E procs and not REV C/G. The systems were identical in every other way. Possibly a weird race condition? We have also reproduced this once in fedora core 5. We reloaded the system so it took some time to reproduce this bug. Thats why we were unable to get back to you right away. What eveentually triggered it is that we put some load on the system by typing dd if=/dev/urandom of=/dev/null and then switched the kvm to use another blade. We tried these same steps with the 2.6.9-22 kernel but were not able to reproduce the problem yet. Also we don't seem to be able to reproduce this on every AMD blade we've tried. Again if you need any data i will be quick to provide. Thanks for your help on this.
Created attachment 130702 [details] 2.6.9-34 Dmesg
Created attachment 130704 [details] 2.6.9-22 Dmesg
The comment #10 contains the same file as comment #9. So I need a dmesg of -22 taken, on the same system (with the same load, e.g. the dd). I understand that this can happen on a variety of systems and with several OS versions. So, our hunt for the strict regression may ultimately be misguided. Frankly, I doubt that this is a regression, because OHCI driver and USB stack didn't see changes which could account for the symptoms between U2 and U3. However, it can be something else. The symptom is that OHCI hardware was unable to access its control data, which should not happen. What can it be? Possibilities are wide #1 is a hardware bug(s): can be PCI-2-HT bridge, OHCI, or anything #2 other driver mistakenly invalidating an IOMMU entry #3 a race in the platform core when it handles IOMMU #4 SMM BIOS accessing the chip and not restoring the state fully #5 .... other ..... If we nail a regression, it's going to be a quicker trip. Otherwise it can be months, because this is something very hard to catch. I'd need hardware access, most likely...
Created attachment 130715 [details] 2.6.9-22 Dmesg My bad for somereason i u/ld the same dmesg 2 times. Here is the correct one for -22
hmmm, we lost this comment: ------- Additional Comments From sprelutsky 2006-06-09 12:53 EST ------- Created an attachment (id=130832) --> (https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=130832&action=view) Dmesg 2.6.9-34 usb-handoff Here is the dmesg for the usb-handoff option. I will go ahead and put this blade aside so it is always available for us to test on. I do have other blades that have this same issue so if you would like the dmesg from those machines please let me know.
Right, thanks, Jason. Fortunately, I downloaded and examined the attachement before the crash, when Sarah attached it.
I looked at the issue head-on, but wasn't successful. So we have to bisect, I cannot help it. Test kernels are uploaded to http://people.redhat.com/zaitcev/ftp/194190/ Bisection is done in this way. 1. Save the list, don't trust your memory! (The list is below) We have two tested kernels ok the list has, -22.EL and -34.EL. Mark first "ok", mark the last one "bad". 2. Pick a kernel to test in the middle of the tdistance between "ok" and "bad". At the first step, it is 2.6.9-22.19.EL. 3. If that kernel works, mark it ok in the list, and move to the higher half. At the first step, it would be 2.6.9-24.EL If it does not work, mark it bad, move to lower half. At the first step, that would be 2.6.9-22.9.EL 4. If something is present between "ok" and "bad", go to step 2 Once you're done, you should end with two kernels next to each other, one good, one bad. Please let me know how it goes soon, these kernels take a lot of space. The list: 2.6.9-22.EL Shipping 2.6.9-22.1.EL 4E-kernel 2.6.9-22.2.EL 4E-kernel 2.6.9-22.3.EL 4E-kernel 2.6.9-22.4.EL 4E-kernel 2.6.9-22.5.EL 4E-kernel 2.6.9-22.6.EL 4E-kernel 2.6.9-22.7.EL 4E-kernel 2.6.9-22.8.EL 4E-kernel 2.6.9-22.9.EL 4E-kernel 2.6.9-22.10.EL 4E-kernel 2.6.9-22.11.EL 4E-kernel 2.6.9-22.12.EL 4E-kernel 2.6.9-22.13.EL 4E-U3 2.6.9-22.14.EL 4E-kernel 2.6.9-22.15.EL 4E-kernel 2.6.9-22.16.EL 4E-kernel 2.6.9-22.17.EL 4E-U3 2.6.9-22.18.EL 4E-kernel 2.6.9-22.19.EL 4E-kernel 2.6.9-22.20.EL 4E-kernel 2.6.9-22.21.EL 4E-kernel 2.6.9-22.22.EL 4E-kernel 2.6.9-22.23.EL 4E-kernel 2.6.9-22.24.EL 4E-kernel 2.6.9-22.25.EL 4E-U3 2.6.9-22.26.EL 4E-kernel 2.6.9-22.27.EL 4E-kernel 2.6.9-23.EL 4E-kernel 2.6.9-24.EL 4E-U3 2.6.9-25.EL 4E-U3 2.6.9-26.EL 4E-U3 2.6.9-27.EL 4E-U3 2.6.9-28.EL 4E-U3 2.6.9-29.EL 4E-U3 2.6.9-30.EL 4E-U3 2.6.9-31.EL 4E-U3 2.6.9-32.EL 4E-U3 2.6.9-33.EL 4E-U3 2.6.9-34.EL 4E-U3
Created attachment 131212 [details] 2.6.9-22 Console Capture of call trace So today after alot of effort I can force the 2.6.9-22.EL.smp kernel to fail. This took alot of effort and plugging and unpluggin of usb and switching the kvm while doing dd if=/dev/urandom of=/dev/null This is very difficult to reproduce at times. In all honesty I do not think this a regression issue at all.
OK, I'm deleting test kernels from the upload area. I'm afraid it's not a good news. I was concerned about this scenario when I wrote comment #10 on 6/07. Observe that EHCI hardware dies too, only the driver can restart clearly. It may be the same reason. Please attach me an output of dmidecode and lspci -v. I'll pass them over to Jim Paradis and Andi Kleen to ask if they know of DMA problems with the specific hardware.
Created attachment 131295 [details] Dmidecode 2.6.9-34
Created attachment 131296 [details] lspci -v 2.6.9-34
Pete-- I have gotten the OK to get you guys hardware for testing. Would you like to do this? --Sarah
I reckon that getting the hardware is unavoidable if we want this resolved. Since I am remote in California, we're going to have this discussed and perhaps someone else would take the bug.
I have turned this over with Andi Kleen, there result is negative. HT1000 is not known for issues of this kind. We may be first though, if something on the board is not connected right...
(In reply to comment #21) > I reckon that getting the hardware is unavoidable if we want this resolved. > Since I am remote in California, we're going to have this discussed > and perhaps someone else would take the bug. California where? I am located in California also. --Sarah
Created attachment 131375 [details] FC5 kernel panic I have also gotten this to fail on FC5. Attached is the dmesg if this helps any.
Today we found something pretty interesting involving this bug on our AMD Opteron servers. During the previous tests we had "PowerNow" Enabled in the bios and at some point or another can loose usb functionality to all amd blades in the chassis. Disabling PowerNow in the bios and re-running the previous test, the usb no longer fails!
That's a relief. If workaround is available, we'll have severity lowered. But still... I need to backport the HC restart code from the current 2.6.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.