Red Hat Bugzilla – Bug 194190
usb kernel panics on reboot with usb mouse/keyboard
Last modified: 2012-06-20 12:11:08 EDT
Description of problem:
usb kernel panics on reboot
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. install os
2. reboot from firstboot
3. kernel panic
after rebooting for the first time usb causes a kernel panic and disables usb
entirely so we cannot use a usb keyboard, or mouse
Reboot into OS
Please let me know if there are any patches or diff modules i can use. I am at
your full disposal to resolve this issue asap. My dmesg is attached
Created attachment 130588 [details]
I forgot to mention that this is not reproducable on 3 update 7 or 4 update 2 or
any other version of red hat enterprise linux
The good news is, the oops may be fixed in RHEL 4 U4 (see bug 182433),
or at worst in U5 (depending on its position in the priority queue).
But the bad news is, if U2 works and U3 does not, we'll need to look closer.
Please DO NOT DUP this into 182433, at least until we figure this out.
I saw interrupt tables to move between U2 and U3, wreaking havoc with USB.
In order to cut this branch of the fault tree, it would be very useful
to install and boot an old kernel from U2 (I think 2.6.9-15.EL). Then,
collect a dmesg so we can look for differences.
Created attachment 130630 [details]
Dmesg from 2.6.9-22.ELsmp kernel
Ok so i grabbed the kernel from 4u2 (2.6.9-22.ELsmp) and usb seems to be
functioning properly. Above is the requested dmesg
The dmesgs are not taken on the same computer, they are less useful than
they otherwise would. Here's just a little part of the diff -u:
CPU: L2 Cache: 1024K (64 bytes/line)
-CPU 0(2) -> Node 0
-CPU0: Physical Processor ID: 0
-CPU0: Processor Core ID: 0
-CPU0: Initial APIC ID: 0
-CPU0: Dual Core AMD Opteron(tm) Processor 265 HE stepping 02
-per-CPU timeslice cutoff: 1024.04 usecs.
+AMD CPU0: Physical Processor ID: 0
+AMD CPU0: Processor Core ID: 0
+AMD CPU0: Initial APIC ID: 0
+CPU 0(1) -> Node 0
+CPU0: AMD Opteron(tm) Processor 248 HE stepping 01
+per-CPU timeslice cutoff: 1024.20 usecs.
task migration cache decay timeout: 2 msecs.
I need these one of these dmesgs to be taken again, on the same unit
as the other dmesg.
I see why you took them, the motherboards seem to have very similar layout
at least as far as USB is concerned. So, the diff looks like this:
... everyhthing the same ...
@@ -326,12 +327,16 @@
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
+ohci_hcd 0000:00:03.1: wakeup
usb 3-1: new full speed USB device using address 2
hub 3-1:1.0: USB hub found
hub 3-1:1.0: 4 ports detected
usb 3-1.1: new low speed USB device using address 3
input: USB HID v1.00 Keyboard [USBPS2] on usb-0000:00:03.1-1.1
input: USB HID v1.00 Mouse [USBPS2] on usb-0000:00:03.1-1.1
+usb 3-1.2: new low speed USB device using address 4
+input: USB HID v1.10 Keyboard [CHESEN PS2 to USB Converter] on usb-0000:00:03.1-1.2
+input: USB HID v1.10 Mouse [CHESEN PS2 to USB Converter] on usb-0000:00:03.1-1.2
ACPI: Power Button (FF) [PWRF]
EXT3 FS on dm-0, internal journal
device-mapper: dm-multipath version 1.0.4 loaded
@@ -339,12 +344,45 @@
EXT3 FS on sda1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 2031608k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1
+ohci_hcd 0000:00:03.1: OHCI Unrecoverable Error, disabled
+bad: scheduling while atomic!
Which is fine, but on the other hand, there weren't any changes to
OHCI driver to create this problem. It cannot possibly regress.
So... Maybe the mouse not present on the box with -22.EL allows it
to work, or the BIOS is somehow different. Which is why doing it on
the same box with same peripherals is important.
Things i forgot to mention.. This is a blade unit that uses usb for the mouse
keyboard and kvm functions. Also we are constantly switching from one blade to
another so the usb disconnects and connects frequently.
I also do not have access to bug 182433.
My co-workers have just mentioned that they have seen this with sles 9 sp3. In
that case it was weird because they only saw it when it had REV E procs and not
REV C/G. The systems were identical in every other way. Possibly a weird race
We have also reproduced this once in fedora core 5.
We reloaded the system so it took some time to reproduce this bug. Thats why we
were unable to get back to you right away.
What eveentually triggered it is that we put some load on the system by typing
dd if=/dev/urandom of=/dev/null and then switched the kvm to use another blade.
We tried these same steps with the 2.6.9-22 kernel but were not able to
reproduce the problem yet.
Also we don't seem to be able to reproduce this on every AMD blade we've tried.
Again if you need any data i will be quick to provide. Thanks for your help on this.
Created attachment 130702 [details]
Created attachment 130704 [details]
The comment #10 contains the same file as comment #9. So I need a dmesg
of -22 taken, on the same system (with the same load, e.g. the dd).
I understand that this can happen on a variety of systems and with
several OS versions. So, our hunt for the strict regression may ultimately
Frankly, I doubt that this is a regression, because OHCI driver and USB
stack didn't see changes which could account for the symptoms between
U2 and U3. However, it can be something else.
The symptom is that OHCI hardware was unable to access its control data,
which should not happen. What can it be? Possibilities are wide
#1 is a hardware bug(s): can be PCI-2-HT bridge, OHCI, or anything
#2 other driver mistakenly invalidating an IOMMU entry
#3 a race in the platform core when it handles IOMMU
#4 SMM BIOS accessing the chip and not restoring the state fully
#5 .... other .....
If we nail a regression, it's going to be a quicker trip. Otherwise
it can be months, because this is something very hard to catch.
I'd need hardware access, most likely...
Created attachment 130715 [details]
My bad for somereason i u/ld the same dmesg 2 times. Here is the correct one
hmmm, we lost this comment:
------- Additional Comments From firstname.lastname@example.org 2006-06-09
12:53 EST -------
Created an attachment (id=130832)
Dmesg 2.6.9-34 usb-handoff
Here is the dmesg for the usb-handoff option.
I will go ahead and put this blade aside so it is always available for us to
test on. I do have other blades that have this same issue so if you would like
the dmesg from those machines please let me know.
Right, thanks, Jason. Fortunately, I downloaded and examined the attachement
before the crash, when Sarah attached it.
I looked at the issue head-on, but wasn't successful. So we have to bisect,
I cannot help it.
Test kernels are uploaded to http://people.redhat.com/zaitcev/ftp/194190/
Bisection is done in this way.
1. Save the list, don't trust your memory! (The list is below)
We have two tested kernels ok the list has, -22.EL and -34.EL.
Mark first "ok", mark the last one "bad".
2. Pick a kernel to test in the middle of the tdistance between
"ok" and "bad". At the first step, it is 2.6.9-22.19.EL.
3. If that kernel works, mark it ok in the list, and move to the
higher half. At the first step, it would be 2.6.9-24.EL
If it does not work, mark it bad, move to lower half.
At the first step, that would be 2.6.9-22.9.EL
4. If something is present between "ok" and "bad", go to step 2
Once you're done, you should end with two kernels next to each other,
one good, one bad.
Please let me know how it goes soon, these kernels take a lot of space.
Created attachment 131212 [details]
2.6.9-22 Console Capture of call trace
So today after alot of effort I can force the 2.6.9-22.EL.smp kernel to fail.
This took alot of effort and plugging and unpluggin of usb and switching the
kvm while doing dd if=/dev/urandom of=/dev/null
This is very difficult to reproduce at times. In all honesty I do not think
this a regression issue at all.
OK, I'm deleting test kernels from the upload area.
I'm afraid it's not a good news. I was concerned about this scenario
when I wrote comment #10 on 6/07. Observe that EHCI hardware dies too,
only the driver can restart clearly. It may be the same reason.
Please attach me an output of dmidecode and lspci -v. I'll pass them
over to Jim Paradis and Andi Kleen to ask if they know of DMA problems
with the specific hardware.
Created attachment 131295 [details]
Created attachment 131296 [details]
lspci -v 2.6.9-34
I have gotten the OK to get you guys hardware for testing. Would you like to do
I reckon that getting the hardware is unavoidable if we want this resolved.
Since I am remote in California, we're going to have this discussed
and perhaps someone else would take the bug.
I have turned this over with Andi Kleen, there result is negative.
HT1000 is not known for issues of this kind. We may be first though,
if something on the board is not connected right...
(In reply to comment #21)
> I reckon that getting the hardware is unavoidable if we want this resolved.
> Since I am remote in California, we're going to have this discussed
> and perhaps someone else would take the bug.
California where? I am located in California also.
Created attachment 131375 [details]
FC5 kernel panic
I have also gotten this to fail on FC5. Attached is the dmesg if this helps
Today we found something pretty interesting involving this bug on our AMD
During the previous tests we had "PowerNow" Enabled in the bios and at some
point or another can loose usb functionality to all amd blades in the chassis.
Disabling PowerNow in the bios and re-running the previous test, the usb no
That's a relief. If workaround is available, we'll have severity lowered.
But still... I need to backport the HC restart code from the current 2.6.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life.
Please See https://access.redhat.com/support/policy/updates/errata/
If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.