Bug 194190 - usb kernel panics on reboot with usb mouse/keyboard
Summary: usb kernel panics on reboot with usb mouse/keyboard
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Pete Zaitcev
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-06-06 00:23 UTC by Sarah Prelutsky
Modified: 2012-06-20 16:11 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-20 16:11:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Dmesg (19.58 KB, text/plain)
2006-06-06 00:23 UTC, Sarah Prelutsky
no flags Details
Dmesg from 2.6.9-22.ELsmp kernel (16.48 KB, text/plain)
2006-06-06 18:19 UTC, Sarah Prelutsky
no flags Details
2.6.9-34 Dmesg (21.65 KB, text/plain)
2006-06-07 20:08 UTC, Sarah Prelutsky
no flags Details
2.6.9-22 Dmesg (21.65 KB, text/plain)
2006-06-07 20:08 UTC, Sarah Prelutsky
no flags Details
2.6.9-22 Dmesg (27.24 KB, text/plain)
2006-06-08 00:14 UTC, Sarah Prelutsky
no flags Details
2.6.9-22 Console Capture of call trace (14.26 KB, application/octet-stream)
2006-06-20 18:53 UTC, Sarah Prelutsky
no flags Details
Dmidecode 2.6.9-34 (10.91 KB, text/plain)
2006-06-21 17:19 UTC, Sarah Prelutsky
no flags Details
lspci -v 2.6.9-34 (7.86 KB, text/plain)
2006-06-21 17:20 UTC, Sarah Prelutsky
no flags Details
FC5 kernel panic (24.52 KB, text/plain)
2006-06-22 19:13 UTC, Sarah Prelutsky
no flags Details

Description Sarah Prelutsky 2006-06-06 00:23:35 UTC
Description of problem:
usb kernel panics on reboot


Version-Release number of selected component (if applicable):


How reproducible:
Every time


Steps to Reproduce:
1. install os
2. reboot from firstboot
3. kernel panic
  
Actual results:
after rebooting for the first time usb causes a kernel panic and disables usb
entirely so we cannot use a usb keyboard, or mouse

Expected results:
Reboot into OS

Additional info:
Please let me know if there are any patches or diff modules i can use. I am at
your full disposal to resolve this issue asap. My dmesg is attached

Comment 1 Sarah Prelutsky 2006-06-06 00:23:35 UTC
Created attachment 130588 [details]
Dmesg

Comment 2 Sarah Prelutsky 2006-06-06 00:30:54 UTC
I forgot to mention that this is not reproducable on 3 update 7 or 4 update 2 or
any other version of red hat enterprise linux

Comment 3 Pete Zaitcev 2006-06-06 02:39:40 UTC
The good news is, the oops may be fixed in RHEL 4 U4 (see bug 182433),
or at worst in U5 (depending on its position in the priority queue).
But the bad news is, if U2 works and U3 does not, we'll need to look closer.
Please DO NOT DUP this into 182433, at least until we figure this out.

I saw interrupt tables to move between U2 and U3, wreaking havoc with USB.
In order to cut this branch of the fault tree, it would be very useful
to install and boot an old kernel from U2 (I think 2.6.9-15.EL). Then,
collect a dmesg so we can look for differences.


Comment 4 Sarah Prelutsky 2006-06-06 18:19:51 UTC
Created attachment 130630 [details]
Dmesg from 2.6.9-22.ELsmp kernel

Comment 5 Sarah Prelutsky 2006-06-06 18:21:21 UTC
Ok so i grabbed the kernel from 4u2 (2.6.9-22.ELsmp) and usb seems to be
functioning properly. Above is the requested dmesg

Comment 6 Pete Zaitcev 2006-06-06 22:24:59 UTC
The dmesgs are not taken on the same computer, they are less useful than
they otherwise would. Here's just a little part of the diff -u:

 CPU: L2 Cache: 1024K (64 bytes/line)
-CPU 0(2) -> Node 0
-CPU0: Physical Processor ID: 0
-CPU0: Processor Core ID: 0
-CPU0: Initial APIC ID: 0
-CPU0: Dual Core AMD Opteron(tm) Processor 265 HE stepping 02
-per-CPU timeslice cutoff: 1024.04 usecs.
+AMD CPU0: Physical Processor ID: 0
+AMD CPU0: Processor Core ID: 0
+AMD CPU0: Initial APIC ID: 0
+CPU 0(1) -> Node 0
+CPU0: AMD Opteron(tm) Processor 248 HE stepping 01
+per-CPU timeslice cutoff: 1024.20 usecs.
 task migration cache decay timeout: 2 msecs.

I need these one of these dmesgs to be taken again, on the same unit
as the other dmesg.


Comment 7 Pete Zaitcev 2006-06-06 22:30:12 UTC
I see why you took them, the motherboards seem to have very similar layout
at least as far as USB is concerned. So, the diff looks like this:

... everyhthing the same ...
@@ -326,12 +327,16 @@
 md: Autodetecting RAID arrays.
 md: autorun ...
 md: ... autorun DONE.
+ohci_hcd 0000:00:03.1: wakeup
 usb 3-1: new full speed USB device using address 2
 hub 3-1:1.0: USB hub found
 hub 3-1:1.0: 4 ports detected
 usb 3-1.1: new low speed USB device using address 3
 input: USB HID v1.00 Keyboard [USBPS2] on usb-0000:00:03.1-1.1
 input: USB HID v1.00 Mouse [USBPS2] on usb-0000:00:03.1-1.1
+usb 3-1.2: new low speed USB device using address 4
+input: USB HID v1.10 Keyboard [CHESEN PS2 to USB Converter] on usb-0000:00:03.1-1.2
+input: USB HID v1.10 Mouse [CHESEN PS2 to USB Converter] on usb-0000:00:03.1-1.2
 ACPI: Power Button (FF) [PWRF]
 EXT3 FS on dm-0, internal journal
 device-mapper: dm-multipath version 1.0.4 loaded
@@ -339,12 +344,45 @@
 EXT3 FS on sda1, internal journal
 EXT3-fs: mounted filesystem with ordered data mode.
 Adding 2031608k swap on /dev/VolGroup00/LogVol01.  Priority:-1 extents:1
+ohci_hcd 0000:00:03.1: OHCI Unrecoverable Error, disabled
+bad: scheduling while atomic!
.....

Which is fine, but on the other hand, there weren't any changes to
OHCI driver to create this problem. It cannot possibly regress.

So... Maybe the mouse not present on the box with -22.EL allows it
to work, or the BIOS is somehow different. Which is why doing it on
the same box with same peripherals is important.


Comment 8 Sarah Prelutsky 2006-06-07 19:33:07 UTC
Things i forgot to mention.. This is a blade unit that uses usb for the mouse
keyboard and kvm functions. Also we are constantly switching from one blade to
another so the usb disconnects and connects frequently.

I also do not have access to bug 182433.

My co-workers have just mentioned that they have seen this with sles 9 sp3. In
that case it was weird because they only saw it when it had REV E procs and not
REV C/G. The systems were identical in every other way. Possibly a weird race
condition?

We have also reproduced this once in fedora core 5.

We reloaded the system so it took some time to reproduce this bug. Thats why we
were unable to get back to you right away.

What eveentually triggered it is that we put some load on the system by typing
dd if=/dev/urandom of=/dev/null and then switched the kvm to use another blade.
We tried these same steps with the 2.6.9-22 kernel but were not able to
reproduce the problem yet.

Also we don't seem to be able to reproduce this on every AMD blade we've tried.

Again if you need any data i will be quick to provide. Thanks for your help on this.

Comment 9 Sarah Prelutsky 2006-06-07 20:08:04 UTC
Created attachment 130702 [details]
2.6.9-34 Dmesg

Comment 10 Sarah Prelutsky 2006-06-07 20:08:48 UTC
Created attachment 130704 [details]
2.6.9-22 Dmesg

Comment 11 Pete Zaitcev 2006-06-07 23:51:09 UTC
The comment #10 contains the same file as comment #9. So I need a dmesg
of -22 taken, on the same system (with the same load, e.g. the dd).

I understand that this can happen on a variety of systems and with
several OS versions. So, our hunt for the strict regression may ultimately
be misguided.

Frankly, I doubt that this is a regression, because OHCI driver and USB
stack didn't see changes which could account for the symptoms between
U2 and U3. However, it can be something else.

The symptom is that OHCI hardware was unable to access its control data,
which should not happen. What can it be? Possibilities are wide
 #1 is a hardware bug(s): can be PCI-2-HT bridge, OHCI, or anything
 #2 other driver mistakenly invalidating an IOMMU entry
 #3 a race in the platform core when it handles IOMMU
 #4 SMM BIOS accessing the chip and not restoring the state fully
 #5 .... other .....

If we nail a regression, it's going to be a quicker trip. Otherwise
it can be months, because this is something very hard to catch.
I'd need hardware access, most likely...


Comment 12 Sarah Prelutsky 2006-06-08 00:14:43 UTC
Created attachment 130715 [details]
2.6.9-22 Dmesg

My bad for somereason i u/ld the same dmesg 2 times. Here is the correct one
for -22

Comment 13 Jason Baron 2006-06-15 17:57:39 UTC
hmmm, we lost this comment:


------- Additional Comments From sprelutsky  2006-06-09
12:53 EST -------
Created an attachment (id=130832)
 --> (https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=130832&action=view)
Dmesg 2.6.9-34 usb-handoff

Here is the dmesg for the usb-handoff option.

I will go ahead and put this blade aside so it is always available for us to
test on. I do have other blades that have this same issue so if you would like
the dmesg from those machines please let me know.


Comment 14 Pete Zaitcev 2006-06-15 18:11:46 UTC
Right, thanks, Jason. Fortunately, I downloaded and examined the attachement
before the crash, when Sarah attached it.


Comment 15 Pete Zaitcev 2006-06-20 05:13:49 UTC
I looked at the issue head-on, but wasn't successful. So we have to bisect,
I cannot help it.

Test kernels are uploaded to http://people.redhat.com/zaitcev/ftp/194190/

Bisection is done in this way.
 1. Save the list, don't trust your memory! (The list is below)
    We have two tested kernels ok the list has, -22.EL and -34.EL.
    Mark first "ok", mark the last one "bad".
 2. Pick a kernel to test in the middle of the tdistance between
    "ok" and "bad". At the first step, it is 2.6.9-22.19.EL.
 3. If that kernel works, mark it ok in the list, and move to the
    higher half. At the first step, it would be 2.6.9-24.EL
    If it does not work, mark it bad, move to lower half.
    At the first step, that would be 2.6.9-22.9.EL
 4. If something is present between "ok" and "bad", go to step 2

Once you're done, you should end with two kernels next to each other,
one good, one bad.

Please let me know how it goes soon, these kernels take a lot of space.

The list:
2.6.9-22.EL    Shipping
2.6.9-22.1.EL  4E-kernel
2.6.9-22.2.EL  4E-kernel
2.6.9-22.3.EL  4E-kernel
2.6.9-22.4.EL  4E-kernel
2.6.9-22.5.EL  4E-kernel
2.6.9-22.6.EL  4E-kernel
2.6.9-22.7.EL  4E-kernel
2.6.9-22.8.EL  4E-kernel
2.6.9-22.9.EL  4E-kernel
2.6.9-22.10.EL 4E-kernel
2.6.9-22.11.EL 4E-kernel
2.6.9-22.12.EL 4E-kernel
2.6.9-22.13.EL 4E-U3
2.6.9-22.14.EL 4E-kernel
2.6.9-22.15.EL 4E-kernel
2.6.9-22.16.EL 4E-kernel
2.6.9-22.17.EL 4E-U3
2.6.9-22.18.EL 4E-kernel
2.6.9-22.19.EL 4E-kernel
2.6.9-22.20.EL 4E-kernel
2.6.9-22.21.EL 4E-kernel
2.6.9-22.22.EL 4E-kernel
2.6.9-22.23.EL 4E-kernel
2.6.9-22.24.EL 4E-kernel
2.6.9-22.25.EL 4E-U3
2.6.9-22.26.EL 4E-kernel
2.6.9-22.27.EL 4E-kernel
2.6.9-23.EL    4E-kernel
2.6.9-24.EL    4E-U3
2.6.9-25.EL    4E-U3
2.6.9-26.EL    4E-U3
2.6.9-27.EL    4E-U3
2.6.9-28.EL    4E-U3
2.6.9-29.EL    4E-U3
2.6.9-30.EL    4E-U3
2.6.9-31.EL    4E-U3
2.6.9-32.EL    4E-U3
2.6.9-33.EL    4E-U3
2.6.9-34.EL    4E-U3


Comment 16 Sarah Prelutsky 2006-06-20 18:53:45 UTC
Created attachment 131212 [details]
2.6.9-22 Console Capture of call trace

So today after alot of effort I can force the 2.6.9-22.EL.smp kernel to fail.
This took alot of effort and plugging and unpluggin of usb and switching the
kvm while doing dd if=/dev/urandom of=/dev/null

This is very difficult to reproduce at times. In all honesty I do not think
this a regression issue at all.

Comment 17 Pete Zaitcev 2006-06-21 02:35:23 UTC
OK, I'm deleting test kernels from the upload area.

I'm afraid it's not a good news. I was concerned about this scenario
when I wrote comment #10 on 6/07. Observe that EHCI hardware dies too,
only the driver can restart clearly. It may be the same reason.

Please attach me an output of dmidecode and lspci -v. I'll pass them
over to Jim Paradis and Andi Kleen to ask if they know of DMA problems
with the specific hardware.


Comment 18 Sarah Prelutsky 2006-06-21 17:19:42 UTC
Created attachment 131295 [details]
Dmidecode 2.6.9-34

Comment 19 Sarah Prelutsky 2006-06-21 17:20:28 UTC
Created attachment 131296 [details]
lspci -v 2.6.9-34

Comment 20 Sarah Prelutsky 2006-06-21 21:58:50 UTC
Pete--

I have gotten the OK to get you guys hardware for testing. Would you like to do
this?

--Sarah

Comment 21 Pete Zaitcev 2006-06-21 22:15:31 UTC
I reckon that getting the hardware is unavoidable if we want this resolved.
Since I am remote in California, we're going to have this discussed
and perhaps someone else would take the bug.

Comment 22 Pete Zaitcev 2006-06-21 22:24:35 UTC
I have turned this over with Andi Kleen, there result is negative.
HT1000 is not known for issues of this kind. We may be first though,
if something on the board is not connected right...


Comment 23 Sarah Prelutsky 2006-06-21 22:55:08 UTC
(In reply to comment #21)
> I reckon that getting the hardware is unavoidable if we want this resolved.
> Since I am remote in California, we're going to have this discussed
> and perhaps someone else would take the bug.

California where? I am located in California also.

--Sarah

Comment 24 Sarah Prelutsky 2006-06-22 19:13:09 UTC
Created attachment 131375 [details]
FC5 kernel panic

I have also gotten this to fail on FC5. Attached is the dmesg if this helps
any.

Comment 25 Sarah Prelutsky 2006-07-19 21:37:50 UTC
Today we found something pretty interesting involving this bug on our AMD
Opteron servers.

During the previous tests we had "PowerNow" Enabled in the bios and at some
point or another can loose usb functionality to all amd blades in the chassis.

Disabling PowerNow in the bios and re-running the previous test, the usb no
longer fails!

Comment 26 Pete Zaitcev 2006-07-20 23:37:50 UTC
That's a relief. If workaround is available, we'll have severity lowered.
But still... I need to backport the HC restart code from the current 2.6.


Comment 27 Jiri Pallich 2012-06-20 16:11:08 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.


Note You need to log in before you can comment on or make changes to this bug.