Bug 472005 - [Stratus 4.8 bug REVERT] panic reading /proc/bus/input/devices during input device removal
Summary: [Stratus 4.8 bug REVERT] panic reading /proc/bus/input/devices during input d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.8
Hardware: x86_64
OS: Linux
high
high
Target Milestone: beta
: 4.8
Assignee: Jim Paradis
QA Contact: Martin Jenner
URL:
Whiteboard:
: 491940 (view as bug list)
Depends On:
Blocks: 501064
TreeView+ depends on / blocked
 
Reported: 2008-11-18 05:10 UTC by Robert N. Evans
Modified: 2009-06-20 08:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 491940 501064 (view as bug list)
Environment:
Last Closed: 2009-05-18 19:19:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Crash analysis from USB device re-route (11.89 KB, text/plain)
2008-11-18 05:18 UTC, Robert N. Evans
no flags Details
Crash analysis from USB Root hub removal (19.36 KB, text/plain)
2008-11-18 05:27 UTC, Robert N. Evans
no flags Details
Patch to add mutex locking to the dev list (3.35 KB, patch)
2009-02-27 20:46 UTC, Jim Paradis
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description Robert N. Evans 2008-11-18 05:10:09 UTC
Description of problem:
General protection exception during input device removal.


Version-Release number of selected component (if applicable):
kernel-2.6.9-78.0.5.ELsmp

How reproducible:
surprise removal of USB root hub or USB input devices triggers this problem.  It is infrequent and occurs perhaps once per several hundred device removals.  This has only been seen on systems with 8 CPUs.

Steps to Reproduce:
1. Induce moderate (disk-IO) workload.
2. Perform suprise device removals.
3. 
  
Actual results:
Kernel panic occurs

Expected results:
No panic

Additional info:
Two memory dumps from this problem are available.  Analysis of the dumps will be attached.  In summary, the problem seems to occur because there is no locking or reference counting to protect input_devices_read from referencing structures concurrently with their deallocation by unregistering input devices.

Comment 1 Robert N. Evans 2008-11-18 05:18:35 UTC
Created attachment 323846 [details]
Crash analysis from USB device re-route

This is the analysis of a panic on 2008-11-12.  The trigger for this panic was an AC switch.  This operation moves the external USB devices from one root hub to another.  Apparently the panic occurred during unregistration of the KB and mouse, before they were re-registered on the other root hub.

Comment 2 Robert N. Evans 2008-11-18 05:27:42 UTC
Created attachment 323848 [details]
Crash analysis from USB Root hub removal

This is the analysis of a panic on 2008-11-16.  The active IO subsystem was broken.  As a result, the PCI devices in that chassis are removed.  USB devices
are switched over to the control of the other IO chassis.  Apparently the panic occurred due to un-registration of the KB and mouse, however, the memory image shows them re-registered on the surviving USB root hub (PCI device 0000:0b:1d.0).

Comment 3 Andrius Benokraitis 2008-11-18 06:38:04 UTC
Robert - are you saying this is a regression?

Comment 4 Robert N. Evans 2008-11-18 14:00:58 UTC
I do not have a conclusion whether this is a regression.

Stratus hit bug 453507 early in this test cycle.  To eliminate that, we have moved to the latest errata kernel for RHEL4.7 since that is what our customers would be running.  Consequently we do not have enough test time on the kernel released with RHEL4.7 to determine whether this is a regression in the errata kernel.

Given that we have run similar tests (but on slower processors) with RHEL4.6 it seems this problem may have been introduced in RHEL4.7.  But the problem may have already been in the RHEL4.6 code base and the faster processors may be necessary to open the window enough to get hit by a race condition.

Comment 5 Jim Paradis 2008-11-21 23:11:24 UTC
I don't believe this is a regression; rather, it's a latent issue that only shows up when you (a) have a lot of CPUs and (b) are doing very fast surprise device removals while also reading /proc/bus/input/devices.

This bug is similar to the RHEL5 Bug 468915.  Note that the input.c code is very different between the two kernels, though, so a different fix will be required for this one.  The underlying issue remains the same: in both RHEL4 and RHEL5 kernels there is insufficient locking of the input device lists.

This bug is a bit more difficult to reproduce than Bug 468915, though.  I have not been able to reproduce it in the Red Hat lab using the 4-CPU system at my disposal.  Bob has been able to reproduce it within hours in the Stratus lab using a faster 8-CPU system.

I'm working on a patch.

Comment 6 RHEL Program Management 2008-12-17 14:28:29 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Vivek Goyal 2009-01-09 13:56:19 UTC
Committed in 78.26.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 8 Andrius Benokraitis 2009-01-20 19:06:41 UTC
Hey Jim - I think you can start testing this test kernel on top of 4.7...

Comment 10 Jim Paradis 2009-02-27 20:46:33 UTC
Created attachment 333534 [details]
Patch to add mutex locking to the dev list

Comment 11 Andrius Benokraitis 2009-03-24 18:08:16 UTC
The attachment in Comment #10 needs to be in a new bugzilla, since the bug is already ON_QA. Will create one now.

Comment 12 Chris Ward 2009-03-27 14:20:21 UTC
~~ Attention Partners! Snap 1 Released ~~
RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should
be a fix present, which addresses this bug. NOTE: there is only a short time
left to test, please test and report back results on this bug
at your earliest convenience.

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

 - Red Hat QE Partner Management

Comment 13 Chris Ward 2009-04-14 07:42:31 UTC
What's the status of this fix? Did a new bug get created for the patch provided in comment #10?

Comment 14 Andrius Benokraitis 2009-04-14 15:03:21 UTC
Yup - see Blocks bug section above.

Comment 15 Chris Ward 2009-04-15 09:12:23 UTC
Could you please provide a few details regarding Status's verification?

Comment 16 Peter Martuccelli 2009-04-15 17:11:16 UTC
This patch caused a regression, see bug 454479.

Comment 17 Andrius Benokraitis 2009-04-15 17:43:01 UTC
Peter - there was a follow-on patch to this in bug 491940 - are you saying this specific patch in this bz caused a regression?

Comment 18 Peter Martuccelli 2009-04-15 19:03:44 UTC
The patch from comment #10 caused a deadlock.

Comment 19 Andrius Benokraitis 2009-04-15 22:20:28 UTC
...which was committed and tracked in bug 491940. Correct. The patch posted in Comment #10 wasn't committed in this bz. The blocker should be on bug 491940.

Comment 20 Andrius Benokraitis 2009-04-15 22:21:34 UTC
Comment on attachment 333534 [details]
Patch to add mutex locking to the dev list

Obsoleting since this follow-on patch was posted in another BZ (bug 491940).

Comment 22 Chris Ward 2009-04-16 16:08:12 UTC
~~ Attention! Snap 4 Released ~~
RHEL 4.8 Snapshot 4 has been released on partners.redhat.com. There should
be a fix present that addresses this bug. NOTE: there is only a short time
left to test, please test and report back results on this bug ASAP.

The latest kernel build can be obtained here:
http://people.redhat.com/vgoyal/rhel4/

If you encounter any issues, please set the bug back to the ASSIGNED state and
describe the issues you encountered. If you have found a NEW bug, clone this
bug and describe the issues you encountered. Further questions can be
directed to your Red Hat Partner Manager.

If you have VERIFIED the bug fix. Please select your PartnerID from the
Verified field above. Please leave a comment with your test results details.
Include which arches tested, package version and any applicable logs.

Comment 23 Andrius Benokraitis 2009-04-16 19:42:52 UTC
*** Bug 491940 has been marked as a duplicate of this bug. ***

Comment 24 Andrius Benokraitis 2009-04-16 19:45:49 UTC
Deferring to RHEL 4.9 due to concerns by kernel management - confirmed by Stratus.

Comment 25 Linda Wang 2009-04-16 21:25:11 UTC
incremental patch posted on Wed, 15 Apr 2009 16:12:31 -0400 (EDT)

--- linux-2.6.9/drivers/input/input.c.orig      2009-04-15 15:39:20.000000000 -0400
+++ linux-2.6.9/drivers/input/input.c   2009-04-15 15:47:00.000000000 -0400
@@ -492,7 +492,6 @@ void input_unregister_device(struct inpu
        input_call_hotplug("remove", dev);
        mutex_lock(&input_mutex);
 #endif
-       mutex_lock(&input_mutex);

        list_del_init(&dev->node);

Comment 28 Andrius Benokraitis 2009-04-22 17:26:08 UTC
*** Bug 491940 has been marked as a duplicate of this bug. ***

Comment 32 errata-xmlrpc 2009-05-18 19:19:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.