Description of problem: General protection exception during input device removal. Version-Release number of selected component (if applicable): kernel-2.6.9-78.0.5.ELsmp How reproducible: surprise removal of USB root hub or USB input devices triggers this problem. It is infrequent and occurs perhaps once per several hundred device removals. This has only been seen on systems with 8 CPUs. Steps to Reproduce: 1. Induce moderate (disk-IO) workload. 2. Perform suprise device removals. 3. Actual results: Kernel panic occurs Expected results: No panic Additional info: Two memory dumps from this problem are available. Analysis of the dumps will be attached. In summary, the problem seems to occur because there is no locking or reference counting to protect input_devices_read from referencing structures concurrently with their deallocation by unregistering input devices.
Created attachment 323846 [details] Crash analysis from USB device re-route This is the analysis of a panic on 2008-11-12. The trigger for this panic was an AC switch. This operation moves the external USB devices from one root hub to another. Apparently the panic occurred during unregistration of the KB and mouse, before they were re-registered on the other root hub.
Created attachment 323848 [details] Crash analysis from USB Root hub removal This is the analysis of a panic on 2008-11-16. The active IO subsystem was broken. As a result, the PCI devices in that chassis are removed. USB devices are switched over to the control of the other IO chassis. Apparently the panic occurred due to un-registration of the KB and mouse, however, the memory image shows them re-registered on the surviving USB root hub (PCI device 0000:0b:1d.0).
Robert - are you saying this is a regression?
I do not have a conclusion whether this is a regression. Stratus hit bug 453507 early in this test cycle. To eliminate that, we have moved to the latest errata kernel for RHEL4.7 since that is what our customers would be running. Consequently we do not have enough test time on the kernel released with RHEL4.7 to determine whether this is a regression in the errata kernel. Given that we have run similar tests (but on slower processors) with RHEL4.6 it seems this problem may have been introduced in RHEL4.7. But the problem may have already been in the RHEL4.6 code base and the faster processors may be necessary to open the window enough to get hit by a race condition.
I don't believe this is a regression; rather, it's a latent issue that only shows up when you (a) have a lot of CPUs and (b) are doing very fast surprise device removals while also reading /proc/bus/input/devices. This bug is similar to the RHEL5 Bug 468915. Note that the input.c code is very different between the two kernels, though, so a different fix will be required for this one. The underlying issue remains the same: in both RHEL4 and RHEL5 kernels there is insufficient locking of the input device lists. This bug is a bit more difficult to reproduce than Bug 468915, though. I have not been able to reproduce it in the Red Hat lab using the 4-CPU system at my disposal. Bob has been able to reproduce it within hours in the Stratus lab using a faster 8-CPU system. I'm working on a patch.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 78.26.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Hey Jim - I think you can start testing this test kernel on top of 4.7...
Created attachment 333534 [details] Patch to add mutex locking to the dev list
The attachment in Comment #10 needs to be in a new bugzilla, since the bug is already ON_QA. Will create one now.
~~ Attention Partners! Snap 1 Released ~~ RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should be a fix present, which addresses this bug. NOTE: there is only a short time left to test, please test and report back results on this bug at your earliest convenience. If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. - Red Hat QE Partner Management
What's the status of this fix? Did a new bug get created for the patch provided in comment #10?
Yup - see Blocks bug section above.
Could you please provide a few details regarding Status's verification?
This patch caused a regression, see bug 454479.
Peter - there was a follow-on patch to this in bug 491940 - are you saying this specific patch in this bz caused a regression?
The patch from comment #10 caused a deadlock.
...which was committed and tracked in bug 491940. Correct. The patch posted in Comment #10 wasn't committed in this bz. The blocker should be on bug 491940.
Comment on attachment 333534 [details] Patch to add mutex locking to the dev list Obsoleting since this follow-on patch was posted in another BZ (bug 491940).
~~ Attention! Snap 4 Released ~~ RHEL 4.8 Snapshot 4 has been released on partners.redhat.com. There should be a fix present that addresses this bug. NOTE: there is only a short time left to test, please test and report back results on this bug ASAP. The latest kernel build can be obtained here: http://people.redhat.com/vgoyal/rhel4/ If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs.
*** Bug 491940 has been marked as a duplicate of this bug. ***
Deferring to RHEL 4.9 due to concerns by kernel management - confirmed by Stratus.
incremental patch posted on Wed, 15 Apr 2009 16:12:31 -0400 (EDT) --- linux-2.6.9/drivers/input/input.c.orig 2009-04-15 15:39:20.000000000 -0400 +++ linux-2.6.9/drivers/input/input.c 2009-04-15 15:47:00.000000000 -0400 @@ -492,7 +492,6 @@ void input_unregister_device(struct inpu input_call_hotplug("remove", dev); mutex_lock(&input_mutex); #endif - mutex_lock(&input_mutex); list_del_init(&dev->node);
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html