Bug 468915

Summary: [Stratus/NEC 5.3 bug] System can crash when removing input device
Product: Red Hat Enterprise Linux 5 Reporter: Jim Paradis <jparadis>
Component: kernelAssignee: Jim Paradis <jparadis>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: andriusb, chas.horvath, jparadis, mgahagan, peterm, robert.evans, syeghiay
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 19:57:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 432518    
Attachments:
Description Flags
This patch fixes none

Description Jim Paradis 2008-10-28 19:40:25 UTC
Description of problem:

Under certain circumstances, unplugging an input device can cause the system to crash.  In particular, reading /proc/bus/input/devices while a device removal is in progress can cause a NULL pointer dereference.

This happens because there is no locking around the list of input devices (input_dev_list) that is traversed while reading /proc/bus/input/devices.  The code even admits as much:

static void *input_devices_seq_start(struct seq_file *seq, loff_t *pos)
{
        /* acquire lock here ... Yes, we do need locking, I knowi, I know... */

        return list_get_nth_element(&input_dev_list, pos);
}

If a device is removed while such a traversal is in progress, one could hit a dangling pointer.


Version-Release number of selected component (if applicable):

This bug is present in all 5.2 kernels as of the date of this report.

How reproducible:

Easy, once you know how.

Steps to Reproduce:

We managed to reproduce this issue on a Stratus ftServer.  It is particularly easy to reproduce on this system because the act of switching active consoles from one CRU to another removes and re-adds the input devices in very rapid succession.

1. Instal RHEL5.2 and Stratus ftSSS on an ftServer.
2. In one session, start a tight loop that cat's /proc/bus/input/devices
3. In another session, start a tight loop that does /opt/ft/bin/ftsmaint acSwitch

Actual results:

Typically within five minutes, the system will crash as follows:

10-21 18:08:07 Unable to handle kernel NULL pointer dereference at 0000000000000
000 RIP:
10-21 18:08:07 <6>EVLOG: INFORMATION - 10 is now STATE_DUPLEX / REASON_PRIMARY
10-21 18:08:07  [<ffffffff80057a74>] kobject_get_path+0x81/0xc1
10-21 18:08:07 PGD 1564d4067 PUD 159dea067 PMD 0

Expected results:

The system should switch the active console back and forth every few seconds without incident.

Comment 1 Jim Paradis 2008-10-28 19:48:42 UTC
Dmitry Torokhov submitted a patch entitled "implement proper locking in input core" (git id: 8006479c9b75fb6594a7b746af3d7f1fbb68f18f) on 8/30/2007.  I cherry-picked the portions of this patch that deal with input_dev_list.  When I apply this patch I can run the above reproduction scenario for hours and the system stays up.

Comment 3 Jim Paradis 2008-11-04 19:36:11 UTC
Created attachment 322471 [details]
This patch fixes

Comment 9 Don Zickus 2008-11-12 16:37:48 UTC
in kernel-2.6.18-123.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Chris Ward 2008-11-18 18:14:16 UTC
~~ Snapshot 3 is now available ~~ 

Snapshot 3 is now available for Partner Testing, which should contain a fix that resolves this bug. ISO's available as usual at ftp://partners.redhat.com. Your testing feedback is vital! Please let us know if you encounter any NEW issues (file a new bug) or if you have VERIFIED the fix is present and functioning as expected (add PartnerVerified Keyword).

Ping your Partner Manager with any additional questions. Thanks!

Comment 12 Chris Ward 2008-11-28 06:45:41 UTC
~~ Attention ~~ Snapshot 4 is now available for testing @ partners.redhat.com ~~

Partners, it is vital that we get your testing feedback on this important bug fix / feature request. If you are unable to test, please clearly indicate this in a comment to this bug or directly with your partner manager. If we do not receive your test feedback, this bug is at risk from being dropped from the release.

If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla Keywords field, along with a description of the test results. 

If you encounter a new bug, CLONE this bug and request from your Partner manager to review. We are no longer excepting new bugs into the release, bar critical regressions.

Comment 13 Jim Paradis 2008-12-03 19:27:54 UTC
I have verified this fix.  My old test case would crash the system within five iterations; with the latest kernel I have gone 1000 iterations without incident.

Comment 15 errata-xmlrpc 2009-01-20 19:57:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html