Bug 450786 - [Stratus 5.3 bug] kernel NULL pointer dereference at usbdev_read
[Stratus 5.3 bug] kernel NULL pointer dereference at usbdev_read
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
All Linux
high Severity high
: beta
: ---
Assigned To: Pete Zaitcev
Martin Jenner
: OtherQA
Depends On:
Blocks: 432518
  Show dependency treegraph
 
Reported: 2008-06-10 20:21 EDT by Robert N. Evans
Modified: 2009-06-20 00:15 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 15:05:37 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Experimental patch for 2.6.18-92.el5 with instrumentation (2.27 KB, patch)
2008-06-11 09:06 EDT, Robert N. Evans
no flags Details | Diff
Fix without instrumentation (2.22 KB, patch)
2008-06-12 12:59 EDT, Robert N. Evans
no flags Details | Diff
Alan Stern's second patch (3.67 KB, text/plain)
2008-06-12 14:12 EDT, Pete Zaitcev
no flags Details
Alan Stern's 2nd patch backported to 2.6.18 (2.92 KB, patch)
2008-06-12 16:48 EDT, Robert N. Evans
no flags Details | Diff
Previous patch with an extra check (3.55 KB, application/octet-stream)
2008-06-13 18:16 EDT, Pete Zaitcev
no flags Details
Same as before, but with locks (3.99 KB, patch)
2008-06-18 01:13 EDT, Pete Zaitcev
no flags Details | Diff
Snapshot (same, one fix) (8.67 KB, patch)
2008-06-25 15:50 EDT, Pete Zaitcev
no flags Details | Diff
libusb_test2.c (1.55 KB, text/plain)
2008-07-26 13:40 EDT, Pete Zaitcev
no flags Details
Test 3 - ad-hoc history (27.55 KB, patch)
2008-07-27 01:53 EDT, Pete Zaitcev
no flags Details | Diff
Crash dmesg and bt (6.04 KB, text/plain)
2008-08-07 11:29 EDT, Robert N. Evans
no flags Details
Test patch for i_private (602 bytes, patch)
2008-08-07 22:12 EDT, Pete Zaitcev
no flags Details | Diff

  None (edit)
Description Robert N. Evans 2008-06-10 20:21:21 EDT
Description of problem:
Kernel panic at usbdev_read+0xf5 due to NULL pointer dereference after removing
USB root hub device.  Crash dump examination shows a thread from command "pcscd"
active in sys_read hits this exception.

Version-Release number of selected component (if applicable):
2.6.18-92.el5

How reproducible:
This crash has occurred several times in the Stratus lab.  It should be
reproducable on generic hardware, for example via a PCcard USB adapter.

Steps to Reproduce:
Repeatedly remove and insert USB root hub device with pcscd running.
  
Actual results:
Kernel panic in thread of pcscd.

Expected results:
Error status returned to pcscd with no kernel panic.

Additional info:
This crash was discussed a few days ago on the linux-usb mailing list.  Alan
Stern provided a patch
 http://marc.info/?l=linux-usb&m=121269125717403&w=2
for which he claims:
 "That could easily be true.  Oddly enough, although there is locking to 
 synchronize open with disconnect, there is no flag to tell open that 
 the device is gone."

Stratus engineering is adapting that patch for 2.6.18 and testing it.

Analysis of crash follows:

crash> bt
PID: 8897   TASK: ffff81027dc170c0  CPU: 5   COMMAND: "pcscd"
 #0 [ffff81026ff8dbe0] crash_kexec at ffffffff800aa977
 #1 [ffff81026ff8dca0] __die at ffffffff800650af
 #2 [ffff81026ff8dce0] do_page_fault at ffffffff80066aa1
 #3 [ffff81026ff8ddd0] error_exit at ffffffff8005dde9
    [exception RIP: usbdev_read+0xf5]
    RIP: ffffffff801e6063  RSP: ffff81026ff8de88  RFLAGS: 00010297
    RAX: 0000000000000000  RBX: ffff81027de6ecc0  RCX: 0000000000000012
    RDX: 0000000000000000  RSI: 000000004159eef0  RDI: ffff81023476da28
    RBP: 0000000000000012   R8: 0000000000000002   R9: 0000000000000012
    R10: 0000000000000003  R11: 0000000000000202  R12: 0000000000000008
    R13: ffff81023476d800  R14: 0000000000000000  R15: ffff81026ff8df50
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff81026ff8de80] usbdev_read at ffffffff801e5fa6
 #5 [ffff81026ff8df10] vfs_read at ffffffff8000b337
 #6 [ffff81026ff8df40] sys_read at ffffffff80011715
 #7 [ffff81026ff8df80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 0000003199a0cc8b  RSP: 000000004159ce00  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 0000000000000008  RSI: 000000004159eef0  RDI: 0000000000000009
    RBP: 0000000000000009   R8: 0000000000000000   R9: 000000001a2b76a2
    R10: 0000000000000003  R11: 0000000000000202  R12: 0000000000000000
    R13: 000000001a2b6670  R14: 000000004159eed0  R15: 000000014159eef0
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

The fault is at usbdev_read line 167:

122 static ssize_t usbdev_read(struct file *file, char __user *buf,
size_t nbytes, loff_t *ppos)
123 {
124         struct dev_state *ps = (struct dev_state*)file->private_data;
125         struct usb_device *dev = ps->dev;
...
164         pos = sizeof(struct usb_device_descriptor);
165         for (i = 0; nbytes && i < dev->descriptor.bNumConfigurations; i++) {
166                 struct usb_config_descriptor *config =
167                         (struct usb_config_descriptor*) dev->rawdescriptors[i];

"dev" is in r13. Part of it looks valid, but at least the beginning of
it has been overwritten:
crash> print ((struct usb_device *) 0xffff81023476d800)->rawdescriptors
$2 = (char **) 0x0
crash> print ((struct usb_device *)
0xffff81023476d800)->descriptor.bNumConfigurations
$3 = 0x1
crash> rd ffff81023476d800 100
ffff81023476d800:  723d4e4f49544341 45440065766f6d65   ACTION=remove.DE
ffff81023476d810:  642f3d4854415056 702f736563697665   VPATH=/devices/p
ffff81023476d820:  303a303030306963 303a303030302f30   ci0000:00/0000:0
ffff81023476d830:  302f302e32303a30 303a31303a303030   0:02.0/0000:01:0
ffff81023476d840:  303030302f302e30 302e30303a32303a   0.0/0000:02:00.0
ffff81023476d850:  33303a303030302f 30302f302e31303a   /0000:03:01.0/00
ffff81023476d860:  64313a62303a3030 595342555300302e   00:0b:1d.0.SUBSY
ffff81023476d870:  6963703d4d455453 3d4d554e51455300   STEM=pci.SEQNUM=
ffff81023476d880:  303a003132333435 303030302f302e32   54321.:02.0/0000
ffff81023476d890:  594850303a31303a 3d53554256454453   :01:0PHYSDEVBUS=
ffff81023476d8a0:  5f49435000696370 30433d5353414c43   pci.PCI_CLASS=C0
ffff81023476d8b0:  5f49435000303033 3a363830383d4449   300.PCI_ID=8086:
ffff81023476d8c0:  4943500038383632 5f5359534255535f   2688.PCI_SUBSYS_
ffff81023476d8d0:  3a333330313d4449 4943500030353338   ID=1033:8350.PCI
ffff81023476d8e0:  414e5f544f4c535f 3a303030303d454d   _SLOT_NAME=0000:
ffff81023476d8f0:  00302e64313a6230 5341494c41444f4d   0b:1d.0.MODALIAS
ffff81023476d900:  3030763a6963703d 3064363830383030   =pci:v00008086d0
ffff81023476d910:  7338383632303030 3330313030303076   0002688sv0000103
ffff81023476d920:  3830303030647333 7343306362303533   3sd00008350bc0Cs
ffff81023476d930:  0000303069333063 0000000000100100   c03i00..........
ffff81023476d940:  0000000000200200 0000000000000001   .. .............
ffff81023476d950:  0000000000000000 0000000000000001   ................
ffff81023476d960:  ffff81023476d960 ffff81023476d960   `.v4....`.v4....
ffff81023476d970:  ffff81022b06b070 0000000000000000   p..+............
ffff81023476d980:  0000000031627375 0000000000000000   usb1............
ffff81023476d990:  0000000200000000 ffff81023476d998   ..........v4....
ffff81023476d9a0:  ffff81023476d998 ffff81022b06b160   ..v4....`..+....
ffff81023476d9b0:  ffffffff80314420 0000000000000000    D1.............
ffff81023476d9c0:  0000000000000000 0000000000000001   ................
ffff81023476d9d0:  ffff81023476d9d0 ffff81023476d9d0   ..v4......v4....
ffff81023476d9e0:  0000000031627375 0000000000000000   usb1............
ffff81023476d9f0:  0000000000000000 ffffffff802965fa   .........e).....
ffff81023476da00:  0000000000000000 0000000000000080   ................
ffff81023476da10:  0000000000000000 ffffffff801ad638   ........8.......
ffff81023476da20:  0000000000000000 0000000000000000   ................
ffff81023476da30:  0000000000000001 ffff81023476da38   ........8.v4....
ffff81023476da40:  ffff81023476da38 ffffffff8031f100   8.v4......1.....
ffff81023476da50:  0000000000000000 ffffffff804c40d8   .........@L.....
ffff81023476da60:  0000000000000000 0000000000000000   ................
ffff81023476da70:  0000000300000000 0000000000000000   ................
ffff81023476da80:  0000000000000000 ffff81022b06b070   ........p..+....
ffff81023476da90:  ffff81023476da90 ffff81023476da90   ..v4......v4....
ffff81023476daa0:  ffff81022b06b060 0000000000000000   `..+............
ffff81023476dab0:  ffff81023476dab0 ffff81023476dab0   ..v4......v4....
ffff81023476dac0:  0000000000000000 ffff81023476dac8   ..........v4....
ffff81023476dad0:  ffff81023476dac8 0000000000000000   ..v4............
ffff81023476dae0:  0000000000000000 ffffffff801da45b   ........[.......
ffff81023476daf0:  4000000901100112 0203020600000000   .......@........
ffff81023476db00:  0000000000000101 0000000000000000   ................
ffff81023476db10:  0000000000000000 0000000000000000   ................

Stratus internal tracking number is: Bug 5707.
Comment 1 Robert N. Evans 2008-06-10 20:52:13 EDT
One issue with the patch proposed on linux-usb is that it changes struct
usb_device.  I think this is a KABI change.
Comment 2 Pete Zaitcev 2008-06-10 22:47:00 EDT
It may be the case, yes. I haven't started looking into workarounds for
the kABI though. I wanted a known-working solution, because the patch
Alan Stern posted initially looked a bit odd.
Comment 3 Robert N. Evans 2008-06-11 09:06:42 EDT
Created attachment 308923 [details]
Experimental patch for 2.6.18-92.el5 with instrumentation

This is an adaptation of Alan Stern's patch for RHEL 5.2.  Printk and
dump_stack instrumentation were added to the test in usbdev_open to indicate
the patch avoided opening an unregistered device.
Comment 4 Robert N. Evans 2008-06-12 12:59:37 EDT
Created attachment 309101 [details]
Fix without instrumentation

We have qualified this patch with testing that included 800 surprise removals
of a USB root hub.  During this testing, three times the new code avoided
opening a usbdev because usbdev_registered was zero.  There was no kernel
panic.

Please consider how to deal with the kABI issue and the possibility of
including this patch in 5.2.z.
Comment 5 Pete Zaitcev 2008-06-12 14:12:49 EDT
Created attachment 309120 [details]
Alan Stern's second patch

This avoids kABI issues, incidentially.
Comment 6 Robert N. Evans 2008-06-12 16:48:36 EDT
Created attachment 309142 [details]
Alan Stern's 2nd patch backported to 2.6.18

I've modified Alan's 2nd patch so it can be applied to kernel 2.6.18-92.el5 and
will run an overnight test with USB root hub removals.

I did not see a convenient place for instrumentation to indicate when the
modified code prevented a panic.  Initial progress of this test shows that
about half the times when a root hub is removed, usbdev_remove is called with
dev->filelist not empty.   The new code would empty this list, but this does
not match the panic frequency.
Comment 7 Pete Zaitcev 2008-06-13 18:16:23 EDT
Created attachment 309305 [details]
Previous patch with an extra check

I don't know if this is even relevant. We haven't seen class_device_create
fail, right? The crash traceback would've been different. So, the previous
patch by Peter & Robert should be sufficient.
Comment 8 Robert N. Evans 2008-06-17 09:50:25 EDT
I have been testing with the latest patch (id=309305).  It seems this does not
solve the reported problem.  After 243 IO breaks (surprise removals) there was a
kernel panic with the same footprint.  Here's the last gasp from the serial console:

uhci_hcd 0000:7e:1d.0: HCRESET not completed yet!
uhci_hcd 0000:7e:1d.1: HCRESET not completed yet!
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff801e611e>] usbdev_read+0xf5/0x1c8
PGD 64164067 PUD 627bc067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /class/vtm/vtmdmp_0c:01.1/intr_count
CPU 3 
Modules linked in: nfs lockd fscache nfs_acl ppp_deflate ppp_async crc_ccitt
ppp_generic slhc deflate zlib_deflate af_key autofs4 hidp rfcomm l2cap bluetooth
sunrpc bonding(U) iscsi_tcp ib_iser libiscsi scsi_transport_iscsi ib_srp ib_sdp
ib_ipoib ipv6(U) xfrm_nalgo crypto_api rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm
ib_cm iw_cm ib_addr ib_sa ib_mad ib_core dm_mirror(U) dm_multipath(U) dm_mod(U)
video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac
parport_pc lp parport txen(U) ipmi_devintf ftmod(PU) ipmi_msghandler vtm(FU)
sr_mod cdrom(U) sg(U) i5000_edac radeonfb(FU) edac_mc fosil(U) e1000(U) pcspkr
ata_piix(U) aic94xx(U) libsas(U) libata(U) scsi_transport_sas(U) sd_mod(U)
scsi_mod(U) raid1(U) ext3 jbd uhci_hcd(U) ohci_hcd ehci_hcd
Pid: 12495, comm: pcscd Tainted: PF     2.6.18-92.el5_with_usbdev_fix_03 #1
RIP: 0010:[<ffffffff801e611e>]  [<ffffffff801e611e>] usbdev_read+0xf5/0x1c8
RSP: 0018:ffff81006472fe88  EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff810042aa6b80 RCX: 0000000000000012
RDX: 0000000000000000 RSI: 00000000427daef0 RDI: ffff8100185d2228
RBP: 0000000000000012 R08: 0000000000000002 R09: 0000000000000012
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000008
R13: ffff8100185d2000 R14: 0000000000000000 R15: ffff81006472ff50
FS:  00000000427db940(0063) GS:ffff81007fe0c640(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000072ef9000 CR4: 00000000000006e0
Process pcscd (pid: 12495, threadinfo ffff81006472e000, task ffff810079dfb0c0)
Stack:  0000000000000008 0000000000000000 00000000427daef0 0000000080027338
 ffff8100185d2228 0000000000000000 0800000901100112 0000000000000000
 0000000000000000 0000000100000000 ffff81007cf8d000 ffff810042aa6b80
Call Trace:
 [<ffffffff8000b337>] vfs_read+0xcb/0x171
 [<ffffffff80011715>] sys_read+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 8b 34 10 0f b7 7e 02 89 f8 49 8d 2c 01 48 39 e9 7d 5d 49 
RIP  [<ffffffff801e611e>] usbdev_read+0xf5/0x1c8
RSP <ffff81006472fe88>
Comment 9 Pete Zaitcev 2008-06-18 01:13:42 EDT
Created attachment 309692 [details]
Same as before, but with locks

Here's the thinking:

The only way to get what we get is to get a pointer to usb_device and had
that usb_device freed and reused before usb_get_dev happens. The distance
between the lookup by minor and usb_get_dev is very small, but interrupts
are enabled, so a CPU A can become distracted for a million cycles easy.
Enough for CPU B to kill and reinitialize the dev.

Secondly, I do not trust the fs_remove_file. It's a homemade contraption
by usbfs, so it may fail to lock out opens and so ->i_private may be
stale too. So let's cover that as well.

I need this patch tested. It introduces a danger of unforseen deadlocks.
My mind is too small to contain all indirections from library functions.
Comment 10 Robert N. Evans 2008-06-18 11:48:42 EDT
(In reply to comment #9)
> I need this patch tested. It introduces a danger of unforseen deadlocks.
> My mind is too small to contain all indirections from library functions.

This patch caused a deadlock on the first IO break.  The io_poll thread takes
the usbfs_mutex in usb_notify_remove_device( stack frame #6).  Then it blocks
trying to acquire the same mutex in usbdev_remove, inlined into usbdev_notify(
in frame #3).

crash> struct mutex usbfs_mutex
struct mutex {
  count = {
    counter = -1
  },
  wait_lock = {
    raw_lock = {
      slock = 1
    }
  },
  wait_list = {
    next = 0xffff81017db5ba70,
    prev = 0xffff8101721e3dd8
  }
}

crash> struct mutex_waiter 0xffff81017db5ba70
struct mutex_waiter {
  list = {
    next = 0xffff8101721e3dd8,
    prev = 0xffffffff80320428
  },
  task = 0xffff81017cf58040
}

crash> ps 0xffff81017cf58040
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
   6205      1   5  ffff81017cf58040  UN   0.0       0      0  [io_poll]

crash> bt 6205
PID: 6205   TASK: ffff81017cf58040  CPU: 5   COMMAND: "io_poll"
 #0 [ffff81017db5b990] schedule at ffffffff80062efb
 #1 [ffff81017db5ba68] __mutex_lock_slowpath at ffffffff80063bc7
 #2 [ffff81017db5baa8] .text.lock.mutex at ffffffff80063c11 (via mutex_lock)
 #3 [ffff81017db5bac8] usbdev_notify at ffffffff801e4b60
 #4 [ffff81017db5bae8] notifier_call_chain at ffffffff80066ba4
 #5 [ffff81017db5bb08] blocking_notifier_call_chain at ffffffff80098503
 #6 [ffff81017db5bb28] usb_notify_remove_device at ffffffff801e6355
 #7 [ffff81017db5bb38] usb_disconnect at ffffffff801dc32d
 #8 [ffff81017db5bb68] usb_remove_hcd at ffffffff801ddc7c
 #9 [ffff81017db5bb88] usb_hcd_pci_remove at ffffffff801e65e8
#10 [ffff81017db5bba8] pci_device_remove at ffffffff8014f34f
#11 [ffff81017db5bbc8] __device_release_driver at ffffffff801af19c
#12 [ffff81017db5bbe8] device_release_driver at ffffffff801af447
#13 [ffff81017db5bc08] bus_remove_device at ffffffff801aea6f
#14 [ffff81017db5bc28] device_del at ffffffff801ad8d7
#15 [ffff81017db5bc58] device_unregister at ffffffff801ad94e
#16 [ffff81017db5bc68] pci_stop_dev at ffffffff8014c525
#17 [ffff81017db5bc78] pci_remove_bus_device at ffffffff8014c63a
#18 [ffff81017db5bc98] pci_remove_behind_bridge at ffffffff8014c6c7
#19 [ffff81017db5bcb8] pci_remove_bus_device at ffffffff8014c618
#20 [ffff81017db5bcd8] remove_io at ffffffff88274bbd
#21 [ffff81017db5bd08] io_state_change at ffffffff882775a1
#22 [ffff81017db5bd28] IoStateChange at ffffffff882d293d
#23 [ffff81017db5be38] CcIoPoll at ffffffff882d4cff
#24 [ffff81017db5bee8] io_poll at ffffffff88277855
#25 [ffff81017db5bf48] kernel_thread at ffffffff8005dfb1

void usb_notify_remove_device(struct usb_device *udev)
{
	/* Protect against simultaneous usbfs open */
	mutex_lock(&usbfs_mutex);
	blocking_notifier_call_chain(&usb_notifier_list,
			USB_DEVICE_REMOVE, udev);
	mutex_unlock(&usbfs_mutex);
}

static void usbdev_remove(struct usb_device *dev)
{
	mutex_lock(&usbfs_mutex);
	if (dev->class_dev)
		class_device_unregister(dev->class_dev);
	usb_fs_classdev_common_remove(dev);
	mutex_unlock(&usbfs_mutex);
}

static int usbdev_notify(struct notifier_block *self, unsigned long action,
			 void *dev)
{
	switch (action) {
	case USB_DEVICE_ADD:
		usbdev_add(dev);
		break;
	case USB_DEVICE_REMOVE:
		usbdev_remove(dev);
		break;
	}
	return NOTIFY_OK;
}
Comment 11 Pete Zaitcev 2008-06-19 13:51:36 EDT
This is a very bad news. I completely forgot about that kludge of taking
usbfs_mutex for all disconnect notifications. Therefore, the fault scenario
that I have built was incorrect. I am facing making a special instrumen-
tation and running it in Stratus' environment (unless it's something simple
suddenly pops up, like maybe bus-removal vs. device-removal -- I'm still
grasping at straws here before doing full-on custom tracing).
Comment 12 Pete Zaitcev 2008-06-25 15:50:55 EDT
Created attachment 310292 [details]
Snapshot (same, one fix)

I have to suspend work until next week.

I started working on reproducing this locally, because I am unable form
the failure scenario. Doing so, I found a silly bug in the code I added.
If lookup by minor finds nothing, NULL dereference Oops happens. This
is not what Stratus issue is, just my bug. It is fixed in the attachment.
Comment 13 Robert N. Evans 2008-07-01 14:55:36 EDT
(In reply to comment #12)  
I've tested this patch by running several hundred IO breaks on systems that
reproduced this problem in the past.  This testing is not conclusive and it was
not intended to demonstrate that the NULL dereference has been fixed.  However,
it  shows no ill effect from the latest patch.
Comment 17 Robert N. Evans 2008-07-09 15:19:49 EDT
RE: Attachment (id=310292) from 2008-06-25
I've done further tests with this patch.  As expected, the original problem
(NULL pointer dereference in pcscd at usbdev_read+0xf5) was reproduced.  However
there were no problems found due to this code change.
Comment 18 Pete Zaitcev 2008-07-10 13:58:06 EDT
I'm still unable to reproduce the issue. I looping rmmod/modprobe uhci_hcd
and run a test that does what pcscd does: a loop of usb_find_busses();
usb_find_devices() (it triggers reading of descriptors from
/dev/bus/usb/008/001 and etc.). The system is a 2-way SMP.
I'll have to come up with instrumentation what Robert can run on site.
Comment 20 Pete Zaitcev 2008-07-26 13:40:51 EDT
Created attachment 312710 [details]
libusb_test2.c

This test is intended to make the bug easier to reproduce.
E.g. where Robert has to do 800 cycles normally, with this in
background it ought to blow up in 50 or so -- hopefuly.
Comment 21 Pete Zaitcev 2008-07-27 01:53:57 EDT
Created attachment 312722 [details]
Test 3 - ad-hoc history

Robert, would you mind to run this patch and capture the crash dmesg?
Please feel free to look at it and give me any suggestions you may have too.
Comment 23 Robert N. Evans 2008-07-29 17:35:46 EDT
I've started a test with the RHEL5.2 kernel patched to include "Test 3 - ad-hoc
history".  Next I wanted to use the libusb_test2.c to get a faster bug
reproduction, but I think something may be missing or I don't understand how to
deploy it:
# make libusb_test2
cc     libusb_test2.c   -o libusb_test2
/tmp/cc0dKI3Y.o: In function `main':
libusb_test2.c:(.text+0x10): undefined reference to `usb_init'
libusb_test2.c:(.text+0x15): undefined reference to `usb_find_busses'
libusb_test2.c:(.text+0x1a): undefined reference to `usb_find_devices'
collect2: ld returned 1 exit status
make: *** [libusb_test2] Error 1
Comment 24 Pete Zaitcev 2008-07-29 18:04:53 EDT
It's not strictly required to use libusb_test2, in fact I'm not sure if
it's going to increase the incidence of the crash. To build it -lusb
needs to be passed:
 gcc -Wall -O -o libusb_test2 libusb_test2.c -lusb

What about the patch though, does it work? Please attach the dmesg
after the crash (it'll be necessary to have a netconsole or serial console,
sorry about the complication).
Comment 25 Robert N. Evans 2008-08-01 10:41:53 EDT
I was unable to induce the panic with libusb_test2 running.  This is similar to
our experience with homegrown scripts that repeatedly accesses sysfs.  We are
continuing to test without running libusb_test2.
Comment 26 Robert N. Evans 2008-08-07 11:29:34 EDT
Created attachment 313707 [details]
Crash dmesg and bt

I was afraid that the logging had skewed timing so the bug would not re-occur.  Today I reproduced the reported problem with custom logging.  The attachment gives info from the dump harvested at the time of the panic.
Comment 27 Robert N. Evans 2008-08-07 17:04:42 EDT
(In reply to comment #26)

When the OS crashed:
 CPUs 0,2,4,5,7 were idle,
 CPU 6 was in the kernel, it had an exception causing the OS to panic,
 CPU 3 was in the kernel,
 CPU 1 was in removing PCI devices, causing callbacks into the kernel.
       It seems the USB host controller was being removed at the crash point.

crash> ps | grep '^>'
>     0      0   0  ffffffff802e4ae0  RU   0.0       0      0  [swapper]
>     0      1   2  ffff81010551a7a0  RU   0.0       0      0  [swapper]
>     0      1   4  ffff810105596860  RU   0.0       0      0  [swapper]
>     0      1   5  ffff8101055a1040  RU   0.0       0      0  [swapper]
>     0      1   7  ffff810105651100  RU   0.0       0      0  [swapper]
>   737      1   3  ffff81017ebeb860  RU   0.0   12996   1168  udevd
>  5529      1   1  ffff8101770d6040  RU   0.0       0      0  [io_poll]
> 11943      1   6  ffff810176cb4100  RU   0.1   35740   5808  pcscd 

crash> bt ffff8101770d6040
PID: 5529   TASK: ffff8101770d6040  CPU: 1   COMMAND: "io_poll"
 #0 [ffff810105530f30] crash_nmi_callback at ffffffff80079a6d
 #1 [ffff810105530f40] do_nmi at ffffffff800659cc
 #2 [ffff810105530f50] nmi at ffffffff80064e47
    [exception RIP: __delay+0x8]
    RIP: ffffffff8000c67d  RSP: ffff8101756238b8  RFLAGS: 00000097
    RAX: 00000000cfd68927  RBX: ffffffff804b9740  RCX: 00000000cfd67ea4
    RDX: 000000000000bb24  RSI: 0000000000000005  RDI: 0000000000000bb7
    RBP: 0000000000000000   R8: 0000000000000020   R9: 0000000000000000
    R10: 0000000000000001  R11: ffff81017fe4e300  R12: 00000000000026fc
    R13: 0000000000000063  R14: 0000000000000032  R15: ffffffff80463c39
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <exception stack> ---
 #3 [ffff8101756238b8] __delay at ffffffff8000c67d
 #4 [ffff8101756238b8] serial8250_console_putchar at ffffffff801ac437
 #5 [ffff8101756238e8] uart_console_write at ffffffff801a7c51
 #6 [ffff810175623918] serial8250_console_write at ffffffff801abdfb
 #7 [ffff810175623958] __call_console_drivers at ffffffff8008f6b4
 #8 [ffff810175623978] release_console_sem at ffffffff80016bf1
 #9 [ffff8101756239a8] vprintk at ffffffff8008fe0e
#10 [ffff810175623a28] printk at ffffffff8008fec4
#11 [ffff810175623b18] uhci_reset_hc at ffffffff801e9122
#12 [ffff810175623b38] uhci_hc_died at ffffffff880249f1
#13 [ffff810175623b48] uhci_stop at ffffffff88025534
#14 [ffff810175623b68] usb_remove_hcd at ffffffff801dddf7
#15 [ffff810175623b88] usb_hcd_pci_remove at ffffffff801e682d
#16 [ffff810175623ba8] pci_device_remove at ffffffff8014f3f6
#17 [ffff810175623bc8] __device_release_driver at ffffffff801af24a
#18 [ffff810175623be8] device_release_driver at ffffffff801af4f5
#19 [ffff810175623c08] bus_remove_device at ffffffff801aeb1d
#20 [ffff810175623c28] device_del at ffffffff801ad985
#21 [ffff810175623c58] device_unregister at ffffffff801ad9fc
#22 [ffff810175623c68] pci_stop_dev at ffffffff8014c5cc
#23 [ffff810175623c78] pci_remove_bus_device at ffffffff8014c6e1
#24 [ffff810175623c98] pci_remove_behind_bridge at ffffffff8014c76e
#25 [ffff810175623cb8] pci_remove_bus_device at ffffffff8014c6bf
#26 [ffff810175623cd8] remove_io at ffffffff8826dbbd
#27 [ffff810175623d08] io_state_change at ffffffff882705a1
#28 [ffff810175623d28] IoStateChange at ffffffff882cb925
#29 [ffff810175623e38] CcIoPoll at ffffffff882cdce7
#30 [ffff810175623ee8] io_poll at ffffffff88270855
#31 [ffff810175623f48] kernel_thread at ffffffff8005dfb1

crash> bt ffff81017ebeb860
PID: 737    TASK: ffff81017ebeb860  CPU: 3   COMMAND: "udevd"
 #0 [ffff8101055b1f30] crash_nmi_callback at ffffffff80079a6d
 #1 [ffff8101055b1f40] do_nmi at ffffffff800659cc
 #2 [ffff8101055b1f50] nmi at ffffffff80064e47
    [exception RIP: find_next_bit+0x64]
    RIP: ffffffff800416d2  RSP: ffff81010571fdc8  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffffffff80424268  RCX: 0000000000000002
    RDX: 0000000000000000  RSI: 00000000000000bf  RDI: ffffffff80424270
    RBP: 000000000000b2b4   R8: 0000000000000008   R9: 0000000000000038
    R10: ffffffff80424260  R11: 0000000000000000  R12: 000000000000003a
    R13: 0000000000049b78  R14: 000000001c546b7c  R15: 0000000000b8e925
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <exception stack> ---
 #3 [ffff81010571fdc8] find_next_bit at ffffffff800416d2
 #4 [ffff81010571fdd0] __next_cpu at ffffffff80142ffc
 #5 [ffff81010571fde0] show_stat at ffffffff80043dde
 #6 [ffff81010571feb0] seq_read at ffffffff8003f15b
 #7 [ffff81010571ff10] vfs_read at ffffffff8000b332
 #8 [ffff81010571ff40] sys_read at ffffffff80011710
 #9 [ffff81010571ff80] system_call at ffffffff8005d116
    RIP: 00002af223b2c0d0  RSP: 00007fff878c97c0  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffffffff8005d116  RCX: 00002af2233eb8d0
    RDX: 0000000000001000  RSI: 00002af2231ec000  RDI: 0000000000000008
    RBP: 00007fff878c9380   R8: 00000000ffffffff   R9: 0000000000000000
    R10: 0000000000000022  R11: 0000000000000246  R12: 00000000000003ff
    R13: 0000000000000000  R14: 00002af22420fc20  R15: 000000002420fc20
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

crash> bt ffff810176cb4100
PID: 11943  TASK: ffff810176cb4100  CPU: 6   COMMAND: "pcscd"
 #0 [ffff81016b97bbe0] crash_kexec at ffffffff800aaa1f
 #1 [ffff81016b97bca0] __die at ffffffff800650af
 #2 [ffff81016b97bce0] do_page_fault at ffffffff80066aa1
 #3 [ffff81016b97bdd0] error_exit at ffffffff8005dde9
    [exception RIP: usbdev_read+0x106]
    RIP: ffffffff801e637e  RSP: ffff81016b97be88  RFLAGS: 00010297
    RAX: 0000000000000000  RBX: ffff810175669180  RCX: 0000000000000012
    RDX: 0000000000000000  RSI: 0000000000000296  RDI: ffff810040a2e228
    RBP: 0000000000000012   R8: 0000000000000002   R9: 0000000000000012
    R10: 0000000000000003  R11: 0000000000000202  R12: 0000000000000008
    R13: ffff810040a2e000  R14: 0000000000000000  R15: ffff81016b97bf50
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff81016b97be80] usbdev_read at ffffffff801e62c1
 #5 [ffff81016b97bf10] vfs_read at ffffffff8000b332
 #6 [ffff81016b97bf40] sys_read at ffffffff80011710
 #7 [ffff81016b97bf80] tracesys at ffffffff8005d28d (via system_call)
    RIP: 0000003289c0cc8b  RSP: 0000000041b7be00  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: ffffffff8005d28d  RCX: ffffffffffffffff
    RDX: 0000000000000008  RSI: 0000000041b7def0  RDI: 0000000000000008
    RBP: 0000000000000008   R8: 0000000000000000   R9: 000000001ff679e2
    R10: 0000000000000003  R11: 0000000000000202  R12: 0000000000000000
    R13: 000000001ff669b0  R14: 0000000041b7ded0  R15: 0000000141b7def0
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b
Comment 28 Pete Zaitcev 2008-08-07 22:12:32 EDT
Created attachment 313771 [details]
Test patch for i_private
Comment 29 Pete Zaitcev 2008-08-07 22:14:39 EDT
Dear Robert, thanks a lot for the trace.
Please try the patch in comment #28.
Comment 30 Robert N. Evans 2008-08-08 09:19:19 EDT
I assume the patch in comment 28 is a proposed fix.  Therefore we will test with just that patch applied to kernel 2.6.18-92.1.10.el5.  Because there is no instrumentation to indicate the panic situation occurred but was avoided, we will need to run for a long time without a panic to have confidence in the fix.  I plan to execute 10000 surprise IO removals during this test.
Comment 31 Andrius Benokraitis 2008-08-14 10:54:51 EDT
Robert - any luck with the recent test?
Comment 32 Robert N. Evans 2008-08-14 14:29:44 EDT
(In reply to comment #31)
> Robert - any luck with the recent test?

We are about 40% through the long duration run with no problems detected.  If there are no problems with the test environment, I hope to have results next week.
Comment 33 Robert N. Evans 2008-08-19 12:58:56 EDT
(In reply to comment #31)
> Robert - any luck with the recent test?

At this time, we have more than 9570 IO subsystem removals.  During this test, the kernel paniced twice due to bug 455460.  However, there were no other panics.  I consider the fix for the present bug (450786), attached to comment 28, to be qualified by this testing.
Comment 34 Andrius Benokraitis 2008-08-19 14:12:35 EDT
Pete - we should be good to go with the patch Stratus tested if you think that's the right patchset to propose for 5.3, since we are getting close to the beta patch deadline... Are you needing anything from Status from here on out? Thanks!
Comment 38 Don Zickus 2008-09-02 23:40:12 EDT
in kernel-2.6.18-107.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 41 Chris Ward 2008-11-14 09:04:16 EST
~~~ Attention Partners! ~~~

Please test this URGENT / HIGH priority bug at your earliest convenience to ensure it makes it into the upcoming RHEL 5.3 release. The fix should be present in the Partner Snapshot #2 (kernel*-122), available NOW at ftp://partners.redhat.com. As we are approaching the end of the RHEL 5.3 test cycle, it is critical that you report back testing results as soon as possible. 

If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla Keywords field to indicate this. If you find that this issue has not been properly fixed, set the bug status to ASSIGNED with a comment describing the issues you encountered.

All NEW issues encountered (not part of this bug fix) should have a new bug created with the proper keywords and flags set to trigger a review for their inclusion in the upcoming RHEL 5.3 or other future release. Post a link in this bugzilla pointing to the new issue to ensure it is not overlooked.

For any additional questions, speak with your Partner Manager.
Comment 42 Robert N. Evans 2008-11-14 15:34:09 EST
This fix was verified when I tested the patch attached to comment 28.  That took almost two weeks of run time.  I've examined the source of /kernel-2.6.18-122.el5 and verified the same fix is in place.  On that basis I verify the fix is present.    I am adding the PartnerVerified keyword.
Comment 44 errata-xmlrpc 2009-01-20 15:05:37 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.