Description of problem: Kernel panic at usbdev_read+0xf5 due to NULL pointer dereference after removing USB root hub device. Crash dump examination shows a thread from command "pcscd" active in sys_read hits this exception. Version-Release number of selected component (if applicable): 2.6.18-92.el5 How reproducible: This crash has occurred several times in the Stratus lab. It should be reproducable on generic hardware, for example via a PCcard USB adapter. Steps to Reproduce: Repeatedly remove and insert USB root hub device with pcscd running. Actual results: Kernel panic in thread of pcscd. Expected results: Error status returned to pcscd with no kernel panic. Additional info: This crash was discussed a few days ago on the linux-usb mailing list. Alan Stern provided a patch http://marc.info/?l=linux-usb&m=121269125717403&w=2 for which he claims: "That could easily be true. Oddly enough, although there is locking to synchronize open with disconnect, there is no flag to tell open that the device is gone." Stratus engineering is adapting that patch for 2.6.18 and testing it. Analysis of crash follows: crash> bt PID: 8897 TASK: ffff81027dc170c0 CPU: 5 COMMAND: "pcscd" #0 [ffff81026ff8dbe0] crash_kexec at ffffffff800aa977 #1 [ffff81026ff8dca0] __die at ffffffff800650af #2 [ffff81026ff8dce0] do_page_fault at ffffffff80066aa1 #3 [ffff81026ff8ddd0] error_exit at ffffffff8005dde9 [exception RIP: usbdev_read+0xf5] RIP: ffffffff801e6063 RSP: ffff81026ff8de88 RFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff81027de6ecc0 RCX: 0000000000000012 RDX: 0000000000000000 RSI: 000000004159eef0 RDI: ffff81023476da28 RBP: 0000000000000012 R8: 0000000000000002 R9: 0000000000000012 R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000008 R13: ffff81023476d800 R14: 0000000000000000 R15: ffff81026ff8df50 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff81026ff8de80] usbdev_read at ffffffff801e5fa6 #5 [ffff81026ff8df10] vfs_read at ffffffff8000b337 #6 [ffff81026ff8df40] sys_read at ffffffff80011715 #7 [ffff81026ff8df80] tracesys at ffffffff8005d28d (via system_call) RIP: 0000003199a0cc8b RSP: 000000004159ce00 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 0000000000000008 RSI: 000000004159eef0 RDI: 0000000000000009 RBP: 0000000000000009 R8: 0000000000000000 R9: 000000001a2b76a2 R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000000 R13: 000000001a2b6670 R14: 000000004159eed0 R15: 000000014159eef0 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b The fault is at usbdev_read line 167: 122 static ssize_t usbdev_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) 123 { 124 struct dev_state *ps = (struct dev_state*)file->private_data; 125 struct usb_device *dev = ps->dev; ... 164 pos = sizeof(struct usb_device_descriptor); 165 for (i = 0; nbytes && i < dev->descriptor.bNumConfigurations; i++) { 166 struct usb_config_descriptor *config = 167 (struct usb_config_descriptor*) dev->rawdescriptors[i]; "dev" is in r13. Part of it looks valid, but at least the beginning of it has been overwritten: crash> print ((struct usb_device *) 0xffff81023476d800)->rawdescriptors $2 = (char **) 0x0 crash> print ((struct usb_device *) 0xffff81023476d800)->descriptor.bNumConfigurations $3 = 0x1 crash> rd ffff81023476d800 100 ffff81023476d800: 723d4e4f49544341 45440065766f6d65 ACTION=remove.DE ffff81023476d810: 642f3d4854415056 702f736563697665 VPATH=/devices/p ffff81023476d820: 303a303030306963 303a303030302f30 ci0000:00/0000:0 ffff81023476d830: 302f302e32303a30 303a31303a303030 0:02.0/0000:01:0 ffff81023476d840: 303030302f302e30 302e30303a32303a 0.0/0000:02:00.0 ffff81023476d850: 33303a303030302f 30302f302e31303a /0000:03:01.0/00 ffff81023476d860: 64313a62303a3030 595342555300302e 00:0b:1d.0.SUBSY ffff81023476d870: 6963703d4d455453 3d4d554e51455300 STEM=pci.SEQNUM= ffff81023476d880: 303a003132333435 303030302f302e32 54321.:02.0/0000 ffff81023476d890: 594850303a31303a 3d53554256454453 :01:0PHYSDEVBUS= ffff81023476d8a0: 5f49435000696370 30433d5353414c43 pci.PCI_CLASS=C0 ffff81023476d8b0: 5f49435000303033 3a363830383d4449 300.PCI_ID=8086: ffff81023476d8c0: 4943500038383632 5f5359534255535f 2688.PCI_SUBSYS_ ffff81023476d8d0: 3a333330313d4449 4943500030353338 ID=1033:8350.PCI ffff81023476d8e0: 414e5f544f4c535f 3a303030303d454d _SLOT_NAME=0000: ffff81023476d8f0: 00302e64313a6230 5341494c41444f4d 0b:1d.0.MODALIAS ffff81023476d900: 3030763a6963703d 3064363830383030 =pci:v00008086d0 ffff81023476d910: 7338383632303030 3330313030303076 0002688sv0000103 ffff81023476d920: 3830303030647333 7343306362303533 3sd00008350bc0Cs ffff81023476d930: 0000303069333063 0000000000100100 c03i00.......... ffff81023476d940: 0000000000200200 0000000000000001 .. ............. ffff81023476d950: 0000000000000000 0000000000000001 ................ ffff81023476d960: ffff81023476d960 ffff81023476d960 `.v4....`.v4.... ffff81023476d970: ffff81022b06b070 0000000000000000 p..+............ ffff81023476d980: 0000000031627375 0000000000000000 usb1............ ffff81023476d990: 0000000200000000 ffff81023476d998 ..........v4.... ffff81023476d9a0: ffff81023476d998 ffff81022b06b160 ..v4....`..+.... ffff81023476d9b0: ffffffff80314420 0000000000000000 D1............. ffff81023476d9c0: 0000000000000000 0000000000000001 ................ ffff81023476d9d0: ffff81023476d9d0 ffff81023476d9d0 ..v4......v4.... ffff81023476d9e0: 0000000031627375 0000000000000000 usb1............ ffff81023476d9f0: 0000000000000000 ffffffff802965fa .........e)..... ffff81023476da00: 0000000000000000 0000000000000080 ................ ffff81023476da10: 0000000000000000 ffffffff801ad638 ........8....... ffff81023476da20: 0000000000000000 0000000000000000 ................ ffff81023476da30: 0000000000000001 ffff81023476da38 ........8.v4.... ffff81023476da40: ffff81023476da38 ffffffff8031f100 8.v4......1..... ffff81023476da50: 0000000000000000 ffffffff804c40d8 .........@L..... ffff81023476da60: 0000000000000000 0000000000000000 ................ ffff81023476da70: 0000000300000000 0000000000000000 ................ ffff81023476da80: 0000000000000000 ffff81022b06b070 ........p..+.... ffff81023476da90: ffff81023476da90 ffff81023476da90 ..v4......v4.... ffff81023476daa0: ffff81022b06b060 0000000000000000 `..+............ ffff81023476dab0: ffff81023476dab0 ffff81023476dab0 ..v4......v4.... ffff81023476dac0: 0000000000000000 ffff81023476dac8 ..........v4.... ffff81023476dad0: ffff81023476dac8 0000000000000000 ..v4............ ffff81023476dae0: 0000000000000000 ffffffff801da45b ........[....... ffff81023476daf0: 4000000901100112 0203020600000000 .......@........ ffff81023476db00: 0000000000000101 0000000000000000 ................ ffff81023476db10: 0000000000000000 0000000000000000 ................ Stratus internal tracking number is: Bug 5707.
One issue with the patch proposed on linux-usb is that it changes struct usb_device. I think this is a KABI change.
It may be the case, yes. I haven't started looking into workarounds for the kABI though. I wanted a known-working solution, because the patch Alan Stern posted initially looked a bit odd.
Created attachment 308923 [details] Experimental patch for 2.6.18-92.el5 with instrumentation This is an adaptation of Alan Stern's patch for RHEL 5.2. Printk and dump_stack instrumentation were added to the test in usbdev_open to indicate the patch avoided opening an unregistered device.
Created attachment 309101 [details] Fix without instrumentation We have qualified this patch with testing that included 800 surprise removals of a USB root hub. During this testing, three times the new code avoided opening a usbdev because usbdev_registered was zero. There was no kernel panic. Please consider how to deal with the kABI issue and the possibility of including this patch in 5.2.z.
Created attachment 309120 [details] Alan Stern's second patch This avoids kABI issues, incidentially.
Created attachment 309142 [details] Alan Stern's 2nd patch backported to 2.6.18 I've modified Alan's 2nd patch so it can be applied to kernel 2.6.18-92.el5 and will run an overnight test with USB root hub removals. I did not see a convenient place for instrumentation to indicate when the modified code prevented a panic. Initial progress of this test shows that about half the times when a root hub is removed, usbdev_remove is called with dev->filelist not empty. The new code would empty this list, but this does not match the panic frequency.
Created attachment 309305 [details] Previous patch with an extra check I don't know if this is even relevant. We haven't seen class_device_create fail, right? The crash traceback would've been different. So, the previous patch by Peter & Robert should be sufficient.
I have been testing with the latest patch (id=309305). It seems this does not solve the reported problem. After 243 IO breaks (surprise removals) there was a kernel panic with the same footprint. Here's the last gasp from the serial console: uhci_hcd 0000:7e:1d.0: HCRESET not completed yet! uhci_hcd 0000:7e:1d.1: HCRESET not completed yet! Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff801e611e>] usbdev_read+0xf5/0x1c8 PGD 64164067 PUD 627bc067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /class/vtm/vtmdmp_0c:01.1/intr_count CPU 3 Modules linked in: nfs lockd fscache nfs_acl ppp_deflate ppp_async crc_ccitt ppp_generic slhc deflate zlib_deflate af_key autofs4 hidp rfcomm l2cap bluetooth sunrpc bonding(U) iscsi_tcp ib_iser libiscsi scsi_transport_iscsi ib_srp ib_sdp ib_ipoib ipv6(U) xfrm_nalgo crypto_api rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa ib_mad ib_core dm_mirror(U) dm_multipath(U) dm_mod(U) video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport txen(U) ipmi_devintf ftmod(PU) ipmi_msghandler vtm(FU) sr_mod cdrom(U) sg(U) i5000_edac radeonfb(FU) edac_mc fosil(U) e1000(U) pcspkr ata_piix(U) aic94xx(U) libsas(U) libata(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) raid1(U) ext3 jbd uhci_hcd(U) ohci_hcd ehci_hcd Pid: 12495, comm: pcscd Tainted: PF 2.6.18-92.el5_with_usbdev_fix_03 #1 RIP: 0010:[<ffffffff801e611e>] [<ffffffff801e611e>] usbdev_read+0xf5/0x1c8 RSP: 0018:ffff81006472fe88 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff810042aa6b80 RCX: 0000000000000012 RDX: 0000000000000000 RSI: 00000000427daef0 RDI: ffff8100185d2228 RBP: 0000000000000012 R08: 0000000000000002 R09: 0000000000000012 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000008 R13: ffff8100185d2000 R14: 0000000000000000 R15: ffff81006472ff50 FS: 00000000427db940(0063) GS:ffff81007fe0c640(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000072ef9000 CR4: 00000000000006e0 Process pcscd (pid: 12495, threadinfo ffff81006472e000, task ffff810079dfb0c0) Stack: 0000000000000008 0000000000000000 00000000427daef0 0000000080027338 ffff8100185d2228 0000000000000000 0800000901100112 0000000000000000 0000000000000000 0000000100000000 ffff81007cf8d000 ffff810042aa6b80 Call Trace: [<ffffffff8000b337>] vfs_read+0xcb/0x171 [<ffffffff80011715>] sys_read+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 48 8b 34 10 0f b7 7e 02 89 f8 49 8d 2c 01 48 39 e9 7d 5d 49 RIP [<ffffffff801e611e>] usbdev_read+0xf5/0x1c8 RSP <ffff81006472fe88>
Created attachment 309692 [details] Same as before, but with locks Here's the thinking: The only way to get what we get is to get a pointer to usb_device and had that usb_device freed and reused before usb_get_dev happens. The distance between the lookup by minor and usb_get_dev is very small, but interrupts are enabled, so a CPU A can become distracted for a million cycles easy. Enough for CPU B to kill and reinitialize the dev. Secondly, I do not trust the fs_remove_file. It's a homemade contraption by usbfs, so it may fail to lock out opens and so ->i_private may be stale too. So let's cover that as well. I need this patch tested. It introduces a danger of unforseen deadlocks. My mind is too small to contain all indirections from library functions.
(In reply to comment #9) > I need this patch tested. It introduces a danger of unforseen deadlocks. > My mind is too small to contain all indirections from library functions. This patch caused a deadlock on the first IO break. The io_poll thread takes the usbfs_mutex in usb_notify_remove_device( stack frame #6). Then it blocks trying to acquire the same mutex in usbdev_remove, inlined into usbdev_notify( in frame #3). crash> struct mutex usbfs_mutex struct mutex { count = { counter = -1 }, wait_lock = { raw_lock = { slock = 1 } }, wait_list = { next = 0xffff81017db5ba70, prev = 0xffff8101721e3dd8 } } crash> struct mutex_waiter 0xffff81017db5ba70 struct mutex_waiter { list = { next = 0xffff8101721e3dd8, prev = 0xffffffff80320428 }, task = 0xffff81017cf58040 } crash> ps 0xffff81017cf58040 PID PPID CPU TASK ST %MEM VSZ RSS COMM 6205 1 5 ffff81017cf58040 UN 0.0 0 0 [io_poll] crash> bt 6205 PID: 6205 TASK: ffff81017cf58040 CPU: 5 COMMAND: "io_poll" #0 [ffff81017db5b990] schedule at ffffffff80062efb #1 [ffff81017db5ba68] __mutex_lock_slowpath at ffffffff80063bc7 #2 [ffff81017db5baa8] .text.lock.mutex at ffffffff80063c11 (via mutex_lock) #3 [ffff81017db5bac8] usbdev_notify at ffffffff801e4b60 #4 [ffff81017db5bae8] notifier_call_chain at ffffffff80066ba4 #5 [ffff81017db5bb08] blocking_notifier_call_chain at ffffffff80098503 #6 [ffff81017db5bb28] usb_notify_remove_device at ffffffff801e6355 #7 [ffff81017db5bb38] usb_disconnect at ffffffff801dc32d #8 [ffff81017db5bb68] usb_remove_hcd at ffffffff801ddc7c #9 [ffff81017db5bb88] usb_hcd_pci_remove at ffffffff801e65e8 #10 [ffff81017db5bba8] pci_device_remove at ffffffff8014f34f #11 [ffff81017db5bbc8] __device_release_driver at ffffffff801af19c #12 [ffff81017db5bbe8] device_release_driver at ffffffff801af447 #13 [ffff81017db5bc08] bus_remove_device at ffffffff801aea6f #14 [ffff81017db5bc28] device_del at ffffffff801ad8d7 #15 [ffff81017db5bc58] device_unregister at ffffffff801ad94e #16 [ffff81017db5bc68] pci_stop_dev at ffffffff8014c525 #17 [ffff81017db5bc78] pci_remove_bus_device at ffffffff8014c63a #18 [ffff81017db5bc98] pci_remove_behind_bridge at ffffffff8014c6c7 #19 [ffff81017db5bcb8] pci_remove_bus_device at ffffffff8014c618 #20 [ffff81017db5bcd8] remove_io at ffffffff88274bbd #21 [ffff81017db5bd08] io_state_change at ffffffff882775a1 #22 [ffff81017db5bd28] IoStateChange at ffffffff882d293d #23 [ffff81017db5be38] CcIoPoll at ffffffff882d4cff #24 [ffff81017db5bee8] io_poll at ffffffff88277855 #25 [ffff81017db5bf48] kernel_thread at ffffffff8005dfb1 void usb_notify_remove_device(struct usb_device *udev) { /* Protect against simultaneous usbfs open */ mutex_lock(&usbfs_mutex); blocking_notifier_call_chain(&usb_notifier_list, USB_DEVICE_REMOVE, udev); mutex_unlock(&usbfs_mutex); } static void usbdev_remove(struct usb_device *dev) { mutex_lock(&usbfs_mutex); if (dev->class_dev) class_device_unregister(dev->class_dev); usb_fs_classdev_common_remove(dev); mutex_unlock(&usbfs_mutex); } static int usbdev_notify(struct notifier_block *self, unsigned long action, void *dev) { switch (action) { case USB_DEVICE_ADD: usbdev_add(dev); break; case USB_DEVICE_REMOVE: usbdev_remove(dev); break; } return NOTIFY_OK; }
This is a very bad news. I completely forgot about that kludge of taking usbfs_mutex for all disconnect notifications. Therefore, the fault scenario that I have built was incorrect. I am facing making a special instrumen- tation and running it in Stratus' environment (unless it's something simple suddenly pops up, like maybe bus-removal vs. device-removal -- I'm still grasping at straws here before doing full-on custom tracing).
Created attachment 310292 [details] Snapshot (same, one fix) I have to suspend work until next week. I started working on reproducing this locally, because I am unable form the failure scenario. Doing so, I found a silly bug in the code I added. If lookup by minor finds nothing, NULL dereference Oops happens. This is not what Stratus issue is, just my bug. It is fixed in the attachment.
(In reply to comment #12) I've tested this patch by running several hundred IO breaks on systems that reproduced this problem in the past. This testing is not conclusive and it was not intended to demonstrate that the NULL dereference has been fixed. However, it shows no ill effect from the latest patch.
RE: Attachment (id=310292) from 2008-06-25 I've done further tests with this patch. As expected, the original problem (NULL pointer dereference in pcscd at usbdev_read+0xf5) was reproduced. However there were no problems found due to this code change.
I'm still unable to reproduce the issue. I looping rmmod/modprobe uhci_hcd and run a test that does what pcscd does: a loop of usb_find_busses(); usb_find_devices() (it triggers reading of descriptors from /dev/bus/usb/008/001 and etc.). The system is a 2-way SMP. I'll have to come up with instrumentation what Robert can run on site.
Created attachment 312710 [details] libusb_test2.c This test is intended to make the bug easier to reproduce. E.g. where Robert has to do 800 cycles normally, with this in background it ought to blow up in 50 or so -- hopefuly.
Created attachment 312722 [details] Test 3 - ad-hoc history Robert, would you mind to run this patch and capture the crash dmesg? Please feel free to look at it and give me any suggestions you may have too.
I've started a test with the RHEL5.2 kernel patched to include "Test 3 - ad-hoc history". Next I wanted to use the libusb_test2.c to get a faster bug reproduction, but I think something may be missing or I don't understand how to deploy it: # make libusb_test2 cc libusb_test2.c -o libusb_test2 /tmp/cc0dKI3Y.o: In function `main': libusb_test2.c:(.text+0x10): undefined reference to `usb_init' libusb_test2.c:(.text+0x15): undefined reference to `usb_find_busses' libusb_test2.c:(.text+0x1a): undefined reference to `usb_find_devices' collect2: ld returned 1 exit status make: *** [libusb_test2] Error 1
It's not strictly required to use libusb_test2, in fact I'm not sure if it's going to increase the incidence of the crash. To build it -lusb needs to be passed: gcc -Wall -O -o libusb_test2 libusb_test2.c -lusb What about the patch though, does it work? Please attach the dmesg after the crash (it'll be necessary to have a netconsole or serial console, sorry about the complication).
I was unable to induce the panic with libusb_test2 running. This is similar to our experience with homegrown scripts that repeatedly accesses sysfs. We are continuing to test without running libusb_test2.
Created attachment 313707 [details] Crash dmesg and bt I was afraid that the logging had skewed timing so the bug would not re-occur. Today I reproduced the reported problem with custom logging. The attachment gives info from the dump harvested at the time of the panic.
(In reply to comment #26) When the OS crashed: CPUs 0,2,4,5,7 were idle, CPU 6 was in the kernel, it had an exception causing the OS to panic, CPU 3 was in the kernel, CPU 1 was in removing PCI devices, causing callbacks into the kernel. It seems the USB host controller was being removed at the crash point. crash> ps | grep '^>' > 0 0 0 ffffffff802e4ae0 RU 0.0 0 0 [swapper] > 0 1 2 ffff81010551a7a0 RU 0.0 0 0 [swapper] > 0 1 4 ffff810105596860 RU 0.0 0 0 [swapper] > 0 1 5 ffff8101055a1040 RU 0.0 0 0 [swapper] > 0 1 7 ffff810105651100 RU 0.0 0 0 [swapper] > 737 1 3 ffff81017ebeb860 RU 0.0 12996 1168 udevd > 5529 1 1 ffff8101770d6040 RU 0.0 0 0 [io_poll] > 11943 1 6 ffff810176cb4100 RU 0.1 35740 5808 pcscd crash> bt ffff8101770d6040 PID: 5529 TASK: ffff8101770d6040 CPU: 1 COMMAND: "io_poll" #0 [ffff810105530f30] crash_nmi_callback at ffffffff80079a6d #1 [ffff810105530f40] do_nmi at ffffffff800659cc #2 [ffff810105530f50] nmi at ffffffff80064e47 [exception RIP: __delay+0x8] RIP: ffffffff8000c67d RSP: ffff8101756238b8 RFLAGS: 00000097 RAX: 00000000cfd68927 RBX: ffffffff804b9740 RCX: 00000000cfd67ea4 RDX: 000000000000bb24 RSI: 0000000000000005 RDI: 0000000000000bb7 RBP: 0000000000000000 R8: 0000000000000020 R9: 0000000000000000 R10: 0000000000000001 R11: ffff81017fe4e300 R12: 00000000000026fc R13: 0000000000000063 R14: 0000000000000032 R15: ffffffff80463c39 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <exception stack> --- #3 [ffff8101756238b8] __delay at ffffffff8000c67d #4 [ffff8101756238b8] serial8250_console_putchar at ffffffff801ac437 #5 [ffff8101756238e8] uart_console_write at ffffffff801a7c51 #6 [ffff810175623918] serial8250_console_write at ffffffff801abdfb #7 [ffff810175623958] __call_console_drivers at ffffffff8008f6b4 #8 [ffff810175623978] release_console_sem at ffffffff80016bf1 #9 [ffff8101756239a8] vprintk at ffffffff8008fe0e #10 [ffff810175623a28] printk at ffffffff8008fec4 #11 [ffff810175623b18] uhci_reset_hc at ffffffff801e9122 #12 [ffff810175623b38] uhci_hc_died at ffffffff880249f1 #13 [ffff810175623b48] uhci_stop at ffffffff88025534 #14 [ffff810175623b68] usb_remove_hcd at ffffffff801dddf7 #15 [ffff810175623b88] usb_hcd_pci_remove at ffffffff801e682d #16 [ffff810175623ba8] pci_device_remove at ffffffff8014f3f6 #17 [ffff810175623bc8] __device_release_driver at ffffffff801af24a #18 [ffff810175623be8] device_release_driver at ffffffff801af4f5 #19 [ffff810175623c08] bus_remove_device at ffffffff801aeb1d #20 [ffff810175623c28] device_del at ffffffff801ad985 #21 [ffff810175623c58] device_unregister at ffffffff801ad9fc #22 [ffff810175623c68] pci_stop_dev at ffffffff8014c5cc #23 [ffff810175623c78] pci_remove_bus_device at ffffffff8014c6e1 #24 [ffff810175623c98] pci_remove_behind_bridge at ffffffff8014c76e #25 [ffff810175623cb8] pci_remove_bus_device at ffffffff8014c6bf #26 [ffff810175623cd8] remove_io at ffffffff8826dbbd #27 [ffff810175623d08] io_state_change at ffffffff882705a1 #28 [ffff810175623d28] IoStateChange at ffffffff882cb925 #29 [ffff810175623e38] CcIoPoll at ffffffff882cdce7 #30 [ffff810175623ee8] io_poll at ffffffff88270855 #31 [ffff810175623f48] kernel_thread at ffffffff8005dfb1 crash> bt ffff81017ebeb860 PID: 737 TASK: ffff81017ebeb860 CPU: 3 COMMAND: "udevd" #0 [ffff8101055b1f30] crash_nmi_callback at ffffffff80079a6d #1 [ffff8101055b1f40] do_nmi at ffffffff800659cc #2 [ffff8101055b1f50] nmi at ffffffff80064e47 [exception RIP: find_next_bit+0x64] RIP: ffffffff800416d2 RSP: ffff81010571fdc8 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff80424268 RCX: 0000000000000002 RDX: 0000000000000000 RSI: 00000000000000bf RDI: ffffffff80424270 RBP: 000000000000b2b4 R8: 0000000000000008 R9: 0000000000000038 R10: ffffffff80424260 R11: 0000000000000000 R12: 000000000000003a R13: 0000000000049b78 R14: 000000001c546b7c R15: 0000000000b8e925 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <exception stack> --- #3 [ffff81010571fdc8] find_next_bit at ffffffff800416d2 #4 [ffff81010571fdd0] __next_cpu at ffffffff80142ffc #5 [ffff81010571fde0] show_stat at ffffffff80043dde #6 [ffff81010571feb0] seq_read at ffffffff8003f15b #7 [ffff81010571ff10] vfs_read at ffffffff8000b332 #8 [ffff81010571ff40] sys_read at ffffffff80011710 #9 [ffff81010571ff80] system_call at ffffffff8005d116 RIP: 00002af223b2c0d0 RSP: 00007fff878c97c0 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff8005d116 RCX: 00002af2233eb8d0 RDX: 0000000000001000 RSI: 00002af2231ec000 RDI: 0000000000000008 RBP: 00007fff878c9380 R8: 00000000ffffffff R9: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: 00000000000003ff R13: 0000000000000000 R14: 00002af22420fc20 R15: 000000002420fc20 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b crash> bt ffff810176cb4100 PID: 11943 TASK: ffff810176cb4100 CPU: 6 COMMAND: "pcscd" #0 [ffff81016b97bbe0] crash_kexec at ffffffff800aaa1f #1 [ffff81016b97bca0] __die at ffffffff800650af #2 [ffff81016b97bce0] do_page_fault at ffffffff80066aa1 #3 [ffff81016b97bdd0] error_exit at ffffffff8005dde9 [exception RIP: usbdev_read+0x106] RIP: ffffffff801e637e RSP: ffff81016b97be88 RFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff810175669180 RCX: 0000000000000012 RDX: 0000000000000000 RSI: 0000000000000296 RDI: ffff810040a2e228 RBP: 0000000000000012 R8: 0000000000000002 R9: 0000000000000012 R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000008 R13: ffff810040a2e000 R14: 0000000000000000 R15: ffff81016b97bf50 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff81016b97be80] usbdev_read at ffffffff801e62c1 #5 [ffff81016b97bf10] vfs_read at ffffffff8000b332 #6 [ffff81016b97bf40] sys_read at ffffffff80011710 #7 [ffff81016b97bf80] tracesys at ffffffff8005d28d (via system_call) RIP: 0000003289c0cc8b RSP: 0000000041b7be00 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 0000000000000008 RSI: 0000000041b7def0 RDI: 0000000000000008 RBP: 0000000000000008 R8: 0000000000000000 R9: 000000001ff679e2 R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000000 R13: 000000001ff669b0 R14: 0000000041b7ded0 R15: 0000000141b7def0 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b
Created attachment 313771 [details] Test patch for i_private
Dear Robert, thanks a lot for the trace. Please try the patch in comment #28.
I assume the patch in comment 28 is a proposed fix. Therefore we will test with just that patch applied to kernel 2.6.18-92.1.10.el5. Because there is no instrumentation to indicate the panic situation occurred but was avoided, we will need to run for a long time without a panic to have confidence in the fix. I plan to execute 10000 surprise IO removals during this test.
Robert - any luck with the recent test?
(In reply to comment #31) > Robert - any luck with the recent test? We are about 40% through the long duration run with no problems detected. If there are no problems with the test environment, I hope to have results next week.
(In reply to comment #31) > Robert - any luck with the recent test? At this time, we have more than 9570 IO subsystem removals. During this test, the kernel paniced twice due to bug 455460. However, there were no other panics. I consider the fix for the present bug (450786), attached to comment 28, to be qualified by this testing.
Pete - we should be good to go with the patch Stratus tested if you think that's the right patchset to propose for 5.3, since we are getting close to the beta patch deadline... Are you needing anything from Status from here on out? Thanks!
in kernel-2.6.18-107.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
~~~ Attention Partners! ~~~ Please test this URGENT / HIGH priority bug at your earliest convenience to ensure it makes it into the upcoming RHEL 5.3 release. The fix should be present in the Partner Snapshot #2 (kernel*-122), available NOW at ftp://partners.redhat.com. As we are approaching the end of the RHEL 5.3 test cycle, it is critical that you report back testing results as soon as possible. If you have VERIFIED the fix, please add PartnerVerified to the Bugzilla Keywords field to indicate this. If you find that this issue has not been properly fixed, set the bug status to ASSIGNED with a comment describing the issues you encountered. All NEW issues encountered (not part of this bug fix) should have a new bug created with the proper keywords and flags set to trigger a review for their inclusion in the upcoming RHEL 5.3 or other future release. Post a link in this bugzilla pointing to the new issue to ensure it is not overlooked. For any additional questions, speak with your Partner Manager.
This fix was verified when I tested the patch attached to comment 28. That took almost two weeks of run time. I've examined the source of /kernel-2.6.18-122.el5 and verified the same fix is in place. On that basis I verify the fix is present. I am adding the PartnerVerified keyword.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html