Description of problem: Unable to handle kernel paging request at 0000000100000248 RIP: <ffffffff8028166c>{cdrom_analyze_sense_data+53} PML4 1dc1b3067 PGD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: netconsole netdump autofs4 i2c_dev i2c_core ib_srp ib_sdp ib_ipoib inet_lro md5 ipv6 rdma_ucm rdma_cm iw_cm ib_addr ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core dm_round_robin dm_multipath button battery ac ohci_hcd hw_random k8_edac edac_mc tg3 bonding(U) floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd raid1 raid0 dm_mod qla2300 qla2xxx scsi_transport_fc mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod Pid: 15825, comm: oracle Not tainted 2.6.9-89.0.20.ELsmp RIP: 0010:[<ffffffff8028166c>] <ffffffff8028166c>{cdrom_analyze_sense_data+53} RSP: 0000:00000101fabcbe18 EFLAGS: 00010246 RAX: 0000000000000200 RBX: 0000000000000000 RCX: 0000000100000246 RDX: 0000000100000200 RSI: 00000100c03cbd18 RDI: ffffffff804e4f90 RBP: 0000000100000246 R08: 00000000000d0cdd R09: 000000000f8d0954 R10: 0000000000000246 R11: 0000000000000246 R12: 00000100c03cbd18 R13: ffffffff804e4f90 R14: ffffffff804e4f90 R15: 0000000000000001 FS: 0000002a975ad920(0000) GS:ffffffff80505000(0000) knlGS:00000000f6aed6c0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000100000248 CR3: 0000000001032000 CR4: 00000000000006e0 Process oracle (pid: 15825, threadinfo 0000010054214000, task 00000100716ca030) Stack: 0000000000002706 ffffffff804e4f90 0000010080011460 0000000000000001 0000769a0788937e 00000101f37377f0 000001008007fc80 0000000000000000 00000101fabcbe88 ffffffff80132cc7 Call Trace:<IRQ> <ffffffff80132cc7>{activate_task+124} <ffffffff80281caf>{cdrom_end_request+129} <ffffffff802829e0>{cdrom_pc_intr+229} <ffffffff802828fb>{cdrom_pc_intr+0} <ffffffff802767f5>{ide_intr+379} <ffffffff80112ff6>{handle_IRQ_event+41} <ffffffff80113270>{do_IRQ+197} <ffffffff801108c3>{ret_from_intr+0} <EOI> Code: 0f b6 45 02 83 e0 0f 0f b6 c0 83 f8 06 77 37 89 c0 ff 24 c5 RIP <ffffffff8028166c>{cdrom_analyze_sense_data+53} RSP <00000101fabcbe18> CR2: 0000000100000248 Version-Release number of selected component (if applicable): 2.6.9-89.20.EL How reproducible: Reported to have occurred several times. Steps to Reproduce: Unknown - awaiting clarification from reporter. Actual results: Above oops. Expected results: No oops. Additional info:
Created attachment 409758 [details] dmesg leading up to panic dmesg leading up to panic in comment #0
At the time of the oops, CPU1 was processing an IDE interrupt that ended up calling cdrom_analyze_sense_data: crash> set 15825 PID: 15825 COMMAND: "oracle" TASK: 100716ca030 [THREAD_INFO: 10054214000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> bt PID: 15825 TASK: 100716ca030 CPU: 1 COMMAND: "oracle" #0 [101fabcbc50] netpoll_start_netdump at ffffffffa02ac366 #1 [101fabcbc80] do_page_fault at ffffffff80124c50 #2 [101fabcbd60] error_exit at ffffffff80110e21 [exception RIP: cdrom_analyze_sense_data+53] RIP: ffffffff8028166c RSP: 00000101fabcbe18 RFLAGS: 00010246 RAX: 0000000000000200 RBX: 0000000000000000 RCX: 0000000100000246 RDX: 0000000100000200 RSI: 00000100c03cbd18 RDI: ffffffff804e4f90 RBP: 0000000100000246 R8: 00000000000d0cdd R9: 000000000f8d0954 R10: 0000000000000246 R11: 0000000000000246 R12: 00000100c03cbd18 R13: ffffffff804e4f90 R14: ffffffff804e4f90 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #3 [101fabcbe90] cdrom_end_request at ffffffff80281caf #4 [101fabcbed0] cdrom_pc_intr at ffffffff802829e0 #5 [101fabcbf10] ide_intr at ffffffff802767f5 #6 [101fabcbf50] handle_IRQ_event at ffffffff80112ff6 #7 [101fabcbf80] do_IRQ at ffffffff80113270 --- <IRQ stack> --- #8 [10054215f58] ret_from_intr at ffffffff801108c3 RIP: ffffffffff6000a5 RSP: 0000007fbfffe0c8 RFLAGS: 00000246 RAX: 00000000ee74565f RBX: 000482078624efcb RCX: 00000000000d08f5 RDX: 000000000000038a RSI: 0000000000000000 RDI: 0000007fbfffe0f0 RBP: 0000007fbfffe100 R8: 00000000000d0cdd R9: 000000000f8d0954 R10: 000000004ba1667a R11: 0000000000000246 R12: 000000000a19e480 R13: 0000000060004650 R14: 0000000000004d58 R15: 0000000000003cd8 ORIG_RAX: ffffffffffffff0f CS: 0033 SS: 002b The actual oops takes place in cdrom_log_sense(): 388 void cdrom_analyze_sense_data(ide_drive_t *drive, 389 struct request *failed_command, 390 struct request_sense *sense) 391 { 392 unsigned long sector; 393 unsigned long bio_sectors; 394 unsigned long valid; 395 396 if (!cdrom_log_sense(drive, failed_command, sense)) <--- 397 return; [...] 342 static int cdrom_log_sense(ide_drive_t *drive, struct request *rq, 343 struct request_sense *sense) 344 { 345 int log = 0; 346 347 if (!sense || !rq || (rq->flags & REQ_QUIET)) 348 return 0; 349 350 switch (sense->sense_key) { <--- 1st access to *sense 351 case NO_SENSE: case RECOVERED_ERROR: 352 break; [...] Pulling out the ide_drive_t: crash> ide_drive_s ffffffff804e4f90 struct ide_drive_s { name = "hdc", driver_req = "\000\000\000\000\000\000\000\000\000", queue = 0x10080de03d0, rq = 0x0, next = 0xffffffff804e4f90, driver = 0xffffffff804228c0, driver_data = 0x10080011400, id = 0x10037ff5e00, proc = 0x100081a8080, settings = 0x10080db7b00, devfs_name = "ide/host0/bus1/target0/lun0\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", hwif = 0xffffffff804e4e48, sleep = 0, service_start = 4425133547, service_time = 0, timeout = 0, special = { all = 0, b = { set_geometry = 0, recalibrate = 0, set_multmode = 0, set_tune = 0, serviced = 0, reserved = 0 } }, select = { all = 160, b = { head = 0, unit = 0, bit5 = 1, lba = 0, bit7 = 1 } }, keep_settings = 0 '\0', autodma = 1 '\001', using_dma = 1 '\001', retry_pio = 0 '\0', state = 0 '\0', waiting_for_dma = 0 '\0', unmask = 1 '\001', bswap = 0 '\0', dsc_overlap = 0 '\0', nice1 = 1 '\001', present = 1, dead = 0, id_read = 1, noprobe = 0, removable = 1, attach = 0, is_flash = 0, forced_geom = 0, no_unmask = 0, no_io_32bit = 0, atapi_overlap = 0, nice0 = 0, nice2 = 0, doorlocking = 0, autotune = 1, remap_0_to_1 = 0, blocked = 0, vdma = 0, stroke = 0, addressing = 0, scsi = 0, quirk_list = 0 '\0', suspend_reset = 0 '\0', init_speed = 12 '\f', pio_speed = 0 '\0', current_speed = 66 'B', dn = 2 '\002', wcache = 0 '\0', acoustic = 0 '\0', media = 5 '\005', ctl = 8 '\b', ready_stat = 0 '\0', mult_count = 0 '\0', mult_req = 0 '\0', tune_req = 0 '\0', io_32bit = 1 '\001', bad_wstat = 161 '<A1>', nowerr = 0 '\0', sect0 = 0 '\0', head = 0 '\0', sect = 0 '\0', bios_head = 0 '\0', bios_sect = 0 '\0', doing_barrier = 0 '\0', bios_cyl = 0, cyl = 0, drive_data = 0, usage = 1, failures = 0, max_failures = 1, capacity64 = 0, probed_capacity = 8388604, lun = 0, crc_count = 0, list = { next = 0xffffffff80422a18, prev = 0xffffffff80422a18 }, gendev = { node = { next = 0xffffffff804e5700, prev = 0xffffffff804e5700 }, bus_list = { next = 0xffffffff80421c68, prev = 0xffffffff80421c68 }, driver_list = { next = 0xffffffff804e50e0, prev = 0xffffffff804e50e0 }, children = { next = 0xffffffff804e50f0, prev = 0xffffffff804e50f0 }, parent = 0xffffffff804e56d0, kobj = { k_name = 0xffffffff804e5110 "1.0", name = "1.0\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", kref = { refcount = { counter = 5 } }, entry = { next = 0x101fa84b1b0, prev = 0xffffffff804e5738 }, parent = 0xffffffff804e5718, kset = 0xffffffff804190a0, ktype = 0x0, dentry = 0x1000818b318 }, bus_id = "1.0\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", bus = 0xffffffff80421b40, driver = 0xffffffff80422960, driver_data = 0xffffffff804e4f90, platform_data = 0x0, power = { power_state = 0, prev_state = 0, saved_state = 0x0, pm_users = { counter = 0 }, pm_parent = 0xffffffff804e56d0, entry = { next = 0x101fa84b238, prev = 0xffffffff804e57c0 } }, power_state = 0, saved_state = 0x0, detach_state = 0, dma_mask = 0x0, coherent_dma_mask = 0, dma_pools = { next = 0xffffffff804e51e8, prev = 0xffffffff804e51e8 }, dma_mem = 0x0, release = 0xffffffff80279c38 <drive_release_dev> }, gendev_rel_sem = { count = { counter = 0 }, sleepers = 0, wait = { lock = { lock = 1, magic = 3735899821 }, task_list = { next = 0xffffffff804e5218, prev = 0xffffffff804e5218 } } }, disk = 0x10037ff5a00 } From that get back to the rq via the hwgroup from the hwif: crash> ide_drive_s ffffffff804e4f90 | grep hwif hwif = 0xffffffff804e4e48, crash> hwif_s 0xffffffff804e4e48|grep hwg hwgroup = 0x10037ff5200, Using: struct request *rq = HWGROUP(drive)->rq; #define HWGROUP(drive) ((ide_hwgroup_t *)(HWIF(drive)->hwgroup)) #define HWIF(drive) ((ide_hwif_t *)((drive)->hwif)) crash> hwgroup_s 0x10037ff5200 | grep rq rq = 0x10080011460, crash> request 0x10080011460 struct request { queuelist = { next = 0x10080011460, prev = 0x10080011460 }, flags = 264256, sector = 0, nr_sectors = 0, current_nr_sectors = 0, hard_sector = 0, hard_nr_sectors = 0, hard_cur_sectors = 0, nr_cbio_segments = 0, nr_cbio_sectors = 0, cbio = 0x0, bio = 0x0, biotail = 0x0, elevator_private = 0x0, rq_status = -1, rq_disk = 0x10037ff5a00, errors = 0, start_time = 0, nr_phys_segments = 0, nr_hw_segments = 0, tag = 0, buffer = 0x100c03cbd18 "<FF>?<F9><A8>", ref_count = 0, q = 0x0, rl = 0x0, waiting = 0x0, special = 0x0, cmd_len = 0, cmd = "\003\000\000\000\022\000\000\000\000\000\000\000\000\000\000", data_len = 0, data = 0x100c03cbcaa, sense_len = 18, sense = 0x0, timeout = 60000, pm = 0x0 } At this point, rq->buffer should point to the "original" (failed) request but it's garbage: crash> request 0x100c03cbd18 struct request { queuelist = { next = 0x100a8f93fff, prev = 0xffffffff801f1077 }, flags = 0, sector = 18446744071564103241, nr_sectors = 4397712, current_nr_sectors = 937394277, hard_sector = 937394277, hard_nr_sectors = 18446744071563492786, hard_cur_sectors = 0, nr_cbio_segments = 0, nr_cbio_sectors = 18446744071563558405, cbio = 0x101f6055180, bio = 0x101f8769028, biotail = 0xa31190, elevator_private = 0x101f88e49f8, rq_status = 48, rq_disk = 0x100c03cbe68, errors = 937394279, start_time = 937394277, nr_phys_segments = 32869, nr_hw_segments = 14303, tag = 0, buffer = 0x37df8065 <Address 0x37df8065 out of bounds>, ref_count = -165768824, q = 0x10038def030, rl = 0xa31190, waiting = 0x101fa07a040, special = 0x101f88e49f8, cmd_len = 2148977923, cmd = "<FF><FF><FF><FF>\000\000\000\000\000\000\000\000\000\000\000", data_len = 0, data = 0x2a95573, sense_len = 18, sense = 0x100000246, <---- crash value timeout = 3588521672, pm = 0x101f2a87148 } At the same time on CPU0 hald is issuing an open(2) on the same IDE CD device that's issuing a cdrom_check_status as part of check_disk_change()->media_changed()->ide_cdrom_check_media_change_real(): crash> set 7210 PID: 7210 COMMAND: "hald" TASK: 100790b27f0 [THREAD_INFO: 10078e0a000] CPU: 0 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 7210 TASK: 100790b27f0 CPU: 0 COMMAND: "hald" #0 [10078e0b738] schedule at ffffffff80314c64 #1 [10078e0b810] wait_for_completion at ffffffff80314ea8 #2 [10078e0b890] ide_do_drive_cmd at ffffffff80276973 #3 [10078e0b900] cdrom_queue_packet_command at ffffffff80282b99 #4 [10078e0b970] cdrom_check_status at ffffffff802839bc #5 [10078e0bab0] ide_cdrom_check_media_change_real at ffffffff80284802 #6 [10078e0bac0] media_changed at ffffffff802888ec #7 [10078e0bae0] check_disk_change at ffffffff8018388c #8 [10078e0bb00] cdrom_open at ffffffff80288475 #9 [10078e0bdc0] idecd_open at ffffffff80284b5c #10 [10078e0bdf0] do_open at ffffffff80183c31 #11 [10078e0be40] blkdev_open at ffffffff80184088 #12 [10078e0be60] __dentry_open at ffffffff8017b8c1 #13 [10078e0bea0] filp_open at ffffffff8017ba9a #14 [10078e0bf50] sys_open at ffffffff8017bc89 #15 [10078e0bf80] system_call at ffffffff801102f6 RIP: 0000002a959d25e2 RSP: 0000007fbffff808 RFLAGS: 00000202 RAX: 0000000000000002 RBX: ffffffff801102f6 RCX: 0000000000649c70 RDX: 0000000000000000 RSI: 0000000000000880 RDI: 00000000006b7db0 RBP: 0000000000000000 R8: 0000000000000000 R9: 000000000f8d0952 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00000000006b3ef0 R14: 00000000006b72f0 R15: 0000000000000000 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b This isn't code that I'm too familiar with but I'm trying to understand if the hald->open() and cdrom_pc_intr() paths are racing here or if there's something else going on to have left the bogus pointer in the request..
There's a known issue with some IDE CDROM drivers, where it takes a long time to check if a CDROM is inserted, and if the CDROM is empty or not. The proper fix is to lock subsequent calls, to avoid OOPS/Panic. This is what happens on RHEL5. As hald periodically checks if a CDROM is inserted, in order to mount the driver and open a window at the X interface, if, for some reason, the previous check didn't ended, this could cause troubles. On RHEL5, the adopted solution were to lock the calls to disk status stuff. Sometimes, this results on long wait periods (see bz#500416). While such lock is not backported to RHEL4, a workaround for this issue is to disable hald polling for CDROM.
(In reply to comment #8) > Another option would be to provide the customer a custom kernel with cdrom > compiled as module, and blacklist it via /etc/modprobe.conf Another solution that won't require using a custom kernel would be to disable it via boot parameter, with something like: hdc=none (assuming that the cdrom is at hdc on all machines)
Another case with a recent 4.8 kernel looks to be suffering from the same root issue. From the core: PID: 2504 TASK: 10094902030 CPU: 0 COMMAND: "ls" #0 [ffffffff8046e610] netpoll_start_netdump at ffffffffa016c366 #1 [ffffffff8046e640] do_page_fault at ffffffff80124c50 #2 [ffffffff8046e720] error_exit at ffffffff80110e21 [exception RIP: __end_that_request_first+202] RIP: ffffffff80257a6d RSP: ffffffff8046e7d8 RFLAGS: 00010006 RAX: 0000000000000000 RBX: 0000000000000046 RCX: 0000000000000046 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000100e5ddf968 RBP: 0000014a00000000 R8: 000000000000ea69 R9: 0000000000000206 R10: ffffffff802769c9 R11: ffffffff802769c9 R12: 0000000000000000 R13: 00000100e5ddf968 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #3 [ffffffff8046e810] cdrom_end_request at ffffffff80281d07 #4 [ffffffff8046e850] cdrom_pc_intr at ffffffff802829f0 #5 [ffffffff8046e890] ide_intr at ffffffff80276805 #6 [ffffffff8046e8d0] handle_IRQ_event at ffffffff80112ff6 #7 [ffffffff8046e900] do_IRQ at ffffffff80113270 --- <IRQ stack> --- #8 [1009723bf58] ret_from_intr at ffffffff801108c3 RIP: 00000037fcc106b3 RSP: 0000007fbfffee78 RFLAGS: 00000287 RAX: 0000000000000017 RBX: 0000007fbfffeef0 RCX: 00000037fd033a18 RDX: 00000000000005e8 RSI: 0000000000000000 RDI: 00000037fd033a18 RBP: 0000007fbffff1f0 R8: 0000000000000000 R9: 00000037fd034000 R10: 0000000000000812 R11: 0000000000000206 R12: 00000037fd033a18 R13: 00000037fd038028 R14: 0000002a956664a0 R15: 0000007fbffff298 ORIG_RAX: ffffffffffffff0e CS: 0033 SS: 002b In this case, the ide-cd driver made it a little past cdrom_analyze_sense_data() and died in __end_that_request_first(). The bad request at 0x100e5ddf968 is a stack address and looks to have been from a call to cdrom_check_status() from within ide_cdrom_drive_status(). The cdrom_check_status() has completed and the process continued on to call cdrom_get_media_event() crash> bt 3881 PID: 3881 TASK: 100ed033030 CPU: 0 COMMAND: "coda" #0 [100e5ddf6b8] schedule at ffffffff80314c74 #1 [100e5ddf790] wait_for_completion at ffffffff80314eb8 #2 [100e5ddf810] ide_do_drive_cmd at ffffffff80276983 #3 [100e5ddf880] cdrom_queue_packet_command at ffffffff80282ba9 #4 [100e5ddf8f0] ide_cdrom_packet at ffffffff80284154 #5 [100e5ddfa20] cdrom_get_media_event at ffffffff80287767 #6 [100e5ddfaa0] ide_cdrom_drive_status at ffffffff8028469e #7 [100e5ddfb00] cdrom_open at ffffffff80287d8f #8 [100e5ddfdc0] idecd_open at ffffffff80284b6c #9 [100e5ddfdf0] do_open at ffffffff80183e15 #10 [100e5ddfe40] blkdev_open at ffffffff8018409c #11 [100e5ddfe60] __dentry_open at ffffffff8017b8d5 #12 [100e5ddfea0] filp_open at ffffffff8017baae #13 [100e5ddff50] sys32_open at ffffffff80128d8a This double attempt to complete the request is being caused by the way the ide-cd's sense requests interact with the noop io scheduler. From the attached dmesg and looking at the new cores, the original system and the new case were all using the noop scheduler instead of the default cfq. In RHEL4, the noop scheduler isn't honoring the ELEVATOR_INSERT_FRONT flag. Instead, it always appends requests to the end of the queue. The ide-cd driver using a single, pre-allocated request (cdrom_info.request_sense_request) for each cdrom to gather sense information. Whenever this special request is used, ide uses ELEVATOR_INSERT_FRONT to place it at the front of the queue. This should insure that the single special request is run right after the failed command. Since the noop scheduler ignores ELEVATOR_INSERT_FRONT, if another command was already in the queue and it also fails, the request_sense_request request will be used again, overwriting the pointers to the original failed request and being inserted into the queue list a second time. Thus, the queue list is now also corrupted. Sense blkdev_dequeue_request() uses list_del_init(), no poisoned pointers are ever inserted into the corrupted list. Instead, the corrupt list causes the request_sense_request to be executed multiple times. The extra completion then causes the system to crash since the failed command was already freed by the first completion of request_sense_request. In summary, the RHEL4 noop io scheduler's current behavior which ignores ELEVATOR_INSERT_FRONT is incompatible with ide-cd's handling of sense requests.
I'm trying to follow this more closely and a bit confused by "noop scheduler ignores ELEVATOR_INSERT_FRONT". Looking at: drivers/block/noop-iosched.c in 2.6.9-89.0.26.EL 38 void elevator_noop_add_request(request_queue_t *q, struct request *rq, 39 int where) 40 { 41 struct list_head *insert = q->queue_head.prev; 42 43 if (where == ELEVATOR_INSERT_FRONT) 44 insert = &q->queue_head; 45 46 list_add_tail(&rq->queuelist, &q->queue_head); 47 48 /* 49 * new merges must not precede this barrier 50 */ 51 if (rq->flags & REQ_HARDBARRIER) 52 q->last_merge = NULL; 53 else if (!q->last_merge) 54 q->last_merge = rq; 55 } It seems to be that this would put our request at the head of the queue. ide_do_drive_cmd() seems a little tricky and may try to use ELEVATOR_INSERT_BACK. That would seem to indicate the problem is in the driver, not the elevator. Or did I misunderstand?
Created attachment 490591 [details] dmesg from bootup of affected system Here is their dmesg from startup of the system that panicked. We're still waiting to hear back on whether they were using the IDE drive, or what they were doing at the time. They have 11 "identical" servers, and only 6 of them have demonstrated the issue thus far. They are unsure what the difference is between them. FWIW, they've removed noop for now and have been stable since.
Prarit, Any update on where we stand with this or what else needs to be done? Thanks very much, Justin