Description of problem: Watchdog detected LOCKUP __make_request was waiting on q->queue_lock. Version-Release number of selected component (if applicable): kernel-2.6.9-42.0.8.ELsmp How reproducible: Not attempted Steps to Reproduce: Unknown Actual results: Watchdog Timeout NMI Expected results: No Panic Additional info: crash> bt PID: 2279 TASK: 10037d41030 CPU: 3 COMMAND: "kjournald" #0 [10428f92bd0] start_disk_dump at ffffffffa013f36d #1 [10428f92c00] try_crashdump at ffffffff8014bd05 #2 [10428f92c10] die at ffffffff80111c00 #3 [10428f92c30] do_invalid_op at ffffffff80111fc8 #4 [10428f92cf0] error_exit at ffffffff80110d91 [exception RIP: panic+211] RIP: ffffffff8013794e RSP: 0000010428f92da8 RFLAGS: 00010086 RAX: 000000000000002c RBX: ffffffff8031e13d RCX: 0000000000000046 RDX: 0000000000009f36 RSI: 0000000000000046 RDI: ffffffff803e2480 RBP: 0000010428f92f58 R8: 0000000000000001 R9: ffffffff8031e13d R10: 0000000000000000 R11: 0000ffff803fd180 R12: 0000000000000008 R13: 0000000004c371dd R14: 00000103e170f888 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [10428f92da0] panic at ffffffff8013793a #6 [10428f92e80] die_nmi at ffffffff80111c91 #7 [10428f92ea0] nmi_watchdog_tick at ffffffff8011d055 #8 [10428f92ee0] default_do_nmi at ffffffff8011255e #9 [10428f92f40] do_nmi at ffffffff8011d10b [exception RIP: .text.lock.spinlock+46] RIP: ffffffff8030b586 RSP: 0000010227379928 RFLAGS: 00000086 RAX: 00000103e170f888 RBX: 0000010037c0fb5c RCX: 0000000000000000 RDX: 00000103e170f888 RSI: 0000000000000000 RDI: 0000010037c0fb5c RBP: 00000100cfc14080 R8: 0000000000000000 R9: 0000010412f6ec20 R10: 0000000000000216 R11: ffffffff80256741 R12: 0000000000000008 R13: 0000000004c371dd R14: 00000103e170f888 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <exception stack> --- #10 [10227379928] .text.lock.spinlock at ffffffff8030b586 (via _spin_lock_irq) #11 [10227379930] __make_request at ffffffff8024f3fb #12 [10227379980] generic_make_request at ffffffff8024fa42 #13 [102273799d0] __split_bio at ffffffffa003f588 #14 [10227379a80] dm_request at ffffffffa003f982 #15 [10227379aa0] generic_make_request at ffffffff8024fa42 #16 [10227379af0] submit_bio at ffffffff8024fb4e #17 [10227379b50] submit_bh at ffffffff8017baaa #18 [10227379b80] ll_rw_block at ffffffff8017bb2e #19 [10227379bb0] journal_commit_transaction at ffffffffa0053a88 #20 [10227379e80] kjournald at ffffffffa0056914 #21 [10227379f50] kernel_thread at ffffffff80110f47
It looks as though the disks in the system are connected to the cciss controller. Is that right? The lpfc module is loaded, but it looks unused. The cciss driver calls into blk_init_queue to register a request function and a queue spinlock (to protect the request queue). In the case of the cciss driver, this lock is embedded in a list of ctlr_info structures. Now, the reason that the spinlock call is hung is that the spinlock data structure has been corrupted: lock = { magic = 0xdead4ead, lock = 0x10037c20000, babble = 0x37c20000, module = 0x10037c12000 "", owner = 0x37c12000 <Address 0x37c12000 out of bounds>, oline = 0x330e9d20 }, The magic is correct, but everything else looks corrupt. So, it appears that something has overwritten a part of the ctrl_info structure.
I found that there is a bit more corruption - another piece of the hba data structure is: gendisk = {0x10428ff1c00, 0x10428ff1a00, 0x10428ff1800, 0x10428ff1600, 0x10428 ff1400, 0x10428ff1200, 0x10428ff2e00, 0x10428ff2c00, 0x10428ff2a00, 0x10428ff280 0, 0x10037e4c800, 0x10037e4c600, 0x0, 0x0, 0x0, 0x0}, Most of the gendisks look valid, but the last one does not (though the pointer looks like it should be valid): struct gendisk { major = 0, first_minor = 0, minors = 935428096, disk_name = "\000\001\000\000\000\200<C1>7\000\001\000\000<A8>\202<C1>7\000\00 1\000\000P\205<C1>7\000\001\000\000<F8>\207<C1>7", part = 0x10037c18aa0, fops = 0x10037c18d48, queue = 0x10037c18ff0, private_data = 0x10037c19298, capacity = 1100447061312, flags = 935434216, devfs_name = "\000\001\000\000\220\232<C1>7\000\001\000\0008\235<C1>7\000\001\ 000\000<E0>\237<C1>7\000\001\000\000\210<A2><C1>7\000\001\000\0000<A5><C1>7\000\ 001\000\000ا<C1>7\000\001\000\000\200<AA><C1>7\000\001\000\000(<AD><C1>7", number = 256, driverfs_dev = 0x10037c1afd0, kobj = { k_name = 0x10037c1b278 "", name = " <B5><C1>7\000\001\000\000<U+0237><C1>7\000\001\000\000p<BA><C1>7", kref = { refcount = { counter = 256 } }, entry = { next = 0x10037c1bd18, prev = 0x10037c1bfc0 }, parent = 0x10037c1c268, kset = 0x10037c1c510, ktype = 0x10037c1c7b8, dentry = 0x10037c1ca60 }, random = 0x10037c1cd08, policy = 935448496, sync_io = { counter = 256 }, stamp = 1100447076952, stamp_idle = 1100447077632, in_flight = 935450536, dkstats = 0x37c18000 }
I dumped the data pointed to by the bad gendisk pointer. It looks like a bunch of pointers followed by 21, and a spin lock: crash> rd 0x10037e4c600 296 10037e4c600: 0000000000000000 0000010037c18000 ...........7.... 10037e4c610: 0000010037c18000 0000010037c182a8 ...7.......7.... 10037e4c620: 0000010037c18550 0000010037c187f8 P..7.......7.... 10037e4c630: 0000010037c18aa0 0000010037c18d48 ...7....H..7.... 10037e4c640: 0000010037c18ff0 0000010037c19298 ...7.......7.... 10037e4c650: 0000010037c19540 0000010037c197e8 @..7.......7.... 10037e4c660: 0000010037c19a90 0000010037c19d38 ...7....8..7.... 10037e4c670: 0000010037c19fe0 0000010037c1a288 ...7.......7.... 10037e4c680: 0000010037c1a530 0000010037c1a7d8 0..7.......7.... 10037e4c690: 0000010037c1aa80 0000010037c1ad28 ...7....(..7.... 10037e4c6a0: 0000010037c1afd0 0000010037c1b278 ...7....x..7.... 10037e4c6b0: 0000010037c1b520 0000010037c1b7c8 ..7.......7.... 10037e4c6c0: 0000010037c1ba70 0000010037c1bd18 p..7.......7.... 10037e4c6d0: 0000010037c1bfc0 0000010037c1c268 ...7....h..7.... 10037e4c6e0: 0000010037c1c510 0000010037c1c7b8 ...7.......7.... 10037e4c6f0: 0000010037c1ca60 0000010037c1cd08 `..7.......7.... 10037e4c700: 0000010037c1cfb0 0000010037c1d258 ...7....X..7.... 10037e4c710: 0000010037c1d500 0000010037c1d7a8 ...7.......7.... 10037e4c720: 0000000037c18000 0000000000000021 ...7....!....... 10037e4c730: dead4ead00000001 0000000000000000 .....N.......... 10037e4c740: 0000000000000000 0000000000000000 ................ I dumped the data pointed to a few of those pointers. They seem to point to blocks that were zeroed out.