493429 – Watchdog Timeout while waiting for q->queue_lock in __make_request

Bug 493429 - Watchdog Timeout while waiting for q->queue_lock in __make_request

Summary: Watchdog Timeout while waiting for q->queue_lock in __make_request

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.4
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Red Hat Kernel Manager
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-04-01 18:00 UTC by Marc Milgram
Modified:	2018-10-27 14:29 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-07 04:27:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marc Milgram 2009-04-01 18:00:52 UTC

Description of problem:
Watchdog detected LOCKUP
__make_request was waiting on q->queue_lock.

Version-Release number of selected component (if applicable):
kernel-2.6.9-42.0.8.ELsmp

How reproducible:
Not attempted

Steps to Reproduce:
Unknown
  
Actual results:
Watchdog Timeout NMI

Expected results:
No Panic

Additional info:
crash> bt
PID: 2279   TASK: 10037d41030       CPU: 3   COMMAND: "kjournald"
 #0 [10428f92bd0] start_disk_dump at ffffffffa013f36d
 #1 [10428f92c00] try_crashdump at ffffffff8014bd05
 #2 [10428f92c10] die at ffffffff80111c00
 #3 [10428f92c30] do_invalid_op at ffffffff80111fc8
 #4 [10428f92cf0] error_exit at ffffffff80110d91
    [exception RIP: panic+211]
    RIP: ffffffff8013794e  RSP: 0000010428f92da8  RFLAGS: 00010086
    RAX: 000000000000002c  RBX: ffffffff8031e13d  RCX: 0000000000000046
    RDX: 0000000000009f36  RSI: 0000000000000046  RDI: ffffffff803e2480
    RBP: 0000010428f92f58   R8: 0000000000000001   R9: ffffffff8031e13d
    R10: 0000000000000000  R11: 0000ffff803fd180  R12: 0000000000000008
    R13: 0000000004c371dd  R14: 00000103e170f888  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [10428f92da0] panic at ffffffff8013793a
 #6 [10428f92e80] die_nmi at ffffffff80111c91
 #7 [10428f92ea0] nmi_watchdog_tick at ffffffff8011d055
 #8 [10428f92ee0] default_do_nmi at ffffffff8011255e
 #9 [10428f92f40] do_nmi at ffffffff8011d10b
    [exception RIP: .text.lock.spinlock+46]
    RIP: ffffffff8030b586  RSP: 0000010227379928  RFLAGS: 00000086
    RAX: 00000103e170f888  RBX: 0000010037c0fb5c  RCX: 0000000000000000
    RDX: 00000103e170f888  RSI: 0000000000000000  RDI: 0000010037c0fb5c
    RBP: 00000100cfc14080   R8: 0000000000000000   R9: 0000010412f6ec20
    R10: 0000000000000216  R11: ffffffff80256741  R12: 0000000000000008
    R13: 0000000004c371dd  R14: 00000103e170f888  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <exception stack> ---
#10 [10227379928] .text.lock.spinlock at ffffffff8030b586 (via _spin_lock_irq)
#11 [10227379930] __make_request at ffffffff8024f3fb
#12 [10227379980] generic_make_request at ffffffff8024fa42
#13 [102273799d0] __split_bio at ffffffffa003f588
#14 [10227379a80] dm_request at ffffffffa003f982
#15 [10227379aa0] generic_make_request at ffffffff8024fa42
#16 [10227379af0] submit_bio at ffffffff8024fb4e
#17 [10227379b50] submit_bh at ffffffff8017baaa
#18 [10227379b80] ll_rw_block at ffffffff8017bb2e
#19 [10227379bb0] journal_commit_transaction at ffffffffa0053a88
#20 [10227379e80] kjournald at ffffffffa0056914
#21 [10227379f50] kernel_thread at ffffffff80110f47

Comment 2 Jeff Moyer 2009-04-02 17:26:13 UTC

It looks as though the disks in the system are connected to the cciss controller.  Is that right?  The lpfc module is loaded, but it looks unused.

The cciss driver calls into blk_init_queue to register a request function and a queue spinlock (to protect the request queue).  In the case of the cciss driver, this lock is embedded in a list of ctlr_info structures.

Now, the reason that the spinlock call is hung is that the spinlock data structure has been corrupted:

  lock = {
    magic = 0xdead4ead, 
    lock = 0x10037c20000, 
    babble = 0x37c20000, 
    module = 0x10037c12000 "", 
    owner = 0x37c12000 <Address 0x37c12000 out of bounds>, 
    oline = 0x330e9d20
  }, 

The magic is correct, but everything else looks corrupt.  So, it appears that something has overwritten a part of the ctrl_info structure.

Comment 3 Marc Milgram 2009-04-21 20:47:34 UTC

I found that there is a bit more corruption - another piece of the hba data structure is:
  gendisk = {0x10428ff1c00, 0x10428ff1a00, 0x10428ff1800, 0x10428ff1600, 0x10428
ff1400, 0x10428ff1200, 0x10428ff2e00, 0x10428ff2c00, 0x10428ff2a00, 0x10428ff280
0, 0x10037e4c800, 0x10037e4c600, 0x0, 0x0, 0x0, 0x0}, 

Most of the gendisks look valid, but the last one does not (though the pointer looks like it should be valid):

struct gendisk {
  major = 0, 
  first_minor = 0, 
  minors = 935428096, 
  disk_name = "\000\001\000\000\000\200<C1>7\000\001\000\000<A8>\202<C1>7\000\00
1\000\000P\205<C1>7\000\001\000\000<F8>\207<C1>7", 
  part = 0x10037c18aa0, 
  fops = 0x10037c18d48, 
  queue = 0x10037c18ff0, 
  private_data = 0x10037c19298, 
  capacity = 1100447061312, 
  flags = 935434216, 
  devfs_name = "\000\001\000\000\220\232<C1>7\000\001\000\0008\235<C1>7\000\001\
000\000<E0>\237<C1>7\000\001\000\000\210<A2><C1>7\000\001\000\0000<A5><C1>7\000\
001\000\000ا<C1>7\000\001\000\000\200<AA><C1>7\000\001\000\000(<AD><C1>7", 
  number = 256, 
  driverfs_dev = 0x10037c1afd0, 
  kobj = {
    k_name = 0x10037c1b278 "", 
    name = " <B5><C1>7\000\001\000\000<U+0237><C1>7\000\001\000\000p<BA><C1>7", 
    kref = {
      refcount = {
        counter = 256
      }
    }, 
    entry = {
      next = 0x10037c1bd18, 
      prev = 0x10037c1bfc0
    }, 
    parent = 0x10037c1c268, 
    kset = 0x10037c1c510, 
    ktype = 0x10037c1c7b8, 
    dentry = 0x10037c1ca60
  }, 
  random = 0x10037c1cd08, 
  policy = 935448496, 
  sync_io = {
    counter = 256
  }, 
  stamp = 1100447076952, 
  stamp_idle = 1100447077632, 
  in_flight = 935450536, 
  dkstats = 0x37c18000
}

Comment 4 Marc Milgram 2009-04-23 19:19:19 UTC

I dumped the data pointed to by the bad gendisk pointer.  It looks like a bunch of pointers followed by 21, and a spin lock:

crash> rd 0x10037e4c600 296
     10037e4c600:  0000000000000000 0000010037c18000   ...........7....
     10037e4c610:  0000010037c18000 0000010037c182a8   ...7.......7....
     10037e4c620:  0000010037c18550 0000010037c187f8   P..7.......7....
     10037e4c630:  0000010037c18aa0 0000010037c18d48   ...7....H..7....
     10037e4c640:  0000010037c18ff0 0000010037c19298   ...7.......7....
     10037e4c650:  0000010037c19540 0000010037c197e8   @..7.......7....
     10037e4c660:  0000010037c19a90 0000010037c19d38   ...7....8..7....
     10037e4c670:  0000010037c19fe0 0000010037c1a288   ...7.......7....
     10037e4c680:  0000010037c1a530 0000010037c1a7d8   0..7.......7....
     10037e4c690:  0000010037c1aa80 0000010037c1ad28   ...7....(..7....
     10037e4c6a0:  0000010037c1afd0 0000010037c1b278   ...7....x..7....
     10037e4c6b0:  0000010037c1b520 0000010037c1b7c8    ..7.......7....
     10037e4c6c0:  0000010037c1ba70 0000010037c1bd18   p..7.......7....
     10037e4c6d0:  0000010037c1bfc0 0000010037c1c268   ...7....h..7....
     10037e4c6e0:  0000010037c1c510 0000010037c1c7b8   ...7.......7....
     10037e4c6f0:  0000010037c1ca60 0000010037c1cd08   `..7.......7....
     10037e4c700:  0000010037c1cfb0 0000010037c1d258   ...7....X..7....
     10037e4c710:  0000010037c1d500 0000010037c1d7a8   ...7.......7....
     10037e4c720:  0000000037c18000 0000000000000021   ...7....!.......
     10037e4c730:  dead4ead00000001 0000000000000000   .....N..........
     10037e4c740:  0000000000000000 0000000000000000   ................

I dumped the data pointed to a few of those pointers.  They seem to point to blocks that were zeroed out.

Note You need to log in before you can comment on or make changes to this bug.