Description of problem: A four-way (2x Xeon DP with HT) system paniced tonight in the following fashion: Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: 0221fef1 *pde = 00003001 Oops: 0000 [#1] SMP Modules linked in: iptable_filter e1000 ipt_REDIRECT iptable_nat ip_conntrack ip_tables floppy sg microcode dm_mod button battery asus_acpi ac ext3 jbd sata_sil libata sd_mod scsi_mod CPU: 2 EIP: 0060:[<0221fef1>] Not tainted EFLAGS: 00010293 (2.6.7-1.476smp) EIP is at cfq_get_queue+0x28/0x98 eax: 00000000 ebx: 00000034 ecx: c1c834d8 edx: 00000000 esi: 00002008 edi: 00000220 ebp: 04d6dc64 esp: 19aa4cb0 ds: 007b es: 007b ss: 0068 Process pdflush (pid: 8200, threadinfo=19aa4000 task=39ec17b0) Stack: 04d6dc64 00000220 61f1bd10 04dbe3ac 022201f7 022201d7 04d91bb4 00000001 022173ef 61f1bd10 02218f7e 04d91c40 00000000 00000220 00000008 00000000 00000000 0000007b c1069704 04d91bb4 00000008 00000000 02219b72 1579dd17 Call Trace: [<022201f7>] cfq_set_request+0x20/0x63 [<022201d7>] cfq_set_request+0x0/0x63 [<022173ef>] elv_set_request+0xa/0x17 [<02218f7e>] get_request+0x18b/0x2b0 [<02219b72>] __make_request+0x2de/0x4d6 [<02219ef6>] generic_make_request+0x18c/0x19c [<02219fd0>] submit_bio+0xca/0xd2 [<02160092>] submit_bh+0x60/0x103 [<0215ecb8>] __block_write_full_page+0x1dd/0x2c4 [<02162d62>] blkdev_get_block+0x0/0x46 [<0215ffcd>] block_write_full_page+0xc5/0xce [<02162d62>] blkdev_get_block+0x0/0x46 [<0217cef1>] mpage_writepages+0x157/0x272 [<02162e45>] blkdev_writepage+0x0/0xc [<02141fe4>] do_writepages+0x19/0x27 [<0217b48c>] __sync_single_inode+0x84/0x1f8 [<02129cbf>] process_timeout+0x0/0x5 [<0217b6db>] __writeback_single_inode+0xdb/0xe1 [<0217b887>] sync_sb_inodes+0x1a6/0x2be [<02142ab0>] pdflush+0x0/0x1e [<0217baaa>] writeback_inodes+0x10b/0x1ce [<02161ba3>] sync_supers+0xf7/0x137 [<02141e7f>] wb_kupdate+0x89/0xec [<021429c5>] __pdflush+0x1b9/0x2a4 [<02142aca>] pdflush+0x1a/0x1e [<02141df6>] wb_kupdate+0x0/0xec [<02142ab0>] pdflush+0x0/0x1e [<0213458d>] kthread+0x73/0x9b [<0213451a>] kthread+0x0/0x9b [<021041f1>] kernel_thread_helper+0x5/0xb Code: 8b 02 0f 18 00 90 39 ca 74 0d 39 72 14 75 04 89 d0 eb 06 8b <6>TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets The load spiked to an artificial value of about 250. The interesting thing to note was that console sessions (OOB) worked great, the UDP side of the network stack worked great (snmp requests were serviced), it responded to ICMP requests normally but the TCP stack got completely hosed and all connections were refused. Connections already established were kept in the state tables but were not serviced at all. Version-Release number of selected component (if applicable): kernel-smp-2.6.7-1.476 Additional info:
Created attachment 101783 [details] one of the two pdflushes vanished There are two pdflush pseudo-processes running of which one died according to the panic message.
Created attachment 101784 [details] spectacular load spike the load started going up at the same time
Created attachment 101785 [details] meanwhile interface traffic dropped but did not stop completely
Created attachment 101786 [details] rapid rise of tcp sessions established and in close wait tcp sessions in time wait are not shown as they would throw off the graph.
Created attachment 101787 [details] just the time wait sessions tcp sessions in time wait dropped to zero at the same time close wait sessions flatlined (about 00:20 local time in the graph)
Happened again on another otherwise identical system. Unfortunately the console was dead so there wasn't anything on it. The following got logged to syslog, though. kernel: Debug: sleeping function called from invalid context at mm/mempool.c:197 kernel: in_atomic():0[expected: 0], irqs_disabled():1 kernel: [<0211f978>] __might_sleep+0x7d/0x87 kernel: [<0213f8f3>] mempool_alloc+0x6a/0x198 kernel: [<021441e6>] poison_obj+0x1d/0x3d kernel: [<0211ff27>] autoremove_wake_function+0x0/0x2d kernel: [<02145982>] cache_alloc_debugcheck_after+0xcf/0x103 kernel: [<0211ff27>] autoremove_wake_function+0x0/0x2d kernel: [<0213f904>] mempool_alloc+0x7b/0x198 kernel: [<02220c34>] __cfq_get_queue+0x53/0x98 kernel: [<02220cc8>] cfq_get_queue+0x4f/0x86 kernel: [<02220f95>] cfq_set_request+0x20/0x63 kernel: [<02220f75>] cfq_set_request+0x0/0x63 kernel: [<02218107>] elv_set_request+0xa/0x17 kernel: [<02219c82>] get_request+0x18b/0x2b0 kernel: [<02219e24>] get_request_wait+0x7d/0xb9 kernel: [<0211ff27>] autoremove_wake_function+0x0/0x2d Seems to be rather unreliably reproduceable just by installing a new kernel with "rpm -ivh kernel-*.rpm" with 476, 478 and 481 kernels.
In order to be more certain and isolate the problem, can you try booting with the anticipator or deadline elevator instead and see if it survives?
Would that be elevator=anticipatory and/or elevator=deadline ?
Hmmm I think anticipatory is actually "elevator=as".
Booting with elevator=deadline has had the server up for 2+ days with simulated load (about 600 java threads, net-snmp full table walks against it). I installed 492, booted with elevator=deadline and will see what happens.
Btw, all the panicing systems are Supermicro 6013P-T systems. A lot of companies also OEM these and sell them as their own.
With elevator=deadline the uptimes are now around 12+ days
If this issue is not resolved with the latest rawhide kernels, you can help by bringing this report to the attention of upstream lkml and the CFQ author Jens Axboe.