Description of problem: I hit the following panic when I was running revolver on a 4-node cluster with 3 gfs2 filesystems. This is using the debug kernel that has the fix for bug 426234. (Kernel 2.6.18-79.el5.bz426234.debug2) /var/crash/2008-03-04-23:33/ on merit.lab.msp.redhat.com has the vmcore BUG: unable to handle kernel NULL pointer dereference at virtual address 00000 000 printing eip: e049fc94 *pde = 118c9067 Oops: 0000 [#1] SMP last sysfs file: /fs/gfs2/Smoke4:smoke42/lock_module/recover_done Modules linked in: sctp autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2(U) d lm configfs sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight i2c_ec button battery asus_acpi ac lp floppy sg i2c_i810 e1000 i2c_algo_bit i 2c_i801 pcspkr i2c_core parport_pc parport ide_cd cdrom dm_snapshot dm_zero dm _mirror dm_mod lpfc scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 0 EIP: 0060:[<e049fc94>] Tainted: G VLI EFLAGS: 00010202 (2.6.18-79.el5.bz426234.debug2 #1) EIP is at sctp_datamsg_put+0x1d/0xde [sctp] eax: d183f800 ebx: d183f820 ecx: 00000000 edx: d194b3ac esi: be0db0e0 edi: 00000000 ebp: 00000000 esp: d1ebbc10 ds: 007b es: 007b ss: 0068 Process d_doio (pid: 3779, ti=d1ebb000 task=d1c0d550 task.ti=d1ebb000) Stack: d194b390 00ebbc10 d18bec80 be0db0e0 00000000 00000000 e04a0691 d18bed8c e04a4345 df7a8644 d194b390 d194b3b4 d194a000 d194a0cc be0db0e0 d1ebbc7c d194b390 d18beb80 d194a000 e049cf1e 00000004 00000003 00000001 00000001 Call Trace: [<e04a0691>] sctp_chunk_free+0x32/0x3a [sctp] [<e04a4345>] sctp_outq_sack+0x27d/0x334 [sctp] [<e049cf1e>] sctp_do_sm+0x353/0xc21 [sctp] [<e049f679>] sctp_assoc_bh_rcv+0x8e/0xc1 [sctp] [<e04a2f7d>] sctp_inq_push+0x15/0x17 [sctp] [<e04ab1d5>] sctp_backlog_rcv+0x87/0xb5 [sctp] [<c05a7cf3>] release_sock+0x44/0x91 [<e04aa1d6>] sctp_recvmsg+0x347/0x353 [sctp] [<c04551e8>] sync_page+0x0/0x3b [<c0450355>] __delayacct_blkio_end+0x32/0x35 [<c05a77ca>] sock_common_recvmsg+0x2f/0x45 [<c05a5583>] sock_recvmsg+0xe5/0x100 [<c0435fc7>] autoremove_wake_function+0x0/0x2d [<c04810bb>] core_sys_select+0x1c5/0x2ca [<c045739a>] generic_file_read+0x0/0xab [<c05a6cb0>] sys_recvfrom+0xcf/0x130 [<c045ff85>] unmap_vmas+0x361/0x530 [<c05a6d2a>] sys_recv+0x19/0x1d [<c05a7155>] sys_socketcall+0x117/0x19e [<c0404eff>] syscall_call+0x7/0xb ======================= Code: 80 98 00 00 00 89 50 18 c3 e9 b1 03 00 00 55 57 56 53 89 c3 83 ec 08 90 ff 48 10 0f 94 c0 84 c0 0f 84 bf 00 00 00 8b 0b 8a 43 1c <8b> 29 3c 01 19 ff f 7 d7 c7 44 24 04 00 00 00 00 c7 04 24 00 00 EIP: [<e049fc94>] sctp_datamsg_put+0x1d/0xde [sctp] SS:ESP 0068:d1ebbc10
Did you say on IRC yesterday that you managed to capture a core on this bug? Do you have it available anywhere?
nm, I just saw it in the description.
actually, can you provide me the login account/password for merit. Its not an NIS enabled machine & the normal root passwords don't seem to be correct
Created attachment 296925 [details] patch to attempt to prevent sctp_datamsg from freeing until all child chunks are freed Ok, the problem is pretty clear, the msg->chunks list that gets traversed in sctp_datamsg_destroy is corrupted (specifically the list_head->next pointer is NULL, so we oops in the setup of list_for_each_safe. I think the underlying reason why is a reference counting problem. When we setup an sctp_datamsg structure we only hold a single reference count on it, even if the chunk assigned to it is fragmented (ie, elements on the chunk->frag_list). These fragments get appended to sctp_datamsg->chunks after we call sctp_datamsg_assign. When sctp_datamsg_from_user is run in sctp_sendmsg, we call sctp_datamsg_track on all of these fragments, but that just increments the reference count for that chunk, not the datamsg refcount. I think whats happening is that a chunk with fragments is getting sent, and the lead chunk (which holds the reference to the datamessage gets freed, while its other fragments have some extra latency (perhaps they need to be resent, if they are dropped during transmit). As a result they get resent (successfully) and removed from a retransmit timer context while we are freeing the parent datamsg in the origional send context. The result is that we can have two cpus manipulating chunk->frag_list (aliased as sctp_datamsg->chunks) at the same time, which leads to corruption. I think what we need to do is have each chunk hold a reference on the parent datamsg, so as to ensure that the datamesg is not deleted until such time as all the chunks belonging to that datamsg are sent and deleted. The attached patch is an attempt to implement this. Fair warning: I've not even built it yet, so it may not compile, but I wanted you all to have it since this is a testing blocker, in case you want to fiddle with it. I'm going to build it shortly and fix up any bugs in it. Once I have a build, I'll post a reference to it.
Created attachment 297049 [details] cleaned up version of patch herees a cleaned up version of the patch. Itws buliding for me. Why don't you roll it into your test kernel with the other patch your testing with, and let me know your results. Thanks!
hold off on that last patch, I think I've found a corner case that I need to fix up . I'll repost when its good to go.
Created attachment 297108 [details] new patch Ok, heres a new version of the patch. It takes care of the previous bug I was worreid about (a botched reference count in the opposite direction, leading to a stalled association), and should I think fix your list corruption too. Please give it a try and let me know the results. I've been passing traffic with it locally for a few hours now.
I hit the NULL ptr dereference again with the above patch (comment #9) and the last patch posted on rhkernel-list for bz 426234. The machine I hit it on is merit.lab.msp.redhat.com. The kernel I'm running is http://brewweb.devel.redhat.com/brew/buildinfo?buildID=69847 2.6.18-84.abhi002 that I built with the above two patches. /var/crash/2008-03-06-18:39/ on merit has the vmcore too BUG: unable to handle kernel NULL pointer dereference at virtual address 00000 000 printing eip: e04a7c96 *pde = 13f99067 Oops: 0000 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:1e.0/0000:01:00.0/host0/rport-0:0 -2/target0:0:0/0:0:0:1/vendor Modules linked in: sctp autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2(U) d lm configfs sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight i2c_ec button battery asus_acpi ac lp sg floppy ide_cd pcspkr parport_pc e100 0 cdrom parport i2c_i810 i2c_algo_bit i2c_i801 i2c_core dm_snapshot dm_zero dm _mirror dm_mod lpfc scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 0 EIP: 0060:[<e04a7c96>] Tainted: G VLI EFLAGS: 00010202 (2.6.18-84.abhi002 #1) EIP is at sctp_datamsg_put+0x1f/0xe1 [sctp] eax: d3f3d700 ebx: d3f3d7e0 ecx: 00000000 edx: 00000000 esi: 00000002 edi: 00000000 ebp: 00000000 esp: c072ad48 ds: 007b es: 007b ss: 0068 Process d_doio (pid: 2769, ti=c072a000 task=d3f18000 task.ti=d3f50000) Stack: b3062326 d39fb390 d3841d80 00000002 00000000 00000000 e04a8695 d3841c8c b3062326 e04ac351 df606c44 d39fb390 d39fb3b4 d39fa000 d39fa0cc b3062326 c072adb8 d39fb390 d3841e80 d39fa000 e04a4f1e 00000004 00000003 00000001 Call Trace: [<e04a8695>] sctp_chunk_free+0x36/0x48 [sctp] [<e04ac351>] sctp_outq_sack+0x27d/0x334 [sctp] [<e04a4f1e>] sctp_do_sm+0x353/0xc21 [sctp] [<e04a7679>] sctp_assoc_bh_rcv+0x8e/0xc1 [sctp] [<e04aaf89>] sctp_inq_push+0x15/0x17 [sctp] [<e04b3944>] sctp_rcv+0x5f4/0x6a0 [sctp] [<c042db0b>] __mod_timer+0x99/0xa3 [<c05c942b>] ip_local_deliver+0x159/0x202 [<c05c9298>] ip_rcv+0x475/0x4af [<c05aeae2>] netif_receive_skb+0x2b1/0x325 [<e022c091>] e1000_clean_rx_irq+0x347/0x41c [e1000] [<e022b175>] e1000_clean+0x6b/0x21f [e1000] [<c05b0482>] net_rx_action+0x92/0x175 [<c042a7e6>] __do_softirq+0x5a/0xbb [<c0407461>] do_softirq+0x52/0x9d [<c0407406>] do_IRQ+0xa5/0xae [<c040592e>] common_interrupt+0x1a/0x20 ======================= Code: 00 00 00 89 50 18 c3 e9 b1 03 00 00 55 57 56 53 89 c3 83 ec 08 90 ff 48 10 0f 94 c0 31 d2 84 c0 0f 84 be 00 00 00 8b 0b 8a 43 1c <8b> 29 3c 01 19 ff f 7 d7 c7 44 24 04 00 00 00 00 c7 04 24 00 00 EIP: [<e04a7c96>] sctp_datamsg_put+0x1f/0xe1 [sctp] SS:ESP 0068:c072ad48
Yeah, I was going through the mental exercize of how the refcounting for the two datastructures in question worked a bit ago, and I realized that they're working properly. We must have an imbalanced reference decrement somewhere, which is causing the chunk to be freed at the same time that sctp_datamsg_destroy . I think I can write a debug patch to try to catch it. I'll post it in the AM.
Created attachment 297214 [details] debug patch Build and run this patch if you would please. It won't fix anything but it will oops out when we detect a double free and that should help us identify the real root cause. Thanks
I hit this with a kernel with the debug patch above. kernel-2.6.18-84.003, http://brewweb.devel.redhat.com/brew/buildinfo?buildID=69888 I haven't been able to hit it again and I don't know if it's a consequence of your debugging code. I'll do a bit of testing again tomorrow to see if I can get something new. BUG: unable to handle kernel NULL pointer dereference at virtual address 00000 004 printing eip: e049d000 *pde = 00000000 BUG: unable to handle kernel NULL pointer dereference at virtual address 00000 001 printing eip: e049d000 *pde = 00000000 Oops: 0002 [#1] SMP last sysfs file: /devices/pci0000:00/0000:00:1e.0/0000:01:00.0/host0/rport-0:0 -3/target0:0:0/0:0:0:1/vendor Modules linked in: sctp autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2(U) d lm configfs sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight i2c_ec button battery asus_acpi ac lp sg floppy i2c_i810 i2c_algo_bit i2c_i80 1 pcspkr i2c_core parport_pc parport ide_cd e1000 cdrom dm_snapshot dm_zero dm _mirror dm_mod lpfc scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 0 EIP: 0060:[<e049d000>] Tainted: G VLI EFLAGS: 00010296 (2.6.18-84.003 #1) EIP is at sctp_sm_lookup_event+0x0/0xf4 [sctp] eax: 00000001 ebx: ce508000 ecx: 00000000 edx: 00000004 esi: ce575280 edi: ce575280 ebp: ce508000 esp: c072ad98 ds: 007b es: 007b ss: 0068 Process d_doio (pid: 2753, ti=c072a000 task=ce9ed550 task.ti=cfc0b000) Stack: e04a1bf8 00000004 00000000 00000001 c041fa67 c13f4ee0 00000000 00000003 00000000 00000000 00000001 c13f4ee0 c04202dd 00000000 00000000 00000000 00000000 c13f5a40 00000003 00000086 dce02fa4 d6e34ad8 00000001 c072ae18 Call Trace: [<e04a1bf8>] sctp_do_sm+0x2d/0xc21 [sctp] [<c041fa67>] enqueue_task+0x29/0x39 [<c04202dd>] try_to_wake_up+0x371/0x37b [<c041eaa0>] __wake_up_common+0x2f/0x53 [<c041f89a>] __wake_up+0x2a/0x3d [<e04a4679>] sctp_assoc_bh_rcv+0x8e/0xc1 [sctp] [<e04a7f91>] sctp_inq_push+0x15/0x17 [sctp] [<e04b094c>] sctp_rcv+0x5f4/0x6a0 [sctp] [<c042db0b>] __mod_timer+0x99/0xa3 [<c05c942b>] ip_local_deliver+0x159/0x202 [<c05c9298>] ip_rcv+0x475/0x4af [<c05aeae2>] netif_receive_skb+0x2b1/0x325 [<e01c7091>] e1000_clean_rx_irq+0x347/0x41c [e1000] [<e01c6175>] e1000_clean+0x6b/0x21f [e1000] [<c05b0482>] net_rx_action+0x92/0x175 [<c042a7e6>] __do_softirq+0x5a/0xbb [<c0407461>] do_softirq+0x52/0x9d [<c04059bf>] apic_timer_interrupt+0x1f/0x24 [<c04266a2>] vprintk+0x27d/0x2b1 [<c043616b>] remove_wait_queue+0x16/0x23 [<c048075a>] free_poll_entry+0xe/0x16 [<c048077a>] poll_freewait+0x18/0x4c [<c04266ee>] printk+0x18/0x8e [<c060a604>] do_page_fault+0x3c3/0x4b8 [<c048141f>] __pollwait+0x0/0xb2 [<c060a241>] do_page_fault+0x0/0x4b8 [<c0405a71>] error_code+0x39/0x40 [<e049d000>] sctp_sm_lookup_event+0x0/0xf4 [sctp] [<e04a1bf8>] sctp_do_sm+0x2d/0xc21 [sctp] [<e04a51da>] sctp_chunkify+0x16/0xb3 [sctp] [<c04e63d5>] copy_from_user+0x31/0x5d [<c05abe2d>] memcpy_fromiovecend+0x35/0x54 [<e04a575e>] sctp_user_addto_chunk+0x58/0x97 [sctp] [<e04a4fd2>] sctp_datamsg_from_user+0x275/0x2b7 [sctp] [<e04af55f>] sctp_primitive_SEND+0x2e/0x33 [sctp] [<e04aed15>] sctp_sendmsg+0x7ff/0x98d [sctp] [<c0607c54>] schedule+0x920/0x9cd [<c05ea2d5>] inet_sendmsg+0x35/0x3f [<c05a568c>] sock_sendmsg+0xce/0xe8 [<c0450286>] delayacct_end+0x58/0x7a [<c0435fc7>] autoremove_wake_function+0x0/0x2d [<c04810bb>] core_sys_select+0x1c5/0x2ca [<c0455aca>] do_generic_mapping_read+0x421/0x468 [<c05a6844>] sys_sendto+0x116/0x140 [<c0456361>] __generic_file_aio_read+0x166/0x198 [<c0454f3b>] file_read_actor+0x0/0xd1 [<c045739a>] generic_file_read+0x0/0xab [<c0457431>] generic_file_read+0x97/0xab [<c05a6887>] sys_send+0x19/0x1d [<c05a714b>] sys_socketcall+0xed/0x19e [<c0404eff>] syscall_call+0x7/0xb ======================= Code: Bad EIP value. EIP: [<e049d000>] sctp_sm_lookup_event+0x0/0xf4 [sctp] SS:ESP 0068:c072ad98 <0>Kernel panic - not syncing: Fatal exception in interrupt
no, this wouldnt be something from the debug code. if you hav a core i'll look at it though
Raising priority and severity levels. We are hitting these problems when running the latest QE test suites for cluster filesystem testing, also fixing version.
ok, but I'm waiting on the results of the debug patch I posted in comment 13. Once I get that back I can move forward
When I started running 2.6.18-84.003, which has the fix to 426234 and the debug patch for this bug, I swear the first time I ran the test I made progress. However, after that, every time I run either the coherency test from bz428751 or # dd_io -i 1 -l /usr/tests/sts-rhel5.2 -m /mnt/fs0 -r /usr/tests/sts-rhel5.2 -u root -R hedge-123.xml -S FULL -O output I get nothing. There are no errors but none of the d_doio processes are doing any work at all. sysrq-t on them shows d_doio S 000038BF 2452 5077 5075 5076 (NOTLB) d4206b68 00000082 65a19fc0 000038bf d98c5e00 00000000 d4628400 0000000a f7843aa0 f7c1a550 65a19fc0 000038bf 00000000 00000001 f7843bac ca0138c4 f7ffd340 00000000 00000000 00000040 00000000 ffffffff 00000000 00000000 Call Trace: [<c0608341>] schedule_timeout+0x13/0x8c [<f8e8b393>] sctp_poll+0x14/0xe9 [sctp] [<c0480e9c>] do_select+0x371/0x3cb [<c048141f>] __pollwait+0x0/0xb2 [<c04202e7>] default_wake_function+0x0/0xc [<c04e21e0>] __next_cpu+0x12/0x21 [<c041efff>] find_busiest_group+0x177/0x462 [<c0607c54>] schedule+0x920/0x9cd [<c0455184>] find_lock_page+0x1a/0x7e [<c04578c7>] find_or_create_page+0x2f/0x8c [<f8e31f74>] getbuf+0xfc/0x106 [gfs2] [<f8e320fb>] gfs2_meta_wait+0x12/0x79 [gfs2] [<f8e3219f>] gfs2_meta_read+0x3d/0x54 [gfs2] [<f8e325e0>] gfs2_meta_indirect_buffer+0xbb/0x14b [gfs2] [<f8e231b1>] gfs2_block_map+0x2dc/0x364 [gfs2] [<f8e2322d>] gfs2_block_map+0x358/0x364 [gfs2] [<f8e232a3>] gfs2_extent_map+0x6a/0x9e [gfs2] [<c048119f>] core_sys_select+0x2a9/0x2ca [<f8e2c337>] gfs2_glock_dq+0x9e/0xb2 [gfs2] [<f8e2af16>] gfs2_glock_put+0x1e/0x109 [gfs2] [<f8e2c1e6>] gfs2_holder_uninit+0xb/0x1b [gfs2] [<f8e38c00>] gfs2_sharewrite_nopage+0x2b0/0x2bc [gfs2] [<c042d9fa>] lock_timer_base+0x15/0x2f [<c045ff85>] unmap_vmas+0x361/0x530 [<c0466c88>] free_pages_and_swap_cache+0x6b/0x7f [<c048179b>] sys_select+0xcf/0x180 [<c0404eff>] syscall_call+0x7/0xb ======================= which has a bunch of junk in it. However, if I do a sysrq-c and look at the crashdump, I see a much more sensible PID: 5077 TASK: f7843aa0 CPU: 1 COMMAND: "d_doio" #0 [d4206afc] schedule at c0607c54 #1 [d4206b6c] schedule_timeout at c060833c #2 [d4206b90] do_select at c0480e97 #3 [d4206e34] core_sys_select at c048119a #4 [d4206f74] sys_select at c0481796 #5 [d4206fb8] system_call at c0404ef8 EAX: ffffffda EBX: 00000004 ECX: 00000000 EDX: bfe168e4 DS: 007b ESI: 00000000 ES: 007b EDI: 00000000 SS: 007b ESP: bfe1689c EBP: bfe16978 CS: 0073 EIP: 00819402 ERR: 0000008e EFLAGS: 00000246 It looks like all the processes are just sitting there waiting for the server. Using the regular 84.el5 kernel I see the sctp issues in spurts, but when I'm not panicing, I'm correctly getting messages. I assume this explains why Abhi can't recreate this bug with the latest patches. It's easy to miss the fact that no work is getting done during the revolver test, since the machines are constantly being rebooted.
Oh, gosh, theres a problem with my debug patch. Sorry about that. Its minor. I'll correct it and repost in a second.
Created attachment 297967 [details] corrected debug patch Sorry about that, I had a condition reversed and as a result the situation we were trying to trap with a BUG halt was just not reducing our reference count instead. As a result we had chunks that would remain hanging around forever, eating our receive space, leading to the hung connections you saw. This patch corrects that.