436067 – NULL pointer dereference at sctp_datamsg_put

Bug 436067 - NULL pointer dereference at sctp_datamsg_put

Summary: NULL pointer dereference at sctp_datamsg_put

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-03-05 05:58 UTC by Abhijith Das
Modified:	2008-12-01 12:05 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-12-01 12:05:38 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patch to attempt to prevent sctp_datamsg from freeing until all child chunks are freed (3.82 KB, patch) 2008-03-05 20:07 UTC, Neil Horman	no flags	Details \| Diff
cleaned up version of patch (4.37 KB, patch) 2008-03-06 13:58 UTC, Neil Horman	no flags	Details \| Diff
new patch (1.99 KB, patch) 2008-03-06 21:41 UTC, Neil Horman	no flags	Details \| Diff
debug patch (1.90 KB, patch) 2008-03-07 16:46 UTC, Neil Horman	no flags	Details \| Diff
corrected debug patch (2.29 KB, patch) 2008-03-13 18:39 UTC, Neil Horman	no flags	Details \| Diff
Show Obsolete (3) View All

Description Abhijith Das 2008-03-05 05:58:25 UTC

Description of problem:
I hit the following panic when I was running revolver on a 4-node cluster with 3
gfs2 filesystems. This is using the debug kernel that has the fix  for bug
426234. (Kernel 2.6.18-79.el5.bz426234.debug2)

/var/crash/2008-03-04-23:33/ on merit.lab.msp.redhat.com has the vmcore

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000
000
 printing eip:
e049fc94
*pde = 118c9067
Oops: 0000 [#1]
SMP 
last sysfs file: /fs/gfs2/Smoke4:smoke42/lock_module/recover_done
Modules linked in: sctp autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2(U) d
lm configfs sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight
 i2c_ec button battery asus_acpi ac lp floppy sg i2c_i810 e1000 i2c_algo_bit i
2c_i801 pcspkr i2c_core parport_pc parport ide_cd cdrom dm_snapshot dm_zero dm
_mirror dm_mod lpfc scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd
 ehci_hcd ohci_hcd uhci_hcd
CPU:    0
EIP:    0060:[<e049fc94>]    Tainted: G      VLI
EFLAGS: 00010202   (2.6.18-79.el5.bz426234.debug2 #1) 
EIP is at sctp_datamsg_put+0x1d/0xde [sctp]
eax: d183f800   ebx: d183f820   ecx: 00000000   edx: d194b3ac
esi: be0db0e0   edi: 00000000   ebp: 00000000   esp: d1ebbc10
ds: 007b   es: 007b   ss: 0068
Process d_doio (pid: 3779, ti=d1ebb000 task=d1c0d550 task.ti=d1ebb000)
Stack: d194b390 00ebbc10 d18bec80 be0db0e0 00000000 00000000 e04a0691 d18bed8c
 
       e04a4345 df7a8644 d194b390 d194b3b4 d194a000 d194a0cc be0db0e0 d1ebbc7c
 
       d194b390 d18beb80 d194a000 e049cf1e 00000004 00000003 00000001 00000001
 
Call Trace:
 [<e04a0691>] sctp_chunk_free+0x32/0x3a [sctp]
 [<e04a4345>] sctp_outq_sack+0x27d/0x334 [sctp]
 [<e049cf1e>] sctp_do_sm+0x353/0xc21 [sctp]
 [<e049f679>] sctp_assoc_bh_rcv+0x8e/0xc1 [sctp]
 [<e04a2f7d>] sctp_inq_push+0x15/0x17 [sctp]
 [<e04ab1d5>] sctp_backlog_rcv+0x87/0xb5 [sctp]
 [<c05a7cf3>] release_sock+0x44/0x91
 [<e04aa1d6>] sctp_recvmsg+0x347/0x353 [sctp]
 [<c04551e8>] sync_page+0x0/0x3b
 [<c0450355>] __delayacct_blkio_end+0x32/0x35
 [<c05a77ca>] sock_common_recvmsg+0x2f/0x45
 [<c05a5583>] sock_recvmsg+0xe5/0x100
 [<c0435fc7>] autoremove_wake_function+0x0/0x2d
 [<c04810bb>] core_sys_select+0x1c5/0x2ca
 [<c045739a>] generic_file_read+0x0/0xab
 [<c05a6cb0>] sys_recvfrom+0xcf/0x130
 [<c045ff85>] unmap_vmas+0x361/0x530
 [<c05a6d2a>] sys_recv+0x19/0x1d
 [<c05a7155>] sys_socketcall+0x117/0x19e
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
Code: 80 98 00 00 00 89 50 18 c3 e9 b1 03 00 00 55 57 56 53 89 c3 83 ec 08 90 
ff 48 10 0f 94 c0 84 c0 0f 84 bf 00 00 00 8b 0b 8a 43 1c <8b> 29 3c 01 19 ff f
7 d7 c7 44 24 04 00 00 00 00 c7 04 24 00 00 
EIP: [<e049fc94>] sctp_datamsg_put+0x1d/0xde [sctp] SS:ESP 0068:d1ebbc10

Comment 1 Neil Horman 2008-03-05 13:15:20 UTC

Did you say on IRC yesterday that you managed to capture a core on this bug?  Do
you have it available anywhere?

Comment 2 Neil Horman 2008-03-05 14:16:03 UTC

nm, I just saw it in the description.

Comment 3 Neil Horman 2008-03-05 15:48:05 UTC

actually, can you provide me the login account/password for merit.  Its not an
NIS enabled machine & the normal root passwords don't seem to be correct

Comment 6 Neil Horman 2008-03-05 20:07:17 UTC

Created attachment 296925 [details]
patch to attempt to prevent sctp_datamsg from freeing until all child chunks are freed

Ok, the problem is pretty clear, the msg->chunks list that gets traversed in
sctp_datamsg_destroy is corrupted (specifically the list_head->next pointer is
NULL, so we oops in the setup of list_for_each_safe.  I think the underlying
reason why is a reference counting problem.  When we setup an sctp_datamsg
structure we only hold a single reference count on it, even if the chunk
assigned to it is fragmented (ie, elements on the chunk->frag_list).  These
fragments get appended to sctp_datamsg->chunks after we call
sctp_datamsg_assign.  When sctp_datamsg_from_user is run in sctp_sendmsg, we
call sctp_datamsg_track on all of these fragments, but that just increments the
reference count for that chunk, not the datamsg refcount.   I think whats
happening is that a chunk with fragments is getting sent, and the lead chunk
(which holds the reference to the datamessage gets freed, while its other
fragments have some extra latency (perhaps they need to be resent, if they are
dropped during transmit). As a result they get resent (successfully) and
removed from a retransmit timer context while we are freeing the parent datamsg
in the origional send context.	The result is that we can have two cpus
manipulating chunk->frag_list (aliased as sctp_datamsg->chunks) at the same
time, which leads to corruption.
 

I think what we need to do is have each chunk hold a reference on the parent
datamsg, so as to ensure that the datamesg is not deleted until such time as
all the chunks belonging to that datamsg are sent and deleted. The attached
patch is an attempt to implement this.	Fair warning: I've not even built it
yet, so it may not compile, but I wanted you all to have it since this is a
testing blocker, in case you want to fiddle with it.  I'm going to build it
shortly and fix up any bugs in it.  Once I have a build, I'll post a reference
to it.

Comment 7 Neil Horman 2008-03-06 13:58:11 UTC

Created attachment 297049 [details]
cleaned up version of patch

herees a cleaned up version of the patch.  Itws buliding for me.  Why don't you
roll it into your test kernel with the other patch your testing with, and let
me know your results.  Thanks!

Comment 8 Neil Horman 2008-03-06 16:27:12 UTC

hold off on that last patch, I think I've found a corner case that I need to fix
up .  I'll repost when its good to go.

Comment 9 Neil Horman 2008-03-06 21:41:51 UTC

Created attachment 297108 [details]
new patch

Ok, heres a new version of the patch.  It takes care of the previous bug I was
worreid about (a botched reference count in the opposite direction, leading to
a stalled association), and should I think fix your list corruption too. 
Please give it a try and let me know the results.  I've been passing traffic
with it locally for a few hours now.

Comment 10 Abhijith Das 2008-03-07 01:02:58 UTC

I hit the NULL ptr dereference again with the above patch (comment #9) and the
last patch posted on rhkernel-list for bz 426234.

The machine I hit it on is merit.lab.msp.redhat.com. The kernel I'm running is
http://brewweb.devel.redhat.com/brew/buildinfo?buildID=69847
2.6.18-84.abhi002 that I built with the above two patches.

/var/crash/2008-03-06-18:39/ on merit has the vmcore too

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000
000
 printing eip:
e04a7c96
*pde = 13f99067
Oops: 0000 [#1]
SMP 
last sysfs file: /devices/pci0000:00/0000:00:1e.0/0000:01:00.0/host0/rport-0:0
-2/target0:0:0/0:0:0:1/vendor
Modules linked in: sctp autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2(U) d
lm configfs sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight
 i2c_ec button battery asus_acpi ac lp sg floppy ide_cd pcspkr parport_pc e100
0 cdrom parport i2c_i810 i2c_algo_bit i2c_i801 i2c_core dm_snapshot dm_zero dm
_mirror dm_mod lpfc scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd
 ehci_hcd ohci_hcd uhci_hcd
CPU:    0
EIP:    0060:[<e04a7c96>]    Tainted: G      VLI
EFLAGS: 00010202   (2.6.18-84.abhi002 #1) 
EIP is at sctp_datamsg_put+0x1f/0xe1 [sctp]
eax: d3f3d700   ebx: d3f3d7e0   ecx: 00000000   edx: 00000000
esi: 00000002   edi: 00000000   ebp: 00000000   esp: c072ad48
ds: 007b   es: 007b   ss: 0068
Process d_doio (pid: 2769, ti=c072a000 task=d3f18000 task.ti=d3f50000)
Stack: b3062326 d39fb390 d3841d80 00000002 00000000 00000000 e04a8695 d3841c8c
 
       b3062326 e04ac351 df606c44 d39fb390 d39fb3b4 d39fa000 d39fa0cc b3062326
 
       c072adb8 d39fb390 d3841e80 d39fa000 e04a4f1e 00000004 00000003 00000001
 
Call Trace:
 [<e04a8695>] sctp_chunk_free+0x36/0x48 [sctp]
 [<e04ac351>] sctp_outq_sack+0x27d/0x334 [sctp]
 [<e04a4f1e>] sctp_do_sm+0x353/0xc21 [sctp]
 [<e04a7679>] sctp_assoc_bh_rcv+0x8e/0xc1 [sctp]
 [<e04aaf89>] sctp_inq_push+0x15/0x17 [sctp]
 [<e04b3944>] sctp_rcv+0x5f4/0x6a0 [sctp]
 [<c042db0b>] __mod_timer+0x99/0xa3
 [<c05c942b>] ip_local_deliver+0x159/0x202
 [<c05c9298>] ip_rcv+0x475/0x4af
 [<c05aeae2>] netif_receive_skb+0x2b1/0x325
 [<e022c091>] e1000_clean_rx_irq+0x347/0x41c [e1000]
 [<e022b175>] e1000_clean+0x6b/0x21f [e1000]
 [<c05b0482>] net_rx_action+0x92/0x175
 [<c042a7e6>] __do_softirq+0x5a/0xbb
 [<c0407461>] do_softirq+0x52/0x9d
 [<c0407406>] do_IRQ+0xa5/0xae
 [<c040592e>] common_interrupt+0x1a/0x20
 =======================
Code: 00 00 00 89 50 18 c3 e9 b1 03 00 00 55 57 56 53 89 c3 83 ec 08 90 ff 48 
10 0f 94 c0 31 d2 84 c0 0f 84 be 00 00 00 8b 0b 8a 43 1c <8b> 29 3c 01 19 ff f
7 d7 c7 44 24 04 00 00 00 00 c7 04 24 00 00 
EIP: [<e04a7c96>] sctp_datamsg_put+0x1f/0xe1 [sctp] SS:ESP 0068:c072ad48

Comment 11 Neil Horman 2008-03-07 03:36:54 UTC

Yeah, I was going through the mental exercize of how the refcounting for the two
datastructures in question worked a bit ago, and I realized that they're working
properly.  We must have an imbalanced reference decrement somewhere, which is
causing the chunk to be freed at the same time that sctp_datamsg_destroy .  I
think I can write a debug patch to try to catch it.  I'll post it in the AM.

Comment 12 Neil Horman 2008-03-07 16:46:31 UTC

Created attachment 297214 [details]
debug patch

Build and run this patch if you would please.  It won't fix anything but it
will oops out when we detect a double free and that should help us identify the
real root cause.  Thanks

Comment 13 Abhijith Das 2008-03-10 04:47:40 UTC

I hit this with a kernel with the debug patch above. kernel-2.6.18-84.003,
http://brewweb.devel.redhat.com/brew/buildinfo?buildID=69888
I haven't been able to hit it again and I don't know if it's a consequence of
your debugging code. I'll do a bit of testing again tomorrow to see if I can get
something new.

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000
004
 printing eip:
e049d000
*pde = 00000000
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000
001
 printing eip:
e049d000
*pde = 00000000
Oops: 0002 [#1]
SMP 
last sysfs file: /devices/pci0000:00/0000:00:1e.0/0000:01:00.0/host0/rport-0:0
-3/target0:0:0/0:0:0:1/vendor
Modules linked in: sctp autofs4 hidp rfcomm l2cap bluetooth lock_dlm gfs2(U) d
lm configfs sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video sbs backlight
 i2c_ec button battery asus_acpi ac lp sg floppy i2c_i810 i2c_algo_bit i2c_i80
1 pcspkr i2c_core parport_pc parport ide_cd e1000 cdrom dm_snapshot dm_zero dm
_mirror dm_mod lpfc scsi_transport_fc ata_piix libata sd_mod scsi_mod ext3 jbd
 ehci_hcd ohci_hcd uhci_hcd
CPU:    0
EIP:    0060:[<e049d000>]    Tainted: G      VLI
EFLAGS: 00010296   (2.6.18-84.003 #1) 
EIP is at sctp_sm_lookup_event+0x0/0xf4 [sctp]
eax: 00000001   ebx: ce508000   ecx: 00000000   edx: 00000004
esi: ce575280   edi: ce575280   ebp: ce508000   esp: c072ad98
ds: 007b   es: 007b   ss: 0068
Process d_doio (pid: 2753, ti=c072a000 task=ce9ed550 task.ti=cfc0b000)
Stack: e04a1bf8 00000004 00000000 00000001 c041fa67 c13f4ee0 00000000 00000003
 
       00000000 00000000 00000001 c13f4ee0 c04202dd 00000000 00000000 00000000
 
       00000000 c13f5a40 00000003 00000086 dce02fa4 d6e34ad8 00000001 c072ae18
 
Call Trace:
 [<e04a1bf8>] sctp_do_sm+0x2d/0xc21 [sctp]
 [<c041fa67>] enqueue_task+0x29/0x39
 [<c04202dd>] try_to_wake_up+0x371/0x37b
 [<c041eaa0>] __wake_up_common+0x2f/0x53
 [<c041f89a>] __wake_up+0x2a/0x3d
 [<e04a4679>] sctp_assoc_bh_rcv+0x8e/0xc1 [sctp]
 [<e04a7f91>] sctp_inq_push+0x15/0x17 [sctp]
 [<e04b094c>] sctp_rcv+0x5f4/0x6a0 [sctp]
 [<c042db0b>] __mod_timer+0x99/0xa3
 [<c05c942b>] ip_local_deliver+0x159/0x202
 [<c05c9298>] ip_rcv+0x475/0x4af
 [<c05aeae2>] netif_receive_skb+0x2b1/0x325
 [<e01c7091>] e1000_clean_rx_irq+0x347/0x41c [e1000]
 [<e01c6175>] e1000_clean+0x6b/0x21f [e1000]
 [<c05b0482>] net_rx_action+0x92/0x175
 [<c042a7e6>] __do_softirq+0x5a/0xbb
 [<c0407461>] do_softirq+0x52/0x9d
 [<c04059bf>] apic_timer_interrupt+0x1f/0x24
 [<c04266a2>] vprintk+0x27d/0x2b1
 [<c043616b>] remove_wait_queue+0x16/0x23
 [<c048075a>] free_poll_entry+0xe/0x16
 [<c048077a>] poll_freewait+0x18/0x4c
 [<c04266ee>] printk+0x18/0x8e
 [<c060a604>] do_page_fault+0x3c3/0x4b8
 [<c048141f>] __pollwait+0x0/0xb2
 [<c060a241>] do_page_fault+0x0/0x4b8
 [<c0405a71>] error_code+0x39/0x40
 [<e049d000>] sctp_sm_lookup_event+0x0/0xf4 [sctp]
 [<e04a1bf8>] sctp_do_sm+0x2d/0xc21 [sctp]
 [<e04a51da>] sctp_chunkify+0x16/0xb3 [sctp]
 [<c04e63d5>] copy_from_user+0x31/0x5d
 [<c05abe2d>] memcpy_fromiovecend+0x35/0x54
 [<e04a575e>] sctp_user_addto_chunk+0x58/0x97 [sctp]
 [<e04a4fd2>] sctp_datamsg_from_user+0x275/0x2b7 [sctp]
 [<e04af55f>] sctp_primitive_SEND+0x2e/0x33 [sctp]
 [<e04aed15>] sctp_sendmsg+0x7ff/0x98d [sctp]
 [<c0607c54>] schedule+0x920/0x9cd
 [<c05ea2d5>] inet_sendmsg+0x35/0x3f
 [<c05a568c>] sock_sendmsg+0xce/0xe8
 [<c0450286>] delayacct_end+0x58/0x7a
 [<c0435fc7>] autoremove_wake_function+0x0/0x2d
 [<c04810bb>] core_sys_select+0x1c5/0x2ca
 [<c0455aca>] do_generic_mapping_read+0x421/0x468
 [<c05a6844>] sys_sendto+0x116/0x140
 [<c0456361>] __generic_file_aio_read+0x166/0x198
 [<c0454f3b>] file_read_actor+0x0/0xd1
 [<c045739a>] generic_file_read+0x0/0xab
 [<c0457431>] generic_file_read+0x97/0xab
 [<c05a6887>] sys_send+0x19/0x1d
 [<c05a714b>] sys_socketcall+0xed/0x19e
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
Code:  Bad EIP value.
EIP: [<e049d000>] sctp_sm_lookup_event+0x0/0xf4 [sctp] SS:ESP 0068:c072ad98
 <0>Kernel panic - not syncing: Fatal exception in interrupt

Comment 14 Neil Horman 2008-03-10 11:27:34 UTC

no, this wouldnt be something from the debug code. if you hav  a core i'll look
at it though

Comment 15 Kiersten (Kerri) Anderson 2008-03-13 14:31:55 UTC

Raising priority and severity levels.  We are hitting these problems when
running the latest QE test suites for cluster filesystem testing, also fixing
version.

Comment 16 Neil Horman 2008-03-13 14:35:39 UTC

ok, but I'm waiting on the results of the debug patch I posted in comment 13. 
Once I get that back I can move forward

Comment 17 Ben Marzinski 2008-03-13 18:13:31 UTC

When I started running 2.6.18-84.003, which has the fix to 426234 and the debug
patch for this bug, I swear the first time I ran the test I made progress. 
However, after that, every time I run either the coherency test from bz428751 or

# dd_io -i 1 -l /usr/tests/sts-rhel5.2 -m /mnt/fs0 -r /usr/tests/sts-rhel5.2 -u
root -R hedge-123.xml -S FULL -O output

I get nothing.  There are no errors but none of the d_doio processes are doing
any work at all.

sysrq-t on them shows
d_doio        S 000038BF  2452  5077   5075                5076 (NOTLB)
       d4206b68 00000082 65a19fc0 000038bf d98c5e00 00000000 d4628400 0000000a
       f7843aa0 f7c1a550 65a19fc0 000038bf 00000000 00000001 f7843bac ca0138c4
       f7ffd340 00000000 00000000 00000040 00000000 ffffffff 00000000 00000000
Call Trace:
 [<c0608341>] schedule_timeout+0x13/0x8c
 [<f8e8b393>] sctp_poll+0x14/0xe9 [sctp]
 [<c0480e9c>] do_select+0x371/0x3cb
 [<c048141f>] __pollwait+0x0/0xb2
 [<c04202e7>] default_wake_function+0x0/0xc
 [<c04e21e0>] __next_cpu+0x12/0x21
 [<c041efff>] find_busiest_group+0x177/0x462
 [<c0607c54>] schedule+0x920/0x9cd
 [<c0455184>] find_lock_page+0x1a/0x7e
 [<c04578c7>] find_or_create_page+0x2f/0x8c
 [<f8e31f74>] getbuf+0xfc/0x106 [gfs2]
 [<f8e320fb>] gfs2_meta_wait+0x12/0x79 [gfs2]
 [<f8e3219f>] gfs2_meta_read+0x3d/0x54 [gfs2]
 [<f8e325e0>] gfs2_meta_indirect_buffer+0xbb/0x14b [gfs2]
 [<f8e231b1>] gfs2_block_map+0x2dc/0x364 [gfs2]
 [<f8e2322d>] gfs2_block_map+0x358/0x364 [gfs2]
 [<f8e232a3>] gfs2_extent_map+0x6a/0x9e [gfs2]
 [<c048119f>] core_sys_select+0x2a9/0x2ca
 [<f8e2c337>] gfs2_glock_dq+0x9e/0xb2 [gfs2]
 [<f8e2af16>] gfs2_glock_put+0x1e/0x109 [gfs2]
 [<f8e2c1e6>] gfs2_holder_uninit+0xb/0x1b [gfs2]
 [<f8e38c00>] gfs2_sharewrite_nopage+0x2b0/0x2bc [gfs2]
 [<c042d9fa>] lock_timer_base+0x15/0x2f
 [<c045ff85>] unmap_vmas+0x361/0x530
 [<c0466c88>] free_pages_and_swap_cache+0x6b/0x7f
 [<c048179b>] sys_select+0xcf/0x180
 [<c0404eff>] syscall_call+0x7/0xb
 =======================

which has a bunch of junk in it.  However, if I do a sysrq-c and look at the
crashdump, I see a much more sensible

PID: 5077   TASK: f7843aa0  CPU: 1   COMMAND: "d_doio"
 #0 [d4206afc] schedule at c0607c54
 #1 [d4206b6c] schedule_timeout at c060833c
 #2 [d4206b90] do_select at c0480e97
 #3 [d4206e34] core_sys_select at c048119a
 #4 [d4206f74] sys_select at c0481796
 #5 [d4206fb8] system_call at c0404ef8
    EAX: ffffffda  EBX: 00000004  ECX: 00000000  EDX: bfe168e4 
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 00000000
    SS:  007b      ESP: bfe1689c  EBP: bfe16978
    CS:  0073      EIP: 00819402  ERR: 0000008e  EFLAGS: 00000246 

It looks like all the processes are just sitting there waiting for the server. 
Using the regular 84.el5 kernel I see the sctp issues in spurts, but when I'm
not panicing, I'm correctly getting messages.  I assume this explains why Abhi
can't recreate this bug with the latest patches.  It's easy to miss the fact
that no work is getting done during the revolver test, since the machines are
constantly being rebooted.

Comment 18 Neil Horman 2008-03-13 18:31:56 UTC

Oh, gosh, theres  a problem with my debug patch.  Sorry about that.  Its minor.
 I'll correct it and repost in a second.

Comment 19 Neil Horman 2008-03-13 18:39:29 UTC

Created attachment 297967 [details]
corrected debug patch

Sorry about that, I had a condition reversed and as a result the situation we
were trying to trap with a BUG halt was just not reducing our reference count
instead.  As a result we had chunks that would remain hanging around forever,
eating our receive space, leading to the hung connections you saw.  This patch
corrects that.

Note You need to log in before you can comment on or make changes to this bug.