146797 – SCTP memory consumption and system freezes

Bug 146797 - SCTP memory consumption and system freezes

Summary: SCTP memory consumption and system freezes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-01 17:44 UTC by Jere Leppanen
Modified:	2007-11-30 22:07 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-06-08 15:13:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Script to produce the problem. (1.09 KB, text/plain) 2005-02-02 18:20 UTC, Jere Leppanen	no flags	Details
Output of the script on my box. (2.02 KB, text/plain) 2005-02-02 18:21 UTC, Jere Leppanen	no flags	Details
my test case for sctp lowmem exhaustion (1.94 KB, application/octet-stream) 2005-02-02 19:28 UTC, Neil Horman	no flags	Details
Sysreport from my box (224.41 KB, application/x-bzip) 2005-02-08 13:34 UTC, Patrick C. F. Ernzer	no flags	Details
A more elaborate log file (8.95 KB, text/plain) 2005-02-08 13:40 UTC, Patrick C. F. Ernzer	no flags	Details
patch to fix sctp sendbuffer accounting (7.15 KB, patch) 2005-02-25 19:27 UTC, Neil Horman	no flags	Details \| Diff
patch to fix sctp receive buffer accounting (1.28 KB, patch) 2005-02-25 19:44 UTC, Neil Horman	no flags	Details \| Diff
patch to stall sending processes on full peer receive window (3.31 KB, patch) 2005-02-25 19:49 UTC, Neil Horman	no flags	Details \| Diff
patch to do rcv buffer accounting on ulpevent sk_buffs (633 bytes, patch) 2005-03-10 12:29 UTC, Neil Horman	no flags	Details \| Diff
this is the final version of the sndbuf accounting patch which is getting upstream testing (6.99 KB, patch) 2005-03-17 01:58 UTC, Neil Horman	no flags	Details \| Diff
patch to break up bottom half to avoid deadlock when receive path sends frames (444 bytes, patch) 2005-03-23 16:00 UTC, Neil Horman	no flags	Details \| Diff
patch to add sendbuffer accounting to sctp_make_chunk for all paths. (3.89 KB, patch) 2005-03-23 16:01 UTC, Neil Horman	no flags	Details \| Diff
final patch for sendbuffer accounting (5.32 KB, patch) 2005-04-28 15:32 UTC, Neil Horman	no flags	Details \| Diff
Test run results on U1-beta (5.20 KB, text/plain) 2005-05-02 09:48 UTC, Jere Leppanen	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2005:420	0	normal	SHIPPED_LIVE	Important: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 1	2005-06-08 04:00:00 UTC

Description Jere Leppanen 2005-02-01 17:44:21 UTC

Description of problem:
When sending small messages over SCTP and suspending the receiver, 
SCTP uses a huge amount of memory for buffering. When the receiver is 
then switched back to foreground, after a few seconds the system 
freezes. The freeze lasts 5-30 seconds with one association, more 
than 5 minutes with two associations. Obviously there are no problems 
if TCP is used in the same test instead of SCTP.

Version-Release number of selected component (if applicable):
kernel-2.6.9-5.EL

How reproducible:
Every time.

Steps to Reproduce:
1. One host, two processes: sender and receiver.
2. Create N SCTP associations from sender to receiver.
3. Sender: Send (via loopback) 1-byte messages round-robin to N 
assocs in a loop.
4. Receiver: Receive (via loopback) 1-byte messages round-robin from 
N assocs in a loop.
5. Suspend the receiver, wait until sender blocks.
6. Check memory consumption.
7. Foreground the receiver.
  
Actual results:
In step 6, the memory consuption is about 200 megabytes per assoc.
In step 7, the receiver runs for a few seconds, then the system 
freezes. The freeze lasts 5-30 seconds with one association, more 
than 5 minutes with two associations. With two associations, the 
system sometimes resumes normally after the freeze, sometimes the 
system resumes only intermittently, freezing again after a few 
seconds.

Expected results:
In step 6, the memory consumption should be limited by SCTP socket 
buffer limits plus possibly some overhead. A default of 200 megabytes 
per assoc may be too much.
Step 7 is pretty obvious, the system shouldn't freeze.

Additional info:
This behaviour may be dependent on CPU/Mem, so here's the info from 
my test rig (Dell Optiplex GX270):

[jere@dhcp094189 ~]$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 2
model name	: Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping	: 9
cpu MHz		: 2993.440
cache size	: 512 KB
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
cid xtpr
bogomips	: 5914.62

[jere@dhcp094189 ~]$ free
             total       used       free     shared    buffers     
cached
Mem:       1027228     260576     766652          0      20796     
143516
-/+ buffers/cache:      96264     930964
Swap:      2031608          0    2031608

Comment 1 Jere Leppanen 2005-02-01 17:51:05 UTC

Forgot one thing:

In step 5, when SCTP (kernel code) allocates 200 megabytes per assoc, 
if you run out of RAM, the system freezes and starts blinking 
CapsLock and ScrollLock.

Comment 3 Neil Horman 2005-02-02 14:18:44 UTC

could you please provide the test case you're using to create this problem, and
provide memory consumption information on your box when the problem occurs, so
that we can compare/confirm results?  Thanks

Comment 4 Jere Leppanen 2005-02-02 18:20:54 UTC

Created attachment 110565 [details]
Script to produce the problem.

Comment 5 Jere Leppanen 2005-02-02 18:21:46 UTC

Created attachment 110566 [details]
Output of the script on my box.

Comment 6 Jere Leppanen 2005-02-02 18:55:51 UTC

I've attached a script and the output on my box.

The script uses sctp_darn from lksctp-tools. In sctp_darn the minimum message 
size is 3 bytes. With one assoc the memory consumption is "only" ~67 megabytes, 
as you can see from the output.

With one-byte messages the memory consumption was ~200 megabytes, with 3-byte 
messages it's ~67, which is of course 200 / 3 = 67. Inverse linear. Also, the 
severity of the intermittent freezes seems proportional to the amount of 
consumed memory. This brings us to a theory: SCTP is allocating entities with a 
lot of overhead (DATA chunks?), while the max amount of entities is limited by 
the payload contained within the entities. Small messages --> a huge number of 
entities. SCTP ends up processing a huge list of entities, and during that 
processing nothing much is happening in the kernel.

Comment 7 Neil Horman 2005-02-02 19:28:02 UTC

Created attachment 110572 [details]
my test case for sctp lowmem exhaustion

I just wrote my own test case here and received semi-simmilar results.	My test
case tends to result in ENOMEM errors, but that could just be the way I have my
socket options set.  I'll confirm results with your scripts.  It would appears
that in the case of my test, the skbuff_head_cache is growing unboundedly,
until the kernel is in a state of lowmem exhaustion.  Not sure quite what to do
about it yet....

Comment 8 Neil Horman 2005-02-02 19:53:33 UTC

I was just browsing through the sctp code, and found this in the header of
sctp_sendmsg:

BUG:  We do not implement the equivalent of wait_for_tcp_memory()

Upstream seems to hold a variant of this comment, and I expect this is what
we're seeing, as without it, send data usage will likely increase unboundedly. 
I'll see whats involved in implementing this

Comment 10 Neil Horman 2005-02-03 01:14:37 UTC

No, I don't mind at all.  I'm going to contact Dave M about this tomorrow to
figure out how best to handle this.  Thanks!

Comment 11 Patrick C. F. Ernzer 2005-02-08 13:34:34 UTC

Created attachment 110796 [details]
 Sysreport from my box

Sysreport as submitted by customer to Issue Tracker

Comment 12 Patrick C. F. Ernzer 2005-02-08 13:40:44 UTC

Created attachment 110797 [details]
A more elaborate log file

A more elaborate log file. This time I was on a serial console instead of X
session, hence the different free memory numbers. The consumption is still the
same ~67 megabytes. This log includes several samples of SysRq-m: one before
script execution, several during the freeze, and one more after I managed to
killall sctp_darn and system came back to normal.

Comment 13 Neil Horman 2005-02-08 13:44:46 UTC

Thanks.  I'm in the process of tracking this down at the moment.

Comment 14 Neil Horman 2005-02-08 19:30:02 UTC

Can you all do me a favor?  I think I was mistaken before regarding
what I thought the problem was.  sctp does appear to provide
reasonable accounting of its memory usage, and while it could probably
stand to have better accounting on the socket level, I think this
problem can be easily avoided by tuning, rather than code adjustments.
 It would appear that by default (not suprisingly), the socket send
and receive buffer sizes are set to whatever
/proc/sys/net/core/[r|w]mem_default is.  Normally under most protocols
the default value on boot here is fine.  UDP doesn't care so much
about it as frames are freed once they are sent leading to a quick
recovery of allocated data.  Under TCP, the values are reasonable
since tcp coalesces data into large frames under a single sk_buff. 
However, in sctp, we can send 1 byte of data at a time (which we are
doing here), and each 1 byte datagram requires its own sk_buff
structure.  This extreeme fragmentation, coupled with the fact that
the socket buffer accounting only  accounts payload size, and not
associated structure size, the socket buffer sizes appear effectively
unlimited to the send and receive code.  As such, the protocol
continues to send data, which gets enqueued on the receive path when
we suspend the listener half of the application, the sk_buff_head slab
cache grows and grows until such time as we run out of lowmem.

The easy workaround is to either:

a) reduce /proc/sys/net/core/rmem_default and wmem_default to values
reasonable for the system you are running on.  I tested with values
between 4096 and 32768, and detected no slow-downs or lockups on my
1GB machine.

b) reduce the SO_SNDBUF and SO_RCVBUF values via setsockopt inside the
program being run.  This has the same effect as (a), but is nicer
because it will avoid any odd side effects for programs which do not
explicitly set SND/RCV buffer sizes.

As for a fix, I'm not sure there needs to be one.  About the only
thing I can think of at the moment is (since unlike tcp, sctp can send
very small data packets, each of which requires an sk_buff and an ACK
of receipt from its peer), that we include the sk_buff data size in
the accounting of the SND/RCV buf values.  It kind of feels wrong to
me to do that, but on the other hand, it would be a more accurate
reflection of how much memory the protocol is using.  I solicit for
comments and post an update here with a verdict.

Comment 15 Jere Leppanen 2005-02-09 10:44:47 UTC

The area of effect of workaround a) is way too wide.

Workaround b) kind of solves the problem, but requires modification 
and recompilation of user programs and is therefore as effective as 
workaround

c) Don't send so many / so small messages.

Even with workaround b), one would be forced to use really small 
buffers considering that there can be thousands of associations.

Be that as it may, the _real_ problem is that if the sctp module is 
loaded, any user can freeze the system either intentionally or 
accidentally.

Comment 16 Neil Horman 2005-02-09 13:06:45 UTC

You are correct, the problem here is that using small buffers can lead
to an exhaustion of Lowmem, bringing the system to its knees.  This
arises, as I mentioned in my last post, this stems from the fact that
sending/receiving small buffers without accounting for the size of the
requisite sk_buffs used for those buffers leads to inacurate
accounting  of socket memory usage.  I'm going to write a patch to fix
this shortly.  Untill then, whichever workaoround (a, b or c) you find
to be most suitable for your environemnt should avoid this problem
just fine.

Comment 17 Jere Leppanen 2005-02-09 14:02:22 UTC

Right, correcting the buffer usage accounting should fix this. It can 
make the buffers seem small to the user process if small messages are 
used. Still, in my opinion it's much better than accidentally running 
out of memory and freezing the system. Even with more correct 
accounting of buffer usage, there might still be problems with large 
numbers of associations, but I guess it's the same even with TCP: 
fill ten thousand TCP socket buffers and you'll run out of memory.

If you come up with a patch I'd be happy to try it. Meanwhile, will 
try to cope with the workarounds. Thanks.

Comment 18 Neil Horman 2005-02-25 19:27:02 UTC

Created attachment 111436 [details]
patch to fix sctp sendbuffer accounting

Comment 19 Neil Horman 2005-02-25 19:44:08 UTC

Created attachment 111439 [details]
patch to fix sctp receive buffer accounting

Comment 20 Neil Horman 2005-02-25 19:49:00 UTC

Created attachment 111440 [details]
patch to stall sending processes on full peer receive window

Comment 21 Neil Horman 2005-02-25 19:54:01 UTC

I just uploaded three patches to this bug:
1) a send buffer accounting patch
2) a receive buffer accounting patch
3) a receive window stall patch

I'm still waiting on internal feedback on these, but I thought If
anyone wanted to test them out, I'd put them up.  (1) and (2) are
independent, but 3 is cumulative with 1, so if you want to test it,
you need to take 1 and 3.  I recommend that they all be included, as
they fix what I believe are pretty clearly bugs in sctp, based on my
reading of the RFC (although, bear in mind I'm a little new to this
protocol so I might be somewhat mistaken).  Patch 3 isn't technically
a bug, but it avoids alot of really unnecessecary thrashing of the
slab cache caused by slab allocs that are immeidately freed again. 
I'm also building a kernel with all three patches incorporated, and
I'll post a link once its available.

Comment 22 Neil Horman 2005-02-26 12:43:55 UTC

I've built and uploaded test kernels whith the attached three patches
incorporated to:

http://people.redhat.com:/nhorman/kernels

The up and smp versions are there, if anyone seeing this problem wouldn't mind,
test feedback would be greatly appreciated.  Thanks!

Comment 23 Jere Leppanen 2005-03-09 17:39:44 UTC

My observations testing the three patches:

Patches 1+2: My script produces allocations of more than 750 Mbytes. And kernel 
freezes. So these two patches alone are actually worse than the original.

Patches 1+2+3 (tried this with Neil's binaries as well as ones compiled by 
myself, identical results): The script still produces allocation of about 28 
Mbytes. No kernel freezes. Better, but 28 Mbytes per assoc is still too much, 
_and_ if I drop the message size from 3 bytes to 1 byte, the allocation becomes 
85 Mbytes per assoc.

Comment 24 Neil Horman 2005-03-09 18:24:25 UTC

How are you measuring the amount of memory consumed by each association?

Also, Just FYI, these patches won't be the final revision.  The
receive buffer patch has been accepted upstream, but I'm still working
with the project maintainers on the send buffer accounting scheme.  As
it turns out there is some language in the sctp socket extensions and
implementation guide that suggests that that each association should
have sk_sndbuf worth of bytes to allocate on the send side, even when
multiplexed over one socket.  Clearly this violates socket layer
accounting rules, so I'm walking a fine line here.  I'll post whatever
patch we wind up going with upstream here.  If anyone wants to follow
along the mailing list archives are available at the lksctp project
page on sourceforge.

Comment 25 Jere Leppanen 2005-03-10 10:45:59 UTC

free:

             total       used       free     shared    buffers     
cached
Mem:       1027228     306240     720988          0      32532     
155232
-/+ buffers/cache:     118476     908752
Swap:      2031608          0    2031608

I'm looking at changes in the last number on the middle line, in this 
case 908752. I've also checked (using top) that user processes don't 
increase memory consumption during testing.

Comment 26 Neil Horman 2005-03-10 12:29:08 UTC

Created attachment 111849 [details]
patch to do rcv buffer accounting on ulpevent sk_buffs

Sorry, you're right.  I was looking into it after your last post, and I noticed
that when the receiver was suspended the size-512 slab cache was growing quite
large.	I back tracked it to skb allocations in the ulpevent code.  Please add
this new patch to the set that you are currently testing with and let me know
how it goes.  I tried it with my test case, and ran the lksctp regression suite
against the entire set of 4 and memory usage seemed to be reduced to whatever I
set /proc/sys/net/core/[r|w]mem_default to.  Thanks!

Comment 27 Jere Leppanen 2005-03-10 14:54:54 UTC

I applied the fourth patch also, but as a result, my box runs out of 
its 1G RAM when I run my test script. Ouch.

Comment 28 Neil Horman 2005-03-10 15:07:38 UTC

That doesn't make any sense.  The fourth patch only increases the
amount of memory accounted for on the receive side of the network
stack for this protocol.  Can you run the test again, and see if you
get consistent results?

Comment 29 Jere Leppanen 2005-03-10 16:10:49 UTC

It's consistent, although I don't always run out of RAM. But even 
then, the memory consumption is huge.

Comment 30 Neil Horman 2005-03-10 16:13:46 UTC

Well, I don't understand how it can happen, but there are some
comments around the ulpevent code that suggest other things may be
wrong in that functionality as well.  I'll try see if I can figure out
whats going on there.  thanks.

Comment 31 Neil Horman 2005-03-17 01:58:11 UTC

Created attachment 112070 [details]
this is the final version of the sndbuf accounting patch which is getting upstream testing

Comment 32 Neil Horman 2005-03-18 18:43:42 UTC

Current status:

1) Recieve buffer accounting patch applied upstream and in Red Hat kernels
2) Sendbuffer accounting patch has a consensus upstream and will be receiving
additional testing before application
3) ulpevent receive buffer accounting patch pending review upstream

In addition to these patches, I've also found other code paths in sctp which can
allocate skbuffs without having them accounted for on the send side. 
Unfortunately they are all reacable via the net rx softirq, which means we can't
sleep waiting for sendbuffer space.  It looks like this is the result of a hack
I don't yet completely understand.  It should be pretty easy to reverse but I'll
 need to understand it better before we know exactly how to fix this.  This is
most likely the source of the extra memeory allocations that jere reported in
comment #27 above.  I'll get something put together for that asap.

Comment 35 Neil Horman 2005-03-23 16:00:09 UTC

Created attachment 112266 [details]
patch to break up bottom half to avoid deadlock when receive path sends frames

Comment 36 Neil Horman 2005-03-23 16:01:39 UTC

Created attachment 112267 [details]
patch to add sendbuffer accounting to sctp_make_chunk for all paths.

Comment 38 Neil Horman 2005-03-23 16:07:08 UTC

Sorry, but there are more patches here.  I'm still cleaning up accouting issues
(and there appear to be more still).  I'll put together a new kernel rpm for
testing which incorporates all of them and post a link here soon.

Comment 40 Neil Horman 2005-04-14 18:08:00 UTC

I've built the kernel with the following patches in it for testing:

1)patch to fix sctp receive buffer accounting
2) this is the final version of the sndbuf accounting patch which is getting 
   upstream testing
3) patch to do rcv buffer accounting on ulpevent sk_buffs
4) patch to stall sending processes on full peer receive window 

The kernels are here:
http://people.redhat.com/nhorman/kernels/

Please note that patch 2 requires you to set the sctp accounting mode.  It
should be set to zero in the sysctl.

Also, the upstream maintainer has some more concerns regarding sctp
rfc/implementation guide compliance, so the the final (sigh :)) version of the
sctp sendbuffer patch may vary somewhat from what is in here, but it will be close.

Comment 41 Jere Leppanen 2005-04-20 12:57:59 UTC

I tested with Neil's binaries (link in the previous comment.)

With association_sndbuf_policy set to 0 or 2, no problems with my test case. 
With association_sndbuf_policy set to 1, the first sendmsg() in sctp_darn gives 
ENOMEM.

I'm a bit unsure about the sensibility of the sysctl - shouldn't there be one, 
agreed upon, "correct" method of buffer accounting in sctp? Oh well, what do I 
know.

Anyway, this is certainly an improvement.

Comment 42 Neil Horman 2005-04-20 13:51:22 UTC

I agree that the sysctl is a little silly.  There is however a statement in the
SCTP implementors guide that indicates that buffer accounting should be done on
a per association basis, rather than on a per socket basis.  Its clearly a
violation of socket level accounting rules, but the upstream developers are
adamant about maintaining the ability to do per association accounting.  The
sysctl is the only way around this that I see.

Comment 43 Jere Leppanen 2005-04-21 10:19:12 UTC

Yeah, I guess you're right ... but I still don't like it. ;-)

I managed a little bit more testing. I'm not sure how this was supposed to be 
now but there's still some dependency between user message size and memory 
consumption. I tested with one-to-one sockets and the memory consumptions were

1 byte message, policy=0: 485KB per assoc
1 byte message, policy=2: 570KB per assoc
10KB message, policy=0: 179KB per assoc
10KB message, policy=2: 173KB per assoc

Comment 44 Neil Horman 2005-04-28 15:32:45 UTC

Created attachment 113789 [details]
final patch for sendbuffer accounting

This is the patch for sendbuffer accounting that finally got upstream
acceptance and has been posted to the internal RHEL list for RHEL4 inclusion

Comment 48 Tim Burke 2005-05-02 00:32:50 UTC

Setting state to modified as this got incorporated into [2.6.9-6.43] which will
be in U1.

Comment 50 Jere Leppanen 2005-05-02 09:48:47 UTC

Created attachment 113919 [details]
Test run results on U1-beta

I installed the kernel from U1-beta (2.6.9-6.37.EL) and ran my test script.

Free still shows ~68 megabytes memory consumption, and I still get system
freezes up to 20 seconds in duration. Unlike before, the system recovers
usually within one minute even if I don't kill sctp_darn - with this particular
test case ... but if I push it a little (or a lot) more, I get kernel panics.

The attachment shows the system freezes as missing seconds in the script
output, in this case only 1-3 second freezes. I snapped the first sysrq showmem
during a freeze, the second after the system had recovered.

Comment 60 Tim Powers 2005-06-08 15:13:41 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-420.html

Note You need to log in before you can comment on or make changes to this bug.