Description of problem:
When sending small messages over SCTP and suspending the receiver,
SCTP uses a huge amount of memory for buffering. When the receiver is
then switched back to foreground, after a few seconds the system
freezes. The freeze lasts 5-30 seconds with one association, more
than 5 minutes with two associations. Obviously there are no problems
if TCP is used in the same test instead of SCTP.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. One host, two processes: sender and receiver.
2. Create N SCTP associations from sender to receiver.
3. Sender: Send (via loopback) 1-byte messages round-robin to N
assocs in a loop.
4. Receiver: Receive (via loopback) 1-byte messages round-robin from
N assocs in a loop.
5. Suspend the receiver, wait until sender blocks.
6. Check memory consumption.
7. Foreground the receiver.
In step 6, the memory consuption is about 200 megabytes per assoc.
In step 7, the receiver runs for a few seconds, then the system
freezes. The freeze lasts 5-30 seconds with one association, more
than 5 minutes with two associations. With two associations, the
system sometimes resumes normally after the freeze, sometimes the
system resumes only intermittently, freezing again after a few
In step 6, the memory consumption should be limited by SCTP socket
buffer limits plus possibly some overhead. A default of 200 megabytes
per assoc may be too much.
Step 7 is pretty obvious, the system shouldn't freeze.
This behaviour may be dependent on CPU/Mem, so here's the info from
my test rig (Dell Optiplex GX270):
[jere@dhcp094189 ~]$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping : 9
cpu MHz : 2993.440
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
bogomips : 5914.62
[jere@dhcp094189 ~]$ free
total used free shared buffers
Mem: 1027228 260576 766652 0 20796
-/+ buffers/cache: 96264 930964
Swap: 2031608 0 2031608
Forgot one thing:
In step 5, when SCTP (kernel code) allocates 200 megabytes per assoc,
if you run out of RAM, the system freezes and starts blinking
CapsLock and ScrollLock.
could you please provide the test case you're using to create this problem, and
provide memory consumption information on your box when the problem occurs, so
that we can compare/confirm results? Thanks
Created attachment 110565 [details]
Script to produce the problem.
Created attachment 110566 [details]
Output of the script on my box.
I've attached a script and the output on my box.
The script uses sctp_darn from lksctp-tools. In sctp_darn the minimum message
size is 3 bytes. With one assoc the memory consumption is "only" ~67 megabytes,
as you can see from the output.
With one-byte messages the memory consumption was ~200 megabytes, with 3-byte
messages it's ~67, which is of course 200 / 3 = 67. Inverse linear. Also, the
severity of the intermittent freezes seems proportional to the amount of
consumed memory. This brings us to a theory: SCTP is allocating entities with a
lot of overhead (DATA chunks?), while the max amount of entities is limited by
the payload contained within the entities. Small messages --> a huge number of
entities. SCTP ends up processing a huge list of entities, and during that
processing nothing much is happening in the kernel.
Created attachment 110572 [details]
my test case for sctp lowmem exhaustion
I just wrote my own test case here and received semi-simmilar results. My test
case tends to result in ENOMEM errors, but that could just be the way I have my
socket options set. I'll confirm results with your scripts. It would appears
that in the case of my test, the skbuff_head_cache is growing unboundedly,
until the kernel is in a state of lowmem exhaustion. Not sure quite what to do
about it yet....
I was just browsing through the sctp code, and found this in the header of
BUG: We do not implement the equivalent of wait_for_tcp_memory()
Upstream seems to hold a variant of this comment, and I expect this is what
we're seeing, as without it, send data usage will likely increase unboundedly.
I'll see whats involved in implementing this
No, I don't mind at all. I'm going to contact Dave M about this tomorrow to
figure out how best to handle this. Thanks!
Created attachment 110796 [details]
Sysreport from my box
Sysreport as submitted by customer to Issue Tracker
Created attachment 110797 [details]
A more elaborate log file
A more elaborate log file. This time I was on a serial console instead of X
session, hence the different free memory numbers. The consumption is still the
same ~67 megabytes. This log includes several samples of SysRq-m: one before
script execution, several during the freeze, and one more after I managed to
killall sctp_darn and system came back to normal.
Thanks. I'm in the process of tracking this down at the moment.
Can you all do me a favor? I think I was mistaken before regarding
what I thought the problem was. sctp does appear to provide
reasonable accounting of its memory usage, and while it could probably
stand to have better accounting on the socket level, I think this
problem can be easily avoided by tuning, rather than code adjustments.
It would appear that by default (not suprisingly), the socket send
and receive buffer sizes are set to whatever
/proc/sys/net/core/[r|w]mem_default is. Normally under most protocols
the default value on boot here is fine. UDP doesn't care so much
about it as frames are freed once they are sent leading to a quick
recovery of allocated data. Under TCP, the values are reasonable
since tcp coalesces data into large frames under a single sk_buff.
However, in sctp, we can send 1 byte of data at a time (which we are
doing here), and each 1 byte datagram requires its own sk_buff
structure. This extreeme fragmentation, coupled with the fact that
the socket buffer accounting only accounts payload size, and not
associated structure size, the socket buffer sizes appear effectively
unlimited to the send and receive code. As such, the protocol
continues to send data, which gets enqueued on the receive path when
we suspend the listener half of the application, the sk_buff_head slab
cache grows and grows until such time as we run out of lowmem.
The easy workaround is to either:
a) reduce /proc/sys/net/core/rmem_default and wmem_default to values
reasonable for the system you are running on. I tested with values
between 4096 and 32768, and detected no slow-downs or lockups on my
b) reduce the SO_SNDBUF and SO_RCVBUF values via setsockopt inside the
program being run. This has the same effect as (a), but is nicer
because it will avoid any odd side effects for programs which do not
explicitly set SND/RCV buffer sizes.
As for a fix, I'm not sure there needs to be one. About the only
thing I can think of at the moment is (since unlike tcp, sctp can send
very small data packets, each of which requires an sk_buff and an ACK
of receipt from its peer), that we include the sk_buff data size in
the accounting of the SND/RCV buf values. It kind of feels wrong to
me to do that, but on the other hand, it would be a more accurate
reflection of how much memory the protocol is using. I solicit for
comments and post an update here with a verdict.
The area of effect of workaround a) is way too wide.
Workaround b) kind of solves the problem, but requires modification
and recompilation of user programs and is therefore as effective as
c) Don't send so many / so small messages.
Even with workaround b), one would be forced to use really small
buffers considering that there can be thousands of associations.
Be that as it may, the _real_ problem is that if the sctp module is
loaded, any user can freeze the system either intentionally or
You are correct, the problem here is that using small buffers can lead
to an exhaustion of Lowmem, bringing the system to its knees. This
arises, as I mentioned in my last post, this stems from the fact that
sending/receiving small buffers without accounting for the size of the
requisite sk_buffs used for those buffers leads to inacurate
accounting of socket memory usage. I'm going to write a patch to fix
this shortly. Untill then, whichever workaoround (a, b or c) you find
to be most suitable for your environemnt should avoid this problem
Right, correcting the buffer usage accounting should fix this. It can
make the buffers seem small to the user process if small messages are
used. Still, in my opinion it's much better than accidentally running
out of memory and freezing the system. Even with more correct
accounting of buffer usage, there might still be problems with large
numbers of associations, but I guess it's the same even with TCP:
fill ten thousand TCP socket buffers and you'll run out of memory.
If you come up with a patch I'd be happy to try it. Meanwhile, will
try to cope with the workarounds. Thanks.
Created attachment 111436 [details]
patch to fix sctp sendbuffer accounting
Created attachment 111439 [details]
patch to fix sctp receive buffer accounting
Created attachment 111440 [details]
patch to stall sending processes on full peer receive window
I just uploaded three patches to this bug:
1) a send buffer accounting patch
2) a receive buffer accounting patch
3) a receive window stall patch
I'm still waiting on internal feedback on these, but I thought If
anyone wanted to test them out, I'd put them up. (1) and (2) are
independent, but 3 is cumulative with 1, so if you want to test it,
you need to take 1 and 3. I recommend that they all be included, as
they fix what I believe are pretty clearly bugs in sctp, based on my
reading of the RFC (although, bear in mind I'm a little new to this
protocol so I might be somewhat mistaken). Patch 3 isn't technically
a bug, but it avoids alot of really unnecessecary thrashing of the
slab cache caused by slab allocs that are immeidately freed again.
I'm also building a kernel with all three patches incorporated, and
I'll post a link once its available.
I've built and uploaded test kernels whith the attached three patches
The up and smp versions are there, if anyone seeing this problem wouldn't mind,
test feedback would be greatly appreciated. Thanks!
My observations testing the three patches:
Patches 1+2: My script produces allocations of more than 750 Mbytes. And kernel
freezes. So these two patches alone are actually worse than the original.
Patches 1+2+3 (tried this with Neil's binaries as well as ones compiled by
myself, identical results): The script still produces allocation of about 28
Mbytes. No kernel freezes. Better, but 28 Mbytes per assoc is still too much,
_and_ if I drop the message size from 3 bytes to 1 byte, the allocation becomes
85 Mbytes per assoc.
How are you measuring the amount of memory consumed by each association?
Also, Just FYI, these patches won't be the final revision. The
receive buffer patch has been accepted upstream, but I'm still working
with the project maintainers on the send buffer accounting scheme. As
it turns out there is some language in the sctp socket extensions and
implementation guide that suggests that that each association should
have sk_sndbuf worth of bytes to allocate on the send side, even when
multiplexed over one socket. Clearly this violates socket layer
accounting rules, so I'm walking a fine line here. I'll post whatever
patch we wind up going with upstream here. If anyone wants to follow
along the mailing list archives are available at the lksctp project
page on sourceforge.
total used free shared buffers
Mem: 1027228 306240 720988 0 32532
-/+ buffers/cache: 118476 908752
Swap: 2031608 0 2031608
I'm looking at changes in the last number on the middle line, in this
case 908752. I've also checked (using top) that user processes don't
increase memory consumption during testing.
Created attachment 111849 [details]
patch to do rcv buffer accounting on ulpevent sk_buffs
Sorry, you're right. I was looking into it after your last post, and I noticed
that when the receiver was suspended the size-512 slab cache was growing quite
large. I back tracked it to skb allocations in the ulpevent code. Please add
this new patch to the set that you are currently testing with and let me know
how it goes. I tried it with my test case, and ran the lksctp regression suite
against the entire set of 4 and memory usage seemed to be reduced to whatever I
set /proc/sys/net/core/[r|w]mem_default to. Thanks!
I applied the fourth patch also, but as a result, my box runs out of
its 1G RAM when I run my test script. Ouch.
That doesn't make any sense. The fourth patch only increases the
amount of memory accounted for on the receive side of the network
stack for this protocol. Can you run the test again, and see if you
get consistent results?
It's consistent, although I don't always run out of RAM. But even
then, the memory consumption is huge.
Well, I don't understand how it can happen, but there are some
comments around the ulpevent code that suggest other things may be
wrong in that functionality as well. I'll try see if I can figure out
whats going on there. thanks.
Created attachment 112070 [details]
this is the final version of the sndbuf accounting patch which is getting upstream testing
1) Recieve buffer accounting patch applied upstream and in Red Hat kernels
2) Sendbuffer accounting patch has a consensus upstream and will be receiving
additional testing before application
3) ulpevent receive buffer accounting patch pending review upstream
In addition to these patches, I've also found other code paths in sctp which can
allocate skbuffs without having them accounted for on the send side.
Unfortunately they are all reacable via the net rx softirq, which means we can't
sleep waiting for sendbuffer space. It looks like this is the result of a hack
I don't yet completely understand. It should be pretty easy to reverse but I'll
need to understand it better before we know exactly how to fix this. This is
most likely the source of the extra memeory allocations that jere reported in
comment #27 above. I'll get something put together for that asap.
Created attachment 112266 [details]
patch to break up bottom half to avoid deadlock when receive path sends frames
Created attachment 112267 [details]
patch to add sendbuffer accounting to sctp_make_chunk for all paths.
Sorry, but there are more patches here. I'm still cleaning up accouting issues
(and there appear to be more still). I'll put together a new kernel rpm for
testing which incorporates all of them and post a link here soon.
I've built the kernel with the following patches in it for testing:
1)patch to fix sctp receive buffer accounting
2) this is the final version of the sndbuf accounting patch which is getting
3) patch to do rcv buffer accounting on ulpevent sk_buffs
4) patch to stall sending processes on full peer receive window
The kernels are here:
Please note that patch 2 requires you to set the sctp accounting mode. It
should be set to zero in the sysctl.
Also, the upstream maintainer has some more concerns regarding sctp
rfc/implementation guide compliance, so the the final (sigh :)) version of the
sctp sendbuffer patch may vary somewhat from what is in here, but it will be close.
I tested with Neil's binaries (link in the previous comment.)
With association_sndbuf_policy set to 0 or 2, no problems with my test case.
With association_sndbuf_policy set to 1, the first sendmsg() in sctp_darn gives
I'm a bit unsure about the sensibility of the sysctl - shouldn't there be one,
agreed upon, "correct" method of buffer accounting in sctp? Oh well, what do I
Anyway, this is certainly an improvement.
I agree that the sysctl is a little silly. There is however a statement in the
SCTP implementors guide that indicates that buffer accounting should be done on
a per association basis, rather than on a per socket basis. Its clearly a
violation of socket level accounting rules, but the upstream developers are
adamant about maintaining the ability to do per association accounting. The
sysctl is the only way around this that I see.
Yeah, I guess you're right ... but I still don't like it. ;-)
I managed a little bit more testing. I'm not sure how this was supposed to be
now but there's still some dependency between user message size and memory
consumption. I tested with one-to-one sockets and the memory consumptions were
1 byte message, policy=0: 485KB per assoc
1 byte message, policy=2: 570KB per assoc
10KB message, policy=0: 179KB per assoc
10KB message, policy=2: 173KB per assoc
Created attachment 113789 [details]
final patch for sendbuffer accounting
This is the patch for sendbuffer accounting that finally got upstream
acceptance and has been posted to the internal RHEL list for RHEL4 inclusion
Setting state to modified as this got incorporated into [2.6.9-6.43] which will
be in U1.
Created attachment 113919 [details]
Test run results on U1-beta
I installed the kernel from U1-beta (2.6.9-6.37.EL) and ran my test script.
Free still shows ~68 megabytes memory consumption, and I still get system
freezes up to 20 seconds in duration. Unlike before, the system recovers
usually within one minute even if I don't kill sctp_darn - with this particular
test case ... but if I push it a little (or a lot) more, I get kernel panics.
The attachment shows the system freezes as missing seconds in the script
output, in this case only 1-3 second freezes. I snapped the first sysrq showmem
during a freeze, the second after the system had recovered.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.