Description of problem: When sending small messages over SCTP and suspending the receiver, SCTP uses a huge amount of memory for buffering. When the receiver is then switched back to foreground, after a few seconds the system freezes. The freeze lasts 5-30 seconds with one association, more than 5 minutes with two associations. Obviously there are no problems if TCP is used in the same test instead of SCTP. Version-Release number of selected component (if applicable): kernel-2.6.9-5.EL How reproducible: Every time. Steps to Reproduce: 1. One host, two processes: sender and receiver. 2. Create N SCTP associations from sender to receiver. 3. Sender: Send (via loopback) 1-byte messages round-robin to N assocs in a loop. 4. Receiver: Receive (via loopback) 1-byte messages round-robin from N assocs in a loop. 5. Suspend the receiver, wait until sender blocks. 6. Check memory consumption. 7. Foreground the receiver. Actual results: In step 6, the memory consuption is about 200 megabytes per assoc. In step 7, the receiver runs for a few seconds, then the system freezes. The freeze lasts 5-30 seconds with one association, more than 5 minutes with two associations. With two associations, the system sometimes resumes normally after the freeze, sometimes the system resumes only intermittently, freezing again after a few seconds. Expected results: In step 6, the memory consumption should be limited by SCTP socket buffer limits plus possibly some overhead. A default of 200 megabytes per assoc may be too much. Step 7 is pretty obvious, the system shouldn't freeze. Additional info: This behaviour may be dependent on CPU/Mem, so here's the info from my test rig (Dell Optiplex GX270): [jere@dhcp094189 ~]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 9 cpu MHz : 2993.440 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 5914.62 [jere@dhcp094189 ~]$ free total used free shared buffers cached Mem: 1027228 260576 766652 0 20796 143516 -/+ buffers/cache: 96264 930964 Swap: 2031608 0 2031608
Forgot one thing: In step 5, when SCTP (kernel code) allocates 200 megabytes per assoc, if you run out of RAM, the system freezes and starts blinking CapsLock and ScrollLock.
could you please provide the test case you're using to create this problem, and provide memory consumption information on your box when the problem occurs, so that we can compare/confirm results? Thanks
Created attachment 110565 [details] Script to produce the problem.
Created attachment 110566 [details] Output of the script on my box.
I've attached a script and the output on my box. The script uses sctp_darn from lksctp-tools. In sctp_darn the minimum message size is 3 bytes. With one assoc the memory consumption is "only" ~67 megabytes, as you can see from the output. With one-byte messages the memory consumption was ~200 megabytes, with 3-byte messages it's ~67, which is of course 200 / 3 = 67. Inverse linear. Also, the severity of the intermittent freezes seems proportional to the amount of consumed memory. This brings us to a theory: SCTP is allocating entities with a lot of overhead (DATA chunks?), while the max amount of entities is limited by the payload contained within the entities. Small messages --> a huge number of entities. SCTP ends up processing a huge list of entities, and during that processing nothing much is happening in the kernel.
Created attachment 110572 [details] my test case for sctp lowmem exhaustion I just wrote my own test case here and received semi-simmilar results. My test case tends to result in ENOMEM errors, but that could just be the way I have my socket options set. I'll confirm results with your scripts. It would appears that in the case of my test, the skbuff_head_cache is growing unboundedly, until the kernel is in a state of lowmem exhaustion. Not sure quite what to do about it yet....
I was just browsing through the sctp code, and found this in the header of sctp_sendmsg: BUG: We do not implement the equivalent of wait_for_tcp_memory() Upstream seems to hold a variant of this comment, and I expect this is what we're seeing, as without it, send data usage will likely increase unboundedly. I'll see whats involved in implementing this
No, I don't mind at all. I'm going to contact Dave M about this tomorrow to figure out how best to handle this. Thanks!
Created attachment 110796 [details] Sysreport from my box Sysreport as submitted by customer to Issue Tracker
Created attachment 110797 [details] A more elaborate log file A more elaborate log file. This time I was on a serial console instead of X session, hence the different free memory numbers. The consumption is still the same ~67 megabytes. This log includes several samples of SysRq-m: one before script execution, several during the freeze, and one more after I managed to killall sctp_darn and system came back to normal.
Thanks. I'm in the process of tracking this down at the moment.
Can you all do me a favor? I think I was mistaken before regarding what I thought the problem was. sctp does appear to provide reasonable accounting of its memory usage, and while it could probably stand to have better accounting on the socket level, I think this problem can be easily avoided by tuning, rather than code adjustments. It would appear that by default (not suprisingly), the socket send and receive buffer sizes are set to whatever /proc/sys/net/core/[r|w]mem_default is. Normally under most protocols the default value on boot here is fine. UDP doesn't care so much about it as frames are freed once they are sent leading to a quick recovery of allocated data. Under TCP, the values are reasonable since tcp coalesces data into large frames under a single sk_buff. However, in sctp, we can send 1 byte of data at a time (which we are doing here), and each 1 byte datagram requires its own sk_buff structure. This extreeme fragmentation, coupled with the fact that the socket buffer accounting only accounts payload size, and not associated structure size, the socket buffer sizes appear effectively unlimited to the send and receive code. As such, the protocol continues to send data, which gets enqueued on the receive path when we suspend the listener half of the application, the sk_buff_head slab cache grows and grows until such time as we run out of lowmem. The easy workaround is to either: a) reduce /proc/sys/net/core/rmem_default and wmem_default to values reasonable for the system you are running on. I tested with values between 4096 and 32768, and detected no slow-downs or lockups on my 1GB machine. b) reduce the SO_SNDBUF and SO_RCVBUF values via setsockopt inside the program being run. This has the same effect as (a), but is nicer because it will avoid any odd side effects for programs which do not explicitly set SND/RCV buffer sizes. As for a fix, I'm not sure there needs to be one. About the only thing I can think of at the moment is (since unlike tcp, sctp can send very small data packets, each of which requires an sk_buff and an ACK of receipt from its peer), that we include the sk_buff data size in the accounting of the SND/RCV buf values. It kind of feels wrong to me to do that, but on the other hand, it would be a more accurate reflection of how much memory the protocol is using. I solicit for comments and post an update here with a verdict.
The area of effect of workaround a) is way too wide. Workaround b) kind of solves the problem, but requires modification and recompilation of user programs and is therefore as effective as workaround c) Don't send so many / so small messages. Even with workaround b), one would be forced to use really small buffers considering that there can be thousands of associations. Be that as it may, the _real_ problem is that if the sctp module is loaded, any user can freeze the system either intentionally or accidentally.
You are correct, the problem here is that using small buffers can lead to an exhaustion of Lowmem, bringing the system to its knees. This arises, as I mentioned in my last post, this stems from the fact that sending/receiving small buffers without accounting for the size of the requisite sk_buffs used for those buffers leads to inacurate accounting of socket memory usage. I'm going to write a patch to fix this shortly. Untill then, whichever workaoround (a, b or c) you find to be most suitable for your environemnt should avoid this problem just fine.
Right, correcting the buffer usage accounting should fix this. It can make the buffers seem small to the user process if small messages are used. Still, in my opinion it's much better than accidentally running out of memory and freezing the system. Even with more correct accounting of buffer usage, there might still be problems with large numbers of associations, but I guess it's the same even with TCP: fill ten thousand TCP socket buffers and you'll run out of memory. If you come up with a patch I'd be happy to try it. Meanwhile, will try to cope with the workarounds. Thanks.
Created attachment 111436 [details] patch to fix sctp sendbuffer accounting
Created attachment 111439 [details] patch to fix sctp receive buffer accounting
Created attachment 111440 [details] patch to stall sending processes on full peer receive window
I just uploaded three patches to this bug: 1) a send buffer accounting patch 2) a receive buffer accounting patch 3) a receive window stall patch I'm still waiting on internal feedback on these, but I thought If anyone wanted to test them out, I'd put them up. (1) and (2) are independent, but 3 is cumulative with 1, so if you want to test it, you need to take 1 and 3. I recommend that they all be included, as they fix what I believe are pretty clearly bugs in sctp, based on my reading of the RFC (although, bear in mind I'm a little new to this protocol so I might be somewhat mistaken). Patch 3 isn't technically a bug, but it avoids alot of really unnecessecary thrashing of the slab cache caused by slab allocs that are immeidately freed again. I'm also building a kernel with all three patches incorporated, and I'll post a link once its available.
I've built and uploaded test kernels whith the attached three patches incorporated to: http://people.redhat.com:/nhorman/kernels The up and smp versions are there, if anyone seeing this problem wouldn't mind, test feedback would be greatly appreciated. Thanks!
My observations testing the three patches: Patches 1+2: My script produces allocations of more than 750 Mbytes. And kernel freezes. So these two patches alone are actually worse than the original. Patches 1+2+3 (tried this with Neil's binaries as well as ones compiled by myself, identical results): The script still produces allocation of about 28 Mbytes. No kernel freezes. Better, but 28 Mbytes per assoc is still too much, _and_ if I drop the message size from 3 bytes to 1 byte, the allocation becomes 85 Mbytes per assoc.
How are you measuring the amount of memory consumed by each association? Also, Just FYI, these patches won't be the final revision. The receive buffer patch has been accepted upstream, but I'm still working with the project maintainers on the send buffer accounting scheme. As it turns out there is some language in the sctp socket extensions and implementation guide that suggests that that each association should have sk_sndbuf worth of bytes to allocate on the send side, even when multiplexed over one socket. Clearly this violates socket layer accounting rules, so I'm walking a fine line here. I'll post whatever patch we wind up going with upstream here. If anyone wants to follow along the mailing list archives are available at the lksctp project page on sourceforge.
free: total used free shared buffers cached Mem: 1027228 306240 720988 0 32532 155232 -/+ buffers/cache: 118476 908752 Swap: 2031608 0 2031608 I'm looking at changes in the last number on the middle line, in this case 908752. I've also checked (using top) that user processes don't increase memory consumption during testing.
Created attachment 111849 [details] patch to do rcv buffer accounting on ulpevent sk_buffs Sorry, you're right. I was looking into it after your last post, and I noticed that when the receiver was suspended the size-512 slab cache was growing quite large. I back tracked it to skb allocations in the ulpevent code. Please add this new patch to the set that you are currently testing with and let me know how it goes. I tried it with my test case, and ran the lksctp regression suite against the entire set of 4 and memory usage seemed to be reduced to whatever I set /proc/sys/net/core/[r|w]mem_default to. Thanks!
I applied the fourth patch also, but as a result, my box runs out of its 1G RAM when I run my test script. Ouch.
That doesn't make any sense. The fourth patch only increases the amount of memory accounted for on the receive side of the network stack for this protocol. Can you run the test again, and see if you get consistent results?
It's consistent, although I don't always run out of RAM. But even then, the memory consumption is huge.
Well, I don't understand how it can happen, but there are some comments around the ulpevent code that suggest other things may be wrong in that functionality as well. I'll try see if I can figure out whats going on there. thanks.
Created attachment 112070 [details] this is the final version of the sndbuf accounting patch which is getting upstream testing
Current status: 1) Recieve buffer accounting patch applied upstream and in Red Hat kernels 2) Sendbuffer accounting patch has a consensus upstream and will be receiving additional testing before application 3) ulpevent receive buffer accounting patch pending review upstream In addition to these patches, I've also found other code paths in sctp which can allocate skbuffs without having them accounted for on the send side. Unfortunately they are all reacable via the net rx softirq, which means we can't sleep waiting for sendbuffer space. It looks like this is the result of a hack I don't yet completely understand. It should be pretty easy to reverse but I'll need to understand it better before we know exactly how to fix this. This is most likely the source of the extra memeory allocations that jere reported in comment #27 above. I'll get something put together for that asap.
Created attachment 112266 [details] patch to break up bottom half to avoid deadlock when receive path sends frames
Created attachment 112267 [details] patch to add sendbuffer accounting to sctp_make_chunk for all paths.
Sorry, but there are more patches here. I'm still cleaning up accouting issues (and there appear to be more still). I'll put together a new kernel rpm for testing which incorporates all of them and post a link here soon.
I've built the kernel with the following patches in it for testing: 1)patch to fix sctp receive buffer accounting 2) this is the final version of the sndbuf accounting patch which is getting upstream testing 3) patch to do rcv buffer accounting on ulpevent sk_buffs 4) patch to stall sending processes on full peer receive window The kernels are here: http://people.redhat.com/nhorman/kernels/ Please note that patch 2 requires you to set the sctp accounting mode. It should be set to zero in the sysctl. Also, the upstream maintainer has some more concerns regarding sctp rfc/implementation guide compliance, so the the final (sigh :)) version of the sctp sendbuffer patch may vary somewhat from what is in here, but it will be close.
I tested with Neil's binaries (link in the previous comment.) With association_sndbuf_policy set to 0 or 2, no problems with my test case. With association_sndbuf_policy set to 1, the first sendmsg() in sctp_darn gives ENOMEM. I'm a bit unsure about the sensibility of the sysctl - shouldn't there be one, agreed upon, "correct" method of buffer accounting in sctp? Oh well, what do I know. Anyway, this is certainly an improvement.
I agree that the sysctl is a little silly. There is however a statement in the SCTP implementors guide that indicates that buffer accounting should be done on a per association basis, rather than on a per socket basis. Its clearly a violation of socket level accounting rules, but the upstream developers are adamant about maintaining the ability to do per association accounting. The sysctl is the only way around this that I see.
Yeah, I guess you're right ... but I still don't like it. ;-) I managed a little bit more testing. I'm not sure how this was supposed to be now but there's still some dependency between user message size and memory consumption. I tested with one-to-one sockets and the memory consumptions were 1 byte message, policy=0: 485KB per assoc 1 byte message, policy=2: 570KB per assoc 10KB message, policy=0: 179KB per assoc 10KB message, policy=2: 173KB per assoc
Created attachment 113789 [details] final patch for sendbuffer accounting This is the patch for sendbuffer accounting that finally got upstream acceptance and has been posted to the internal RHEL list for RHEL4 inclusion
Setting state to modified as this got incorporated into [2.6.9-6.43] which will be in U1.
Created attachment 113919 [details] Test run results on U1-beta I installed the kernel from U1-beta (2.6.9-6.37.EL) and ran my test script. Free still shows ~68 megabytes memory consumption, and I still get system freezes up to 20 seconds in duration. Unlike before, the system recovers usually within one minute even if I don't kill sctp_darn - with this particular test case ... but if I push it a little (or a lot) more, I get kernel panics. The attachment shows the system freezes as missing seconds in the script output, in this case only 1-3 second freezes. I snapped the first sysrq showmem during a freeze, the second after the system had recovered.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-420.html