Created attachment 331637 [details]
ip_dst_cache (in blue) VS `route -Cn | wc -l` (in green) over time
Description of problem:
The value of ip_dst_cache (in /proc/slabinfo) continues to grow constantly, even thought the cached route table remains fairly constant. This leads to the eventual time when ip_dst_cache reaches the value of /proc/sys/net/ipv4/route/max_size. When this happens, the kernel complains with 'dst cache overflow' and the server no longer responds to any network activity.
Version-Release number of selected component (if applicable): kernel 2.6.18-92.1.22.el5
How reproducible: live system being currently affected by this issue.
Steps to Reproduce:
1. Configure test machine as a router between two networks
2. send packets from network A to network B with a large number of different source/dest IPs
3. Watch the values of ip_dst_cache and rt_cache
ip_dst_cache continues to grow while rt_cache grows and shrinks with the traffic
ip_dst_cache and rt_cache follow each other closely. Values return to zero after traffic stops and route cache entries expire.
Leak rate is slower during higher traffic loads.
Created attachment 331719 [details]
ip_dst_cache (in blue) vs. `ip route ls cache | wc -l` (in green)
Higher network traffic loads are present between 17:00 and 04:00
Confirmed leak is present is 2.6.18-131 (hence it's also present in 5.3)
Report sent to oCERT with details of remote DoS exploiting this issue.
(In reply to comment #4)
> Report sent to oCERT with details of remote DoS exploiting this issue.
Hector, is this bug considered public? I noticed the bz was created without the Security keyword. https://bugzilla.redhat.com/show_activity.cgi?id=485163
Can you please share with us the report for oCERT if it has additional information not already in this bz? Thanks.
(In reply to comment #6)
> Created an attachment (id=332786) [details]
> Report sent to oCERT with regards to this issue
> Public disclosure to follow oCERT disclosure guidelines and only upon agreement
> with oCERT and reporter
Thanks Hector. The reason I asked is because this bug was first created as a public bug, and it was made private many days later. So, I am not sure why this bug should be kept private if it is already made public previously.
(In reply to comment #7)
> (In reply to comment #6)
> > Created an attachment (id=332786) [details] [details]
> > Report sent to oCERT with regards to this issue
> > Public disclosure to follow oCERT disclosure guidelines and only upon agreement
> > with oCERT and reporter
> Thanks Hector. The reason I asked is because this bug was first created as a
> public bug, and it was made private many days later. So, I am not sure why this
> bug should be kept private if it is already made public previously.
Btw, have you tested the upstream kernel?
I changed it to a security bug because once I figured out how the bug reacted to network traffic and I was able to come up with a simple remote DoS attack, which in my mind exposed vulnerable servers to a high risk.
My initial intent was to report a bug that was present on some of the servers I manage.
I haven't tested the issue with any mainline kernels. I tried looking for a howto/guide on how to compile mainline into RHEL or CentOS but I didn't find anything too useful. If you can give me pointers I can try it on my virtual machines.
I realize that there are probably very few servers which are vulnerable, thus making this a minor issue. But those servers would most likely be part of network infrastructure (routers for ISPs for example) and the effects of a attack on one of those would affect a larger user base.
I'm starting to have second thoughts about the seriousness of this, as I look over the code and the analysis thereof. In looking at this, you're adding a route of type RTN_UNREACHABLE to the fib, and then sending a ton of packets through the system, and observing that the number of dst cache entires on the slab is growing unboundedly while the actual number of dst entries in the route cache is remaining constant at or near its max value. I agree on the surface that looks like a leak, but the path from creating a dst entry in ip_route_input_slow, at the local_input label for routes of type RTN_UNREACHABLE, to where they are hashed into place in rt_intern_hash is very short and consice, and I don't see any way we can leak a dst entry out of there.
That said, I started to think about the data above, and its missing a bit. Depending on which field you looked at in /proc/slabinfo, that data could be perfectly valid. if dst entires are freed back to their slab cache, the active number will go down for that cache, but the total number will stay until such time as the kernel shrinks the cache, which may not happen if there is sufficient memory in the system.
1) If you stop traffic on the system, and wait for a gc_cycle on the router, does it become possible to pass traffic again?
2) If the answer to one is yes, can this problem be avoided by
a) increasing /proc/sys/net/ipv4/route/max_size
b) lowering /proc/sys/net/ipv4/route/gc_elasticity
c) lowering /proc/sys/net/ipv4/route/gc_interval
d) raising /proc/sys/net/ipv4/route/gc_thresh
That should instruct the garbage collector for the route cache to be much more agressive in its collection
Bear in mind, that I'm not suggesting that this isn't a bug, I'm just trying to get straight about weather this is a true leak, or if perhaps its simply a condition set in which the garbage collector isn't as agressive as it needs to be. Its fairly clear in ip_route_input_slow, that when we go to allocate a dst, if dst_alloc fails (which it will if the route cache entries are greater than gc_thresh), we will not route the frame, so I think we likely just need to make sure that if dst_alloc is going to fail, we get much more agressive in our garbage collection.
The value of /proc/sys/net/ipv4/route/gc_interval is 60
I performed the following test:
(all times are in minutes
00: Sent packet storm at time
01: Ceased all network traffic
14: observed the count of active routes drop to zero
14: value of ip_dst_cache is 32767
14: confirmed server is still offline by sending 1 ping
19: rt_cache count is 2 (probably from ping at time 14)
19: value of ip_dst_cache is 32768
19: confirmed server is still offline by sending 1 ping
28: value of rt_cache is 0
28: value of ip_dst_cache is 32768
28: confirmed server is still offline by sending 1 ping
I'll also attach a graph of our live router values showing the history for the last few days.
Created attachment 333083 [details]
rt_cache values over an extended time period
The blips on Mon, Tue and Wed are reboots. The blue line tracks the value of ip_dst_cache, the green area tracks the value of `ip -o route ls cache | wc -l`.
When I learned that the issue was caused by the "REJECT" route, I removed the route on Thursday. Since then the value of ip_dst_cache has remained constant.
What value are you looking at when you say that ip_dst_cache is 32767? Is it the first column or the second? I would not be suprised if the second column stays high, in fact it should until there is a good deal of memory pressure on the system and the cache needs to shrink. If thats the first column on the other hand, then yes, that seems to be an issue, especially if the route cache entries goes to 2. That would suggest that you do in fact have a leak somewhere.
What would be really helpful would be if you could catch a sumbmit a vmcore, so that I could look through the route cache by hand. Also, when you sent those pings through the system, did you continue to get the dst cache overflow messages? That would be of great interest. If you didn't then you hit on an existing route in the table, and the frame should have been processed further, which in turn suggests this might not be a routing problem. It would be good, if after you preformed the above test, you captured /proc/net/snmp. That would give us a better idea of where you were dropping frames.
1) details about how you are tracking ip_dst_cache size, so that we can confirm that active objects are or are not being reclaimed.
2) a vmcore if possible, so that I can look through the kernel memory image by hand, and get some idea of where these errant dst entries are living.
3) capture of /proc/net/snmp after the test in comment #14 is preformed, so that we can get a better idea of where these frames are getting dropped.
Also, how are you generating your various input frames. Are you using a hw solution like an ixia or smartbits box, or are you doing it in software? Just trying to get an idea of the volume and variation of traffic that I need to start working on a reproducer here. Thanks.
If you look at the report I sent to oCERT (it's attached to this bug), you will find details on the setup I'm using to test as well as my test scripts.
The packet generator is the pktgen kernel module configured to send randomized src IPs to random destination IPs inside the "REJECT"'d route.
I'm looking at the first value of ip_dst_cache.
When I ping the server and it doesn't respond, I do see a 'dst cache overflow' message, precisely one console message for each packet received, no matter the source or destination IP.
I'll perform the test again and obtain a vmcore and dump of /proc/net/snmp and post it here once it's done.
Ok, thanks, it'll take me a few days to get the systems together to set up a reproducer, but I think thats going to be the best way to figure out exactly whats going on here.
Created attachment 333204 [details]
stap script to help diagnose potential router leak
hey there, while I'm getting the systems together to recreate this, would you mind running this systemtap script on your router there, and providing me with the output? It would help me diagnose the problem I think. Note it will fill up your console with messages and may reduce your performance a bit. Thanks!
Created attachment 333266 [details]
Stdout of stap script
Executed the stap script as requested, split the stdout and stderr into two files. stap.1.gz is stdout, stap.2.gz is stderr.
Created attachment 333267 [details]
Stderr of stap script
Executed the stap script as requested, split the stdout and stderr into two files. stap.1.gz is stdout, stap.2.gz is stderr.
So, I've got my reprdocution environment set up here, and I've got some bad news of sorts: its working just fine. I'm using RHEL 5.3, with the -128.el5 kernel, and your script from your oCert submission. I'm repeatedly sending 100000 packets over and over again (about every 10 seconds), and when I check the active objects in the ip_dst_cache against the number of cached routes, they always track fairly closely against both each other, and the number of packets I'm sending (as I would expect). Then when I discontinue packet sends to the unreachable route, and let the router quiesce, both the slab cache and the route cache quickly shrink again, also as I would expect. So it would seem that there is some more subtle nuance in your system that is triggering your system to have this issue. Can you send me a sysreport of your system? I'd like to compare your tunables to mine to see if anything is off between our setups.
Created attachment 333557 [details]
sosreport with kernel 2.6.18-128.el5
I got a hold of kernel-2.6.18-128.el5 and I can still reproduce the bug with it.
Thank you, was this taken on the system after the problem had occured, or prior to it? It looks like it was taken prior, but I'd like to be sure. When the problem does happen, do you see any stats change in /proc/net/snmp, netstat, or ifconfig?
The sosreport was taken after the issue appeared.
Created attachment 333635 [details]
sosreport before issue (kernel 2.6.18-128)
sosreport immediately after a reboot.
Created attachment 333636 [details]
sosreport after issue (kernel 2.6.18-128)
sosreport taken after 'dst cache overflow' messages appear. `ip route ls cache -l` has returned to zero and ip_dst_cache is 32767. At the time this report was taken, any network packet received by the system produces a 'dst cache overflow' message on the console
Note to self: found something intersting, the problem seems to hinge on the addition of a default route. I've now managed to reproduce the problem, and I can only do it if I add a default route to the route cache (as per the oCert docs). Previously I had forgotten to add a default route, and everything worked like a charm. slabcache and route cache grew and shrank as you would expect. But as soon as we added a default route, the cache filled up and overflowed. The route cache shrinks again, and it appears the slabcache is shrinking as well, although I need to monitor it to see if it returns to its expected size in correlation with the route cache.
Its interesting to note that even after the route cache is back to a steady state size for my invironment (between 2 and 10 routes), I don't see dst cache overflow messages, but I do see Neighbour table overflow messages, indiating something has gone wrong with the arp table as well (perhaps the overuse of src macs with multiple IP sources via pktgen, I'm not sure).
I'm getting the feeling like we're not looking up the reject route properly, when we have a default gw route (its like the def. gw is at a higher priority or something). I'll try a few tests and update again this afternoon.
Note to self: Just tried to reproduce with 2.6.18-8.el5 and was unable to, so this problem was introduced sometime during one of the RHEL update cycles. I'm going over the changelog now to see if anything stands out
Created attachment 333786 [details]
patch to revert a previous leak fix
Hey, would you please build a kernel with this patch. I'm going through our RHEL5 changelog and found this patch, I'm not sure, but I think its keeping my system up here (it reverts a fix for another leak, so you'll still seem some leaked entries, which is why I'm not sure of it), but I seem able to keep my system up with this patch here. Its not a final fix, since it does revert a previous change that fixed a leak, but I'd like to confirm that it does something for you too. Thanks!
(In reply to comment #31)
> Created an attachment (id=333786) [details]
> patch to revert a previous leak fix
> Hey, would you please build a kernel with this patch. I'm going through our
> RHEL5 changelog and found this patch, I'm not sure, but I think its keeping my
> system up here (it reverts a fix for another leak, so you'll still seem some
> leaked entries, which is why I'm not sure of it), but I seem able to keep my
> system up with this patch here. Its not a final fix, since it does revert a
> previous change that fixed a leak, but I'd like to confirm that it does
> something for you too. Thanks!
I'm building the kernel with this patch for Hector. I will post the rpm soon.
(In reply to comment #32)
> (In reply to comment #31)
> > Created an attachment (id=333786) [details] [details]
> > patch to revert a previous leak fix
> > Hey, would you please build a kernel with this patch. I'm going through our
> > RHEL5 changelog and found this patch, I'm not sure, but I think its keeping my
> > system up here (it reverts a fix for another leak, so you'll still seem some
> > leaked entries, which is why I'm not sure of it), but I seem able to keep my
> > system up with this patch here. Its not a final fix, since it does revert a
> > previous change that fixed a leak, but I'd like to confirm that it does
> > something for you too. Thanks!
> I'm building the kernel with this patch for Hector. I will post the rpm soon.
Hector, you can download them here: http://people.redhat.com/eteo/485163/
Thank you, I was still getting my build environment setup when I saw the posting by Eugene.
I downloaded the kernel-2.6.18-131.el5.bz485163.rpm (and kernel-devel too) and installed both. Rebooted to the new kernel then I proceeded to followed my usual testing process.
at time 0, I flooded the server until I received dst cache overflow messages. After a few seconds the values of rt_cache and ip_dst_cache settled at 32767.
after 5 minutes, both rt_cache and ip_dst_cache dropped to 32766
after another 8 minutes, rt_cache dropped to zero and ip_dst_cache remained at 32766. A ping test at this time showed that the server is unreachable from the network. Each ping packet caused one dst cache overflow message.
Would you like sosreports for this kernel test?
No, thank you, its not going to show me anything. I can't understand why my cache is leaking so much less than yours.
By the way, when I reproduce this, I occasionally get Neighbour table overflow indications rather than dst cache overflow notifications (which odd as it may seem, are both attributable to route cache overflows). Do you see those, or are yours strictly dst cache overflow messages?
I only see 'dst cache overflow' messages. I checked /var/log/messages but I found nothing else.
ok, making slow progress here
I'm able to observe the leak with a single packet. If I send in a single packet using the pktflood script from the ocert document, and then flush the cache via /proc/net/ipv4/route/flush, the route cache is cleaned but a single slab cache entry remains active.
I've instrumented the kernel and found that on flush, both the route cache entries that are added (one from the host to the unreaachable network via l0 and another from the local router interface to the sending system) are removed. That tells me we're dealing with an out of whack ref count when entering rt_intern_hash. Thats good progress. I'll report more soon.
Neil is able to reproduce the leak, in single packet quantities using the setup described in the oCERT doc. It requires exactly what the setup says, an unreachable route and a default route.
It is definitely a remote DoS attack, as anyone with the described configuration can have their router ability to forward traffic terminated. It is possible to work around this issue until it is fixed, including the loopback route mentioned in the ticket, as well as an iptables solution.
am I correct to assume that with a recent kernel this issue is not present? I'm trying to understand if it deserves some kernel maintainer ping as well (they would likely ignore it if affects 2.6.18 only which is quite old).
Other than that while the full impact discussion might not be public the bug itself is. Considering that there is a workaround I think this bug deserves to be opened (and if we feel necessary a preliminary advisory could be released as there is a workaround). Hoping of course that a patch would be available asap.
Almost certain I've found the problem. Its fixed in upstream commit 7c0ecc4c4f8fd90988aab8a95297b9c0038b6160. The problem is that we fail to dst_release the result of a route lookup when sending the icmp host unreachable response to the frames that is attempted to be directed along the unreachable route. I'm building a test kernel for it now that we can all use to verify it.
Created attachment 334377 [details]
patch to fix the dst_entry leak
I've confirmed it, the backport patch I'm attaching here solves the leak. I've got an x86_64 build if you would like to play with it:
I'll post this for inclusion monday afternoon. Please test the kernel and make sure if solves the problem for you as well. If I don't hear from you by monday afternoon, I'll go ahead and post.
(In reply to comment #42)
> Created an attachment (id=334377) [details]
> patch to fix the dst_entry leak
> I've confirmed it, the backport patch I'm attaching here solves the leak. I've
> got an x86_64 build if you would like to play with it:
> I'll post this for inclusion monday afternoon. Please test the kernel and make
> sure if solves the problem for you as well. If I don't hear from you by monday
> afternoon, I'll go ahead and post.
Or if you prefer the 32-bit rpms:
Please test the kernel to ensure that the issue is resolved.
The issue does not appear in kernel-2.6.18-134.el5.bz485163b.i686.rpm Thank you all. Andrea, I'll leave it up to oCERT and the Red Hat security team to determine if and how this issue is to be disclosed.
(In reply to comment #48)
> The issue does not appear in kernel-2.6.18-134.el5.bz485163b.i686.rpm Thank
> you all. Andrea, I'll leave it up to oCERT and the Red Hat security team to
> determine if and how this issue is to be disclosed.
Thanks from oCERT too Hector.
Eugene, do you feel oCERT should release an advisory about this? Anyone else we need to reach out/investigate if affected?
(In reply to comment #52)
> Thanks from oCERT too Hector.
> Eugene, do you feel oCERT should release an advisory about this? Anyone else we
> need to reach out/investigate if affected?
Andrea, if you want to, it's fine with us. But I will send out a note in oss-security@ anyway with or without the advisory. I am not sure who else is affected.
CVSS2 score of high, 7.1 (AV:N/AC:M/Au:N/C:N/I:N/A:C)
This issue has been addressed in following products:
Red Hat Enterprise Linux 5
Via RHSA-2009:0326 https://rhn.redhat.com/errata/RHSA-2009-0326.html
This issue has been addressed in following products:
Red Hat Enterprise Linux 5.2 Z Stream
Via RHSA-2010:0079 https://rhn.redhat.com/errata/RHSA-2010-0079.html