159181 – kernel: dst cache overflow causes network loss then panic

Bug 159181 - kernel: dst cache overflow causes network loss then panic

Summary: kernel: dst cache overflow causes network loss then panic

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	David Miller
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-05-31 04:01 UTC by Trevor Cordes
Modified:	2007-11-30 22:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-01-19 07:18:21 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
text transcribed from the panic screen (849 bytes, text/plain) 2005-06-03 03:58 UTC, Trevor Cordes	no flags	Details
cricket graph of ip_dst_cache value from slabinfo (7.08 KB, image/png) 2005-06-22 17:59 UTC, Trevor Cordes	no flags	Details
the other side of the ipsec tunnel (7.18 KB, image/png) 2005-06-22 18:03 UTC, Trevor Cordes	no flags	Details
readings from the use column of /proc/net/rt_cache (1.47 KB, text/plain) 2005-06-24 09:52 UTC, Trevor Cordes	no flags	Details
cricket graph of ip_dst_cache using 1376 kernel (7.59 KB, image/png) 2005-09-19 07:14 UTC, Trevor Cordes	no flags	Details
View All

Description Trevor Cordes 2005-05-31 04:01:00 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
A firewall box I manage was dropping 90% of packets today.  I managed to issue a reboot over ssh.  It never came back up.  I went onsite and found a panic screen (I'll attach the panic output).

The only weird things in the logs are lots of "kernel: dst cache overflow" starting about 4 days ago and occurring a few times every 5 minutes or so.  They seem to coincide with a watchdog script of mine (the cache error occurs 1 second after my watchdog starts nmap).  I have a script that runs "nmap -S 192.168.100.1 -sP -PE 192.168.100.2 192.168.100.100-254" every few minutes to make sure that other (XP) machines on the LAN are reachable.  I know that causes a lot of conntrack entries to be made.  I'm not sure if this is related.

This bug appears to be similar to bug 149427, bug 138040 and bug 64472.  My kernel is not tainted with non-FC modules.  However, the kernel I run has had one extra patch applied: the patch that fixes NAT over 2.6 native ipsec (bug 143374).  Other than that it is a pure stock kernel.  I can't easily "just upgrade to the latest kernel" as the patch doesn't apply to 2.6.11 yet.


Version-Release number of selected component (if applicable):
kernel-2.6.10-1.766_FC3

How reproducible:
Didn't try


Additional info:

Comment 1 Trevor Cordes 2005-05-31 04:03:08 UTC

Oh, I should also say that this box and its sister (other side of the ipsec VPN)
have been working 100% fine with this kernel and setup for many months now. 
This is the first time this has happened.

Comment 2 Warren Togami 2005-05-31 04:05:32 UTC

- You didn't mention which kernel driver.
- You should retest it with the latest FC3 kernel update, and try to replicate
the conditions that caused the panic.

Comment 3 Trevor Cordes 2005-05-31 05:28:11 UTC

How do I tell which "kernel driver"?  I have the panic text I handwrote on paper
I will transcribe shortly, if that helps.

I'll eventually get the latest FC3 kernel in once I can find the patch I need in
a version that works on 2.6.11, or when the kernel guys get to putting the patch
in the mainstream.  My boxes cannot function without the patch as I *need* NAT
through ipsec.

The box (and many others that are almost identical) never once showed this
behaviour before.  I think it's a mem leak related thing since it seems to have
occurred only once the uptime got very long and the box was all of a sudden put
under relatively high network ipsec loads.

Comment 4 Warren Togami 2005-05-31 05:55:23 UTC

If it is hand written, then it is likely missing most of the beginning of the
panic dump and wont be useful at all.

It is unlikely that RH can help you with your problem because it is extremely
rare, and this is not nearly enough information to possibly diagnose the
problem.  You should go to upstream kernel.org mailing lists and bugzilla for help.

Comment 5 David Miller 2005-06-01 19:05:08 UTC

If you are applying the huge NAT patch I think you are, you are asking
a lot from us to debug this with that patch applied.  Please tell us exactly
what patch you have applied on top of the stock kernel tree.

If it's huge and invasive, you're going to be on your own, sorry.

Comment 6 Jim Phillips 2005-06-02 16:30:29 UTC

We have recently started to see this same problem here.  Our firewall box is
experiencing the same troubles described above, but we are running a stock
kernel 2.6.10-1.770_FC3 (no patches of any kind).  

I googled this problem and found that it could be related to
net.ipv4.route.max_size which on our firewall was set to 2048 by default.  We
are using Quagga/OSPF to do dynamic routing and at last check the slabinfo shows 
ip_dst_cache         825    825    256   15    1 : tunables  120   60    0 :
slabdata     55     55      0
This is the peak that it has reached since I've been watching it, but it has
grown from less than 400 when I first looked at it a few hours ago.

I wasn't around the last time the firewall crashed so I couldn't verify this at
the time, but I'm theorizing that we reached a point where ip_dst_cache reached
2048 and couldn't proceed.  I've increased max_size to 32752 to hopefully
prolong the life of the firewall, but if this is indicative of a leak of some
sort, eventually we'll get to a point where it crashes again.

Comment 7 Jim Phillips 2005-06-02 17:59:50 UTC

I have continued googling this problem and have found that there is likely a
kernel bug that causes this problem.  See the thread: 

http://lkml.org/lkml/2005/1/21/141

Towards the end of the thread they do actually post a patch:

http://lkml.org/lkml/2005/1/30/87

Since the thread indicates that the problem is present in 2.6.11-rc1, I'm
willing to bet it's also present in 2.6.10-1.770_FC3 but I haven't yet confirmed
this.

I have confirmed that the patch posted is present in the 2.6.11-1.27_FC3 kernel
now available on fedora updates.  I am about to apply this update to our
firewall to find out if it solves our problem.

Comment 8 Trevor Cordes 2005-06-03 03:48:08 UTC

Re: comment #5: I'm not sure how to specify what patch I'm running.  It's from
the lartc/netfilter mailing lists and the patchfiles are around 900 lines over 4
patch files that modify a couple of dozen source files like netfilter.c,
ip_forward.c, etc.  AFAIK it's the only patch that enables NAT on native ipsec.
 I'm not sure if the problem is something do with the patch.  I'd give it a
50/50 chance.  I can't run without this patch as I need NAT over ipsec.  I was
(up until yesterday) under the impression that this patch was going to get put
in the mainstream kernel, but new talk in bug 143374 indicates that this
probably won't be the case.

Re: comment #6: net.ipv4.route.max_size appears to be default based on your RAM
size.  This could be a good reason for why only one (of 3) sets of boxes I
administer with this setup has had the issue so far: it's the one with only
256MB of RAM and a default max_size of 8k (the others are 16k).

How do you check what the current value is?  You mention "the slabinfo" but I
can't figure out what you mean -- it doesn't appear to be an installed command.
 Let me know and I'll watch the values on the various boxes I administer (with
and without the NAT patch) to see if it is growing over time.

Comment 9 Trevor Cordes 2005-06-03 03:58:09 UTC

Created attachment 115125 [details]
text transcribed from the panic screen

Hope this is somewhat helpful.	I believe the top part had NOT scrolled off the
screen yet so this should be the entire text of the panic.  This panic occurred
during the execution of the "reboot" command and occurred right after it
started "unmounting file systems".

Comment 10 Trevor Cordes 2005-06-03 04:01:34 UTC

We are doing only static routing.  My googling revealed the same links you found
which surprised me since our routing tables really don't change much at all
AFAIK except for the odd time interfaces go down, etc.

If the latest FC3 update does fix the issue, let us know.  Then I'll try to do
the NAT patch on 2.6.11 and see if this doesn't happen again.

Comment 11 Jim Phillips 2005-06-03 11:40:59 UTC

To check the current size of the ip_dst_cache, you can grep ip_dst_cache from
/proc/slabinfo.. The current value is the first number, I'm not 100% sure what
the second number is, but it appears to be the peak of the first value since the
most recent flush (or something along those lines).

My digging has also indicated that ip_dst_cache is not directly tied to the
number of routes you see in a typical ip route list, but rather "ip route list
cache" which includes the routes the system creates for each host it knows
about.  In other words, every packet that comes through the system has a source
and destination.. rather than calculating the route for each new packet, the
system caches the calculated route for each source/destination pair.  Thus, if
you have a lot of traffic going through your router from a lot of different
hosts to a lot of different hosts, you can have a very large cache.

By the way, since applying the updated kernel yesterday around 3pm, there have
been no new crashes and the current ip_dst_cache is down around where it should
be for 7:30 in the morning.  Of course, now I'm getting "eth0: Too much work in
interrupt, status 8401." in my syslog which may or may not be related.. I've got
some more googling to do it appears.

Comment 12 Trevor Cordes 2005-06-05 01:07:09 UTC

I've added ip_dst_cache to my cricket monitoring and will report back here in a
few days once I have some good visual idea of how that parameter behaves over
time.  I'll be able to directly compare boxes I administer that both have and
don't have the ipsec nat patch.

Comment 13 Trevor Cordes 2005-06-22 17:59:51 UTC

Created attachment 115824 [details]
cricket graph of ip_dst_cache value from slabinfo

(The graph takes a snapshot value every 5 mins and so does not record transient
spikes unless it is by chance.)

I had a different (other pair) of ipsec machines go mental in what I believe
was the same way.  I don't think there was a panic this time though as it was
rebooted fairly early once the symptoms started.  However, this time I had my
customized cricket grapher running and the attached shows the machine I think
went mental first.  The sharp drop off Wed morning is when the symptoms started
and the blank area is when I rebooted; it took me a while to realize I had to
recompile cricket's data files before it started graphing again.  The dropoff
on Sunday is after another reboot.

It is obvious that this value seems to grow without bounds over time.  However,
it is hard to see how this relates to /proc/sys/net/ipv4/route/max_size because
on this box that value is 16384 and this ip_dst_cache value never gets close to
that.

I will keep an eye on it to see how it behaves when it reaches the high values
again.

Since I really have no idea what these values represent I will leave it to the
experts to interpret.

Comment 14 Trevor Cordes 2005-06-22 18:03:15 UTC

Created attachment 115826 [details]
the other side of the ipsec tunnel

I'm not positive that last graph's machine is the one that "went mental" first
so I am here including the graph from the other machine in the 2-machine ipsec
VPN.  In fact, from what the end-user described, this may very well be the
machine that went nuts first.  The reboot times were within a few hours between
these machines.

Comment 15 Adam Thompson 2005-06-24 03:20:53 UTC

Trevor, if these are all connected to Shaw cablemodems and are subject to the
regular DDoS / portscan traffic we all get :-(  then ip_dst *will* in fact grow
without bound until some sort of garbage collection happens.  I don't know
enough about how Linux does that to predict how/when GC happens.  My first guess
would be when max_size is reached...

The DST stuff is very similar to Cisco Express Forwarding, most modern IP stacks
have something along these lines nowadays.  It can't be a pure ip_dst problem,
otherwise thousands of linux-based hosts worldwide would be crashing on a
regular basis.

My only suggestion would be to work around the problem, by preventing the dst
table from filling up in the first place.  Limit as strictly as possible the
number of IP src/dst pairs the routing code ever gets to see, possibly going as
far as using ebtables as well as iptables, if the host doesn't expose any public
services to the world.

Comment 16 Adam Thompson 2005-06-24 04:31:28 UTC

Looking at net/ipv4/route.c, it occurs to me that rt_check_expire never actually
calls rt_garbage_collect, which appears to be responsible for cleaning up
ip_dst_cache... in fact, I can't see anywhere that GC occurs except in the
middle of the in_ and out_ paths.

So, one suggestion would be: instead of INCREASING the max_size parameter,
DECREASE it to force more aggressive GC to happen, and see if that prevents
overflows.  I'm guessing there should be an inflection point (i.e. ip_dst_cache
size should have a nonlinear response to net.ipv4.route.max_size somewhere
around the actual *legitimate* working set size) but where that point is will be
highly dependent on the specific traffic patters each system handles.

The other piece of information that would be nice to capture in concert with the
current slabinfo for ip_dst_cache would be a) the size of, and b) the contents
of /proc/net/rt_cache.  (There are also some related statistics in
/proc/net/stat/rt_cache which should correlate perfectly with the size of
ip_dst_cache, if I understand the routing code correctly.)

Trevor: I'm assuming this is Dr. Nick's office you're talking about... the total
set of valid internal IP addresses should be very small, on the order of 20 or
so (?), I would try setting max_size to something ridiculously small like 64 or
32.  Beware, however, that if it gets too small you'll probably lose
connectivity entirely - you may need to experiment onsite, or at least make
changes dynamically with a scheduled reboot pending in ~10min to recover.  Also
consider that you need a large enough rt_cache to hold all the "local" entries
in addition to remote hosts.

I would particularly pay attention to the "use" column of /proc/net/rt_cache -
what sort of statistical distribution are you seeing?  In a perfect world, you
should have significant y-axis clustering somewhere near the mean/median of the
data set with a handful of significant outliers.  If you're seeing clustering
near the median but NOT near the arithmetic mean of the Use column, your
max_size parameter is too *big*, not too small.

I suspect you'll find your nmap script is causing a large number of entries with
low Use counts (0..10), whereas IP addresses corresponding to real workstations
will see a secondary concentration of Use values that increase monotonically
with firewall uptime.

I agree that there seems to be a kernel bug of some sort involved, I'm focused
more on characterizing the problem and understanding how to work around it than
fixing it.

Comment 17 Trevor Cordes 2005-06-24 09:52:27 UTC

Created attachment 115921 [details]
readings from the use column of /proc/net/rt_cache

As per the "use" column of /proc/net/rt_cache, see the attached for some
samples from some machines.

Comment 18 Trevor Cordes 2005-06-24 10:22:56 UTC

Re: Decreasing max_size: I can try that when next onsite and will report back. 
It may be hard to get a bead on what value to set it to to match the system
since the linecount of /proc/net/rt_cache varies quite dramatically (1000+ to
20) over a very short period of time.

I have programmed cricket to capture the line count of /proc/net/rt_cache and
we'll see how that varies over time after a day or two.  If you meant something
other than the line count, let me know.  I also am dumping the contents every 5
mins to files so if the bug symptom reoccurs we will have the contents near that
exact time.

Location: Yes, it's Dr.N's and other locations.  Simple small network with a 2
office native ipsec VPN.  About 15-22 machines at each location.  The other pair
that has had the bug symptoms is only 13/5 machines, but with half the RAM on
the server (so smaller max_size), and the 13 one was the first to show the symptoms.

Yes, the nmap script appears to cause a large number of entries with low counts.
 However, I have yet to see the symptoms at the 12+ locations I run the nmap
scripts with no VPN.  2 of the 3 pairs of locations I run a VPN on have shown
the symptoms, and the third is so low traffic that it's not a big surprise it
hasn't shown up yet.

The very strange thing is, the patched 2.6.10 kernel I am running ran fine since
Mar 1 through till the first problem around May 25.  Sure, I might have rebooted
the systems once in a while, but it's a good bet they ran at least 30-60 days in
some cases.  Also, prior to that I was running a patched 2.6.9 from Dec '04, and
that never showed a problem either.  It could be coincidence, or it could be
something introduced in 2.6.10 or even a non-kernel rpm update that triggered it.

Do you think the bug could be dependent on the ipsec NAT patch I am using as
some have suggested?  You would think the gc would be independent of that.

Comment 19 Trevor Cordes 2005-06-24 10:26:12 UTC

Comment #7: jphillps, anything to report after running patched for nearly a
month?  Is there anything you were/are doing that is "weird" like in my setup
(ipsec VPN, nmap, etc)?  I'm looking for commonalities to explain why you and I
see the bug but most of the linux population does not.  If we can find a common
"weirdness" then maybe we can have a better undestanding.

Comment 20 Trevor Cordes 2005-06-24 10:38:31 UTC

Comment #15: One pair of machines was on cablemodem, the other on DSL PPPoE,
both see the usual noise and both have strict iptables blocking it all.  The
servers do expose smtp + http to the world.

What if the ip_dst_cache is a red herring in the sense that it's the only thing
that's complaining but has not much to do with the root cause?  As you said, if
it was a big deal, everyone's system would be flaking after a month or two.

Comment 21 Jim Phillips 2005-06-24 14:54:41 UTC

Since patching, we have not had a single case of the network failing on the
router.  I would have to say that the patch did succeed in doing what it needed
to do.  At least for us.  Keep in mind though, this is an internal router with
nothing fancy going on.  The only non-out-of-the-box thing we are doing is that
we are using Quagga/OSPF to keep our routing tables updated with our various
gateway servers (we have a large number of tunnels between us and our customers
and use OSPF to keep track of all the routes).

Comment 22 Dave Jones 2005-07-15 18:01:38 UTC

An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 23 Trevor Cordes 2005-09-19 07:11:03 UTC

I've now been running 3 production pairs of systems for 1.5, 1 and .5 months
using 2.6.12-1.1376_FC3 with ipsec/nat patches, but otherwise a straight up FC3
kernel.  I have had zero issues or crashes with regards to this bug.  I think
this bug is fixed, though I would probably give it another 2-3 months before
declaring absolute victory considering that previously it took many months for
the bug to show.

Interestingly, the slabinfo/ip_dst_cache data I'm tracking/graphing with cricket
show a completely different pattern compared to the 2.6.10 kernel I was using
before.  If the bug was due to ip_dst_cache then this could be significant.

Comment 24 Trevor Cordes 2005-09-19 07:14:46 UTC

Created attachment 118959 [details]
cricket graph of ip_dst_cache using 1376 kernel

Here's the new graph with the new kernel.  This is from the same host as and
makes a good comparison with attachment #115824 [details].  The main difference that I
think is important is the fact that after the peak, the troughs in the new
kernel all return to a low value.  In 2.6.10 the troughs would slowly creep up
higher and higher (and so would the peaks).

Comment 25 Dave Jones 2006-01-16 22:26:06 UTC

This is a mass-update to all currently open Fedora Core 3 kernel bugs.

Fedora Core 3 support has transitioned to the Fedora Legacy project.
Due to the limited resources of this project, typically only
updates for new security issues are released.

As this bug isn't security related, it has been migrated to a
Fedora Core 4 bug.  Please upgrade to this newer release, and
test if this bug is still present there.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

Thank you.

Comment 26 Trevor Cordes 2006-01-17 09:34:22 UTC

I have not seen this bug for many months now (I think... but we have some new
bugs now).  If Jim agrees then I think this should be marked as closed and fixed.

Comment 27 Jim Phillips 2006-01-17 15:51:11 UTC

I haven't seen this bug in quite awhile either.  I think it's probably safe to
call it fixed.

Note You need to log in before you can comment on or make changes to this bug.