Bug 77775

Summary: (NET)Neighbour Table and Lost Packets
Product: [Retired] Red Hat Linux Reporter: CJeness <cj>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 7.3CC: davem
Target Milestone: ---   
Target Release: ---   
Hardware: i586   
OS: Linux   
URL: www.sforest.org
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-06-09 05:33:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description CJeness 2002-11-13 13:34:14 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.3-emp_3 i686)

Description of problem:
After replacing RedHat 7.1 with RedHat 7.3 on our web server computer, we see
the following problems on a DAILY basis:
1.  Frequent "kernel-neighbour table overflow" messages.
2.  Large percentage of packet losses on ping after system has been running less
than 24 hours.

I am not sure if the two problems are related, but after our system has been
freshly rebooted and has run from 6-12 hours, we begin to see it slow down
especially in web-related activity.  If we perform a ping on a web site, then we
will typically see up to 50% packet loss.  If we restart the network (./network
restart), then fewer packets are lost; however, the only thing which seems to
solve the problem at least temporarily is rebooting.  If we reboot and then ping
the same web site, there will be no packet loss.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.  Install RedHat 7.3 on an IBM 300 PL running Apache
2.  Wait about 6 hours
3.  Try pinging and observe the large packet loss
	

Actual Results:  Very slow performance

Expected Results:  Reasonable and consistent performance levels over time

Additional info:

Comment 1 Arjan van de Ven 2002-11-13 13:41:35 UTC
so which kernel are you running? I assume 2.4.18-17.7.x
What network card/driver are you using ?

Comment 2 CJeness 2002-11-13 14:34:21 UTC
The "uname -a" command returns the following:

2.4.18-17.7.x

I upgrade at least weekly.

We have two nic's in the computer.  Our module.conf file shows the following:

alias eth1 eepro100
alias eth0 3c59x

eth0 is connected to a BellSouth ADSL modem.   This is the original ADSL which
used ethernet and DHCP.  eth1 connects to our internal lan.  I am using IPCHAINS
for masquerading.  In an attempt to resolve the slowdown issues, I have now
turned off all of my IPCHAINS rules except FORWARD MASQ.

Comment 3 Arjan van de Ven 2002-11-13 14:36:09 UTC
can you try using the e100 module instead ?
(replace eepro100 with e100 in modules.conf)

Comment 4 CJeness 2002-11-27 14:43:53 UTC
I have chnaged my driver from eepro100 to e100 as requested.  Apparently, I
missed the notification about this request; otherwise, I would have made the
change sooner.   I will provide an updated status after the system has run for a
day.

Comment 5 CJeness 2002-12-02 00:25:15 UTC
Changing the driver from eepro100 to e100 has not resolved the problem.  We
continue to see the neighbour table overflow error.  More importantly, this
error seems to coincide with a high level of packet loss which makes the
computer unusable for Interntet activities.

Please keep in mind that this computer has been operating successfully under a
version of RedHat using the eepro100 driver since 5.2.  It was the install of
7.3 that triggered the problems.  The previous version was 7.1.  Therefore,
there is something different in the kernel or some component of the 7.3
distribution which is triggering the problem.  This problem is very serious
since the computer has to be rebooted about every 12 hours.

Comment 6 David Miller 2002-12-02 10:40:17 UTC
This message shows up when either of two things have happened:

1) The loopback device is misconfigured

2) The netmask on one of your interfaces is wrong

I am extremely confident in this statement, so if you could go
and double, no in fact triple check, the loopback interface configuration
and that of your interfaces.

2.2.x kernels used to be very lenient on misconfiguration in this area,
2.4.x is not and you absolutely must get this right.  That would explain
why 7.1 did not show the bahavior and 7.3 does.

Comment 7 CJeness 2002-12-02 22:44:45 UTC
Below are the results from "ifconfig".  eth0 is set up by DHCP and I assume that
the netmask is correct.  The netmask for eth1 is correct.    I am not sure how
the "loopback" might be misconfigured since I have never done anything special
to configure it.  I assumed that it should be appropriately set up as part of
the installation process.  Can you clarify what I should check on the loopback?  

----------eth0      Link encap:Ethernet  HWaddr 00:10:4B:25:5D:97  
          inet addr:66.20.72.252  Bcast:66.20.75.255  Mask:255.255.252.0
          UP BROADCAST NOTRAILERS RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:58754 errors:0 dropped:0 overruns:0 frame:0
          TX packets:32899 errors:0 dropped:0 overruns:0 carrier:0
          collisions:2 txqueuelen:100 
          RX bytes:58712841 (55.9 Mb)  TX bytes:5738838 (5.4 Mb)
          Interrupt:10 Base address:0x7c40 

eth1      Link encap:Ethernet  HWaddr 00:04:AC:1D:FF:13  
          inet addr:192.168.1.14  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:29034 errors:0 dropped:0 overruns:0 frame:0
          TX packets:22556 errors:0 dropped:0 overruns:0 carrier:0
          collisions:37 txqueuelen:100 
          RX bytes:4542571 (4.3 Mb)  TX bytes:11461679 (10.9 Mb)
          Interrupt:11 Base address:0x7c20 Memory:f3eff000-f3eff038 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:462 errors:0 dropped:0 overruns:0 frame:0
          TX packets:462 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:35052 (34.2 Kb)  TX bytes:35052 (34.2 Kb)


Comment 8 CJeness 2002-12-11 00:20:45 UTC
It has now been more than one week since I asked for clarification as to how the
loopback interface could be misconfigured and what should be checked.  Any help
would be greatly appreciated.

Comment 9 Arjan van de Ven 2002-12-11 08:56:29 UTC
so are you 
1) 100% sure your netmask and broadcast address of eth0 are correct
2) 100% sure you didn't firewall off the loopback device ?

Comment 10 David Miller 2002-12-11 08:57:38 UTC
Show us your routing table as well.

Also, do you think the behavior occurs when dhcp renegs eth0's address?
That might be a clue.


Comment 11 David Miller 2002-12-11 09:03:00 UTC
Also when I say "netmask is correct", I mean does it match what
other systems on that subnet are using.

Having this not match is what causes neighbour table overflow
messages.

You say eth1 is correct, fine, but go and make sure eth0 is getting
something legitimate.  Probably, when these messages are being printed,
the contents of /proc/net/arp is full of bogus ARP entries because the
netmask is incorrect.

Next time it triggers, capture /proc/net/arp and attach it to this
bug report.  Thanks.


Comment 12 CJeness 2002-12-18 13:48:13 UTC
With regard to netmask, we have reviewed all of the computers that participate
in the network and have verified that that they all use a netmask of
255.255.255.0.

At the time that we received the many neighbour table overflow messages
yeseterday, here are the contents of /proc/net/arp:

IP address       HW type     Flags       HW address            Mask     Device
192.168.1.25     0x1         0x2         00:20:E0:65:EA:4A     *        eth1
66.20.72.1       0x1         0x2         00:02:3B:01:6B:94     *        eth0
IP address       HW type     Flags       HW address            Mask     Device
192.168.1.25     0x1         0x2         00:20:E0:65:EA:4A     *        eth1
66.20.72.1       0x1         0x2         00:02:3B:01:6B:94     *        eth0

This is obviously not what you expected.  Also, here is the result of netstat
-r:

Kernel IP routing table;

Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.1.0     *               255.255.255.0   U        40 0          0 eth1
66.20.72.0      *               255.255.252.0   U        40 0          0 eth0
127.0.0.0       *               255.0.0.0       U        40 0          0 lo
default         adsl-20-72-1.as 0.0.0.0         UG       40 0          0 eth0

In terms of etho through DHCP, the results are always consistent.  In
particular, I always see:

eth0      Link encap:Ethernet  HWaddr 00:10:4B:25:5D:97  
          inet addr:66.20.72.252  Bcast:66.20.75.255  Mask:255.255.252.0
          UP BROADCAST NOTRAILERS RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:56263 errors:0 dropped:0 overruns:0 frame:0
          TX packets:41404 errors:0 dropped:0 overruns:0 carrier:0
          collisions:1 txqueuelen:100 
          RX bytes:33599901 (32.0 Mb)  TX bytes:3403690 (3.2 Mb)
          Interrupt:10 Base address:0x7c40 

This is really a baffling for me.  I have 4 other computers that I upgraded from
RedHat 7.1 to 7.3.  All of these other computers are working fine.  One of the 4
serves the web site for the Atlanta Java Users Group.  The only differences
between the problem computer and the other 3 include the following:

1.  Problem computer uses the newer file system ext3.  I had the installation
upgrade my existing ext2 for /usr/local and /home
2.  Problem computer uses IPCHAINS with the only the following statements:

:input ACCEPT
:forward ACCEPT
:output ACCEPT
-P forward MASQ

So it uses IP Masquerading.

3.  Problem computer interacts with BellSouth DSL.

4.  Hardware is dfferent.  The computer is an IBM 300 PL.

However, this problem computer has been running RedHat Linux and performing the
same functionality for about 4 years.  In the past, its primary problem related
to its S3Trio3D video card.  This now seems to work OK.    

What else can I look at? Should I try upgrading to RedHat 8 which I have already
purchased?




Comment 13 David Miller 2002-12-21 07:05:28 UTC
Upgrading to 8.0 isn't likely to help much, as the errata
kernels are nearly identical.

I won't be able to help more with this until the new year.
You could try taking masquerading out of the equation, if such
an experiment is possible.


Comment 14 CJeness 2002-12-28 16:22:56 UTC
The primary reason that we have this computer is to do IP masquerading. 
Disabling masquerading would shut off our Internet access.   We have been using
the same IPCHAINS command now for as long as we have had BellSouth DSL or about
4 years.  When I first installed RedHat 7.3, I had accepted your firewall
settings to try to make our environment more secure.  However, when we started
having problems, I eliminated all the security and went back to just the single
masquerading command.  

Comment 15 CJeness 2003-01-17 19:11:35 UTC
Good news for all.  This bug has been resolved by the following action.  We
disabled the network interface on the motherboard (eepro100) and installed a
Netgear PCI LAN card.  I don't know whether this was a software incompatibility
(i.e. RedHat 7.3 and EEPRO100 driver) or just a hardware failure which arose
around the time that we upgraded to 7.3.  We also upgraded the memory from 128
to 256 Mb.  Therefore, you may close this bug with whatever resulotion code you
deem appropriate.  Thanks for the help and suggestions.

Comment 16 David Miller 2003-01-18 08:35:07 UTC
Just to clarify, you were using the eepro100 and e100 drivers from the Red Hat
kernel rpms, right?  Or were you using a vendor supplied kernel module image?

It'd be nice if this was indeed a convenient hardware failure of some sort,
but I'm not convinced of that just yet :)


Comment 17 CJeness 2003-01-20 15:29:34 UTC
Yes, we tried both the e100 and the eepro100 driver which are part of the RedHat
distribution.  There is another issue that we see at work with this card.  At
work, we have a 100 Mb ethernet switch running in full-duplex mode.   When I
installed RedHat 7.3 on one of the IBM 300 PL's at work, everything operated
correctly except the network interactions which were extremely slow.  We thought
that perhaps the NIC was not going into full-duplex mode.  However, when I
researched the driver parameters on the RedHat site, I drew the conclusion that
there was no way to force full-duplex.  The card was supposed to sense this.  I
did some searches on Google which seemed to confirm this.  So we have seen other
issues with this on-board NIC.   This full-duplex issue is actually a big
problem for getting Linux adopted at work.  The only desktop computers we have
are IBM 300 PL's.  I just have not had time to pursue it further.


Comment 18 David Miller 2003-06-09 05:33:13 UTC
You can control the duplex setting using the "ethtool" utility.

Your original problem is gone so I'm closing this.