Bug 500925

Summary: IP Fragments Dropped when ARP is needed
Product: Red Hat Enterprise Linux 5 Reporter: Tuan Hoang <tqhoang>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: nhorman, tgraf
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-15 11:06:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tcpdump logs none

Description Tuan Hoang 2009-05-14 21:27:38 UTC
Created attachment 344051 [details]
tcpdump logs

Description of problem:
My company develops network test software. One of the features of our software is to push UDP traffic, namely large UDP packets that get fragmented at the IP network layer. 

It appears that the link layer is discarding the IP fragments until it has a good MAC address for Host2.  After it successfully ARP's, it finally sends the IP fragments but it's only the last few of them.  When the ARP cache is current, all fragments are transmitted successfully.

This is the only thing I could find online but it's rather old and for the 2.4 kernel series. http://lkml.org/lkml/2003/1/29/54
This also happens with a variety of NIC hardware, so it's not tied to a particular NIC driver...so this points me to a bug/behavior of the Linux IP stack.

Version-Release number of selected component (if applicable):
kernel-2.6.18-128.1.6.el5

How reproducible:
Every time.

Steps to Reproduce:
1. Clear ARP cache entries for host1 and host2 (ex: 148.34.190.x).
2. Start tcpdump on Host1 and Host2 over dedicated interface (ex: tcpdump -nn -v -i eth1).
3. Send large ping (ex: ping 148.34.190.10 -s 50000 -c 1).
4. Host1 needs to ARP for Host2.
5. Only the last few IP fragments of ICMP echo request are actually transmitted from Host1...so no ICMP echo reply back from Host2.
6. Since ARP cache is current for both nodes, try sending large ping again.
7. All IP fragments of ICMP echo request are sent from Host1 and received by Host2.
8. All IP fragments of ICMP echo reply are sent from Host2 and received by Host1.
  
Actual results:
First ping (ICMP echo request) is not sent entirely, so no ICMP echo reply.
Second ping is sent and received entirely.

Expected results:
Link layer does not discard IP packets and both pings get sent & received properly.

Additional info:
Our temporary work-around is to increase the ARP cache timeout (/proc/sys/net/ipv4/neigh/*/gc_stale_time to something really large like 3600sec.  Then we ping before the test to update the ARP cache.

Comment 1 Neil Horman 2009-06-11 13:34:41 UTC
This sounds an awful lot like the arp_queue overflowing (see __neigh_event_send). It used to be a silent discard, but sometime in the 5.4 devel cycle I added an unres_discard stat to the /proc/net/stat/[arp|ndisc]_cache files, so they could be observed.  nominally that queue length is only 3 frames, but it can be adjusted upward via /proc/sys/net/ipv[4|6]/neigh/<iface>/unres_qlen.  That would be the appropriate adjustment to make for the test described.

Comment 2 Cong Wang 2009-06-12 05:49:17 UTC
(In reply to comment #1)
> This sounds an awful lot like the arp_queue overflowing (see
> __neigh_event_send). It used to be a silent discard, but sometime in the 5.4
> devel cycle I added an unres_discard stat to the
> /proc/net/stat/[arp|ndisc]_cache files, so they could be observed.  nominally
> that queue length is only 3 frames, but it can be adjusted upward via
> /proc/sys/net/ipv[4|6]/neigh/<iface>/unres_qlen.  That would be the appropriate
> adjustment to make for the test described.  

Hello, Neil.

Thanks for your helpful hints!

Yes, /proc/sys/net/ipv[4|6]/neigh/<iface>/unres_qlen is the reason here, the value in it is exactly the number of ICMP packets that we can get on HOST2 with the above test.

Hmm, in the source code it should be the last 'if' in __neigh_event_send().

So... we don't need to fix this? Or just change the default value of unres_qlen?

Comment 3 Neil Horman 2009-06-12 13:20:32 UTC
I think, given that the setting is tunable, no code change is needed.  If the tests this customer is conducting require no UDP frame loss, the answer is for them to tune that value appropriately.  I'd close this as NOTABUG, and provide documentation on how the user can scale that tunable appropriately.

Comment 4 Tuan Hoang 2009-06-12 13:50:44 UTC
Thank you for the valuable information.  I will setup the same test and report back.  

Out of curiosity, is there any adverse side effect of setting "unres_qlen" to a value of say 50 or even 100?

Comment 5 Neil Horman 2009-06-12 14:56:55 UTC
Only that you potentially create a large backlog of frames in the system.  IIRC that queue is per-peer.  So if you have a lot of hosts that need revalidation frequently, you can get lots of frames backing up.  But if lost frames on tx are unacceptible, thats your only recourse.

Comment 6 Cong Wang 2009-06-15 09:50:37 UTC
(In reply to comment #3)
> I think, given that the setting is tunable, no code change is needed.  If the
> tests this customer is conducting require no UDP frame loss, the answer is for
> them to tune that value appropriately.  I'd close this as NOTABUG, and provide
> documentation on how the user can scale that tunable appropriately.  

That is fine for me, please close this as NOTABUG.

Comment 7 Neil Horman 2009-06-15 11:06:14 UTC
As discussed.  I think the discussion in this bug serves as sufficient documentation.  Tuan, please feel free to reopen this (or a new bug), if any subsequent problems come up.