Red Hat Bugzilla – Bug 172777
UDP packets with bad checksum not dropped
Last modified: 2014-06-18 04:28:42 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511
Description of problem:
When using 'snmpwalk' (from the net-snmp-utils package), we sometimes get bad packets in the responses from one of our network devices (which is a separate problem). When the packets come back corrupted, snmpwalk hangs. This also happens when using the getnext() Perl function in SNMP.pm (also part of the Net-SNMP suite).
In both cases, running strace on the program that hangs shows that the recvfrom() function is what locks up. A normal kill or Ctrl-C will successfully terminate the hung program (kill -9 not required). No other detrimental effect to the system has been observed.
A packet capture in tandem with the queries shows that in every case, an SNMP response with a bad UDP checksum is the last packet to arrive when the program hangs.
A bug report was submitted to the Net-SNMP maintainers via SourceForge, who claim that the problem is that the Linux kernel should be dropping UDP packets with bad checksums before the Net-SNMP library sees the packet in the first place.
Version-Release number of selected component (if applicable):
seen in multiple versions (since at least RHEL3U3)
Steps to Reproduce:
1. Must create a scenario where packets will be damaged in transit (perhaps crafted responses?).
2. Start 'tcpdump -s 0 -n -w bad.packet.pcap "udp port 161"'.
2. Use 'strace snmpwalk -c yourcommunity -v 2c host mib' to send query.
3. Ensure that a response with a bad UDP checksum is sent to the querier. Note bad packet in tcpdump capture file and strace output shows that snmpwalk hung on recvfrom() function.
Actual Results: Application hangs on recvfrom function every time. Ethereal or tcpdump captures show that this occurs if and only if the packet has a bad UDP checksum (due to alterations elsewhere in the packet).
Expected Results: Kernel should drop packets with bad checksums, causing query to time out (forcing retransmission of the original query).
If there is a way to PROVE that this is, in fact, a bug in Net-SNMP, I would be more than happy to add this information to my open bug report on SourceForge. However, I will need instructions on how to verify this so I can convince them that it's their problem and not RedHat's.
If it IS a problem that must be solved by RedHat, I believe it should be fixed, but we could wait until Update 7, if necessary. We have a workaround in place on our network (ugly, but functional).
For reference, here is the bug report I submitted to the Net-SNMP maintainers:
That is right, the kernel is supposed to drop UDP fragments with invalid
checksums. It would helpful to have the following additional information to
resolve the issue:
- What kind of network device is used on the receiving side?
- Checksumming settings of the device (run ethtool)
The network device is as follows (per the kernel at boot time):
Tigon3 [partno(BCM95703A30) rev 1002 PHY(5703)] (PCIX:133MHz:64-bit)
It's the Broadcom 10/100/1000 Ethernet over UTP NIC that came standard with the
Dell PowerEdge 2650.
As for the checksumming settings, I'm not as familiar with ethtool as perhaps I
should be, but is this correct? See below:
[root@mybox]# ethtool -a eth0
Pause parameters for eth0:
I assume this means checksumming is disabled for both inbound and outbound
packets. I could try enabling this, of course, but wouldn't this be at layer 2?
The layer 2 checksums are not failing, so the error wouldn't be caught by the
NIC. The errors we've been seeing are at layer 4 and higher (specifically bad
UDP checksums). If there is a setting for that, shouldn't it be hidden
somewhere in /proc?
This same bug manifests itself on my laptop as well (Dell Latitude D600) which
has both TX and RX checksumming enabled (according to 'ethtool -a eth0'), which
would again suggest that the problem is not at layer 2 but higher up in the stack.
Additional information from the Net-SNMP maintainers:
''My suspicion is that the recv call is possibly blocking until it receives a
(valid) packet, having been led to believe by 'select' that there was one
waiting. If the network driver discards the mangled packet after having
signalled it using select, but before passing it back via recv, then this might
indeed have the effect of locking up within the recvfrom call.''
What do you think?
Have head no updates on this issue. Any ideas or word on if/when this might be
fixed? Have you been able to reproduce the problem? Is it something that has
been fixed in RHEL 4 (or 5)?
This problem continues to crop up from time to time, which means that processes
will hang out on my management station until someone comes along to run strace
on them to confirm they are locked up and kill them. Some investigation into
this issue would be appreciated.
Yours is RHEL3 but in RHEL4 there's bug #212321 (and all its reported duplicates). I'm not sure how 2.4
behaves with regards to UDP or if there is a similar small patch available upstream. You might want to
look into that. :) HTH
This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
For more information of the RHEL errata support policy, please visit:
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.
So even though I reported this bug TWO YEARS AGO before RHEL 3 was in
maintenance mode, you're still not going to fix it. That's just great. Thanks