Bug 491549

Summary:

Netfilter NAT mishandles udp checksum 0 => 0xffff transition - also disabled HW checksum

Product:

Red Hat Enterprise Linux 5

Reporter:

David Bein <d.bein>

Component:

kernel

Assignee:

Jiri Olsa <jolsa>

Status:

CLOSED WONTFIX

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

low

Version:

5.3

CC:

anton, davem, d.bein, dzickus, jpirko, nhorman, qcai, tgraf, tis

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-02-22 08:03:44 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Mongo checksum/netfilter bits from 2.6.19.7	none
Addition patch from 2.6.20 which reverts 1 bit from 2.6.19 patches	none
A few more important bits in net/ipv4/netfilter	none
generates/sends a udp packet with checksum 0xFFFF	none
patch to detect the zero checksum packet	none
patch to detect the zero checksum packet and display counts in proc file	none

Description David Bein 2009-03-22 22:30:49 UTC

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.20) Gecko/20081217 Firefox/2.0.0.20

If a udp datagram which would normally compute a checksum of 0, then => -1 (0xffff) is sent on a system with CSO (e.g. on an e1000), what happens is
udp_push_pending_frames() does the partial checksum on the header + src/dst addrs. The problem is seen in RHEL5.2 2.6.18-92.1.22 kernels and as well
with RHEL5.3 2.6.18-128.1.1. kernels. I think this is a problem for all
flavors of RHEL5 going back to GA in 2007.

So far so good - it then pushes it down to ip who ultimately feeds it to the
netfilter code. Running on a system with NAT enabled, but without any rules
which would rewrite it ends up ignoring the partial checksum and re-computes
it using skb_checksum_help(). From my reading of that function, if the
final full udp checksum would be 0, it is not translated to -1 which means
that it goes out on the wire with a checksum of 0 (no checksum).

I am not yet sure if this is where the bug is or if the final bit of
code in ip_nat_mangle_udp_packet() which also does not account for the
0 -> -1 transition. It appears the case we are hitting is the skb_checksum_help
case because we are not running any NAT rewrites on it. It looks
like udp_manip_pkt() has the same problem as well.

What I do know is that if I run the exhaustive tests with the same set of
nfs client inputs and run tcpdump on the machine while it is happening
I see nfs client soft retries and the associated udp checksum is 0 somewhere
in the middle of the test run (it takes about an hour to run the test).

[NB: I know that sending out udp checksum == 0 is legal, but in this case
NB: we have other actors in the network who are getting confused by it].

Leaving everything the same, but substituting 2.6.19-7 (last stable version)
I can run the same tests on the same machines (same network topology) over
and over and never see any udp packets sent with the udp checksum == 0.

I did a little bit of digging into the history of this and there was
a set of patches under commit number 4cf411de49c65140b3c259748629b561c0d3340f
by Patrick McHardy, subject [NETFILTER]: Get rid of HW checksum invalidation
which changes how this stuff is handled and goes to great lengths to avoid
disabling HW assisted checksum in the process. Those patches are not quite
right for RHEL5.3 because the previous CHECKSUM_HW => CHECKSUM_PARTIAL
conversions that were done in 2.6.19.

For the moment, this is not a problem for us because it turns out we
have the NAT .config bits set needlessly. I took a kernel .config and
ripped out CONFIG_NF_NAT* (and the MASQUERADE bits as well) and re-ran
the same tests and it completed as it is supposed to.

I hope this provides enough context for the bug. I can't trivially
give you a reproduction case, but by simply taking the NAT code
in net/ipv4/netfilter out of the picture, I see no more evidence
using tcpdump of either 0 checksums or correct checksums (which made
me suspicious that something in netfilter-land was re-computing things).
Without the NAT code, using tcpdump -s 1500, I see "incorrect" checksums
consistently because tcpdump does not know about the partial checksums
associated with the HW assisted checksum offload.

From the patches, the following places clearly do the conversion:

ip_nat_mangle_udp_packet()
udp_manip_packet()

In addition the code in ip_nat_fn() does not do anything special
if skb->ip_summed is CHECKSUM_HW (in 2.6.19 language would be CHECKSUM_PARTIAL).

I realize that it would be a fair amount of patching to adopt that
conversion because it touches a lot of drivers, but I have yet
to determine if the 2.6.19 patches can be faithfully represented
without the 3rd value for CHECKSUM_COMPLETE.

Reproducible: Sometimes

Steps to Reproduce:
This one is very hard to reproduce because the inputs have to be "perfect"
to produce the natural udp checksum of 0 (before conversion to 0xffff).
It would not be strictly wrong to have it be 0 because that just means
there is no checksum (which is a perfectly valid case).

I found this one by adding some special code at the device layer looking
for a udp packet with checksum == 0 and ip_summed == CHECKSOME_NONE after
proving from the top-level (udp) that it was categorically not sending
on because some context asked it to not do udp checksumming.

Comment 1 David Bein 2009-03-24 22:41:14 UTC

Created attachment 336569 [details]
Mongo checksum/netfilter bits from 2.6.19.7

I am attaching patches relative to 2.6.18-128.1.1 which
implement 2 patches from 2.6.19.7:

(1) [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.19
.y.git;a=commitdiff_plain;h=84fa7933a33f806bbbaae6775e87459b1ec584c0;hp=8584d6df
39db5601965f9bc5e3bf2fea833ad7bb
(2) [NETFILTER]: Get rid of HW checksum invalidation
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.19
.y.git;a=commitdiff_plain;h=4cf411de49c65140b3c259748629b561c0d3340f

NOTES: Some of these patches are within CONFIG_XEN bits, but I have
NOTES: no ready way to test the XEN code, so someone who is knowledgeable
NOTES: about the XEN frontend/backend logic would need to look at these
NOTES: and maybe make changes. They do compile properly. Other than
NOTES: the CONFIG_XEN specific code segments, these changes have been
NOTES: run and show that (a) having NAT present even w/o any rules does
NOTES: not disable outbound HW checksum (setup by tcp/udp as CHECKSUM_PARTIAL).
NOTES: And it fixes the problem we were seeing with missing udp 0 => 0xffff
NOTES: on-the-wire transitions. Also changes to drivers not used on i686
NOTES: or x86_64 configs have not been compiled, There are only a few
NOTES: of these.

Comment 2 David Bein 2009-03-24 22:43:41 UTC

Because the existing code has the side-effect is disabling HW assisted
outbound checksum, I am marking this as a high priority issue because
that is for most people a more serious issue than the udp checksum 0 => 0xffff
not happening properly.

Comment 3 David Bein 2009-03-24 22:54:23 UTC

If someone wants the full modified source files, please let me know.
The context diffs work as I have tested them, but they don't necessarily
show enough context especially bits in CONFIG_XEN ... #endif segments.

Comment 4 David Bein 2009-03-25 00:03:26 UTC

One bit of the patches sometimes can produce a bad tcp checksum
in tcp_retrans_try_collapse() [ in net/ipv4/tcp_output.c ].

There is a patch reverting this one bit in 2.6.20:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=52d570aabe921663a987b2e4bae2bdc411cee480

I will add an additional patch which has this adjusted relative
to the mongo patch I've already attached.

Comment 5 David Bein 2009-03-25 00:11:34 UTC

Created attachment 336573 [details]
Addition patch from 2.6.20 which reverts 1 bit from 2.6.19 patches

This is from

From: Jarek Poplawski <jarkao2>
Date: Wed, 24 Jan 2007 06:07:12 +0000 (-0800)
Subject: [TCP]: rare bad TCP checksum with 2.6.19
X-Git-Tag: v2.6.20-rc6~5^2~3

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=52d570aabe921663a987b2e4bae2bdc411cee480

Comment 6 Jiri Olsa 2009-04-02 17:02:24 UTC

I understand the first issue with 'wrong' zero checksum and I'll try to reproduce it. You mentioned nfs client, could you plz be more specific about your test. I'll try smth on my own meanwhile.

I dont follow the second part about 'disabled HW checksum'. I guess I'm missing something but could you plz be more specific what's currently wrong and why is it a show stopper.

I'm pointing this message from you:

> Because the existing code has the side-effect is disabling HW assisted
> outbound checksum, I am marking this as a high priority issue because
> that is for most people a more serious issue than the udp checksum 0 => 0xffff
> not happening properly.  


All the patches you attached deal with the HW checksum. I found current RHEL stream has:

    commit: 84fa7933a33f806bbbaae6775e87459b1ec584c0
    [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE

I dont see the rest you post. I guess all those patches are related
to the 'disabled HW checksum' problem..

Comment 7 David Bein 2009-04-02 17:56:57 UTC

If it isn't clear from the patches, what is happening is that some outbound
udp packet is setup in udp.c with CHECKSUM_HW [with the patches => CHECKSUM_PARTIAL]. It is not at that point completely checksummed
and is (absent NAT) going to be handed to the outgoing nic (e.g. e1000)
which will complete the checksum and if the final checksum would compute
to 0, the nic would transition that to -1 (0xffff). From the tests we
ran, roughly 25 to 50 packets out of 17 million (using the exact udp
datagrams being sent over roughly 1 hour) will naturally checksum to 0
and if it were being done completely in software (e.g. tx checksum is
disabled using ethtool -K eth0 tx off), the 0 => 0xffff conversion
would happen at the udp layer itself (would be CHECKSUM_NONE as the
NAT layer would see it). In this case there is no bug provided there
is no matching NAT rule - in our case this would have been true.
However since we are using tx checksum offloading (this is the normal
case for NICs which have this feature), the partial checksum is being
stepped on in ip_nat_fn() [net/ipv4/netfilter/ip_standalone_nat.c] in
the following (unpatched) code segment:

        ct = ip_conntrack_get(*pskb, &ctinfo);

        /* Don't try to NAT if this packet is not conntracked */
        if (ct == &ip_conntrack_untracked)
                return NF_ACCEPT;

        /* If we had a hardware checksum before, it's now invalid */
        if ((*pskb)->ip_summed == CHECKSUM_HW)
                if (skb_checksum_help(*pskb, (out == NULL)))
                        return NF_DROP;

Even if (especially if) there are no matching NAT rules (there
were not in our test environment), this fails to check for the
0 to 0xffff transition on the end of it. After skb_checksum_help()
(out != NULL in this case), the checksum could be 0 and it is
there that the bug of udp checksum == 0 is happening.

But - the bigger problem is that it is touching the checksum
which is partially setup for an outbound nic which should do
the majority of the checksum calculation including the 0 -> -1
transition. Simply fixing skb_checksum_help() would be one option,
but the fact that it runs at this point is disabling HW checksum
offload, with a corresponding performance hit.

The supplied patches (all 3 of them) applied in order do not
touch the packet until/unless there is a matching NAT rule
and have special handling for the partially computed checksum
(after the patch ip_summed would be CHECKSUM_PARTIAL).

In a nutshell, you could fix just the udp checksum == 0 problem,
but the bigger problem is that without the upstream patches, no
outbound packet from tcp or udp will have the checksumming completed
by capable hardware. This is the real issue.

Last but not least - the existing netfilter/nat code w/o the supplied
patches does not do anything illegal, but it is quite sub-optimal.
sending a udp packet over the wire with checksum == 0 is legal, but
it should only happen if some agent (nfs or user program) has overtly
disabled the checksum calculation for a given udp socket.

Comment 8 David Bein 2009-04-03 14:02:36 UTC

Of the patches submitted, there is one which I did based on upstream
(2.6.28) sources. The submitted patch is:

--- ./drivers/net/cxgb3/sge.c.rhel5.3   2009-03-06 19:25:02.000000000 -0500
+++ ./drivers/net/cxgb3/sge.c   2009-03-24 10:32:32.000000000 -0400
@@ -1092,7 +1092,7 @@ static void write_tx_pkt_wr(struct adapt
        } else {
                cntrl |= V_TXPKT_OPCODE(CPL_TX_PKT);
                cntrl |= F_TXPKT_IPCSUM_DIS;    /* SW calculates IP csum */
-               cntrl |= V_TXPKT_L4CSUM_DIS(skb->ip_summed != CHECKSUM_HW);
+               cntrl |= V_TXPKT_L4CSUM_DIS(skb->ip_summed != CHECKSUM_PARTIAL)\
;
                cpl->cntrl = htonl(cntrl);

                if (skb->len <= WR_LEN - sizeof(*cpl)) {
@@ -1195,7 +1195,7 @@ int t3_eth_xmit(struct sk_buff *skb, str
        }

        /* update port statistics */
-       if (skb->ip_summed == CHECKSUM_HW)
+       if (skb->ip_summed == CHECKSUM_COMPLETE)
                qs->port_stats[SGE_PSTAT_TX_CSUM]++;
        if (skb_shinfo(skb)->gso_size)
                qs->port_stats[SGE_PSTAT_TSO]++;

The change from CHECKSUM_HW => CHECKSUM_COMPLETE should be
CHECKSUM_HW => CHECKSUM_PARTIAL since this is outbound and
is counting checksum offload tx packets. No outbound packet
with the 4 possible values for ip_summed should be CHECKSUM_COMPLETE.
That only has meaning in input from the driver contexts.

This is incorrect IMO upstream. Maybe it has been fixed.

Also - I have a few more bits in net/ipv4/netfilter which are from later
patches in 2.6.19-7 which fix a few more checksum bits among other things.
I will attach that patch next. Note that the patches I have included
so far should not be considered as making netfilter identical to 2.6.19-7
(as that drags on a lot of other things).

Comment 9 David Bein 2009-04-03 14:08:47 UTC

Created attachment 338040 [details]
A few more important bits in net/ipv4/netfilter

This fixes a few more obvious things - based on changes made by
Patrick McHardy after the initial 2 patches he made to 2.6.19.

Comment 10 Jiri Olsa 2009-04-16 09:32:01 UTC

I backprop those patches you identified to help you to the RHEL:

[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE
84fa7933a33f806bbbaae6775e87459b1ec584c0

[NETFILTER]: Get rid of HW checksum invalidation
4cf411de49c65140b3c259748629b561c0d3340f

[TCP]: rare bad TCP checksum with 2.6.19
52d570aabe921663a987b2e4bae2bdc411cee480


I built and published kernel rpm and the patches on http://people.redhat.com/jolsa/bz491549. (let me know if you need other arch than published).

Could you plz try it and see if it works for you..

As it is a quite extensive change, especially the first patch (and the 2nd one
highly depends on it), I'm thinking about some other way to fix it, so far,
following hunk from the second patch seems to me like probable fix:

diff --git a/net/ipv4/netfilter/ip_nat_standalone.c b/net/ipv4/netfilter/ip_nat_standalone.c
index 8686b49..163db98 100644
--- a/net/ipv4/netfilter/ip_nat_standalone.c
+++ b/net/ipv4/netfilter/ip_nat_standalone.c
@@ -116,11 +116,6 @@ ip_nat_fn(unsigned int hooknum,
        if (ct == &ip_conntrack_untracked)
                return NF_ACCEPT;
 
-       /* If we had a hardware checksum before, it's now invalid */
-       if ((*pskb)->ip_summed == CHECKSUM_HW)
-               if (skb_checksum_help(*pskb, (out == NULL)))
-                       return NF_DROP;
-
        /* Can't track?  It's not due to stress, or conntrack would
           have dropped it.  Hence it's the user's responsibilty to
           packet filter it out, or implement conntrack/NAT for that


thoughts?

Would you plz share with me more details about your testing, like what application you use and where do you catch the bug (printk?),
so I'd be able to reproduce...

You also mentioned you can share the source code with changes, plz do.

thanks
jirka

Comment 11 David Bein 2009-04-16 13:35:19 UTC

(In reply to comment #10)
> I backprop those patches you identified to help you to the RHEL:
> 
> [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE
> 84fa7933a33f806bbbaae6775e87459b1ec584c0
> 
> [NETFILTER]: Get rid of HW checksum invalidation
> 4cf411de49c65140b3c259748629b561c0d3340f
> 
> [TCP]: rare bad TCP checksum with 2.6.19
> 52d570aabe921663a987b2e4bae2bdc411cee480
> 
> 
> I built and published kernel rpm and the patches on
> http://people.redhat.com/jolsa/bz491549. (let me know if you need other arch
> than published).
> 
> Could you plz try it and see if it works for you..

Thanks for the builds - I could test that the kernels work, but
the way I diagnosed this was by adding per-cpu counters and
a /proc interface to collect them at the udp layer where the
outbound packets are being checksummed (partially or completely
depending on (a) hardware CSO support and (b) single vs multi
fragment packets. I then added the same as the layer where packets
are handed to the outbound nic xmit routine for processing.

It was only when I could see that with NAT present (and no rules)
that the lower level was seeing none of the partial checksums
and only the fully checksummed packets [only for udp] that I
became suspicious and removed NAT from the mix.

If you want me to test what you have, I will need the full patchset
for the 138 builds you did so I can add the counters and do the
testing I did to discover this and point out where it was happening.
With these patches, it can be shown via counters that any higher
level packet that tcp or udp generate partial checksums on are
delivered to the hardware without any netfilter/nat routine
disabling it.
> 
> As it is a quite extensive change, especially the first patch (and the 2nd one
> highly depends on it), I'm thinking about some other way to fix it, so far,
> following hunk from the second patch seems to me like probable fix:
> 
> diff --git a/net/ipv4/netfilter/ip_nat_standalone.c
> b/net/ipv4/netfilter/ip_nat_standalone.c
> index 8686b49..163db98 100644
> --- a/net/ipv4/netfilter/ip_nat_standalone.c
> +++ b/net/ipv4/netfilter/ip_nat_standalone.c
> @@ -116,11 +116,6 @@ ip_nat_fn(unsigned int hooknum,
>         if (ct == &ip_conntrack_untracked)
>                 return NF_ACCEPT;
> 
> -       /* If we had a hardware checksum before, it's now invalid */
> -       if ((*pskb)->ip_summed == CHECKSUM_HW)
> -               if (skb_checksum_help(*pskb, (out == NULL)))
> -                       return NF_DROP;
> -
>         /* Can't track?  It's not due to stress, or conntrack would
>            have dropped it.  Hence it's the user's responsibilty to
>            packet filter it out, or implement conntrack/NAT for that
> 
> 
> thoughts?

Doing this would fix it if NAT does not touch the packet, but
if NAT touches the packet in any way, it has to do different
things in the handling of the differential checksum depending
on whether or not the packet is partially or fully checksummed.
That is the nature of this fix. The first patchset is seemingly
large, but conceptually it is mostly a transparent change to
properly mark the ip_summed field distinctly in ways that do
not confuse a packet (fragment) that was received from a nic
that computed the checksum -> skb->csum and outbound partially
checksummed packets. You want this because of things like bridges
and I think forwarding as well. So the first patch is really
a good patch because 3 states can not represent what 4 states
can. From the way the 2nd patch is done, the dependency is on
the distinctness of the 4 state patch because the now distinct
CHECKSUM_COMPLETE value means only one thing. It would be possible
to not change the outbound handling of CHECKSUM_HW to CHECKSUM_PARTIAL
only changing inbound handling of it to CHECKSUM_COMPLETE, but that
only means anything if there are 4 distinct values.

I think for KABI reasons, you folks may not be able to use this
approach to fixing this problem in any RHEL5.x release unless
I am mistaken about your KABI requirements. Once RHEL6 hits the
streets (whenever that is), the basis for it will be a lot
newer than 2.6.18.4 and this problem will be fixed in the base.
I don't expect you have KABI requirements across major releases
(e.g. from RHEL5 to RHEL6).

> 
> Would you plz share with me more details about your testing, like what
> application you use and where do you catch the bug (printk?),
> so I'd be able to reproduce...
> 
> You also mentioned you can share the source code with changes, plz do.


I'll dig up the counter code (relative to 128.1.1) but I think
you folks will need to first decide if the basic changes to skbuff.h
make it possible to fix any of this using this approach in RHEL5.x.
Absent the 3 to 4 state changes there, none of the submitted
patches mean anything. I think you could cut the cases down
by ending up with CHECKSUM_HW being equivalent to CHECKSUM_PARTIAL
(as it is now), but you still have to hunt for changes that are
about inbound processing related to CHECKSUM_COMPLETE and it
must be a distinct value [it isn't in the current releases].

> thanks
> jirka

Comment 12 Jiri Olsa 2009-04-16 14:11:02 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > I backprop those patches you identified to help you to the RHEL:
> > 
> > [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE
> > 84fa7933a33f806bbbaae6775e87459b1ec584c0
> > 
> > [NETFILTER]: Get rid of HW checksum invalidation
> > 4cf411de49c65140b3c259748629b561c0d3340f
> > 
> > [TCP]: rare bad TCP checksum with 2.6.19
> > 52d570aabe921663a987b2e4bae2bdc411cee480
> > 
> > 
> > I built and published kernel rpm and the patches on
> > http://people.redhat.com/jolsa/bz491549. (let me know if you need other arch
> > than published).
> > 
> > Could you plz try it and see if it works for you..
> 
> Thanks for the builds - I could test that the kernels work, but
> the way I diagnosed this was by adding per-cpu counters and
> a /proc interface to collect them at the udp layer where the
> outbound packets are being checksummed (partially or completely
> depending on (a) hardware CSO support and (b) single vs multi
> fragment packets. I then added the same as the layer where packets
> are handed to the outbound nic xmit routine for processing.
> 
> It was only when I could see that with NAT present (and no rules)
> that the lower level was seeing none of the partial checksums
> and only the fully checksummed packets [only for udp] that I
> became suspicious and removed NAT from the mix.
> 
> If you want me to test what you have, I will need the full patchset
> for the 138 builds you did so I can add the counters and do the
> testing I did to discover this and point out where it was happening.
> With these patches, it can be shown via counters that any higher
> level packet that tcp or udp generate partial checksums on are
> delivered to the hardware without any netfilter/nat routine
> disabling it.

the full patchset is on http://people.redhat.com/jolsa/bz491549/patch/
it is based on the 138 build

[jolsa@jolsa kernel]$ git describe
2.6.18-138.el5

> > 
> > As it is a quite extensive change, especially the first patch (and the 2nd one
> > highly depends on it), I'm thinking about some other way to fix it, so far,
> > following hunk from the second patch seems to me like probable fix:
> > 
> > diff --git a/net/ipv4/netfilter/ip_nat_standalone.c
> > b/net/ipv4/netfilter/ip_nat_standalone.c
> > index 8686b49..163db98 100644
> > --- a/net/ipv4/netfilter/ip_nat_standalone.c
> > +++ b/net/ipv4/netfilter/ip_nat_standalone.c
> > @@ -116,11 +116,6 @@ ip_nat_fn(unsigned int hooknum,
> >         if (ct == &ip_conntrack_untracked)
> >                 return NF_ACCEPT;
> > 
> > -       /* If we had a hardware checksum before, it's now invalid */
> > -       if ((*pskb)->ip_summed == CHECKSUM_HW)
> > -               if (skb_checksum_help(*pskb, (out == NULL)))
> > -                       return NF_DROP;
> > -
> >         /* Can't track?  It's not due to stress, or conntrack would
> >            have dropped it.  Hence it's the user's responsibilty to
> >            packet filter it out, or implement conntrack/NAT for that
> > 
> > 
> > thoughts?
> 
> Doing this would fix it if NAT does not touch the packet, but
> if NAT touches the packet in any way, it has to do different
> things in the handling of the differential checksum depending
> on whether or not the packet is partially or fully checksummed.
> That is the nature of this fix. The first patchset is seemingly
> large, but conceptually it is mostly a transparent change to
> properly mark the ip_summed field distinctly in ways that do
> not confuse a packet (fragment) that was received from a nic
> that computed the checksum -> skb->csum and outbound partially
> checksummed packets. You want this because of things like bridges
> and I think forwarding as well. So the first patch is really
> a good patch because 3 states can not represent what 4 states
> can. From the way the 2nd patch is done, the dependency is on
> the distinctness of the 4 state patch because the now distinct
> CHECKSUM_COMPLETE value means only one thing. It would be possible
> to not change the outbound handling of CHECKSUM_HW to CHECKSUM_PARTIAL
> only changing inbound handling of it to CHECKSUM_COMPLETE, but that
> only means anything if there are 4 distinct values.
> 
> I think for KABI reasons, you folks may not be able to use this
> approach to fixing this problem in any RHEL5.x release unless
> I am mistaken about your KABI requirements. Once RHEL6 hits the
> streets (whenever that is), the basis for it will be a lot
> newer than 2.6.18.4 and this problem will be fixed in the base.
> I don't expect you have KABI requirements across major releases
> (e.g. from RHEL5 to RHEL6).

I did some modifications so it passed the KABI check,
however I need to discuss this more

> 
> > 
> > Would you plz share with me more details about your testing, like what
> > application you use and where do you catch the bug (printk?),
> > so I'd be able to reproduce...
> > 
> > You also mentioned you can share the source code with changes, plz do.
> 
> 
> I'll dig up the counter code (relative to 128.1.1) but I think
> you folks will need to first decide if the basic changes to skbuff.h
> make it possible to fix any of this using this approach in RHEL5.x.
> Absent the 3 to 4 state changes there, none of the submitted
> patches mean anything. I think you could cut the cases down
> by ending up with CHECKSUM_HW being equivalent to CHECKSUM_PARTIAL
> (as it is now), but you still have to hunt for changes that are
> about inbound processing related to CHECKSUM_COMPLETE and it
> must be a distinct value [it isn't in the current releases].

I'd appreciate it, at least for my curiosity :) thanks

> 
> > thanks
> > jirka

Comment 13 David Bein 2009-04-19 13:12:19 UTC

(In reply to comment #12)
> (In reply to comment #11)
> > (In reply to comment #10)
> > > I backprop those patches you identified to help you to the RHEL:
> > > 
> > > [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE
> > > 84fa7933a33f806bbbaae6775e87459b1ec584c0
> > > 
> > > [NETFILTER]: Get rid of HW checksum invalidation
> > > 4cf411de49c65140b3c259748629b561c0d3340f
> > > 
> > > [TCP]: rare bad TCP checksum with 2.6.19
> > > 52d570aabe921663a987b2e4bae2bdc411cee480
> > > 
> > > 
> > > I built and published kernel rpm and the patches on
> > > http://people.redhat.com/jolsa/bz491549. (let me know if you need other arch
> > > than published).
> > > 
> > > Could you plz try it and see if it works for you..
> > 
> > Thanks for the builds - I could test that the kernels work, but
> > the way I diagnosed this was by adding per-cpu counters and
> > a /proc interface to collect them at the udp layer where the
> > outbound packets are being checksummed (partially or completely
> > depending on (a) hardware CSO support and (b) single vs multi
> > fragment packets. I then added the same as the layer where packets
> > are handed to the outbound nic xmit routine for processing.
> > 
> > It was only when I could see that with NAT present (and no rules)
> > that the lower level was seeing none of the partial checksums
> > and only the fully checksummed packets [only for udp] that I
> > became suspicious and removed NAT from the mix.
> > 
> > If you want me to test what you have, I will need the full patchset
> > for the 138 builds you did so I can add the counters and do the
> > testing I did to discover this and point out where it was happening.
> > With these patches, it can be shown via counters that any higher
> > level packet that tcp or udp generate partial checksums on are
> > delivered to the hardware without any netfilter/nat routine
> > disabling it.
> 
> the full patchset is on http://people.redhat.com/jolsa/bz491549/patch/
> it is based on the 138 build
> 
> [jolsa@jolsa kernel]$ git describe
> 2.6.18-138.el5

If you want me to test relative to 138, then what I am missing
is the source rpm for 138. Is this available for testing purposes?

> > > 
> > > As it is a quite extensive change, especially the first patch (and the 2nd one
> > > highly depends on it), I'm thinking about some other way to fix it, so far,
> > > following hunk from the second patch seems to me like probable fix:
> > > 
> > > diff --git a/net/ipv4/netfilter/ip_nat_standalone.c
> > > b/net/ipv4/netfilter/ip_nat_standalone.c
> > > index 8686b49..163db98 100644
> > > --- a/net/ipv4/netfilter/ip_nat_standalone.c
> > > +++ b/net/ipv4/netfilter/ip_nat_standalone.c
> > > @@ -116,11 +116,6 @@ ip_nat_fn(unsigned int hooknum,
> > >         if (ct == &ip_conntrack_untracked)
> > >                 return NF_ACCEPT;
> > > 
> > > -       /* If we had a hardware checksum before, it's now invalid */
> > > -       if ((*pskb)->ip_summed == CHECKSUM_HW)
> > > -               if (skb_checksum_help(*pskb, (out == NULL)))
> > > -                       return NF_DROP;
> > > -
> > >         /* Can't track?  It's not due to stress, or conntrack would
> > >            have dropped it.  Hence it's the user's responsibilty to
> > >            packet filter it out, or implement conntrack/NAT for that
> > > 
> > > 
> > > thoughts?
> > 
> > Doing this would fix it if NAT does not touch the packet, but
> > if NAT touches the packet in any way, it has to do different
> > things in the handling of the differential checksum depending
> > on whether or not the packet is partially or fully checksummed.
> > That is the nature of this fix. The first patchset is seemingly
> > large, but conceptually it is mostly a transparent change to
> > properly mark the ip_summed field distinctly in ways that do
> > not confuse a packet (fragment) that was received from a nic
> > that computed the checksum -> skb->csum and outbound partially
> > checksummed packets. You want this because of things like bridges
> > and I think forwarding as well. So the first patch is really
> > a good patch because 3 states can not represent what 4 states
> > can. From the way the 2nd patch is done, the dependency is on
> > the distinctness of the 4 state patch because the now distinct
> > CHECKSUM_COMPLETE value means only one thing. It would be possible
> > to not change the outbound handling of CHECKSUM_HW to CHECKSUM_PARTIAL
> > only changing inbound handling of it to CHECKSUM_COMPLETE, but that
> > only means anything if there are 4 distinct values.
> > 
> > I think for KABI reasons, you folks may not be able to use this
> > approach to fixing this problem in any RHEL5.x release unless
> > I am mistaken about your KABI requirements. Once RHEL6 hits the
> > streets (whenever that is), the basis for it will be a lot
> > newer than 2.6.18.4 and this problem will be fixed in the base.
> > I don't expect you have KABI requirements across major releases
> > (e.g. from RHEL5 to RHEL6).
> 
> I did some modifications so it passed the KABI check,
> however I need to discuss this more

I am curious how you work around the KABI for 3 vs 4 distinct states
since there are drivers that use all 4 of the defines. Granted
they are in different contexts, so *maybe* there is a way
to differentiate between _HW in the outbound case and _HW
in the inbound case (what should be _COMPLETE). The original
code did not need this, but relied on certain assumptions
that I think may break the forwarding/bridging case.

> 
> > 
> > > 
> > > Would you plz share with me more details about your testing, like what
> > > application you use and where do you catch the bug (printk?),
> > > so I'd be able to reproduce...
> > > 
> > > You also mentioned you can share the source code with changes, plz do.
> > 
> > 
> > I'll dig up the counter code (relative to 128.1.1) but I think
> > you folks will need to first decide if the basic changes to skbuff.h
> > make it possible to fix any of this using this approach in RHEL5.x.
> > Absent the 3 to 4 state changes there, none of the submitted
> > patches mean anything. I think you could cut the cases down
> > by ending up with CHECKSUM_HW being equivalent to CHECKSUM_PARTIAL
> > (as it is now), but you still have to hunt for changes that are
> > about inbound processing related to CHECKSUM_COMPLETE and it
> > must be a distinct value [it isn't in the current releases].
> 
> I'd appreciate it, at least for my curiosity :) thanks

I'll prepare an attachment relative to 128.1.1 w/o any of the
patches I have already submitted for this. It will take me a little time.

> 
> > 
> > > thanks
> > > jirka

Comment 14 David Bein 2009-04-19 16:34:46 UTC

I just pulled the 138 base from Don Zickus's http directory so I could see your patches in full. Not changing skb_checksum_help() from 2 to 1 arg fixes part of
your KABI needs, but what about the effective change from CHECKSUM_COMPLETE from 1
(aliased to CHECKSUM_HW) to a distinct value of 3? Any drivers that were
correctly using this before would be busted w/o a re-compile if used with
a kernel that has the changes in skbuff.h in it.

Now that I have your patch set, I will generate the udp debugging I added
relative to the patched version in 138. There are other bits that are
changed in udp.c I think, but my patch will be added on top of your
patch set.

If the src rpm from Don's tree for 138 is NOT what you used, please
let me know, but from the output of folding in your patch set, it seems
like it is the same.

Comment 15 David Bein 2009-04-19 16:50:56 UTC

The changes that you've made look basically fine and I certainly see
why you have preserved the nat cheat ip check functions etc. However
I think that Patrick McHardy is probably the person who should see
what you've changed (not in ./drivers, but in ./net) where it deviates
from the patches I submitted relative to rhel5.3 (-128.1.1) because
other than the XEN stuff (which I couldn't adequately test), all of
the changes were really based on his changes. The only bits I added
to the mix would be including all drivers in my search for CHECKSUM_HW
conversions and dealing with bits added by RH having to do with XEN.

Also - if you are going to use these, I recommend the 4th attachment
because it makes the source in netfilter self-consistent. This
is separate from how you handled the route me harder bits.

Comment 16 David Bein 2009-04-19 17:39:42 UTC

(In reply to comment #15)
> The changes that you've made look basically fine and I certainly see
> why you have preserved the nat cheat ip check functions etc. However
> I think that Patrick McHardy is probably the person who should see
> what you've changed (not in ./drivers, but in ./net) where it deviates
> from the patches I submitted relative to rhel5.3 (-128.1.1) because
> other than the XEN stuff (which I couldn't adequately test), all of
> the changes were really based on his changes. The only bits I added
> to the mix would be including all drivers in my search for CHECKSUM_HW
> conversions and dealing with bits added by RH having to do with XEN.
> 
> Also - if you are going to use these, I recommend the 4th attachment
> because it makes the source in netfilter self-consistent. This
> is separate from how you handled the route me harder bits.  

In looking more closely, I see what you did with a local version
of the route me harder functionality in ipt_REJECT.c. I guess
that should work as well, but you are a better judge of that.

Sometime in the next week, I'll find some time to try this
full set of patches (as written) with debugging added to
it to verify that the basic problem of NAT stomping on the
checksums is fixed. I didn't have a test case for the udp
checksum == 0 case, but code inspection and the fact that
the same mongo test on 2.6.19-7 was not failing is sufficient
for me to know that part works properly. The more interesting
test is the checksum offloading which I can do pretty easily
on one of the test machines (Nehalem server).

Note that since 138 isn't yet a release (beta or otherwise)
if there are bits that are unrelated that stick me, I'll
let you know - but am not expecting that to be the case
and won't file a new bug against it since I can see that
139 is already there.

Comment 17 Jiri Olsa 2009-04-20 08:39:04 UTC

(In reply to comment #14)
> I just pulled the 138 base from Don Zickus's http directory so I could see your
> patches in full. Not changing skb_checksum_help() from 2 to 1 arg fixes part of
> your KABI needs, but what about the effective change from CHECKSUM_COMPLETE
> from 1
> (aliased to CHECKSUM_HW) to a distinct value of 3? Any drivers that were
> correctly using this before would be busted w/o a re-compile if used with
> a kernel that has the changes in skbuff.h in it.
> 

You're right, I discussed with Neil Horman and this actually looks as a big problem. I did not realized that even this change passed the automated KABI check, it actually breaks the KABI.

My current understanding is that either you'd need to wait for another
release which will have this one fixed from upstream, or we need to
find another solution.

> Now that I have your patch set, I will generate the udp debugging I added
> relative to the patched version in 138. There are other bits that are
> changed in udp.c I think, but my patch will be added on top of your
> patch set.
> 
> If the src rpm from Don's tree for 138 is NOT what you used, please
> let me know, but from the output of folding in your patch set, it seems
> like it is the same.  

you're right I used 138.

I'm trying to reproduce the bug. Plz check attachments and find the
send application, which prepare/send the UDP packet with 0xFFFF checksum and
the kernel patch (138 based) that printks any 0 checksum occurance.

Plz let me know if I miss something..

Comment 18 Jiri Olsa 2009-04-20 08:41:11 UTC

Created attachment 340300 [details]
generates/sends a udp packet with checksum 0xFFFF

Makefile:
---
CC=gcc
send: send.o

Comment 19 Jiri Olsa 2009-04-20 08:42:49 UTC

Created attachment 340301 [details]
patch to detect the zero checksum packet

based on  kernel-2.6.18-138.el5

Comment 20 David Bein 2009-04-20 13:28:51 UTC

(In reply to comment #17)
> (In reply to comment #14)
> > I just pulled the 138 base from Don Zickus's http directory so I could see your
> > patches in full. Not changing skb_checksum_help() from 2 to 1 arg fixes part of
> > your KABI needs, but what about the effective change from CHECKSUM_COMPLETE
> > from 1
> > (aliased to CHECKSUM_HW) to a distinct value of 3? Any drivers that were
> > correctly using this before would be busted w/o a re-compile if used with
> > a kernel that has the changes in skbuff.h in it.
> > 
> 
> You're right, I discussed with Neil Horman and this actually looks as a big
> problem. I did not realized that even this change passed the automated KABI
> check, it actually breaks the KABI.
> 
> My current understanding is that either you'd need to wait for another
> release which will have this one fixed from upstream, or we need to
> find another solution.

We can brew our own kernels in any case (including drivers).

For other people who do not use the nat module, this isn't a problem at all.

There ought to be ways to be able to tell the "original" direction
of a packet and based on that know the distinction between the
2 values. One way to do that is tag the skb's that come upstream
(from a driver). I don't know if there any unused bitfields in the
skb, but I think there are. Since all skb's come from the kernel
at least for non-Xen (maybe Xen too - I didn't count the unused bits)
this could be possible (based on the path of the packet). But that's
a hack - maybe a worthwhile hack. All upstream packets have to
be recognizable. i.e. if it tagged by the driver as hw/partial
it can be converted on the fly to complete (distinct value). Then
the KABI issue should go away provided all the inbound paths are
covered.


<snip>

Your udp checksum 0 program is handy. Thank you.

As mentioned before I will generate a 138 based kernel and brew the
full counters that show before/after patching behavior.

Your test for the checksum 0 case is one piece of it and hooking it into 
dst_output() is a good place to do it. I actually put my low
level counter at the nic layer (igb) so I could really see if
it ever received any partial checksum packets.

It should be noted that tiny tcp packets (e.g syn,fin,ack-only,rst) are
checksummed by the host, so you have to take this into account.
But by counting the number of none or hw/partial packets (outbound) it
was clear what is happening.

Comment 21 Jiri Olsa 2009-04-21 22:02:32 UTC

<snip>

I was able to recreate the problem with the 'send' tool and the additional kernel patch (attached test-code.patch) -> I saw ZERO checksum packets in the proc file and also in the tcpdump from the UDP peer server.

I also verified the fix we have so far is working properly - no ZERO checksum packets in the proc file and the proper tcpdump


<snip>

> There ought to be ways to be able to tell the "original" direction
> of a packet and based on that know the distinction between the
> 2 values. One way to do that is tag the skb's that come upstream
> (from a driver). I don't know if there any unused bitfields in the
> skb, but I think there are. Since all skb's come from the kernel
> at least for non-Xen (maybe Xen too - I didn't count the unused bits)
> this could be possible (based on the path of the packet). But that's
> a hack - maybe a worthwhile hack. All upstream packets have to
> be recognizable. i.e. if it tagged by the driver as hw/partial
> it can be converted on the fly to complete (distinct value). Then
> the KABI issue should go away provided all the inbound paths are
> covered.

I'll look more closely on this possibility, but I'm guessing that we will
probably end up with the patch of the same magnitude

Comment 22 Jiri Olsa 2009-04-21 22:05:29 UTC

Created attachment 340644 [details]
patch to detect the zero checksum packet and display counts in proc file

based on kernel-2.6.18-139.el5

Comment 23 Jiri Olsa 2009-04-23 14:53:51 UTC

(In reply to comment #21)
> <snip>
> 
> I was able to recreate the problem with the 'send' tool and the additional
> kernel patch (attached test-code.patch) -> I saw ZERO checksum packets in the
> proc file and also in the tcpdump from the UDP peer server.
> 
> I also verified the fix we have so far is working properly - no ZERO checksum
> packets in the proc file and the proper tcpdump
> 
> 
> <snip>
> 
> > There ought to be ways to be able to tell the "original" direction
> > of a packet and based on that know the distinction between the
> > 2 values. One way to do that is tag the skb's that come upstream
> > (from a driver). I don't know if there any unused bitfields in the
> > skb, but I think there are. Since all skb's come from the kernel
> > at least for non-Xen (maybe Xen too - I didn't count the unused bits)
> > this could be possible (based on the path of the packet). But that's
> > a hack - maybe a worthwhile hack. All upstream packets have to
> > be recognizable. i.e. if it tagged by the driver as hw/partial
> > it can be converted on the fly to complete (distinct value). Then
> > the KABI issue should go away provided all the inbound paths are
> > covered.
> 
> I'll look more closely on this possibility, but I'm guessing that we will
> probably end up with the patch of the same magnitude  


I discussed with Neil Horman and it seems there's no easy way to tell
the packet direction out of the sk_buff:

<email>
> Those flags are used for telling the packet's direction and card's capability
> to compute the HW checksum.
>
> Is there any other way we could get the packet direction out of the sk_buff?
> If there is, we could probably bypass the first patch and stay KABI compliant.
>
> thoughts?
>
Nothing in the skbuff thats definitive, no, unfortunately.
Neil
</email>

So I guess the best would be to push this bug out to the RHEL 6.

Comment 24 David Bein 2009-04-23 15:05:00 UTC

(In reply to comment #23)
> (In reply to comment #21)
> > <snip>
> > 
> > I was able to recreate the problem with the 'send' tool and the additional
> > kernel patch (attached test-code.patch) -> I saw ZERO checksum packets in the
> > proc file and also in the tcpdump from the UDP peer server.
> > 
> > I also verified the fix we have so far is working properly - no ZERO checksum
> > packets in the proc file and the proper tcpdump
> > 
> > 
> > <snip>
> > 
> > > There ought to be ways to be able to tell the "original" direction
> > > of a packet and based on that know the distinction between the
> > > 2 values. One way to do that is tag the skb's that come upstream
> > > (from a driver). I don't know if there any unused bitfields in the
> > > skb, but I think there are. Since all skb's come from the kernel
> > > at least for non-Xen (maybe Xen too - I didn't count the unused bits)
> > > this could be possible (based on the path of the packet). But that's
> > > a hack - maybe a worthwhile hack. All upstream packets have to
> > > be recognizable. i.e. if it tagged by the driver as hw/partial
> > > it can be converted on the fly to complete (distinct value). Then
> > > the KABI issue should go away provided all the inbound paths are
> > > covered.
> > 
> > I'll look more closely on this possibility, but I'm guessing that we will
> > probably end up with the patch of the same magnitude  
> 
> 
> I discussed with Neil Horman and it seems there's no easy way to tell
> the packet direction out of the sk_buff:
> 
> <email>
> > Those flags are used for telling the packet's direction and card's capability
> > to compute the HW checksum.
> >
> > Is there any other way we could get the packet direction out of the sk_buff?
> > If there is, we could probably bypass the first patch and stay KABI compliant.
> >
> > thoughts?
> >
> Nothing in the skbuff thats definitive, no, unfortunately.
> Neil
> </email>
> 
> So I guess the best would be to push this bug out to the RHEL 6.  

What I was thinking about is the following:

A network device receives a hardware packet. If it marks it
with CHECKSUM_NONE or CHECKSUM_UNNECESSARY nothing needs to
change. Only if it uses the ABI value for HW (PARTIAL) does
anything need to happen. Since the packets have to ultimately
be handed to netif_receive_skb() [or variants of that] those
functions could transform CHECKSUM_PARTIAL => CHECKSUM_COMPLETE.
Since netif_receive_skb() is in the kernel [along with variants
of it for LRO and vlan], it could handle this transformation.

If that is feasible, then drivers compiled with the wrong value
of CHECKSUM_COMPLETE [aliased to CHECKSUM_HW => CHECKSUM_PARTIAL]
would work w/o breaking the KABI.

Comment 26 Jiri Olsa 2010-02-17 15:42:47 UTC

proposing to close as wontfix due to KABI issues

Comment 27 David Bein 2010-02-17 15:47:11 UTC

(In reply to comment #26)
> proposing to close as wontfix due to KABI issues    

I expected this to be final resolution - when RHEL6
comes out I expect the base 2.6 release to be changed
to something newer than 2.6.18 and in that case this
should be fixed w/o doing anything.